# Computing the Euclidean Distance

In [None]:
import pandas as pd
import numpy as np
import os 
import math
import random
import matplotlib.pyplot as plt

### Load a Data Set and Save it as a Pandas DataFrame

We will work with a new data set called "cell2cell." This data set is used to analyze cellular telephone customers and can be used to predict whether a customer will remain with their current telecom service or leave to another.



In [None]:
filename = os.path.join(os.getcwd(), "data","cell2cell.csv")
df = pd.read_csv(filename, header=0)

### Inspect the Data

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
df.shape

## Euclidean Distance

#### KNN
<p>k-Nearest Neighbors (KNN) is an instance-based learning algorithm. To make a classification for a given unlabeled example $A$, we search the training data for the $k$ nearest neighbors, as defined by some distance metric $d(A,B)$ in which $B$ represents another example. We choose the most common label among the nearest neighbor examples to be our prediction (label) for the unlabeled example.<br>
    
The most commonly used distance metric for KNN is the Euclidean distance.

#### Euclidean Distance

For two n-dimensional, real-valued vectors $A,B \in \mathbb{R}^n$, the Euclidean distance $eud$ is defined as:

$$ eud(A, B) = \sqrt{\sum_{i=1}^{n}{(B_i-A_i)^2}}$$


Euclidean distance finds the distance between two vectors of the same length. In this formula, $A_i$ is the $ith$ coordinate of vector $A$, and $B_i$ is the $ith$ coordinate of vector $B$.


Let's relate this to a dataset. Let's think of the vectors $A$ and $B$ as being two examples (rows) in a dataset. 

Let $A = <x^a_1,...x^a_n>$ be a $n$-dimensional vector ($x^a_i$ is the $ith$ feature in example $A$ and $n$ is the total number of features). 

Then for two vectors (examples) $A$ and $B$ the Euclidean distance is defined as:<br><br>
<center>$eud(A, B) = \sqrt{(x^b_1-x^a_1)^2+ (x^b_2-x^a_2)^2+...+(x^b_n-x^a_n)^2} = \sqrt{\sum\limits_{i=1}^n (x^b_i-x^a_i)^2}$
</center>
<br><br>
</p>


To visualize KNN, you can picture plotting the examples (also called data points) in our dataset and finding the distance between them. Let's create a visualization to see how we plot examples and find the distance between each example.

To easily visualize this, let's plot two examples from DataFrame `df`. Note that each example contains many features, but to make this visualization even simpler, we will work with two dimensions (that is, two features). 

Euclidean distance is best used to calculate the distance between vectors containing numerical values. Therefore, we will choose two features that have numerical values.

Let us use row 0 and row 4 in DataFrame `df` and focus on features `HandsetModels` and `AgeHH1`. Run the code below to examine the two examples we will be plotting.



In [None]:
display(df.loc[[0,4],['HandsetModels','AgeHH1']])


Each example (row) can be viewed as a vector:

example 1: `(0.487071, 1.387766)`

example 2:  `(1.590917, 0.663601)`


You will use the Euclidean distance  formula to find the distance between these two vectors. First, let's plot these vectors. Run the code cell below to generate a plot. Examine the resulting plot.

In [None]:
# example 1 (row 0):
vector_A = [df.loc[0]['HandsetModels'], df.loc[0]['AgeHH1']]

# example 2 (row 4):
vector_B = [df.loc[4]['HandsetModels'], df.loc[4]['AgeHH1']]


plt.scatter(vector_A[0],vector_A[1] ,c='b',label='example 1')
plt.scatter(vector_B[0],vector_B[1], c='r', label='example 2')
plt.plot([vector_A[0],vector_B[0]], [vector_A[1],vector_B[1]], c='g', linestyle='dashed', label ='distance')

plt.xlim([0, 3])
plt.ylim([0, 3])
plt.xlabel('HandsetModels')
plt.ylabel('AgeHH1')

plt.legend(loc='upper right');
plt.show()

You will use the Euclidean distance formula to find the distance between these two vectors. 


Use the Euclidean distance formula to calculate the distance between `vector_A` and `vector_B` by hand and save the result to variable `euc_distance`.

For simplicity, use the following rounded vector values in your calculation:

vector_A: `(0.5, 1.4)`

vector_B:  `(1.6, 0.7)`

### Graded Cell
The cell below will be graded. Remove the line "raise NotImplementedError()" before writing your code.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Self-Check

Run the cell below to test the correctness of your code above before submitting for grading. Do not add code or delete code in the cell.

In [None]:
# Run this self-test cell to check your code; 
# do not add code or delete code in this cell
from jn import testEuc

try:
    p, err = testEuc(euc_distance)
    print(err)
except Exception as e:
    print("Error!\n" + str(e))
    


## Step 1: Filter Numerical Features

We will now compute the Euclidean distance between two rows in DataFrame `df`, using all of their numerical feature values. Let us create a new DataFrame that contains only the numerically valued columns of the original `df` DataFrame.

In [None]:
df_numerical = df.select_dtypes(include=['int64','float64'])

print(df_numerical.shape)
df_numerical.head()


We will exclude the `CustomerID` column, since it contains the customer ID and is not a feature that we want to consider.

In [None]:
df_numerical = df_numerical.drop(columns=['CustomerID'])
df_numerical.head()

We will compute the Euclidean distance between two examples in our data. In other words, our vectors $A$ and $B$ will be two distinct *rows* of our DataFrame `df_numerical` (which we filtered to include only numerical columns).

The code cell below randomly samples two rows from the `df_numerical` dataset and stores each in new DataFrame objects named `A` and `B`, respectively.

In [None]:
A = df_numerical.sample(replace=False)
B = df_numerical.sample(replace=False)

In [None]:
A

In [None]:
B

## Step 2: Compute the Euclidean Distance Between Two Vectors Using Python

We will first implement a function that finds the Euclidean distance in Python. Since we will be working with Python, let us convert DataFrames `A` and `B` into Python lists.

In [None]:
list_A = A.values.flatten().tolist()
list_B = B.values.flatten().tolist()

list_A


Using the definition above, complete the function below that returns the Euclidean distance between its two list inputs. <br>

You will use a traditional `for` loop to handle the computation for each pair of i-th coordinates of the two input lists (You can think of each pair as a 'column' in a DataFrame with just two rows -- $A$ and $B$.). 

<b>Tip</b>: to compute the square root, use the Python `math.sqrt()` function.

### Graded Cell
The cell below will be graded. Remove the line "raise NotImplementedError()" before writing your code.

In [None]:
def euclidean_distance(vector1 , vector2):
    ## the sum_squares variable will contain the current value of the sum of squares of each i-th coordinate pair
    sum_squares = 0
    
    numberOfIterations = len(vector1)
    
    ## TODO: Complete loop below ## 
    
    # The number of times the loop will be executed is the length of the vectors.
    # 
    # At each loop iteration, you will:
    #  Step 1. index into each vector and find the difference between the ith element in vector2 and vector1 
    #  Step 2. square the difference
    #  Step 3. update the value of the 'sum_squares' variable by adding the result in Step 2 to 
    #          the existing value of sum_squares
    

    for i in range(numberOfIterations):
        
        # Inside this loop follow steps 1-3 to update the value of the 'sum_squares' variable by
        # adding the squared difference of the i'th coordinate pair to the sum.
        

        # YOUR CODE HERE
        raise NotImplementedError()
        

    ### TODO: Compute the Distance ###  
    
    # Compute the square root of the variable 'sum_squares' and assign 
    # that result to a new variable named 'distance'
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # return the Euclidean distance
    return distance

### Self-Check

Run the cell below to test the correctness of your code above before submitting for grading. Do not add code or delete code in the cell.

In [None]:
# Run this self-test cell to check your code; 
# do not add code or delete code in this cell
from jn import testFunction

try:
    p, err = testFunction(euclidean_distance)
    print(err)
except Exception as e:
    print("Error!\n" + str(e))
    


The code cell below tests your function. Run the cell to view the results.

In [None]:
euclidean_distance(list_A, list_B)


## Step 3: Compute the Euclidean Distance Between Two Vectors Using NumPy 

The NumPy package provides an easy way to compute the Euclidean distance between two vectors. NumPy has a `norm()` function, which is part of a linear algebra module called `linalg.` You can call the function using this syntax: `np.linalg.norm()`. The `norm([vector_name])`finds a vector norm. A vector has both magnitude and direction, and calculating the vector norm finds the magnitude.

By default, the `norm()` function calculates the L2 norm, also known as the Euclidean norm since it calculates the Euclidean distance. We can therefore use the  `norm()` function to calculate the distance between two vectors.

The `norm()` function requires that its input vectors be of type NumPy array. The code cell below converts DataFrame `A` and `B` to NumPy arrays and uses the `norm()` function to find the Euclidean distance. 

Run the cell below and compare the results. Is the Euclidean distance the same value as what your function `euclidean_distance` produces? Try using the `norm()` function to find the Euclidean distance between the vectors `vector_A` and `vector_B` as well.



In [None]:
array1 = np.array(A)
array2 = np.array(B)
np.linalg.norm(array2-array1)


You can see how easy it is to find the Euclidean distance between two vectors, or examples using NumPy! For more information about the `norm()` function, consult the online [documentation](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html).