In [13]:
import seaborn as sns
import pandas as pd
import plotly.express as px
import numpy as np

In [54]:
data = sns.load_dataset("iris")

## Feature Vector

In [55]:
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In Data Science, each data point (each row) can be regarded as a vector and is specifically called a "Feature Vector." We can consider each feature (sepal_length, sepal_width, etc.) as coordinates in a 4-dimensional space (4D because we have 4 input features).

Since the human brain cannot easily visualize a 4-dimensional space, let's simplify and consider only 3 dimensions.

In [56]:
data = data.iloc[:, 0:3]

In [57]:
data

Unnamed: 0,sepal_length,sepal_width,petal_length
0,5.1,3.5,1.4
1,4.9,3.0,1.4
2,4.7,3.2,1.3
3,4.6,3.1,1.5
4,5.0,3.6,1.4
...,...,...,...
145,6.7,3.0,5.2
146,6.3,2.5,5.0
147,6.5,3.0,5.2
148,6.2,3.4,5.4


Consider the vector at index 0 - [5.4, 3.5, 1.4]. These three values are referred to as the components of the vector.

In [82]:
random_data_index = np.random.choice(data.index, 5) 
random_data_index_copy = random_data_index.copy()

In [83]:
fig = px.scatter_3d(data.iloc[random_data_index, :], x = "sepal_length" , y = "sepal_width" , z = "petal_length")
for index in random_data_index:
    line_trace = px.line_3d(
            pd.DataFrame({'x': [0, data.iloc[index, 0]], 'y': [0,data.iloc[index, 1]], 'z': [0, data.iloc[index, 2]]}),
            x='x', y='y', z='z'
        ).data[0]
    fig.add_trace(line_trace)
fig.show()

## Distance from Origin
Let's examine one of the vectors.

The distance from the origin of any vector is calculated as the square root of the sum of the squares of each component.

In [70]:
v1 = np.array(data.iloc[0, :])

In [71]:
v1

array([5.1, 3.5, 1.4])

In [73]:
np.linalg.norm(v1) # Distance from Orgin

6.34192399828317

In [77]:
s = 0
for value in v1:
    s = s + value**2
s = np.sqrt(s)

In [78]:
s

6.34192399828317

# Distance between two vectors

In [85]:
random_data_index = random_data_index[0:2]
fig = px.scatter_3d(data.iloc[random_data_index, :], x = "sepal_length" , y = "sepal_width" , z = "petal_length")
for index in random_data_index:
    line_trace = px.line_3d(
            pd.DataFrame({'x': [0, data.iloc[index, 0]], 'y': [0,data.iloc[index, 1]], 'z': [0, data.iloc[index, 2]]}),
            x='x', y='y', z='z'
        ).data[0]
    fig.add_trace(line_trace)
fig.show()

To find the distance between two vectors, we need to determine the length of the line segment connecting the two given data points.

In [95]:
random_data_index = random_data_index[0:2]
fig = px.scatter_3d(data.iloc[random_data_index, :], x = "sepal_length" , y = "sepal_width" , z = "petal_length")
line_trace = px.line_3d(
        pd.DataFrame({'x': [data.iloc[random_data_index[0], 0], data.iloc[random_data_index[1], 0]], 'y': [data.iloc[random_data_index[0], 1], data.iloc[random_data_index[1], 1]], 'z': [data.iloc[random_data_index[0], 2], data.iloc[random_data_index[1], 2]]}),
        x='x', y='y', z='z'
    ).data[0]

fig.add_trace(line_trace)

for index in random_data_index:
    line_trace = px.line_3d(
            pd.DataFrame({'x': [0, data.iloc[index, 0]], 'y': [0,data.iloc[index, 1]], 'z': [0, data.iloc[index, 2]]}),
            x='x', y='y', z='z'
        ).data[0]
    fig.add_trace(line_trace)
fig.show()

In [119]:
v1 = np.array(data.iloc[random_data_index[0], :])
v2 = np.array(data.iloc[random_data_index[1], :])

In [151]:
v3 = v1 - v2 # Vector v3 is obtained by subtracting the components of vector v2 from vector v1.

In [121]:
print(v1, v2)
print(v3)

[5.7 4.4 1.5] [5.  3.4 1.5]
[0.7 1.  0. ]


The distance of vector v3 from the origin is the same as the distance between vectors v1 and v2.

In [122]:
np.linalg.norm(v3)

1.2206555615733707

The distance between two vectors is referred to as the **"Euclidean Distance"**

**Application**: This concept finds an application in Machine Learning. For instance, if we aim to predict the value of a new data point, we calculate its Euclidean Distance with every other point in the dataset. The data point that is most similar to the new data will have its output feature closely resemble that data. This approach is known as the "K-Nearest Neighbors" algorithm.

# Dot Product

In [123]:
v1

array([5.7, 4.4, 1.5])

In [124]:
v2

array([5. , 3.4, 1.5])

In [150]:
np.dot(v1, v2) # Note that the dot product of two vectors is a scalar.

45.71

# Calculation
v1 = [5.7, 4.4, 1.5]  
v2 = [5. , 3.4, 1.5]  
v1.v2 = 5.7 * 5  + 4.4 * 3.4 + 1.5 * 1.5

In [126]:
5.7*5 + 4.4*3.4 + 1.5*1.5

45.71

## Application in Machine Learning:

The dot product of two vectors A and B can be expressed as:

A · B = ||A|| * ||B|| * cos(θ)

Where:

||A|| is the norm (distance from the origin) of vector A.  
||B|| is the norm of vector B.  
θ is the angle between vectors A and B.  
From the above equation, we can calculate the angle between two vectors.

In [146]:
angle = np.arccos((np.dot(v1, v2))/(np.linalg.norm(v1) * np.linalg.norm(v2)))

In [147]:
angle # In radians

0.0698139105674889

In [148]:
angle * 180 / np.pi # In degree

4.000042426820892

So, the angle between the two vectors is approximately 4 degrees.

This concept also has a similar use case. When a new data point arrives, we can calculate the angle between this vector and each of the other vectors already present in the dataset. The vector that forms the smallest angle with the new vector is considered the most similar one.