# Jupyter Notebook
Try it out below.

In [None]:
# This is a code cell. Try to write some code here!


This is a markdown cell. Try to write some text here!

# Distance Matrix

### Distance Metrics

#### 1. Euclidean Distance

The Euclidean distance between two points \( x \) and \( y \) in \( n \)-dimensional space is given by:

$$ 
d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} 
$$

#### 2. Manhattan Distance (L1 Distance)
The Manhattan distance between two points \( x \) and \( y \) is defined as:

$$ 
d(x, y) = \sum_{i=1}^{n} |x_i - y_i| 
$$


#### There are other ways to measure distances / dissimilarities !!

We will mostly be dealing with the Euclidean distance.

---

References: 

**Data Mining Concepts and Techniques[^1].**

Section 2.4 Measuring Data Similarity and Dissimilarity


**Data Mining: The Textbook[^2].**

Chapter 3: Similarity and Distances

---

[^1]: Han, J., Kamber, M. & Pei, J. (2012). Data mining concepts and techniques, third edition Morgan Kaufmann Publishers

[^2]: Aggarwal, C. C. (2015). Data Mining - The Textbook. Springer. ISBN: 978-3-319-14142-8

### Data

Let's get some example data

In [None]:
# imports
import numpy as np
import pandas as pd

In [None]:
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
df = pd.DataFrame(data, columns=['Feature 1', 'Feature 2'], index=['P1','P2','P3','P4'])
df


In [None]:
## Distance between P1, P1

d_1_1 = ((2-2)**2 + (1-1)**2)**.5
d_1_1

In [None]:
## Distance between P1, P2

d_1_2 = (((4-2)**2) + ((3-1)**2))**.5
d_1_2

In [None]:
## Distance between P1, P3
# d_1_3 = 0 

In [None]:
## Distance between P1, P4

# d_1_4 = 0 # Calculate


In [None]:
df


In [None]:
###########################

## Surely there's a better way?

def euclidean_dist_2d(p1,p2):
    return ((p2[1]-p1[1])**2 + (p2[0]-p1[0])**2)**.5

###########################
## Distance between P2, P1
print(euclidean_dist_2d((3,4),(1,2)))

## Distance between P2, P2
print(euclidean_dist_2d((3,4), (3,4)))

## Distance between P2, P3
print(euclidean_dist_2d((3,4),(5,6)))

## Distance between P2, P4
print(euclidean_dist_2d((3,4),(7,8)))

#### ... and so on ... ####

###########################
## Distance between P3, P1
## Distance between P3, P2
## Distance between P3, P3
## Distance between P3, P4
###########################
## Distance between P4, P1
## Distance between P4, P2
## Distance between P4, P3
## Distance between P4, P4
###########################


### Iris Dataset

Let's get some (better) data

In [None]:
# imports
from sklearn import datasets

In [None]:
# abstract
dataset = datasets.load_iris()

In [None]:
# dictionary
dataset.keys()

In [None]:
dataset["feature_names"]

We are using the Iris Flower Dataset.


https://en.wikipedia.org/wiki/Iris_flower_data_set

![Iris Flower](https://raw.githubusercontent.com/fpontejos/Data-Mining-24-25/main/figures/img/iris.png)

In [None]:
data = dataset["data"]
data  # data is a numpy array data structure. Think of it as a matrix of data (or as an excel spreadsheet)

In [None]:
data.shape

In [None]:
print(data[1])
print(data[0])


`numpy` arrays behave differently to Python lists!

In [None]:
# data[1] - data[0]

In [None]:
# data[1].tolist() - data[0].tolist()

So we can do something like this: 

In [None]:
(data[0] - data[1])

In [None]:
(data[0] - data[1])**2

In [None]:
sum((data[0] - data[1])**2)

In [None]:
(sum((data[0] - data[1])**2))**(1/2)

In [None]:
# Let's wrap it into a function:
# Euclidean distance of 2 observations

def euclidean_dist(p1, p2):
    return sum((p1 - p2)**2)**(1/2)
    
print(euclidean_dist(data[1], data[0]))


In [None]:
# initialize distance matrix. What will be its final shape?
dist = []

In [None]:
# Build the distance matrix. Use 2 for loops, the append list method and the euclidean distance formula

dist = []

for i in range(data.shape[0]):
    dist_row = []
    for j in range(data.shape[0]):
        single_dist = sum((data[i] - data[j]) ** 2) ** (1/2)
        dist_row.append(single_dist)
    dist.append(dist_row)        



In [None]:
len(dist), len(dist[0])

In [None]:
# another import (usually all imports are done at the top of the script/ notebook)
import seaborn as sns

In [None]:
sns.heatmap(dist)

## Why do we care about distances?

> In data mining applications, such as clustering, outlier analysis, and nearest-neighbor
classification, we need ways to assess how alike or unalike objects are in comparison to
one another. For example, a store may want to search for clusters of customer objects,
resulting in groups of customers with similar characteristics (e.g., similar income, area
of residence, and age). Such information can then be used for marketing. A cluster is
a collection of data objects such that the objects within a cluster are similar to one
another and dissimilar to the objects in other clusters.
> 
> (Han, 2011, p.65-66)



---

Han, J., Kamber, M., & Pei, J. (2011). *Data Mining: Concepts and Techniques*. Elsevier.

# Plotting data: 
Don't worry about the code for now; it is not the goal of the exercise. We will learn how to plot data in future classes.
### How can we represent an observation in a N-dimensional Space

In [None]:
# another import (usually all imports are done at the top of the script/ notebook)
import matplotlib.pyplot as plt

In [None]:
# 2D scatter plot
plt.scatter(data[:, 0], data[:, 1])
plt.xlabel(dataset["feature_names"][0])
plt.ylabel(dataset["feature_names"][1])
plt.show()

In [None]:
# 1D scatter plot
plt.scatter(data[:, 0], [0 for i in range(data.shape[0])])
plt.xlabel(dataset["feature_names"][0])
plt.show()

In [None]:
# 3D scatter plot
fig = plt.figure(figsize=(14, 7))  # defining a figure so we can add a 3d subplot
ax = fig.add_subplot(111, projection="3d")
ax.scatter(data[:, 0], data[:, 1], data[:, 2])
ax.set_xlabel(dataset["feature_names"][0])
ax.set_ylabel(dataset["feature_names"][1])
ax.set_zlabel(dataset["feature_names"][2])
plt.show()

## Finding nearest neighbors

**Pseudo code:**

```python
# initialize global min_dist to a very large number:
min_dist = 9999999

# save coordinates of pair with smallest dist
min_args = (-,-)

for each row in dist_matrix:
    row_dist =  distances from point in row_id
                to all other points in the lower triangle of matrix
                (Why?)
    if the smallest distance in row_dist is smaller than current global min_dist,
        replace min_dist with this dist
        replace min_args with the row_number and col_number
```

In [None]:
# get variables to save closest neighbors later
min_args, min_dist = (None, 9e99)
for id_r, row in enumerate(dist):
    # CODE HERE


In [None]:
min_args

In [None]:
print(data[min_args[0]])
print(data[min_args[1]])
print('minimum distance:\t', min_dist)

## Define functions
Why do we want to define functions in this case?

In [None]:
def distance_matrix(data):
    # CODE HERE
    return dist    

def closest_points(dist_matrix):
    # CODE HERE
    return min_args, min_dist

## Finding the `n` shortest distances

In [None]:
dist_matrix = distance_matrix(data)
n_distances = 10

# CODE HERE

distances