# Worksheet 04

Name:  Di Wang
UID: U22721196

### Topics

- Distance & Similarity
- Cost Functions
- K means

### Distance & Similarity

#### Part 1

a) In the minkowski distance, describe what the parameters p and d are.

The parameter p is the order of the distance metric. 
The parameter d is the number of dimensions in the space in which the points are defined. 

b) In your own words describe the difference between the Euclidean distance and the Manhattan distance.

Euclidean distance: It's the straight-line distance between two points which can be calculated as the square root of the sum of the squares of the differences between the coordinates of the two points.

Manhattan distance: It's the distance between two points which can be calculated as the sum of the absolute differences between their coordinates.

Consider A = (0, 0) and B = (1, 1). When:

- p = 1, d(A, B) = 2
- p = 2, d(A, B) = $\sqrt{2} = 1.41$
- p = 3, d(A, B) = $2^{1/3} = 1.26$
- p = 4, d(A, B) = $2^{1/4} = 1.19$

c) Describe what you think distance would look like when p is very large.

When p is very large, the distance between A and B will become infinitly close to 0.

d) Is the minkowski distance still a distance function when p < 1? Expain why / why not.

When p < 1, the minikowski distance isn't a distance function.
Because the Minkowski distance does not satisfy the triangle inequality. Therefore it can not be considered a proper distance function. 

e) when would you use cosine similarity over the euclidan distance?

Cosine similarity is used when the magnitude of the vectors is not important, but rather the orientation or direction of the vectors is what is important. It is often used in information retrieval and natural language processing tasks, where the goal is to measure the similarity of text documents or word vectors.

f) what does the jaccard distance account for that the manhattan distance doesn't?

Jaccard distance accounts for the number of unique elements in both sets and the number of elements that are shared between the sets while manhattan distance doesn't.

#### Part 2

Consider the following two sentences:

In [2]:
s1 = "hello my name is Alice"  
s2 = "hello my name is Bob"

using the union of words from both sentences, we can represent each sentence as a vector. Each element of the vector represents the presence or absence of the word at that index.

In this example, the union of words is ("hello", "my", "name", "is", "Alice", "Bob") so we can represent the above sentences as such:

In [3]:
v1 = [1,    1, 1,   1, 1,    0]
#     hello my name is Alice
v2 = [1,    1, 1,   1, 0, 1]
#     hello my name is    Bob

Programmatically, we can do the following:

In [4]:
corpus = [s1, s2]
all_words = list(set([item for x in corpus for item in x.split()]))
print(all_words)
v1 = [1 if x in s1 else 0 for x in all_words]
print(v1)

['my', 'Bob', 'is', 'Alice', 'name', 'hello']
[1, 0, 1, 1, 1, 1]


Let's add a new sentence to our corpus:

In [5]:
s3 = "hi my name is Claude"
corpus.append(s3)

a) What is the new union of words used to represent s1, s2, and s3?

In [6]:
all_words = list(set([item for x in corpus for item in x.split()]))
print(all_words)

['my', 'hi', 'Bob', 'is', 'Alice', 'name', 'hello', 'Claude']


b) Represent s1, s2, and s3 as vectors as above, using this new set of words.

In [7]:
v1 = [1 if x in s1 else 0 for x in all_words]
v2 = [1 if x in s2 else 0 for x in all_words]
v3 = [1 if x in s3 else 0 for x in all_words]
print(v1)
print(v2)
print(v3)

[1, 0, 0, 1, 1, 1, 1, 0]
[1, 0, 1, 1, 0, 1, 1, 0]
[1, 1, 0, 1, 0, 1, 0, 1]


c) Write a function that computes the manhattan distance between two vectors. Which pair of vectors are the most similar under that distance function?

In [9]:
def manhattan(x, y):
    # check x, y are in the same dimention
    res = 0
    for i in range(len(x)):
        res += abs(x[i] + y [i])
    return res

Under this distance function, the most similar pair of vectors are v1 and v2 with a Manhattan distance of 2.

d) Create a matrix of all these vectors (row major) and add the following sentences in vector form:

- "hi Alice"
- "hello Claude"
- "Bob my name is Claude"
- "hi Claude my name is Alice"
- "hello Bob"

In [14]:
corpus = ["hello my name is Alice", "hello my name is Bob", "hi my name is Claude", "hi Alice", "hello Claude", "Bob my name is Claude", "hi Claude my name is Alice", "hello Bob"]
all_words = list(set([item for x in corpus for item in x.split()]))
matrix = []
for sentence in corpus:
    vector = [1 if word in sentence else 0 for word in all_words]
    matrix.append(vector)

e) How many rows and columns does this matrix have?

In [15]:
print(len(all_words))

8


In [None]:
The matrix has 8 rows, 8 columns.

f) When using the Manhattan distance, which two sentences are the most similar?

In [12]:
def manhattan(x, y):
    res = 0
    for i in range(len(x)):
        res += abs(x[i] - y[i])
    return res

most_similar = (None, None, float("inf"))
for i in range(len(matrix)):
    for j in range(i+1, len(matrix)):
        distance = manhattan(matrix[i], matrix[j])
        if distance < most_similar[2]:
            most_similar = (i, j, distance)

sentence1 = corpus[most_similar[0]]
sentence2 = corpus[most_similar[1]]
print("Sentences: '{}' and '{}' are the most similar with a Manhattan distance of {}".format(sentence1, sentence2, most_similar[2]))


Sentences: 'hi my name is Claude' and 'hi Claude my name is Alice' are the most similar with a Manhattan distance of 1


Sentences 'hi my name is Claude' and 'hi Claude my name is Alice' are the most similar with a Manhattan distance of 1.

### Cost Function

Solving Data Science problems often starts by defining a metric with which to evaluate solutions were you able to find some. This metric is called a cost function. Data Science then backtracks and tries to find a process / algorithm to find solutions that can optimize for that cost function.

For example suppose you are asked to cluster three points A, B, C into two non-empty clusters. If someone gave you the solution `{A, B}, {C}`, how would you evaluate that this is a good solution?

Notice that because the clusters need to be non-empty and all points must be assigned to a cluster, it must be that two of the three points will be together in one cluster and the third will be alone in the other cluster.

In the above solution, if A and B are closer than A and C, and B and C, then this is a good solution. The smaller the distance between the two points in the same cluster (here A and B), the better the solution. So we can define our cost function to be that distance (between A and B here)!

The algorithm / process would involve clustering together the two closest points and put the third in its own cluster. This process optimizes for that cost function because no other pair of points could have a lower distance (although it could equal it).

### K means

a) (1-dimensional clustering) Walk through Lloyd's algorithm step by step on the following dataset:

`[0, .5, 1.5, 2, 6, 6.5, 7]` (note: each of these are 1-dimensional data points)

Given the initial centroids:

`[0, 2]`

First step: assign each data point to the nearest centroid:
For [0, .5, 1.5, 2, 6, 6.5, 7]:
0: nearest centroid is 0
0.5: nearest centroid is 0
1.5: nearest centroid is 0
2: nearest centroid is 2
6: nearest centroid is 2
6.5: nearest centroid is 2
7: nearest centroid is 2
Second step: recalculate the centroids based on the mean of the assigned data points:
Centroid [0]: mean of [0, 0.5, 1.5] = (0 + 0.5 + 1.5) / 3 = 0.5
Centroid [2]: mean of [2, 6, 6.5, 7] = (2 + 6 + 6.5 + 7) / 4 = 5.375
Here we should repeat steps 1 and 2 until the centroids stop changing:
step 1:
0: nearest centroid is 0.5
0.5: nearest centroid is 0.5
1.5: nearest centroid is 0.5
2: nearest centroid is 5.375
6: nearest centroid is 5.375
6.5: nearest centroid is 5.375
7: nearest centroid is 5.375
step 2:
Centroid [0.5]: mean of [0, 0.5, 1.5] = (0 + 0.5 + 1.5) / 3 = 0.5
Centroid [5.375]: mean of [2, 6, 6.5, 7] = (2 + 6 + 6.5 + 7) / 4 = 5.375
Since the centroids are no longer changing, the algorithm has converged and the final clusters are [0, 0.5, 1.5] and [2, 6, 6.5, 7].

b) Describe in plain english what the cost function for k means is.

The cost function for k-means is a measure of the difference between the data points and the centroids of the clusters. 

c) For the same number of clusters K, why could there be very different solutions to the K means algorithm on a given dataset?

Because K means is sensitive to the initial placement of centroids and the way the data points are assigned to centroids. These two factors are not fixed for K means with the same number of clusters.

d) Does Lloyd's Algorithm always converge? Why / why not?

It doesn't always converge.
Because of Lloyd's Algorith's heuristic nature, the sensitivity to the initial placement of centroids, and the presence of outliers or noisy data, it may not always converge.