In [4]:
import numpy
import scipy
import scipy.sparse
import sklearn.metrics.pairwise

# 5 Distance and Similarity

# 5.1 Distance




The distance metric qantifies the distance between two points $x$ and $y$. 

Some frequently used distance metrics include:
- Euclidean distance: $$D_{euc}(x, y) = \left(\sum_{i=1}^{n}(x_{i}-y_{i})^2\right)^{0.5}$$
- Manhattan distance: $$D_{manh}(x, y) = \sum_{i=1}^{n} abs(x_{i}-y_{i})$$

Fun note: If you have run into the concepts elsewhere, Euclidian distance of a vector with itself is equivalent to the L2 norm of the vector, and the Manhattan distance corresponds to the L1 norm.

In general, distance metrics satisfy some properties such as:
- The distance between two points is non-negative. 
- The distance from a point to the point itself is equal to 0. 
- The distance from point $i$ to point $j$ is exactly equal to the distance from point $j$ to point $i$ (symmetry). 
- The distance from point $i$ to point $j$ via point $k$ is always longer than the direct distance from $i$ to $j$ (triangle inequality). 

In [5]:
# function to calculate manhattan distance between two points x and y
def manhattan_distance(x, y):
    diff = x - y
    absolute_diff = numpy.absolute(diff)
    dist = numpy.sum(absolute_diff)
    return dist
        
x = numpy.array([1,3])
y = numpy.array([2,4])
print("Manhattan distance: ", manhattan_distance(x, y))

Manhattan distance:  2


## Exercise 5.1.1

Create a function to calculate the Euclidean distance between two points x and y. 
Make sure that you do not use loops!

For example:
```python
a = numpy.array([0.0, 0.0])
b = numpy.array([1.0, 1.0])
print(euclidean(a, b))
```
Output:
```
1.41421356237
```

## Exercise 5.1.2

Load the iris data into a 150x4 numpy array. Compute the Euclidean distance between the first row and each other row.

### 5.1.1 Distance for categorical data

The most common method of computing the the distance across categorical dimensions is Jaccard's coefficient. This measures the ratio between the number of shared features between two objects (the intersection) and the toal number of features across the two objects (the union). Thus objects that share more features are more similar. 

### 5.1.2 Distance for String data

Distance measures for comparing strings with each other:
- Hamming distance: number of positions at which two strings of the same length differ.
    $$D_{hamming}(x, y) =  \sum_{i=1}^{n} x_{i} \neq y_{i} $$
- Levenshtein distance: the minimum number of operations required to transform one string into another string. The operations are insertion, deletion and substitutions and each operations has a cost (often unit costs are used). The Levenshtein distance can be used in a lot of text mining techniques where the similarity between two words needs to be measured. 

## Exercise 5.1.3
Create your own function to calculate the Hamming distance between two strings. The strings need to have the same length.

For example:

```python
a = 'Panama'
b = 'Pamela'
hamming(a, b)
```
Output:
```
3
```

## 5.2 Similarity

A similarity metric quantifies how similar two objects are. Often, similarity is coded as a non-linear transformation of the inverse of the distance between two objects. This is intiutive from the definitions of the concepts: as the distance between two objects increases (all other things being equal) then the similarity should decrease, and vice versa.

A very common similarity metric in data science is the cosine similarity. This measure treats $x$ (a point in a $n$-dimensional space) as a vector of length $n$. The cosine similarity between $x$ and $y$ is the dot product between $x$ and $y$ divided by the product of the L2-norms of $x$ and $y$. By normalizing by the lenth of the vectors (dividing by the L2-norms), the cosine similarity reflects the angle between the two vectors in high dimensional space. When the cosine similarity metric is equal to 1 the vectors point in exactly the same direction, when the similarity metric is equal to -1 then the vectors are the opposite of each other, and perpendicular vectors have a cosine similarity of 0.

The formal definition of cosine similarity is: $$cosine(x, y) = \frac{x^Ty}{||x||\cdot ||y||} = \frac{\sum_{i=1}^n x_i y_i}{\left(\sum_{i=1}^n x_{i}^2\right)^{0.5} \left(\sum_{i=1}^n y_{i}^2\right)^{0.5}}$$

In text-mining the cosine similarity can be used to represent the similarity between two documents.
Each document is represented by a vector that has a dimension for each word in the corpus. The value on each vector is equal to the word frequency.

The cosine similarity between two documents thus represents how similar they are in the set of words they contain. Documents that contain a lot of the same words the documents are said to be more similar than documents with hardly any overlapping words.



### Exercise 5.2.1

Create a function `cosine_similarity` that calculates the cosine similarity between two vectors. Do not use for-loops, Scipy, or Scikit-learn.

For example:
```python
a = numpy.array([0.0, 1.0])
b = numpy.array([1.0, 0.0])
c = numpy.array([0.0, -1.0])
print(cosine_similarity(a, b))
print(cosine_similarity(a, a))
print(cosine_similarity(a, c))
```
Output:
```
0.0
1.0
-1.0
```


## 5.3 Using scikit-learn to compute distances and similarity

In most cases, there is no need to create your own code/function to calculate the distances.
In the library `scikit-learn` the class `DistanceMetric` will calculate most distances for you.

Scikit-learn is a library for machine learning algorithms in Python. Scikit-learn includes functions for classification, regression, clustering, dimensionality reduction, model selection and preprocessing. 
For more information:  http://scikit-learn.org/stable/

### 5.3.1 Distance measures in in scipy and scikit-learn

The class to calculate distances is in the module "neighbors" of the scikit-learn library.
One can access the class via: `sklearn.neighbors.DistanceMetric`
Note that you first have to import the library `sklearn` and more specifically you have to import `sklearn.neighbors`.

The `DistanceMetric` class includes different distance metrics, such as Euclidean and Manhattan. Also other distance metrics are present, such as chebyshev, minkowski, mahalanobis, haversine, hamming, canberra, braycurtis and many more. 

For more information about the distance metrics that are included in the `DistanceMetric` class:  http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html

The most important methods of the `DistanceMetric` class are the `get_metric()` and the `pairwise()` functions. 
The `get_metric` function is used to specify which metric is used, for example 'euclidean' or 'manhattan'. 
The `pairwise` function is used to calculate the pairwise distances between points in an array x. 
See the examples below.

There is also a simple function to calculate pairwise distances according to a number of metrics:
```python
from sklearn.metrics.pairwise import pairwise_distances```

In [8]:
import sklearn.neighbors
from sklearn.metrics.pairwise import pairwise_distances

# we want to calculate the pairwise distance between three points in a 2D space
x = numpy.array([[1,3], [2, 4], [1, 6]])

# Euclidean distance
d1 = sklearn.neighbors.DistanceMetric.get_metric('euclidean')
print("Euclidean distance:")
print(d1.pairwise(x))
print()
print(pairwise_distances(x, metric='euclidean'))
print()

# Manhattan distance
d2 = sklearn.neighbors.DistanceMetric.get_metric('manhattan')
print("Manhattan distance:")
print(d2.pairwise(x))
print()
print(pairwise_distances(x, metric='manhattan'))

Euclidean distance:
[[ 0.          1.41421356  3.        ]
 [ 1.41421356  0.          2.23606798]
 [ 3.          2.23606798  0.        ]]

[[ 0.          1.41421356  3.        ]
 [ 1.41421356  0.          2.23606798]
 [ 3.          2.23606798  0.        ]]

Manhattan distance:
[[ 0.  2.  3.]
 [ 2.  0.  3.]
 [ 3.  3.  0.]]

[[ 0.  2.  3.]
 [ 2.  0.  3.]
 [ 3.  3.  0.]]


### 5.3.2 Cosine similarity in scipy and scikit-learn

The cosine similarity is implemented in the scipy library and in scikit-learn library.

In scipy:
`scipy.spatial.distance.cosine(x,y)` calculates the cosine distance between two points `x` and `y` where `x` and `y` are both represented by a 1-dimensional vector.
Note that in order to convert the distance to similarity, you should subtract distance from 1.

In scikit-learn:
`sklearn.metrics.pairwise.cosine_similarity(x)` calculates the pairwise cosine similarities between all points in array `x`.


**IMPORTANT** Verify, and remember, which of these implementations work with sparse matrices. 

In [9]:
# cosine similarity in scipy 
import scipy.spatial.distance
x = [1, 0, -1]
y = [-1,-1, 0]
print(1 - scipy.spatial.distance.cosine(x, y))

-0.5


In [10]:
# pairwise cosine similarity in sklearn
import sklearn.metrics.pairwise
x = numpy.array([[1, 0, -1], [-1,-1, 0]])
print(sklearn.metrics.pairwise.cosine_similarity(x))

[[ 1.  -0.5]
 [-0.5  1. ]]


### Exercise 4.3.1

Use the function `word_count` which you implemented last week to create a document-term matrix from the first 1000 rows of file [coco_val.txt](coco_val.txt). 
- Does your cosine function work on rows of the sparse document-term matrix? If not, do you know why not?
- Compute the pairwise cosine similarity between the rows of your document-term matrix using the `sklearn` implementation. Which document is the most similar to the first row according to cosine similarity?  
- Based on the above experiment, suggest ways of modifying the word counts to make the cosine similarity more useful as a metric of text similarity. Implement your idea and test it.

In [3]:
N = 1000
text = []

with open("coco_val.txt") as myfile:
    for index in range(N):
        line = next(myfile)
        text.append(line.split())
        
# vocab, M = word_count(text)


### Exercise 4.3.2

Scikit learn has a couple of classes which are useful for creating various versions of document-word matrices:

- `sklearn.feature_extraction.text.CountVectorizer`
- `sklearn.feature_extraction.text.TfidfVectorizer`

Read the documentation of these classes and try to apply them on the [coco_val.txt](coco_val.txt) data. Compute similarities/distances using these a few versions of these document-word matrices and check how they compare to using plain word-counts as we have been doing so far. 