# MSBD5001 Foundations of Data Analytics

### Fall 2020

## Tutorial: Similarity Measures

In this tutorial, we will discuss some examples of Python packages for similarity measures.
Or we can always implement the similarity measures ourselves.

### NLTK -- Natural Language Toolkit
<strong>NLTK -- Natural Language Toolkit</strong> - edit distance, jaccard distance, etc. ...
<url>http://www.nltk.org/index.html</url>

To install:
<pre>pip install nltk</pre>

In [None]:
%pip install nltk

In [None]:
from nltk.metrics import distance
distance.edit_distance("dave", "dav")

1

In [None]:
from nltk.metrics import distance
s1 = "who are you"
s2 = "how are you"
distance.jaccard_distance(set(s1.split()), set(s2.split()))

0.5

Word N-grams

In [None]:
from nltk.util import ngrams
from nltk.metrics import distance
s1 = "who are you"
s2 = "how are you"
wbigrams1 = set(ngrams(s1.split(), 2))
wbigrams2 = set(ngrams(s2.split(), 2))
print("Set1:", wbigrams1)
print("Set2:", wbigrams2)
distance.jaccard_distance(wbigrams1, wbigrams2)

Set1: {('who', 'are'), ('are', 'you')}
Set2: {('are', 'you'), ('how', 'are')}


0.6666666666666666

Character N-grams

In [None]:
from nltk.util import ngrams
from nltk.metrics import distance
text1 = 'pedro'
text2 = "peter"
bigrams1 = set(ngrams(text1, 2))
bigrams2 = set(ngrams(text2, 2))
print("Set1:", bigrams1)
print("Set2:", bigrams2)
distance.jaccard_distance(bigrams1, bigrams2)

Set1: {('p', 'e'), ('d', 'r'), ('r', 'o'), ('e', 'd')}
Set2: {('p', 'e'), ('e', 'r'), ('e', 't'), ('t', 'e')}


0.8571428571428571

### Python-Levenshtein

<strong>Python-Levenshtein</strong> – edit distance, jaro, jaro-winkler, etc….
<url>https://pypi.python.org/pypi/python-Levenshtein</url>

To install, in prompt:
<pre>pip install python-levenshtein</pre>
or
<pre>conda install -c conda-forge python-levenshtein</pre>

In [None]:
%pip install python-levenshtein

In [None]:
from Levenshtein import *

edit_dist = distance("abc", "abd")
print (edit_dist)

hamming_dist = hamming("abc", "abd")
print (hamming_dist)

jaro_dist = jaro("abc", "abd")
print (jaro_dist)

jaro_winkler_dist = jaro_winkler("abc", "abd")
print (jaro_winkler_dist)

1
1
0.7777777777777777
0.8222222222222222


### FuzzyWuzzy

<strong>FuzzyWuzzy</strong> – using levenshtein distance
<url>https://pypi.python.org/pypi/fuzzywuzzy</url>

To install, in prompt:
<pre>   pip install fuzzywuzzy</pre>

In [None]:
%pip install fuzzywuzzy

In [None]:
from fuzzywuzzy import fuzz
fuzz.ratio("abc", "abd")

67

### Scikit-learn

To install, in prompt:
<pre>pip install sklearn</pre>

The <strong>sklearn.metrics.pairwise</strong> submodule implements utilities to evaluate pairwise distances.
<url>https://scikit-learn.org/stable/modules/metrics.html#metrics</url>

In [None]:
from sklearn.metrics import pairwise as p
X = [[0, 1], [1, 1]]
Y = [[0, 1], [2, 1]]
p.paired_distances(X, Y, "euclidean")

array([0., 1.])

In [None]:
p.paired_distances(X, Y, "cosine")

array([0.       , 0.0513167])

In [None]:
p.cosine_similarity(X, Y)

array([[1.        , 0.4472136 ],
       [0.70710678, 0.9486833 ]])

In [None]:
p.cosine_distances(X, Y)

array([[0.        , 0.5527864 ],
       [0.29289322, 0.0513167 ]])

The <strong>sklearn.neighbours.DistanceMetric</strong> submodule 
<url>https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html#</url>

In [None]:
from sklearn.neighbors import DistanceMetric
d = DistanceMetric.get_metric('jaccard')
X = [[1, 0, 0, 0], [1, 1, 0, 1]]
d.pairwise(X)

array([[0.        , 0.66666667],
       [0.66666667, 0.        ]])