In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
%matplotlib inline

# Text Similarity

There are various ways of analyzing text similarity, and we can classify the intent of text similarity broadly into the following two areas:


we can classify the intent of text similarity : 
- Lexical similarity
  - This involves observing the contents of the text documents with regard to **syntax**, **structure**, and content and measuring their similarity based on these parameters.
- Semantic similarity
  - This involves trying to find out the **semantics**, **meaning**, and context of the documents and then trying to see how close they are to each other. Dependency grammars and entity recognition are handy tools that can help in this.



### sematic similarity

You can also cover several parts of **semantic similarity** using simple models like the Bag of Words. 

Usually distance metrics will be used to measure similarity scores between text entities, 

and we will be mainly covering the following two broad areas of text similarity:

Using Bag of Words (BOW) : 
- term similarity
- document similarity

# Analyzing Term Similarity


We will start with analyzing term similarity—or similarity between individual word tokens, to be more precise.

참고 : 
- [tokenization](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)


Usages: <br/>
- several applications and use-cases like autocompleters, spell check, and correctors 
use some of these techniques to correct misspelled terms.


The **word representations** we will be using are as follows:
- Character vectorization
- Bag of Characters vectorization


## Characters Verctorization


In [8]:
import numpy as np
from scipy.stats import itemfreq

In [9]:
def vectorize_terms(terms):
    terms = [term.lower() for term in terms]
    terms = [np.array(list(term)) for term in terms]
    terms = [np.array([ord(char) for char in term]) for term in terms]
    return terms

In [10]:
root = 'Believe'
term1 = 'beleive'
term2 = 'bargain'
term3 = 'Elephant'    

terms = [root, term1, term2, term3]

vec_root, vec_term1, vec_term2, vec_term3 = vectorize_terms(terms)
print ('''
root: {}
term1: {}
term2: {}
term3: {}
'''.format(vec_root, vec_term1, vec_term2, vec_term3))


root: [ 98 101 108 105 101 118 101]
term1: [ 98 101 108 101 105 118 101]
term2: [ 98  97 114 103  97 105 110]
term3: [101 108 101 112 104  97 110 116]



In [11]:
print('b = {}, e = {}'.format(ord('b'),ord('e')))

b = 98, e = 101


## Bag of Characters vectorization

**Bag of Characters** vectorization is very similar to the **Bag of Words** model except here we compute the frequency of each character in the word. 

Sequence or word orders are not taken into account. 

The following function helps in computing this:

In [14]:
def boc_term_vectors(word_list):
    word_list = [word.lower() for word in word_list]
    unique_chars = np.unique(np.hstack([list(word) for word in word_list]))
    word_list_term_counts = [{char: count for char, count in itemfreq(list(word))} for word in word_list]

    boc_vectors = [np.array([int(word_term_counts.get(char, 0)) for char in unique_chars])
                                for word_term_counts in word_list_term_counts]
    return list(unique_chars), boc_vectors

In [15]:
root = 'Believe'
term1 = 'beleive'
term2 = 'bargain'
term3 = 'Elephant'    

terms = [root, term1, term2, term3]

features, (boc_root, boc_term1, boc_term2, boc_term3) = boc_term_vectors(terms)
print ('Features:', features)
print ('''
root: {}
term1: {}
term2: {}
term3: {}
'''.format(boc_root, boc_term1, boc_term2, boc_term3))

Features: ['a', 'b', 'e', 'g', 'h', 'i', 'l', 'n', 'p', 'r', 't', 'v']

root: [0 1 3 0 0 1 1 0 0 0 0 1]
term1: [0 1 3 0 0 1 1 0 0 0 0 1]
term2: [2 1 0 1 0 1 0 1 0 1 0 0]
term3: [1 0 2 0 1 0 1 1 1 0 1 0]



# Distance :

- Hamming distance
- Manhattan distance
- Euclidean distance
- Levenshtein distance
- Cosine distance and similarity

참고 : 
- https://www.slideshare.net/smanjunath1/similarity-measures
- https://www.slideshare.net/khusuma/distance-function
- https://medium.com/@adriensieg/text-similarities-da019229c894
- http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/


# Hamming Distance

u, v : term of length n,

Hammding Distance :
$$
\large hd(u,v) = \sum_{i=1}^{n}(u_i \ne v_i)
$$

Normalized Hamming Distance :
$$
\large norm\_hd(u,v) = \frac{\sum_{i=1}^{n}{(u_i \ne v_i)}}{n}
$$

참고 : 
- https://en.wikipedia.org/wiki/Hamming_distance


The Hamming distance between:
- "karolin" and "kathrin" is 3.
- "karolin" and "kerstin" is 3.
- 1011101 and 1001001 is 2.
- 2173896 and 2233796 is 3.


Two example distances: 100→011 has distance 3; 010→111 has distance 2 :

[![3-bit binary cube Hamming distance examples](https://upload.wikimedia.org/wikipedia/commons/thumb/6/6e/Hamming_distance_3_bit_binary_example.svg/140px-Hamming_distance_3_bit_binary_example.svg.png)](https://en.wikipedia.org/wiki/File:Hamming_distance_3_bit_binary_example.svg)


In [33]:
def hamming_distance(u, v, norm=False):
    if u.shape != v.shape:
        raise ValueError('The vectors must have equal lengths.')
    return (u != v).sum() if not norm else (u != v).mean()

In [34]:
root_term = root
root_vector = vec_root
root_boc_vector = boc_root

terms = [term1, term2, term3]
vector_terms = [vec_term1, vec_term2, vec_term3]
boc_vector_terms = [boc_term1, boc_term2, boc_term3]

print('root_vector={}'.format(root_vector))
print('vec_term1  ={}'.format(vec_term1))

root_vector=[ 98 101 108 105 101 118 101]
vec_term1  =[ 98 101 108 101 105 118 101]


### hamming distance 

In [27]:
# HAMMING DISTANCE DEMO
for term, vector_term in zip(terms, vector_terms):
    print ('Hamming distance between root: {} and term: {} is {}'.
           format(root_term, 
                  term,
                  hamming_distance(root_vector, vector_term, norm=False)
                 ))

Hamming distance between root: Believe and term: beleive is 2
Hamming distance between root: Believe and term: bargain is 6


ValueError: The vectors must have equal lengths.

###  hamming distance (normalized)

In [28]:
for term, vector_term in zip(terms, vector_terms):
    print ('Normalized Hamming distance between root: {} and term: {} is {}'.
           format(root_term,
                  term,
                  round(hamming_distance(root_vector, vector_term, norm=True), 2)
                 ))

Normalized Hamming distance between root: Believe and term: beleive is 0.29
Normalized Hamming distance between root: Believe and term: bargain is 0.86


ValueError: The vectors must have equal lengths.

# Manhattan Distance

we subtract the difference between each pair of characters at each position of the two strings. 

Formally, Manhattan distance is also known as city block distance, L1 norm, taxicab metric 

and is defined as the distance between two points in a grid based on strictly horizontal or vertical paths instead of the diagonal distance conventionally calculated by the Euclidean distance metric.

$$
\Large md(u,v) = || u - v ||_1 = \sum_{i=1}^{n}{|u_i - v_i|}
$$

Normalized :

$$
\Large norm\_md(u,v) = \frac{|| u - v ||_1}{n} = \frac{\sum_{i=1}^{n}{|u_i - v_i|} }{n}
$$

참고:
- [https://www.sciencedirect.com/topics/engineering/manhattan-distance](https://www.sciencedirect.com/topics/engineering/manhattan-distance)
  - example:
    ![image](https://ars.els-cdn.com/content/image/3-s2.0-B9780128038185000093-f09-18-9780128038185.jpg?_)
- L1 norm vs. L2 norm 
  - http://www.chioka.in/differences-between-the-l1-norm-and-the-l2-norm-least-absolute-deviations-and-least-squares/

In [41]:
def manhattan_distance(u, v, norm=False):
    if u.shape != v.shape:
        raise ValueError('The vectors must have equal lengths.')
    return abs(u - v).sum() if not norm else abs(u - v).mean()

In [42]:
root_term = root
root_vector = vec_root
root_boc_vector = boc_root

terms = [term1, term2, term3]
vector_terms = [vec_term1, vec_term2, vec_term3]
boc_vector_terms = [boc_term1, boc_term2, boc_term3]

print('root_vector={}'.format(root_vector))
print('vec_term1  ={}'.format(vec_term1))

root_vector=[ 98 101 108 105 101 118 101]
vec_term1  =[ 98 101 108 101 105 118 101]


### mahattan distance

In [43]:
for term, vector_term in zip(terms, vector_terms):
    print('Manhattan distance between root: {} and term: {} is {}'.
          format(root_term,
                 term,
                 manhattan_distance(root_vector, vector_term, norm=False)
                ))

Manhattan distance between root: Believe and term: beleive is 8
Manhattan distance between root: Believe and term: bargain is 38


ValueError: The vectors must have equal lengths.

### manhattan distance (normalized)

In [44]:
for term, vector_term in zip(terms, vector_terms):
    print ('Normalized Manhattan distance between root: {} and term: {} is {}'.
           format(root_term,
                  term,
                  round(manhattan_distance(root_vector, vector_term, norm=True),2)
                 ))

Normalized Manhattan distance between root: Believe and term: beleive is 1.14
Normalized Manhattan distance between root: Believe and term: bargain is 5.43


ValueError: The vectors must have equal lengths.

# Euclidean Distance

the Euclidean distance is also known as the Euclidean norm, L2 norm, or L2 distance 

and is defined as the shortest straight-line distance between two points

Mathematically this can be denoted as :
$$
ed(u,v) = || u - v ||_2 = \sqrt{\sum_{i=1}^{n}{(u_i - v_i)^2}}
$$

참고 : 
- [https://en.wikipedia.org/wiki/Euclidean_distance](https://en.wikipedia.org/wiki/Euclidean_distance)
- example
  - Illustration for n=3, repeated application of the Pythagorean theorem yields the formula
    - ![img](https://upload.wikimedia.org/wikipedia/commons/thumb/1/10/Euclidean_distance_3d_2_cropped.png/330px-Euclidean_distance_3d_2_cropped.png)

In [45]:
def euclidean_distance(u,v):
    if u.shape != v.shape:
        raise ValueError('The vectors must have equal lengths.')
    distance = np.sqrt(np.sum(np.square(u - v)))
    return distance

In [46]:
root_term = root
root_vector = vec_root
root_boc_vector = boc_root

terms = [term1, term2, term3]
vector_terms = [vec_term1, vec_term2, vec_term3]
boc_vector_terms = [boc_term1, boc_term2, boc_term3]

print('root_vector={}'.format(root_vector))
print('vec_term1  ={}'.format(vec_term1))
print('ed(root_vector,vec_term1)={}'.format(np.sqrt(np.sum([np.square(4),np.square(4)]))))

root_vector=[ 98 101 108 105 101 118 101]
vec_term1  =[ 98 101 108 101 105 118 101]
ed(root_vector,vec_term1)=5.656854249492381


### euclidean distance

In [48]:
for term, vector_term in zip(terms, vector_terms):
    print ('Euclidean distance between root: {} and term: {} is {}'.
           format(root_term,
                  term,
                  round(euclidean_distance(root_vector, vector_term),2)
                 ))

Euclidean distance between root: Believe and term: beleive is 5.66
Euclidean distance between root: Believe and term: bargain is 17.94


ValueError: The vectors must have equal lengths.

## Levenstein Edit Distance



$$
l{d}_{u,v}\left(i,j\right) = \left\{\begin{array}{cc}\hfill max\left(i,j\right)\hfill & \hfill if\kern0.5em  min\left(i,j\right)=0\hfill \\ {}\hfill min\left\{\begin{array}{c}\hfill \begin{array}{l}l{d}_{u,v}\left(i-1,j\right)+1\\ {}l{d}_{u,v}\left(i,j-1\right)+1\end{array}\hfill \\ {}\hfill l{d}_{u,v}\left(i-1,j-1\right)+{C}_{ui\ne vj}\hfill \end{array}\right\}\hfill & \hfill otherwise\hfill \end{array}\right\} 
$$

$$
{C}_{u_i\ \ne\ {v}_j} = \left\{\begin{array}{c}\hfill 1\kern1em  if\ {u}_i\ne {v}_j\hfill \\ {}\hfill 0\kern1em  if\kern0.5em {u}_i={v}_j\hfill \end{array}\right\} 
$$

참고 : 
- https://en.wikipedia.org/wiki/Levenshtein_distance
- https://www.cuelogic.com/blog/the-levenshtein-algorithm

<h4>Example</h4>
For example, the Levenshtein distance between "kitten" and "sitting" is 3, <br/>
since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

1. kitten → sitten (substitution of "s" for "k")
2. sitten → sittin (substitution of "i" for "e")
3. sittin → sitting (insertion of "g" at the end).

In [49]:
import copy
import pandas as pd

def levenshtein_edit_distance(u, v):
    # convert to lower case
    u = u.lower()
    v = v.lower()
    # base cases
    if u == v: return 0
    elif len(u) == 0: return len(v)
    elif len(v) == 0: return len(u)
    # initialize edit distance matrix
    edit_matrix = []
    # initialize two distance matrices
    du = [0] * (len(v) + 1)
    dv = [0] * (len(v) + 1)
    # du: the previous row of edit distances
    for i in range(len(du)):
        du[i] = i
    # dv : the current row of edit distances
    for i in range(len(u)):
        dv[0] = i + 1
        # compute cost as per algorithm
        for j in range(len(v)):
            cost = 0 if u[i] == v[j] else 1
            dv[j + 1] = min(dv[j] + 1, du[j + 1] + 1, du[j] + cost)
        # assign dv to du for next iteration
        for j in range(len(du)):
            du[j] = dv[j]
        # copy dv to the edit matrix
        edit_matrix.append(copy.copy(dv))
    # compute the final edit distance and edit matrix
    distance = dv[len(v)]
    edit_matrix = np.array(edit_matrix)
    edit_matrix = edit_matrix.T
    edit_matrix = edit_matrix[1:,]
    edit_matrix = pd.DataFrame(data=edit_matrix,
                               index=list(v),
                               columns=list(u))
    return distance, edit_matrix

In [50]:
for term in terms:
    edit_d, edit_m = levenshtein_edit_distance(root_term, term)
    print ('Computing distance between root: {} and term: {}'.format(root_term, term))
    print ('Levenshtein edit distance is {}'.format(edit_d))
    print ('The complete edit distance matrix is depicted below')
    print (edit_m)
    print ('-'*30)

Computing distance between root: Believe and term: beleive
Levenshtein edit distance is 2
The complete edit distance matrix is depicted below
   b  e  l  i  e  v  e
b  0  1  2  3  4  5  6
e  1  0  1  2  3  4  5
l  2  1  0  1  2  3  4
e  3  2  1  1  1  2  3
i  4  3  2  1  2  2  3
v  5  4  3  2  2  2  3
e  6  5  4  3  2  3  2
------------------------------
Computing distance between root: Believe and term: bargain
Levenshtein edit distance is 6
The complete edit distance matrix is depicted below
   b  e  l  i  e  v  e
b  0  1  2  3  4  5  6
a  1  1  2  3  4  5  6
r  2  2  2  3  4  5  6
g  3  3  3  3  4  5  6
a  4  4  4  4  4  5  6
i  5  5  5  4  5  5  6
n  6  6  6  5  5  6  6
------------------------------
Computing distance between root: Believe and term: Elephant
Levenshtein edit distance is 7
The complete edit distance matrix is depicted below
   b  e  l  i  e  v  e
e  1  1  2  3  4  5  6
l  2  2  1  2  3  4  5
e  3  2  2  2  2  3  4
p  4  3  3  3  3  3  4
h  5  4  4  4  4  4  4
a  6 

## Cosine Similarity

dot product between two vectors :

$$ u\cdot v = \left|\right|u\left|\right|\kern0.5em \left|\right|v\left|\right| \cos \left(\theta \right) $$

Cosine similarity : 

$$ cs\left(u,\ v\right)= \cos \left(\theta \right) = \frac{u\cdot v}{\left|\left|u\left|\right|\kern0.5em \left|\right|v\right|\right|} = \frac{{\displaystyle {\sum}_{i=1}^n}{u}_i\ {v}_i}{\sqrt{{\displaystyle {\sum}_{i=1}^n}{u}_i^2}\ \sqrt{{\displaystyle {\sum}_{i=1}^n}{v}_i^2}} $$

Cosine distance :
$$ cd\left(u,\ v\right)=1 - cs\left(u,\ v\right)=1 - \cos \left(\theta \right)=1 - \frac{u\cdot v}{\left|\left|u\left|\right|\kern0.5em \left|\right|v\right|\right|} = \kern0.5em 1 - \frac{{\displaystyle {\sum}_{i=1}^n}{u}_i\ {v}_i}{\sqrt{{\displaystyle {\sum}_{i=1}^n}{u}_i^2}\ \sqrt{{\displaystyle {\sum}_{i=1}^n}{v}_i^2}} $$

![img](https://learning.oreilly.com/library/view/text-analytics-with/9781484223871/A427287_1_En_6_Fig2_HTML.jpg)

참고 : 
- [https://en.wikipedia.org/wiki/Cosine_similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

Examples:
- ![img](http://i0.wp.com/techinpink.com/wp-content/uploads/2017/07/cosine.png?w=697)

In [22]:
def cosine_distance(u, v):
    distance = 1.0 - (np.dot(u, v) / 
                        (np.sqrt(sum(np.square(u))) * np.sqrt(sum(np.square(v))))
                     )
    return distance

In [51]:
root_term = root
root_vector = vec_root
root_boc_vector = boc_root

terms = [term1, term2, term3]
vector_terms = [vec_term1, vec_term2, vec_term3]
boc_vector_terms = [boc_term1, boc_term2, boc_term3]

print('root_vector={}'.format(root_vector))
print('vec_term1  ={}'.format(vec_term1))
print('ed(root_vector,vec_term1)={}'.format(np.sqrt(np.sum([np.square(4),np.square(4)]))))

root_vector=[ 98 101 108 105 101 118 101]
vec_term1  =[ 98 101 108 101 105 118 101]
ed(root_vector,vec_term1)=5.656854249492381


In [52]:
# COSINE DISTANCE\SIMILARITY DEMO
for term, boc_term in zip(terms, boc_vector_terms):
    print ('Analyzing similarity between root: {} and term: {}'.format(root_term, term))
    distance = round(cosine_distance(root_boc_vector, boc_term),2)
    similarity = 1 - distance                                                           
    print ('Cosine distance  is {}'.format(distance))
    print ('Cosine similarity  is {}'.format(similarity))
    print ('-'*40)

Analyzing similarity between root: Believe and term: beleive


NameError: name 'cosine_distance' is not defined