# Application: Word Similarity

In [1]:
import numpy as np
from rich.pretty import pprint
from rich import print
from sklearn.metrics.pairwise import cosine_similarity

Given the four words `battle`, `good`, `fool` and `wit` defined
in {eq}`eq:term-document-matrix-doc-dimensions` in the previous section 
in [](../words_and_vectors/concept.ipynb).

We can calculate the cosine similarity between each word as follows:

In [9]:
x1 = np.array([1, 0, 7, 13]) # battle
x2 = np.array([114, 80, 62, 82]) # good
x3 = np.array([36, 58, 1, 4]) # fool
x4 = np.array([20, 15, 2, 3]) # wit

print(cosine_similarity([x3], [x4]))
print(cosine_similarity([x1], [x3]))

We can also calculate the cosine similarity between each word 
by compiling all $4$ vectors into a matrix $\mathbf{X}$, 
where each row is a vector.

In [7]:
X = np.array(
    [
        [1, 0, 7, 13],
        [114, 80, 62, 89],
        [36, 58, 1, 4],
        [20, 15, 2, 3],
    ]
)
print(X)

In [8]:
kernel_matrix = cosine_similarity(X)
print(kernel_matrix)

array([[1.        , 0.65267448, 0.09386806, 0.1952947 ],
       [0.65267448, 1.        , 0.75892858, 0.86817473],
       [0.09386806, 0.75892858, 1.        , 0.92856079],
       [0.1952947 , 0.86817473, 0.92856079, 1.        ]])

The `(3, 4)` entry will be the similarity between `x3` and `x4`.

We see that `x1` and `x3` are extremely dissimilar, having a cosine similarity of 0.09386806,
while `x3` and `x4` are extremely similar, having a cosine similarity of 0.92856079.