### Covariance Pair Problem, August 10, 2020

#### Pair Problem \#1

You are given the following five documents (i.e., this `text` is our corpus):

```python
text = ["wookie stormtrooper",
        "harry potter",
        "wookie stormtrooper stormtrooper",
        "hairy wookie stormtrooper",
        "hairy harry potter"]
```

1. Transform this corpus into a *bag of words* representation, with simple word counts for each document (each row is a document, each column represents a word in the corpus, each value counts the word in the document). How informative is this format? What information do you have about individual words and their relationships to one another?
2. Calculate Euclidean and cosine distances between each pair of documents. How do these distances relate to your intuition for the documents' similarities?
3. Normalize these data (by row!), and calculate (a) one minus the cosine distance, and (b) the Pearson correlation coefficient between each pair of documents. How are they related? Is this a coincidence? Find a counterexample or prove that there isn't one.

#### Pair Problem \#2

Return to the matrix that you created in \#1 of Problem 1 above. 

1. Multiply (`np.matmul`) this matrix by the transpose of itself.
2. How many rows are there? Pick two (different valued, *off*-diagonal) cells of this matrix $(a_1, a_2)$ and $(b_1, b_2)$, where $a_1$ represents the row index of cell $a$.
3. Compare documents $a_1$ and $a_2$ (from `text`), and then compare documents $b_1$ and $b_2$. Is the value in cell $a$ bigger/smaller than the value in cell $b$? How does that relate to the comparisons between the corresponding documents?

What familiar calculation is this? What result have you produced?

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import pairwise_distances

import pandas as pd
import numpy as np

In [3]:
text = ["wookie stormtrooper",
        "harry potter",
        "wookie stormtrooper stormtrooper",
        "hairy wookie stormtrooper",
        "hairy harry potter"]

In [4]:
cv = CountVectorizer()

In [7]:
text_cv = cv.fit_transform(text)

In [9]:
text_df = pd.DataFrame(text_cv.toarray(), columns= cv.get_feature_names())

In [10]:
text_df

Unnamed: 0,hairy,harry,potter,stormtrooper,wookie
0,0,0,0,1,1
1,0,1,1,0,0
2,0,0,0,2,1
3,1,0,0,1,1
4,1,1,1,0,0


In [13]:
dist_euc = pairwise_distances(text_cv, metric='euclidean')

In [14]:
dist_cos = pairwise_distances(text_cv, metric='cosine')

In [21]:
euc_df = pd.DataFrame(dist_euc, columns= cv.get_feature_names(), index= cv.get_feature_names())

In [22]:
euc_df

Unnamed: 0,hairy,harry,potter,stormtrooper,wookie
hairy,0.0,2.0,1.0,1.0,2.236068
harry,2.0,0.0,2.645751,2.236068,1.0
potter,1.0,2.645751,0.0,1.414214,2.828427
stormtrooper,1.0,2.236068,1.414214,0.0,2.0
wookie,2.236068,1.0,2.828427,2.0,0.0


In [23]:
cos_df = pd.DataFrame(dist_cos, columns= cv.get_feature_names(), index= cv.get_feature_names())

In [24]:
cos_df

Unnamed: 0,hairy,harry,potter,stormtrooper,wookie
hairy,0.0,1.0,0.051317,0.183503,1.0
harry,1.0,0.0,1.0,1.0,0.183503
potter,0.051317,1.0,0.0,0.225403,1.0
stormtrooper,0.183503,1.0,0.225403,0.0,0.666667
wookie,1.0,0.183503,1.0,0.666667,0.0


In [25]:
np.sum(text_cv, axis=1)

matrix([[2],
        [2],
        [3],
        [3],
        [3]])

In [26]:
# NOT correct for normailization - need to update
#    to subtract mean and divide by standard deviation

text_normal = text_cv/np.sum(text_cv, axis=1)

In [27]:
text_normal

matrix([[0.        , 0.        , 0.        , 0.5       , 0.5       ],
        [0.        , 0.5       , 0.5       , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.66666667, 0.33333333],
        [0.33333333, 0.        , 0.        , 0.33333333, 0.33333333],
        [0.33333333, 0.33333333, 0.33333333, 0.        , 0.        ]])

In [28]:
dist_cos_normal = pairwise_distances(text_normal, metric='cosine') * -1 + 1

In [29]:
dist_cos_normal

array([[1.        , 0.        , 0.9486833 , 0.81649658, 0.        ],
       [0.        , 1.        , 0.        , 0.        , 0.81649658],
       [0.9486833 , 0.        , 1.        , 0.77459667, 0.        ],
       [0.81649658, 0.        , 0.77459667, 1.        , 0.33333333],
       [0.        , 0.81649658, 0.        , 0.33333333, 1.        ]])

In [50]:
dist_normal_df = pd.DataFrame(text_normal)

In [53]:
round((dist_normal_df.T.corr(method='pearson')/2 + 0.5),3)

Unnamed: 0,0,1,2,3,4
0,1.0,0.167,0.959,0.833,0.0
1,0.167,1.0,0.194,0.0,0.833
2,0.959,0.194,1.0,0.806,0.041
3,0.833,0.0,0.806,1.0,0.167
4,0.0,0.833,0.041,0.167,1.0


How many rows are there? Pick two (different valued, off-diagonal) cells of this matrix  (𝑎1,𝑎2)  and  (𝑏1,𝑏2) , where  𝑎1  represents the row index of cell  𝑎 .
Compare documents  𝑎1  and  𝑎2  (from text), and then compare documents  𝑏1  and  𝑏2 . Is the value in cell  𝑎  bigger/smaller than the value in cell  𝑏 ? How does that relate to the comparisons between the corresponding documents?

In [31]:
text_matrix_mul = np.matmul(text_df,text_df.T)

In [43]:
text_matrix_mul

Unnamed: 0,hairy,harry,potter,stormtrooper,wookie
0,2,0,3,2,0
1,0,2,0,0,2
2,3,0,5,3,0
3,2,0,3,3,1
4,0,2,0,1,3


In [35]:
a1 = text_matrix_mul.iloc[0,2]
a2 = text_matrix_mul.iloc[2,3]

In [36]:
text

['wookie stormtrooper',
 'harry potter',
 'wookie stormtrooper stormtrooper',
 'hairy wookie stormtrooper',
 'hairy harry potter']