# Linear Algebra


Links: https://www.fastcompany.com/90205359/google-you-auto-complete-me

## Dot Products

A dot product is defined as

$ a \cdot b = \sum_{i}^{n} a_{i}b_{i} = a_{1}b_{1} + a_{2}b_{2} + a_{3}b_{3} + \dots + a_{n}b_{n}$

The geometric definition of a dot product is 

$ a \cdot b = $\|\|b\|\|\|\|a\|\|

### What does a dot product conceptually mean?

A dot product is a representation of the similarity between two components, because it is calculated based upon shared elements.


The actual value of a dot product reflects the direction of change:

* **Zero**: we don't have any growth in the original direction
* **Positive** number: we have some growth in the original direction
* **Negative** number: we have negative (reverse) growth in the original direction

In [30]:
A = [0,2]
B = [0,1]


# What will the dot product of A and B be?

-0.4480736161291701

In [None]:
A = [1,2]
B = [2,4]
# What will the dot product of A and B be?

In [1]:
data_corpus = ["John likes to watch movies. Mary likes movies too.", 
"John also likes to watch football games. Mary does not like football much."]

from sklearn.feature_extraction.text import CountVectorizer





# Bag of Words Models

You can use **`sklearn.feature_extraction.text.CountVectorizer`** to easily convert your corpus into a bag of words matrix:

```python
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data_corpus = ["John likes to watch movies. Mary likes movies too.", 
"John also likes to watch football games. Mary does not like football much."]
X = vectorizer.fit_transform(data_corpus) 
```
Note that the output `X` here is not your traditional Numpy matrix! Calling **`type(X)`** here will yield **`<class 'scipy.sparse.csr.csr_matrix'>`**, which is a **CSR ([compressed sparse row format matrix](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html))**. To convert it into an actual matrix, call the `toarray()` method:

```python
X.toarray()
```
Your output will be 

```
array([[0, 0, 0, 0, 1, 0, 2, 1, 2, 0, 0, 1, 1, 1],
       [1, 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1]], dtype=int64)
```
Notice that using **`X.shape`** $\rightarrow$ `(2,14)`, indicating a total vocabulary size $V$ of 14. To get what word each of the 14 columns corresponds to, use **`vectorizer.get_feature_names()`**:
```
['also', 'does', 'football', 'games', 'john', 'like', 'likes', 'mary', 'movies', 'much', 'not', 'to', 'too', 'watch']
```

Notice, however, that as the vocabulary size $V$ increases, the percent of the matrix taken up by zero values increases:

```python
corpus = [
    "Some analysts think demand could drop this year because a large number of homeowners take on remodeling projectsafter buying a new property. With fewer homes selling, home values easing, and mortgage rates rising, they predict home renovations could fall to their lowest levels in three years.", 
    
          "Most home improvement stocks are expected to report fourth-quarter earnings next month.",
    
         "The conversation boils down to how much leverage management can get out of its wide-ranging efforts to re-energize operations, branding, digital capabilities, and the menu–and, for investors, how much to pay for that.",
    
    "RMD’s software acquisitions, efficiency, and mix overcame pricing and its gross margin improved by 90 bps Y/Y while its operating margin (including amortization) improved by 80 bps Y/Y. Since RMD expects the slower international flow generator growth to continue for the next few quarters, we have lowered our organic growth estimates to the mid-single digits."
]

X = vectorizer.fit_transform(corpus).toarray() 
```



In [60]:
corpus = [
        "Some analysts think demand could drop this year because a large number of homeowners take on remodeling projectsafter buying a new property. With fewer homes selling, home values easing, and mortgage rates rising, they predict home renovations could fall to their lowest levels in three years.", 
    
          "Most home improvement stocks are expected to report fourth-quarter earnings next month.",
    
         "The conversation boils down to how much leverage management can get out of its wide-ranging efforts to re-energize operations, branding, digital capabilities, and the menu–and, for investors, how much to pay for that.",
    
    "RMD’s software acquisitions, efficiency, and mix overcame pricing and its gross margin improved by 90 bps Y/Y while its operating margin (including amortization) improved by 80 bps Y/Y. Since RMD expects the slower international flow generator growth to continue for the next few quarters, we have lowered our organic growth estimates to the mid-single digits. "
]

X = vectorizer.fit_transform(corpus).toarray() 
import numpy as np
from sys import getsizeof

zeroes = np.where(X.flatten() == 0)[0].size 
percent_sparse = zeroes / X.size
print(f"The bag of words feature space is {round(percent_sparse * 100,2)}% sparse. \n\
That's approximately {round(getsizeof(X) * percent_sparse,2)} bytes of wasted memory.")

The bag of words feature space is 72.63% sparse. 
That's approximately 2777.34 bytes of wasted memory.


# Distance Measures


## Euclidean Distance

Euclidean distances can range from 0 (completely identically) to $\infty$ (extremely dissimilar). **Magnitude** plays an extremely important role:

In [3]:
from math import sqrt
 
# Method 1

There's typically an easier way to write this function that takes advantage of Numpy's vectorization capabilities:

In [2]:
import numpy as np

# Method 2

# Similarity Measures

Similarity measures will always range between -1 and 1. A similarity of -1 means the two objects are complete opposites, while a similarity of 1 indicates the objects are identical.

## Pearson Correlation Coefficient
* We use **ρ** when the correlation is being measured from the population, and **r** when it is being generated from a sample.
* An r value of 1 represents a **perfect linear** relationship, and a value of -1 represents a perfect inverse linear relationship.

The equation for Pearson's correlation coefficient is 
$$
ρ_{Χ_Υ} = \frac{cov(X,Y)}{σ_Xσ_Y}
$$

### Intuition Behind Pearson Correlation Coefficient

#### When $ρ_{Χ_Υ} = 1$ or  $ρ_{Χ_Υ} = -1$

This requires **$cov(X,Y) = σ_Xσ_Y$** or **$-1 * cov(X,Y) = σ_Xσ_Y$** (in the case of $ρ = -1$) . This corresponds with all the data points lying perfectly on the same line.
![Correlations](images/correlation.png "Visualization of various r values for Pearson correlation coefficient")


## Cosine Similarity

The cosine similarity of two vectors (each vector will usually represent one document) is a measure that calculates $ cos(\theta)$, where $\theta$ is the angle between the two vectors.

Therefore, if the vectors are **orthogonal** to each other (90 degrees), $cos(90) = 0$. If the vectors are in exactly the same direction, $\theta = 0$ and $cos(0) = 1$.

Cosine similiarity **does not care about the magnitude of the vector, only the direction** in which it points. This can help normalize when comparing across documents that are different in terms of word count.

### Shift Invariance

* The Pearson correlation coefficient between X and Y does not change with you transform $X \rightarrow a + bX$ and $Y \rightarrow c + dY$, assuming $a$, $b$, $c$, and $d$ are constants and $b$ and $d$ are positive.
* Cosine similarity does, however, change when transformed in this way.


<h1><span style="background-color: #FFFF00">Exercise (10 minutes):</span></h1>

>In Python, find the **cosine similarity** of the two following sentences, assuming a **bag of words** model. You may use a library to create the BoW feature space, but do not use libraries other than `numpy` or `scipy` to compute Pearson and and cosine similarity:

>`A = "John likes to watch movies. Mary likes movies too"`

>`B = "John also likes to watch football games, but he likes to watch movies on occasion as well"`

# Pointwise Mutual Information

Pointwise mutual information measures the ratio between the **joint probability of two events happening** with the probabilities of the two events happening, assuming they are independent. It can be defined with the following equation:

$$
\begin{equation}
MI_{i,j} = log(\frac{P(i,j)}{P(i)P(j)})
\end{equation}
$$

Remember that when two events are independent, $P(i,j) = P(i)P(j)$.