In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
import numpy

<h1>Vector space models</h1>

<h2>About vectors</h2>

There are many ways to think about what a vector is. You can think of it as a list of numbers

In [None]:
v1 = [3, 4]
v2 = [5, 2]

You can think also think of a vector as a point in space.

In [None]:
plt.grid(True)
plt.axis([0, 10, 0, 10])
plt.plot(3, 4, "ro")
plt.plot(5, 2, "ro")

Or as an arrow drawn from the origin to those points in space.

In [None]:
def plot_vec(v):
    plt.arrow(0, 0, v[0], v[1], head_width=.2, head_length=.4, length_includes_head=True)
    return
plt.grid(True)
plt.axis([0, 10, 0, 10])
plot_vec(v1)
plot_vec(v2)

To get the magnitude (length) of a vector you use the **pythagorean theorem**

In [None]:
from math import sqrt
magnitude_of_v1 = sqrt(3**2 + 4**2)
magnitude_of_v1

## Closeness of two vectors

We are going to want some way of measuring how close two vectors are.

One way to do this is to use the pythagorean theorem again, this time to find the distance between the points at the end of the two vectors

In [None]:
distance_btwn_v1_and_v2 = sqrt((5 - 3)**2 + (2 - 4)**2)
distance_btwn_v1_and_v2

### The dot product

More common though is to use a measurement known as the "dot product."

The dot product captures something like the overlap between two vectors. If the vectors are perpendicular to each other then the dot product is 0. If they point in the same direction then the dot product is the product of their lengths.

Here are some ways of calcuting the dot product.

$\vec{v}\bullet\vec{w}= {v_1}{w_1}+{v_2}{w_2}+{v_3}{w_3}+...$

$\vec{v}\bullet\vec{w}=|\vec{v}||\vec{w}|cos(\theta)$

In python, it looks like this

In [None]:
v1[0] * v2[0] + v1[1] * v2[1]

numpy has a function to do this for you, `dot`

In [None]:
from numpy import dot
dot(v1, v2)

You can use `dot` to compute the magnitude of a vector in a compact manner

In [None]:
v1_mag = sqrt(dot(v1, v1))
v1_mag

### "Normalizing" vectors

Normalizing a vector, means converting the vector to an arrow in the same direction, but with a length of 1.

To accomplish this, we divide each dimension of the vector by the length of the vector

(Yes, this is yet another meaning of the word "normalize." There is even a third meaning that is relevant here. "Normal" is also a word used to mean "perpendicular.")

In [None]:
v1_normalized = [v1[0] / v1_mag, v1[1] / v1_mag]
v1_normalized

In [None]:
dot(v1_normalized, v1_normalized)

This is important because we often want to know the extent to which two vectors point in the same direction, regardless of how long each one is. You'll see why in a little bit.

### Vectors in numpy

What happens if we try to add and multiply vectors without numpy?

It doesn't do what we want it to do.

In [None]:
v1 + v2

In [None]:
3 * v1

With numpy, vectors behave like vectors

In [None]:
import numpy as np
v1 = np.array([3, 4])
v2 = np.array([5, 2])

In [None]:
v1 + v2

In [None]:
3 * v1

In [None]:
v1_mag = sqrt(dot(v1, v1))
print(v1_mag)
v1_mag = np.linalg.norm(v1)
print(v1_mag)

In [None]:
v1_normalized = v1 / v1_mag
print(v1_normalized)

In [None]:
v2_normalized = v2 / np.linalg.norm(v2)

In [None]:
plt.grid(True)
plt.axis([0, 3, 0, 3])
plot_vec(v1_normalized)
plot_vec(v2_normalized)

## Converting text to a vector

Now we are going to convert a block of text to a vector. To do that, we are going to first decide on an ordered list of words. We'll call this the "vocabulary." Then, to convert a block of text, we'll count how many times each of these words appears in the block of text.

In [None]:
import nltk

In [None]:
t1 = "now is the time for all good men to come to the aid of their country"
t1w = nltk.word_tokenize(t1)
t2 = "now is the time for all good women to come to the aid of their country"
t2w = nltk.word_tokenize(t2)
t3 = "is it time for the women to lead us all"
t3w = nltk.word_tokenize(t3)

In [None]:
t1w

In [None]:
vocab = sorted(list(set(t1w + t2w + t3w)))

In [None]:
print(vocab)

In [None]:
import numpy as np

In [None]:
mylist = []
for word in vocab:
    mylist.append(t1w.count(word))

In [None]:
v1 = np.array([t1w.count(word) for word in vocab])
print(v1)

In [None]:
def norm_vec(v):
    return v / np.linalg.norm(v)
np.set_printoptions(precision=3)

In [None]:
v1 = norm_vec(v1)
print(v1)

In [None]:
v2 = norm_vec(np.array([t2w.count(word) for word in vocab]))
v3 = norm_vec(np.array([t3w.count(word) for word in vocab]))

In [None]:
print("dot product of v1 and v2 is ", dot(v1, v2))
print("dot product of v1 and v3 is ", dot(v1, v3))
print("dot product of v2 and v3 is ", dot(v1, v3))

## Squashing the vectors

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)

In [None]:
X = np.array([v1, v2, v3])

In [None]:
X_squashed = pca.fit_transform(X)
X_squashed

In [None]:
xs = [v[0] for v in X_squashed]
ys = [v[1] for v in X_squashed]
plt.scatter(xs, ys)
labels = ["v1", "v2", "v3"]
for n, v in enumerate(X_squashed):
    plt.annotate(labels[n], (v[0], v[1]), textcoords="offset points", xytext=(5, 5,))