# Word Vectors

Anytime you see ``______ # TODO: FILL IN HERE.`` in the code, you should replace the ``______`` with your own code.

As always, ask your neighbors or an instructor if you have any questions!

## Pre-trained Word Vectors
Let's start by working with pre-trained word vectors. These are word representations that have been created by other people, which we can download and use for our applications.

We'll work with vectors trained using the Word2Vec model. Other people have already created these vectors, which are available to download online. Don't worry about the specific methods used to create these vectors for this class, but if you're interested in learning more feel free to ask one of the instructors.

###  0. Import packages.

First, we'll import the packages we need in the rest of this notebook.

As a reminder, press ctrl-enter to run a cell.

In [1]:
from math import sqrt

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### 1. Download pre-trained word vectors.

#### 1a. Download the data online.

To download the word vectors, go to this link: https://nlp.stanford.edu/projects/glove/
and search for "glove.6B.zip".

Download this file, and save it in the same folder as this notebook.

Then extract the files contained in the .zip.

#### 1b. Load the word embeddings.

We've provided the "read_embeddings" function, which takes in the name of one of the embedding files and reads it into a pandas DataFrame.

Run the following two cells to create the ``read_embeddings`` function and load a set of embeddings into a DataFrame.

In [2]:
def read_embeddings(filename):
    word_embeddings = pd.read_table(filename, header=None, sep=" ", index_col=0, quoting=3)
    return word_embeddings

In [3]:
embeddings_file = 'glove.6B/glove.6B.200d.txt'
embeddings_df = read_embeddings(embeddings_file)

#### 1c. Look at some examples of word embeddings.

To manually examine the first few rows of a DataFrame, we can use the ``.head()`` function of the DataFrame.

In this case, we named the DataFrame ``embeddings_df``, so we run ``embeddings_df.head()`` to view the first few rows.

Run the following cell to see the first few embeddings.

In [4]:
embeddings_df.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,191,192,193,194,195,196,197,198,199,200
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
the,-0.071549,0.093459,0.023738,-0.090339,0.056123,0.32547,-0.39796,-0.092139,0.061181,-0.1895,...,0.1218,0.19957,-0.20303,0.34474,-0.24328,0.13139,-0.008877,0.33617,0.030591,0.25577
",",0.17651,0.29208,-0.002077,-0.37523,0.004914,0.23979,-0.28893,-0.014643,-0.10993,0.15592,...,-0.32582,0.19153,-0.15469,-0.14679,0.046971,0.032325,-0.22006,-0.20774,-0.23189,-0.10814
.,0.12289,0.58037,-0.069635,-0.50288,0.10503,0.39945,-0.38635,-0.084279,0.12219,0.080312,...,-0.035236,0.17688,-0.0536,0.007003,-0.033006,-0.080021,-0.24451,-0.039174,-0.16236,-0.096652
of,0.052924,0.25427,0.31353,-0.35613,0.029629,0.51034,-0.10716,0.15195,0.057698,0.06149,...,-0.040886,0.3894,-0.10509,0.23372,0.096027,-0.30324,0.24488,-0.086254,-0.41917,0.46496
to,0.57346,0.5417,-0.23477,-0.3624,0.4037,0.11386,-0.44933,-0.30991,-0.005341,0.58426,...,-0.27915,0.43742,-0.31237,0.13194,-0.33278,0.18877,-0.23422,0.54418,-0.23069,0.34947


Furthermore, you can view the embedding of a specific word using the code

``embeddings_df.loc[word]``

Try it with a few words of your choice.

In [4]:
selected_word = ______ # TODO: FILL IN HERE.
embeddings_df.loc[selected_word]

NameError: name '______' is not defined

### 2. Create a function to compute cosine similarity.

Now we'll write a function to compute the cosine similarity between two words.

#### 2a: Fill in the function definition.

Fill in the following function; feel free to discuss with your neighbors and look back to today's slides.

In [None]:
def compute_cosine_similarity(word_a, word_b):
    word_a_vector = embeddings_df.loc[word_a]
    word_b_vector = _____ # TODO: FILL IN HERE
    word_a_vectornorm = np.linalg.norm(word_a_vector)
    word_b_vectornorm = ____ # TODO: FILL IN HERE
    a_dot_b = np.dot(word_a_vector, ____ ) # TODO: FILL IN HERE
    cosine_similarity = # TODO: FILL IN HERE
    return cosine_similarity

#### 2b: Test your cosine similarity function with different words.

Try out your function with some sample words, and write some examples of cosine similarities between words you tried. Do the results make sense?

In [None]:
word_a = _____ # TODO: FILL IN HERE
word_b = _____ # TODO: FILL IN HERE
similarity = compute_cosine_similarity(word_a, word_b)
print(similarity)

### 3. Create a function to find the closest words to a selected word.

Now that we've written a function to compute the cosine similarity between two words, it's time to try solving our own analogies!

As a recap, we can solve analogies using the following translation from words to math:

$A\:is\:to\:B\:as\:C\:is\:to\:D \leftrightarrow A-B\approx C-D \leftrightarrow C+B-A\approx D$

Using this, try filling in the following function to solve an analogy.

In [None]:
def get_predicted_index_vectorized(v1, v2, v3, embeddings_matrix):
    embeddings_index = embeddings_df.index
    v1_index = embeddings_index.get_loc(v1)
    v2_index = embeddings_index.get_loc(v2)
    v3_index = embeddings_index.get_loc(v3)
    v1 = embeddings_matrix[v1_index,:]
    v2 = embeddings_matrix[v2_index,:]
    v3 = embeddings_matrix[v3_index,:]
    predicted_vec = (____ + ____ - ____).reshape(1, embeddings_df.shape[1]) # TODO: FILL IN HERE.
    diffs = np.sum((embeddings_matrix - predicted_vec) ** 2, axis=1)
    min_indices = diffs.argsort()[:4]
    for i in range(3):
        min_index = min_indices[i]
        if min_index != v1_index and min_index != v2_index and min_index != v3_index:
            return(embeddings_index[min_index])
    return embeddings_index[min_indices[3]]

### 5. Try solving your own analogies.

Try using your function to solve your own analogies.

To start, here are some you might try:

    "boy" is to "girl" as "brother" is to ?
    "uncle" is to "aunt" as "policewoman" is to ?
    "occasional" is to "occasionally" as "lucky" is to ?
    "jumping" is to "jumped" as "flying" is to ?

# TODO: POSSIBLE OTHER STUFF

## Creating Your Own Word Vectors

Now let's try creating our own word embeddings.

### 1. Create vectors using co-occurrence counts.

### 2. Create vectors using PPMI.

### 3. Visualize projections of these embeddings.

## Create word vectors

- Distribution of nearby words, with diff window sizes
- PPMI
- Visualize

### 3. Visualize projections of selected words.

These embeddings are useful mathematically, but they have a lot of dimensions, which makes it hard to visualize them.

For easier visualization, we can project them onto fewer dimensions. Using projection, we can map something in many dimensions (in the case of these word vectors, 200 dimensions) into fewer dimensions (in particular, two dimensions are easy to visualize).

Projecting onto fewer dimensions makes us lose some information. For instance, imagine that we have a full-color picture and map it to a black-and-white picture -- the new picture might be easier to represent, but we lose some information in the process.

Using certain techniques, we can make a lower-dimensional projection that keeps as much information as possible.

In [None]:
mu = X.mean(0)
C = np.cov(X - mu, rowvar=False)
d, u = np.linalg.eigh(C)
U = u.T[::-1]
Z = np.dot(X - mu, U[:2].T)
