# Word Vectors and the FNC

## 0. Import packages.

In [5]:
import nltk
import numpy as np
from gensim.models.keyedvectors import KeyedVectors

## 1. Download the word vectors.

Go to https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit and download GoogleNews-vectors-negative300.bin.gz. When it finishes downloadin, extract the file into this folder.

This may take awhile. While it's downloading, read this page for more information about Word2Vec: https://code.google.com/archive/p/word2vec/. Then, read this more in-depth description: https://www.tensorflow.org/tutorials/representation/word2vec

## 2. Load the word vectors.

In [6]:
word2vec = {}

_wnl = nltk.WordNetLemmatizer()
def normalize_word(w):
    return _wnl.lemmatize(w).lower()

# Load Google's pre-trained Word2Vec model.
def initialize():
    global word2vec
    if len(word2vec) == 0:
        print('loading word2vec...')
        word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
        for word in word_vectors.vocab:
            word2vec[normalize_word(word)] = word_vectors[word]
        print('word2vec loaded')

In [7]:
initialize()

loading word2vec...
word2vec loaded


Now, word2vec is a KeyedVector, and you can view the vectors corresponding to various words. If you're curious, try running the following cell with different words.

In [8]:
word2vec['hi']

array([ -2.20703125e-01,   3.27148438e-02,   1.29882812e-01,
         2.31933594e-02,  -4.53125000e-01,  -7.51953125e-02,
        -9.37500000e-02,   3.90625000e-02,  -1.54296875e-01,
        -1.39160156e-02,   2.22167969e-02,  -1.19628906e-01,
        -1.68945312e-01,  -2.13623047e-02,  -1.87500000e-01,
         1.97753906e-02,   2.22656250e-01,   5.27343750e-01,
        -1.21582031e-01,   4.37011719e-02,  -3.71093750e-01,
         4.34570312e-02,   7.86132812e-02,  -3.00781250e-01,
        -1.56250000e-01,  -2.27539062e-01,  -2.03857422e-02,
         1.84326172e-02,   4.37011719e-02,   3.20434570e-03,
         2.29492188e-01,   8.00781250e-02,  -1.10839844e-01,
        -2.51953125e-01,  -6.05468750e-02,  -4.56542969e-02,
        -2.06054688e-01,  -7.71484375e-02,  -5.83496094e-02,
        -1.31835938e-01,  -8.39843750e-02,   1.87500000e-01,
        -1.34887695e-02,   1.39648438e-01,  -1.77001953e-02,
        -1.73828125e-01,  -2.51464844e-02,  -3.16406250e-01,
        -1.12792969e-01,

## 2. Write a function to convert a sentence into a vector.

The word vectors we downloaded give us a vector for each word, but we want a vector to represent each sentence. This function does just that. Read through the code, and try to answer the following questions:

**In Line $2$, why do we write `[0.0] * 300`?** (Hint: try running the following cell and see what happens).

In [9]:
[0.0] * 5

[0.0, 0.0, 0.0, 0.0, 0.0]

**Why do we need the `if` statement on line $6$? What do you think would happen if we removed it?**

**What's the point of line $9$ and line $10$?** (Hint: Think about why we divided by the norms of the vectors when computing cosine similarity).

In [17]:
def sentence2vector(sentence, word2vec):
    vector = np.array([0.0] * 300)
    count = 0
    for word in sentence:
        if word in word2vec:
            vector += word2vec[word]
            count += 1
    if count > 0:
        vector /= count
        vector /= np.linalg.norm(vector)
    return vector

Now, try creating vectors from sentences.

In [11]:
title1 = "sam I am"
title2 = "I am sam"
title3 = "green eggs and ham"

In [18]:
titlevector1 = sentence2vector(title1, word2vec)
titlevector2 = sentence2vector(title2, word2vec)
titlevector3 = sentence2vector(title3, word2vec)

Compute the cosine similarities between these titles. Do the results surprise you?

In [19]:
def compute_cosine_similarity(vector1, vector2):
    return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

In [20]:
print(compute_cosine_similarity(titlevector1, titlevector2))
print(compute_cosine_similarity(titlevector1, titlevector3))

1.0
0.667999486431


Try creating more sentence vectors of your own choosing.

## 3. Write a function to compute similarity using sentence vectors.

Instead of directly computing the cosine similarity between pairs of sentences, the FNC solution uses a slightly different similarity metric. This is one part of their semantic_similarities feature (we'll talk a bit more about the other part tomorrow, when we put everything together).

Read the code, and try to answer the following questions:

**Lines $7$ through $16$ loop through each sentence in `body_sentences`. What does this loop do with each body sentence?** (The `body_sentences` argument is a list of all the sentences in an article body.)

**What does the returned feature include?**

**Try to re-write this function in pseudo-code, writing a one-line description of what each line of code does.**

In [21]:
def semantic_similarities(title, body_sentences, word2vec):
    title_vector = sentence2vector(title, word2vec)
    max_sim = -1
    best_vector = np.array([0.0] * 300)

    supports = []
    for sub_body in body_sentences:
        sub_body_vector = sentence2vector(sub_body, word2vec)
        similarity = 0
        for i in range(300):
            similarity += title_vector[i] * sub_body_vector[i]
        if similarity > max_sim:
            max_sim = similarity
            best_vector = sub_body_vector

        supports.append(similarity)

    features = [max(supports), min(supports)]

    for v in best_vector:
        features.append(v)
    for v in title_vector:
        features.append(v)
    return features

We'll build upon this code tomorrow, when we start looking at the full FNC solution!