## Word2Vec tutorial

In this tutorial we will go over the basic usage of a pre-trained word2vec model to obtain embeddings for a piece of text. You will learn how to download a pre-trained model, load it into python, convert the piece of text to a word embedding and how to save the embeddings. You can then use these embeddings for a classification task, analysis of the corpus and/or other types of similarity orientated tasks. 

The dataset used for this tutorial is equivalent to the dataset used for the LDA tutorial. We will use a collection of the prestigious [NIPS](https://nips.cc/) conference obtained from a [kaggle competition](https://www.kaggle.com/benhamner/nips-papers). You can download the data from the kaggle link, it will also be attached to this file and is placed in the folder "Data". 



In [1]:
import pandas as pd
import numpy as np
import string # Used to remove stopwords
import nltk 
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1108)>


# Pre-processing the data
Similair to any other NLP project we first have to pre-process the data, we will do so in a similair fashion as we did during the LDA tutorial. 

In [2]:
def column_to_lower(df, column):
    return df[column].str.lower()

def column_remove_punctuation(df, column):
    return df[column].str.replace('[{}]'.format(string.punctuation), '')

def column_remove_stop_words(df, column, stopwords):
    print(f"Currently processing the column: {column}")
    return df[column].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))

In [3]:
df = pd.read_csv("Data/papers.csv")
stop_words = stopwords.words('english')
df['abstract'] = column_to_lower(df, 'abstract')
df['abstract'] = column_remove_punctuation(df, 'abstract')
df['abstract'] = column_remove_stop_words(df, 'abstract', stop_words)

Currently processing the column: abstract


# Downloading and loading the pre-trained word2vec model.
We will use the package gensim to work with the word2vec model. Gensim has a range of available pre-models, one of them is the GoogleNews300 model that is trained on the Google News corpus which consists of over 100 billion words. The number 300 represents the dimensionality of the vector, the dimensionality of the embedding space. In the Bag of Words setting this dimensionality would be determined by the amount of unique words, in word2vec this number is pre-determined before training the model. 

To download the pre-trained model I took the following steps
 - In your terminal type: brew install wget
 - In your terminal type: wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
 - Move the zip to the folder that contains your code
 - Unzip the model either manually or by navigating to your folder and then use the terminal command: gzip -d GoogleNews-vectors-negative300.bin.gz

In [4]:
from gensim.models import Word2Vec
from gensim import models
import gensim
# download the pre-trained word2vec model #https://stackoverflow.com/questions/46433778/import-googlenews-vectors-negative300-bin
w2v_vectors = models.KeyedVectors.load_word2vec_format(
    'GoogleNews-vectors-negative300.bin', binary=True)


Now that we have downloaded the model and saved it in the python variable "w2v_vectors" we can now use the model to get our embeddings. First, let us inspect some functions that are defined by gensim. One famous example is finding the equivalent of man to king as is woman to ?. We can get this result by using the following code:

In [5]:
result = w2v_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
print(f"Considering the word relationship (king, man) \n The model deems the pair (woman, ?) to be answered as ? = {result[0][0]} with score {result[0][1]} ")

Considering the word relationship (king, man) 
 The model deems the pair (woman, ?) to be answered as ? = queen with score 0.7118192911148071 


As you can see, the obvious equivalent missing word is "queen". We found this result by asking the word2vec model what is the most similair wordvector considering the relationship between man and king and woman is ?. This is calculated by using the cosine similarity, this metric is often used to determine how similair two words are. Intuitively, a high cosine similairty score indicates that two words are very similair while a low similarity score indicates that the two words are not similair at all. The cosine similarity is calculated as

<img src="Images/cossim.png">

This allows us to compare word vectors with each other. Consider this example

<img src="Images/cosine_sim_example.png">

It is difficult to visualise wordvectors outside of the 3D space, however, we can assume the vectors to be 2D such that we can maka use of a 2D plot for illustrative purposes. Our word vectors are 300D but I cannot visualise 300D in my mind, I don't think anyone can, but we can think of the word embeddings as vectors inside a 2D space and consider this figure

<img src="Images/word_e_r.png">

You can see that the relationship between words can be interpreted as vector operations. Being able to capture relationships between words as man is to king as woman is to ? is very important as now we know that the word vectors do capture these relationships and that they capture the concept of words having meaning. We cannot directly infer this relationship using e.g. BoW or tf-idf

## Other functions using Gensim
Gensim also offers some other functions that could be suitable for your project. One of the functions is given a string, which word does not match the other words in the string? 

In [6]:
print(w2v_vectors.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


We could also ask what the distance between two words is, intuitively this implies that words that are similair lie closer to each other in the embedding space thus the distance will be smaller compared to words that have nothing to do with each other. The distance between two of the same vectors will always be 0.

In [7]:
distance = w2v_vectors.distance("coffee", "tea")
print(f"The distance between the words 'coffe' and 'tea' is: {distance}")

distance = w2v_vectors.distance("coffee", "coffee")
print("The distance between a word and itself (using w2v) is always: ", distance)

The distance between the words 'coffe' and 'tea' is: 0.43647080659866333
The distance between a word and itself (using w2v) is always:  0.0


Now we will convert a word to its corresponding word2vec word embedding and inspect the results. Given that we use the pre-trained model called Google News 300, we expect that for each word we retreive a vector of size (300,1) or in numpy terms this can be expressed as (300,). We will obtain a word vector, print its shape and manually inspect the actual word vector.

In [8]:
vector = w2v_vectors['computer']
print("Shape of the vector is: ", vector.shape)
print("With values: ", vector)

Shape of the vector is:  (300,)
With values:  [ 1.07421875e-01 -2.01171875e-01  1.23046875e-01  2.11914062e-01
 -9.13085938e-02  2.16796875e-01 -1.31835938e-01  8.30078125e-02
  2.02148438e-01  4.78515625e-02  3.66210938e-02 -2.45361328e-02
  2.39257812e-02 -1.60156250e-01 -2.61230469e-02  9.71679688e-02
 -6.34765625e-02  1.84570312e-01  1.70898438e-01 -1.63085938e-01
 -1.09375000e-01  1.49414062e-01 -4.65393066e-04  9.61914062e-02
  1.68945312e-01  2.60925293e-03  8.93554688e-02  6.49414062e-02
  3.56445312e-02 -6.93359375e-02 -1.46484375e-01 -1.21093750e-01
 -2.27539062e-01  2.45361328e-02 -1.24511719e-01 -3.18359375e-01
 -2.20703125e-01  1.30859375e-01  3.66210938e-02 -3.63769531e-02
 -1.13281250e-01  1.95312500e-01  9.76562500e-02  1.26953125e-01
  6.59179688e-02  6.93359375e-02  1.02539062e-02  1.75781250e-01
 -1.68945312e-01  1.21307373e-03 -2.98828125e-01 -1.15234375e-01
  5.66406250e-02 -1.77734375e-01 -2.08984375e-01  1.76757812e-01
  2.38037109e-02 -2.57812500e-01 -4.46777344

So the word vector is indeed a continuous vector with a dimensionality of 300. As you can see, we as humans cannot directly interpret these numbers, they do not hold any meaning to us. However, methods such as the cosine similarity measurement or cosine distance help us understand the relationship between these vectors. 

The next step would be to convert a sentence to a word embedding. Considering that a sentence consists of multiple words, we have to retrieve the word embeddings for each word and average the vectors to gain a sentence embedding. There is a try-except block in the code below as it could be that you encounter a word that the model does not know. This could have several reasons such as, the word is a stopword or it is not an english word or the model has not encountered this word during training time. Hence why we use a try/except block to capture the scenario where the model does now know the word. We can simply skip this word and not take it into consideration for our sentence embedding. If you find that the model does not know many of the words that you encounter then you have to either train/fine-tune the model using your corpus or download a different pre-trained model.

In [9]:
sentence = "In the academic world there are numerous interesting and prestigous journals"
parsed_sentence = sentence.lower().split()
print(sentence)
print(parsed_sentence)

word_vectors = []

# For each word in the sentence
# - Try to retrieve the corresponding word vector 
# - Append the word embedding  to a list
# once we have all word embeddings, we can simply take the average over the first dimension to gain an average embedding for the sentence 
for word in parsed_sentence:
    try:
        word_vector = w2v_vectors[word]
        word_vectors.append(word_vector)
    except:
        print(f"Word '{word}' is not in the vocabulary.")

print(np.asarray(word_vectors).shape) # the first dimension here stands for the amount of words that the model has word vectors for
sentence_w2v_embedding = np.average(np.asarray(word_vectors), axis = 0) # hence why we take the averege over the first dimension
print(sentence_w2v_embedding) # the end results remains uninterpretable for humans!

In the academic world there are numerous interesting and prestigous journals
['in', 'the', 'academic', 'world', 'there', 'are', 'numerous', 'interesting', 'and', 'prestigous', 'journals']
Word 'and' is not in the vocabulary.
(10, 300)
[-0.00956116  0.03077087  0.01315918  0.12819824  0.07929687 -0.06503906
  0.03223877 -0.04815968  0.07803345  0.06192627 -0.02783203 -0.12019996
  0.02840576  0.08427735 -0.06013184  0.03520508  0.02338257  0.07210694
 -0.02193603  0.00145264 -0.03520203  0.03729858 -0.06897583 -0.02338638
 -0.03984986 -0.04063721 -0.09047394  0.08147583 -0.05607452 -0.05244141
  0.02787743 -0.10141907  0.03948975  0.03804016  0.07524414 -0.031073
 -0.02215385  0.00544434  0.07575683  0.05131836  0.10772705  0.01655274
 -0.03143311  0.10469361 -0.00141602  0.02293854  0.01469421 -0.04886322
 -0.0708374   0.01257324 -0.02077484  0.02348023 -0.09191284 -0.02677612
  0.02268066  0.09912644  0.08406983 -0.14236145  0.09207153 -0.08804779
 -0.0247757   0.03312988  0.05045166 

# Convert abstracts to word embeddings and save
Now that we have seen how to convert words (strings) to a word vector, let us consider the real data and create an embedding for each available abstract. We will store the embeddings in a dict of format {abstract_id: sentence_embedding} and pickle the result. If we want to retreive the word embeddings we could simply use the ID that is available in the dataframe and obtain the corresponding word embeddings after we have run this script once.

To let the code go over each abstract in the data remove the "break" at line 26. For illustrative purposes we only consider the first non-missing abstract and show how you can store and save the result. This is just an example, there many other ways you can save the data!

In [10]:
dict_sentence_embeddings = {} # dict with mapping of {id: sentence_embedding}
for index, row in df.iterrows():
    sentence_embedding_list = []
    abstract = row['abstract']
    row_id = row['id']
    
    missing_words = 0
    if abstract == "abstract missing":
        # scenario where there is no abstract
        continue
    else:
        print(" === new non-missing abstract ===")
        for word in abstract.split():
                try:
                    word_vector = w2v_vectors[word]
                    sentence_embedding_list.append(word_vector)
                except:
                    missing_words += 1
                    print(f"Word '{word}' is not in the vocabulary.")
        
        
    sentence_embedding = np.average(np.asarray(sentence_embedding_list), axis = 0)
    dict_sentence_embeddings[row_id] = sentence_embedding
    
    print(f"Out of {len(abstract.split())} words there were {missing_words} missing in the w2v model")
    break

 === new non-missing abstract ===
Word 'nonnegative' is not in the vocabulary.
Word 'nmf' is not in the vocabulary.
Word 'plicative' is not in the vocabulary.
Word 'nmf' is not in the vocabulary.
Word 'kullbackleibler' is not in the vocabulary.
Word 'onally' is not in the vocabulary.
Out of 65 words there were 6 missing in the w2v model


In [11]:
import pickle

# dump the dictionary with mapping {id: sentence_embedding} using pickle
with open('id_w2v_map.pickle', 'wb') as handle:
    pickle.dump(dict_sentence_embeddings, handle, protocol=pickle.HIGHEST_PROTOCOL)

# obtain the dictionary with mapping {id: sentence_embedding} using pickle
with open('id_w2v_map.pickle', 'rb') as handle:
    dict_embeddings = pickle.load(handle)

#print(dict_embeddings)