# Word2Vec Example

### Getting and preprocessing the dataset 

A dataset composed by movie plots (in txt format) is used in this example.

In [4]:
import gdown

url = 'https://drive.google.com/uc?id=1nrHLegM4ee7RoVNZEXMpxskQAMaNu06W'
output = "movie_plots.txt"
gdown.download(url, output, quiet=False)


Downloading...
From: https://drive.google.com/uc?id=1nrHLegM4ee7RoVNZEXMpxskQAMaNu06W
To: /content/movie_plots.txt
20.7MB [00:00, 151MB/s]


'movie_plots.txt'

Once the dataset is downloaded, we need to do some cleaning and preprocessing **before** generating the embeddings. Preprocessing includes tasks such as tokenization, lowercasing and stopword (and empty word) removal. **Gensim** includes a simple preprocessing method that performs these tasks automatically

In [5]:
import gensim

def process_input_file(input_file):
  token_list=[]
  with open(input_file,'r',encoding='utf-8',errors='ignore') as f:
    for line in f:
      token_list.append(gensim.utils.simple_preprocess(line))
  return token_list

documents=process_input_file("./movie_plots.txt")



Once this preprocessing is done, we end up with a flat list of meaningful tokens, following the same appearance order as in the input documents. This ensures that word context is maintained. To ensure that this constraint is met, we take a look at the first sentence. 

In [None]:
documents[1]

['bartender',
 'is',
 'working',
 'at',
 'saloon',
 'serving',
 'drinks',
 'to',
 'customers',
 'after',
 'he',
 'fills',
 'stereotypically',
 'irish',
 'man',
 'bucket',
 'with',
 'beer',
 'carrie',
 'nation',
 'and',
 'her',
 'followers',
 'burst',
 'inside',
 'they',
 'assault',
 'the',
 'irish',
 'man',
 'pulling',
 'his',
 'hat',
 'over',
 'his',
 'eyes',
 'and',
 'then',
 'dumping',
 'the',
 'beer',
 'over',
 'his',
 'head',
 'the',
 'group',
 'then',
 'begin',
 'wrecking',
 'the',
 'bar',
 'smashing',
 'the',
 'fixtures',
 'mirrors',
 'and',
 'breaking',
 'the',
 'cash',
 'register',
 'the',
 'bartender',
 'then',
 'sprays',
 'seltzer',
 'water',
 'in',
 'nation',
 'face',
 'before',
 'group',
 'of',
 'policemen',
 'appear',
 'and',
 'order',
 'everybody',
 'to',
 'leave']

As shown, all tokens have been converted to lowercase, and some stopwords and empty words have been removed. Still, the context of the original sentence remains.

### Initializing and training the model

Once we have our document tokenized, we can initialize and train our Word2Vec model. Prior to training the model, certain parameters need to be defined:


*   *size*: Dimensionality of word vectors. It should be consistent with the dimensionality of the document corpus and the size of the vocabulary. **As we have a reduced corpus, a size of 100 should be adecquate for our problem.**
*   *window*: Dimension of the context window. It should be enough to contextualize a word. In this example, **we will use a size 10**.
*   *min_count*: Minimum number of appearances of a word in the corpus to be considered for embedding. **We will consider that at least 3 repetitions of a word are enough to be embedded**.
*   *sg*: Training algorithm: 1 for skip-gram; 0 for CBOW. **We will use skip-gram**.
*   *iter*: Number of training iterations. We will do **5 iterations**.
*   *seed*: Initialization seed. **We will use 1852 as our seed.**





In [6]:
w2v_model=gensim.models.Word2Vec(documents,size=100,window=10,min_count=5,sg=1,iter=5,seed=1852)
w2v_model.train(documents, total_examples=w2v_model.corpus_count, epochs=w2v_model.iter)

  


(12958641, 17280470)

After the model has been trained, we can extract the generated word embeddings (in the form of a Python dictionary) and query over them

In [7]:
w2v_embeddings=w2v_model.wv

With the extracted embeddings, we can perform certain operations such as:

### Ask for the most similar word to a given word

In [8]:
word="singer"
w2v_embeddings.most_similar(positive=word)

[('nightclub', 0.7739318609237671),
 ('dancer', 0.7558026909828186),
 ('hotheaded', 0.7398906946182251),
 ('performer', 0.7246049642562866),
 ('ingenue', 0.7170679569244385),
 ('soprano', 0.7109376788139343),
 ('netta', 0.7076431512832642),
 ('starlet', 0.706315815448761),
 ('superstar', 0.7013533711433411),
 ('vocalist', 0.7001423835754395)]

If we just want to get the *N* most similar words, then:

In [9]:
word=["dog"]
n=3
w2v_embeddings.most_similar(positive=word,topn=n)

[('labrador', 0.7114201188087463),
 ('dogs', 0.7100064158439636),
 ('puppy', 0.6919729113578796)]

### Check whether a word is represented or not

As mentioned before, embeddings are returned in the form of a dictionary of tuples, where words play the role of keys and the embeddings are the values. However, being a special type of dictionary, the vocabulary can't be retrieved using the built-in *dict.keys()* function. The following snippet of code retrieves the vocabulary represented by the W2V model, and then uses list comprehension to detect whether a word is represented or not:

In [10]:
w2v_vocabulary=w2v_embeddings.vocab.keys()
word='platypus'
if word in w2v_vocabulary:
  print(True)
else:
  print(False)

False


### Measure the similarity between pairs of words

Given two words existing in the corpus, we can measure the similarity existing between the two. If the model is properly trained, then words that refer to similar concepts, such as synonims, should get high similarity scores.

In [11]:
word1="movie"
word2="film"
w2v_embeddings.similarity(word1,word2)

0.825347

On the contrary, antonyms should recive low scores

In [12]:
word1="great"
word2="awful"
w2v_embeddings.similarity(word1,word2)

0.32689977

### Retrieve the embedding of a given word

There are two ways of retrieving the vector of a word. The first way is by using the built-in method *get_vector*. This model returns the embedding associated with the input word if it exists in the vocabulary, and error otherwise.

In [13]:
word="beach"
w2v_embeddings.get_vector(word)

array([-0.11691452, -0.1356225 ,  0.10066555,  0.35933575,  0.26329288,
       -0.25205   ,  0.24960966, -0.06589814,  0.01037013,  1.0342513 ,
       -0.47822237, -0.10310647,  0.29664803, -0.2733945 , -0.25119177,
        0.28906366,  0.18302594, -0.37111542,  0.14916371,  0.01552051,
        0.11886437, -0.10779689,  0.2465281 ,  0.15453239, -0.3312453 ,
        0.13145737, -0.04551599,  0.21664234, -0.3127712 , -0.2797321 ,
       -0.2933099 , -0.22166942,  0.18876773, -0.6957975 ,  0.13720626,
       -0.10923652,  0.5118466 ,  0.07903919, -0.00449225, -0.50585634,
        0.62448114, -0.02667939, -0.19505975,  0.05632718, -0.13266708,
        0.17240706,  0.33905092,  0.20434055,  0.3421863 , -0.19651157,
        0.11847249, -0.4463276 ,  0.2541473 , -0.0915373 ,  0.31275782,
       -0.26218522,  0.10099335, -0.08984791,  0.30922264,  0.56262803,
       -0.07055393,  0.35954636,  0.05894015,  0.2949058 ,  0.43041632,
       -0.27441806, -0.13565955, -0.00531093, -0.05123455,  0.02

Considering that word embeddings are stored in the form of a dictionary composed by *(word,vector)* tuples, where words serve as keys, we can easily obtain the embedding as:

In [14]:
w2v_embeddings['beach']

array([-0.11691452, -0.1356225 ,  0.10066555,  0.35933575,  0.26329288,
       -0.25205   ,  0.24960966, -0.06589814,  0.01037013,  1.0342513 ,
       -0.47822237, -0.10310647,  0.29664803, -0.2733945 , -0.25119177,
        0.28906366,  0.18302594, -0.37111542,  0.14916371,  0.01552051,
        0.11886437, -0.10779689,  0.2465281 ,  0.15453239, -0.3312453 ,
        0.13145737, -0.04551599,  0.21664234, -0.3127712 , -0.2797321 ,
       -0.2933099 , -0.22166942,  0.18876773, -0.6957975 ,  0.13720626,
       -0.10923652,  0.5118466 ,  0.07903919, -0.00449225, -0.50585634,
        0.62448114, -0.02667939, -0.19505975,  0.05632718, -0.13266708,
        0.17240706,  0.33905092,  0.20434055,  0.3421863 , -0.19651157,
        0.11847249, -0.4463276 ,  0.2541473 , -0.0915373 ,  0.31275782,
       -0.26218522,  0.10099335, -0.08984791,  0.30922264,  0.56262803,
       -0.07055393,  0.35954636,  0.05894015,  0.2949058 ,  0.43041632,
       -0.27441806, -0.13565955, -0.00531093, -0.05123455,  0.02

### Get the closest word to a given vector

As words are represented in a vectorial space, we can perform operations certain operations, such as addition or substraction, that give also vectors as a result. As this operations are based on intuition, therefore not exact, the result of an operation like this is not directly keyed to a word. We can get the closest word to a given vector as:


In [15]:
vector1=w2v_embeddings['music']
vector2=w2v_embeddings['film']
operation=vector1+vector2
w2v_embeddings.similar_by_vector(operation)

[('music', 0.8312228322029114),
 ('film', 0.8091937303543091),
 ('comedians', 0.7460590600967407),
 ('segue', 0.7459224462509155),
 ('movie', 0.7425822615623474),
 ('theme', 0.7345827221870422),
 ('previn', 0.7257819175720215),
 ('storyline', 0.7213714122772217),
 ('orchestral', 0.7164971828460693),
 ('rhapsody', 0.7151800394058228)]

### Detect unfitting terms in a list of words

Given a list of words, we can identify which one of them is unrelated to the rest using the built-in function *doesnt_match*:

In [16]:
word_list=['cat','dog','mouse','actress','bird']
w2v_embeddings.doesnt_match(word_list)

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'actress'

### Perform analogies between words

One of the biggest improvements introduced by Word2Vec was the capability to perform analogies between words. This property not only puts to evidence the quality of the embeddings generated by the model, but enables the **approximate** representation of words that are unseen during training:

In [17]:
word1=w2v_embeddings['cat']
word2=w2v_embeddings['cats']
word3=w2v_embeddings['mouse']
operation=word1-word3+word2
w2v_embeddings.similar_by_vector(operation)

[('cats', 0.8587690591812134),
 ('cat', 0.628413200378418),
 ('mice', 0.6051133871078491),
 ('scat', 0.5927438735961914),
 ('dogs', 0.5771578550338745),
 ('lounging', 0.5638996362686157),
 ('jaws', 0.5573769211769104),
 ('feline', 0.5521344542503357),
 ('screeching', 0.5508318543434143),
 ('noisy', 0.5451743602752686)]

A cleaner and faster way of performing analogical inference is: 

In [18]:
w2v_embeddings.most_similar(positive=['cat', 'cats'], negative=['mouse'])

[('mice', 0.6658698320388794),
 ('scat', 0.6330621242523193),
 ('jaws', 0.6269769072532654),
 ('feline', 0.6197717785835266),
 ('dogs', 0.6094540357589722),
 ('canine', 0.6084303259849548),
 ('canary', 0.6057158708572388),
 ('screeching', 0.603484570980072),
 ('dog', 0.5941815376281738),
 ('hens', 0.5933408141136169)]