# TP3: Word embeddings
Master LiTL

Page du cours : https://github.com/chloebt/m2-litl-students


In this practical session, we will explore the generation of word embeddings.

We will make use of *gensim* for generating word embeddings. 
If you want to use your own computer you will need to make sure it is installed (e.g. using the command ```pip```). 
If you’re using Anaconda/Miniconda, you can use the command ```conda install <modulename>```.

Sources:
- Practical from T. van de Cruys
- https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
- https://radimrehurek.com/gensim/models/word2vec.html
- https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/: see an example based on the 20NewsGroup corpus
- http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/ 
- (not used but seems interesting: https://www.machinelearningplus.com/nlp/gensim-tutorial/#14howtotrainword2vecmodelusinggensim)


## 1. Look at the data

Upload the data: *corpus_food.txt.gz*. 
The data come from blogs on cooking.

You cab take a look at your data using a terminal and the following commands:

* Number of lines:
```
$ wc -l corpus_food.txt
$ 1161270 corpus_food.txt
```

* first ten lines:
```
$ head -n 10 corpus_food.txt
$ -mention meilleur espoir féminin : on aurait pu ajouter ioudgine .
malheureusement , comme presque tout ce qui est bon , c' est bourré de beurre et de sucre .
j' avais déjà façonné une recette allégée pour weight watchers mais elle contenait encore du beurre et un peu de sucre .
aujourd' hui je vous propose cette recette que j' ai improvisée hier soir , sans beurre et sans sucre .
n' empêche que pour acheter sa propre baguette magique ou pour déguster des bières au beurre , on pourrait partir au bout du monde !
menthe , sucre de canne , rhum , citron vert , sont vos meilleurs amis en soirée ?
parfois , on rêve d' un bon verre de vin .
la marque de biscuits oreo a pensé aux gourmandes et aux gourmands , et s' apprête à lancer des gâteaux dotés de nouvelles saveurs : caramel beurre salé et noix de coco .
rangez les parapluies , et sortez le sel et le citron !
le vin on adore le savourer avec modération .
```

Première phrase bizarre mais sinon le début : http://www.leblogdelaura.com/2017/03/pancakes-sans-sucre-et-sans-graisses.html

## 2. Build word embeddings

We  will  use *gensim* in  order  to  induce  word  embeddings  from  text.
*gensim* is  a  vector  space modeling and topic modeling toolkit for python, and contains an efficient implementation of the *word2vec* algorithms.

*word2vec* consists of two different algorithms: *skipgram* (sg) and *continuous-bag-of-words* (cbow). 
The underlying prediction task of the former is to estimate the context words from the target word ; the prediction task of the latter is to estimate the target word from the sum of the context words. 

### 2.1 Train a model
▶▶**Run the following code: it will build word embeddings based on the food corpus using the Word2Vec algorithm.**
The model will be saved on your disk.

In [None]:
# potential Error: need to update smart_open with conda install smart_open==2.0.0 or pip install smart_open==2.0.0


# construct word2vec model using gensim

from gensim.models import Word2Vec 

import gzip
import logging

import time

# set up logging for gensim
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)

# we define a PlainTextCorpus class; this will provide us with an
# iterator over the corpus (so that we don't have to load the corpus
# into memory)
class PlainTextCorpus(object):
    def __init__(self, fileName):
        self.fileName = fileName

    def __iter__(self):
        for line in gzip.open(self.fileName, 'rt', encoding='utf-8'):
            yield  line.split()

# -- Instantiate the corpus class using corpus location
sentences = PlainTextCorpus('corpus_food.txt.gz')

# -- Trianing
# we only take into account words with a frequency of at least 50, and
# we iterate over the corpus only once
model = Word2Vec(sentences, min_count=50, iter=1, sorted_vocab=1)

# -- Finally, save the constructed model to disk
# When getting started, you can save the learned model in ASCII format and review the contents.
model.wv.save_word2vec_format('model_word2vec_food.txt', binary=False)
# by default, it is saved as binary
model.save('model_word2vec_food.bin')
# a model saved can be load again using:
#model = Word2Vec.load('model_word2vec_food.bin')

### 2.2 A few remarks:

From: http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

#### Downsampling

Subsampling frequent words to decrease the number of training examples.

There are two “problems” with common words like “the”:
* When looking at word pairs, (“fox”, “the”) doesn’t tell us much about the meaning of “fox”. “the” appears in the context of pretty much every word.
* We will have many more samples of (“the”, …) than we need to learn a good vector for “the”.

Word2Vec implements a “subsampling” scheme to address this. 
For each word we encounter in our training text, there is a chance that we will effectively delete it from the text. 
The probability that we cut the word is related to the word’s frequency.

If we have a window size of 10, and we remove a specific instance of “the” from our text:

* As we train on the remaining words, “the” will not appear in any of their context windows.
* We’ll have 10 fewer training samples where “the” is the input word.

There is also a parameter in the code named ‘sample’ which controls how much subsampling occurs, and the default value is 0.001. Smaller values of ‘sample’ mean words are less likely to be kept.


#### Negative sampling (for SkipGram)

Training a neural network means taking a training example and adjusting all of the neuron weights slightly so that it predicts that training sample more accurately. 
In other words, each training sample will tweak all of the weights in the neural network --> prohibitive

Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them.

When training the network on the word pair (“fox”, “quick”), i.e. 'fox' is the target, 'quick' a context word: “quick” -> 1; all of the other thousands of output neurons -> 0.

With negative sampling, we are instead going to randomly select just a small number of “negative” words (let’s say 5) to update the weights for: “quick” -> 1; 5 other random words -> 0.

Recall that the output layer of our model has a weight matrix that’s dx|V|, i.e. 100 x 23,000. So we will just be updating the weights for our positive word (“quick”), plus the weights for 5 other words that we want to output 0. That’s a total of 6 output neurons, and 1,800 weight values total. That’s only 0.06% of the 2.3M weights in the output layer! (In the hidden layer, only the weights for the input word are updated -- this is true whether you’re using Negative Sampling or not).

### 2.3 Print information about the model learned
Note that the corpus is food-related, so food-related terms will work best. 

You can print the vocabulary using:
```
vocabulary = list(model.wv.vocab)
```

It is possible to look at the individual word embeddings using the following :
```
model.wv[’citron’]
```
```
print(model['citron'])
```

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

▶▶ **Print the vocabulary and then the vectors for a few terms, e.g. 'citron' and 'fruit'. Do they seem close?**

## 3. Compute word similarity 

You can now compute similarity measure between word using
```
model.similarity('manger', goûter')
```

You can also print the most similar words (which is measured by cosine similarity between the word vectors) by issuing the following command :
```
model.wv.most_similar(’citron’)
```
▶▶**Print the similarity between some terms, e.g. ('manger', 'boire'), ('manger', 'dormir') ... Do the results seem coherent?**

▶▶**Print the words that are most similar to: 'citron', 'manger' and other words, e.g. not related to food. Do the results seem coherent?**

##  4. Exercise: change the parameters values

As a default, the *word2vec* module creates **word embeddings of size 100**, using a **cbow model** with a **window of 5 words**.

▶▶**Train a model with different parameters:**
- using a different window size, 
- using a different embedding size 
- using *skipgram*, 

Inspect the results (similar words) qualitatively. Do the similarity computations change ? Are they better or worse ?

See doc: https://radimrehurek.com/gensim_3.8.3/models/word2vec.html

## 5. Analogical reasoning

As  we  saw  in  class,  word  embeddings  allow  us  to  do  analogical  reasoning  using  vector  addiction and subtraction. *gensim* offers the possibility to do so. 

▶▶ **Try to perform analogical reasoning in the food  realm,  e.g.  fourchette - légume  +  soupe  = ?**

Hint  :  the  function  *most_similar()*  takes  arguments positive and negative. 

▶▶ **Try the same using the function most_similar_cosmul()** (which performs a similar computation but uses multiplication and division instead), and see what works best

See: https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html

https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar_cosmul.html

### 6. Visualize word embeddings

After you learn word embedding for your text data, it can be nice to explore it with visualization.

You can use classical projection methods to reduce the high-dimensional word vectors to two-dimensional plots and plot them on a graph.

The visualizations can provide a qualitative diagnostic for your learned model.

We can retrieve all of the vectors from a trained model as follows:
```
X = model[model.wv.vocab]
```

We can then train a projection method on the vectors, such as those methods offered in scikit-learn, then use matplotlib to plot the projection as a scatter plot.

Let’s look at an example with Principal Component Analysis or PCA.

In [None]:
X = model[list(model.wv.vocab)[:1000]]
print(X.shape)

### 6.1 Using PCA

We can create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class as follows.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
result = pca.fit_transform(X)

In [None]:
from matplotlib import pyplot

pyplot.scatter(result[:, 0], result[:, 1])

We can go one step further and annotate the points on the graph with the words themselves. A crude version without any nice offsets looks as follows.

In [None]:
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)[:1000]
for i, word in enumerate(words):
	pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
#pyplot.show()
pyplot.savefig('plot_w2v.png')

In [None]:
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)[:100]
for i, word in enumerate(words):
	pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
#pyplot.show()
pyplot.savefig('plot_w2v.png')

#### 6.2 Using TensorFlow projector

As we saw during the course, TensorFlow provides a tool to vizualize word embeddings. We need to provide:    
* A TSV file with the vectors
* Another TSV file with the words

The following code allows to write this file from the model. 

It comes from the source code of the script: https://radimrehurek.com/gensim/scripts/word2vec2tensor.html 

▶▶ **Run the followng code and then load the files within the TensorFlow projector. Look e.g. for 'citron', 'manger', 'pain'..., check their neighbors (with PCA and/or T-SNE).**

https://projector.tensorflow.org/

In [None]:
#model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True)
tensorsfp = "model_word2vec_food_tensor.tsv"
metadatafp = "metadata_word2vec_food_tensor.tsv"
with open( tensorsfp, 'w+') as tensors:
    with open( metadatafp, 'w+') as metadata:
         for word  in model.wv.index2word:
           metadata.write(word + '\n')
           vector_row = '\t'.join(map(str, model[word]))
           tensors.write(vector_row + '\n')

#### 7. Embeddings with multiword ngrams

There is a *gensim.models.phrases* module which lets you automatically detect phrases longer than one word, using collocation statistics. Using phrases, you can learn a word2vec model where “words” are actually multiword expressions, such as new_york_times or financial_crisis:



In [None]:
from gensim.models import Phrases

# Train a bigram detector.
bigram_transformer = Phrases(sentences)

# Apply the trained MWE detector to a corpus, using the result to train a Word2vec model.
model = Word2Vec(bigram_transformer[sentences], min_count=10, iter=1)

In [None]:
print(list(model.wv.vocab))

If you’re finished training a model (i.e. no more updates, only querying), you can switch to the KeyedVectors instance:



In [None]:
word_vectors = model.wv
del model

## 8. Other algorithms

Not enough time, but note that you can also build embeddings using FastText algorithm with Gensim. Doc2vec is also available.

https://radimrehurek.com/gensim/apiref.html

