[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/exercises/tut5_embeddings_student.ipynb)

# Tutorial 5: Word embeddings and Word2Vec
The tutorial covers word embeddings in general and one of the most well-known models in this matter, the so-called Word-to-Vec (W2V). For this purpose, we reconsider the text dataset IMDB from the last tutorial. Yet, this time we preprocess the data with `Keras` using the `TextVectorization` layer that facilitates the standardization, tokenization and indexing. Next, we create a simple binary classification model using word embeddings to grasp the essence in practice. Finally, we apply the W2V model to the same data using [Gensim library](https://pypi.org/project/gensim/). 

Several libraries make things easier if the aim is to use W2V directly. The [Gensim library](https://pypi.org/project/gensim/) is one of them that offers a friendly interface to train embeddings, as you will see in this tutorial. 

However, if you would like to start from scratch and code W2V yourself using just `NumPy`, we recommend [Nathan Rooy's post](https://nathanrooy.github.io/posts/2018-03-22/word2vec-from-scratch-with-python-and-numpy/). Or, if you would like to do it with `TensorFlow`, there is an excellent tutorial [here](https://www.tensorflow.org/tutorials/text/word2vec) from the TensorFlow website.

The outline then is the following
1. Preparing the IMDB dataset with `Keras`
2. Understanding embeddings with a simple binary classification model
3. Word2Vec using `Gensim`

For further examples, please visit the demo [word-2-vec.ipynb](https://github.com/Humboldt-WI/adams/blob/master/demos/nlp/word-2-vec.ipynb).

## 1. Preparing the IMDB dataset
Setting things up

In [39]:
# Import the required libraries
import io
import re
import string
import pandas as pd
import gensim
from gensim.models import Word2Vec  
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

### Exercise 1
Load the `IMDB-50K-Movie-Review.zip` file, and map the labels to 1 (positive) and 0 (negative). Then, have a look at the first rows.

In [None]:
# load the data (be sure to provide the correct file path)


### Exercise 2
Split the data into training and validation sets. Use 80% of the data for training. You can use `train_test_split()` function from `sklearn`. In addition, transform the sets into `NumPy` arrays using `to_numpy()`.

In [41]:
# Split the data

# transform them to numpy


So far, we have just created training and validation sets of text and labels. However, as we saw in the last tutorial, we cannot feed a neural network with this text format. We need numeric tensors. 

The transformation of text to numeric tensors is known as *vectorization*. This process can be split into three steps:
1. **Standardization** of the text, such as removing punctuation, converting all the text to lowercase, etc.
2. **Tokeinzation** of the standardized text, where we separate the text into units or *tokens*, usually words or n-grams.
3. **Indexing** of the tokens into a numerical vector.

These 3 steps are implemented in the Keras `TextVectorization` layer.

```python
TextVectorization(
    standardize = our_standardization,
    max_tokens = vocab_size,
    output_sequence_length = seq_length
    )
```
Where `our_standardization` is our customized function to standardize the text, for example, we saw that some examples in the dataset have HTML tags `<br />`, and we'd like to delete them. 

### Exercise 3
So, let's first build our own standardization function called `our_standardization`. The function should convert uppercase to lowercase (`tf.strings.lower`), remove HTML tags (`tf.strings.regex_replace`), deletes the punctuation (`re.escape(string.punctuation)`) and double spaces.

In [1]:
# def our_standardization(text_data):


### Exercise 4
Apply the `our_standardization` function to the following text and see how it works

`"Bruce Dern also is in the mix and Dern never fails to fascinate in about any film.<br /><br />The movie could be considered kind of downer to the average viewer, but I found it fascinating....and I don't like depressing movies normally. What I found was a kind of quirky crime film."`

In [2]:
# An example of the our_standardization function


### Exercise 5
Great! So let's now vectorize our data (use `TextVectorization`) with a vocabulary of the first 10,000 most frequent words and a maximum sequence of the text of 100 characters. Called this layer `vectorize_layer`.

In [3]:
# Define the size of the vocabulary and the max number of words in a sequence
vocab_size = 10000
seq_length = 100

# Create a vectorization layer


### Exercise 6
Index the vocabulary. To do it, you need to call the `adapt()` method from the `vectorize_layer` and apply it to `X_train`. Then, retrieve the computed vocabulary using `get_vocabulary()` and save it into `vocab`. Finally, print the first 10 words of the vocabulary (`print(vocab[:10])`).

In [4]:
# To create the vocabulary, we need to call adapt. The input is only the text

# Check the first 10 words of the vocabulary. It is sorted by frequency 


### Exercise 7
Apply the `vectorization_layer` to the same previous example, i.e. To

`"Bruce Dern also is in the mix and Dern never fails to fascinate in about any film.<br /><br />The movie could be considered kind of downer to the average viewer, but I found it fascinating....and I don't like depressing movies normally. What I found was a kind of quirky crime film."`


In [5]:
# Check the vectorization layer


## 2. Understanding word embeddings
We have not introduced any embedding layer until now; instead, we created a vectorization layer that can transform text inputs into numeric tensors. So, for example, the text "tomorrow is Saturday" will be transformed into something like `[23, 45, 5, 0, 0, 0]` (if the length of the sequence is 6). An embedding layer transforms each word, which can be thought of as a one-hot vector, into another more dense vector. 

Let's see an example of the `Embedding` layer in Keras, where we hypothetically have only 100 words in the vocabulary, and we want to transform this space into a 5-dimensional one.

In [47]:
# Create the embedding layer of shape (100,5)
embedding_layer = layers.Embedding(100, 5)
# Feed a sequence of word indices
result1 = embedding_layer(tf.constant([23, 45, 5, 0, 0, 0]))
# We can also feed batches    
result2 = embedding_layer(tf.constant([[23, 45, 5, 0, 0, 0], [3, 4, 55, 4, 0, 0]]))
print("result1:",result1.shape,"\nresult2:",result2.shape)

result1: (6, 5) 
result2: (2, 6, 5)


Each word index has been transformed into a 5-dimensional vector (in this case, random values). For example, the values of `results1` are

In [48]:
result1

<tf.Tensor: shape=(6, 5), dtype=float32, numpy=
array([[-0.01591078,  0.03600693, -0.04597142, -0.01586562,  0.02273249],
       [ 0.03328559, -0.01752152, -0.00554677, -0.00369354,  0.02254558],
       [ 0.0063172 ,  0.02094451,  0.01136215,  0.02023139,  0.0041159 ],
       [-0.00736328,  0.01393194,  0.0357581 , -0.02324574, -0.0478739 ],
       [-0.00736328,  0.01393194,  0.0357581 , -0.02324574, -0.0478739 ],
       [-0.00736328,  0.01393194,  0.0357581 , -0.02324574, -0.0478739 ]],
      dtype=float32)>

### Exercise 8
Build a simple model to infer the sentiment. To do it, use a `Sequential` model where the first layer transforms the text into tensors (`vectorize_layer`), the second layer embeds the vocabulary into a 16-dimension (`layers.Embedding`), the third layer uses `layers.GlobalAveragePooling1D()` to reduce the complete text to a single average vector in the embedding space, and finally, use a `Dense` layer with a sigmoid activation to infer the sentiment.

In [6]:
# Create a simple model to use word embeddings
embedding_dim = 16



### Exercise 9 
Compile the model with the `rmsprop` optimizer, the adequate loss function and monitor the `accuracy`.

In [7]:
# let's compile it


### Exercise 10
Fit the model using 10 `epochs`. Remember to specify the validation dataset in `validation_data`. How accurate is the model in the validation set? How many parameters does it have?

In [8]:
# ~ 2 minutes


In [9]:
# Check the number of trainable parameters



### Exercise 11
We would like to visualize the trained embbedings. For this purpose, first, get the weights of the embedding layer with `get_weights()` module. Save them in `embeddings`. Then run the following code to save the embeddings and labels in the correct format to visualize them with [https://projector.tensorflow.org/](https://projector.tensorflow.org/)

In [10]:
# Let's get the embeddings! is a matrix of shape (vocab_size, embedding_dimension).


In [54]:
# Let's save the embeddings to visualize them with https://projector.tensorflow.org/
embeddings_doc = io.open('simple_mod_tensor.tsv', 'w', encoding='utf-8')
words_doc = io.open('simple_mod_metadata.tsv', 'w', encoding='utf-8')

for i, word in enumerate(vocab):
  if i == 0:
    continue  # skip the padding
  embedding = embeddings[i]
  embeddings_doc.write('\t'.join([str(x) for x in embedding]) + "\n")
  words_doc.write(word + "\n")
embeddings_doc.close()
words_doc.close()

Go and check how these embeddings look in 2D or 3D.

## Word2Vec with Gensim library
Let's now apply W2V to the same text data. Remember that W2V proposes two models for learning word vectors, continuous-bag-of-words (CBOW) and Skip-Gram (SG). In a nutshell, CBOW predicts a central <font color='yellow'>target</font> word from surrounding <font color='green'>context</font> words, while SG takes the opposite approach. Given a <font color='yellow'>target</font> word, predict <font color='green'>context</font> words. For example, using a window size of 2 to the following phrase

> I finally <font color='green'>found a</font><font color='yellow'> machine</font><font color='green'> at the </font>  gym that I like: the vending machine!

So in CBOW, the problem is 

[I finally <font color='green'>found a</font><font color='yellow'> ?</font><font color='green'> at the </font> gym]

And in SG,

[I finally <font color='green'>? ?</font><font color='yellow'> machine</font><font color='green'> ? ? </font> gym]

In this tutorial, we apply SG (argument `sg=1` in `gensim`). But you are welcome to compare results for CBOW (`sg=0`).

### Exercise 12
Create `X_train_vec` by applying the vectorization layer you have already created to `X_train`. In this way, we will be using the same vectorization procedure as before (same vocabulary, length of the sequence, etc.)

In [11]:
# Apply vectorize_layer to X_train



### Exercise 13
Gensim accepts words, so convert the `X_train_vec` into a list of words. Called this object `X_train_words`

In [12]:
# ~ 6 min

### Exercise 14
Train a W2V model using `Word2Vec` function and `X_train_words` as input (call it `w2v_model`). Use `min_count` of 1, a `window` of 5, 50 `epochs`, a `vector_size` of 100 for the embeddings and SG (`sg=1`).

In [14]:
# Train a Word2Vec model ~ 5 min

# summarize the loaded model
# print(w2v_model)


### Exercise 15
Check how similar are great to good, great to horrible and so on. Use `w2v_model.wv.similarity()` function.

### Exercise 16 
Get the `topn` 5 most similar words to great. Use `w2v_model.wv.most_similar` function

Now run the following code to save the embeddings and labels so we can visualize them with https://projector.tensorflow.org/

In [60]:
# Let's save this matrix to visualize it in https://projector.tensorflow.org/
w2v_model.wv.save_word2vec_format("word2vec.model")
model = gensim.models.KeyedVectors.load_word2vec_format("word2vec.model", binary=False)
embeddings_doc = io.open('w2v_mod_tensor.tsv', 'w', encoding='utf-8')
words_doc = io.open('w2v_mod_metadata.tsv', 'w', encoding='utf-8')

for word in model.index_to_key:
    vector_row = '\t'.join(str(x) for x in model[word])
    word=str("PAD") if word=="" else word
    words_doc.write(word + "\n")
    embeddings_doc.write(vector_row + "\n")

embeddings_doc.close()
words_doc.close()