# Understand embeddings with Word2Vec

### Exercise objectives:
- Convert 🔠 words to 🔢 vector representations thanks to embeddings
- Discover the powerful Word2Vec algorithm

<hr>

_Embeddings_ are representations of words using vectors. These embeddings can be learned within a Neural Network. But it can take time to converge. Another option is to learn them as a first step. Then, use them directly to feed the word representations into a Recurrent Neural Network. 

▶️ Run this cell and make sure the version of 📚 [Gensim - Word2Vec](https://radimrehurek.com/gensim/auto_examples/index.html) you are using is ≥ 4.0!

In [1]:
!pip freeze | grep gensim

gensim==4.2.0


In [2]:
!pip freeze | grep tensorflow

tensorflow==2.10.0
tensorflow-datasets==4.6.0
tensorflow-estimator==2.10.0
tensorflow-io-gcs-filesystem==0.27.0
tensorflow-metadata==1.10.0


# The data

Keras provides many datasets, among which is the IMDB dataset 🎬:
- It is comprised of sentences that are ***movie reviews***. 
- Each of these reviews is related to a score given by the reviewer.

❓ **Question** ❓ First of all, let's load the data. You don't have to understand what is going on in the function, it does not matter here.

⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that too many sentences will make your compute slow down, or even freeze - your RAM can overflow. For that reason, **you should start with 10% of the sentences** and see if your computer can handle it. Otherwise, rerun with a lower number.  

⚠️ **DISCLAIMER** ⚠️ **No need to play _who has the biggest_ (RAM) !** The idea is to get to run your models quickly to prototype. Even in real life, it is recommended that you start with a subset of your data to loop and debug quickly. So increase the number only if you are into getting the best accuracy. 

In [3]:
###########################################
### Just run this cell to load the data ###
###########################################

import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.text import text_to_word_sequence

def load_data(percentage_of_sentences=None):
    train_data, test_data = tfds.load(name="imdb_reviews", split=["train", "test"], batch_size=-1, as_supervised=True)

    train_sentences, y_train = tfds.as_numpy(train_data)
    test_sentences, y_test = tfds.as_numpy(test_data)
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(train_sentences))
        train_sentences, y_train = train_sentences[:len_train], y_train[:len_train]
  
        len_test = int(percentage_of_sentences/100*len(test_sentences))
        test_sentences, y_test = test_sentences[:len_test], y_test[:len_test]
    
    X_train = [text_to_word_sequence(_.decode("utf-8")) for _ in train_sentences]
    X_test = [text_to_word_sequence(_.decode("utf-8")) for _ in test_sentences]
    
    return X_train, y_train, X_test, y_test

X_train, y_train, X_test, y_test = load_data(percentage_of_sentences=10)

2023-12-22 15:39:22.071815: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteMIXDWI/imdb_reviews-train.tfrecord*...…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteMIXDWI/imdb_reviews-test.tfrecord*...:…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteMIXDWI/imdb_reviews-unsupervised.tfrec…

[1mDataset imdb_reviews downloaded and prepared to ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


2023-12-22 15:40:45.470461: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<b><u>Embeddings in the previous challenge</u></b>:

In the previous exercise, we jointly learned a representation for the words, and fed this representation to a RNN, as shown down below 👇: 

<img src="layers_embedding.png" width="400px" />

However, this increases the number of parameters to learn, which slows down and increases the difficulty of convergence!

<b><u>Embeddings in the current challenge</u></b>:

For this reason, we will separate the steps of learning the word representation and feeding it into a RNN. As shown here: 

<img src="word2vec_representation.png" width="400px" />

We will learn the embedding with Word2Vec.

The drawback is indeed that the learned embeddings are not _specifically_ designed for our task. However, learning them independently of the task at hand (sentiment analysis) has some advantages: 
- it is very fast to do in general (with Word2Vec)
- the representation learned by Word2Vec is still meaningful 
- the convergence of the RNN alone will be easier and faster

So let's learn an embedding with Word2Vec and see how meaningful it is!

# Embedding with Word2Vec

Let's use Word2Vec to embed the words of our sentences. Word2Vec will be able to convert each word to a fixed-size vectorial representation.

For instance, we will have:
- 🐶 _dog_ $\rightarrow$ [0.1, -0.3, 0.8]
- 🐱 _cat_ $\rightarrow$ [-1.1, 2.3, 0.7]
- 🍏 _apple_ $\rightarrow$ [3.1, 0.9, -4.7]

Here, your embedding space is of size 3.

***What is a "good" numerical representation of words?***

- ***Words with close meanings should be geometrically close in your embedding space!***

    - Look at the following example which represents a bi-dimensional embedding space.

![Embedding](word_embedding.png)

❓ **Question** ❓ Let's run Word2Vec! 

[📚 **Gensim**](https://radimrehurek.com/gensim/)  is a great Python package that makes the use of the Word2Vec algorithm easy to implement, fast and accurate (which is not an easy task!).

1. The following code imports Word2Vec from Gensim. 

2. The second line learns the embedding representation of the words thanks to the sentences in `X_train`. 
3. The third line stores the words and their trained embeddings in `wv`.

In [4]:
from gensim.models import Word2Vec

word2vec = Word2Vec(sentences=X_train)
wv = word2vec.wv

Let's look at the embedded representation of some words.

You can use `wv` as a dictionary.
For instance, `wv['dog']` will return a representation of `dog` in the embedding space.

❓ **Question** ❓ Try different words - especially, try non-existing words to see that they don't have any representation (which is perfectly normal as their representation was not learned). 

In [5]:
from gensim.models import Word2Vec

# Assuming you've already trained the Word2Vec model with X_train as shown previously
word2vec = Word2Vec(sentences=X_train)
wv = word2vec.wv

# Examples of words to check their embeddings
words_to_check = ["dog", "cat", "apple", "non_existing_word"]

for word in words_to_check:
    try:
        print(f"Embedding for '{word}': {wv[word]}")
    except KeyError:
        print(f"No embedding found for '{word}'")

Embedding for 'dog': [-0.08555823  0.17339347 -0.08181087  0.21108234 -0.02032751 -0.32627067
  0.04378813  0.5240822  -0.20800665 -0.15857576 -0.00494731 -0.28844658
  0.0585472   0.15533428  0.02426083 -0.23761147  0.17209926 -0.22785889
 -0.07049159 -0.4302882   0.19429903  0.05993873  0.31784904 -0.15720111
 -0.00408607  0.01088329 -0.18901946 -0.10262786 -0.20828311 -0.01181161
  0.1945869   0.00372149  0.13275695 -0.32653853 -0.1841536   0.23749484
  0.08341638 -0.10567769 -0.26443875 -0.34345752 -0.14148036 -0.28053665
 -0.23240499  0.19585615  0.27611262 -0.04167099 -0.1541255  -0.13019587
  0.20721921  0.18885952  0.05336806 -0.18910179 -0.27054396 -0.05237565
 -0.13248296  0.14661637  0.11713735  0.01220182 -0.21446398  0.03165093
  0.12909262 -0.0588934   0.04267502  0.10239565 -0.21414989  0.31340128
 -0.0568453   0.21128896 -0.33794597  0.20631337 -0.19552083  0.23485897
  0.24956402 -0.2103832   0.38875028  0.06379774  0.02877027  0.11462229
 -0.19564226 -0.05439083 -0.23

❓ **Question** ❓ What is the size of each word representation, and therefore, what is the size of the embedding space?

In [6]:
# Assuming 'dog' is in the vocabulary and its embedding is already printed
embedding_size_for_dog = len(wv['dog'])

print("Size of the embedding space:", embedding_size_for_dog)

Size of the embedding space: 100


🧐 How do we know whether this embedding make any sense or not? 

💡 To investigate this question, we will check that words with a close meaning have close representations. 

👉 Let's use the [**`Word2Vec.wv.most_similar`**](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar) method that, given an input word, displays the "closest" words in the embedding space. If the embedding is well done, then words with similar meanings will have similar representation in the embedding space.

❓ **Question** ❓ Try out the `most_similar` method on different words. 

🧑🏿‍🏫 The quality of the closeness will depend on the quality of your embedding, and thus, depend on the number of sentences that you have loaded and from which you create your embedding.

In [7]:
# Examples of words to check for similarity
words_to_check_similarity = ["dog", "movie", "happy", "sad"]

for word in words_to_check_similarity:
    try:
        similar_words = wv.most_similar(word)
        print(f"Words most similar to '{word}':")
        for similar_word, similarity in similar_words:
            print(f"  {similar_word} (similarity: {similarity})")
        print()
    except KeyError:
        print(f"No similar words found for '{word}' - it might not be in the vocabulary.")
        print()

Words most similar to 'dog':
  research (similarity: 0.9912723898887634)
  man's (similarity: 0.9891747236251831)
  intention (similarity: 0.9889662861824036)
  political (similarity: 0.9889214634895325)
  greek (similarity: 0.9880340099334717)
  babies (similarity: 0.9876160621643066)
  unforgettable (similarity: 0.9876061081886292)
  streisand (similarity: 0.9872338175773621)
  experiment (similarity: 0.9867767095565796)
  nuclear (similarity: 0.9863289594650269)

Words most similar to 'movie':
  film (similarity: 0.9661290645599365)
  show (similarity: 0.843681812286377)
  thing (similarity: 0.8244465589523315)
  ending (similarity: 0.7924070358276367)
  series (similarity: 0.7861598134040833)
  sequel (similarity: 0.7814816832542419)
  flick (similarity: 0.7599536180496216)
  book (similarity: 0.7547544240951538)
  fun (similarity: 0.7492166757583618)
  watching (similarity: 0.7457419037818909)

Words most similar to 'happy':
  saying (similarity: 0.9677345156669617)
  normally (si

📚 Similarly to `most_similar` used on words directly, we can use [**`similar_by_vector`**](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.similar_by_vector) on vectors to do the same thing:

In [8]:
# Example: Using the vector representation of 'dog' to find similar words
dog_vector = wv['dog']

# Find words most similar to the vector of 'dog'
similar_words_to_vector = wv.similar_by_vector(dog_vector)

print("Words most similar to the vector of 'dog':")
for similar_word, similarity in similar_words_to_vector:
    print(f"  {similar_word} (similarity: {similarity})")

Words most similar to the vector of 'dog':
  dog (similarity: 0.9999999403953552)
  research (similarity: 0.9912723898887634)
  man's (similarity: 0.9891747236251831)
  intention (similarity: 0.9889662861824036)
  political (similarity: 0.9889214634895325)
  greek (similarity: 0.9880339503288269)
  babies (similarity: 0.9876160621643066)
  unforgettable (similarity: 0.9876061081886292)
  streisand (similarity: 0.9872338175773621)
  experiment (similarity: 0.9867767095565796)


# Arithmetic on words

Now, let's perform some mathematical operations on words, i.e. on their vector representations!

As any learned word is represented as a vector, you can do basic arithmetic operations, such as:

$$W2V(good) - W2V(bad)$$

❓ **Question** ❓ Do this mathematical operation and print the result

In [9]:
# Perform the operation: vector("good") - vector("bad")
vector_good = wv['good']
vector_bad = wv['bad']
result_vector = vector_good - vector_bad

print("Result of W2V('good') - W2V('bad'):", result_vector)

Result of W2V('good') - W2V('bad'): [-0.33491522 -0.39934835  0.21403015  0.3361305   0.0590021  -0.1518879
 -0.08099687  0.04494332 -0.1558781  -0.25252956  0.10859609  0.09595746
  0.05276501 -0.08170962 -0.08264568  0.4310434  -0.3747596   0.29057598
 -0.1192195   0.07043684 -0.35550702  0.10595404 -0.2046206   0.07378142
 -0.1618473  -0.10472789 -0.1766685   0.7271621  -0.01663208 -0.83334863
 -0.15421146  0.24641627  0.03751349  0.30387425  0.0009017  -0.4853022
  0.25204936 -0.13798207 -0.5347204   0.7624682  -0.11195929 -0.16153759
 -0.32178906 -0.34848517  0.5733923  -0.14840174  0.4846446  -0.10596311
  0.05075854  0.553197   -0.07314749  0.01497471 -0.4091641   0.12564003
  0.11956668  0.05222023 -0.26890734  0.12221307  0.1411688   0.68235123
  0.03219012  0.31542993 -0.15815812 -0.42742652  0.15388364 -0.00543487
 -0.53245044 -0.1607559   0.48167282  0.56848186 -0.743979   -0.33208567
  0.3595556   0.16524208 -0.17978448  0.04131694 -0.00478932 -0.12449896
  0.28178966  0.0

Now, imagine for a second that the following equality holds true:

$$W2V(good) - W2V(bad) = W2V(nice) - W2V(stupid)$$

which is equivalent to:

$$W2V(good) - W2V(bad) + W2V(stupid) = W2V(nice)$$

❓ **Question** ❓ Let's, just for fun (as it would be bold of us to think that this equality holds true ...), do the operation $W2V(good) - W2V(bad) + W2V(stupid)$ and store it in a `res` variable (which will be a vector of size 100 that you can print).

In [10]:
# Perform the operation: W2V("good") - W2V("bad") + W2V("stupid")
vector_good = wv['good']
vector_bad = wv['bad']
vector_stupid = wv['stupid']

res = vector_good - vector_bad + vector_stupid

print("Result of W2V('good') - W2V('bad') + W2V('stupid'):", res)

Result of W2V('good') - W2V('bad') + W2V('stupid'): [-0.07430738 -0.46666953  0.3914961  -0.02891597  0.14834842 -0.7675248
 -0.26757705  0.47757328 -0.5598122  -0.52179074 -0.11470663 -0.31723464
  0.07442862  0.16343768  0.08463763  0.19721806 -0.07568207 -0.23376471
 -0.3592999  -0.7580217   0.11102498  0.26100138  0.26174116  0.11038011
 -0.3222077  -0.27706707 -0.3240376   0.50751185 -0.24323463 -0.7517084
  0.43493748  0.26979607  0.32388505  0.04884326 -0.37326285  0.3248775
  0.2095495   0.04816389 -0.69738847 -0.17218566 -0.05783308 -0.62679446
 -0.02515575  0.1850509   0.7196395  -0.06600647  0.16997868 -0.49373916
 -0.06357201  0.636631    0.06898028 -0.32254255 -0.53596157  0.17868918
  0.10893975  0.40474984 -0.05074216  0.06358797 -0.40205082  0.25674167
  0.10113913  0.09353857  0.23224327 -0.16455022 -0.5994894   0.5754331
 -0.48052624 -0.11690301  0.11707014  0.7247374  -0.6012769  -0.03476503
  0.54854554  0.32502759  0.6474283   0.18940568  0.03015812 -0.21185443
 -0

We said earlier, that for any vector it is possible to see the closest vectors in the embedding space.

❓ **Question** ❓ Look at the closest vectors of `res`

💡 _Hint_: `similar_by_vector`

In [11]:
# Find the words most similar to the vector stored in 'res'
similar_words_to_res = wv.similar_by_vector(res)

print("Words most similar to the result vector:")
for similar_word, similarity in similar_words_to_res:
    print(f"  {similar_word} (similarity: {similarity})")

Words most similar to the result vector:
  given (similarity: 0.7771015763282776)
  nice (similarity: 0.7645534873008728)
  always (similarity: 0.7637762427330017)
  spark (similarity: 0.7472996711730957)
  posted (similarity: 0.7446976900100708)
  although (similarity: 0.7440087199211121)
  used (similarity: 0.7399207353591919)
  fair (similarity: 0.7395617961883545)
  good (similarity: 0.733678936958313)
  decent (similarity: 0.731594979763031)


Incredible right! You can do arithmetic operations on words!

❓ **Question** ❓ You can try on arithmetic such as 

$$W2V(Boy) - W2V(Girl) = W2V(Man) - W2V(Woman)$$

or 

$$W2V(Queen) - W2V(King) = W2V(actress) - W2V(actor)$$

❗ **Remark** ❗ You will probably see that the results are not perfect. But don't forget that you trained your model on a very small corpus.

In [12]:
# Operation: W2V("Boy") - W2V("Girl")
vector_boy = wv['boy']
vector_girl = wv['girl']
result_vector_boy_girl = vector_boy - vector_girl

# Find the words most similar to the result vector for "Boy" - "Girl"
similar_words_boy_girl = wv.similar_by_vector(result_vector_boy_girl)

print("Words most similar to the vector of 'Boy' - 'Girl':")
for similar_word, similarity in similar_words_boy_girl:
    print(f"  {similar_word} (similarity: {similarity})")
print()

# Operation: W2V("Queen") - W2V("King")
vector_queen = wv['queen']
vector_king = wv['king']
result_vector_queen_king = vector_queen - vector_king

# Find the words most similar to the result vector for "Queen" - "King"
similar_words_queen_king = wv.similar_by_vector(result_vector_queen_king)

print("Words most similar to the vector of 'Queen' - 'King':")
for similar_word, similarity in similar_words_queen_king:
    print(f"  {similar_word} (similarity: {similarity})")

Words most similar to the vector of 'Boy' - 'Girl':
  10 (similarity: 0.6828743815422058)
  2 (similarity: 0.6552428603172302)
  1 (similarity: 0.6201376914978027)
  3 (similarity: 0.6161310076713562)
  minutes (similarity: 0.5957394242286682)
  5 (similarity: 0.5681634545326233)
  tv (similarity: 0.5670452117919922)
  my (similarity: 0.5658004879951477)
  worst (similarity: 0.5565696954727173)
  4 (similarity: 0.5293103456497192)

Words most similar to the vector of 'Queen' - 'King':
  she (similarity: 0.18317453563213348)
  he (similarity: 0.13897664844989777)
  who (similarity: 0.09356987476348877)
  never (similarity: 0.08082055300474167)
  what (similarity: 0.05337127670645714)
  when (similarity: 0.04753798991441727)
  why (similarity: 0.040311217308044434)
  her (similarity: 0.035780299454927444)
  him (similarity: 0.034226350486278534)
  that (similarity: 0.013694081455469131)


<u><i>Some notes about Word2Vec as an internal Neural Network</i></u>:

You might wonder where does this magic comes from (at quite a low price, you just ran a line of code on a very small corpus and it was trained within few minutes). The magic comes from the way Word2Vec is trained. The details are quite complex, but you can remember that Word2vec, in `word2vec = Word2Vec(sentences=X_train)`, actually trains a internal neural network (that you don't see).  

In a nutshell, this internal neural network predicts a word from the surroundings words in a sentences. Hence, it splits the original sentences, then for each split it chooses some words as inputs $X$ and a word as the output $y$ which it tries to predict, using the embedding space.

And as with any neural network, Word2Vec has some hyperparameters. Let's play with some of these. 

# Word2Vec hyperparameters

❓ **Question** ❓ The first important hyperparameter is the `vector_size` argument. It corresponds to the size of the embedding space. Learn a new `word2vec_2` model, still trained on the `X_train`, but with a smaller or higher `vector_size`.

Verify on some words that the embedding size is the one you chose.

In [13]:
from gensim.models import Word2Vec

# Choose a new vector size (e.g., 20, 50, or 100)
new_vector_size = 50  # Example value, you can adjust this

# Train a new Word2Vec model with the new vector size
word2vec_2 = Word2Vec(sentences=X_train, vector_size=new_vector_size)

# Access the word vectors
wv_2 = word2vec_2.wv

# Examples of words to check their new embeddings
words_to_check = ["dog", "movie", "happy"]

for word in words_to_check:
    try:
        print(f"Embedding for '{word}' with vector size {new_vector_size}: {wv_2[word]}")
        print(f"Size of embedding: {len(wv_2[word])}")  # This should match new_vector_size
        print()
    except KeyError:
        print(f"No embedding found for '{word}'")
        print()


Embedding for 'dog' with vector size 50: [ 0.04225151  0.00188194 -0.04602172  0.08343252 -0.07825615 -0.44203973
  0.32264626  0.62627864 -0.74150693 -0.22374304 -0.12112553 -0.5285265
 -0.13321267  0.12990339  0.0505675   0.2655882   0.25878403  0.16548042
 -0.59291196 -0.40122205  0.0933851   0.20387961  0.5519996  -0.23980533
  0.31096533  0.05027333 -0.24279663  0.17375876 -0.31118268  0.11954911
 -0.06141366 -0.26943132 -0.25731024  0.03257808 -0.16933754  0.13686617
  0.31253004  0.05351512  0.09517422 -0.21734266  0.3185163  -0.04427274
 -0.09767125  0.23966514  0.49648875  0.12202334  0.13892101 -0.5442693
  0.3798602   0.27710918]
Size of embedding: 50

Embedding for 'movie' with vector size 50: [-0.3032787   0.73370713 -0.02492724 -1.0187318   0.6587918  -1.1219642
  0.67373866  1.9380773   0.5520067  -1.0656253   0.0099706  -1.1125219
  3.1161475   0.5385937  -1.6240096   1.3214861   1.182277    1.7344649
 -2.1508431  -0.2676459  -0.5228351   0.28891042  0.80866903  2.34802

❓ **Question** ❓ Use the **`Word2Vec.wv.key_to_index`** attribute to display the size of the learned vocabulary. Compare it to the number of different words in `X_train`.

In [14]:
# Size of the learned vocabulary
vocab_size_word2vec = len(wv_2.key_to_index)
print("Size of learned vocabulary:", vocab_size_word2vec)

# Calculate the number of unique words in X_train
unique_words_in_X_train = set(word for sentence in X_train for word in sentence)
unique_word_count_in_X_train = len(unique_words_in_X_train)
print("Number of different words in X_train:", unique_word_count_in_X_train)

Size of learned vocabulary: 8006
Number of different words in X_train: 30419


There is an important difference between the number of words in the train sentences and in the Word2Vec vocabulary, even though it has been trained on the train sentence set. The reasons comes from the second important hyperparameter of Word2Vec:  `min_count`. 

`min_count` is a integer that tells you how many occurrences a given word should have to be learned in the embedding space. For instance, let's say that the word "movie" appears 1000 times in the corpus and "simba" only 2 times. If `min_count=3`, the word "simba" will be skipped during the training.

The intention is to learn a representation of words that are sufficiently present in the corpus to have a robust embedded representation.

❓ **Question** ❓ Learn a new `word2vec_3` model with a `min_count` higher than 5 (which is the default value) and a `word2vec_4` with a `min_count` smaller than 5, and then, compare the size of the vocabulary for all the different word2vecs that you have trained (you can choose any `vector_size` you want).

In [15]:
# Train word2vec_3 with min_count higher than 5
word2vec_3 = Word2Vec(sentences=X_train, vector_size=50, min_count=6)  # Example: min_count set to 6
vocab_size_word2vec_3 = len(word2vec_3.wv.key_to_index)

# Train word2vec_4 with min_count lower than 5
word2vec_4 = Word2Vec(sentences=X_train, vector_size=50, min_count=4)  # Example: min_count set to 4
vocab_size_word2vec_4 = len(word2vec_4.wv.key_to_index)

# Print the sizes of the vocabularies
print("Size of vocabulary in word2vec_2 (default min_count=5):", vocab_size_word2vec)
print("Size of vocabulary in word2vec_3 (min_count=6):", vocab_size_word2vec_3)
print("Size of vocabulary in word2vec_4 (min_count=4):", vocab_size_word2vec_4)

Size of vocabulary in word2vec_2 (default min_count=5): 8006
Size of vocabulary in word2vec_3 (min_count=6): 6892
Size of vocabulary in word2vec_4 (min_count=4): 9584


Remember that Word2Vec has an internal neural network that is optimized based on some predictions. These predictions actually correspond to predicting a word based on surrounding words. The surroundings words are in a `window` which corresponds to the number of words taken into account. And you can train the Word2Vec with different `window` sizes.

❓ **Question** ❓ Train a new `word2vec_5` model with a `window` different than previously (default is 5).

In [16]:
# Train word2vec_5 with a different window size
new_window_size = 7  # Example: window size set to 7, but you can choose any size other than 5
word2vec_5 = Word2Vec(sentences=X_train, vector_size=50, window=new_window_size)

# Size of the vocabulary for word2vec_5
vocab_size_word2vec_5 = len(word2vec_5.wv.key_to_index)

# Print the size of the vocabulary
print("Size of vocabulary in word2vec_5 (window size = {}): {}".format(new_window_size, vocab_size_word2vec_5))

Size of vocabulary in word2vec_5 (window size = 7): 8006


The arguments you have seen (`vector_size`, `min_count` and `window`) are usually the ones that you should start playing with to get a better performance for your model.

But you can also look at other arguments in the [**📚 Documentation - gensim.models.word2vec.Text8Corpus**](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Text8Corpus)

# Convert our train and test set to RNN-ready datasets

Remember that `Word2Vec` is the first step to the overall process of feeding such a representation into a RNN, as shown here:

<img src="word2vec_representation.png" width="400px" />



Now, let's work on Step 2 by converting the training and test data into their vector representation to be ready to be fed in RNNs.

❓ **Question** ❓ Now, write a function that, given a sentence, returns a matrix that corresponds to the embedding of the full sentence, which means that you have to embed each word one after the other and concatenate the result to output a 2D matrix (make sure that your output is a NumPy array)

❗ **Remark** ❗ You will probably notice that some words you are trying to convert throw errors as they are said not to belong to the dictionary:

- For the <font color=orange>test</font> set, this is understandable: <font color=orange>some words were not</font> in the <font color=blue>train</font> set and thus, their <font color=orange>embedded representation is unknown</font>
- for the <font color=blue>train set</font>, due to `min_count` hyperparameter, not all the words have a vector representation.

In any case, just skip the missing words here.

In [18]:
import numpy as np

example = ['this', 'movie', 'is', 'the', 'worst', 'action', 'movie', 'ever']
example_missing_words = ['this', 'movie', 'is', 'laaaaaaaaaame']

def embed_sentence(word2vec, sentence):
    embedded_sentence = []
    
    for word in sentence:
        if word in word2vec.wv.key_to_index:  # Check if the word is in the Word2Vec vocabulary
            embedded_sentence.append(word2vec.wv[word])
    
    return np.array(embedded_sentence)
    
### Checks
embedded_sentence = embed_sentence(word2vec, example)
assert(type(embedded_sentence) == np.ndarray)
assert(embedded_sentence.shape == (8, 100))

embedded_sentence_missing_words = embed_sentence(word2vec, example_missing_words)  
assert(type(embedded_sentence_missing_words) == np.ndarray)
assert(embedded_sentence_missing_words.shape == (3, 100))

❓ **Question** ❓ Write a function that, given a list of sentences (each sentence being a list of words/strings), returns a list of embedded sentences (each sentence is a matrix). Apply this function to the train and test sentences

💡 _Hint_: Use the previous function `embed_sentence`

In [20]:
def embedding(word2vec, sentences):
    embedded_sentences = []
    for sentence in sentences:
        embedded_sentence = embed_sentence(word2vec, sentence)
        if len(embedded_sentence) > 0:  # Only add non-empty embeddings
            embedded_sentences.append(embedded_sentence)
    return embedded_sentences
    
X_train_embedded = embedding(word2vec, X_train)
X_test_embedded = embedding(word2vec, X_test)

❓ **Question** ❓ In order to have ready-to-use data, do not forget to pad your sequences so you have tensors which can be divided into batches (of `batch_size`) during the optimization. Store the padded values in `X_train_pad` and `X_test_pad`. Do not forget the important arguments of the padding ;)

In [None]:
### YOUR CODE HERE

assert(len(X_train_pad.shape) == 3)
assert(len(X_test_pad.shape) == 3)
assert(X_train_pad.shape[2] == 100)
assert(X_test_pad.shape[2] == 100)



🏁 Congratulations, you are now able to use `Word2Vec` to embed your words :)

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge!
