# Vectorize Text for Exploration Using Word Embeddings

[![Open In Colab](colab-badge.svg)](https://colab.research.google.com/github/alexisperrier/intro2nlp/blob/master/notebooks/intro2nlp_08_word_embeddings.ipynb)

Word embeddings are numerical representations of words or phrases that capture the meaning of the words in a  vector space. They are useful for natural language processing tasks because they capture the semantic relationships between words, which allows algorithms to make more accurate predictions and decisions based on the meaning of the text.

This notebook follows chapter 3 of the [intro to NLP](https://openclassrooms.com/en/courses/6532301-introduction-to-natural-language-processing) course on [openclassrooms](https://openclassrooms.com).

We will use the gensim library version 4. Note that Gensim has had a significant [upgrade from version 3 to 4](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4).


## Measuring the similarity between words


In [None]:
# install gensim with
!pip install --upgrade gensim
# check that the version is 4+
!pip show gensim

In [None]:
# other libs we will need
import numpy as np

In [None]:
# import gensim and load the model
# this might take awhile, especially on google colab.
# you can use a lighter smaller model and follow along. 
# The results will be slightly different but the conclusions will roughly be the same

import gensim.downloader as api
model = api.load("word2vec-google-news-300")

# or if the above model takes too long to download, use
# model = api.load("glove-wiki-gigaword-50")

the vector for the work 'book' has dimension 300. It is obtained with

In [None]:
print(model['book'])

The  `most_similar()`  function returns the 10 most similar words and their similarity scores:
python


In [None]:
model.most_similar("book")

Similarly, the words most similar to apple are:

In [None]:
model.most_similar("apple")

whereas, Apple with a capital is associated with the brand

In [None]:
model.most_similar("Apple")

We can also measure the similarity score between pair of words. For instance

In [None]:
print("similarity score between apple and banana:", model.similarity("apple", "banana")) 
print("similarity score between apple and dog:   ", model.similarity("apple", "dog")) 
print("similarity score between cat   and dog:   ", model.similarity("cat", "dog")) 


According to word2vec, a cat is more similar to a dog than an apple.

### Vocabulary
Let's take a look at the vocabulary in the word2vec model.

Note: this syntax has changed between Gensim 3.x and 4.x. 

In 3.x, you would get the vocab with

```
vocab = model.vocab.keys()
```
In 4.x, you need

```
vocab = model.index_to_key
```



In [None]:
# print 5 random words out of the whole available vocabulary, do it 10 times
vocab = model.index_to_key
for _ in range(10):
    print(np.random.choice(vocab,5))

#### side note: levenshtein distance and cosine similarity

Cosine similarity can be calculated with

In [None]:
from scipy import spatial
vector1 = [1, 1, 2, 2, 3]
vector2 = [1, 3, 1, 2, 6]

cosine_similarity = 1 - spatial.distance.cosine(vector1, vector2)
print (cosine_similarity)


The cosine similarity for 2 words such as grass and tree is:

In [None]:
print("similarity score between grass and tree:", model.similarity("grass", "tree")) 

# and with scipy
cosine_similarity = 1 - spatial.distance.cosine(model['grass'], model['tree'])
print ("cosine similarity between grass and tree:",cosine_similarity)



Similarity between words can be measured with other methods. 
The levenshtein distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another.

In [None]:
!pip install levenshtein

In [None]:
from Levenshtein import distance

In [None]:
print(f"distance('test','test') = {distance('test','test')}  because no character substitution is needed")
print(f"distance('test','team') = {distance('test','team')}  because two character substitutions are needed: s -> a and t -> m")

##### Cultural biais

In the US, Alexis is a feminin name, while in the rest of the world it's a masculin name. Word2vec was trained on US centric data. This shows up when looking at the names the model condsiders most similar to 'Alexis': Nicole, Erica, Marissa, Alicia ... all women names.


In [None]:
model.most_similar('Alexis')

##### Out of Vocabulary OOV

Some words are not in Word2vec vocab's. for instance Covid and ... word2vec.

In [None]:
vocab = model.index_to_key

# no covid words (only 'covidien' which is a company)
start_with = 'covid'
vocab_subset = [tk.lower() for tk in  vocab if tk.lower()[:len(start_with)] == start_with]
vocab_subset.sort()
print(vocab_subset)

# no word2vec words
start_with = 'word2vec'
vocab_subset = [tk.lower() for tk in  vocab if tk.lower()[:len(start_with)] == start_with]
vocab_subset.sort()
print(vocab_subset)



## Train Your First Embedding Models

To train your first model, we’ll use the Shakespeare corpus, composed of all the lines of all the Shakespeare plays available on Kaggle (or here). The idea behind working on classic literature is not to be snobbish, but to find a corpus different enough from the ones word2vec and GloVe were trained on (Google U.S. News and Wikipedia). We expect the Shakespeare dataset to have a different worldview and vocabulary. The dataset is also large and already in a short-sequence format, which will speed up the calculations.



In [None]:
import requests
import re

url = 'https://raw.githubusercontent.com/alexisperrier/intro2nlp/master/data/Shakespeare_alllines.txt'

r = requests.get(url)
lines = r.text.encode('ascii',errors='ignore').decode('utf-8').split("\n")

# remove all punctuation and only keep verses with more than one token to reduce the size of the corpus
sentences = []

for line in lines:
   # remove punctuation
   line = re.sub(r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]','',line).strip()

   # simple tokenizer
   tokens = re.findall(r'\b\w+\b', line)

   # only keep lines with at least one token
   if len(tokens) > 1:
      sentences.append(tokens)
print("This gives: ", len(sentences), "sentences")

Let's train a word2vec model, which we will call bard2vec

In [None]:
from gensim.models import Word2Vec

bard2vec = Word2Vec(
         sentences,
         min_count=3,   # Ignore words that appear less than this
         vector_size=50,       # Dimensionality of word embeddings
         sg = 1,        # skipgrams
         window=7,      # Context window for words during training
         epochs=40)       # Number of epochs training over corpus

Once the training is done, we can explore the results by looking at word similarity for certain words

In [None]:
def similar_words(word):
    print("-- most similar words to: ", word)
    for (token, score) in bard2vec.wv.most_similar(word):
        print(f"\t{token:>10} {np.round(score,2)}")
    print()
    
similar_words('King')
similar_words('sword')
similar_words('husband')
similar_words('Hamlet')


The results are dependent on how we trained the model. 
Let's compare with a model that is trained for a longer time and for larger window



In [None]:
from gensim.models import Word2Vec

bard2vec = Word2Vec(
         sentences,
         min_count=3,   # same
         vector_size=50,  # same
         sg = 0,        # cbow instead of skip-grams
         window=10,      # larger context windows
         epochs=100)       # longer training

In [None]:
def similar_words(word):
    print("-- most similar words to: ", word)
    for (token, score) in bard2vec.wv.most_similar(word):
        print(f"\t{token:>10} {np.round(score,2)}")
    print()
    
similar_words('King')
similar_words('sword')
similar_words('husband')
similar_words('Hamlet')
