***
## <center>NLP words embedding using Word2Vec </center> 

#### <center> Student. Benhari Abdessalam -- Mines Paristech - March 2021 </center>
***
### > Introduction :


In this notebook we create a word2vec model pipeline that we train on a bunch of data to use in our app. We are using as a library _gensim_. The original implementation can be found <a href="https://arxiv.org/pdf/1301.3781.pdf">HERE</a>.

The data used for training comes from the <a href=" http://kavita-ganesan.com/entity-ranking-data/">OpinRank</a> documentation.

In [1]:
# Libraries importation :
import gzip
import gensim 
import logging

# login configuration :
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

> ### >> Data extraction : 

First, we extract and process our document as an input-ready list for our model :

In [2]:
# setting the data path
dataPath="../data/mLearning/reviews_data.txt.gz"

def readInput(inputPath):
    """ Reads the input file which is
        in gzip format.
    """
    # loggin info display
    logging.info("reading file {0}...".format(inputPath))
    
    with gzip.open (inputPath, 'rb') as f:
        for i, line in enumerate (f): 
            if (i%10000==0):
                # logging info display 2
                logging.info ("read {0} reviews".format (i))

            # get a list of words for each review text
            yield gensim.utils.simple_preprocess (line)

# get a list of all the words from the preprocessed docs
documents = list (readInput(dataPath))
logging.info ("Done reading data file")    

2021-03-22 18:22:58,020 : INFO : reading file ../data/mLearning/reviews_data.txt.gz...
2021-03-22 18:22:58,032 : INFO : read 0 reviews
2021-03-22 18:22:59,773 : INFO : read 10000 reviews
2021-03-22 18:23:01,547 : INFO : read 20000 reviews
2021-03-22 18:23:03,473 : INFO : read 30000 reviews
2021-03-22 18:23:05,377 : INFO : read 40000 reviews
2021-03-22 18:23:07,429 : INFO : read 50000 reviews
2021-03-22 18:23:09,332 : INFO : read 60000 reviews
2021-03-22 18:23:10,978 : INFO : read 70000 reviews
2021-03-22 18:23:12,435 : INFO : read 80000 reviews
2021-03-22 18:23:13,984 : INFO : read 90000 reviews
2021-03-22 18:23:15,746 : INFO : read 100000 reviews
2021-03-22 18:23:17,357 : INFO : read 110000 reviews
2021-03-22 18:23:18,985 : INFO : read 120000 reviews
2021-03-22 18:23:21,141 : INFO : read 130000 reviews
2021-03-22 18:23:23,234 : INFO : read 140000 reviews
2021-03-22 18:23:25,523 : INFO : read 150000 reviews
2021-03-22 18:23:27,516 : INFO : read 160000 reviews
2021-03-22 18:23:29,349 : 

> ### >> Model training :

In this section we train our model using the prepared data before, and then we save it to be able to use it without training it fro scratch.

In [3]:
# Setting the model
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)

# Training the model
model.train(documents,total_examples=len(documents),epochs=10)

e 0
2021-03-22 18:28:35,152 : INFO : EPOCH 6 - PROGRESS: at 5.50% examples, 826650 words/s, in_qsize 17, out_qsize 2
2021-03-22 18:28:36,170 : INFO : EPOCH 6 - PROGRESS: at 7.93% examples, 800827 words/s, in_qsize 20, out_qsize 2
2021-03-22 18:28:37,187 : INFO : EPOCH 6 - PROGRESS: at 10.32% examples, 808637 words/s, in_qsize 19, out_qsize 0
2021-03-22 18:28:38,194 : INFO : EPOCH 6 - PROGRESS: at 12.96% examples, 839734 words/s, in_qsize 19, out_qsize 0
2021-03-22 18:28:39,199 : INFO : EPOCH 6 - PROGRESS: at 15.78% examples, 856257 words/s, in_qsize 18, out_qsize 1
2021-03-22 18:28:40,200 : INFO : EPOCH 6 - PROGRESS: at 18.61% examples, 882516 words/s, in_qsize 19, out_qsize 0
2021-03-22 18:28:41,212 : INFO : EPOCH 6 - PROGRESS: at 21.48% examples, 896591 words/s, in_qsize 18, out_qsize 1
2021-03-22 18:28:42,215 : INFO : EPOCH 6 - PROGRESS: at 24.07% examples, 908557 words/s, in_qsize 18, out_qsize 1
2021-03-22 18:28:43,218 : INFO : EPOCH 6 - PROGRESS: at 27.92% examples, 925473 words/

(303497013, 415193580)

In [5]:
# Saving the model into our data folder
model.save('models/word2vecModel')

2021-03-22 18:32:28,281 : INFO : saving Word2Vec object under models/word2vecModel, separately None
2021-03-22 18:32:28,282 : INFO : storing np array 'vectors' to models/word2vecModel.wv.vectors.npy
2021-03-22 18:32:28,525 : INFO : not storing attribute vectors_norm
2021-03-22 18:32:28,527 : INFO : storing np array 'syn1neg' to models/word2vecModel.trainables.syn1neg.npy
2021-03-22 18:32:28,908 : INFO : saved models/word2vecModel


> ### >> Output examples : 

The word2vec trained model can be used to calculate similarity, ressemblance, non-similarity, the intruder word (on a list) and much more. In our case, we are only going to use it to predict some synonyms and get the embedding vectors :

In [6]:
# EXAMPLE : Similarity (synonym) results :
w1 = "happy"
model.wv.most_similar (positive=w1)

2021-03-22 18:32:41,834 : INFO : precomputing L2-norms of word weight vectors


[('pleased', 0.8089096546173096),
 ('satisfied', 0.7358423471450806),
 ('delighted', 0.6559268832206726),
 ('impressed', 0.6552660465240479),
 ('thrilled', 0.6467134952545166),
 ('disappointed', 0.5812833309173584),
 ('dissapointed', 0.5792136192321777),
 ('grateful', 0.5529959201812744),
 ('willing', 0.5381923913955688),
 ('displeased', 0.5198877453804016)]

In [7]:
# EXAMPLE : Embedded vector of the word "bad"
model.wv['dirty']

array([ 4.990052  ,  0.841634  ,  1.5524447 ,  3.2968087 ,  1.7365276 ,
        0.8118474 ,  2.306224  , -1.2724187 ,  1.0596699 ,  0.9143704 ,
        1.8354827 , -0.41388938,  0.4152715 , -0.4517619 , -3.010841  ,
       -1.8359199 , -1.5408039 ,  0.4325712 , -2.0794213 , -0.5403505 ,
       -1.6379613 , -0.24902526, -2.384686  , -1.2031227 ,  1.813023  ,
       -0.9492246 , -2.1577306 , -1.5225183 ,  1.4909753 ,  1.3825532 ,
        3.086713  , -3.0291204 , -0.84425706, -0.9748284 , -2.697659  ,
       -0.49744514, -0.6550763 , -0.85399914,  3.6276917 ,  0.73236203,
       -1.7324402 ,  0.97531676, -0.9627324 , -1.8066303 , -0.01335902,
        0.7721689 , -2.3003924 , -0.7372992 , -0.48276767, -2.8071015 ,
       -1.3335373 ,  3.072297  ,  1.7701883 , -2.007972  , -1.0845212 ,
        0.8838573 ,  0.42267337,  0.99321747, -1.3929528 , -4.410457  ,
       -1.7682109 , -2.0521379 ,  0.25185904, -0.32647207,  0.24923967,
        3.6221285 ,  1.3064955 ,  1.1378423 , -2.738603  ,  1.51