<a href="https://colab.research.google.com/github/anshupandey/Natural_language_Processing/blob/master/word2vec_model_with_gensim_IMDB_reviews_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The data

   ##  About the data
The analysis seeks to establish transformation of word into vectors on any text. We are not concerned about whether the text data has label or not. The data set supplied consists of  **50000 IMDB reviews**  with review ID on a certain movie  with no labels.We'll use this unlabelled data to train a model. which can be applied on test data.

Please visit the site to download the data
https://www.kaggle.com/c/word2vec-nlp-tutorial/data

In [None]:
import numpy as np
import pandas as pd

## Import the data

The data was imported from local repository using the command below.

In [None]:
!wget -q https://www.dropbox.com/s/0ygoimffauvl7x5/unlabeledTrainData.tsv

In [None]:
df=pd.read_csv("unlabeledTrainData.tsv",delimiter="\t",quoting=3,header=0)
df.shape

(50000, 2)

In [None]:
df.head()

Unnamed: 0,id,review
0,"""9999_0""","""Watching Time Chasers, it obvious that it was..."
1,"""45057_0""","""I saw this film about 20 years ago and rememb..."
2,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B..."
3,"""7161_0""","""I went to see this film with a great deal of ..."
4,"""43971_0""","""Yes, I agree with everyone on this site this ..."


In [None]:
import re,string

##  Data Cleaning
We've gone through the reviews & detected punctuations in many reviews.The punctuations don't contribute anything to our analysis & moreover they are considered as unique word & distort the meaning of other words.This is why the data needs to be cleaned before we jump into core analysis.

In [None]:
def clean_string(string):                                                         # The entire document is cleaned defining clean_string
  try:
    string=re.sub(r'^https?:\/\/<>.*[\r\n]*','',string,flags=re.MULTILINE) # remove URLS
    string=re.sub(r"[^A-Za-z]"," ",string) # remove non alphabetic tokens
    words=string.strip().lower().split() # removing extra space
    return " ".join(words)
  except:
    return " "
  

In [None]:
doc = "my name is anshu 123anshu http://www.anshi.com jeio"
re.sub(r"[^A-Za-z]"," ",doc)

'my name is anshu    anshu http   www anshi com jeio'

Above we defined a function called **clean_string** & this function we have applied on the raw review column and created a new column(**clean_review**) to save the cleaned reviews.

In [None]:
df['clean_review']=df.review.apply(clean_string)                                  # Finally cleaned format is applied on the reviews


In [None]:
print ("No.of samples \n:",(len(df)))
df.head()

No.of samples 
: 50000


Unnamed: 0,id,review,clean_review
0,"""9999_0""","""Watching Time Chasers, it obvious that it was...",watching time chasers it obvious that it was m...
1,"""45057_0""","""I saw this film about 20 years ago and rememb...",i saw this film about years ago and remember i...
2,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B...",minor spoilers br br in new york joan barnard ...
3,"""7161_0""","""I went to see this film with a great deal of ...",i went to see this film with a great deal of e...
4,"""43971_0""","""Yes, I agree with everyone on this site this ...",yes i agree with everyone on this site this mo...


If we look at the data now, we'll not notice any punctuations in the **clean_review** column.

#  Word2Vec with Gensim(The Word2Vec toolkit)

Gensim is an open source Python library for natural language processing, with a focus on topic modeling.Gensim was developed and is maintained by the Czech natural language processing researcher **Radim Řehůřek** and his company RaRe Technologies.

It is not an everything-including-the-kitchen-sink NLP research library (like NLTK); instead, Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling. Most notably for this tutorial, it supports an implementation of the** Word2Vec word embedding** for learning new word vectors from text.

It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.


### Objective

In this tutorial, we dig a little "deeper" into sentiment analysis. Google's Word2Vec is a deep-learning inspired method that focuses on the meaning of words. Word2Vec attempts to understand meaning and **semantic relationships** among words. It works in a way that is similar to deep approaches, such as recurrent neural nets or deep neural nets, but is computationally more efficient. This tutorial focuses on Word2Vec for sentiment analysis.

**Please install & import the gensim everytime you work on Google colab**

In [None]:
import gensim

**Since we are going to work with words, so we are required to split the each review so that we can have word tokens.**

In [None]:
# Word tokenization
Document=[]
for doc in df['clean_review']:
  Document.append(doc.split(' '))                             

In [None]:
len(Document)

50000

**Let us explore split reviews**

In [None]:
Document[10][6:13]                                                                # This what is there in 10th Document starting from 6 till 12

['movie', 'i', 'am', 'not', 'sure', 'whether', 'i']

In [None]:
print(len(Document[10]))                                                          # Lenth of the 10th document ,  It has 524 words in it
print(Document[10])

524
['after', 'reading', 'the', 'comments', 'for', 'this', 'movie', 'i', 'am', 'not', 'sure', 'whether', 'i', 'should', 'be', 'angry', 'sad', 'or', 'sickened', 'seeing', 'comments', 'typical', 'of', 'people', 'who', 'a', 'know', 'absolutely', 'nothing', 'about', 'the', 'military', 'or', 'b', 'who', 'base', 'everything', 'they', 'think', 'they', 'know', 'on', 'movies', 'like', 'this', 'or', 'on', 'cnn', 'reports', 'about', 'abu', 'gharib', 'makes', 'me', 'wonder', 'about', 'the', 'state', 'of', 'intellectual', 'stimulation', 'in', 'the', 'world', 'br', 'br', 'at', 'the', 'time', 'i', 'type', 'this', 'the', 'number', 'of', 'people', 'in', 'the', 'us', 'military', 'million', 'on', 'active', 'duty', 'with', 'another', 'almost', 'in', 'the', 'guard', 'and', 'reserves', 'for', 'a', 'total', 'of', 'roughly', 'million', 'br', 'br', 'the', 'number', 'of', 'people', 'indicted', 'for', 'abuses', 'at', 'at', 'abu', 'gharib', 'currently', 'less', 'than', 'br', 'br', 'that', 'makes', 'the', 'total',

In [None]:
import logging                                                                    # Please import logging to keep & check information regarding word2vec transformation

In [None]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

model=gensim.models.Word2Vec(Document,                                           # List of reviews
                          min_count=10,                                          # we want words appearing atleast 10 times in the vocab otherwise ignore min_df
                          workers=4,                                             # Use these many worker threads to train the model (=faster training with multicore machines
                           size=50,                                              # it means aword is represented by 50 numbers,in other words the number of neorons in hidden layer is 50 
                          window=5)                                              # 5 neighbors on the either side of a word

2021-04-09 09:23:30,430 : INFO : collecting all words and their counts
2021-04-09 09:23:30,432 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-04-09 09:23:31,009 : INFO : PROGRESS: at sentence #10000, processed 2399440 words, keeping 51654 word types
2021-04-09 09:23:31,622 : INFO : PROGRESS: at sentence #20000, processed 4835846 words, keeping 69077 word types
2021-04-09 09:23:32,196 : INFO : PROGRESS: at sentence #30000, processed 7267977 words, keeping 81515 word types
2021-04-09 09:23:32,789 : INFO : PROGRESS: at sentence #40000, processed 9669772 words, keeping 91685 word types
2021-04-09 09:23:33,370 : INFO : collected 100479 word types from a corpus of 12084660 raw words and 50000 sentences
2021-04-09 09:23:33,372 : INFO : Loading a fresh vocabulary
2021-04-09 09:23:33,781 : INFO : effective_min_count=10 retains 28322 unique words (28% of original 100479, drops 72157)
2021-04-09 09:23:33,785 : INFO : effective_min_count=10 leaves 11910457 word cor

**Please note that after applying Word2Vec function on the clean_review giving all the arguments corretly we have got 28322 words**

In [None]:
print(len(model.wv.vocab))                                                        # Now the vocab contains 28322 uinque words

28322


**Let's check the dimension of a vector i.e. the number of words that represent a word**

In [None]:
print(model.wv.vector_size)                                                       # It means each vector has 50 numbers in it or in other words each word is vector of 5o numbers that we predefined

50


In [None]:
model.wv.vectors.shape                                                            # Dimension of the the entire corpus        

(28322, 50)

In [None]:
model['beautiful'].shape

  """Entry point for launching an IPython kernel.


(50,)

### Let's explore some interesting results of word2vec experiment



In [None]:
model.wv.most_similar("beautiful")                                                # 10 similar words beautiful,the maximum similarity is 1,minimum is 0.When they are completely similar the 
                                                                                  # Value will be 1 , when completely dissimilar,the value will be 0.

2021-04-09 09:26:13,854 : INFO : precomputing L2-norms of word weight vectors


[('gorgeous', 0.8588871359825134),
 ('lovely', 0.833928108215332),
 ('stunning', 0.8132712244987488),
 ('wonderful', 0.7337294816970825),
 ('haunting', 0.7228685021400452),
 ('breathtaking', 0.6874433755874634),
 ('delightful', 0.6828726530075073),
 ('fabulous', 0.6743853092193604),
 ('beauty', 0.6634806394577026),
 ('exquisite', 0.663422167301178)]

In [None]:
model.wv.most_similar("princess")                                                  # 10 similar words returned with numbers

[('widow', 0.8474273085594177),
 ('maria', 0.8233842849731445),
 ('maid', 0.8153141736984253),
 ('nurse', 0.8119286894798279),
 ('prince', 0.801642894744873),
 ('daisy', 0.7929926514625549),
 ('queen', 0.7837755084037781),
 ('belle', 0.7829538583755493),
 ('servant', 0.7782474756240845),
 ('virgin', 0.7725341320037842)]

In [None]:
model.wv.doesnt_match("she richard talked to me in the evening publicly".split())         # publicly does not match in the sentence given

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'richard'

Below the word **right** is represented by a dense 50 dimensional vector

In [None]:
model.wv["right"]                                                                  # right word is represented by 50 numbers in other words the word "right" is vector of 50 numbers
                                                                                   # 50 numbers are summarized weights because these numbers are obtained in the hidden layer of predefined 50 neurons

array([ 0.6846868 , -0.7968005 , -0.6302324 , -3.394566  ,  0.7201638 ,
        1.1136231 ,  0.15784374,  2.5323277 ,  3.5603287 ,  0.59281504,
       -0.9188659 , -0.447955  ,  0.62603706, -0.03631172,  1.1669568 ,
        1.2037969 , -1.878727  ,  1.026431  ,  1.2620952 ,  0.87099075,
       -0.40098897, -2.2058098 , -2.5231655 , -0.3569199 , -1.5878537 ,
       -1.0750184 ,  0.15629587, -0.23911397, -1.4470414 , -0.1245377 ,
        1.136354  , -0.6609408 ,  0.5265115 ,  0.6114063 , -0.15428022,
       -0.1421413 , -0.569659  ,  2.165041  , -2.8660035 ,  0.5492531 ,
        0.93151784,  0.21207607,  2.0477517 ,  1.5115618 ,  0.76157683,
       -1.3251568 ,  0.78988296,  0.98710644,  1.5385407 ,  3.4042273 ],
      dtype=float32)

In [None]:
model.wv['great']

array([ 1.4401122 ,  0.76371634, -1.4465237 , -2.4345956 ,  1.4818008 ,
        2.3014462 ,  0.35223305, -0.28248826,  1.3010585 ,  0.46277562,
       -0.43015203,  0.64026016,  0.26170298,  0.63460237,  2.4360445 ,
        1.3040127 ,  1.5610905 , -2.5930145 , -0.68386453,  1.2764063 ,
       -2.0910609 , -0.79727757, -1.6921145 ,  0.81008387, -1.4320858 ,
        2.696001  , -0.83759457, -1.203519  ,  0.7344471 ,  4.284749  ,
       -0.97111064, -0.95781547,  5.1075478 , -0.13110286, -1.0417333 ,
        1.0495613 ,  0.5642384 ,  4.2558985 , -3.4282224 , -0.5562852 ,
        2.6257257 ,  0.06977661, -0.5748929 ,  2.0517044 ,  1.0178621 ,
       -0.41138136, -1.5831146 ,  2.879354  , -0.30539533,  2.2108846 ],
      dtype=float32)

In [None]:
model.wv.

SyntaxError: ignored

## Saving the model

In [None]:
model.save("word2vec movie-50")                                                    # We save this model for further use.
                                                                                   # Google has such many pre-trained models

2020-01-19 05:41:08,396 : INFO : saving Word2Vec object under word2vec movie-50, separately None
2020-01-19 05:41:08,398 : INFO : not storing attribute vectors_norm
2020-01-19 05:41:08,404 : INFO : not storing attribute cum_table
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2020-01-19 05:41:08,692 : INFO : saved word2vec movie-50
