<a href="https://colab.research.google.com/github/aishwikr/NLP/blob/master/NLP_using_Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Processing

In [1]:
# Data handling
import pandas as pd

# string manipulations
import string
import re

# contains the word2vec algorithm
import gensim

# nlp libs
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import spatial
import numpy as np

# ignore the warnings
import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
# Read inputs (i.e. dataset) and create the list of documents
d1 = pd.read_csv("dataset_1.csv", index_col='id')
print("\nDimensions of Dataset:",d1.shape)
print(d1.head())

documents = []
for idx, doc in d1["text"].iteritems():
    doc = re.sub('<[^<]+?>', '', doc)
    table = str.maketrans({key: None for key in string.punctuation})
    translated = doc.translate(table)
    documents.append(translated)
    

print("\n\n")
print("Documents after html tag removal:\n[{}]".format(documents[0]))
# Generating the vocabulary, previously Gensim use to generate it automatically.
vocab = [s.split() for s in documents] ## Exercise: Use NLTK or Spacy tokenizer

doc_idx = 5
print("Tokens of document [{}]:\n".format(doc_idx))
print(vocab[doc_idx])


Dimensions of Dataset: (3126, 1)
                                                 text
id                                                   
0   <p>What does "backprop" mean? I've Googled it,...
1   <p>Does increasing the noise in data help to i...
2   <p>When you're writing your algorithm, how do ...
3   <p>I have a LEGO Mindstorms EV3 and I'm wonder...
4   <p>The intelligent agent definition of intelli...



Documents after html tag removal:
[What does backprop mean Ive Googled it but its showing backpropagation

Is the backprop term basically the same as backpropagation or does it have a different meaning
]
Tokens of document [5]:

['This', 'quote', 'by', 'Stephen', 'Hawking', 'has', 'been', 'in', 'headlines', 'for', 'quite', 'some', 'time', 'Artificial', 'Intelligence', 'could', 'wipe', 'out', 'humanity', 'when', 'it', 'gets', 'too', 'clever', 'as', 'humans', 'will', 'be', 'like', 'ants', 'Why', 'does', 'he', 'say', 'this', 'To', 'put', 'it', 'simply', 'in', 'layman', 'terms', 'wh

## TF-IDF Vector Representation of Documents

TF-IDF is an information retrieval technique that weighs a term’s frequency (TF) and its inverse document frequency (IDF). Each word or term has its respective TF and IDF score. The product of the TF and IDF scores of a term is called the TF*IDF weight of that term.
Put simply, the higher the TF*IDF score (weight), the rarer the term and vice versa.



TF-IDF Vecorizer also takes stopwords as an argument,yo can specify you list of stopwords and those words will be removed from the data 

In [0]:
stop_words = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=stop_words,lowercase = False)
Doc_TFIDF_Vector = vectorizer.fit_transform(documents)
#Shape of the Document Vector
print("Shape of the TF-IDF Vector is - ")
Doc_TFIDF_Vector.shape

Shape of the TF-IDF Vector is - 


(3126, 27898)

### Visualizing TF-IDF Vector of the first document

In [0]:
import pandas as pd 
first_doc_vector = Doc_TFIDF_Vector[0]
# place tf-idf values in a pandas data frame
df = pd.DataFrame(first_doc_vector.T.todense(), index=vectorizer.get_feature_names(), columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
backprop,0.591263
backpropagation,0.450301
Googled,0.346861
showing,0.284609
basically,0.230987
meaning,0.218159
term,0.210052
mean,0.169560
Ive,0.156563
different,0.139041


In [0]:
#Vector shape of above document
df.shape

(27898, 1)

### Visualizing TF-IDF Vector representation of all documents

In [0]:
df = Doc_TFIDF_Vector.toarray()
Docvec = pd.DataFrame(df, columns=vectorizer.get_feature_names())
Docvec

Unnamed: 0,00,000,0000,00000000,000000000000,0000001,00000010,00000020,00000030,00000040,00000050,00000060,00000070,000001,0000010,000005,00001,0000100,0000111111111000,00002,000025,00005,00009,0001,00011110,0001123,0001169,000125,0002,0003,000321,0004,0005,00051,000524,00053,0005416,000577191,0007253,000925775,...,zeroth,zetabytes,zig,zillion,zip,zipinputs,zipped,zippiparams,zippolicysavedlogprobs,ziptargetangles,zj0,zj1,zlogvar,zlogvarhidden,zmean,zmeanhidden,zombies,zonal,zone,zones,zoo,zoom,zoomed,zoos,zright,zscore,zugzwangs,zwei,Épocas,ˆQN,Σai,α1,αex,αx,πθold,σw1x1,σz,ϵdecay,ϵgreedy,龍爭虎鬥
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
# To know which words the indices are referring to
indices  = vectorizer.get_feature_names()


In [0]:
def get_queryDocVector(querD):
    query_tfidf_vector = vectorizer.transform(querD)
    # place tf-idf values in a pandas data frame
    df = Doc_TFIDF_Vector.toarray()
    Queryvec = pd.DataFrame(df)
    return Queryvec

To calculate TF of a term pass a term of a document to the function below.
The argument doc takes tokens of the document as input

In [0]:
def TF(term,doc):
    doc_wordcount = {}
    filtered = [t for t in doc if not t in stop_words]
    print(filtered)
    for word in filtered:
        count = doc_wordcount.get(word,0)
        doc_wordcount[word] = count + 1
    print(doc_wordcount)
    return doc_wordcount[term]/len(doc)


To calculate IDF of a term pass a term of a document to the function below. The argument vocab takes list of tokens of all documents  as input

In [0]:
import math
def IDF(term,vocab):
    DF = 0
    for doc in vocab:
       if term in doc:
        DF = DF+1
    return math.log2(len(vocab)/DF)



To calculate IDF of a term pass a term of a document to the function below. The argument vocab takes list of tokens of all documents  as input

In [0]:
def TF_IDF(term, vocab):
    tf = TF(term, vocab[0])
    idf = IDF(term, vocab)
    return tf*idf



In [0]:
term = 'backpropagation'

In [0]:
tfidf = TF_IDF(term, vocab)
# Gives index of the term in TF_IDF document vector
indx = indices.index(term)
print(tfidf)
# Retrieves TF-IDF values of that term in a document
print(Docvec.iloc[0,indx])



['What', 'backprop', 'mean', 'Ive', 'Googled', 'showing', 'backpropagation', 'Is', 'backprop', 'term', 'basically', 'backpropagation', 'different', 'meaning']
{'What': 1, 'backprop': 2, 'mean': 1, 'Ive': 1, 'Googled': 1, 'showing': 1, 'backpropagation': 2, 'Is': 1, 'term': 1, 'basically': 1, 'different': 1, 'meaning': 1}
0.41227466249194156
0.4503011368762714


## **For query document(QD), finding the closest documents according to the documents vector representations**

In [0]:
query_doc = pd.read_csv("q_d.csv")
print(query_doc)

    id                                               text
0    1  <p>Obviously this is hypothetical, but is true...
1    2  <p>I'm curious about Artificial Intelligence. ...
2    3  <p>I've heard of AI that can solve math proble...
3    4  <p>I'm trying to gain some intuition beyond de...
4    5  <p>It seems that deep neural networks are maki...
5    6  <p>I'm a bit confused about the definition of ...
6    7  <p>Can one actually kill a machine? Not only d...
7    8  <p>Generally, people can be classified as aggr...
8    9  <p>Assuming mankind will eventually create art...
9   10  <p>AI death is still unclear a concept, as it ...
10  11  <p>Can self-driving cars deal with snow, heavy...
11  12  <p>Most of the people is trying to answer ques...


In [0]:
for query in query_doc.iloc[:, 1]:
    queryDoc = re.sub('<[^<]+?>', '', query)
    indx = documents.index(queryDoc)
    tempScoreList = []
    queryTIDFVec = Docvec.iloc[indx, :]
    for i in range(len(Docvec)):
        result = 1 - spatial.distance.cosine(queryTIDFVec, Docvec.iloc[i, :])
        tempScoreList.append(result)
    arr = np.array(tempScoreList)
    Top5docIndex = arr.argsort()[-5:][::-1]
    print("====================================================================================")
    print("Query Doc : ", queryDoc)
    print("====================================================================================")
    print("Most relevent Document 1 :", tempScoreList[Top5docIndex[0]], documents[Top5docIndex[0]])
    print("Most relevent Document 2 :", tempScoreList[Top5docIndex[1]], documents[Top5docIndex[1]])
    print("Most relevent Document 3 :", tempScoreList[Top5docIndex[2]], documents[Top5docIndex[2]])
    print("Most relevent Document 4 :", tempScoreList[Top5docIndex[3]], documents[Top5docIndex[3]])
    print("Most relevent Document 5 :", tempScoreList[Top5docIndex[4]], documents[Top5docIndex[4]])
    print("====================================================================================")


ValueError: ignored

In [0]:
for word in word_doc.iloc[:, 0]:
    print("=====================================")
    print("Query word :", str(word))
    print("=====================================")
    print("Similarity score and Words")
    print(w2v_model.wv.most_similar(positive=str(word)))

Query word : need
Similarity score and Words
[('wanted', 0.9845367074012756), ('get', 0.9826719760894775), ('expect', 0.9822689294815063), ('switch', 0.9804394841194153), ('configuration,', 0.9800515174865723), ('know,', 0.9770534634590149), ('anywhere', 0.9759050607681274), ('see', 0.974753201007843), ('seem', 0.9744973182678223), ('puzzle', 0.9739166498184204)]
Query word : networks
Similarity score and Words
[('networks,', 0.9927573800086975), ('networks?', 0.9824339151382446), ('networks.', 0.9757485389709473), ('nets.', 0.9662461280822754), ('learnt', 0.9549965262413025), ('network?', 0.9527811408042908), ('network', 0.9524372220039368), ('ranking', 0.9508349299430847), ('network.', 0.946325421333313), ('network,', 0.9433037638664246)]
Query word : artificial
Similarity score and Words
[('implementation', 0.9918379187583923), ('done', 0.9913190603256226), ('deal', 0.990505576133728), ('models', 0.9885209798812866), ('lot', 0.9870597124099731), ('examples', 0.9860020875930786), ('s

###Extras:
Task 5: Representing a document as the average of the vectors for the words that it contains.

In [0]:
word_doc = pd.read_csv("w_d.csv", index_col='id')
print(word_doc)

          text
id            
1         need
2     networks
3   artificial
4         game
5     possible
6   understand
7      example
8      problem
9        human
10        good


In [0]:
#pass the word2vec model trained above along with the document. Remember to pass the document after pre-processing
def get_mean_vector(word2vec_model, doc):
    # remove out-of-vocabulary words
    words = [word for word in doc if word in word2vec_model.wv.vocab]
    if len(words) >= 1:
        return np.mean(word2vec_model[words], axis=0)
    else:
        return []

- The following function call returns the average word vector for a single document. You've to repeat the same for all the documents in the corpus and find all similar docments as done in previous task.

In [0]:
vec = get_mean_vector(w2v_model, queryDoc)
print(vec)

[ 0.47703236 -0.21929555  0.3995885  -0.09952543 -0.44992185 -0.22234799
  1.6736174  -0.61966014  0.36841467 -0.80694157  0.60952157 -0.3281841
 -0.15494032 -0.4060407   0.17575401  0.3796392   0.32797995  1.1934158
 -0.80852795  0.21073261  0.54533285 -0.497017    0.05648672 -0.3234889
 -0.2711816   0.7113703  -0.43529156  0.7314049   1.1142087  -0.2520997
  0.39278498 -1.2586411  -0.41629076 -0.22663155  1.4499834  -0.29076865
 -0.1009325   0.1935157   0.3299008   0.03653309  0.3519862   0.62983453
  0.8645436  -0.14427772 -0.29531655  0.22189255  0.4769987  -0.26186493
 -1.1183815   0.3630509   2.2980194   0.02910349 -0.04526319  1.6317432
  0.11493681 -0.47510928  0.25154963  0.6644172  -0.73696196  0.6563686
  0.04603515  0.30947697  0.17366384 -0.10571729  0.37261674  0.08772404
 -1.4433769   0.0236024  -0.77642834 -0.37085894  1.0359833  -0.57590896
  0.79816073  0.37161687 -0.20915106 -2.6090138  -1.0103724   0.796883
 -0.15708397  0.47139907 -0.02072892 -0.12124224 -1.4402592

Task 6: Similarly for task 6, you've to find the weighted average of vectors of word. Think of a metric you could use for the same. Hint: tf-idf


## Natural Language Processing using Word2Vec

The idea behind Word2Vec is pretty simple. We’re making an assumption that the **meaning of a word can be inferred by the company it keeps**. This is analogous to the saying, “*show me your friends, and I’ll tell who you are*”. For example, the words shocked, appalled and astonished are usually used in a similar context.

In this hands-on session, you will learn how to use the [**Gensim**](https://radimrehurek.com/gensim/) implementation of Word2Vec (in python) and actually get it to work.

### Training the Word2Vec model

To train a Word2Vec model you need to pass all the documents. So, we are essentially passing a list of strings where each string within the main list contains the document. Gensim implementation needs vocabulary. And by vocabulary, I mean a set of unique words.

After building the vocabulary, we just need to call `train(...)`

*Behind the scenes*, we are actually training a **neural network with a single hidden layer**. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn.

In [0]:
# First define the model
w2v_model = gensim.models.Word2Vec(vocab, size=300, window=10, min_count=1,
                                   workers=10)

# Now train the model
w2v_model.train(documents, total_examples=len(documents), epochs=10)
print("Training complete...")

Training complete...


### Understanding the parameters

> **size**: The size of the dense vector to represent each token (here word). If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. The standard size is usually `300`.

> **window**: The maximum distance between the target word and its neighboring words when considering context. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter much, as long as its a decent sized window.

> **min_count**: Minimium frequency count of words. The model would ignore words that do not statisfy the min_count. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

> **workers**: How many threads to use behind the scenes?

**Note:** You can experiment with these parameter values. Specially experiment with `size` parameter.

### Now, let's look at some outputs of our trained model

1.   This first example shows a simple case of generating a vector for a token which is present in the vocabulary.
2.   Secondly we will look up words similar to a word in the corpus. All we need to do here is to call the `most_similar()` function and provide the word as the positive example. This returns the top 10 similar words.

In [0]:
w1 = "backprop"
w2v_model.wv.most_similar (positive=w1)

  if np.issubdtype(vec.dtype, np.int):


[('comment', 0.9915459752082825),
 ('nothing', 0.9901270270347595),
 ('DL', 0.9890097975730896),
 ('maths', 0.9879535436630249),
 ('nice', 0.9868414998054504),
 ('solvers', 0.9856299161911011),
 ('who', 0.9853726029396057),
 ('artistic', 0.9852790236473083),
 ('Even', 0.9850383400917053),
 ('Yes', 0.9847003221511841)]

You can even specify several positive examples to get things that are related all of them in the provided context and provide negative examples to say what should not be considered as related.

In [0]:
pos_words = ["improve",'increasing']
neg_words = ['calculate']
w2v_model.wv.most_similar (positive=pos_words, negative=neg_words, topn=10)

  if np.issubdtype(vec.dtype, np.int):


[('sent', 0.965648353099823),
 ('close', 0.9633491635322571),
 ('always', 0.9618666768074036),
 ('again', 0.9615074396133423),
 ('gesture', 0.9605923295021057),
 ('decode', 0.9597782492637634),
 ('end', 0.9586308598518372),
 ('choose', 0.9581032991409302),
 ('death', 0.9576351046562195),
 ('Clearly', 0.9574530124664307)]

### Similarity between two words in the vocabulary
Use the Word2Vec model to return the similarity between two words that are present in the vocabulary. There are many similarity measurement metric, one of them is `cosine_similarity` which is used by Gensim internally.

Contextually related words should have higher similarity score and vice versa.

**Exercise:** Check what happens if two words should be close according to you.

In [0]:
w2v_model.wv.similarity(w1="learning", w2="improving")

  if np.issubdtype(vec.dtype, np.int):


0.68711406

### Exercise: Try changing the size parameter and analyze it's effects in similarity score and other functions.

In [0]:
# Hint: redefine the model with appropriate parameters and train:
word_vector_size = ???
w2v_model_new = gensim.models.Word2Vec(vocab, size=word_vector_size, window=10,
                                   min_count=1, workers=10)
w2v_model_new.train(documents, total_examples=len(documents), epochs=10)
print("Redefined training complete with [word_vector_size] = {}."
      .format(word_vector_size))

**Compare the similarity values of two models and let TA's know what you observed?**

In [0]:
## Your code here to compare 2 models based on word similarity and other attributes:

## Add any number of code blocks you deem necessary.


### Importing a pre-trained model like, "Glove" or "GoogleNews"


In [0]:
## Don't run, file not present.
from gensim.models.keyedvectors import KeyedVectors

pretrained_w2v_model = KeyedVectors.load_word2vec_format(
    join(model_dir,model_file_name + '.bin'),binary=True)

## For Each query word(QW), finding the closest words as per word2vec Representations

In [0]:
for word in word_doc.iloc[:, 0]:
    print("=====================================")
    print("Query word :", str(word))
    print("=====================================")
    print("Similarity score and Words")
    print(w2v_model.wv.most_similar(positive=str(word)))