### Use Word2Vec to train your own model on a dataset.

1) **Optional** - Find your own dataset of documents to train you model on. You are going to need a lot of data, so it's probably not realistic to scrape data for this assignment given the time constraints that we're working under. Try to find a dataset that has > 5000 documents.

- If you can't find a dataset to use try this one: <https://www.kaggle.com/c/quora-question-pairs>

2) Clean/Tokenize the documents.

3) Vectorize the model using Word2Vec and explore the results using each of the following at least one time:

- your_model.wv.most_similar()
- your_model.wv.similarity()
- your_model.wv.doesn't_match()

In [3]:
import pandas as pd

##### Your Code Here #####
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train = train.dropna()

In [4]:
train.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [5]:
train = train.drop(['id','qid1','qid2','is_duplicate'], axis='columns')

In [6]:
train['questions'] = train['question1'] + train['question2']
train.head()

Unnamed: 0,question1,question2,questions
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,"Which one dissolve in water quikly sugar, salt..."


In [7]:
train = train.drop(['question1', 'question2'], axis='columns')

In [8]:
train.head()

Unnamed: 0,questions
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [9]:
from nltk.corpus import stopwords
import string

train.questions = train.questions

# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
# stop_words = set(stopwords.words('english'))
# tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

train['cleaned'] = train.questions.apply(clean_doc)
print(train.shape)
train.head()

(404287, 2)


Unnamed: 0,questions,cleaned
0,What is the step by step guide to invest in sh...,"[What, is, the, step, by, step, guide, to, inv..."
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,"[What, is, the, story, of, Kohinoor, KohiNoor,..."
2,How can I increase the speed of my internet co...,"[How, can, increase, the, speed, of, my, inter..."
3,Why am I mentally very lonely? How can I solve...,"[Why, am, mentally, very, lonely, How, can, so..."
4,"Which one dissolve in water quikly sugar, salt...","[Which, one, dissolve, in, water, quikly, suga..."


In [10]:
from gensim.models import Word2Vec
w2v = Word2Vec(train.cleaned, min_count=20, window=3, size=300, negative=20)

In [12]:
w2v.wv.most_similar('sugar', topn=15)

[('milk', 0.7301733493804932),
 ('caffeine', 0.7053263187408447),
 ('banana', 0.7019811868667603),
 ('cholesterol', 0.6990753412246704),
 ('bread', 0.6863564252853394),
 ('vitamin', 0.6849280595779419),
 ('cow', 0.6844279766082764),
 ('butter', 0.6700103282928467),
 ('rice', 0.6671006083488464),
 ('meat', 0.6637934446334839),
 ('fruit', 0.657090425491333),
 ('honey', 0.6517457962036133),
 ('corn', 0.6513240337371826),
 ('vinegar', 0.6485158205032349),
 ('seeds', 0.6471840143203735)]

In [14]:
w2v.wv.doesnt_match(['sugar', 'rice', 'vegetables'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'vegetables'

In [16]:
w2v.wv.similarity("vegetable","sugar")

0.5364637839008122

### Stretch Goals:

1) Use Doc2Vec to train a model on your dataset, and then provide model with a new document and let it find similar documents.

2) Download the pre-trained word vectors from Google. Access the pre-trained vectors via the following link: https://code.google.com/archive/p/word2vec

Load the pre-trained word vectors and train the Word2vec model

Examine the first 100 keys or words of the vocabulary

Outputs the vector representation for a select set of words - the words can be of your choice

Examine the similarity between words - the words can be of your choice

For example:

model.similarity('house', 'bungalow')

model.similarity('house', 'umbrella')