### Use Word2Vec to train your own model on a dataset.

1) **Optional** - Find your own dataset of documents to train you model on. You are going to need a lot of data, so it's probably not realistic to scrape data for this assignment given the time constraints that we're working under. Try to find a dataset that has > 5000 documents.

- If you can't find a dataset to use try this one: <https://www.kaggle.com/c/quora-question-pairs>

2) Clean/Tokenize the documents.

3) Vectorize the model using Word2Vec and explore the results using each of the following at least one time:

- your_model.wv.most_similar()
- your_model.wv.similarity()
- your_model.wv.doesn't_match()

# 1: Use the quora-question-pairs dataset
### This assignment is not to do the kaggle competition for the quora-question-pairs, but to investigate the data with various Word2Vec techniques

In [3]:
import pandas as pd
import numpy as np

from nltk import word_tokenize
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
from gensim.models.word2vec import Word2Vec

In [4]:
train = pd.read_csv('/Users/samirgadkari/data/kaggle_competition_quora_question_pairs/train.csv', header=0)
print(train.shape)
train.head()

(404290, 6)


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


### This dataset is not for us to solve any problem - it's just for us to get familiar with Word2Vec. So we will only use question1

In [5]:
questions = train.question1
questions.shape

(404290,)

# 2: Clean/Tokenize the data

In [6]:
questions.isnull().sum()

1

In [7]:
questions = questions.dropna()

In [8]:
questions.isnull().sum()

0

In [9]:
lemmatizer = WordNetLemmatizer()

In [10]:
tokenized = [word_tokenize(q.lower()) for q in questions]
tokenized[:3]

[['what',
  'is',
  'the',
  'step',
  'by',
  'step',
  'guide',
  'to',
  'invest',
  'in',
  'share',
  'market',
  'in',
  'india',
  '?'],
 ['what',
  'is',
  'the',
  'story',
  'of',
  'kohinoor',
  '(',
  'koh-i-noor',
  ')',
  'diamond',
  '?'],
 ['how',
  'can',
  'i',
  'increase',
  'the',
  'speed',
  'of',
  'my',
  'internet',
  'connection',
  'while',
  'using',
  'a',
  'vpn',
  '?']]

In [11]:
all_alphabetic = [[w for w in q if w.isalpha()] for q in tokenized]
all_alphabetic[:3]

[['what',
  'is',
  'the',
  'step',
  'by',
  'step',
  'guide',
  'to',
  'invest',
  'in',
  'share',
  'market',
  'in',
  'india'],
 ['what', 'is', 'the', 'story', 'of', 'kohinoor', 'diamond'],
 ['how',
  'can',
  'i',
  'increase',
  'the',
  'speed',
  'of',
  'my',
  'internet',
  'connection',
  'while',
  'using',
  'a',
  'vpn']]

In [12]:
lemmatized = [[lemmatizer.lemmatize(tok) for tok in q] for q in all_alphabetic]
lemmatized[:3]

[['what',
  'is',
  'the',
  'step',
  'by',
  'step',
  'guide',
  'to',
  'invest',
  'in',
  'share',
  'market',
  'in',
  'india'],
 ['what', 'is', 'the', 'story', 'of', 'kohinoor', 'diamond'],
 ['how',
  'can',
  'i',
  'increase',
  'the',
  'speed',
  'of',
  'my',
  'internet',
  'connection',
  'while',
  'using',
  'a',
  'vpn']]

In [13]:
# sentences = [' '.join(lemma) for lemma in lemmatized]
# sentences[:3]

# 3: Vectorize using Word2Vec

In [16]:
def run_w2v_model(data, min_count, size):
    model = Word2Vec(data, min_count=min_count, size=size)
    print('model:', model)
    print('vocab len:', len(model.wv.vocab), 'vocab:', list(model.wv.vocab))
    return model

In [17]:
model = run_w2v_model(lemmatized, 1, 5)

model: Word2Vec(vocab=54852, size=5, alpha=0.025)


In [19]:
model.wv['kohinoor']

array([ 0.2542926 ,  0.11173115,  0.27236062,  0.49554548, -0.28072038],
      dtype=float32)

In [21]:
model.wv.most_similar('kohinoor')

[('agnosticism', 0.9999101161956787),
 ('嘚瑟', 0.9993659257888794),
 ('oos', 0.9993642568588257),
 ('vallabhbhai', 0.9992117285728455),
 ('constitute', 0.999170184135437),
 ('marble', 0.998907744884491),
 ('defines', 0.9988858103752136),
 ('mixpanel', 0.9988610744476318),
 ('himalaya', 0.998708188533783),
 ('workplace', 0.9986512064933777)]

In [25]:
model.wv.most_similar('programming')

[('testifying', 0.9984339475631714),
 ('optionally', 0.9983837604522705),
 ('knowledge', 0.9976974129676819),
 ('jradiobutton', 0.9922666549682617),
 ('practice', 0.9903624057769775),
 ('grassroots', 0.9903255105018616),
 ('essay', 0.9899473786354065),
 ('largen', 0.9889476299285889),
 ('guoyu', 0.9886331558227539),
 ('useful', 0.9885057210922241)]

In [26]:
model.wv.similarity('beneficial', 'facebook')

-0.1486107

In [27]:
model.wv.similarity('beneficial', 'considering')

0.8864524

In [29]:
model.wv.doesnt_match(['beneficial', 'considering', 'helpful', 'easygoing', 'facebook'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'facebook'

In [31]:
model.wv.doesnt_match(['quality', 'craftsmanship', 'unique', 'beautiful', 'utilitarian'])

'quality'

### I was hoping it would get 'utilitarian'. Obviously, it's not fully correct.

### Stretch Goals:

1) Use Doc2Vec to train a model on your dataset, and then provide model with a new document and let it find similar documents.

2) Download the pre-trained word vectors from Google. Access the pre-trained vectors via the following link: https://code.google.com/archive/p/word2vec

Load the pre-trained word vectors and train the Word2vec model

Examine the first 100 keys or words of the vocabulary

Outputs the vector representation for a select set of words - the words can be of your choice

Examine the similarity between words - the words can be of your choice

For example:

model.similarity('house', 'bungalow')

model.similarity('house', 'umbrella')

### Working on part 2) Google pre-trained word vectors