### Use Word2Vec to train your own model on a dataset.

1) **Optional** - Find your own dataset of documents to train you model on. You are going to need a lot of data, so it's probably not realistic to scrape data for this assignment given the time constraints that we're working under. Try to find a dataset that has > 5000 documents.

- If you can't find a dataset to use try this one: <https://www.kaggle.com/c/quora-question-pairs>

2) Clean/Tokenize the documents.

3) Vectorize the model using Word2Vec and explore the results using each of the following at least one time:

- your_model.wv.most_similar()
- your_model.wv.similarity()
- your_model.wv.doesn't_match()

In [1]:
pwd

'/home/mishraka/Documents/Manjula/Lambda_School/Assignments/Unit4/NLP'

In [2]:
ls

Copy_of_LS_DS_421_Text_Data_Assignment (1).ipynb
data.txt
LS_DS_421_Text_Data_Assignment.ipynb
LS_DS_421_Text_Data_Lecture.ipynb
LS_DS_422_BOW_Assignment.ipynb
LS_DS_423_Document_Classification_Assignment.ipynb
LS_DS_424_Word_Embeddings_Assignment.ipynb
sample_submission.csv
String_manipulation_practice_from_Monday.ipynb
test.csv
train.csv


In [30]:
import re
import string

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize # Sentence Tokenizer
from nltk.tokenize import word_tokenize # Word Tokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.probability import FreqDist

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

[nltk_data] Downloading package punkt to /home/mishraka/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/mishraka/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [100]:
train_df = pd.read_csv('train.csv', encoding="ISO-8859-1")


In [101]:
print(train_df.shape)

(404290, 6)


In [102]:
train_df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [103]:
train_df1= train_df.copy()

In [104]:
train_df= train_df.drop(columns=['qid1', 'qid2', 'is_duplicate'])

In [105]:
df = train_df

In [106]:
df.isna().sum()



id           0
question1    1
question2    2
dtype: int64

In [107]:
df = df.dropna()

In [108]:
df.shape

(404287, 3)

In [109]:
df.isna().sum()

id           0
question1    0
question2    0
dtype: int64

In [110]:
df['question1'].head()

0    What is the step by step guide to invest in sh...
1    What is the story of Kohinoor (Koh-i-Noor) Dia...
2    How can I increase the speed of my internet co...
3    Why am I mentally very lonely? How can I solve...
4    Which one dissolve in water quikly sugar, salt...
Name: question1, dtype: object

In [111]:
### Tokenize using NLTK/clean the listing

In [112]:
df['question1'] = df['question1'].apply(lambda x: re.sub(r'[^\w]', ' ', x))
df['question2'] = df['question2'].apply(lambda x: re.sub(r'[^\w]', ' ', x))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [113]:
df.head()

Unnamed: 0,id,question1,question2
0,0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...
1,1,What is the story of Kohinoor Koh i Noor Dia...,What would happen if the Indian government sto...
2,2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...
3,3,Why am I mentally very lonely How can I solve...,Find the remainder when math 23 24 math i...
4,4,Which one dissolve in water quikly sugar salt...,Which fish would survive in salt water


In [114]:
df['question2'].head()

0    What is the step by step guide to invest in sh...
1    What would happen if the Indian government sto...
2    How can Internet speed be increased by hacking...
3    Find the remainder when  math 23  24   math  i...
4              Which fish would survive in salt water 
Name: question2, dtype: object

In [115]:
table = str.maketrans('', '', string.punctuation)
stop_words = set(stopwords.words('english'))

def nltk_tokenize(input):
  
  # Tokenize by word
    tokens = word_tokenize(input)
  #print("Tokens:", tokens)
  # Make all words lowercase
    lowercase_tokens = [w.lower() for w in tokens]
  #print("Lowercase:", lowercase_tokens)
  # Strip punctuation from within words
    no_punctuation = [x.translate(table) for x in lowercase_tokens]
  #print("No Punctuation:", no_punctuation)
  # Remove words that aren't alphabetic
    alphabetic = [word for word in no_punctuation if word.isalpha()]
  #print("Alphabetic:", alphabetic)
  # Remove stopwords
    words = [w for w in alphabetic if not w in stop_words]
  #print("Cleaned Words:", words)
  #print("--------------------------------")
  # Append to list
    return words
    

In [116]:
df['question1_cleaned']=df['question1'].apply(nltk_tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [117]:
df['question2_cleaned']=df['question2'].apply(nltk_tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [118]:
df.head()

Unnamed: 0,id,question1,question2,question1_cleaned,question2_cleaned
0,0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,"[step, step, guide, invest, share, market, india]","[step, step, guide, invest, share, market]"
1,1,What is the story of Kohinoor Koh i Noor Dia...,What would happen if the Indian government sto...,"[story, kohinoor, koh, noor, diamond]","[would, happen, indian, government, stole, koh..."
2,2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,"[increase, speed, internet, connection, using,...","[internet, speed, increased, hacking, dns]"
3,3,Why am I mentally very lonely How can I solve...,Find the remainder when math 23 24 math i...,"[mentally, lonely, solve]","[find, remainder, math, math, divided]"
4,4,Which one dissolve in water quikly sugar salt...,Which fish would survive in salt water,"[one, dissolve, water, quikly, sugar, salt, me...","[fish, would, survive, salt, water]"


In [119]:
question1_vector = []

for i in df['question1_cleaned']:
    new_description = " ".join(i)
    question1_vector.append(new_description)
    
df['question1_vector'] = question1_vector

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [120]:
question2_vector = []

for i in df['question2_cleaned']:
    new_description = " ".join(i)
    question2_vector.append(new_description)
    
df['question2_vector'] = question2_vector

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [121]:
df.head()

Unnamed: 0,id,question1,question2,question1_cleaned,question2_cleaned,question1_vector,question2_vector
0,0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,"[step, step, guide, invest, share, market, india]","[step, step, guide, invest, share, market]",step step guide invest share market india,step step guide invest share market
1,1,What is the story of Kohinoor Koh i Noor Dia...,What would happen if the Indian government sto...,"[story, kohinoor, koh, noor, diamond]","[would, happen, indian, government, stole, koh...",story kohinoor koh noor diamond,would happen indian government stole kohinoor ...
2,2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,"[increase, speed, internet, connection, using,...","[internet, speed, increased, hacking, dns]",increase speed internet connection using vpn,internet speed increased hacking dns
3,3,Why am I mentally very lonely How can I solve...,Find the remainder when math 23 24 math i...,"[mentally, lonely, solve]","[find, remainder, math, math, divided]",mentally lonely solve,find remainder math math divided
4,4,Which one dissolve in water quikly sugar salt...,Which fish would survive in salt water,"[one, dissolve, water, quikly, sugar, salt, me...","[fish, would, survive, salt, water]",one dissolve water quikly sugar salt methane c...,fish would survive salt water


In [126]:
import gensim
from gensim.models import Word2Vec
w2v = Word2Vec(df['question1_cleaned'], min_count=20, window=3, size=500, negative=20)

#### your_model.wv.most_similar()
#### your_model.wv.similarity()
#### your_model.wv.doesn't_match()

In [127]:
words = list(w2v.wv.vocab)
print(f'Vocablury Size: {len(words)}')

Vocablury Size: 9707


In [129]:
w2v.wv.most_similar('internet', topn=15)

[('wifi', 0.6335042715072632),
 ('vpn', 0.5967395305633545),
 ('tor', 0.5836657881736755),
 ('broadband', 0.5628470182418823),
 ('network', 0.5504523515701294),
 ('unlimited', 0.5464783906936646),
 ('mbps', 0.5295549631118774),
 ('surveys', 0.5169956684112549),
 ('connection', 0.4992689788341522),
 ('router', 0.49836266040802),
 ('torrent', 0.4979938566684723),
 ('browsing', 0.4973679482936859),
 ('torrents', 0.48768365383148193),
 ('downloading', 0.4820516109466553),
 ('itunes', 0.4775010943412781)]

In [134]:
w2v.wv.similarity("internet", 'itunes')

0.47750106

In [138]:
w2v.wv.doesnt_match(["internet", "phone", "table"])

'table'

### Stretch Goals:

1) Use Doc2Vec to train a model on your dataset, and then provide model with a new document and let it find similar documents.

2) Download the pre-trained word vectors from Google. Access the pre-trained vectors via the following link: https://code.google.com/archive/p/word2vec

Load the pre-trained word vectors and train the Word2vec model

Examine the first 100 keys or words of the vocabulary

Outputs the vector representation for a select set of words - the words can be of your choice

Examine the similarity between words - the words can be of your choice

For example:

model.similarity('house', 'bungalow')

model.similarity('house', 'umbrella')