### Use Word2Vec to train your own model on a dataset.

1) **Optional** - Find your own dataset of documents to train you model on. You are going to need a lot of data, so it's probably not realistic to scrape data for this assignment given the time constraints that we're working under. Try to find a dataset that has > 5000 documents.

- If you can't find a dataset to use try this one: <https://www.kaggle.com/c/quora-question-pairs>

2) Clean/Tokenize the documents.

3) Vectorize the model using Word2Vec and explore the results using each of the following at least one time:

- your_model.wv.most_similar()
- your_model.wv.similarity()
- your_model.wv.doesn't_match()

In [1]:
import gensim
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models.word2vec import Word2Vec
import string
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)



In [2]:
df = pd.read_csv('news_dataset.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,title,content,publication,label
0,0,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...,100percentfedup,fake
1,1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,100percentfedup,fake
2,2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,100percentfedup,fake
3,3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,100percentfedup,fake
4,4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,100percentfedup,fake


In [3]:
df = df.drop('Unnamed: 0', axis=1)
df.tail()

Unnamed: 0,title,content,publication,label
28706,An eavesdropping Uber driver saved his 16-year...,Uber driver Keith Avila picked up a p...,Washington Post,real
28707,Plane carrying six people returning from a Cav...,Crews on Friday continued to search L...,Washington Post,real
28708,After helping a fraction of homeowners expecte...,When the Obama administration announced a...,Washington Post,real
28709,"Yes, this is real: Michigan just banned bannin...",This story has been updated. A new law in...,Washington Post,real
28710,What happened in Washington state after voters...,The nation’s first recreational marijuana...,Washington Post,real


In [4]:
df.isna().sum()

title          680
content         46
publication      0
label            0
dtype: int64

In [5]:
df = df.dropna(subset=['content']).reset_index(drop=True)
df.isna().sum()

title          680
content          0
publication      0
label            0
dtype: int64

In [6]:
df['title_text'] = df['title'].fillna('') + df['content']
df = df.drop(columns=['title', 'content'])
df.head()

Unnamed: 0,publication,label,title_text
0,100percentfedup,fake,Muslims BUSTED: They Stole Millions In Gov’t B...
1,100percentfedup,fake,Re: Why Did Attorney General Loretta Lynch Ple...
2,100percentfedup,fake,BREAKING: Weiner Cooperating With FBI On Hilla...
3,100percentfedup,fake,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...
4,100percentfedup,fake,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...


In [7]:
df.label.value_counts()

real    15712
fake    12953
Name: label, dtype: int64

In [8]:
df.isna().sum()

publication    0
label          0
title_text     0
dtype: int64

In [9]:
df.title_text = df.title_text.str.lower()
df.tail()

Unnamed: 0,publication,label,title_text
28660,Washington Post,real,an eavesdropping uber driver saved his 16-year...
28661,Washington Post,real,plane carrying six people returning from a cav...
28662,Washington Post,real,after helping a fraction of homeowners expecte...
28663,Washington Post,real,"yes, this is real: michigan just banned bannin..."
28664,Washington Post,real,what happened in washington state after voters...


In [10]:
# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

df['cleaned'] = df.title_text.apply(clean_doc)
print(df.shape)
df.head()

(28665, 4)


Unnamed: 0,publication,label,title_text,cleaned
0,100percentfedup,fake,muslims busted: they stole millions in gov’t b...,"[muslims, busted, they, stole, millions, in, b..."
1,100percentfedup,fake,re: why did attorney general loretta lynch ple...,"[re, why, did, attorney, general, loretta, lyn..."
2,100percentfedup,fake,breaking: weiner cooperating with fbi on hilla...,"[breaking, weiner, cooperating, with, fbi, on,..."
3,100percentfedup,fake,pin drop speech by father of daughter kidnappe...,"[pin, drop, speech, by, father, of, daughter, ..."
4,100percentfedup,fake,fantastic! trump's 7 point plan to reform heal...,"[fantastic, trumps, point, plan, to, reform, h..."


In [11]:
w2v = Word2Vec(df.cleaned, min_count=20, window=3, size=300, negative=20)

In [12]:
words = list(w2v.wv.vocab)
print(f'Vocabulary Size: {len(words)}')

Vocabulary Size: 31065


In [13]:
w2v.wv.most_similar('hillary', topn=15)

[('mrs', 0.7284717559814453),
 ('hilary', 0.592483401298523),
 ('trump', 0.5030040740966797),
 ('chelsea', 0.4897885322570801),
 ('hrc', 0.46649107336997986),
 ('killary', 0.44560566544532776),
 ('sanders', 0.4441118836402893),
 ('hitlery', 0.4227789640426636),
 ('candidate', 0.39056992530822754),
 ('illegitimacy', 0.38745203614234924),
 ('blumenthal', 0.3853873610496521),
 ('snipes', 0.38438570499420166),
 ('foundation', 0.3803062438964844),
 ('ppp', 0.3782358169555664),
 ('hillarys', 0.37600916624069214)]

In [14]:
w2v.wv.most_similar('trump', topn=15)

[('trumps', 0.598925769329071),
 ('rumsfeld', 0.5547703504562378),
 ('pence', 0.505335807800293),
 ('hillary', 0.5030041933059692),
 ('obama', 0.4950125515460968),
 ('he', 0.49073681235313416),
 ('renzi', 0.48838692903518677),
 ('clinton', 0.476057767868042),
 ('candidacy', 0.4716278612613678),
 ('presidentelect', 0.4615992605686188),
 ('netanyahu', 0.4477747082710266),
 ('duck', 0.4463779926300049),
 ('conway', 0.44119954109191895),
 ('romney', 0.4337351322174072),
 ('nominee', 0.43239378929138184)]

In [15]:
w2v.wv.most_similar(positive=["fbi"], topn=15)

[('comey', 0.6401978731155396),
 ('doj', 0.5706790685653687),
 ('cia', 0.5471445918083191),
 ('dea', 0.5226351022720337),
 ('bureau', 0.4935360848903656),
 ('nypd', 0.4841116666793823),
 ('pentagon', 0.4792800545692444),
 ('nsa', 0.46470406651496887),
 ('weiner', 0.45316457748413086),
 ('fbis', 0.4505649209022522),
 ('investigation', 0.4398224353790283),
 ('irs', 0.43377307057380676),
 ('fdic', 0.4272654056549072),
 ('wada', 0.4242689609527588),
 ('clapper', 0.41915765404701233)]

In [16]:
w2v.wv.most_similar('fbi', topn=15)

[('comey', 0.6401978731155396),
 ('doj', 0.5706790685653687),
 ('cia', 0.5471445918083191),
 ('dea', 0.5226351022720337),
 ('bureau', 0.4935360848903656),
 ('nypd', 0.4841116666793823),
 ('pentagon', 0.4792800545692444),
 ('nsa', 0.46470406651496887),
 ('weiner', 0.45316457748413086),
 ('fbis', 0.4505649209022522),
 ('investigation', 0.4398224353790283),
 ('irs', 0.43377307057380676),
 ('fdic', 0.4272654056549072),
 ('wada', 0.4242689609527588),
 ('clapper', 0.41915765404701233)]

In [17]:
w2v.wv.doesnt_match(['hillary', 'trump', 'sanders'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'sanders'

In [18]:
w2v.wv.doesnt_match(['fbi', 'cia', 'nypd'])

'nypd'

In [19]:
w2v.wv.similarity("fbi","cia")

0.5471445972638371

### Stretch Goals:

1) Use Doc2Vec to train a model on your dataset, and then provide model with a new document and let it find similar documents.

2) Download the pre-trained word vectors from Google. Access the pre-trained vectors via the following link: https://code.google.com/archive/p/word2vec

Load the pre-trained word vectors and train the Word2vec model

Examine the first 100 keys or words of the vocabulary

Outputs the vector representation for a select set of words - the words can be of your choice

Examine the similarity between words - the words can be of your choice

For example:

model.similarity('house', 'bungalow')

model.similarity('house', 'umbrella')