# Preprocessing, Word2Vec Model Training and Perfomance Evaluation

* Part 1: Pre-processing: This code takes in a text file with the text data (specifically written for NYT Text Data in txt format from Lexis Nexis) 

* Part 2: Train a Word2Vec model on this text data

* Part 3: Get accuracy scores on a trained Word2Vec model based on the Google Analogy Test

* Part 4: Quick Exploration of a Word2Vec model

Suggestions along the way for clean dataset to just jump into Part 2 (training), or a pre-trained model to jump into model exploration (Part 4)

In [3]:
import re    
from nltk.corpus import stopwords
import cython #ENSURE cython package is installed on computer/canopy
import sklearn.cluster
import numpy as np
from gensim.models import word2vec
from gensim.models import phrases 
np.set_printoptions(threshold=np.inf) #set to print full output

## Part 1: Pre-Processing 

In [2]:
#upload text file with sentences object, which includes only sentences drawn the text data you want to learn the Word2Vec model on

fname= "dataALLYEARS.txt"
lnraw=open(fname, encoding="utf8").read() #read the file, 

In [3]:
lnraw2= lnraw[3:] #this is for my data, get rid of extra stuff at beginning of each lexis nexis file

In [4]:
#TO MAKE SURE TEXT FILE LOADED CORRECTLY:
len(lnraw2)
#lnraw2= lnraw2[:500000] #for testing code, shorten number characters

In [4]:
#replace these punctuations with periods, to show they are all delimiters for sentences, #double check this is working on a variety of punctuation. also recall goop at end of each article, how will this affect it?
lnrawtrial2 = re.sub(';?:!"', '.', lnraw2)

In [5]:
lnrawtrial2 = re.sub(r'[^\w\s.]',' ', lnrawtrial2)  #remove all punctuation except periods, and replace with spaces
lnrawtrial2= lnrawtrial2.lower()  #make all letters lowercase

Remove Stopwords? (Optional, not done for paper)

In [None]:
#remove stopwords ....sometimes better not to do this, its subjective. See: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors
lnrawtrial3= lnrawtrial2.split()
cachedStopWords = stopwords.words("english")
lnrawtrial4 = ' '.join([word for word in lnrawtrial3 if word not in cachedStopWords])

#note that now this is again formatted so each element is a character, so split again:
lnrawtrial5= lnrawtrial4.split()

If not removing stopwords, skip right to here:

In [6]:
#if not removing stopwords/rare words do this
lnrawtrial7=lnrawtrial2 

#NOW, regardless of stopwords/rarewords:

#last, strsplitr on periods to turn into a "sentences" object. obj= [["w1sent1", "w2sent1"], ["w1sent2", "w2sent2", "w3sent2"]]
lnrawtrial8= lnrawtrial7.split('.') #there are also other ways to split into sentences, such also splitting on ! ? ; :

lnrawtrial9 = [None] * len(lnrawtrial8)
for sent in range(1,len(lnrawtrial8)):
    lnrawtrial9[sent]=list(filter(None, lnrawtrial8[sent].split(" "))) #filter gets out '' after doing the sent split into words, in py 3 need to wrap with list(), but not in py 2

lnrawtrial10=lnrawtrial9[1:] #stubborn space at beginning of my data

To save your cleaned, tokenized text data for later:

In [7]:
f = open('alldat_NYT_tokenized.txt','a')
f.writelines(["%s\n" % item  for item in lnrawtrial10]) 
f.close()

In [None]:
#save some space if you ran pre-prepocessing code
del lnraw
del lnrawtrial2
#del lnrawtrial3
#del lnrawtrial4
#del lnrawtrial5
#del lnrawtrial6
del lnrawtrial7
del lnrawtrial8
del lnrawtrial9

## Phase 2: Train a Word2Vec Model, Explore Hyperparameters

#### Don't have a dataset? Gensim has a few free suggestions, such as the Text8 Corpus:
* https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Text8Corpus

#### Here's a possible baseline model to get to know arguments:

Word2Vec in Gensim Documentation/Tutorial: https://radimrehurek.com/gensim/models/word2vec.html

Notes on a few hyperparameters:
* num_features = word vector dimensionality. usually ranges 50-500 with gains in model performance diminishing after 300. 
* min_word_count = minimum word count. Any word that does not occur at least this many times across all documents is ignored, I use 40 after seeing this doen on Kaggle code. 
* num_workers = number of threads to run in parallel, I set to 4. This SERIOUSLY speeds up training, but to use make sure Cython is installed. 
* context = context window size, 10 in kaggle, larger window is more about topical similarity, smaller is semantic similiarity
* downsampling = downsample setting for frequent words. 
* seedie= can set a seed for reproducibility, note you also cannot use multiple workers if you want a fully reproducible model.
* sg = learning archietecture: skip-gram (1) or CBOW (0)
* hs= training algorithm to speed up computations: hierarchical softmax (1) or negative sampling (0)

#### Train Models with Four Combos of Hyperparameters

After trying out the four possible combinations on training data, I found that Model A performs the best on my training data and used this as my final
model parameters.

In [12]:
bigram_transformer = phrases.Phrases(lnrawtrial10) #lnrawtrial10 is my "sentences" object with the cleaned text data. do this first before training model if you want to allow for bigrams in data (two word expressions, rather than only one word)

In [41]:
print("Training model...") 
modelA_ALLYEARS= word2vec.Word2Vec(bigram_transformer[lnrawtrial10], workers=4, sg=0,
                          size=100, min_count=40,
                          window=5, sample=1e-3)
print("Done")

Training model...




Done


In [31]:
modelB_ALLYEARS= Word2Vec(bigram_transformer[lnrawtrial10], workers=4, sg=1, 
                          size=500, min_count=40,
                          window=10, sample=1e-3)
print("Done")

Done


In [38]:
modelC_ALLYEARS= Word2Vec(bigram_transformer[lnrawtrial10],  workers=4, sg=0, hs=1,
                          size=500, min_count=40, 
                          window=10, sample=1e-3)
print("Done")

Done


In [11]:
modelD_ALLYEARS= Word2Vec(bigram_transformer[lnrawtrial10], workers=4, sg=1, hs=1,
                          size=500, min_count=40,
                          window=10, sample=1e-3)
print("Done")

Done


In [42]:
modelA_ALLYEARS2.init_sims(replace=True) #Precompute L2-normalized vectors. If replace is set to TRUE, forget the original vectors and only keep the normalized ones = saves lots of memory, but can't continue to train the model.

In [43]:
modelA_ALLYEARS2.save("your_model_name_here")

## Phase 3: Word2Vec Model Accuracy

Upload a trained model you want to evaluate:

In [9]:
from gensim.models import Word2Vec
currentmodel=  Word2Vec.load("modelA_ALLYEARS_500dim_10CW")

#### Don't have a model? Use a pretrained Word2Vec Model from Google, trained on Google News
* Read and download here: https://code.google.com/archive/p/word2vec/
* File is called "GoogleNews-vectors-negative300.bin.gz"

In [10]:
accuracy=currentmodel.accuracy('questions_words_pasted.txt') #FOR THIS TO WORK, NEED TO HAVE THIS TXT FILE IN WORKING DIRECTORY

In [11]:
world_capitals1= accuracy[0] #seems same as world_capitals1 and 2 and 3??
world_capitals2= accuracy[1] #seems same as world_capitals1 and 2 and 3??
money= accuracy[2]
US_capitals= accuracy[3]
family= accuracy[4]
adj_to_adverbs= accuracy[5]
opposites= accuracy[6]
comparative= accuracy[7]
superlative= accuracy[8]
present_particple= accuracy[9]
nationality= accuracy[10]
past_tense= accuracy[11]
plural = accuracy[12]
plural_verbs= accuracy[13]
word_capitals3= accuracy[14] #seems same as world_capitals1 and 2 and 3??

In [12]:
for i in accuracy:
    sum_corr = len(i['correct'])
    sum_incorr = len(i['incorrect'])
    total = sum_corr + sum_incorr
    print(float(sum_corr)/(total))

#print('Total sentences: {}, Correct: {:.2f}%, Incorrect: {:.2f}%'.format(total, percent(sum_corr), percent(sum_incorr)))

#print sum_corr[1]
#print sum_incorr[1]
#print total

0.5684210526315789
0.4547244094488189
0.1
0.1790794979079498
0.8973684210526316
0.1431451612903226
0.29210526315789476
0.8536036036036037
0.7210144927536232
0.7149425287356321
0.5718206770356816
0.6518218623481782
0.6857142857142857
0.7321937321937322
0.5711067051189618


## Phase 4: Word2Vec Model Exploring

Upload a trained model you want to explore:

In [None]:
from gensim.models import Word2Vec
currentmodel=  Word2Vec.load("modelA_ALLYEARS_500dim_10CW")

#### Don't have a model? Use a pretrained Word2Vec Model from Google, trained on Google News
* Read and download here: https://code.google.com/archive/p/word2vec/
* File is called "GoogleNews-vectors-negative300.bin.gz"

In [228]:
from gensim import corpora, models, similarities #calc all similarities at once, from http://radimrehurek.com/gensim/tut3.html
from sklearn.metrics.pairwise import cosine_similarity
from scipy import spatial
import numpy as np

In [12]:
currentmodel['woman'] #this gives the word-vector for 'woman'

In [12]:
result = 1 - spatial.distance.cosine(currentmodel['computer'], currentmodel['software']) #calculate cosine similarity more manually, 2x checked against model.most_similar and results match
print result

0.754645441633


In [14]:
currentmodel.most_similar('overweight', topn=5) #asks which words are closest to overweight

[('obese', 0.8001347780227661),
 ('underweight', 0.6493887901306152),
 ('normal_weight', 0.5515424609184265),
 ('being_overweight', 0.5171246528625488),
 ('anorexic', 0.5130723714828491)]

In [17]:
currentmodel.most_similar(negative=['big']) #asks which words are FARTHEST from big, doesn't work so well. Shows how distance as can break down as a meaninful measure in this model. 

[('lab_tests', 0.24826177954673767),
 ('sterilization', 0.2312188297510147),
 ('endocrinologist', 0.2287030965089798),
 ('bouvia', 0.22845697402954102),
 ('auschwitz', 0.2241036295890808),
 ('legal_rights', 0.22197261452674866),
 ('antidepressant_drugs', 0.21886323392391205),
 ('pliny', 0.21702751517295837),
 ('medically', 0.21255454421043396),
 ('were_randomly', 0.206563338637352)]

In [20]:
print currrentmodel.most_similar(positive=['man', 'obesity'], negative=['woman']) #man:king as woman:_?___
print " "
print currentmodel.most_similar(positive=['woman', 'obesity'], negative=['man']) #man:king as woman:_?___

[('hypertension', 0.5848986506462097), ('gout', 0.5804318785667419), ('neurological', 0.5797110795974731), ('congenital', 0.5704867839813232), ('diabetes', 0.5627163648605347), ('atherosclerosis', 0.5601633191108704), ('infectious', 0.5572311878204346), ('retardation', 0.5475963950157166), ('cognition', 0.5436995029449463), ('gastrointestinal', 0.5427045226097107)]
break
[('diabetes', 0.6686073541641235), ('pregnancy', 0.6431320905685425), ('estrogen', 0.6330111622810364), ('childbirth', 0.6287267804145813), ('asthma', 0.6252803802490234), ('alcoholism', 0.6236932277679443), ('anorexia', 0.619240403175354), ('reproductive', 0.6185950040817261), ('syndrome', 0.6179408431053162), ('diseases', 0.615617573261261)]


In [19]:
currentmodel.doesnt_match("noodle chicken turkey beef".split()) #A TRICK! My models seems to do pretty well when I try to trick them

'noodle'

#### Extract Raw Word/Word-Vectors

You could do k-means on this (but computationally expensive), explore the model diemsnions learned, etc.

In [4]:
import csv

In [6]:
wordvecs= currentmodel.syn0 #if run, this outputs all word-vectors en masse http://rare-technologies.com/word2vec-tutorial/, but doesn't have name of word attached to vector

In [7]:
b = open('test.csv', 'wb') #writes a csv with a row for each word-vector, total 17,855 rows. But, this data doesn't include word represented by each word-vector. 
a = csv.writer(b)
'''
testdata = [['Me', 'You'],\
        ['293', '219'],\
        ['54', '13']]'''
a.writerows(wordvecs)
b.close()

In [8]:
len(currentmodel.vocab) #this is the length of the list of all words in the Word2Vec model's vocab

17855