# Word Representations

## *"I know words. I have the best words!"*
    - Noam Chomsky

# Dense Distributed Representations

In [1]:
import pandas as pd
df = pd.read_csv('../data/reviews.full.tsv', sep='\t', nrows=100000)
documents = df.text.tolist()
print(documents[:2])

["Prices change daily and if you want to really research the price continually at many different sites , I have found cheaper cars elsewhere . However , if you don ' t have a lot of time to research the price , this site has always been among the top three ( e . g ., cheapest ) of the ten sites I use to reserve a car .", 'and the fact that they will match other companies is awesome !!']


## Word embeddings with `Word2vec`

In [59]:
from gensim.models import Word2Vec
from gensim.models.word2vec import FAST_VERSION

corpus = [document.split() for document in documents]

# initialize model
w2v_model = Word2Vec(size=100,
                     window=15,
                     sample=0.0001,
                     iter=200,
                     negative=5, 
                     min_count=100,
                     workers=-1, 
                     hs=0
)

w2v_model.build_vocab(corpus)

w2v_model.train(corpus, 
                total_examples=w2v_model.corpus_count, 
                epochs=w2v_model.epochs)


(0, 0)

Now, we can use the embeddings of the model

In [60]:
w2v_model.wv['delivery']

array([ 4.8546400e-03, -1.4327493e-03,  9.1873267e-04, -2.5443123e-03,
       -3.9414167e-03, -1.5230225e-03, -2.1216949e-03,  9.5906004e-04,
       -4.7519077e-03,  1.8372270e-03, -2.3460684e-03,  2.0625452e-03,
        4.5630126e-03,  1.0871316e-03,  3.0135075e-03,  4.8537026e-03,
       -4.5566674e-04, -4.9030469e-03,  1.6111378e-03,  2.3842263e-03,
       -7.6960924e-04, -3.0595718e-03, -1.9958208e-03,  1.3386792e-03,
       -2.7547716e-03,  4.3944218e-03, -1.3478342e-03, -3.2697483e-03,
       -2.3716041e-03, -3.0164132e-03,  2.2254842e-03, -3.8527506e-03,
       -4.9539991e-03,  6.9463247e-04,  1.9091720e-03, -4.5727027e-04,
       -3.9925748e-03,  4.2467550e-03, -4.0459507e-03, -2.6428143e-03,
        4.7718184e-03,  4.1489154e-03,  3.0665381e-03, -6.8244664e-04,
       -3.1425054e-03,  9.5672999e-04,  3.4863157e-03,  3.8578920e-03,
       -1.9977193e-03,  4.8923180e-03, -2.6220127e-03, -4.8777540e-03,
        1.4871495e-03, -4.4299471e-03, -4.6825400e-03, -1.2769677e-03,
      

In [61]:
w2v_model.wv.most_similar(['delivery'])

[('clear', 0.34266629815101624),
 ('fast', 0.3159208297729492),
 ('genuine', 0.3109842538833618),
 ('Jan', 0.29935598373413086),
 ('blind', 0.2912609875202179),
 ('learned', 0.2893880307674408),
 ('services', 0.2849019765853882),
 ('hanging', 0.27297908067703247),
 ('product', 0.26602858304977417),
 ('models', 0.26145869493484497)]

In [66]:
# birthday - present + husband => birthday:present as husband:?
w2v_model.wv.most_similar(positive=['dog', 'husband'], negative=['cat'], topn=3)

[('act', 0.35530605912208557),
 ('Because', 0.33676496148109436),
 ('strange', 0.31482774019241333)]

In [67]:
word1 = "Cheapest"
word2 = "friendly"

# retrieve the actual vector
# print(w2v_model.wv[word1])

# compare
print(w2v_model.wv.similarity(word1, word2))

# get the 3 most similar words
print(w2v_model.wv.most_similar(word1, topn=3))


-0.21957012540682005
[('zero', 0.2897338271141052), ('filling', 0.2875811457633972), ('Here', 0.2875189781188965)]


In [69]:
corpus[1]

(3376, 100)


### Exercise
Use `spacy` to restrict the words in the tweets to *content words*, i.e., nouns, verbs, and adjectives. Transform the words to lower case and add the POS with an underderscore. E.g.:

`love_VERB old-fashioneds_NOUN`

This also allows us to distinguish between homographs, i.e., words that are written the same, but belong to different word classes, e.g., *love* in "I **love** old-fashioneds" vs. "He felt so sick, it must have been **love**".


Make sure to exclude sentences that contain none of the above.

Write the resulting corpus to a variable called `word_corpus`.

In [None]:
# Your code here


Rerun the `Word2vec` model from above on the new data set and test the words out

In [None]:
# Your code here

## Exercise

Train 4 more `Word2vec` models and average the resulting embedding matrices.

In [None]:
# Your code here

## Document embeddings with `Doc2Vec`

In [71]:
df.head()

Unnamed: 0,score,category,uid,gender,age,text
0,5,Car Rental,899881,F,50,Prices change daily and if you want to really ...
1,5,Fitness & Nutrition,828184,M,32,and the fact that they will match other compan...
2,5,Electronic Payment,1698375,M,48,Used Paypal for my buying and selling for the ...
3,5,Gaming,3324079,M,29,I ' ve made two purchases on CJ ' s for Fallou...
4,4,Jewelry,719816,F,29,I was very happy with the diamond that I order...


In [72]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import FAST_VERSION
from gensim.models.doc2vec import TaggedDocument

corpus = []

for row in df.iterrows():
    label = row[1].score
    text = row[1].text
    corpus.append(TaggedDocument(words=text.split(), tags=[str(label)]))

print('done')
d2v_model = Doc2Vec(vector_size=100, 
                    window=15,
                    hs=0,
                    sample=0.000001,
                    negative=5,
                    min_count=100,
                    workers=-1,
                    epochs=500,
                    dm=0, 
                    dbow_words=1)

d2v_model.build_vocab(corpus)

d2v_model.train(corpus, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)

done


We can now look at the elements

In [73]:
d2v_model.docvecs.doctags

{'5': Doctag(offset=0, word_count=4205492, doc_count=78827),
 '4': Doctag(offset=1, word_count=604853, doc_count=9164),
 '1': Doctag(offset=2, word_count=1205430, doc_count=7316),
 '2': Doctag(offset=3, word_count=301478, doc_count=2197),
 '3': Doctag(offset=4, word_count=254820, doc_count=2496)}

In [75]:
target_doc = '1'

similar_docs = d2v_model.docvecs.most_similar(target_doc, topn=5)
print(similar_docs)

[('4', 0.05546309053897858), ('3', 0.04905076324939728), ('2', -0.020994702354073524), ('5', -0.10378839075565338)]


## Exercise

What are the 10 most similar ***words*** to each category?

In [80]:
# your code here
d2v_model.wv.most_similar([d2v_model.docvecs['3']], topn=5)

[('season', 0.33074963092803955),
 ('trip', 0.3259233236312866),
 ('term', 0.31779444217681885),
 ('include', 0.31765735149383545),
 ('preferred', 0.31474271416664124)]