# Word embedding

## An alternative approach

Can we defined words by the company they keep?

"If A and B have almost the identical environments we can say that they are synonyms" (Zelig Harris, 1954)

## Vector representations

One-hot encodings are long and sparse

Alternative: dense vectors

short(length 50-1000) + dense(most elements are non-zero)

## Benefits
1. Easier to us in ML
2. Offer better generalization capabilities 

## 1. Train your own emeddings

Instead of counting terms, we trian a classifier on a prediction task:'does A occur near B'?

The learnt classifier weights become our embeddings

We can create our own word embeddings using gensim (an open-source library for unsupervised topic modeling and NLP) and train it on the Brown corpus

In [2]:
# set up libraries and data
import gensim
from nltk.corpus import brown

In [3]:
# train the model
model = gensim.models.Word2Vec(brown.sents())

In [4]:
# save a copy for later re-use
model.save('brown.embedding')

In [5]:
# we can load models on demand
brown_model = gensim.models.Word2Vec.load('brown.embedding')

In [14]:
len(brown_model.wv.key_to_index)

15173

In [17]:
# how many dimensions?
brown_model.wv['university']

array([ 0.11194711,  0.25647733,  0.20054246,  0.11136482, -0.06431483,
       -0.33305392,  0.20328386,  0.34726298, -0.3063523 , -0.28288716,
        0.16903785, -0.22773439,  0.19072987,  0.16677798,  0.24305744,
       -0.16282238,  0.2695498 , -0.14498329, -0.52258486, -0.5172722 ,
        0.28510782, -0.11505552,  0.48348287,  0.07632669, -0.05928324,
       -0.13334103, -0.2300787 ,  0.01776804, -0.21009324,  0.23414841,
        0.21649893, -0.06716539,  0.2807849 , -0.3703491 , -0.17710349,
        0.05796826, -0.18897642, -0.05402145, -0.3442866 , -0.05958011,
        0.01251105, -0.26838008,  0.17947957,  0.11442474,  0.2036804 ,
       -0.00861567, -0.01929104, -0.02451881,  0.09921164,  0.31112176,
        0.01644453, -0.28586334, -0.25123176, -0.18944709, -0.10734125,
       -0.22981353,  0.17728397,  0.05186184, -0.06207724, -0.0792463 ,
        0.02941441,  0.20066275, -0.03754928, -0.16876425, -0.18955061,
        0.45259425,  0.02250968,  0.28408766, -0.2634104 ,  0.37

In [18]:
# calculate similiarity between terms
brown_model.wv.similarity('university','school')

0.81546724

In [19]:
# find similar terms
brown_model.wv.most_similar('university', topn=5)

[('membership', 0.9543903470039368),
 ('profession', 0.9534215331077576),
 ('neighborhood', 0.9529263973236084),
 ('congregation', 0.9514262676239014),
 ('selection', 0.948223888874054)]

In [20]:
brown_model.wv.most_similar('lemon',topn=5)

[('marble', 0.9657631516456604),
 ('pension', 0.9650222063064575),
 ('frankfurters', 0.9640059471130371),
 ('herd', 0.9625304937362671),
 ('towel', 0.9624120593070984)]

In [22]:
brown_model.wv.most_similar('government',topn=5)

[('nation', 0.9278519749641418),
 ('policy', 0.9250001907348633),
 ('power', 0.923238217830658),
 ('Christian', 0.9232304096221924),
 ('education', 0.9201815128326416)]

## 2. Use pre-trained embeddings

We can load pre-built embeddings, e.g. a sample from a model trained on 100 billion words from the Google News Dataset

In [23]:
import nltk
nltk.download('word2vec_sample')

[nltk_data] Downloading package word2vec_sample to C:\Users\yi
[nltk_data]     quan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping models\word2vec_sample.zip.


True

In [24]:
# load a pre-build model
from nltk.data import find
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
news_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample,binary=False)

In [27]:
# how many terms?
len(news_model.key_to_index)

43981

In [37]:
# how many dimensions?
len(news_model['university'])

300

In [32]:
# are they any better?
news_model.most_similar(positive=['university'], topn=5)

[('universities', 0.7003917098045349),
 ('faculty', 0.6780909895896912),
 ('undergraduate', 0.6587098240852356),
 ('campus', 0.6434985995292664),
 ('college', 0.638526976108551)]

In [38]:
news_model.most_similar(positive=['lemon'],topn=5)

[('lemons', 0.6462560296058655),
 ('apricot', 0.6199415326118469),
 ('avocado', 0.5922888517379761),
 ('fennel', 0.5873183012008667),
 ('coriander', 0.5828487873077393)]

In [39]:
news_model.most_similar(positive=['government'],topn=5)

[('Government', 0.7132058143615723),
 ('governments', 0.6521531343460083),
 ('administration', 0.546237051486969),
 ('legislature', 0.5307288765907288),
 ('parliament', 0.5268455147743225)]

## 3. Perform vector algebra

We can use embeddings to perform verbal reasoning, e.g. A is to B as C is to....

e.g. 'man is to king as woman is to...'

vec("king") - vec("man") + vec("woman") = ~ vec("queen")

In [40]:
news_model.most_similar(positive=['woman','king'], negative=['man'], topn=5)

[('queen', 0.7118194103240967),
 ('monarch', 0.6189676523208618),
 ('princess', 0.5902429819107056),
 ('prince', 0.5377322435379028),
 ('kings', 0.5236845016479492)]

In [41]:
news_model.most_similar(positive=['king','woman'], negative=['man'], topn=5)

[('queen', 0.7118194103240967),
 ('monarch', 0.6189676523208618),
 ('princess', 0.5902429819107056),
 ('prince', 0.5377322435379028),
 ('kings', 0.5236845016479492)]

In [42]:
# encyclopaedic knowledge
news_model.most_similar(positive=['Paris', 'Germany'], negative=['Berlin'], topn=5)

[('France', 0.7884091138839722),
 ('Belgium', 0.6197876334190369),
 ('Spain', 0.5664774179458618),
 ('Italy', 0.5654899477958679),
 ('Switzerland', 0.5609694123268127)]

In [43]:
# syntatic patterns (verbs)
news_model.most_similar(positive=['longest','short'],negative=['long'], topn=5)

[('shortest', 0.5145130157470703),
 ('steepest', 0.4244834780693054),
 ('first', 0.402511864900589),
 ('flattest', 0.40171924233436584),
 ('consecutive', 0.3951871693134308)]

In [44]:
# more encyclopadeic knowledge
news_model.most_similar(positive=['blue', 'tulip'], negative=['sky'], topn=5)

[('purple', 0.5252775549888611),
 ('tulips', 0.49382397532463074),
 ('brown', 0.49077507853507996),
 ('pink', 0.4860530197620392),
 ('maroon', 0.4805646240711212)]

In [45]:
# syntactic knowledge (pronouns)
news_model.most_similar(positive=['him','she'], negative=['he'], topn=5)

[('her', 0.8049386143684387),
 ('herself', 0.6881042718887329),
 ('me', 0.5886673927307129),
 ('She', 0.5803763270378113),
 ('woman', 0.5470801591873169)]

In [46]:
# lexical knowledge (antonyms)
news_model.most_similar(positive=['light','long'],negative=['dark'],topn=5)

[('short', 0.40772414207458496),
 ('longer', 0.36700794100761414),
 ('lengthy', 0.3622959852218628),
 ('Long', 0.3600354790687561),
 ('continuous', 0.34982699155807495)]

In [47]:
# find the add one out
news_model.doesnt_match('breakfast cereal dinner lunch'.split())

'cereal'