# Word embeddings

## An alternative approach
Can we define words by the company they keep?  
"If A and B have almost the identical environments we can say that they are synonyms" (Zelig Harris, 1954)

## Vector representations
One-hot encodings are long and sparse  
Alternative: **dense vectors**  
short (length 50-1000) + dense (most elements are non-zero)
### Benefits
1. Easier to use in ML  
2. Offer better generalization capabilities

## 1. Train your own embeddings
Instead of counting terms, we train a classifier on a prediction task: 'does A occur near B'?  
The learned classifier weights become our embeddings   
We can create our own word embeddings using gensim (an open-source library for unsupervised topic modeling and NLP)  
and train it on the Brown corpus  

In [2]:
# set up libraries & data
import gensim
from nltk.corpus import brown
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     /Users/atsushihatakeyama/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [3]:
# train the model
model = gensim.models.Word2Vec(brown.sents())

In [4]:
# save a copy, for later re-use
model.save('brown.embedding')

In [5]:
# we can load models on demand
brown_model = gensim.models.Word2Vec.load('brown.embedding')

In [6]:
# how many words?
len(brown_model.wv.index_to_key)

15173

In [7]:
# how many dimensions?
brown_model.wv['university']

array([-0.10108075,  0.22742166,  0.3202568 ,  0.23411836, -0.21170661,
       -0.30038887,  0.409905  ,  0.35845628, -0.30588168, -0.21326882,
       -0.14555615, -0.25232542,  0.0048379 ,  0.16435574,  0.41183546,
       -0.15681866,  0.31765732, -0.06036779, -0.3866692 , -0.6257796 ,
        0.32944852, -0.1516606 ,  0.42890576,  0.07599909, -0.01374124,
       -0.10768668, -0.232468  ,  0.09907427, -0.38865626,  0.1297758 ,
        0.27606502,  0.02651284,  0.12728298, -0.24867278,  0.00420449,
        0.02078652, -0.15709856,  0.10768691, -0.21103078, -0.04226524,
       -0.02831067, -0.23004474,  0.0896466 ,  0.07167623,  0.0951155 ,
       -0.06976258, -0.09656906, -0.12872428,  0.05893157,  0.25232992,
        0.24184497, -0.3139338 , -0.17986003, -0.02292724, -0.1581031 ,
       -0.16923527,  0.26761666,  0.06093049, -0.19247353, -0.13254713,
        0.04970529,  0.12182314,  0.11722158, -0.09273087, -0.33360812,
        0.23861901,  0.03607434,  0.18643394, -0.09065596,  0.33

In [8]:
# calculate similarity between terms
brown_model.wv.similarity('university','inception')

np.float32(0.822145)

In [9]:
# find similar terms
brown_model.wv.most_similar('university', topn=5)

[('membership', 0.9555637836456299),
 ('inception', 0.9551917910575867),
 ('profession', 0.9533959627151489),
 ('neighborhood', 0.951281726360321),
 ('congregation', 0.950244665145874)]

In [10]:
brown_model.wv.most_similar('lemon', topn=5)

[('marble', 0.9649989008903503),
 ('elaborate', 0.9647454023361206),
 ('neat', 0.9607310891151428),
 ('Cape', 0.9601399302482605),
 ('pension', 0.9597789645195007)]

In [11]:
brown_model.wv.most_similar('government', topn=5)

[('power', 0.9291643500328064),
 ('policy', 0.927480161190033),
 ('education', 0.9245082139968872),
 ('Christian', 0.920354425907135),
 ('nation', 0.9192814826965332)]

## 2. Use pre-trained embeddings
We can load pre-built embeddings, e.g. a sample from a model trained on 100 billion words from the Google News Dataset

In [12]:
import nltk
nltk.download('word2vec_sample')

[nltk_data] Downloading package word2vec_sample to
[nltk_data]     /Users/atsushihatakeyama/nltk_data...
[nltk_data]   Unzipping models/word2vec_sample.zip.


True

In [13]:
# load a pre-build model
from nltk.data import find
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
news_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

In [14]:
# how many terms?
len(news_model.key_to_index)

43981

In [15]:
# how many dimensions?
len(news_model['university'])

300

In [16]:
# are they any better?
news_model.most_similar(positive=['university'], topn = 5)

[('universities', 0.7003918290138245),
 ('faculty', 0.6780906915664673),
 ('undergraduate', 0.6587096452713013),
 ('campus', 0.6434987783432007),
 ('college', 0.6385269165039062)]

In [17]:
news_model.most_similar(positive=['lemon'], topn = 5)

[('lemons', 0.646256148815155),
 ('apricot', 0.619941771030426),
 ('avocado', 0.5922888517379761),
 ('fennel', 0.5873182415962219),
 ('coriander', 0.5828487277030945)]

In [18]:
news_model.most_similar(positive=['government'], topn = 5)

[('Government', 0.7132059931755066),
 ('governments', 0.6521531939506531),
 ('administration', 0.5462369322776794),
 ('legislature', 0.5307288765907288),
 ('parliament', 0.5268454551696777)]

## 3. Perform vector algebra
We can use embeddings to perform verbal reasoning, e.g. A is to B as C is to...  
e.g. 'man is to king as woman is to...'  
vec(“king”) - vec(“man”) + vec(“woman”) =~ vec(“queen”)

In [19]:
news_model.most_similar(positive=['woman','king'], negative=['man'], topn = 5)

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236843228340149)]

In [20]:
news_model.most_similar(positive=['king', 'woman'], negative=['man'], topn = 5)

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236843228340149)]

In [21]:
# encyclopaedic knowledge
news_model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 5)

[('France', 0.7884091734886169),
 ('Belgium', 0.6197876930236816),
 ('Spain', 0.566477358341217),
 ('Italy', 0.5654898881912231),
 ('Switzerland', 0.560969352722168)]

In [22]:
# syntactic patterns (verbs)
news_model.most_similar(positive=['has','be'], negative=['have'], topn = 5)

[('is', 0.6774995923042297),
 ('was', 0.5710029006004333),
 ('remains', 0.47552669048309326),
 ('been', 0.4538104236125946),
 ('being', 0.4456518888473511)]

In [23]:
# syntactic patterns (adjectives)
news_model.most_similar(positive=['longest','short'], negative=['long'], topn = 5)

[('shortest', 0.5145130753517151),
 ('steepest', 0.42448344826698303),
 ('first', 0.4025117754936218),
 ('flattest', 0.4017193019390106),
 ('consecutive', 0.3951870799064636)]

In [24]:
# more encyclopaedic knowledge
news_model.most_similar(positive=['blue','tulip'], negative=['sky'], topn = 5)

[('purple', 0.5252774953842163),
 ('tulips', 0.4938238859176636),
 ('brown', 0.490774929523468),
 ('pink', 0.4860529899597168),
 ('maroon', 0.48056456446647644)]

In [25]:
# syntactic knowledge (pronouns)
news_model.most_similar(positive=['him','she'], negative=['he'], topn = 5)

[('her', 0.804938554763794),
 ('herself', 0.6881043314933777),
 ('me', 0.5886672139167786),
 ('She', 0.5803765058517456),
 ('woman', 0.5470799207687378)]

In [26]:
# lexical know
news_model.most_similar(positive=['light','long'], negative=['dark'], topn = 5)

[('short', 0.4077242910861969),
 ('longer', 0.3670077621936798),
 ('lengthy', 0.36229580640792847),
 ('Long', 0.3600355386734009),
 ('continuous', 0.34982720017433167)]

In [None]:
# Find the odd one out
news_model.doesnt_match('breakfast cereal dinner lunch'.split())

### Quiz questions + homework

In [None]:
news_model.most_similar(positive=['be','has'], negative=['is'], topn = 5)

In [None]:
brown_model.wv.most_similar('university', topn=5)

In [None]:
news_model.most_similar('university', topn=5)

In [None]:
brown_model.wv.most_similar('college', topn=5)

In [None]:
news_model.most_similar('college', topn=5)

In [None]:
news_model.similarity('university','turtle')

In [None]:
news_model.similarity('university', 'school')

In [None]:
news_model.similarity('university', 'factory')

In [None]:
news_model.similarity('university', 'supermarket')

In [None]:
news_model.similarity('university', 'turtle')