# Deep NLP - Word Embeddings

Think back to NLP as we've understood it so far.

If we've had some luck with NLP modeling, likely with a NaiveBayes algorithm, we were able to illustrate some correlations between words and some other feature of interest.

But to whatever extent our models were able to make connections and pick up on correlations, they did this *without any understanding of the **meaning** of the words in question*.

We can make sense of meaning for computational purposes by thinking about meaning in terms of similarity, i.e. thinking about meaning *holistically*.

Q. Is there any precedent for this way of thinking about meaning? <br/>
A. [Yes](https://plato.stanford.edu/entries/meaning-holism/#ArgForMeaHol)

So what will this look like for us?

*Remember cosine similarity?*

$\rightarrow$We'll have much the same idea here: Associate each word with values along particular dimensions in a multi-dimensional space. If we had a dimension for *softness*, for example, then pillows and marshmallows would score higher on it than rocks and bricks.

In [25]:
import gensim
import numpy as np
import json
from nltk.stem import WordNetLemmatizer
import string

What is Gensim? See [here](https://en.wikipedia.org/wiki/Gensim) and [here](https://radimrehurek.com/gensim/). But, basically, gensim is a package with lots of topic-modeling and NLP tools, inlcuding Word2Vec.

## EDA

In [4]:
# Reading in the data

with open('JEOPARDY_QUESTIONS1.json') as f:
    data = json.load(f)

In [5]:
type(data)

list

In [6]:
len(data)

216930

In [7]:
# Let's look at the first element in our list

data[0]

{'category': 'HISTORY',
 'air_date': '2004-12-31',
 'question': "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
 'value': '$200',
 'answer': 'Copernicus',
 'round': 'Jeopardy!',
 'show_number': '4680'}

In [10]:
data[0]['question']

"'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'"

How many words do we have in this first question?

In [9]:
data[0]['question'].split(' ')

["'For",
 'the',
 'last',
 '8',
 'years',
 'of',
 'his',
 'life,',
 'Galileo',
 'was',
 'under',
 'house',
 'arrest',
 'for',
 'espousing',
 'this',
 "man's",
 "theory'"]

In [11]:
len(data[0]['question'].split(' '))

18

How many words do we have in total?

In [12]:
length = 0
for clue in data:
    length += len(clue['question'].split(' '))
length

3169994

## Data Preparation

Word2Vec requires that our text have the form of a list of 'sentences', where each sentence is itself a list of words. How can we put our _Jeopardy!_ clues in that shape?

In [17]:
text = []

for clue in data:
    sentence = clue['question'].translate(str.maketrans('', '',
                                                        string.punctuation)).split(' ')
    
    new_sent = []
    for word in sentence:
        new_sent.append(word.lower())
    
    text.append(new_sent)

In [18]:
text[0]

['for',
 'the',
 'last',
 '8',
 'years',
 'of',
 'his',
 'life',
 'galileo',
 'was',
 'under',
 'house',
 'arrest',
 'for',
 'espousing',
 'this',
 'mans',
 'theory']

In [23]:
lemma = WordNetLemmatizer()

lemmatized = []
for clue in text:
    lemmatized.append([lemma.lemmatize(word) for word in clue])

In [24]:
lemmatized[0]

['for',
 'the',
 'last',
 '8',
 'year',
 'of',
 'his',
 'life',
 'galileo',
 'wa',
 'under',
 'house',
 'arrest',
 'for',
 'espousing',
 'this',
 'man',
 'theory']

## Modeling

Continuous Bag of Words vs. Skipgram

<a href="https://www.researchgate.net/figure/Illustration-of-the-Skip-gram-and-Continuous-Bag-of-Word-CBOW-models_fig1_281812760"><img src="https://www.researchgate.net/profile/Wang_Ling/publication/281812760/figure/fig1/AS:613966665486361@1523392468791/Illustration-of-the-Skip-gram-and-Continuous-Bag-of-Word-CBOW-models.png" alt="Illustration of the Skip-gram and Continuous Bag-of-Word (CBOW) models."/></a>

[More on Skipgram](https://towardsdatascience.com/word2vec-skip-gram-model-part-1-intuition-78614e4d6e0b)

Word2Vec supports both **continuous bag of words** and **skipgram** models. [This post](https://towardsdatascience.com/nlp-101-word2vec-skip-gram-and-cbow-93512ee24314) is helpful on their structures.

In [26]:
# Constructing the model is simply a matter of
# instantiating a Word2Vec object.

model = gensim.models.Word2Vec(lemmatized, sg=1)

See doc for note about seeding and reproducibility!

In [27]:
# To train, call 'train()'!

model.train(text, total_examples=model.corpus_count, epochs=model.epochs)

(10299626, 15849970)

In [28]:
# Checking word count

model.corpus_total_words

3169994

## `model.wv`

In [29]:
# The '.wv' attribute stores the word vectors

model.wv

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x15705a940>

In [30]:
# The vectors are keyed by the words

model.wv['child']

array([ 0.35933653, -0.16895719,  0.08698016, -0.2724295 ,  0.03410744,
       -0.876136  ,  0.22552657,  0.18308343,  0.08276779,  0.12514946,
       -0.3964169 ,  0.03763022, -0.56658256,  0.2665006 , -0.21136418,
        0.09357793,  0.81534517,  0.01875736, -0.52860904,  0.1600756 ,
       -0.09026229,  0.18312268, -0.14695741, -0.09544276,  0.46544105,
        0.67673206,  0.38662106, -0.30341986, -0.28568813, -0.27198285,
       -0.10253334,  0.8163238 , -0.04682573,  0.4146795 ,  0.25085577,
       -0.29462567, -0.16199362, -0.38281444,  0.09366592, -0.07808338,
        0.50254357, -0.34361532, -0.21382678, -0.07930967,  0.16761082,
        0.5728547 ,  0.11123757,  0.48029777, -0.10854993, -0.3600292 ,
        0.28469074,  0.00913351, -0.22246398, -0.55484813,  0.39430794,
        0.09160047, -0.22096094,  0.7419396 , -0.01934808, -0.17086174,
       -0.25481266,  0.5247727 , -0.3014785 , -0.02293869, -0.20031291,
       -0.23553422, -0.29731867,  0.07297182, -0.17568329,  0.06

### `model.wv` methods
#### `most_similar()` and `similarity()`

In [31]:
model.wv.most_similar('furniture')

[('ceramic', 0.7321969866752625),
 ('cabriole', 0.714465320110321),
 ('pottery', 0.6944854259490967),
 ('linen', 0.6927464008331299),
 ('fastener', 0.6918306946754456),
 ('flooring', 0.6882908344268799),
 ('artwork', 0.6843509674072266),
 ('chippendale', 0.6841990947723389),
 ('decorative', 0.6809938549995422),
 ('mannequin', 0.6777957677841187)]

In [33]:
model.wv.similarity('furniture', 'ceramic')

0.73219705

In [34]:
model.wv.most_similar(positive=['cat', 'animal', 'pet', 'mammal'])

[('carnivore', 0.8239153623580933),
 ('cheetah', 0.8141511678695679),
 ('predatory', 0.7844293117523193),
 ('pachyderm', 0.7825244069099426),
 ('arachnid', 0.777582049369812),
 ('reptile', 0.7768771648406982),
 ('shorthaired', 0.7749311923980713),
 ('rodent', 0.7739750742912292),
 ('giraffe', 0.7706668376922607),
 ('scavenger', 0.7691413164138794)]

Consider the following 'equations':

King + Woman - Man = x

Brother + Woman - Man = y

What values would you suggest for x and y here?

Clearly, getting good answers to these equations depends on understanding the *meanings* of the underlying words.

Or does it? The `most_similar()` method takes a 'negative' parameter as well as a 'positive' one, so we can consult our trained word vectors to see how they would answer these questions!

In [35]:
model.wv.most_similar(positive=['cat', 'animal'], negative='pet')

[('rodent', 0.3702376186847687),
 ('sheep', 0.36358124017715454),
 ('breed', 0.35542649030685425),
 ('vulture', 0.35324373841285706),
 ('insect', 0.3531191647052765),
 ('dog', 0.34904420375823975),
 ('extinction', 0.34766697883605957),
 ('lizard', 0.3446741998195648),
 ('creature', 0.3275783658027649),
 ('domesticated', 0.3261025547981262)]

In [36]:
model.wv.most_similar(positive=['king', 'woman'], negative='man')

[('emperor', 0.26721689105033875),
 ('throne', 0.26415058970451355),
 ('empress', 0.22536341845989227),
 ('spaniard', 0.22355356812477112),
 ('slave', 0.2168462574481964),
 ('ruler', 0.21443983912467957),
 ('queen', 0.20846901834011078),
 ('medici', 0.20117774605751038),
 ('conqueror', 0.1999264359474182),
 ('aristotle', 0.19925367832183838)]

In [37]:
model.wv.most_similar(positive=['brother', 'woman'], negative='man')

[('wife', 0.2738255560398102),
 ('husband', 0.2578567564487457),
 ('motherinlaw', 0.2373926043510437),
 ('slave', 0.23261277377605438),
 ('nymph', 0.2305372804403305),
 ('son', 0.2225310206413269),
 ('jacob', 0.2190614640712738),
 ('odysseus', 0.2141270637512207),
 ('daughter', 0.21392333507537842),
 ('jealous', 0.21081501245498657)]

In [38]:
model.wv.most_similar(positive='usa')

[('pageant', 0.61308354139328),
 ('monorail', 0.610979437828064),
 ('fargo', 0.6104167103767395),
 ('whitewater', 0.6084681749343872),
 ('coanchor', 0.5854692459106445),
 ('firecracker', 0.5806131362915039),
 ('brisbane', 0.5774937868118286),
 ('shannon', 0.5772475600242615),
 ('tyra', 0.574211061000824),
 ('dogpatch', 0.5650832653045654)]

In [39]:
model.wv.most_similar('canada')

[('marianas', 0.6302838325500488),
 ('britain', 0.6298997402191162),
 ('myanmar', 0.623246431350708),
 ('manitoba', 0.6185489296913147),
 ('klm', 0.6138970851898193),
 ('saskatchewan', 0.6122548580169678),
 ('zambia', 0.6085365414619446),
 ('albania', 0.6065628528594971),
 ('mozambique', 0.606482744216919),
 ('canadian', 0.6046032309532166)]

In [40]:
model.wv.most_similar('shakespeare')

[('sophocles', 0.7020344734191895),
 ('shakespearean', 0.68357914686203),
 ('euripides', 0.6722855567932129),
 ('macbeth', 0.6563365459442139),
 ('falstaff', 0.6521875858306885),
 ('hamlet', 0.6492114067077637),
 ('moliere', 0.6449787616729736),
 ('ibsen', 0.6441601514816284),
 ('hellman', 0.6436213254928589),
 ('laertes', 0.6376091241836548)]

In [41]:
model.wv.most_similar('greg')

[('kinnear', 0.8460290431976318),
 ('conner', 0.8010389804840088),
 ('shoeless', 0.7997528314590454),
 ('bebe', 0.7965857982635498),
 ('kerri', 0.7933998107910156),
 ('connors', 0.7874476909637451),
 ('shaun', 0.7821326851844788),
 ('waterstona', 0.7787505388259888),
 ('abduljabbar', 0.773412823677063),
 ('hamlisch', 0.7722470760345459)]

In [42]:
model.wv.most_similar('jefferson')

[('madison', 0.6711403131484985),
 ('prescott', 0.6692360043525696),
 ('lincoln', 0.6623405814170837),
 ('mifflin', 0.6616520285606384),
 ('romney', 0.6580953598022461),
 ('lubbock', 0.657781183719635),
 ('colby', 0.6562408208847046),
 ('seward', 0.6557228565216064),
 ('polk', 0.65288245677948),
 ('judiciary', 0.6506866216659546)]

In [43]:
model.wv.most_similar('washington')

[('dc', 0.8224882483482361),
 ('dca', 0.6604635715484619),
 ('hw', 0.6486342549324036),
 ('newseum', 0.6236661672592163),
 ('p3', 0.6230477690696716),
 ('statuary', 0.6045047640800476),
 ('rotunda', 0.5910694003105164),
 ('gw', 0.5870948433876038),
 ('arlington', 0.5859220027923584),
 ('lincoln', 0.5850906372070312)]

In [44]:
model.wv.most_similar(positive=['president', 'germany'], negative='usa')

[('dictator', 0.24386319518089294),
 ('ruled', 0.23667289316654205),
 ('assassinated', 0.20818036794662476),
 ('1894', 0.20603422820568085),
 ('rebellion', 0.20375829935073853),
 ('slave', 0.2037426233291626),
 ('premier', 0.19761990010738373),
 ('emperor', 0.19710686802864075),
 ('haile', 0.19596856832504272),
 ('occupied', 0.1949566900730133)]

In [45]:
model.wv.most_similar(positive=['president', 'france'], negative='usa')

[('shah', 0.24299493432044983),
 ('dictator', 0.23337674140930176),
 ('medici', 0.22955113649368286),
 ('conquest', 0.22408586740493774),
 ('ruled', 0.22132796049118042),
 ('assassinated', 0.21796737611293793),
 ('exile', 0.21268150210380554),
 ('occupied', 0.2110612690448761),
 ('frenchman', 0.21021825075149536),
 ('spaniard', 0.20750625431537628)]

#### 'doesnt_match()'

In [46]:
model.wv.doesnt_match(['breakfast', 'lunch', 'frog', 'food'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'frog'

In [48]:
model.wv.doesnt_match(['tree', 'flower', 'bush', 'plant', 'toothbrush'])

'bush'

In [49]:
model.wv.doesnt_match(['tree', 'flower', 'plant', 'toothbrush'])

'toothbrush'

#### 'closer_than()'

In [50]:
# Which words are closer to 'king' than 'queen' is?

model.wv.closer_than('king', 'queen')

['iv', 'ix', 'olaf']

#### 'distance()'

In [51]:
# For this it will make more sense to
# normalize our vectors.

model.init_sims(replace=True)

In [52]:
model.wv.distance('king', 'king')

-1.1920928955078125e-07

In [53]:
model.wv.distance('joy', 'happiness')

0.4816325902938843

#### 'evaluate_word_analogies()'

The `evaluate_word_analogies()` method takes in a string of quadruples, properly formatted (see [here](https://radimrehurek.com/gensim/models/keyedvectors.html)), and returns a list of dictionaries. Each dictionary has two keys: 'correct' and 'incorrect', the values for which are lists of the analogies that the model correctly or incorrectly predicted.

Check out [this text file](https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt)!

In [54]:
relatives = model.wv.evaluate_word_analogies(
    """https://raw.githubusercontent.com/"""\
    """nicholas-leonard/word2vec/master/questions-words.txt""")[1][4]

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [55]:
len(relatives['correct'])

94

In [56]:
len(relatives['incorrect'])

248

In [57]:
relatives['correct'][:5]

[('BOY', 'GIRL', 'DAD', 'MOM'),
 ('BOY', 'GIRL', 'HE', 'SHE'),
 ('BOY', 'GIRL', 'HIS', 'HER'),
 ('BOY', 'GIRL', 'KING', 'QUEEN'),
 ('BOY', 'GIRL', 'MAN', 'WOMAN')]

In [58]:
relatives['incorrect'][:5]

[('BOY', 'GIRL', 'BROTHER', 'SISTER'),
 ('BOY', 'GIRL', 'FATHER', 'MOTHER'),
 ('BOY', 'GIRL', 'GRANDFATHER', 'GRANDMOTHER'),
 ('BOY', 'GIRL', 'GRANDPA', 'GRANDMA'),
 ('BOY', 'GIRL', 'GRANDSON', 'GRANDDAUGHTER')]

## Stanford Parser
https://nlp.stanford.edu/software/lex-parser.shtml <br/>
http://nlp.stanford.edu:8080/parser/