# Deep NLP - Word Embeddings

Think back to NLP as we've understood it so far.

If we've had some luck with NLP modeling, likely with a NaiveBayes algorithm, we were able to illustrate some correlations between words and some other feature of interest.

But to whatever extent that our models were able to make connections and pick up on correlations, they did this *without any understanding of the **meaning** of the words in question*.

Let's think for a minute about words and objective meanings!

We can make sense of meaning for computational purposes by thinking about meaning in terms of similarity, i.e. thinking about meaning *holistically*.

Q. Is there any precedent for this way of thinking about meaning? <br/>
A. [Yes](https://plato.stanford.edu/entries/meaning-holism/#ArgForMeaHol)

So what will this look like for us?

*Remember cosine similarity?*

$\rightarrow$We'll have much the same idea here: Associate each word with values along particular dimensions in a multi-dimensional space. If we had a dimension for *softness*, for example, then pillows and marshmallows would score higher on it than rocks and bricks.

In [1]:
#!pip install --upgrade gensim

In [2]:
#!conda upgrade gensim

In [3]:
#!y

In [4]:
import gensim
import numpy as np
#"meaning" comes about based on how a word relates to many other words

In [5]:
# Reading in the data

import json

with open('JEOPARDY_QUESTIONS1.json') as f:
    data = json.load(f)

In [6]:
# Let's check the datatype of our data
type(data)

list

In [7]:
# And the length

len(data)

216930

In [8]:
# Let's look at the first element in our list

data[0]

{'category': 'HISTORY',
 'air_date': '2004-12-31',
 'question': "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
 'value': '$200',
 'answer': 'Copernicus',
 'round': 'Jeopardy!',
 'show_number': '4680'}

In [9]:
# How many words do we have in our first question?

len(data[0]['question']) #characters

98

In [10]:
# Let's try that again!
data[0]['question'].split(' ')

["'For",
 'the',
 'last',
 '8',
 'years',
 'of',
 'his',
 'life,',
 'Galileo',
 'was',
 'under',
 'house',
 'arrest',
 'for',
 'espousing',
 'this',
 "man's",
 "theory'"]

In [11]:
len(data[0]['question'].split(' '))

18

In [12]:
# Let's count the total number of
# clue words we have.

length = 0

for clue in data:
    length += len(clue['question'].split(' '))
    
length

3169994

## Using Word2Vec

In [13]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [14]:
# Word2Vec requires that our text have the form of a list
#Word2Vec - pretrained neural network that vectorizes words
# of 'sentences', where each sentence is itself a list of
# words. How can we put our _Jeopardy!_ clues in that shape?

text = []
for clue in data:
    sentence = clue['question'].translate(str.maketrans('','',
                                                        string.punctuation)).split(' ')
    new_sent = []
    for word in sentence:
        new_sent.append(word.lower())
        
    text.append(new_sent)

In [15]:
# Let's check the new structure of our first clue

text[0]

['for',
 'the',
 'last',
 '8',
 'years',
 'of',
 'his',
 'life',
 'galileo',
 'was',
 'under',
 'house',
 'arrest',
 'for',
 'espousing',
 'this',
 'mans',
 'theory']

In [16]:
#MAKE A PREDICTION ABOUT CONTEXT GIVEN A WORD

# Constructing the model is simply a matter of
# instantiating a Word2Vec object.

model = gensim.models.Word2Vec(sentences=text,
                              size=100,
                              sg=1) #0 = Bag of words ON, 1= skip gram ON

#BoW uses context to predict words***
#Skip Gram uses words to predict context***

#size: dimensionality of the feature vectors
#alpha: learning rate of network
#window: how far to either side of the word do I go to grab context

# King + Woman - Man = Queen
# Brother + Woman - Man = Sister
#ex. the gender vector is the conceptual difference between King vs. Queen, or Brother vs. Sister

In [17]:
# To train, call 'train()'!

model.train(text, total_examples=model.corpus_count, epochs=model.epochs)

(11338462, 15849970)

In [19]:
# Checking word  count

model.corpus_total_words

3169994

## model.wv

In [20]:
# The '.wv' attribute stores the word vectors

model.wv

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x1a34ff7128>

In [21]:
# The vectors are keyed by the words

model.wv['child']

array([ 3.96697342e-01, -5.93880773e-01, -2.74347633e-01,  2.53852785e-01,
        1.82011873e-01, -5.10767922e-02, -4.84780967e-01,  2.62110412e-01,
        3.82081300e-01, -1.00361042e-01, -1.78871229e-01, -1.78836599e-01,
        7.54738986e-01, -6.07567489e-01, -2.99261272e-01, -2.87170380e-01,
        1.66734859e-01,  1.82041407e-01, -2.42542922e-01, -4.12691832e-01,
        7.30799586e-02,  1.04826912e-01, -8.53641033e-02, -1.81949705e-01,
       -7.24344015e-01,  3.55688930e-01, -2.60027826e-01,  5.55334166e-02,
       -7.15716378e-05,  3.43768716e-01, -1.15363607e-02, -7.07383826e-03,
        4.51993197e-02, -4.48755533e-01, -2.04059198e-01,  3.43149826e-02,
        3.39572400e-01, -1.19446926e-01, -2.64043808e-01, -6.82634413e-02,
       -1.06936181e+00, -7.91898593e-02, -1.18580468e-01, -2.89856106e-01,
       -6.87765181e-01,  2.50533614e-02,  1.26647681e-01, -2.41489246e-01,
        4.09442484e-01,  9.42368954e-02, -4.11345333e-01, -2.24804029e-01,
        3.89060140e-01, -

### model.wv methods
#### 'most_similar()' and 'similarity()'

In [22]:
model.wv.most_similar('furniture') #cosine similarity

[('artwork', 0.7391272783279419),
 ('decorative', 0.7103092670440674),
 ('pottery', 0.7016913890838623),
 ('linen', 0.6976805925369263),
 ('ceramic', 0.694246768951416),
 ('flooring', 0.6923009157180786),
 ('wicker', 0.691487729549408),
 ('accessory', 0.6886346340179443),
 ('canvas', 0.6861882209777832),
 ('plaster', 0.6851691007614136)]

In [23]:
model.wv.similarity('furniture', 'jewelry') #cosine similarity

0.6710565

In [24]:
# What's most similar to 'cat'?

model.wv.most_similar('cat')

[('dog', 0.742168664932251),
 ('cheetah', 0.6951326131820679),
 ('shorthaired', 0.6944791078567505),
 ('hound', 0.6816151142120361),
 ('mouse', 0.6631695032119751),
 ('rabbit', 0.6609938144683838),
 ('carnivore', 0.6571004986763),
 ('pet', 0.6562229990959167),
 ('terrier', 0.6538774967193604),
 ('parrot', 0.6514949202537537)]

In [25]:
# Let's try the familiar example: King - Man + Woman = Queen

model.wv.most_similar(positive=['king','woman'],
                      negative=['man'], #must be list for single word, otherwise it uses list of characters
                      topn=10)

[('queen', 0.6917461156845093),
 ('princess', 0.6278905868530273),
 ('noor', 0.5958117246627808),
 ('aquitaine', 0.5931550860404968),
 ('iv', 0.5787621736526489),
 ('isabella', 0.573489785194397),
 ('prince', 0.569932222366333),
 ('aragon', 0.5671788454055786),
 ('throne', 0.563604474067688),
 ('margrethe', 0.5608233213424683)]

In [26]:
# Shakespeare

model.wv.most_similar(['shakespeare'])

[('sophocles', 0.755678653717041),
 ('euripides', 0.7241655588150024),
 ('shakespeares', 0.7147544622421265),
 ('ibsen', 0.6938177943229675),
 ('moliere', 0.6914081573486328),
 ('falstaff', 0.6863422989845276),
 ('shaws', 0.6812687516212463),
 ('shakespearean', 0.681247353553772),
 ('fairies', 0.6790329217910767),
 ('aeschylus', 0.678644597530365)]

In [27]:
# Greg

model.wv.most_similar(['greg'])

[('kinnear', 0.8448551893234253),
 ('shaun', 0.8085587620735168),
 ('prinze', 0.7852723598480225),
 ('conner', 0.7843407988548279),
 ('connors', 0.7816767692565918),
 ('walston', 0.7802096605300903),
 ('ari', 0.7777183651924133),
 ('langham', 0.7763891220092773),
 ('shoeless', 0.7717835903167725),
 ('baxter', 0.7701588869094849)]

In [28]:
# Washington

model.wv.most_similar(['washington'])

[('dc', 0.8206110596656799),
 ('dcs', 0.6679648160934448),
 ('washingtons', 0.6554760932922363),
 ('p3', 0.6474977731704712),
 ('newseum', 0.6419258713722229),
 ('dca', 0.6341737508773804),
 ('virginia', 0.6280378103256226),
 ('arlington', 0.6277400851249695),
 ('statuary', 0.616987943649292),
 ('abilene', 0.609310507774353)]

#### 'doesnt_match()'

In [29]:
model.wv.doesnt_match(['breakfast', 'lunch','frog']) #returns frog because it's UNRELATED

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'frog'

In [30]:
model.doesnt_match(['tree','flower','plant','toothbrush'])

  """Entry point for launching an IPython kernel.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'toothbrush'

#### 'closer_than()'

In [31]:
# Which words are closer to 'king' than 'queen' is?

model.wv.closer_than('king','queen')

['prince',
 'emperor',
 'kings',
 'iii',
 'ruler',
 'iv',
 'vi',
 'vii',
 'tudor',
 'ix',
 'darius',
 'haakon',
 'olaf',
 'canute']

#### 'distance()'

In [40]:
# For this it will make more sense to
# normalize our vectors.

for vector in model.wv:
    norm_vecs.map()

TypeError: 'int' object is not iterable

In [33]:
model.wv.distance('king', 'king')

0.0

In [34]:
model.wv.distance('joy', 'happiness')

0.42578697204589844

#### 'evaluate_word_analogies()'

Check out [this text file](https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt)!

In [35]:
relatives = model.wv.evaluate_word_analogies(
    'https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt')[1][4]


In [36]:
len(relatives['correct'])

136

In [37]:
len(relatives['incorrect'])

284

In [38]:
relatives['correct'][:5]

[('BOY', 'GIRL', 'BROTHER', 'SISTER'),
 ('BOY', 'GIRL', 'BROTHERS', 'SISTERS'),
 ('BOY', 'GIRL', 'FATHER', 'MOTHER'),
 ('BOY', 'GIRL', 'GRANDSON', 'GRANDDAUGHTER'),
 ('BOY', 'GIRL', 'HE', 'SHE')]

In [39]:
relatives['incorrect'][:5]

[('BOY', 'GIRL', 'DAD', 'MOM'),
 ('BOY', 'GIRL', 'GRANDFATHER', 'GRANDMOTHER'),
 ('BOY', 'GIRL', 'GRANDPA', 'GRANDMA'),
 ('BOY', 'GIRL', 'GROOM', 'BRIDE'),
 ('BOY', 'GIRL', 'HUSBAND', 'WIFE')]