# DSCI 575: Advanced Machine Learning (in the context of Natural Language Processing (NLP) applications)

UBC Master of Data Science program, 2019-20

Instructor: Varada Kolhatkar [ʋəɾəda kɔːlɦəʈkər]

### Lecture plan
- Course information
- What is NLP?
- Today's lecture: Word embeddings: 
    - Meaning representation
    - Word2Vec Skip-gram
    - Pre-trained embeddings
    - Summary and preview for the next lecture

## Slide settings 

In [3]:
# And import the libraries 

import pandas as pd
import numpy as np
import IPython
import altair as alt
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [4]:
# Thanks to Firas for the following code for making jupyter RISE slides pretty! 
from traitlets.config.manager import BaseJSONConfigManager
from pathlib import Path
path = Path.home() / ".jupyter" / "nbconfig"
cm = BaseJSONConfigManager(config_dir=str(path))
tmp = cm.update(
        "rise",
        {
            "theme": "serif",
            "transition": "fade",
            "start_slideshow_at": "selected",            
            "width": "100%",
            "height": "100%",
            "header": "",
            "footer":"",
            "scroll": True,
            "enable_chalkboard": True,
            "slideNumber": True,
            "center": False,
            "controlsLayout": "edges",
            "slideNumber": True,
            "hash": True,
        }
    )

## Set Altair default size

def theme_fm(*args, **kwargs):
    return {'height': 300,
            'config': {'style': {'circle': {'size': 400},
                                'point': {'size': 30},
                                'square': {'size': 400},
                                },
                       'legend': {'symbolSize': 20, 'titleFontSize': 20, 'labelFontSize': 20}, 
                       'axis': {'titleFontSize': 20, 'labelFontSize': 20}},
            }

alt.themes.register('theme_fm', theme_fm)
alt.themes.enable('theme_fm')

ThemeRegistry.enable('theme_fm')

In [5]:
%%HTML
<style>
.rendered_html table, .rendered_html th, .rendered_html tr, .rendered_html td {
     font-size: 130%;
}

body.rise-enabled div.inner_cell>div.input_area {
    font-size: 100%;
}

body.rise-enabled div.output_subarea.output_text.output_result {
    font-size: 100%;
}
body.rise-enabled div.output_subarea.output_text.output_stream.output_stdout {
  font-size: 150%;
}
</style>

In [6]:
import pandas as pd
import numpy as np
import os, sys
from IPython.display import display, HTML

import matplotlib.pyplot as plt

from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

import re
from collections import defaultdict
from collections import Counter

plt.rcParams['font.size'] = 16
sys.path.append('code/.')
from preprocessing import MyPreprocessor
from comat import CooccurrenceMatrix

## Course information

### High-level goals of this course

- Apply machine learning algorithms you have learned so far in interesting applications. 
- Learn new ML algorithms and methods with the theme of NLP applications.
- Prepare you a bit for employment in the NLP area.
- Have fun! 


<img src="images/NLP_in_industry.png" width="900" height="900">

### Topics we'll be covering in this course

### Week 1 

- Representation Learning
- Word vectors and word embeddings

<img src="images/tsne_example.png" height="1000" width="1000"> 

### Week 2

- Markov models
- Hidden Markov models

<img src="images/Markov_autocompletion.png" height="800" width="800"> 

### Markov models application

<img src="images/Markov_chain_applications.png" width="500" height="500">

### Week 3

- Topic modeling (Latent Dirichlet Allocation (LDA))
    - Suppose given a large collection of documents, you are asked to 
        - Infer different topics in the documents
        - Pull all documents about a certain topic    
- Introduction to Recurrent Neural Networks (RNNs)
<img src="images/TM_food_magazines.png" height="1000" width="1000"> 


### Week 4 

- LSTMs, GRUs 
- RNN applications: Image captioning 

<blockquote>

<img src="images/image_captioning.png" width="1000" height="1000">

<p style="font-size:30px"></p>
</blockquote>    
Source: https://cs.stanford.edu/people/karpathy/sfmltalk.pdf

### ASIDE: [Neural Storyteller](https://github.com/ryankiros/neural-storyteller)

<img src="images/RNN_example.jpg" width="600" height="600">

<blockquote> 
<p style="font-size:30px">We were barely able to catch the breeze at the beach , and it felt as if someone stepped out of my mind . She was in love with him for the first time in months , so she had no intention of escaping . The sun had risen from the ocean , making her feel more alive than normal . She 's beautiful , but the truth is that I do n't know what to do ...</p>
</blockquote>    

Source: https://github.com/ryankiros/neural-storyteller

### That's all about course information. In the next video we will talk about what Natural Language Processing (NLP) is. 



# DSCI 575 Lecture 1: Word Embeddings

UBC Master of Data Science program, 2019-20


#### Today's promise

- We will learn a state-of-the art method for word "meaning" representation.  

#### Specific learning outcomes

From this class, you will be able to 

- Explain what natural language processing is.
- Explain the general idea of vector space model.
- Explain the skip-gram model at a high level.
- Explain the difference between sparse and dense word representations.
- Train your own word vectors with `Gensim`. 
- Use word2vec models for word similarity and analogies. 
- Load pre-trained word embeddings.

### What is Natural Language Processing (NLP)?

### What should a search engine return when asked the following question? 


<img src="images/lexical_ambiguity.png" width="1000" height="1000">


### What is Natural Language Processing (NLP)?
#### How often do you search everyday? 

<img src="files/images/Google_search.png" width="900" height="900">


### What is Natural Language Processing (NLP)?

<img src="images/WhatisNLP.png" width="800" height="800">

### Why is NLP hard?

- Language is complex and subtle. 
- Language is ambiguous at different levels. 
- Language understanding involves common-sense knowledge and real-world reasoning.

## Example: Lexical ambiguity

<img src="files/images/lexical_ambiguity.png" width="800" height="800">

## Example: Referential ambiguity

<img src="files/images/referential_ambiguity.png" width="800" height="800">

### [Ambiguous news headlines](http://www.fun-with-words.com/ambiguous_headlines.html)

<blockquote>
PROSTITUTES APPEAL TO POPE
</blockquote>    

- **appeal to** means make a serious or urgent request or be attractive or interesting?

<blockquote>
KICKING BABY CONSIDERED TO BE HEALTHY    
</blockquote> 

- **kicking** is used as an adjective or a verb?

<blockquote>
MILK DRINKERS ARE TURNING TO POWDER
</blockquote>

- **turning** means becoming or take up?

### Why is NLP hard?

- All the problems related to representation and reasoning in artificial intelligence arise in this domain. 
- For language understanding, we need a representation that captures its "meaning". 

### Word meaning 

- A favourite topic of philosophers for centuries. 
- An example from legal domain: [Are hockey gloves gloves or "articles of plastics"?](https://www.scc-csc.ca/case-dossier/info/sum-som-eng.aspx?cas=36258)

<blockquote>
Canada (A.G.) v. Igloo Vikski Inc. was a tariff code case that made its way to the SCC (Supreme Court of Canada). The case disputed the definition of hockey gloves as either gloves or as "articles of plastics."
</blockquote>

<img src="images/hockey_gloves_case.png" width="900" height="900">

### Word meaning: NLP view
- Modeling word meaning that allows us to 
    * draw useful inferences to solve meaning-related problems 
    * find relationship between words, 
        * E.g., which words are similar, which ones have positive or negative connotations
    

### Reminder: One-hot representation

- Build the **vocabulary** containing all unique words from the corpus. 
- Represent each word as **one-hot** encoding.
- A vector of length $V$ such that the value at word index is 1 and all other indices is 0.
- Example: 
    * Vocabulary size = 10
    * Index of the word *pineapple* = 4
    * One-hot vector for *pineapple*:
    \begin{bmatrix} 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\end{bmatrix}

### Activity 1:  Brainstorm ways to represent words (~4 mins) 

- Suppose you are building a Question Answering system and you are given the following question and three candidate answers. 
- Think about the following questions.  
    - What kind of relationship between words do we need to capture in order to arrive at the correct answer?  
    - Would one-hot representation help in this context?
    
<blockquote>       
<p style="font-size:30px"><b>Question:</b> How <b>tall</b> is Machu Picchu?</p>
    <p style="font-size:30px"><b>Candidate 1:</b> Machu Picchu is 13.164 degrees south of the equator.</p>    
<p style="font-size:30px"><b>Candidate 2:</b> The official height of Machu Picchu is 2,430 m.</p>
<p style="font-size:30px"><b>Candidate 3:</b> Machu Picchu is 80 kilometres (50 miles) northwest of Cusco.</p>    
</blockquote> 
    

In [7]:
def get_onehot_encoding(word, vocab):
    onehot = np.zeros(len(vocab), dtype='float64')    
    onehot[vocab[word]] = 1
    print('one-hot encoding of the word "%s" is: %s' % (word, str(onehot)))
    return onehot

### Vocabulary and one-hot encoding

In [10]:
# Note: In the NLP community a text data set is referred 
# to as a **corpus** (plural: corpora).
corpus = """make your smoothie special .
          add freshly squeezed pineapple juice in it .
          """
unique_words = list(set(corpus.split()))
unique_words.sort()
vocab = {word: index for index, word in enumerate(unique_words)}
print('Size of the vocabulary: %d' %(len(vocab)))
print(vocab)

word1 = 'pineapple'
onehot_word1 = get_onehot_encoding(word1, vocab)

word2 = 'juice'
onehot_word2 = get_onehot_encoding(word2, vocab)

print("The dot product between %s and %s is %d" % 
      (word1, word2, onehot_word1.dot(onehot_word2)))

Size of the vocabulary: 12
{'.': 0, 'add': 1, 'freshly': 2, 'in': 3, 'it': 4, 'juice': 5, 'make': 6, 'pineapple': 7, 'smoothie': 8, 'special': 9, 'squeezed': 10, 'your': 11}
one-hot encoding of the word "pineapple" is: [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
one-hot encoding of the word "juice" is: [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
The dot product between pineapple and juice is 0


### Problem with one-hot encoding

-  The problem with this representation is that there is no inherent notion of relationship between words.

<center>
$\vec{pineapple}\cdot\vec{juice} = 0$ 
</center>

### Need a representation that captures relationships between words.

- We will be looking at two such representations.  
    1. Sparse representation with **term-term co-occurrence matrix**
    2. Dense representation with **Word2Vec skip-gram model**
- Both are based on two ideas: **distributional hypothesis** and **vector space model**.

### Distributional hypothesis

<blockquote> 
    <p>You shall know a word by the company it keeps.</p>
    <footer>Firth, 1957</footer>        
</blockquote>

<blockquote> 
If A and B have almost identical environments we say that they are synonyms.
<footer>Harris, 1954</footer>    
</blockquote>    

Example: 

- Her **child** loves to play in the playground. 
- Her **kid** loves to play in the playground. 



### Vector space model

- Model the meaning of a word by placing it into a vector space.  
- A standard way to represent meaning in NLP
- Distances among words in the vector space indicate the relationship between them. 
- Called an "embedding" because it's embedded into a high-dimensional space

<img src="images/t-SNE_word_embeddings.png" width="700" height="700">
(Attribution: Jurafsky and Martin 3rd edition)

### Representation 1: Term-term co-occurrence matrix

### Term-term co-occurrence matrix

- The idea is to go through a corpus of text, keeping a count of all of the words that appear in context of each word (within a window).

- An example: 
<img src="images/term-term_comat.png" width="600" height="600">
(Credit: Jurafsky and Martin 3rd edition)


### Visualizing word vectors and similarity 

<img src="images/word_vectors_and_angles.png" width="700" height="700">
(Credit: Jurafsky and Martin 3rd edition)

- The similarity is calculated using dot products between word vectors.
    - Example: $\vec{\text{digital}}.\vec{\text{information}} = 0 \times 1 + 1\times 6 = 6$
    - Higher the dot product more similar the words.

### Visualizing word vectors and similarity

<img src="images/word_vectors_and_angles.png" width="600" height="600">
(Credit: Jurafsky and Martin 3rd edition)

- The similarity is calculated using dot products between word vectors.
    - Example: $\vec{\text{digital}}.\vec{\text{information}} = 0 \times 1 + 1\times 6 = 6$
    - Higher the dot product more similar the words.

- We can also calculate a normalized version of dot products. 
    $$similarity_{cosine}(w_1,w_2) = \frac{w_1.w_2}{\left\lVert w_1\right\rVert_2 \left\lVert w_2\right\rVert_2}$$


In [11]:
### Let's build term-term co-occurrence matrix for our text. 
corpus = ["How tall is Machu Picchu?",
          "Machu Picchu is 13.164 degrees south of the equator.", 
          "The official height of Machu Picchu is 2,430 m.",
          "Machu Picchu is 80 kilometres (50 miles) northwest of Cusco.",
          "It is 80 kilometres (50 miles) northwest of Cusco, on the crest of the mountain Machu Picchu, located about 2,430 metres (7,970 feet) above mean sea level, over 1,000 metres (3,300 ft) lower than Cusco, which has an elevation of 3,400 metres (11,200 ft)."
         ]
pp = MyPreprocessor()
pp_corpus = pp.preprocess_corpus(corpus)
cm = CooccurrenceMatrix(pp_corpus)
vocab, comat = cm.fit_transform()
words = [key for key, value in sorted(vocab.items(), 
                                      key = lambda item: (item[1],item[0]))]
df = pd.DataFrame(comat.todense(), columns = words, 
                  index = words, dtype = np.int8)
df.head()

Unnamed: 0,tall,machu,picchu,13.164,degrees,south,equator,official,height,"2,430",...,mean,sea,level,"1,000","3,300",ft,lower,elevation,"3,400","11,200"
tall,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
machu,1,0,5,1,1,0,0,1,1,2,...,0,0,0,0,0,0,0,0,0,0
picchu,1,5,0,1,1,1,0,1,1,2,...,0,0,0,0,0,0,0,0,0,0
13.164,0,1,1,0,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
degrees,0,1,1,1,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
from sklearn.metrics.pairwise import cosine_similarity
def similarity(word1, word2): 
    """
    Returns similarity score between word1 and word2
    Arguments
    ---------
    word1 -- (str)
        The first word
    word2 -- (str)
        The second word
        
    Returns
    --------
    None. Prints the similarity score between word1 and word2. 
    """
    vec1 = cm.get_word_vector(word1).todense().flatten()
    vec2 = cm.get_word_vector(word2).todense().flatten()
    v1 = np.squeeze(np.asarray(vec1))
    v2 = np.squeeze(np.asarray(vec2))
    print('The dot product between %s and %s is %0.2f and cosine similarity is %0.2f' 
          %(word1,word2,v1.dot(v2),cosine_similarity(vec1, vec2)))
    
similarity('tall', 'height')
similarity('tall', 'official')
### Not very reliable similarity scores because we used only 4 sentences.     

The dot product between tall and height is 2.00 and cosine similarity is 0.71
The dot product between tall and official is 2.00 and cosine similarity is 0.82


### Break (~5 mins)

### Representation 2: Dense word embeddings

### Sparse vs. dense word vectors

- Term-term co-occurrence matrices are long and sparse. 
    - length |V| is usually large (e.g., > 50,000) 
    - most elements are zero
- OK because there are efficient ways to deal with sparse matrices.


### Alternative 
- Learn short (~100 to 1000 dimensions) and dense vectors. 
- Short vectors may be easier to train with ML models (less weights to train).
- They may generalize better.
- In practice they work much better! 

### Word2Vec 

- A family of algorithms to create dense word embeddings
<img src="images/word2vec.png" width="1000" height="1000">

In [13]:
# Load Google's pre-trained Word2Vec model.
# You can download them from here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
# (Under the pre-trained embeddings section on this page: https://code.google.com/archive/p/word2vec/)
# You'll need to install gensim (https://radimrehurek.com/gensim/)
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('/Users/kvarada/MDS/2019-20/575/data/GoogleNews-vectors-negative300.bin', binary=True)

In [14]:
print('Size of vocabulary: ', len(model.vocab))
word_pairs = [('height','tall'),
              ('pineapple','mango'), 
              ('pineapple','juice'), 
              ('sun','robot'), 
              ('GPU','lion')]
for pair in word_pairs: 
    print('The similarity between %s and %s is %0.3f' %(pair[0], pair[1], model.similarity(pair[0], pair[1])))

Size of vocabulary:  3000000
The similarity between height and tall is 0.473
The similarity between pineapple and mango is 0.668
The similarity between pineapple and juice is 0.418
The similarity between sun and robot is 0.029
The similarity between GPU and lion is 0.002


In [15]:
model.most_similar('UBC')

[('UVic', 0.7886475324630737),
 ('SFU', 0.7588527202606201),
 ('Simon_Fraser', 0.7356574535369873),
 ('UFV', 0.688043475151062),
 ('VIU', 0.6778583526611328),
 ('Kwantlen', 0.677142858505249),
 ('UBCO', 0.6734488010406494),
 ('UPEI', 0.6731126308441162),
 ('UBC_Okanagan', 0.6709134578704834),
 ('Lakehead_University', 0.6622507572174072)]

#### Activity 2: Try out Word similarity with the code above (~4 mins)

- Take a moment here and try out some words to find most similar words. To get you started here are some words: *Vancouver, bread, Computer_Science*

### Word2Vec 

- A family of models to obtain dense word vectors.

- Two primary algorithms 
    - **Skip-gram**
    - Continuous bag of words (CBOW)
- Two moderately efficient training methods 
    - Hierarchical softmax
    - Negative sampling 
       

### Skip-gram

- A neural network model to obtain robust and dense representations of words. 

### Fake word-prediction task 

- Given a target word (i.e., center word) word, predict context words (i.e., surrounding words). 
- Note that we are using "target" in a different sense here compared to how we use it in supervised machine learning.  
<blockquote>
    Add freshly squeezed$_{context}$ pineapple$_{target}$ juice$_{context}$ to your smoothie. 
</blockquote> 

<img src="images/target_context.png" width="300" height="300">

- So in the example above given the target word **pineapple**, predict whether: 
    - **juice** is likely to occur in the context of **pineapple**
    - **squeezed** is likely to occur in the context of **pineapple** 

### Skip-gram objective
- Consider the conditional probabilities $p(w_c|w_t)$ and set the parameters $\theta$ of $p(w_c|w_t; \theta)$ so as to maximize the corpus probability. 

$$\arg \max\limits_\theta \prod\limits_{(w_c,w_t) \in D} p(w_c|w_t;\theta)$$


- $w_t$ &rarr; target word
- $w_c$ &rarr; context word
- $D$ &rarr; the set of all word and context pairs from the text. 
- $V$ &rarr; vocabulary

### Skip-gram objective

- Model the conditional probability using softmax of the dot product.
    * Higher the dot product higher the probability and vice-versa.     
    

$$P(w_c|w_t;\theta) = \frac{\exp(w_c.w_t)}{\sum\limits_{\substack{c' \in V}} \exp(w_{c'}.w_t)}$$

- Substituting the conditional probability with the softmax of dot product: 
$$ \arg \max\limits_\theta \prod\limits_{(w_c,w_t) \in D} P(w_c|w_t;\theta) \approx \prod\limits_{(w_c,w_t) \in D}\frac{\exp(w_c.w_t)}{\sum\limits_{\substack{c' \in V}} \exp(w_{c'}.w_t)}$$
- Assumption: Maximizing this objective on a large corpus will result in meaningful embeddings for all words in the vocabulary. 

### How do we do it?

- We use a neural network architecture with 
    - an input layer
    - a hidden layer
    - an output layer 
- We use the softmax activation function for the output layer. 
    


### Example 

<img src="images/skipgram_0.png" width="1000" height="1000">

### Input layer and "gold" 

<img src="images/skipgram_1.png" width="1000" height="1000">

### Hidden layer

<img src="images/skipgram_2.png" width="1000" height="1000">

### What will be the dimensions of the weight matrix between input and hidden layers?

1. $10000 \times 1$
2. $10000 \times 300$
3. $300 \times 300$


<img src="images/skipgram_2.png" width="700" height="700">

### Hidden layer and output layer 

<img src="images/skipgram_3.png" width="1000" height="1000">

### What will be the dimensions of the weight matrix between hidden and output layers?

1. $10000 \times 1$
2. $300 \times 10000$
3. $300 \times 300$


<img src="images/skipgram_3.png" width="800" height="800">

### Softmax activation function 

- Apply softmax to get probability distribution 

<img src="images/skipgram_4.png" width="1000" height="1000">

### Compare prediction ($\hat{y}$) with "gold" ($y$)

- Learn weights using backpropagation and gradient descent. 
- We want a number closer to 1 in the prediction at index 5,428
    - Loss is high!

<img src="images/skipgram_5.png" width="800" height="800">


### Fake word-prediction task 

- Given a target word (i.e., center word) word, predict context words (i.e., surrounding words). 
- Note that we are using "target" in a different sense here compared to how we use it in supervised machine learning.  
<blockquote>
    Add freshly squeezed$_{context}$ pineapple$_{target}$ juice$_{context}$ to your smoothie. 
</blockquote> 

<center>
<img src="images/target_context.png" width="300" height="300">
</center>    

- So in the example above given the target word **pineapple**, predict whether: 
    - **juice** is likely to occur in the context of **pineapple**
    - **squeezed** is likely to occur in the context of **pineapple** 

### Skip-gram model for two target-context pairs 
<center>
<img src="images/skip-gram.png" width="1000" height="1000">
</center>

### Parameters to learn

- Given a corpus with vocabulary of size $V$, where a word $w_i$ is identified by its index $i \in {1, ..., V}$, learn a vector representation for each $w_i$ by predicting the words that appear in its context. 
- Learn the following parameters of the model
    - Suppose $V = 10,000$, $d = 300$, the number of parameters to learn are 6,000,000! 

$$
\theta = 
\begin{bmatrix} aardvark_t\\
                aback_t\\
                \dots\\
                zymurgi_t\\
                aardvark_c\\
                aback_c\\                
                \dots\\
                zymurgi_c\\                
\end{bmatrix} \in R^{2dV}
$$


### Main hyperparameters of the model

- Dimensionality of the word vectors 
- Window size
    * shorter window: more syntactic representation
    * longer window: more semantic representation 
    * Mikolov et al. (2015) suggest setting this parameter in the range 5 to 20 for small training datasets and in the range 2 to 5 for large training datasets.    

### Video lecture1.5

### Training word2vec embeddings 

- [Original C code](https://code.google.com/archive/p/word2vec/) 
- [GitHub version of the code](https://github.com/tmikolov/word2vec)
- [Gensim](https://radimrehurek.com/gensim/), an open source Python library has provides a Python interface for word2vec family of algorithms

In [18]:
### First let's preprocess the corpus. 
### We already have done this for sparse represenation.  
### Let's reuse it. 
print("\ncorpus:\n", corpus)
print("\nPreprocessed corpus: \n", pp_corpus)


corpus:
 ['How tall is Machu Picchu?', 'Machu Picchu is 13.164 degrees south of the equator.', 'The official height of Machu Picchu is 2,430 m.', 'Machu Picchu is 80 kilometres (50 miles) northwest of Cusco.', 'It is 80 kilometres (50 miles) northwest of Cusco, on the crest of the mountain Machu Picchu, located about 2,430 metres (7,970 feet) above mean sea level, over 1,000 metres (3,300 ft) lower than Cusco, which has an elevation of 3,400 metres (11,200 ft).']

Preprocessed corpus: 
 [['tall', 'machu', 'picchu'], ['machu', 'picchu', '13.164', 'degrees', 'south', 'equator'], ['official', 'height', 'machu', 'picchu', '2,430'], ['machu', 'picchu', '80', 'kilometres', '50', 'miles', 'northwest', 'cusco'], ['80', 'kilometres', '50', 'miles', 'northwest', 'cusco', 'crest', 'mountain', 'machu', 'picchu', 'located', '2,430', 'metres', '7,970', 'feet', 'mean', 'sea', 'level', '1,000', 'metres', '3,300', 'ft', 'lower', 'cusco', 'elevation', '3,400', 'metres', '11,200', 'ft']]


In [19]:
# Let's build a word2vec model on our tiny machu picchu corpus
# Just for demonstration. Won't give any meaningful relationships because 
# of the size of our corpus. 
import gensim
from gensim.models import Word2Vec
model = Word2Vec(pp_corpus, 
                 size=100, 
                 window=4, 
                 min_count=1)

# How does a learned dense word vector look like? 
model.wv['tall']

array([-1.8686343e-03, -4.3829111e-03,  8.2461245e-04, -2.5939611e-03,
        3.8368141e-03, -6.4964296e-04, -1.1072899e-03, -1.6571992e-03,
       -2.0547302e-03, -1.0690627e-03, -3.4108125e-03,  2.7660755e-03,
        8.8897522e-04,  1.9116396e-03,  1.3635573e-03,  8.5409568e-04,
        2.1677304e-03,  2.7105369e-04,  2.0211251e-03,  3.6618642e-03,
       -9.9170825e-04, -1.3098032e-03,  1.0668045e-03, -1.0649072e-03,
       -2.9500048e-03, -2.9360277e-03,  3.1440256e-03, -2.2108371e-03,
       -5.1903533e-04, -3.2085821e-03,  3.9160475e-03,  1.1417432e-04,
       -4.1589267e-03, -3.9233039e-03, -3.6281554e-03,  3.3286007e-03,
        2.6871550e-03, -1.6688406e-04, -8.8384928e-05,  2.6939563e-03,
       -3.0400269e-03,  1.6563045e-03, -1.4687789e-03, -6.1041053e-04,
        3.8005984e-03,  2.4170585e-03, -1.7380904e-03,  4.3574700e-04,
       -1.6147987e-03, -4.7198241e-03,  4.2445287e-03, -2.4527828e-03,
        4.8504239e-03,  1.7698731e-03, -4.0522716e-03,  4.8034373e-03,
      

### Other popular methods to get embeddings

### [fastText](https://fasttext.cc/)

- NLP library by Facebook research  
- Includes an algorithm which is an extension to Word2Vec
- Helps deal with unknown words elegantly
- Breaks words into several n-gram subwords 
- Example: trigram sub-words for *berry* are *ber*, *err*, *rry*)
- Embedding(*berry*) = embedding(*ber*) + embedding(*err*) + embedding(rry)

### [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)
- Starts with the co-occurrence matrix
    - Co-occurrence can be interpreted as an indicator of semantic proximity of words
- Takes advantage of global count statistics    
- Predicts co-occurrence ratios
- Loss based on word frequency

### Pre-trained embeddings

- Training embeddings is computationally expensive
- For typical corpora, the vocabulary size is greater than 100,000.  
- If the size of embeddings is 300, the number of parameters of the model is $2 \times 30,000,000$. 
- So people have trained embeddings on huge corpora and made them available.  

### Pre-trained embeddings

A number of pre-trained word embeddings are available. The most popular ones are:  

- [Word2Vec](https://code.google.com/archive/p/word2vec/)
    * trained on several corpora using the word2vec algorithm 
- [wikipedia2vec](https://wikipedia2vec.github.io/wikipedia2vec/pretrained/)
    * pretrained embeddings for 12 languages 
- [GloVe](https://nlp.stanford.edu/projects/glove/)
    * trained using [the GloVe algorithm](https://nlp.stanford.edu/pubs/glove.pdf) 
    * published by Stanford University 
- [fastText pre-trained embeddings for 294 languages](https://fasttext.cc/docs/en/pretrained-vectors.html) 
    * trained using [the fastText algorithm](http://aclweb.org/anthology/Q17-1010)
    * published by Facebook    

In [20]:
# Load Google's pre-trained Word2Vec model.
# You can download them from here: https://code.google.com/archive/p/word2vec/
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('/Users/kvarada/MDS/2019-20/575/data/GoogleNews-vectors-negative300.bin', binary=True)

In [21]:
print('Size of vocabulary: ', len(model.vocab))
word_pairs = [('height','tall'),
              ('pineapple','mango'), 
              ('pineapple','juice'), 
              ('sun','robot'), 
              ('GPU','lion')]
for pair in word_pairs: 
    print('The similarity between %s and %s is %0.3f' %(pair[0], pair[1], model.similarity(pair[0], pair[1])))

Size of vocabulary:  3000000
The similarity between height and tall is 0.473
The similarity between pineapple and mango is 0.668
The similarity between pineapple and juice is 0.418
The similarity between sun and robot is 0.029
The similarity between GPU and lion is 0.002


### Finding similar words 

- Given word $w$, search in the vector space for the word closest to $w$ as measured by cosine distance. 

In [22]:
model.most_similar('UBC')

[('UVic', 0.7886475324630737),
 ('SFU', 0.7588527202606201),
 ('Simon_Fraser', 0.7356574535369873),
 ('UFV', 0.688043475151062),
 ('VIU', 0.6778583526611328),
 ('Kwantlen', 0.677142858505249),
 ('UBCO', 0.6734488010406494),
 ('UPEI', 0.6731126308441162),
 ('UBC_Okanagan', 0.6709134578704834),
 ('Lakehead_University', 0.6622507572174072)]

In [23]:
# Captures different contracted forms and mispelled occurrences 
model.most_similar('information')

[('info', 0.7363681793212891),
 ('infomation', 0.6800296306610107),
 ('infor_mation', 0.6733849048614502),
 ('informaiton', 0.6639008522033691),
 ('informa_tion', 0.660125732421875),
 ('informationon', 0.6339334845542908),
 ('informationabout', 0.6320979595184326),
 ('Information', 0.6186580657958984),
 ('informaion', 0.6093292236328125),
 ('details', 0.6063088774681091)]

# Success of Word2Vec

- Able to capture complex relationships between words.
- Example: What is the word that is similar to **WOMAN** in the same sense as **KING** is similar to **MAN**?
- Perform a simple algebraic operations with the vector representation of words.
    $\vec{X} = \vec{\text{KING}} − \vec{\text{MAN}} + \vec{\text{WOMAN}}$
- Search in the vector space for the word closest to $\vec{X}$ measured by cosine distance.

<img src="images/word_analogies1.png" width="500" height="500">
(Credit: Mikolov et al. 2013)    


In [24]:
def analogy(word1, word2, word3, model = model):
    '''    
    Returns analogy word using the given model. 
    
    Parameters
    --------------
    word1 : (str) 
        word1 in the analogy relation
    word2 : (str)
        word2 in the analogy relation    
    word3 : (str)
        word3 in the analogy relation         
    model : 
        word embedding model
    
    Returns
    ---------------
        pd.dataframe
    '''
    print('%s : %s :: %s : ?' %(word1, word2, word3))
    sim_words = model.most_similar(positive=[word3, word2], negative=[word1])
    return pd.DataFrame(sim_words, columns=['Analogy word', 'Score'])

In [25]:
analogy('man','king','woman')

man : king :: woman : ?


Unnamed: 0,Analogy word,Score
0,queen,0.711819
1,monarch,0.618967
2,princess,0.590243
3,crown_prince,0.549946
4,prince,0.537732
5,kings,0.523684
6,Queen_Consort,0.523595
7,queens,0.518113
8,sultan,0.509859
9,monarchy,0.508741


In [26]:
analogy('Montreal', 'Canadiens', 'Vancouver')

Montreal : Canadiens :: Vancouver : ?


Unnamed: 0,Analogy word,Score
0,Canucks,0.821327
1,Vancouver_Canucks,0.750401
2,Calgary_Flames,0.70547
3,Leafs,0.695783
4,Maple_Leafs,0.691617
5,Thrashers,0.687504
6,Avs,0.681716
7,Sabres,0.665307
8,Blackhawks,0.664625
9,Habs,0.661023


In [27]:
### Recall the title of today's lesson 
analogy('Toronto', 'UofT', 'Vancouver')

Toronto : UofT :: Vancouver : ?


Unnamed: 0,Analogy word,Score
0,SFU,0.579245
1,UVic,0.576921
2,UBC,0.571431
3,Simon_Fraser,0.543464
4,Langara_College,0.541347
5,UVIC,0.520495
6,Grant_MacEwan,0.517273
7,UFV,0.51415
8,Ubyssey,0.510421
9,Kwantlen,0.503807


### Examples of semantic and syntactic relationships

<center>
<img src="files/images/word_analogies2.png" width="800" height="800">
(Credit: Mikolov 2013)
</center>

### Implicit biases and stereotypes in word embeddings

- Reflect gender stereotypes present in broader society.
- They may also amplify these stereotypes because of their widespread usage. 
- See the paper [Man is to Computer Programmer as Woman is to ...](http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf).

In [28]:
analogy('man', 'computer_programmer', 'woman')

man : computer_programmer :: woman : ?


Unnamed: 0,Analogy word,Score
0,homemaker,0.562712
1,housewife,0.510505
2,graphic_designer,0.50518
3,schoolteacher,0.497949
4,businesswoman,0.493489
5,paralegal,0.492551
6,registered_nurse,0.490797
7,saleswoman,0.488163
8,electrical_engineer,0.479773
9,mechanical_engineer,0.47554


### Summary

- Vector space model 
    * Modeling word meaning by placing it in a vector space.
    * Distance between words in this vector space indicate the relationship between them. 
- Word embeddings
    * Sparse embeddings using co-occurrence matrix
    * Dense embeddings using word2vec models 
        * Freely available code and pre-trained models 
        * Available for many different languages. 
        * Stereotypes in the society reflected in word embeddings

### Post-assessment

1. Word representation created by term-term co-occurrence matrix are long and sparse whereas the ones created by word2vec models are short and dense. True or False? 
2. The skip-gram model predicts context word given a target word. True or False? 
3. Given the following table, which word pair is more similar in terms of dot product: (word 1, word 2) or (word 1, word 3)?

<img src="images/similarity_question.png" width="500" height="500">

4. True or False? Suppose you learn word embeddings for a vocabulary of 20,000 words using Word2Vec. Then each dense word embedding associated with a word is of size 20,000 to make sure that we capture the full range of meaning of that word.
<br><br><br><br><br><br><br><br><br><br>

## Relevant papers

- [Distributed representations of words and phrases and their compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
- [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)
- [Linguistic regularities in continuous space word representations](https://www.aclweb.org/anthology/N13-1090)
- [Enriching Word Vectors with Subword Information](http://aclweb.org/anthology/Q17-1010)

## Fun tools
[wevi: word embedding visual inspector](https://ronxin.github.io/wevi/)