<h1 id="tocheading">Natural Language Understanding DS-GA 1012 Homework 1</h1>
<div id="toc"></div>

__Due Feburary 13, 2019 at 2pm (ET)__

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [237]:
import numpy as np
import itertools
import pandas as pd
import matplotlib.pyplot as plt
import torch, torchvision
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from collections import OrderedDict
import os
from numpy.linalg import norm
import argparse
from collections import Counter
import operator

%matplotlib inline

## Part I: Exploring effect of context size [30 pts]

We face many implicit and explicit design decisions in creating distributional word representations. For example, in lecture and in lab, we created word vectors using a co-occurence matrix built on neighboring pairs of words. We might suspect, however, that we can get more signal of word similarity by considering larger contexts than pairs of words.

### Co-occurence Matrix
__a__. Write `build_cooccurrence_matrix`, which generates the co-occurence matrix for a window of arbitrary size and for the vocabulary of `max_vocab_size` most frequent words. Feel free to modify the code used in lab [10 pts]

In [206]:
def remove_(string, list_of_char):
    for x in list_of_char:
        string = string.replace(str(x), "")
    return string

In [207]:
def text_to_list(filepath, mode="w"):
    """args
        - filepath: path to text file
        - mode: "w" for word, "s" for sentence list
    
    returns:
        - text_list: word or sentence list depending on the mode"""
    
    text = open(filepath, "r")
    
    if mode == "w":
        text_list = text.read().replace("\t"," ").replace("\n"," ")
        text_list = remove_(text_list,range(10)).lower().split(" ")[1:]
    elif mode == "s":
        text_list = text.read().split("\n")
        text_list = [sent.replace("\t","") for sent in text_list]
        text_list = [remove_(x,range(10)).lower().split(" ") for x in text_list][1:]

    else:
        raise ValueError ("mode must be 'w'(word) or 's'(sentence)!")
        
    return text_list

In [208]:
data_sentence = text_to_list("data/datasetSentences.txt", "s")

In [209]:
data_word = text_to_list("data/datasetSentences.txt", "w")

In [210]:
data_word[0]

'sentence'

In [219]:
def build_cooccurrence_matrix(data, 
                              max_vocab_size=10000, 
                              context_size=1):
    
    """ Build a co-occurrence matrix
    
    args:
        - data: iterable where each item is a list of tokens (string) 
        - max_vocab_size: maximum vocabulary size
        - context_size: window around a word that is considered context
            context_size=1 should consider pairs of adjacent words
            
    returns:
        - co-occurrence matrix: numpy array where row i corresponds 
        to the co-occurrence counts for word i"""
    
    assert (type(data) == list or type(data) == np.ndarray), "First input must be a list or a numpy ndarray!"
    
    if type(data) == list:
        assert (len(data) > 0), "Data must be non-empty."
    else:
        assert (data.shape[0] > 0), "Data must be non-empty."
        
    ## assuming data is a list of sentences (each split into tokens)
    word_data = ((" ").join([(" ").join(x) for x in data])).split(" ")
    word2count = Counter(word_data)
    sorted_by_freq = sorted(word2count.items(), 
                            key=lambda kv: kv[1])
    
    # vocab = {word: count} for the most frequent max_vocab_size words
    vocab = dict(sorted_by_freq[-max_vocab_size:])
    keys = [*vocab.keys()]
    
    token2id = {}
    id2token = {}
    
    id2token[len(keys)] = "<UNK>"
    token2id["<UNK>"] = len(keys)
    
    for i in range(len(data)):
        data[i] = [x if x in [*vocab.keys()] else "<UNK>" for x in data[i]]
        
    for j in range(len(keys)):
        id2token[j] = keys[j]
        
    for token in keys:
        token2id[token] = keys.index(token)
    
    edge = len(keys)
    comatrix = np.zeros((edge, edge))
    
    # dict where each key is a unique token and each value is another dict of other unique keys 
    # that hold cooccurrence counts
    full_keys = keys + ["<UNK>"]
    occurrences = OrderedDict((key, OrderedDict((key, 0) for key in full_keys)) for key in full_keys)


    for sent in data:
        sent_length = len(sent)
        if context_size >= sent_length:
            for i in range(sent_length):
                for item in sent:
                    occurrences[sent[i]][item] += 1
        else:
            for i in range(sent_length):
                if i <= context_size and sent_length - context_size - 1 <= i:
                    for item in sent:
                        occurrences[sent[i]][item] += 1
                elif i <= context_size and sent_length - context_size - 1 > i:
                    for item in sent[:i+context_size+1]:
                        occurrences[sent[i]][item] += 1
                elif i > context_size and sent_length - context_size - 1 > i:
                    for item in sent[i-context_size:i+context_size+1]:
                        occurrences[sent[i]][item] += 1
                elif i > context_size and sent_length - context_size - 1 <= i:
                    for item in sent[i-context_size:]:
                        occurrences[sent[i]][item] += 1

    for i in range(edge):
        for key in full_keys:
            if token2id[key] == i:
                comatrix[token2id[key]] = np.array([occurrences[key][co] for co in [*occurrences[key].keys()]])
                    
    return token2id, id2token, comatrix

In [220]:
# token2id, id2token, co_matrix = build_cooccurrence_matrix(data_sentence, max_vocab_size=10000, context_size=2)


### Matrix for Sentence Data

Use your implementation of `build_cooccurrence_matrix` to generate the co-occurence matrix from the sentences of [SST](http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip) (file `datasetSentences.txt`) with `context_size=2` and `max_vocab_size=10000`. What is the co-occurrence count of the words "the" and "end"? 

In [221]:
token2id, id2token, co_matrix = build_cooccurrence_matrix(data_sentence, max_vocab_size=10000, context_size=2)


In [None]:
print (f'the id: {token2id["the"]}')
print (f'and id: {token2id["and"]}')

In [None]:
co_matrix[token2id["the"]][token2id["and"]]

### Context Size Effect
__b__. Plot the effect of varying context size in $\{1, 2, 3, 4\}$ (leaving all the other settings the same) on the quality of the learned word embeddings, as measured by performance (Spearman correlation) on the word similarity dataset [MTurk-771](http://www2.mta.ac.il/~gideon/mturk771.html) between human judgments and cosine similarity of the learned word vectors (see lab). [12 pts]

In [None]:
# for each word couple in mturk dataset
#     get mturk similarity
#     get vocab_size x 1 vectors from the cooccurrence matrix, normalize them
#         AND measure their cosine similarity

$$ \left(\frac{\sum_{i=1}^{n} u_{i} \cdot v_{i}}{\|u\|\cdot \|v\|}\right) $$

In [238]:
def cosine_similarity(v1, v2):
    co_sim = np.dot(v1, v2)/(norm(v1)*norm(v2))
    return co_sim

In [242]:
co_matrix[25]

array([0., 0., 0., ..., 0., 0., 1.])

In [243]:
def get_similarity(token1, token2, mturk_df, 
                   token2id, cooccurrence_matrix):
    
    mturk_similarity = float(mturk_df[(mturk_df["word1"]==token1)\
                                      &(mturk_df["word2"]==token2)]["similarity"])
    
    keys = [*token2id.keys()]
    
    if token1 not in keys:
        covector_1 = np.zeros((cooccurrence_matrix.shape[0],))
    else:
        covector_1 = cooccurrence_matrix[token2id[token1]]
    
    if token2 not in keys:
        covector_2 = np.zeros((cooccurrence_matrix.shape[0],))
    else:
        covector_2 = cooccurrence_matrix[token2id[token2]]
        
    ## TODO
    ## co_similarity = cosina_similarity(covector1, covector2)
    co_similarity = cosine_similarity(covector_1, covector_2)
    
    return mturk_similarity, co_similarity

In [244]:
mturk_data = pd.DataFrame(pd.read_csv("data/MTURK-771.csv",header=None))
mturk_data.columns = ["word1", "word2", "similarity"]

In [224]:
mturk_data.head(3)

Unnamed: 0,word1,word2,similarity
0,access,gateway,3.791667
1,account,explanation,2.0
2,account,invoice,3.75


In [226]:
# construct an mturk_similarity and a co_similarity vector and 
# get their Spearman correlation

# change context size and redo
get_similarity("access", "gateway", mturk_data, token2id, co_matrix)

(3.791666667, 0)

In [245]:
for i in range(10):
    print (get_similarity(mturk_data.iloc[i]["word1"], (mturk_data.iloc[i]["word2"]), mturk_data, token2id, co_matrix))

(3.791666667, nan)
(2.0, 0.19053511582654864)
(3.75, nan)
(3.6818181819999998, 0.2511599254077232)
(1.2272727270000001, nan)
(2.739130435, nan)
(2.0, nan)
(1.5833333330000001, nan)
(4.083333333, 0.10203255357771537)
(2.6818181819999998, 0.10456666167222464)


  


#### Context Size = 1

In [229]:
token2id_c1, id2token_c1, co_matrix_c1 = build_cooccurrence_matrix(data_sentence, max_vocab_size=10000, context_size=1)

In [246]:
for i in range(mturk_data.shape[0]):
    print (get_similarity(mturk_data.iloc[i]["word1"], 
                          (mturk_data.iloc[i]["word2"]), 
                          mturk_data, token2id_c1, co_matrix_c1))

  


(3.791666667, nan)
(2.0, 0.13871914988396253)
(3.75, nan)
(3.6818181819999998, 0.01211609529181641)
(1.2272727270000001, nan)
(2.739130435, nan)
(2.0, nan)
(1.5833333330000001, nan)
(4.083333333, 0.024000768036865964)
(2.6818181819999998, 0.046657661165392404)
(3.45, nan)
(2.5, 0.0680676474781128)
(4.608695652, nan)
(2.47826087, 0.036514837167011066)
(2.782608696, nan)
(1.0, 0.0702795114271846)
(3.782608696, 0.05863019699779286)
(2.916666667, nan)
(3.857142857, nan)
(3.19047619, nan)
(4.0, 0.04647342912165046)
(4.476190476, nan)
(4.3636363639999995, 0.014394450569584348)
(2.869565217, nan)
(2.434782609, nan)
(3.458333333, nan)
(2.0, nan)
(1.863636364, nan)
(3.428571429, 0.0)
(3.954545455, nan)
(3.5, nan)
(1.608695652, nan)
(1.9583333330000001, 0.017993390891808767)
(4.1363636360000005, 0.3894875139238524)
(1.434782609, 0.05759268014617334)
(2.6, nan)
(4.041666667, nan)
(1.454545455, nan)
(2.565217391, nan)
(2.0, nan)
(2.72, 0.0)
(4.227272727, nan)
(3.727272727, nan)
(2.44, nan)
(4.2272

(4.0, nan)
(2.782608696, nan)
(4.739130435, 0.14638501094227996)
(1.9166666669999999, 0.023743065646653218)
(1.739130435, nan)
(1.8260869569999998, nan)
(2.52, nan)
(4.045454545, nan)
(3.96, nan)
(1.2380952379999999, nan)
(3.826086957, 0.014217490555006984)
(2.666666667, nan)
(2.5238095240000002, nan)
(1.5777777780000002, 0.0902674143362211)
(1.772727273, 0.015476258713813196)
(2.714285714, 0.1048445352542672)
(2.375, 0.17682682338364944)
(2.958333333, 0.07808688094430302)
(1.8181818180000002, nan)
(2.5238095240000002, nan)
(2.869565217, nan)
(2.666666667, nan)
(2.772727273, 0.015310507072487512)
(2.090909091, nan)
(3.1, nan)
(2.136363636, 0.022049884265973686)
(3.523809524, nan)
(1.4166666669999999, nan)
(3.652173913, 0.0)
(3.833333333, nan)
(3.9090909089999997, nan)
(4.5, nan)
(3.916666667, 0.14769119412392961)
(1.64, nan)
(3.125, nan)
(1.318181818, 0.03711028633270896)
(3.3181818180000002, 0.10660035817780521)
(3.0, 0.3743681545638132)
(3.782608696, nan)
(4.72, nan)
(1.25, nan)
(3.3

(3.260869565, nan)
(2.083333333, nan)
(1.086956522, 0.11016316230980792)
(1.136363636, nan)
(3.208333333, 0.082263430601408)
(4.045454545, 0.16770188976110348)
(4.347826087, 0.07994118492388493)
(4.2173913039999995, 0.08438436994681206)
(1.2028985509999999, nan)
(1.476190476, 0.10999623648805061)
(3.208333333, 0.006839695689557284)
(4.25, nan)
(2.641791045, nan)
(1.038461538, nan)
(3.08, nan)
(4.818181818, 0.15908332480718904)
(2.227272727, nan)
(2.590909091, nan)
(3.25, nan)
(2.791666667, 0.02731791823540765)
(2.038461538, 0.012346619958119868)
(2.347826087, 0.016435457315155326)
(3.857142857, nan)
(3.0, nan)
(2.454545455, 0.14281563333595165)
(4.045454545, 0.14509525002200233)
(3.380952381, nan)
(2.086956522, 0.03782347372361169)
(4.173913043, 0.16643576233486101)
(4.083333333, 0.0)
(3.291666667, 0.03943001861728198)
(1.25, 0.0)
(3.52173913, 0.23145502494313785)
(1.21875, nan)
(1.956521739, nan)
(2.375, 0.06523416275000178)
(1.553191489, nan)
(2.9090909089999997, nan)
(3.884615385, 0

#### Context Size = 2

In [232]:
token2id_c2, id2token_c2, co_matrix_c2 = token2id, id2token, co_matrix

In [247]:
for i in range(mturk_data.shape[0]):
    print (get_similarity(mturk_data.iloc[i]["word1"], 
                          (mturk_data.iloc[i]["word2"]), 
                          mturk_data, token2id_c2, co_matrix_c2))

  


(3.791666667, nan)
(2.0, 0.19053511582654864)
(3.75, nan)
(3.6818181819999998, 0.2511599254077232)
(1.2272727270000001, nan)
(2.739130435, nan)
(2.0, nan)
(1.5833333330000001, nan)
(4.083333333, 0.10203255357771537)
(2.6818181819999998, 0.10456666167222464)
(3.45, nan)
(2.5, 0.23300984431452454)
(4.608695652, nan)
(2.47826087, 0.19194297398747864)
(2.782608696, nan)
(1.0, 0.14282184770086429)
(3.782608696, 0.176964638605189)
(2.916666667, nan)
(3.857142857, nan)
(3.19047619, nan)
(4.0, 0.1283751055584798)
(4.476190476, nan)
(4.3636363639999995, 0.0542828996898106)
(2.869565217, nan)
(2.434782609, nan)
(3.458333333, nan)
(2.0, nan)
(1.863636364, nan)
(3.428571429, 0.036037498507822355)
(3.954545455, nan)
(3.5, nan)
(1.608695652, nan)
(1.9583333330000001, 0.1693259883027908)
(4.1363636360000005, 0.41342287388607174)
(1.434782609, 0.23446825587698925)
(2.6, nan)
(4.041666667, nan)
(1.454545455, nan)
(2.565217391, nan)
(2.0, nan)
(2.72, 0.0)
(4.227272727, nan)
(3.727272727, nan)
(2.44, nan

(2.2, 0.1742459737060647)
(2.583333333, 0.031497039417435604)
(4.7826086960000005, nan)
(3.44, 0.15346201966925543)
(1.217391304, nan)
(2.64, nan)
(3.6818181819999998, nan)
(3.826086957, 0.41564821342381103)
(2.173913043, nan)
(3.142857143, nan)
(3.1363636360000005, nan)
(3.3913043480000002, nan)
(4.0, nan)
(2.782608696, nan)
(4.739130435, 0.24873416908154553)
(1.9166666669999999, 0.13072255685909367)
(1.739130435, nan)
(1.8260869569999998, nan)
(2.52, nan)
(4.045454545, nan)
(3.96, nan)
(1.2380952379999999, nan)
(3.826086957, 0.08630345864773568)
(2.666666667, nan)
(2.5238095240000002, nan)
(1.5777777780000002, 0.17715241664567902)
(1.772727273, 0.08313446682693307)
(2.714285714, 0.2182759304831494)
(2.375, 0.25781016882773394)
(2.958333333, 0.19444444444444445)
(1.8181818180000002, nan)
(2.5238095240000002, nan)
(2.869565217, nan)
(2.666666667, nan)
(2.772727273, 0.08467433897978408)
(2.090909091, nan)
(3.1, nan)
(2.136363636, 0.03521560976575096)
(3.523809524, nan)
(1.41666666699999

(3.4, 0.03302891295379082)
(3.875, 0.10646189059775953)
(2.875, nan)
(1.333333333, 0.09174113175777962)
(1.608695652, 0.18795375411207577)
(4.476190476, 0.06561340795497624)
(2.826086957, nan)
(3.666666667, nan)
(4.227272727, nan)
(2.1, nan)
(3.260869565, nan)
(2.083333333, nan)
(1.086956522, 0.14146557113794814)
(1.136363636, nan)
(3.208333333, 0.20508450027282096)
(4.045454545, 0.31378545133328833)
(4.347826087, 0.2385520889881711)
(4.2173913039999995, 0.24393816040373542)
(1.2028985509999999, nan)
(1.476190476, 0.18271209218135562)
(3.208333333, 0.04418608385115478)
(4.25, nan)
(2.641791045, nan)
(1.038461538, nan)
(3.08, nan)
(4.818181818, 0.3172573470373102)
(2.227272727, nan)
(2.590909091, nan)
(3.25, nan)
(2.791666667, 0.08545309378438275)
(2.038461538, 0.0834191997480073)
(2.347826087, 0.04855713604861532)
(3.857142857, nan)
(3.0, nan)
(2.454545455, 0.16439441628626017)
(4.045454545, 0.20192307692307696)
(3.380952381, nan)
(2.086956522, 0.15726002465745234)
(4.173913043, 0.2437

#### Context Size = 3

In [235]:
token2id_c3, id2token_c3, co_matrix_c3 = build_cooccurrence_matrix(data_sentence, max_vocab_size=10000, context_size=3)

In [248]:
for i in range(mturk_data.shape[0]):
    print (get_similarity(mturk_data.iloc[i]["word1"], 
                        (mturk_data.iloc[i]["word2"]), 
                        mturk_data, token2id_c3, co_matrix_c3))

  


(3.791666667, nan)
(2.0, 0.1975601758824086)
(3.75, nan)
(3.6818181819999998, 0.22987571891216904)
(1.2272727270000001, nan)
(2.739130435, nan)
(2.0, nan)
(1.5833333330000001, nan)
(4.083333333, 0.10830607221477646)
(2.6818181819999998, 0.12873825199836225)
(3.45, nan)
(2.5, 0.307257674543199)
(4.608695652, nan)
(2.47826087, 0.17383200648512992)
(2.782608696, nan)
(1.0, 0.24993258966916737)
(3.782608696, 0.25922266742110234)
(2.916666667, nan)
(3.857142857, nan)
(3.19047619, nan)
(4.0, 0.18753100879764556)
(4.476190476, nan)
(4.3636363639999995, 0.07199649817549103)
(2.869565217, nan)
(2.434782609, nan)
(3.458333333, nan)
(2.0, nan)
(1.863636364, nan)
(3.428571429, 0.05263157894736841)
(3.954545455, nan)
(3.5, nan)
(1.608695652, nan)
(1.9583333330000001, 0.20118782105670852)
(4.1363636360000005, 0.43304845899090955)
(1.434782609, 0.3154343418557466)
(2.6, nan)
(4.041666667, nan)
(1.454545455, nan)
(2.565217391, nan)
(2.0, nan)
(2.72, 0.012235219605809908)
(4.227272727, nan)
(3.72727272

(2.64, nan)
(3.6818181819999998, nan)
(3.826086957, 0.4718526772994135)
(2.173913043, nan)
(3.142857143, nan)
(3.1363636360000005, nan)
(3.3913043480000002, nan)
(4.0, nan)
(2.782608696, nan)
(4.739130435, 0.23328473740792172)
(1.9166666669999999, 0.2099580125958015)
(1.739130435, nan)
(1.8260869569999998, nan)
(2.52, nan)
(4.045454545, nan)
(3.96, nan)
(1.2380952379999999, nan)
(3.826086957, 0.1805422299343205)
(2.666666667, nan)
(2.5238095240000002, nan)
(1.5777777780000002, 0.17850769884025436)
(1.772727273, 0.14940660111240983)
(2.714285714, 0.3182934943412483)
(2.375, 0.24834195395504818)
(2.958333333, 0.2449894717530557)
(1.8181818180000002, nan)
(2.5238095240000002, nan)
(2.869565217, nan)
(2.666666667, nan)
(2.772727273, 0.16094642949135055)
(2.090909091, nan)
(3.1, nan)
(2.136363636, 0.09831239992280215)
(3.523809524, nan)
(1.4166666669999999, nan)
(3.652173913, 0.14697408781305388)
(3.833333333, nan)
(3.9090909089999997, nan)
(4.5, nan)
(3.916666667, 0.2775362139705803)
(1.64

(4.608695652, 0.2090661398887553)
(2.5, nan)
(1.12, nan)
(2.041666667, 0.09582062134976903)
(2.0, 0.2545875386086578)
(3.708333333, nan)
(4.090909091, nan)
(3.428571429, nan)
(2.409090909, nan)
(4.523809524, nan)
(1.7, nan)
(4.739130435, nan)
(4.909090909, 0.2635231383473649)
(2.19047619, nan)
(3.807692308, nan)
(3.958333333, 0.2694079530401624)
(2.545454545, nan)
(3.0909090910000003, nan)
(3.4, 0.13325932093055393)
(3.875, 0.2066462534800938)
(2.875, nan)
(1.333333333, 0.2500254723636671)
(1.608695652, 0.22407133233117493)
(4.476190476, 0.11701896245495681)
(2.826086957, nan)
(3.666666667, nan)
(4.227272727, nan)
(2.1, nan)
(3.260869565, nan)
(2.083333333, nan)
(1.086956522, 0.219942959691286)
(1.136363636, nan)
(3.208333333, 0.27484448736197903)
(4.045454545, 0.4134829639038109)
(4.347826087, 0.2794963665320338)
(4.2173913039999995, 0.35592022422366143)
(1.2028985509999999, nan)
(1.476190476, 0.35983815906232086)
(3.208333333, 0.08806494904123198)
(4.25, nan)
(2.641791045, nan)
(1.03

#### Context Size = 4

In [249]:
token2id_c4, id2token_c4, co_matrix_c4 = build_cooccurrence_matrix(data_sentence, max_vocab_size=10000, context_size=4)

In [250]:
for i in range(mturk_data.shape[0]):
    print (get_similarity(mturk_data.iloc[i]["word1"], 
                        (mturk_data.iloc[i]["word2"]), 
                        mturk_data, token2id_c4, co_matrix_c4))

  


(3.791666667, nan)
(2.0, 0.18051747296260967)
(3.75, nan)
(3.6818181819999998, 0.3412462799816195)
(1.2272727270000001, nan)
(2.739130435, nan)
(2.0, nan)
(1.5833333330000001, nan)
(4.083333333, 0.1992841386722568)
(2.6818181819999998, 0.21971401263748425)
(3.45, nan)
(2.5, 0.28991037829690375)
(4.608695652, nan)
(2.47826087, 0.22931308955606852)
(2.782608696, nan)
(1.0, 0.2807125545535065)
(3.782608696, 0.3399922704491989)
(2.916666667, nan)
(3.857142857, nan)
(3.19047619, nan)
(4.0, 0.2880639455423924)
(4.476190476, nan)
(4.3636363639999995, 0.17520253806545724)
(2.869565217, nan)
(2.434782609, nan)
(3.458333333, nan)
(2.0, nan)
(1.863636364, nan)
(3.428571429, 0.08942036680197918)
(3.954545455, nan)
(3.5, nan)
(1.608695652, nan)
(1.9583333330000001, 0.2591549916800769)
(4.1363636360000005, 0.492928741714154)
(1.434782609, 0.37958108018777575)
(2.6, nan)
(4.041666667, nan)
(1.454545455, nan)
(2.565217391, nan)
(2.0, nan)
(2.72, 0.1789826897859943)
(4.227272727, nan)
(3.727272727, nan

(3.44, 0.36525207342014926)
(1.217391304, nan)
(2.64, nan)
(3.6818181819999998, nan)
(3.826086957, 0.5150101627804187)
(2.173913043, nan)
(3.142857143, nan)
(3.1363636360000005, nan)
(3.3913043480000002, nan)
(4.0, nan)
(2.782608696, nan)
(4.739130435, 0.3043903887640506)
(1.9166666669999999, 0.287317558943746)
(1.739130435, nan)
(1.8260869569999998, nan)
(2.52, nan)
(4.045454545, nan)
(3.96, nan)
(1.2380952379999999, nan)
(3.826086957, 0.2652625078526814)
(2.666666667, nan)
(2.5238095240000002, nan)
(1.5777777780000002, 0.24768001467279435)
(1.772727273, 0.22843532260625435)
(2.714285714, 0.4039101700013926)
(2.375, 0.3220252551823784)
(2.958333333, 0.2866297455600074)
(1.8181818180000002, nan)
(2.5238095240000002, nan)
(2.869565217, nan)
(2.666666667, nan)
(2.772727273, 0.23391832362084428)
(2.090909091, nan)
(3.1, nan)
(2.136363636, 0.15502846636625675)
(3.523809524, nan)
(1.4166666669999999, nan)
(3.652173913, 0.13770607453181927)
(3.833333333, nan)
(3.9090909089999997, nan)
(4.5, 

(2.641791045, nan)
(1.038461538, nan)
(3.08, nan)
(4.818181818, 0.43217858579776774)
(2.227272727, nan)
(2.590909091, nan)
(3.25, nan)
(2.791666667, 0.20477119164785468)
(2.038461538, 0.18896140857304602)
(2.347826087, 0.2816161861126244)
(3.857142857, nan)
(3.0, nan)
(2.454545455, 0.21452933534849952)
(4.045454545, 0.3477740407438389)
(3.380952381, nan)
(2.086956522, 0.3516770238958945)
(4.173913043, 0.38680677722677614)
(4.083333333, 0.14433756729740646)
(3.291666667, 0.24760432530017915)
(1.25, 0.22875450543583878)
(3.52173913, 0.14291548761875736)
(1.21875, nan)
(1.956521739, nan)
(2.375, 0.3094153006593328)
(1.553191489, nan)
(2.9090909089999997, nan)
(3.884615385, 0.2798257255846983)
(4.04, 0.22880215766121476)


### Discussion

__c__. Briefly discuss the pros and cons of varying 

    I.  the context size 
    II.  the vocabulary size 
    III. using bigrams instead of unigrams 
    IV. using subword tokens instead of words. [8 pts]

## Part 2: Pointwise Mutual Information [20 pts]

In lecture, we introduced __pointwise mutual information__ (PMI), which addresses the issue of normalization removing information about absolute magnitudes of counts. The PMI for word $\times$ context pair $(w,c)$ is 

$$\log\left(\frac{P(w,c)}{P(w) \cdot P(c)}\right)$$

with $\log(0) = 0$. This is a measure of how far that cell's value deviates from what we would expect given the row and column sums for that cell.

### PMI

__a__. Implement `pmi`, a function which takes in a co-occurence matrix and returns the matrix with PMI normalization applied. [15 pts]

In [38]:
def pmi(mat):
    """Pointwise mutual information
    
    args:
        - mat: 2d np.array to apply PMI
        
    returns:
        - pmi_mat: matrix of same shape with PMI applied
    """    
    raise NotImplementedError

Apply PMI to the co-occurence matrix computed above with `context_size=1`. What is the PMI between the words "the" and "end"?

### PPMI

__b__. We also consider an extension of PMI, positive PMI (PPMI), that maps all negative PMI values to 0.0 ([Levy and Goldberg 2014](http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization)). 
Write `ppmi`, which is the same as `pmi` except it applies PPMI instead of PMI (feel free to implement it as an option of `pmi`). What is the PMI of the words "the" and "start"? The PPMI? [5 pts]

## Part 3: Analyzing PMI [25 pts]

### Reweight Matrix

__a__. Consider the matrix `np.array([[1.0, 0.0, 0.0], [1000.0, 1000.0, 4000.0], [1000.0, 2000.0, 999.0]])`. Reweight this matrix using `ppmi`. 

    I. What is the value obtained for cell `[0,0]`, and 
    II. (ii) give a brief description for what is likely problematic about this value. [10 pts]

### Dealing with the Problematic Value
__b__. Give a suggestion for dealing with the problematic value and explain why it deals with this. Demonstrate your suggestion empirically [10 pts]

### PMI for Word-Word Co-occurence Matrix
__c__. Consider starting with a word-word co-occurence matrix and apply PMI to this matrix. 

        I. Which of the following describe the resulting vectors: sparse, dense, high-dimensional, low-dimensional
        II. If you wanted the opposite style of representation, what could you do? [5 pts]


## Part 4: Word Analogy Evaluation [25 pts]

Word analogies provide another kind of evaluation for distributed representations. Here, we are given three vectors A, B, and C, in the relationship

_A is to B as C is to __ _

and asked to identify the fourth that completes the analogy. These analogies are by and large substantially easier than the classic brain-teaser analogies that used to appear on tests like the SAT, but it's still an interesting, demanding
task. 

The core idea is that we make predictions by creating the vector

$$(A - B) + C$$ 

and then ranking all vectors based on their distance from this new vector, choosing the closest as our prediction.

### Analogy Completion
__a__. Implement the function `analogy_completion`. [9 pts]

In [None]:
def analogy_completion(a, b, c, mat):
    """Compute ? in 
    a is to b as c is to ? 
    as the closest to (b-a) + c
    """
    raise NotImplementedError

### GloVe
__b__. Our simple word embeddings likely won't perform well on this task. Let's instead look at some high quality pretrained word embeddings. Write code to load 300-dimensional [GloVe word embeddings](http://nlp.stanford.edu/data/glove.840B.300d.zip) trained on 840B tokens. Each line of the file is formatted as a word followed by 300 floats that make up its corresponding word embedding (all space delimited). The entries of GloVe word embeddings are not counts, but instead are learned via machine learning. Use your `analogy_completion` code to complete the following analogies using the GloVe word embeddings. [6 pts]

- "Beijing" is to "China" as "Paris" is to ?
- "gold" is to "first" as "silver" is to ?
- "Italian" is to "mozzarella" as "American" is to ?
- "research" is to "fun" as "engineering" is to ?

### Evaluate GloVe
c. Let's get a more quantitative, aggregate sense of the quality of GloVe embeddings. Load the analogies from `gram6-nationality-adjective.txt` and evaluate GloVe embeddings. Report the mean reciprocal rank of the correct answer (the last word on each line) for each analogy. [10 pts]

__Solution__

In [13]:
def analogy_evaluation(glove_vecs, test_file, verbose=False):
    """Basic analogies evaluation for a file `src_filename `
    in `question-data/`.
    
    Parameters
    ----------    
    mat : 2d np.array
        The VSM being evaluated.
        
    rownames : list of str
        The names of the rows in `mat`.
        
    src_filename : str
        Basename of the file to be evaluated. It's assumed to be in
        `vsmdata_home`/question-data.
        
    distfunc : function mapping vector pairs to floats (default: `cosine`)
        The measure of distance between vectors. Can also be `euclidean`, 
        `matching`, `jaccard`, as well as any other distance measure 
        between 1d vectors.
    
    Returns
    -------
    (float, float)
        The first is the mean reciprocal rank of the predictions and 
        the second is the accuracy of the predictions.
    
    """
    raise NotImplementedError

In [16]:
analogy_evaluation(glove_vecs, "gram6-nationality-adjective.txt")

(0.9391509433962264, defaultdict(int, {True: 97, False: 9}))