<h1 id="tocheading">Natural Language Understanding DS-GA 1012 Homework 1</h1>
<div id="toc"></div>

__Due Feburary 13, 2019 at 2pm (ET)__

In [143]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [101]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch, torchvision
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
import os
import argparse
from collections import Counter
import operator

%matplotlib inline

## Part I: Exploring effect of context size [30 pts]

We face many implicit and explicit design decisions in creating distributional word representations. For example, in lecture and in lab, we created word vectors using a co-occurence matrix built on neighboring pairs of words. We might suspect, however, that we can get more signal of word similarity by considering larger contexts than pairs of words.

### Co-occurence Matrix
__a__. Write `build_cooccurrence_matrix`, which generates the co-occurence matrix for a window of arbitrary size and for the vocabulary of `max_vocab_size` most frequent words. Feel free to modify the code used in lab [10 pts]

In [92]:
def text_to_list(filepath, mode="w"):
    """args
        - filepath: path to text file
        - mode: "w" for word, "s" for sentence list
    
    returns:
        - text_list: word or sentence list depending on the mode"""
    
    text = open(filepath, "r")
    
    if mode == "w":
        text_list = text.read().replace("\t"," ").replace("\n"," ")
        text_list = text_list.lower().split(" ")
    elif mode == "s":
        text_list = text.read().split("\n")
        text_list = [x[2:].lower().split(" ") for x in text_list][1:]
    else:
        raise ValueError ("mode must be 'w'(word) or 's'(sentence)!")
        
    return text_list

In [93]:
data_sentence = text_to_list("DS-GA1012_HW1/datasetSentences.txt", "s")

In [13]:
def build_cooccurrence_matrix(data, 
                              max_vocab_size=20000, 
                              context_size=1):
    
    """ Build a co-occurrence matrix
    
    args:
        - data: iterable where each item is a list of tokens (string) 
        - max_vocab_size: maximum vocabulary size
        - context_size: window around a word that is considered context
            context_size=1 should consider pairs of adjacent words
            
    returns:
        - co-occurrence matrix: numpy array where row i corresponds 
        to the co-occurrence counts for word i"""
    
    assert (type(data) == list or type(data) == np.ndarray), "First input must be a list or a numpy ndarray!"
    
    if type(data) == list:
        assert (len(data) > 0), "Data must be non-empty."
    else:
        assert (data.shape[0] > 0), "Data must be non-empty."
        
    ## assuming data is a list of sentences (each split into tokens)
    word_data = ((" ").join([(" ").join(x) for x in data])).split(" ")
    word2count = Counter(word_data)
    sorted_by_freq = sorted(word2count.items(), 
                            key=lambda kv: kv[1])
    
    vocab = dict(sorted_by_freq[-max_vocab_size:])
    
    df = pd.DataFrame(columns = [*vocab.keys()],
                      index = [*vocab.keys()]
                     ).fillna(0)
    # edges
    for key in [*df.keys()]:
        for sent in data:
            if len(sent) <= context_size:

            else:
                i = 0
                while i < context_size:
                context_list = [[sent[i-context_size:i+context_size+1] if senti == for i in sent] for sent in data]

    
        

### Matrix for Sentence Data

Use your implementation of `build_cooccurrence_matrix` to generate the co-occurence matrix from the sentences of [SST](http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip) (file `datasetSentences.txt`) with `context_size=2` and `max_vocab_size=10000`. What is the co-occurrence count of the words "the" and "end"? 

### Context Size Effect
__b__. Plot the effect of varying context size in $\{1, 2, 3, 4\}$ (leaving all the other settings the same) on the quality of the learned word embeddings, as measured by performance (Spearman correlation) on the word similarity dataset [MTurk-771](http://www2.mta.ac.il/~gideon/mturk771.html) between human judgments and cosine similarity of the learned word vectors (see lab). [12 pts]

### Discussion

__c__. Briefly discuss the pros and cons of varying 

    I.  the context size 
    II.  the vocabulary size 
    III. using bigrams instead of unigrams 
    IV. using subword tokens instead of words. [8 pts]

## Part 2: Pointwise Mutual Information [20 pts]

In lecture, we introduced __pointwise mutual information__ (PMI), which addresses the issue of normalization removing information about absolute magnitudes of counts. The PMI for word $\times$ context pair $(w,c)$ is 

$$\log\left(\frac{P(w,c)}{P(w) \cdot P(c)}\right)$$

with $\log(0) = 0$. This is a measure of how far that cell's value deviates from what we would expect given the row and column sums for that cell.

### PMI

__a__. Implement `pmi`, a function which takes in a co-occurence matrix and returns the matrix with PMI normalization applied. [15 pts]

In [38]:
def pmi(mat):
    """Pointwise mutual information
    
    args:
        - mat: 2d np.array to apply PMI
        
    returns:
        - pmi_mat: matrix of same shape with PMI applied
    """    
    raise NotImplementedError

Apply PMI to the co-occurence matrix computed above with `context_size=1`. What is the PMI between the words "the" and "end"?

### PPMI

__b__. We also consider an extension of PMI, positive PMI (PPMI), that maps all negative PMI values to 0.0 ([Levy and Goldberg 2014](http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization)). 
Write `ppmi`, which is the same as `pmi` except it applies PPMI instead of PMI (feel free to implement it as an option of `pmi`). What is the PMI of the words "the" and "start"? The PPMI? [5 pts]

## Part 3: Analyzing PMI [25 pts]

### Reweight Matrix

__a__. Consider the matrix `np.array([[1.0, 0.0, 0.0], [1000.0, 1000.0, 4000.0], [1000.0, 2000.0, 999.0]])`. Reweight this matrix using `ppmi`. 

    I. What is the value obtained for cell `[0,0]`, and 
    II. (ii) give a brief description for what is likely problematic about this value. [10 pts]

### Dealing with the Problematic Value
__b__. Give a suggestion for dealing with the problematic value and explain why it deals with this. Demonstrate your suggestion empirically [10 pts]

### PMI for Word-Word Co-occurence Matrix
__c__. Consider starting with a word-word co-occurence matrix and apply PMI to this matrix. 

        I. Which of the following describe the resulting vectors: sparse, dense, high-dimensional, low-dimensional
        II. If you wanted the opposite style of representation, what could you do? [5 pts]


## Part 4: Word Analogy Evaluation [25 pts]

Word analogies provide another kind of evaluation for distributed representations. Here, we are given three vectors A, B, and C, in the relationship

_A is to B as C is to __ _

and asked to identify the fourth that completes the analogy. These analogies are by and large substantially easier than the classic brain-teaser analogies that used to appear on tests like the SAT, but it's still an interesting, demanding
task. 

The core idea is that we make predictions by creating the vector

$$(A - B) + C$$ 

and then ranking all vectors based on their distance from this new vector, choosing the closest as our prediction.

### Analogy Completion
__a__. Implement the function `analogy_completion`. [9 pts]

In [None]:
def analogy_completion(a, b, c, mat):
    """Compute ? in 
    a is to b as c is to ? 
    as the closest to (b-a) + c
    """
    raise NotImplementedError

### GloVe
__b__. Our simple word embeddings likely won't perform well on this task. Let's instead look at some high quality pretrained word embeddings. Write code to load 300-dimensional [GloVe word embeddings](http://nlp.stanford.edu/data/glove.840B.300d.zip) trained on 840B tokens. Each line of the file is formatted as a word followed by 300 floats that make up its corresponding word embedding (all space delimited). The entries of GloVe word embeddings are not counts, but instead are learned via machine learning. Use your `analogy_completion` code to complete the following analogies using the GloVe word embeddings. [6 pts]

- "Beijing" is to "China" as "Paris" is to ?
- "gold" is to "first" as "silver" is to ?
- "Italian" is to "mozzarella" as "American" is to ?
- "research" is to "fun" as "engineering" is to ?

### Evaluate GloVe
c. Let's get a more quantitative, aggregate sense of the quality of GloVe embeddings. Load the analogies from `gram6-nationality-adjective.txt` and evaluate GloVe embeddings. Report the mean reciprocal rank of the correct answer (the last word on each line) for each analogy. [10 pts]

__Solution__

In [13]:
def analogy_evaluation(glove_vecs, test_file, verbose=False):
    """Basic analogies evaluation for a file `src_filename `
    in `question-data/`.
    
    Parameters
    ----------    
    mat : 2d np.array
        The VSM being evaluated.
        
    rownames : list of str
        The names of the rows in `mat`.
        
    src_filename : str
        Basename of the file to be evaluated. It's assumed to be in
        `vsmdata_home`/question-data.
        
    distfunc : function mapping vector pairs to floats (default: `cosine`)
        The measure of distance between vectors. Can also be `euclidean`, 
        `matching`, `jaccard`, as well as any other distance measure 
        between 1d vectors.
    
    Returns
    -------
    (float, float)
        The first is the mean reciprocal rank of the predictions and 
        the second is the accuracy of the predictions.
    
    """
    raise NotImplementedError

In [16]:
analogy_evaluation(glove_vecs, "gram6-nationality-adjective.txt")

(0.9391509433962264, defaultdict(int, {True: 97, False: 9}))