In [1]:
from google.colab import drive
drive.mount('/content/gdrive')
nbdir = "/content/gdrive/My Drive/DSCI691/Assignment 3/module-A"

Mounted at /content/gdrive


In [2]:
%cd /content/gdrive/My Drive/DSCI691/Assignment 3/module-A

/content/gdrive/My Drive/DSCI691/Assignment 3/module-A


## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Xi Chen
    - Email: xc98@drexel.edu
- Group member 2
    - Name: Tai Nguyen
    - Email: tdn47@drexel.edu
- Group member 3
    - Name: Tien Nguyen
    - Email: thn44@drexel.edu
- Group member 4
    - Name: Raymond Yung
    - Email: raymond.yung@drexel.edu


### Additional submission comments
- Tutoring support received: NA
- Other (other): We were not able to finish running A8 for sanity check because it took us hours each time running it. Therefore, we were not able to check A9, A10, A11. However, we tried our best to write the functions. 

# DSCI 691: Natural language processing with deep learning <br> Assignment 3: GloVe semantic representation
## Data and Utilities 
Here, we'll be working again with the same linked NewsTweet data, some essential utilities presented in the __Chapter 1 Notes__, as well as the adagrad optimization algorithm from the __Chapter 3 Notes__.

In [3]:
import json
newstweet = json.load(open('./data/newstweet-sample-linked.json'))
exec(open('./01-utilities.py').read())
exec(open('./03-utilities.py').read())

## Overview 
The purpose of this assignment (50 pts) is to gain some experience with iteration-based learning, which officially gets us into the DL work mode. Rather than the Word2Vec algorithm (based on language modeling), the GloVe algorithm's objective is to predict frequency. Let's see how this works!

Note, there are several files which your code should produce, which to speed up your work will make it possible to work on later parts sooner. These are:
- a cached copy of the co-occurrence counts: `./data/cached/data-True-20-0.json`
- a cached copy of the training losses: `./data/cached/saved_losses_newstweet-GloVe-50-20_25000.json`
- a cached copy of the learned parameters: `./data/cached/saved_params_newstweet-GloVe-50-20_25000.npy`
- a cached copy of the sum of squared gradients: `./data/cached/saved_SSG_newstweet-GloVe-50-20_25000.npy`
- a cached copy of the model's state: `./data/cached/saved_state_newstweet-GloVe-50-20_25000.pickle`

If you'd like to test later sections of code prior to earlier ones being completed, bring these files out of their `./data/cached/` directory and into the main `./data/` directory.

### 1. (3 pts) Build the model's training data
The [GloVe algorithm](https://nlp.stanford.edu/pubs/glove.pdf) relies on having access to the non-zero co-ocurrence values of a corpus to form the target variable for the regression task. Since it's cumbersome to construct and utilize these as a sparse matrix (and we don't want access to the zero-values anyway), it'll make sense to pre-compute these values and write them to disk. Hence, your first job is to complete the `make_co_counts(documents, space = True, k = 20, gamma = 0)` function, which
1. attempts to load pre-computed co-ocurrence counts (`co_counts`), or 
2. computes them and store them to disk for future experimentation.

So to complete the function, you'll have to construct it's return value, `co_counts`, which should be a `Counter()` of counts, keyed by a comma-separated string of the $i,j$ indices of the co-ocurring types from the _implied_ co-ocurrence matrix's index:
```
ij = ",".join(map(str,[type_index[ti], type_index[tj]]))
```
Note: the `type_index` construction is already filled into to the function, and much of this computation should be essentially the same as for the construction of `co_counts` inside of the body of the `make_CoM` function.

In [4]:
# A1:Function(3/3)

import os

def make_co_counts(documents, space = True, k = 20, gamma = 0):
    
    handle = "-".join(map(str,[space, k, gamma]))
    if os.path.exists('./data/data-' + handle + '.json'):
        return json.load(open('./data/data-' + handle + '.json'))
    
    document_frequency = Counter()
    for j, document in enumerate(documents):
        sentences = sentokenize(document.lower(), space = space)
        documents[j] = sentences
        frequency = Counter([t for s in documents[j] for t in s])
        document_frequency += Counter(frequency.keys())
    type_index = {t:i for i, t in enumerate(sorted(list(document_frequency.keys())))}

    co_counts = Counter()
    
    #--- your code starts here
    for document in documents:
        for sentence in document:
            for i, ti in enumerate(sentence):
                context, weights = get_context(i, sentence, k = k, gamma = gamma)        
                for j, tj in enumerate(context):
                    ij = ",".join(map(str,[type_index[ti], type_index[tj]]))
                    co_counts[ij] += weights[j]
    #--- your code stops here

    data = {'co_counts': dict(co_counts), 'type_index': dict(type_index)}
    
    with open('./data/data-' + handle + '.json', "w") as f:
        f.write(json.dumps(data))
    
    return data

For reference, your output should be:
```
(111259, 15169286, 276685120.0)
```

In [5]:
# A1:SanityCheck

data = make_co_counts([x['text'].lower() for x in newstweet])
len(data['type_index']), len(data['co_counts']), sum(data['co_counts'].values())

(111259, 15169286, 276685120.0)

### 2. (4 pts) Weight the co-occurrence matrix
Per the GloVe algorithm definition, we'll need to weight the terms of our loss function:

$$
\mathcal{L}= \sum_{i = 1}^{|W|}\sum_{j = 1}^{|W|}f(CoM_{i,j})\left(\vec{u}_i^T\vec{v}_j + a_i + b_j - \log{CoM_{i,j}}\right)^{2}
$$

By a weighting of the co-occurrence matrix, $CoM$, as:
$$
f(CoM_{i,j}) = 1
\hspace{10pt}\text{ if } 
\hspace{10pt}CoM_{i,j} \geq CoM_{\text{max}};
\hspace{10pt}\text{ otherwise } 
\hspace{10pt}f(CoM_{i,j}) = \left(\frac{CoM_{i,j}}{CoM_{\text{max}}}\right)^{\alpha}.
$$
Generally, $\alpha = 0.75$ and $CoM_{\text{max}} = 100$ are recommended by the authors, but we'll leave these as presets for them as optional hyperparameters. 

In this part of the problem, your job is specifically to implement the $f$-weighting function for our $CoM$, below, by filling in the `weight_nonzero_data(data, comax = 100, alpha = 0.75)`. In particular, you should add a dctionary to the `data` object keyed as `data['fco_counts']`, which has `ij` keys (the co-ocurrence type-type comma-separated indices) corresponding to the weighted, $f(CoM_{i,j})$ values.

Note: because this function is supposed to add a field to `data` it should have no return value.

In [6]:
# A2:Function(4/4)

def weight_nonzero_data(data, comax = 100, alpha = 0.75):
    
    #--- your code starts here
    data['fco_counts'] = {ij: 1 if data['co_counts'][ij] >= comax 
                          else ((data['co_counts'][ij])/comax)**alpha for ij in data['co_counts']}

  
    #--- your code stops here
    
    # note: this function has no return value

For reference, your output should be:
```
(15169286, 1049012.2482885977)
```

In [7]:
# A2:SanityCheck

weight_nonzero_data(data)
len(data['fco_counts']), sum(data['fco_counts'].values())

(15169286, 1049012.2482885977)

### 3. (3 pts) Implement the loss function for an `ij` co-ocurrence pair

Now that we have our co-ocurrence frequencies and weights, we can compute the GloVe loss function as:

$$
\mathcal{L}= \sum_{i = 1}^{|W|}\sum_{j = 1}^{|W|}f(CoM_{i,j})\left(\vec{u}_i^T\vec{v}_j + a_i + b_j - \log{CoM_{i,j}}\right)^{2}
$$

You job here is to implement the above math within the `GloVe_loss` function as `L` (which is the return value).

In [8]:
# A3:Function(3/3)

def GloVe_loss(ij, data, U, V, a, b):
    
    i, j = map(int, ij.split(','))
    
    #--- your code starts here
    import numpy as np
    L = data['fco_counts'][ij]*((np.dot(U[i], V[j]) + 
                                 a[i] + b[j] - np.log(data['co_counts'][ij]))**2)

    #--- your code stops here
    
    return L

For reference, your output should be:
```
3.069212443574228
```

In [9]:
# A3:SanityCheck

np.random.seed(691)
GloVe_loss(",".join(map(str, [data['type_index']['robert'], data['type_index']['downey']])), data, 
           np.zeros((len(data['type_index']), 50)), np.zeros((len(data['type_index']), 50)), 
                        np.zeros(len(data['type_index'])), np.zeros(len(data['type_index'])))

3.069212443574228

### 4. (4 pts) Derive the gradient of the loss
Next, your job is to derive the gradients for the type $\vec{u}_i$ and context $\vec{v}_j$ vectors, in addition to for the bias terms, $a_i$ and $b_j$&mdash;to exhibit this work, fill in the steps you take to compute the gradients as markdown in the specified cells, below.

#### First, derive the scalar, partial-derivative bias terms:

In [10]:
# A4:Derivation(1/4)

##### \#\#\# your derivation starts here
$$
\frac{\partial{\mathcal{L}}}{\partial{a}_i} = \sum_{i = 1}^{|W|}\sum_{j = 1}^{|W|}2f(CoM_{i,j})\left(\vec{u}_i^T\vec{v}_j + a_i + b_j - \log{CoM_{i,j}}\right)
$$

$$
\frac{\partial{\mathcal{L}}}{\partial{b}_j} = \sum_{i = 1}^{|W|}\sum_{j = 1}^{|W|}2f(CoM_{i,j})\left(\vec{u}_i^T\vec{v}_j + a_i + b_j - \log{CoM_{i,j}}\right)
$$
##### \#\#\# your derivation stops here

#### Next, derive vector-gradients w/r to the $d$-dimensions of parameters:

In [11]:
# A4:Derivation(3/4)

##### \#\#\# your derivation starts here
$$
\frac{\partial\mathcal{L}}{\partial\vec{u}_i} = \sum_{i = 1}^{|W|}\sum_{j = 1}^{|W|}2\vec{v}_jf(CoM_{i,j})\left(\vec{u}_i^T\vec{v}_j + a_i + b_j - \log{CoM_{i,j}}\right)
$$

$$
\frac{\partial\mathcal{L}}{\partial\vec{v}_j} = \sum_{i = 1}^{|W|}\sum_{j = 1}^{|W|}2\vec{u}_if(CoM_{i,j})\left(\vec{u}_i^T\vec{v}_j + a_i + b_j - \log{CoM_{i,j}}\right)
$$
##### \#\#\# your derivation stops here

### 5. (6 pts) Implement a GloVe loss & gradient function
Using the results from your derivation above, implement the loss and gradient for all type and context embeddings. To do so, complete the `GloVe_loss_and_gradient` function, which has the following input _Arguments_:
 - `ij`: comma-separated (typed to strings) indices co-ocurring types, corresponding to a row from $V$ ($i$), $U$ ($j$), or index pair ($i,j$) from $CoM$ or $f(CoM)$.
 - `U`: the matrix of type vectors, i.e., rows of the matrix $U$ for all types
 - `V`: the matrix of context vectors, i.e., rows of the matrix $V$ for all contexts
 - `a`: the vector of bias terms for the types, $\vec{a}$
 - `b`: the vector of bias terms for the contexts, $\vec{b}$
 - `CoM`: the training data, co-occurrence matrix counting type-type 'context' pairs.
 - `fCoM`: the weighted training data of type-type 'context' pairs, $f(CoM)$.

and the following _Return_ values:
 - `L`: the _scalar_ GLOVE loss function, $\mathcal{L}$.
 - `dLdui`: the gradient _vector_ with respect to the specified type $\frac{\partial\mathcal{L}}{\partial\vec{u}_i}$
 - `dLdvj`: the gradient _vector_ with respect to the specified context $\frac{\partial\mathcal{L}}{\partial\vec{v}_j}$
 - `dai`: the _scalar_ partial derivative with respect to the type's bias parameter, $\frac{\partial{\mathcal{L}}}{\partial{a}_i}$
 - `dbj`: the _scalar_ partial derivative with respect to the context's bias parameter, $\frac{\partial{\mathcal{L}}}{\partial{b}_j}$
 
Note: the `return`ed loss and gradient values should be computed using your pre-defined function and derivation, above.

In [12]:
# A5:Function(6/6)

def GloVe_loss_and_gradient(ij, data, U, V, a, b): 
    
    i, j = map(int, ij.split(','))
    
    #--- your code starts here
    L = GloVe_loss(ij, data, U, V, a, b)
    dLdai = 2*data['fco_counts'][ij]*(np.dot(U[i], V[j]) + 
                                 a[i] + b[j] - np.log(data['co_counts'][ij]))
    dLdbj = dLdai
    dLdui = V[j]*dLdai
    dLdvj = U[i]*dLdai
    
    
    #--- your code stops here
    
    return L, dLdui, dLdvj, dLdai, dLdbj

For reference, your output should be:
```
(3.069212443574228,
 array([-0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.]),
 array([-0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.]),
 -1.9858753089849215,
 -1.9858753089849215)
```

In [13]:
# A5:SanityCheck

np.random.seed(691)
GloVe_loss_and_gradient(",".join(map(str, [data['type_index']['robert'], data['type_index']['downey']])), data, 
                        np.zeros((len(data['type_index']), 50)), np.zeros((len(data['type_index']), 50)), 
                        np.zeros(len(data['type_index'])), np.zeros(len(data['type_index'])))

(3.069212443574228,
 array([-0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.]),
 array([-0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.]),
 -1.9858753089849215,
 -1.9858753089849215)

### 6. (4 pts) Building a sampling function
Here, you're required to build out the `sample_nonz(batch_size, ijs)` function, which must sample `batch_size` pairs of ($i,j$)-indices that are keys for the non-zero values from the $CoM$ within `data['co_counts']`. This will make it efficient to iterate over non-trivial portions of the co-occurrence matrix. 

Note: for the trivial case, when `batch_size == len(ijs)`, your function should just return a copy of `ijs`, and in order to match the `# A6:SanityCheck` output, build your code by using the `random` module's `sample` function, which can be accessed via this assignment's namespace as `ra.sample`.

In [14]:
# A6:Function(4/4)

def sample_ijs(batch_size, ijs):
    
    #--- your code starts here
    
    if batch_size == len(ijs):
      ijs_sample = ijs
    else:
      ijs_sample = [item for item in ra.sample( ijs, batch_size)]
    #--- your code stops here
    
    ra.shuffle(ijs_sample)
    return ijs_sample

For reference, your output should be:
```
['40084,14097',
 '71564,105579',
 '56945,14591',
 '110912,95636',
 '38321,38439',
 '56064,108528',
 '89963,59086',
 '100840,33233',
 '33494,108283',
 '99008,94653']
```

In [15]:
# A6:SanityCheck

ra.seed(691)
sample_ijs(10, list(data['co_counts'].keys()))

['40084,14097',
 '71564,105579',
 '56945,14591',
 '110912,95636',
 '38321,38439',
 '56064,108528',
 '89963,59086',
 '100840,33233',
 '33494,108283',
 '99008,94653']

### 7. (6 pts) Operating GloVe on batches of `ij` index-pairs
Here, we'll produce the primary `GloVe` function, which computes gradient components as prescribed by our non-zero $CoM$-entry-sampling function and a given `batch_size`. This function has the following input _Arguments_:
- `UVab`: the current model parameters for the GloVe algorithm
- `data`: the training data, continaing co-occurrence counts keyed by `'co_counts'` and weighted frequency, keyed by `'fco_counts'`, in addition to the type index, keyed by `'type_index'`.
- `batch_size`: the number of word-context pairs to process into gradients at once
- `loss_and_gradient`: the loss and gradient function for GloVe

and the following _Return_ values:
- `total_L`: the loss function value for the skip-gram model ($\mathcal{L}$)
- `gradient`: the stacked gradient vectors in the `UVab` order

Your job is to aggregate the gradient components into their correct portions of the matrix representation named `gradient`. In particular, for each `ij` co-ocurrence index pair, the gradient output (`dLduj, dLdvi, dLdai, dLdbj`) from the `loss_and_gradient` function must be added to the `gradient` matrix derivatives in the same locations corresponding to the $i,j$ parameters from $U$, $V$, $a$, and $b$, respectively. 

Additionally, your code must aggregate into `total_weight` the value from all `ij` co-occurrence index pairs.

In [16]:
# A7:Function(6/6)

def GloVe(UVab, data, batch_size = 0, loss_and_gradient = GloVe_loss_and_gradient):
    dim = int((UVab.shape[1]/2) - 1)
    ## zero batch_size means compute over whole dataset
    if not batch_size:
        batch_size = len(co_counts)
    U = UVab[:,:dim]; V = UVab[:,dim:2*dim]
    a = UVab[:,-2]; b = UVab[:,-1]
    gradient = np.zeros(UVab.shape)
    total_weight = 0; total_L = 0.0
    for ij in sample_ijs(batch_size, list(data['co_counts'].keys())):
        
        i, j = map(int, ij.split(','))
        
        #--- your code starts here
        L, dLdui, dLdvj, dLdai, dLdbj = loss_and_gradient(ij, data, U, V, a, b)
        total_L += L
        gradient[i, :dim] = dLdui
        gradient[:,dim:2*dim] = dLdvj
        gradient[:,-2] = dLdai
        gradient[:, -1] = dLdbj
        total_weight += data['fco_counts'][ij]
        #--- your code stops here
        
    total_L, gradient = total_L/total_weight, gradient/total_weight
    return total_L, gradient

For reference, your output should be:
```
(3.5667257586946493,
 array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]))
```

In [17]:
# A7:SanityCheck

ra.seed(691)
np.random.seed(691)
GloVe(np.zeros((len(data['type_index']), (50+1)*2)), data, batch_size = 100)

(3.5667257586946493, array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]))

### 8. (4 pts) Complete the GloVe training wrapper
Now it's time to complete the GloVe implementation by filling in the `train` function, which operates much like that from __Chapter 3, Section 3.1.1.6__, which also uses the adaptive gradient descent algorithm for implementation. You'll have access to this again via the code from `03-utilities.py`. 

In particular, use the pre-set model initial random state:
$$
UVab_0\in[-0.5,0.5]^{|W|\times 2(d+1)},
$$
which stacks $U$ $V$ $a$ and $b$ as columns in a matrix with values from $[-1,1]$ to apply the `adagrad()` function by passing the `GloVe` model as a lambda function with variable argument as the model parameters, i.e., an 'anonymous' matrix variable `UVab`.

When applying the `adagrad` function, be sure to supply the correct parameters from the `train()` function's input, in addition to catching the `UVab_m, losses` output that it returns (the final result from `m` iterations).

Note: your code should consist only of a single function call&mdash;to the `adagrad()` function.

In [18]:
# A8:Function(4/4)

import time
def train(data, handle, k = 20, dim = 50, comax = 100, alpha = 0.75, 
          batch_size = 100, m = 25000, eta = 0.5,
          use_saved = True, save_every = 1000, print_every=100,
          loss_and_gradient = GloVe_loss_and_gradient):
    
    N = len(data['type_index'])
    handle = "-".join([handle, str(dim), str(k)])
    ra.seed(691);np.random.seed(691)
    
    start_time=time.time()
    
    UVab_0 = np.concatenate(((np.random.rand(N, dim + 1) - 0.5)/(dim + 1),
                             (np.random.rand(N, dim + 1) - 0.5)/(dim + 1)), axis=1)
    
    #--- your code starts here
    UVab_m, losses = adagrad(lambda UVab: GloVe(UVab, data, batch_size = batch_size),
                             UVab_0, eta, m, handle, use_saved, save_every, print_every)
    #--- your code stops here
    
    stop_time=time.time()
    return UVab_m, losses, (start_time, stop_time)

For reference, your output should be:

```
...
iteration:  100 avg loss up to batch:  4.8556129373186545
...
iteration:  500 avg loss up to batch:  3.268102535178591
...
iteration:  1000 avg loss up to batch:  2.700337325161406
...
iteration:  2000 avg loss up to batch:  2.2696841678746846
...
iteration:  3000 avg loss up to batch:  2.060780232568986
...
iteration:  4000 avg loss up to batch:  1.937148112389848
...
iteration:  5000 avg loss up to batch:  1.8688602231891303
...
iteration:  10000 avg loss up to batch:  1.700315477882425
...
iteration:  15000 avg loss up to batch:  1.6058466121965451
...
iteration:  20000 avg loss up to batch:  1.5143681294461326
...
iteration:  25000 avg loss up to batch:  1.432184821156592
training time (in hours):  <3–6, depending on system>
```

In [None]:
# A8:SanityCheck

UVab, losses, tt = train(data, 'newstweet-GloVe')
print("training time (in hours): ", (tt[1] - tt[0])/(60*60))

iteration:  11100 avg loss up to batch:  3.2584514900783565
iteration:  11200 avg loss up to batch:  3.2552854689140363
iteration:  11300 avg loss up to batch:  3.254052727689305
iteration:  11400 avg loss up to batch:  3.2540302338073013
iteration:  11500 avg loss up to batch:  3.2510969610523253
iteration:  11600 avg loss up to batch:  3.2475417875641943
iteration:  11700 avg loss up to batch:  3.2492006878278397
iteration:  11800 avg loss up to batch:  3.24846168095704
iteration:  11900 avg loss up to batch:  3.2484763135324894
iteration:  12000 avg loss up to batch:  3.247301684025125
iteration:  12100 avg loss up to batch:  3.2464563177376906
iteration:  12200 avg loss up to batch:  3.247560358801099
iteration:  12300 avg loss up to batch:  3.245473802247293
iteration:  12400 avg loss up to batch:  3.245634459712037
iteration:  12500 avg loss up to batch:  3.2452445361177826
iteration:  12600 avg loss up to batch:  3.243407523405031
iteration:  12700 avg loss up to batch:  3.24374

### 9. (6 pts) Build an analogy tester
Now that we've got some vectors (even if under-trained), let's go ahead with [Mikolov's analogy test](https://arxiv.org/pdf/1301.3781.pdf), which checks to see if SAT-like analogies can be completed via cosine similarity comparison of vectors. For example, given the usual analogy: "
> king is to man as woman is to queen

let's refer to `('man', 'king')` as the known `pair`, `woman` as the uknown's `predicate`, and `queen` as the unknown's `target` type. According to the word analogy test, 'good' word vectors will satisfy the relationship:
$$
\hat{v} = v_\text{king} - v_\text{man} + v_\text{woman} \sim v_\text{queen}
$$

Where $\sim$ is specifically measured as the cosine similarity. 

Your job is build this evaluation for a given set of vectors, `X`, and their `type_index`, and a specific analogy, provided to the `test_analogy` function as arguments, named `pair`, `predicate`, and `target`. As output, your function must compute the `rank` of the `target` type, according to the sorted set, decreasing, by cosine similarity. 

Also, there's a catch&mdash;make sure your output filters both `pair` terms and the `predicate` _before_ ranking, otherwise your performance will be underestimated!!

Note: please use the `sklearn.metrics.pairwise.cosine_similarity` implementation of the cosine similarity in order to match output with the provided sample. Sorting issues have been observed from standard numpy matrix operations.

In [None]:
# A9:Function(6/6)

from sklearn.metrics.pairwise import cosine_similarity

def test_analogy(pair, predicate, target, type_index, X):

    #--- your code starts here
    vec = X/np.linalg.norm(X, axis = 1)[:, np.newaxis]
    vp1 = vec[type_index[pair[0]], :]
    vp2 = vec[type_index[pair[1]], :]
    vpred = vec[type_index[predicate], :]
    vtarget = vec[type_index[target], :]
    
    vhat = vp2 - vp1 + vpred
    target_sim = cosine_similarity([vhat], [vtarget])[0][0]
    
    sims = []
    for token in type_index:
        v_token = vec[type_index[token], :]
        token_sim = cosine_similarity([vhat], [vtarget])[0][0]
        sims.append(token_sim)
        
    sims.sort(reverse=True)
    
    for i, sim in enumerate(sims):
        if sim == target_sim:
            rank = i
    #--- your code stops here

    return rank, sim

Note: the `# A9:SanityCheck` below tests agains the different components of your trained vectors, in addition to those from the `CoM` model (measured from the larger data set) _and_ the set 50-dimensional GloVe pretrained "Wikipedia 2014 + Gigaword 5" vectors, [obtained from the author's original resource page](https://nlp.stanford.edu/projects/glove/).

For reference, your output should be:
```
[('U rank, similarity: ', (27, 0.9873522387316712)),
 ('V rank, similarity: ', (554, 0.966891580621855)),
 ('UVab rank, similarity: ', (31, 0.9804160087551583)),
 ('CoM rank, similarity: ', (9, 0.9996402262687847)),
 ('pretrained rank, similarity: ', (1, 0.9373217383382935))]
```

In [None]:
# A9:SanityCheck

# let's break down our vectors to see how a few different pieces perform
dim = int((UVab.shape[1]/2) - 1)
U = UVab[:,:dim]; V = UVab[:,dim:2*dim]

# load some CoM statistics from the larger file
CoM_d = np.load("./data/newstweet-sample-linked-CoM_d.npy")
CoM_d_index = json.load(open("./data/newstweet-sample-linked-type_index.json"))

# load the pre-trained 50-dimensional wikipedia GloVe vectors
import csv
GloVe_pretrained = []; GloVe_pretrained_index = {}
for line in open('./data/glove.6B.50d.txt'):
    row = line.split(" ")
    if row[0] in data['type_index']:
        GloVe_pretrained.append(list(map(float, row[1:])))
        GloVe_pretrained_index[row[0]] = len(GloVe_pretrained_index)
    
GloVe_pretrained = np.array(GloVe_pretrained)

[("U rank, similarity: ", test_analogy(('man', 'he'), 'woman', 'she', data['type_index'], U)),
 ("V rank, similarity: ", test_analogy(('man', 'he'), 'woman', 'she', data['type_index'], V)),
 ("UVab rank, similarity: ", test_analogy(('man', 'he'), 'woman', 'she', data['type_index'], UVab)),
 ("CoM rank, similarity: ", test_analogy(('man', 'he'), 'woman', 'she', CoM_d_index, CoM_d)),
 ("pretrained rank, similarity: ", test_analogy(('man', 'he'), 'woman', 'she', GloVe_pretrained_index, GloVe_pretrained))]

### 10. (6 pts) Test a sample of analogies using the tester
The `analogies` we'll be using are the [MSR](https://www.microsoft.com/en-us/research/people/) set (see [here](https://aclweb.org/aclwiki/Analogy_(State_of_the_art)) for more details) and will be loaded as a dataframe of shape:
```
>>> print(analogies.head())
   Unnamed: 0     type   word1   word2     word3    target
0           0   JJ_JJR    good  better     rough   rougher
1           1   JJR_JJ  better    good   rougher     rough
2           2   JJ_JJS    good    best     rough  roughest
3           3   JJS_JJ    best    good  roughest     rough
4           4  JJS_JJR    best  better  roughest   rougher
```
i.e., so that by column: `pair = (word1, word2)`, `predicate = word3`, and `target = target`.

Using these data, form the input needed and apply the `test_analogy` function. Then, use its `rank`  ($r$) output to compute a `score` as:
$$
\text{score} = 1 - \frac{r - 1}{|W|}
$$
Each computed `rank` and `score` should be appended into its respective list, and the scores will provide a baseline idea of the model's performance against random guessing, i.e., whereupon the list of `scores` should average to $\approx 0.5$. We'll use the `ranks` to compute another ranking metric in the next, too.

Note: you must make sure that _all_ components of the analogy exist in the `type_index` before attempting to `rank` and `score`, otherwise, ignore the analogy!

In [None]:
# A10:Function(6/6)

def analyze_analogies(analogies, type_index, X, num = 250, verbose = True):
    scores = []; ranks = []
    for i, row in analogies.sample(n=analogies.shape[0], random_state=691).iterrows():

        #--- your code starts here
        if (row['word1'] and row['word2'] and row['word3'] and row['target']) in type_index:
          pair = (row['word1'], row['word2'])
          predicate = row['word3']
          target = row['target']

          rank, sim = test_analogy(pair, predicate, target, type_index, X)

          score = 1 - (rank-1)/len(type_index)

          ranks.append(rank)
          scores.append(score)
        #--- your code stops here
        
        if verbose and not len(scores) % int(num/10):
            print(100*len(scores)/num, "% complete")
        
        if len(scores) == num:
            break
    return scores, ranks

For reference, your output should be:
```
Analyzing U-matrix output...
10.0 % complete
20.0 % complete
30.0 % complete
40.0 % complete
50.0 % complete
60.0 % complete
70.0 % complete
80.0 % complete
90.0 % complete
100.0 % complete
done. total score:  172.67554085512182
Analyzing V-matrix output...
10.0 % complete
20.0 % complete
30.0 % complete
40.0 % complete
50.0 % complete
60.0 % complete
70.0 % complete
80.0 % complete
90.0 % complete
100.0 % complete
done. total score:  175.3225446930135
Analyzing UVab-matrix output...
10.0 % complete
20.0 % complete
30.0 % complete
40.0 % complete
50.0 % complete
60.0 % complete
70.0 % complete
80.0 % complete
90.0 % complete
100.0 % complete
done. total score:  186.3084334750446
Analyzing CoM-matrix output...
10.0 % complete
20.0 % complete
30.0 % complete
40.0 % complete
50.0 % complete
60.0 % complete
70.0 % complete
80.0 % complete
90.0 % complete
100.0 % complete
done. total score:  193.1615869277991
Analyzing pretrained-matrix output...
10.0 % complete
20.0 % complete
30.0 % complete
40.0 % complete
50.0 % complete
60.0 % complete
70.0 % complete
80.0 % complete
90.0 % complete
100.0 % complete
done. total score:  249.48278079549016
```

In [None]:
# A10:SanityCheck

import pandas as pd
analogies = pd.read_csv('./data/msr.csv')

print("Analyzing U-matrix output...")
U_scores, U_ranks = analyze_analogies(analogies, data['type_index'], U)
print("done. total score: ", sum(U_scores))

print("Analyzing V-matrix output...")
V_scores, V_ranks = analyze_analogies(analogies, data['type_index'], V)
print("done. total score: ", sum(V_scores))

print("Analyzing UVab-matrix output...")
UVab_scores, UVab_ranks = analyze_analogies(analogies, data['type_index'], UVab)
print("done. total score: ", sum(UVab_scores))

print("Analyzing CoM-matrix output...")
CoM_scores, CoM_ranks = analyze_analogies(analogies, CoM_d_index, CoM_d)
print("done. total score: ", sum(CoM_scores))

print("Analyzing pretrained-matrix output...")
pretrained_scores, pretrained_ranks = analyze_analogies(analogies, GloVe_pretrained_index, GloVe_pretrained)
print("done. total score: ", sum(pretrained_scores))

### 11. (4 pts) Test a sample of analogies using the tester
Finally, to complete the assignmen your job is to produce a function that computes aggregated performance statistics for the a given list of `scores` and `ranks`. In particular, your code must compute 
1. the arithmetic `mean_score`, i.e., simple arithmetic average of the `scores`;
2. the model's overall `accuracy`, i.e., the number of analogies of rank `1`; and
3. the model's [mean reciprocal rank (MRR)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank), which is the reciprocal of the harmonic mean of ranks.

In [None]:
# A11:Function(4/4)

def evaluate_performance(scores, ranks):
    
    #--- your code starts here
    mean_score = np.average(scores)
    accuracy = 100*np.sum([1 for rank in ranks if rank ==1 ])/len(scores)
    MRR = (1/len(ranks)* np.sum(1/np.array(ranks))


    #--- your code stops here
    
    return mean_score, accuracy, MRR

For reference, your output should be:
```
U performance (mean score, accuracy, MRR):  (0.6907021634204873, 0.0, 0.001740256245558779)
V performance (mean score, accuracy, MRR):  (0.701290178772054, 0.0, 0.0009884626617602233)
UVab performance (mean score, accuracy, MRR):  (0.7452337339001784, 0.0, 0.0024205505157109185)
CoM performance (mean score, accuracy, MRR):  (0.7726463477111964, 0.4, 0.010905247640529447)
pre-trained performance (mean score, accuracy, MRR):  (0.9979311231819606, 40.0, 0.5068096534223601)
```

In [None]:
# A11:SanityCheck

# test the NewsTweet-trained GloVe model U component
print("U performance (mean score, accuracy, MRR): ", 
      evaluate_performance(U_scores, U_ranks))

# test the NewsTweet-trained GloVe model V component
print("V performance (mean score, accuracy, MRR): ", 
      evaluate_performance(V_scores, V_ranks))

# test the NewsTweet-trained full GloVe model
print("UVab performance (mean score, accuracy, MRR): ", 
      evaluate_performance(UVab_scores, UVab_ranks))

# test the NewsTweet-trained CoM model
print("CoM performance (mean score, accuracy, MRR): ", 
      evaluate_performance(CoM_scores, CoM_ranks))

# test the Wikipedia pre-trained 50-d GloVe model
print("pre-trained performance (mean score, accuracy, MRR): ", 
      evaluate_performance(pretrained_scores, pretrained_ranks))