# Homework and bake-off: Word similarity

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Dataset readers](#Dataset-readers)
1. [Dataset comparisons](#Dataset-comparisons)
  1. [Vocab overlap](#Vocab-overlap)
  1. [Pair overlap and score correlations](#Pair-overlap-and-score-correlations)
1. [Evaluation](#Evaluation)
  1. [Dataset evaluation](#Dataset-evaluation)
  1. [Dataset error analysis](#Dataset-error-analysis)
  1. [Full evaluation](#Full-evaluation)
1. [Homework questions](#Homework-questions)
  1. [PPMI as a baseline [0.5 points]](#PPMI-as-a-baseline-[0.5-points])
  1. [Gigaword with LSA at different dimensions [0.5 points]](#Gigaword-with-LSA-at-different-dimensions-[0.5-points])
  1. [Gigaword with GloVe for a small number of iterations [0.5 points]](#Gigaword-with-GloVe-for-a-small-number-of-iterations-[0.5-points])
  1. [Dice coefficient [0.5 points]](#Dice-coefficient-[0.5-points])
  1. [t-test reweighting [2 points]](#t-test-reweighting-[2-points])
  1. [Enriching a VSM with subword information [2 points]](#Enriching-a-VSM-with-subword-information-[2-points])
  1. [Your original system [3 points]](#Your-original-system-[3-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

Word similarity datasets have long been used to evaluate distributed representations. This notebook provides basic code for conducting such analyses with a number of datasets:

| Dataset | Pairs | Task-type | Current best Spearman $\rho$ | Best $\rho$ paper |   |
|---------|-------|-----------|------------------------------|-------------------|---|
| [WordSim-353](http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/) | 353 | Relatedness | 82.8 | [Speer et al. 2017](https://arxiv.org/abs/1612.03975) |
| [MTurk-771](http://www2.mta.ac.il/~gideon/mturk771.html) | 771 | Relatedness | 81.0 | [Speer et al. 2017](https://arxiv.org/abs/1612.03975) |
| [The MEN Test Collection](http://clic.cimec.unitn.it/~elia.bruni/MEN) | 3,000 | Relatedness | 86.6 | [Speer et al. 2017](https://arxiv.org/abs/1612.03975)  | 
| [SimVerb-3500-dev](http://people.ds.cam.ac.uk/dsg40/simverb.html) | 500 | Similarity | 61.1 | [Mrki&scaron;&cacute; et al. 2016](https://arxiv.org/pdf/1603.00892.pdf) |
| [SimVerb-3500-test](http://people.ds.cam.ac.uk/dsg40/simverb.html) | 3,000 | Similarity | 62.4 | [Mrki&scaron;&cacute; et al. 2016](https://arxiv.org/pdf/1603.00892.pdf) |

Each of the similarity datasets contains word pairs with an associated human-annotated similarity score. (We convert these to distances to align intuitively with our distance measure functions.) The evaluation code measures the distance between the word pairs in your chosen VSM (which should be a `pd.DataFrame`).

The evaluation metric for each dataset is the [Spearman correlation coefficient $\rho$](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) between the annotated scores and your distances, as is standard in the literature. We also macro-average these correlations across the datasets for an overall summary. (In using the macro-average, we are saying that we care about all the datasets equally, even though they vary in size.)

This homework ([questions at the bottom of this notebook](#Homework-questions)) asks you to write code that uses the count matrices in `data/vsmdata` to create and evaluate some baseline models as well as an original model $M$ that you design. This accounts for 9 of the 10 points for this assignment.

For the associated bake-off, we will distribute two new word similarity or relatedness datasets and associated reader code, and you will evaluate $M$ (no additional training or tuning allowed!) on those new datasets. Systems that enter will receive the additional homework point, and systems that achieve the top score will receive an additional 0.5 points.

## Set-up

In [1]:
from collections import defaultdict
import csv
import itertools
import numpy as np
import os
import pandas as pd
from scipy.stats import spearmanr
import vsm
from IPython.display import display

In [2]:
VSM_HOME = os.path.join('data', 'vsmdata')

WORDSIM_HOME = os.path.join('data', 'wordsim')

## Dataset readers

In [3]:
def wordsim_dataset_reader(
        src_filename, 
        header=False, 
        delimiter=',', 
        score_col_index=2):
    """Basic reader that works for all similarity datasets. They are 
    all tabular-style releases where the first two columns give the 
    word and a later column (`score_col_index`) gives the score.

    Parameters
    ----------
    src_filename : str
        Full path to the source file.
    header : bool
        Whether `src_filename` has a header. Default: False
    delimiter : str
        Field delimiter in `src_filename`. Default: ','
    score_col_index : int
        Column containing the similarity scores Default: 2

    Yields
    ------
    (str, str, float)
       (w1, w2, score) where `score` is the negative of the similarity
       score in the file so that we are intuitively aligned with our
       distance-based code. To align with our VSMs, all the words are 
       downcased.

    """
    with open(src_filename) as f:
        reader = csv.reader(f, delimiter=delimiter)
        if header:
            next(reader)
        for row in reader:
            w1 = row[0].strip().lower()
            w2 = row[1].strip().lower()
            score = row[score_col_index]
            # Negative of scores to align intuitively with distance functions:
            score = -float(score)
            yield (w1, w2, score)

def wordsim353_reader():
    """WordSim-353: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/"""
    src_filename = os.path.join(
        WORDSIM_HOME, 'wordsim353', 'combined.csv')
    return wordsim_dataset_reader(
        src_filename, header=True)

def mturk771_reader():
    """MTURK-771: http://www2.mta.ac.il/~gideon/mturk771.html"""
    src_filename = os.path.join(
        WORDSIM_HOME, 'MTURK-771.csv')
    return wordsim_dataset_reader(
        src_filename, header=False)

def simverb3500dev_reader():
    """SimVerb-3500: http://people.ds.cam.ac.uk/dsg40/simverb.html"""
    src_filename = os.path.join(
        WORDSIM_HOME, 'SimVerb-3500', 'SimVerb-500-dev.txt')
    return wordsim_dataset_reader(
        src_filename, delimiter="\t", header=False, score_col_index=3)

def simverb3500test_reader():
    """SimVerb-3500: http://people.ds.cam.ac.uk/dsg40/simverb.html"""
    src_filename = os.path.join(
        WORDSIM_HOME, 'SimVerb-3500', 'SimVerb-3000-test.txt')
    return wordsim_dataset_reader(
        src_filename, delimiter="\t", header=False, score_col_index=3)

def men_reader():
    """MEN: http://clic.cimec.unitn.it/~elia.bruni/MEN"""
    src_filename = os.path.join(
        WORDSIM_HOME, 'MEN', 'MEN_dataset_natural_form_full')
    return wordsim_dataset_reader(
        src_filename, header=False, delimiter=' ') 

This collection of readers will be useful for flexible evaluations:

In [4]:
READERS = (wordsim353_reader, mturk771_reader, simverb3500dev_reader, 
           simverb3500test_reader, men_reader)

## Dataset comparisons

This section does some basic analysis of the datasets. The goal is to obtain a deeper understanding of what problem we're solving – what strengths and weaknesses the datasets have and how they relate to each other. For a full-fledged project, we would want to continue work like this and report on it in the paper, to provide context for the results.

In [5]:
def get_reader_name(reader):
    """Return a cleaned-up name for the similarity dataset 
    iterator `reader`
    """
    return reader.__name__.replace("_reader", "")

### Vocab overlap

How many vocabulary items are shared across the datasets?

In [6]:
def get_reader_vocab(reader):
    """Return the set of words (str) in `reader`."""
    vocab = set()
    for w1, w2, _ in reader():
        vocab.add(w1)
        vocab.add(w2)
    return vocab

In [7]:
def get_reader_vocab_overlap(readers=READERS):
    """Get data on the vocab-level relationships between pairs of 
    readers. Returns a a pd.DataFrame containing this information.
    """
    data = []
    for r1, r2 in itertools.product(readers, repeat=2):       
        v1 = get_reader_vocab(r1)
        v2 = get_reader_vocab(r2)
        d = {
            'd1': get_reader_name(r1),
            'd2': get_reader_name(r2),
            'overlap': len(v1 & v2), 
            'union': len(v1 | v2),
            'd1_size': len(v1),
            'd2_size': len(v2)}
        data.append(d)
    return pd.DataFrame(data)

In [8]:
vocab_overlap = get_reader_vocab_overlap()

In [9]:
def vocab_overlap_crosstab(vocab_overlap):
    """Return an intuitively formatted `pd.DataFrame` giving 
    vocab-overlap counts for all the datasets represented in 
    `vocab_overlap`, the output of `get_reader_vocab_overlap`.
    """        
    xtab = pd.crosstab(
        vocab_overlap['d1'], 
        vocab_overlap['d2'], 
        values=vocab_overlap['overlap'], 
        aggfunc=np.mean)
    # Blank out the upper right to reduce visual clutter:
    for i in range(0, xtab.shape[0]):
        for j in range(i+1, xtab.shape[1]):
            xtab.iloc[i, j] = ''        
    return xtab        

In [10]:
vocab_overlap_crosstab(vocab_overlap)

d2,men,mturk771,simverb3500dev,simverb3500test,wordsim353
d1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
men,751,,,,
mturk771,230,1113.0,,,
simverb3500dev,23,67.0,536.0,,
simverb3500test,30,94.0,532.0,823.0,
wordsim353,86,158.0,13.0,17.0,437.0


This looks reasonable. By design, the SimVerb dev and test sets have a lot of overlap. The other overlap numbers are pretty small, even adjusting for dataset size.

### Pair overlap and score correlations

How many word pairs are shared across datasets and, for shared pairs, what is the correlation between their scores? That is, do the datasets agree?

In [11]:
def get_reader_pairs(reader):
    """Return the set of alphabetically-sorted word (str) tuples 
    in `reader`
    """
    return {tuple(sorted([w1, w2])): score for w1, w2, score in reader()}

In [12]:
def get_reader_pair_overlap(readers=READERS):
    """Return a `pd.DataFrame` giving the number of overlapping 
    word-pairs in pairs of readers, along with the Spearman 
    correlations.
    """    
    data = []
    for r1, r2 in itertools.product(READERS, repeat=2):
        if r1.__name__ != r2.__name__:
            d1 = get_reader_pairs(r1)
            d2 = get_reader_pairs(r2)
            overlap = []
            for p, s in d1.items():
                if p in d2:
                    overlap.append([s, d2[p]])
            if overlap:
                s1, s2 = zip(*overlap)
                rho = spearmanr(s1, s2)[0]
            else:
                rho = None
            # Canonical order for the pair:
            n1, n2 = sorted([get_reader_name(r1), get_reader_name(r2)])
            d = {
                'd1': n1,
                'd2': n2,
                'pair_overlap': len(overlap),
                'rho': rho}
            data.append(d)
    df = pd.DataFrame(data)
    df = df.sort_values(['pair_overlap','d1','d2'], ascending=False)
    # Return only every other row to avoid repeats:
    return df[::2].reset_index(drop=True)

In [13]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    display(get_reader_pair_overlap())

Unnamed: 0,d1,d2,pair_overlap,rho
0,men,mturk771,11,0.592191
1,men,wordsim353,5,0.7
2,mturk771,simverb3500test,4,0.4
3,men,simverb3500test,2,1.0
4,simverb3500dev,simverb3500test,1,
5,simverb3500test,wordsim353,0,
6,simverb3500dev,wordsim353,0,
7,mturk771,wordsim353,0,
8,mturk771,simverb3500dev,0,
9,men,simverb3500dev,0,


This looks reasonable: none of the datasets have a lot of overlapping pairs, so we don't have to worry too much about places where they give conflicting scores.

## Evaluation

This section builds up the evaluation code that you'll use for the homework and bake-off. For illustrations, I'll read in a VSM created from `data/vsmdata/giga_window5-scaled.csv.gz`:

In [14]:
giga5 = pd.read_csv(
    os.path.join(VSM_HOME, "giga_window5-scaled.csv.gz"), index_col=0)

### Dataset evaluation

In [15]:
def word_similarity_evaluation(reader, df, distfunc=vsm.cosine):
    """Word-similarity evalution framework.
    
    Parameters
    ----------
    reader : iterator
        A reader for a word-similarity dataset. Just has to yield
        tuples (word1, word2, score).    
    df : pd.DataFrame
        The VSM being evaluated.        
    distfunc : function mapping vector pairs to floats.
        The measure of distance between vectors. Can also be 
        `vsm.euclidean`, `vsm.matching`, `vsm.jaccard`, as well as 
        any other float-valued function on pairs of vectors.    
        
    Raises
    ------
    ValueError
        If `df.index` is not a subset of the words in `reader`.
    
    Returns
    -------
    float, data
        `float` is the Spearman rank correlation coefficient between 
        the dataset scores and the similarity values obtained from 
        `df` using  `distfunc`. This evaluation is sensitive only to 
        rankings, not to absolute values.  `data` is a `pd.DataFrame` 
        with columns['word1', 'word2', 'score', 'distance'].
        
    """
    data = []
    for w1, w2, score in reader():
        d = {'word1': w1, 'word2': w2, 'score': score}
        for w in [w1, w2]:
            if w not in df.index:
                raise ValueError(
                    "Word '{}' is in the similarity dataset {} but not in the "
                    "DataFrame, making this evaluation ill-defined. Please "
                    "switch to a DataFrame with an appropriate vocabulary.".
                    format(w, get_reader_name(reader))) 
        d['distance'] = distfunc(df.loc[w1], df.loc[w2])
        data.append(d)
    data = pd.DataFrame(data)
    rho, pvalue = spearmanr(data['score'].values, data['distance'].values)
    return rho, data

In [16]:
rho, eval_df = word_similarity_evaluation(men_reader, giga5)

In [17]:
rho

0.40375964105441753

In [18]:
eval_df.head()

Unnamed: 0,word1,word2,score,distance
0,sun,sunlight,-50.0,0.956828
1,automobile,car,-50.0,0.979143
2,river,water,-49.0,0.970105
3,stairs,staircase,-49.0,0.980475
4,morning,sunrise,-49.0,0.963624


### Dataset error analysis

For error analysis, we can look at the words with the largest delta between the gold score and the distance value in our VSM. We do these comparisons based on ranks, just as with our primary metric (Spearman $\rho$), and we normalize both rankings so that they have a comparable number of levels.

In [19]:
def word_similarity_error_analysis(eval_df):    
    eval_df['distance_rank'] = _normalized_ranking(eval_df['distance'])
    eval_df['score_rank'] = _normalized_ranking(eval_df['score'])
    eval_df['error'] =  abs(eval_df['distance_rank'] - eval_df['score_rank'])
    return eval_df.sort_values('error')
    
    
def _normalized_ranking(series):
    ranks = series.rank(method='dense')
    return ranks / ranks.sum()    

Best predictions:

In [20]:
word_similarity_error_analysis(eval_df).head()

Unnamed: 0,word1,word2,score,distance,distance_rank,score_rank,error
1041,hummingbird,pelican,-32.0,0.975007,0.000243,0.000244,2.434543e-07
2315,lily,pigs,-13.0,0.980834,0.000488,0.000487,4.016842e-07
2951,bucket,girls,-4.0,0.983473,0.000602,0.000603,4.151568e-07
150,night,sunset,-43.0,0.96869,0.000102,0.000103,6.520315e-07
2062,oak,petals,-17.0,0.979721,0.000435,0.000436,7.162632e-07


Worst predictions:

In [21]:
word_similarity_error_analysis(eval_df).tail()

Unnamed: 0,word1,word2,score,distance,distance_rank,score_rank,error
67,branch,twigs,-45.0,0.984622,0.00063,7.7e-05,0.000553
190,birds,stork,-43.0,0.987704,0.000657,0.000103,0.000554
185,bloom,tulip,-43.0,0.990993,0.000663,0.000103,0.000561
167,bloom,blossom,-43.0,0.99176,0.000664,0.000103,0.000561
198,bloom,rose,-43.0,0.992406,0.000664,0.000103,0.000561


### Full evaluation

A full evaluation is just a loop over all the readers on which one want to evaluate, with a macro-average at the end:

In [22]:
def full_word_similarity_evaluation(df, readers=READERS, distfunc=vsm.cosine):
    """Evaluate a VSM against all datasets in `readers`.
    
    Parameters
    ----------
    df : pd.DataFrame
    readers : tuple 
        The similarity dataset readers on which to evaluate.
    distfunc : function mapping vector pairs to floats.
        The measure of distance between vectors. Can also be 
        `vsm.euclidean`, `vsm.matching`, `vsm.jaccard`, as well as 
        any other float-valued function on pairs of vectors.    
    
    Returns
    -------
    pd.Series
        Mapping dataset names to Spearman r values.
        
    """        
    scores = {}     
    for reader in readers:
        score, data_df = word_similarity_evaluation(reader, df, distfunc=distfunc)
        scores[get_reader_name(reader)] = score
    series = pd.Series(scores, name='Spearman r')
    series['Macro-average'] = series.mean()
    return series

In [23]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    display(full_word_similarity_evaluation(giga5))

wordsim353         0.327831
mturk771           0.143146
simverb3500dev    -0.068038
simverb3500test   -0.066348
men                0.403760
Macro-average      0.148070
Name: Spearman r, dtype: float64

## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### PPMI as a baseline [0.5 points]

The insight behind PPMI is a recurring theme in word representation learning, so it is a natural baseline for our task. For this question, write a function called `run_giga_ppmi_baseline` that does the following:

1. Reads the Gigaword count matrix with a window of 20 and a flat scaling function into a `pd.DataFrame`s, as is done in the VSM notebooks. The file is `data/vsmdata/giga_window20-flat.csv.gz`, and the VSM notebooks provide examples of the needed code.

1. Reweights this count matrix with PPMI.

1. Evaluates this reweighted matrix using `full_word_similarity_evaluation`. The return value of `run_giga_ppmi_baseline` should be the return value of this call to `full_word_similarity_evaluation`.

The goal of this question is to help you get more familiar with the code in `vsm` and the function `full_word_similarity_evaluation`.

The function `test_run_giga_ppmi_baseline` can be used to test that you've implemented this specification correctly.

In [25]:

def run_giga_ppmi_baseline():    
    giga20 = pd.read_csv(os.path.join(VSM_HOME, 'giga_window20-flat.csv.gz'), index_col=0)
    giga20_pmi = vsm.pmi(giga20, positive=True)
    res1 = full_word_similarity_evaluation(giga20_pmi)
    return res1



In [26]:
def test_run_giga_ppmi_baseline(run_giga_ppmi_baseline):
    result = run_giga_ppmi_baseline()
    ws_result = result.loc['wordsim353'].round(2)
    ws_expected = 0.58
    assert ws_result == ws_expected, \
        "Expected wordsim353 value of {}; got {}".format(ws_expected, ws_result)

In [27]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_run_giga_ppmi_baseline(run_giga_ppmi_baseline)

### Gigaword with LSA at different dimensions [0.5 points]

We might expect PPMI and LSA to form a solid pipeline that combines the strengths of PPMI with those of dimensionality reduction. However, LSA has a hyper-parameter $k$ – the dimensionality of the final representations – that will impact performance. For this problem, write a wrapper function `run_ppmi_lsa_pipeline` that does the following:

1. Takes as input a count `pd.DataFrame` and an LSA parameter `k`.
1. Reweights the count matrix with PPMI.
1. Applies LSA with dimensionality `k`.
1. Evaluates this reweighted matrix using `full_word_similarity_evaluation`. The return value of `run_ppmi_lsa_pipeline` should be the return value of this call to `full_word_similarity_evaluation`.

The goal of this question is to help you get a feel for how much LSA alone can contribute to this problem. 

The  function `test_run_ppmi_lsa_pipeline` will test your function on the count matrix in `data/vsmdata/giga_window20-flat.csv.gz`.

In [28]:
def run_ppmi_lsa_pipeline(count_df, k):
    pmi_df = vsm.pmi(count_df, positive=True)
    lsa_df = vsm.lsa(pmi_df, k=k)
    res = full_word_similarity_evaluation(lsa_df)
    return res
    




In [29]:
def test_run_ppmi_lsa_pipeline(run_ppmi_lsa_pipeline):
    giga20 = pd.read_csv(
        os.path.join(VSM_HOME, "giga_window20-flat.csv.gz"), index_col=0)
    results = run_ppmi_lsa_pipeline(giga20, k=10)
    men_expected = 0.57
    men_result = results.loc['men'].round(2)
    assert men_result == men_expected,\
        "Expected men value of {}; got {}".format(men_expected, men_result)

In [30]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_run_ppmi_lsa_pipeline(run_ppmi_lsa_pipeline)

### Gigaword with GloVe for a small number of iterations [0.5 points]

Ideally, we would run GloVe for a very large number of iterations on a GPU machine to compare it against its close cousin PMI. However, we don't want this homework to cost you a lot of money or monopolize a lot of your available computing resources, so let's instead just probe GloVe a little bit to see if it has promise for our task. For this problem, write a function `run_small_glove_evals` that does the following:

1. Reads in `data/vsmdata/giga_window20-flat.csv.gz`.
1. Runs GloVe for 10, 100, and 200 iterations on `data/vsmdata/giga_window20-flat.csv.gz`, using the `mittens` implementation of `GloVe`. 
  * For all the other parameters to `mittens.GloVe` besides `max_iter`, use the package's defaults.
  * Because of the way that implementation is designed, these will have to be separate runs, but they should be relatively quick. 
1. Stores the values in a `dict` mapping each `max_iter` value to its associated 'Macro-average' score according to `full_word_similarity_evaluation`. `run_small_glove_evals`  should return this `dict`.

The trend should give you a sense for whether it is worth running GloVe for more iterations.

Some implementation notes:

* Your trained GloVe matrix `X` needs to be wrapped in a `pd.DataFrame` to work with `full_word_similarity_evaluation`. `pd.DataFrame(X, index=giga20.index)` will do the trick.

* If `glv` is your GloVe model, then running `glv.sess.close()` after each model is trained will silence warnings from TensorFlow about interactive sessions being active.

Performance will vary a lot for this function, so there is some uncertainty in the testing, but `test_run_small_glove_evals` will at least check that you wrote a function with the right general logic.

In [31]:
def run_small_glove_evals():
  from mittens import GloVe
  all_res = {}
  giga20 = pd.read_csv(
      os.path.join(VSM_HOME, "giga_window20-flat.csv.gz"), index_col=0)
  for max_iter in [10, 100, 200]:
    glove_model = GloVe(max_iter=max_iter)
    np_giga20 = glove_model.fit(giga20.values)
    glove_model.sess.close()
    giga20_glove = pd.DataFrame(np_giga20, index=giga20.index)
    res = full_word_similarity_evaluation(giga20_glove)
    all_res[max_iter] = res.loc['Macro-average'].round(2)    
  return all_res





In [32]:
def test_run_small_glove_evals(run_small_glove_evals):
    data = run_small_glove_evals()
    for max_iter in (10, 100, 200):
        assert max_iter in data
        assert isinstance(data[max_iter], float)

In [33]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_run_small_glove_evals(run_small_glove_evals)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Iteration 200: loss: 1981474.755

### Dice coefficient [0.5 points]

Implement the Dice coefficient for real-valued vectors, as

$$
\textbf{dice}(u, v) = 
1 - \frac{
  2 \sum_{i=1}^{n}\min(u_{i}, v_{i})
}{
    \sum_{i=1}^{n} u_{i} + v_{i}
}$$
 
You can use `test_dice_implementation` below to check that your implementation is correct.

In [34]:
def test_dice_implementation(func):
    """`func` should be an implementation of `dice` as defined above."""
    X = np.array([
        [  4.,   4.,   2.,   0.],
        [  4.,  61.,   8.,  18.],
        [  2.,   8.,  10.,   0.],
        [  0.,  18.,   0.,   5.]]) 
    assert func(X[0], X[1]).round(5) == 0.80198
    assert func(X[1], X[2]).round(5) == 0.67568

In [35]:
def dice(u, v):
    n = 2 * np.minimum(u,v).sum()
    d = (u+v).sum()
    return 1 - n/d




In [36]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_dice_implementation(dice)

### t-test reweighting [2 points]



The t-test statistic can be thought of as a reweighting scheme. For a count matrix $X$, row index $i$, and column index $j$:

$$\textbf{ttest}(X, i, j) = 
\frac{
    P(X, i, j) - \big(P(X, i, *)P(X, *, j)\big)
}{
\sqrt{(P(X, i, *)P(X, *, j))}
}$$

where $P(X, i, j)$ is $X_{ij}$ divided by the total values in $X$, $P(X, i, *)$ is the sum of the values in row $i$ of $X$ divided by the total values in $X$, and $P(X, *, j)$ is the sum of the values in column $j$ of $X$ divided by the total values in $X$.

For this problem, implement this reweighting scheme. You can use `test_ttest_implementation` below to check that your implementation is correct. You do not need to use this for any evaluations, though we hope you will be curious enough to do so!

In [37]:
def test_ttest_implementation(func):
    """`func` should be an implementation of t-test reweighting as 
    defined above.
    """
    X = pd.DataFrame(np.array([
        [  4.,   4.,   2.,   0.],
        [  4.,  61.,   8.,  18.],
        [  2.,   8.,  10.,   0.],
        [  0.,  18.,   0.,   5.]]))    
    actual = np.array([
        [ 0.33056, -0.07689,  0.04321, -0.10532],
        [-0.07689,  0.03839, -0.10874,  0.07574],
        [ 0.04321, -0.10874,  0.36111, -0.14894],
        [-0.10532,  0.07574, -0.14894,  0.05767]])    
    predicted = func(X)
    assert np.array_equal(predicted.round(5), actual)

In [38]:
def ttest(df):
    X = df.values
    P_X_i_j = X / X.sum()
    col = (X.sum(axis=1) / X.sum()).reshape(-1,1)
    row = X.sum(axis=0) / X.sum()
    P_X_i_s = np.hstack([col for _ in range(X.shape[1])])
    P_X_j_s = np.vstack([row for _ in range(X.shape[0])])
    d = np.sqrt(P_X_i_s * P_X_j_s)
    n = P_X_i_j - (P_X_i_s * P_X_j_s)    
    res = pd.DataFrame( n / d, index=df.index, columns=df.columns)
    return res



In [39]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_ttest_implementation(ttest)

### Enriching a VSM with subword information [2 points]

It might be useful to combine character-level information with word-level information. To help you begin asssessing this idea, this question asks you to write a function that modifies an existing VSM so that the representation for each word $w$ is the element-wise sum of $w$'s original word-level representation with all the representations for the n-grams $w$ contains. 

The following starter code should help you structure this and clarify the requirements, and a simple test is included below as well.

You don't need to write a lot of code; the motivation for this question is that the function you write could have practical value.

In [40]:
def subword_enrichment(df, n=4):
    
    # 1. Use `vsm.ngram_vsm` to create a character-level 
    # VSM from `df`, using the above parameter `n` to 
    # set the size of the ngrams.
    
    df_ngram = vsm.ngram_vsm(df, n=n)

        
    # 2. Use `vsm.character_level_rep` to get the representation
    # for every word in `df` according to the character-level
    # VSM you created above.
    
    df_new_vsm = df.apply(func=lambda x: pd.Series(vsm.character_level_rep(x.name, cf=df_ngram, n=n),
                                                   index=df.columns), 
                          axis=1)
    
    # 3. For each representation created at step 2, add in its
    # original representation from `df`. (This should use
    # element-wise addition; the dimensionality of the vectors
    # will be unchanged.)
                            
    df_final_vsm = df + df_new_vsm

    
    # 4. Return a `pd.DataFrame` with the same index and column
    # values as `df`, but filled with the new representations
    # created at step 3.
                            
    return df_final_vsm

In [41]:
def test_subword_enrichment(func):
    """`func` should be an implementation of subword_enrichment as 
    defined above.
    """
    vocab = ["ABCD", "BCDA", "CDAB", "DABC"]
    df = pd.DataFrame([
        [1, 1, 2, 1],
        [3, 4, 2, 4],
        [0, 0, 1, 0],
        [1, 0, 0, 0]], index=vocab)
    expected = pd.DataFrame([
        [14, 14, 18, 14],
        [22, 26, 18, 26],
        [10, 10, 14, 10],
        [14, 10, 10, 10]], index=vocab)
    new_df = func(df, n=2)
    assert np.array_equal(expected.columns, new_df.columns), \
        "Columns are not the same"
    assert np.array_equal(expected.index, new_df.index), \
        "Indices are not the same"
    assert np.array_equal(expected.values, new_df.values), \
        "Co-occurrence values aren't the same"    

In [42]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_subword_enrichment(subword_enrichment)

### Your original system [3 points]

This question asks you to design your own model. You can of course include steps made above (ideally, the above questions informed your system design!), but your model should not be literally identical to any of the above models. Other ideas: retrofitting, autoencoders, GloVe, subword modeling, ... 

Requirements:

1. Your code must operate on one of the count matrices in `data/vsmdata`. You can choose which one. __Other pretrained vectors cannot be introduced__.

1. Your code must be self-contained, so that we can work with your model directly in your homework submission notebook. If your model depends on external data or other resources, please submit a ZIP archive containing these resources along with your submission.

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.

In [43]:
# Enter your system description in this cell.
# Please do not remove this comment.
if 'IS_GRADESCOPE_ENV' not in os.environ:
    pass
    """
    The general approach of the system was to explore as many available options as possible in a manner similar to a 
    hand-crafted grid-search (discrete manually defined grid). In terms of actual models four different categories 
    have been targeted: baseline models (such as PPMI, discounted PPMI, LSA), retrofited baseline models (such as
    retrofitted discounted positive PMI), autoencoders and GloVe-based models (autoencoder, based on the simple
    provided architecture without any other improvements such as additional hidden layers, use LSA-based dimensional 
    reduction of the PMI-transformed MCO and particularly we found that positive PMI works better than negative). 
    The fourth and final category is that of ensembles that concatenate (or concatenate-and-SVM-reduce) multiple 
    embeddings into one single embedding (for example: GloVe 150 retrofitted with LSA450 - autoencoder150 
    retrofitted and then finally LSA reduced).
    Another important architectural decision was to train and evaluate each proposed candidate using all provided 
    MCOs from `data/vsmdata` so that later decide what model-dataset combination works best for our experiment.
    
    After performing more than 80 training iterations (more than 20 approaches on each of the 4 provided MCOs) we 
    found out that our best option is to retrofit the results of a auto-encoder trained on the LSA-reduced 
    discounted positive PMI matrix of the imdb-window5 MCO. With this conclusion the initial stage of 
    heuristical grid exploration was closed and the focus has been shifted on fine-tuning proposed model. In order to
    have a picture of the grid-search results below is a dataframe output of the results where:
     - LSAxxx_PPMIyy are LSA reduced at k=xxx positive PMI matrices that might have discounts (2 different approaches)
     - AExxx_yK2 are autoencoders with xxx embeddings trained y-thousands epochs 
     - PPMIxxxx are varios positive PMIs baselines (w/o discount applyed with 2 options)
     - Cx are ensembles (two architectures)
     - _RETR added after model name signifies that result has been retrofitted with WordNet
     
                 MODEL                 DATA  wordsim353  mturk771  simverb3500dev  simverb3500test       men  Macro-average
39              G75_2K   giga_window20-flat    0.346009  0.401967        0.056723         0.049627  0.487193       0.268304
41         G75_2K_RETR   giga_window20-flat    0.321310  0.363652        0.108407         0.103789  0.445423       0.268516
42        G200_2K_RETR   giga_window20-flat    0.313237  0.359199        0.126266         0.115625  0.458164       0.274498
40             G200_2K   giga_window20-flat    0.335347  0.414257        0.071788         0.062357  0.496666       0.276083
82             C2_150r   imdb_window20-flat    0.325856  0.352635        0.149259         0.132554  0.448724       0.281806
81              C2_150   imdb_window20-flat    0.332357  0.352534        0.152638         0.134357  0.450520       0.284481
66              G75_2K  imdb_window5-scaled    0.370917  0.388565        0.093948         0.105596  0.482576       0.288321
68         G75_2K_RETR  imdb_window5-scaled    0.374163  0.385513        0.197215         0.195217  0.467951       0.324012
27              C2_150   giga_window20-flat    0.394362  0.427948        0.165727         0.154533  0.551803       0.338874
28             C2_150r   giga_window20-flat    0.394906  0.428740        0.174473         0.159296  0.560530       0.343589
12              G75_2K  giga_window5-scaled    0.402494  0.479712        0.168492         0.112990  0.589486       0.350635
67             G200_2K  imdb_window5-scaled    0.438300  0.474744        0.169428         0.143061  0.544740       0.354055
14         G75_2K_RETR  giga_window5-scaled    0.388175  0.441037        0.214476         0.181746  0.560286       0.357144
69        G200_2K_RETR  imdb_window5-scaled    0.399811  0.441326        0.261823         0.233856  0.524049       0.372173
13             G200_2K  giga_window5-scaled    0.432319  0.508929        0.186463         0.147255  0.621126       0.379219
15        G200_2K_RETR  giga_window5-scaled    0.404446  0.471663        0.247540         0.221316  0.592824       0.387558
55             C2_150r  imdb_window5-scaled    0.430801  0.453289        0.264101         0.240273  0.556222       0.388937
54              C2_150  imdb_window5-scaled    0.442718  0.460120        0.269422         0.240553  0.573499       0.397263
1              C2_150r  giga_window5-scaled    0.443774  0.498379        0.250595         0.227297  0.630670       0.410143
52        LSA75_PPMID2   giga_window20-flat    0.551618  0.488368        0.230429         0.155304  0.656758       0.416495
0               C2_150  giga_window5-scaled    0.447012  0.509440        0.256311         0.231371  0.643621       0.417551
43                PPMI   giga_window20-flat    0.582573  0.495228        0.232637         0.158925  0.624885       0.418850
45              PPMID2   giga_window20-flat    0.587157  0.507723        0.241369         0.166956  0.637283       0.428097
83              C1_150   imdb_window20-flat    0.518883  0.541709        0.245962         0.191885  0.648792       0.429446
84             C1_150r   imdb_window20-flat    0.519434  0.542172        0.246448         0.192447  0.649361       0.429972
51       LSA200_PPMID2   giga_window20-flat    0.573059  0.512123        0.247582         0.172587  0.669115       0.434893
47       LSA100_PPMID1   giga_window20-flat    0.582617  0.508020        0.244603         0.164431  0.693490       0.438632
29              C1_150   giga_window20-flat    0.573512  0.515020        0.253321         0.187255  0.672496       0.440321
30             C1_150r   giga_window20-flat    0.573272  0.515224        0.253351         0.187827  0.673018       0.440538
36          AE100_1Kd2   giga_window20-flat    0.574683  0.527902        0.253993         0.179512  0.682303       0.443679
86          AE100_1Kd1   imdb_window20-flat    0.585852  0.569396        0.193383         0.174075  0.696205       0.443782
44              PPMID1   giga_window20-flat    0.615723  0.514810        0.256390         0.172416  0.669964       0.445860
85          AE200_1Kd1   imdb_window20-flat    0.595826  0.566020        0.207111         0.179068  0.684143       0.446434
46       LSA200_PPMID1   giga_window20-flat    0.601415  0.527075        0.255420         0.174647  0.694131       0.450538
32          AE100_1Kd1   giga_window20-flat    0.600119  0.532503        0.248483         0.184230  0.706567       0.454380
48    LSA200_PPMI_RETR   giga_window20-flat    0.559960  0.517598        0.285164         0.259575  0.659936       0.456446
87     AE200_1Kd1_RETR   imdb_window20-flat    0.558048  0.522121        0.307551         0.256777  0.658868       0.460673
25        LSA75_PPMID2  giga_window5-scaled    0.551251  0.579595        0.264351         0.196538  0.721627       0.462672
90          AE100_1Kd2   imdb_window20-flat    0.580800  0.563325        0.267625         0.219934  0.695829       0.465503
50   LSA75_PPMID1_RETR   giga_window20-flat    0.577490  0.524386        0.285175         0.247567  0.694228       0.465769
53  LSA200_PPMID2_RETR   giga_window20-flat    0.566863  0.526416        0.296860         0.265071  0.674468       0.465936
88     AE100_1Kd1_RETR   imdb_window20-flat    0.553460  0.527072        0.306167         0.265256  0.679480       0.466287
35          AE200_1Kd2   giga_window20-flat    0.599898  0.563121        0.275055         0.207868  0.695622       0.468313
31          AE200_1Kd1   giga_window20-flat    0.618046  0.567278        0.271859         0.200761  0.710471       0.473683
89          AE200_1Kd2   imdb_window20-flat    0.602134  0.590578        0.249717         0.235149  0.696128       0.474741
70                PPMI  imdb_window5-scaled    0.577190  0.562433        0.310526         0.270943  0.660408       0.476300
9           AE100_1Kd2  giga_window5-scaled    0.568349  0.595862        0.259853         0.226785  0.743782       0.478926
72              PPMID2  imdb_window5-scaled    0.581862  0.572660        0.296686         0.271814  0.675204       0.479645
71              PPMID1  imdb_window5-scaled    0.588492  0.563234        0.307426         0.266128  0.676107       0.480277
49  LSA200_PPMID1_RETR   giga_window20-flat    0.596064  0.539958        0.304601         0.265470  0.696623       0.480543
38     AE100_1Kd2_RETR   giga_window20-flat    0.595299  0.541724        0.314201         0.272006  0.690043       0.482655
79        LSA75_PPMID2  imdb_window5-scaled    0.603732  0.586757        0.266972         0.255839  0.702384       0.483137
17              PPMID1  giga_window5-scaled    0.591688  0.581325        0.306978         0.239871  0.706676       0.485307
20       LSA100_PPMID1  giga_window5-scaled    0.568276  0.599770        0.294443         0.230995  0.750179       0.488733
5           AE100_1Kd1  giga_window5-scaled    0.561517  0.591780        0.301127         0.244475  0.750541       0.489888
16                PPMI  giga_window5-scaled    0.604212  0.587155        0.316475         0.250925  0.696400       0.491033
34     AE100_1Kd1_RETR   giga_window20-flat    0.612804  0.554966        0.310947         0.274225  0.715919       0.493772
24       LSA200_PPMID2  giga_window5-scaled    0.582203  0.619073        0.286614         0.236081  0.746754       0.494145
18              PPMID2  giga_window5-scaled    0.590522  0.609254        0.311360         0.247677  0.712918       0.494346
4           AE200_1Kd1  giga_window5-scaled    0.587092  0.612370        0.275897         0.249625  0.749850       0.494967
58          AE200_1Kd1  imdb_window5-scaled    0.607768  0.599900        0.275105         0.268931  0.729448       0.496230
59          AE100_1Kd1  imdb_window5-scaled    0.593525  0.601826        0.295940         0.266454  0.732591       0.498067
63          AE100_1Kd2  imdb_window5-scaled    0.592976  0.602873        0.281321         0.282096  0.741488       0.500151
19       LSA200_PPMID1  giga_window5-scaled    0.592502  0.622624        0.294481         0.241731  0.750920       0.500452
37     AE200_1Kd2_RETR   giga_window20-flat    0.605090  0.565620        0.340461         0.303114  0.700335       0.502924
8           AE200_1Kd2  giga_window5-scaled    0.594110  0.632567        0.276709         0.255564  0.758668       0.503523
2               C1_150  giga_window5-scaled    0.586615  0.619190        0.311337         0.265907  0.744875       0.505585
3              C1_150r  giga_window5-scaled    0.586178  0.620189        0.311115         0.267009  0.745321       0.505962
23   LSA75_PPMID1_RETR  giga_window5-scaled    0.537824  0.590572        0.344034         0.315286  0.743571       0.506257
74       LSA100_PPMID1  imdb_window5-scaled    0.604813  0.616821        0.308645         0.266924  0.735871       0.506615
91     AE200_1Kd2_RETR   imdb_window20-flat    0.585607  0.563341        0.377524         0.323619  0.687013       0.507421
73       LSA200_PPMID1  imdb_window5-scaled    0.618076  0.620043        0.306045         0.274083  0.737493       0.511148
33     AE200_1Kd1_RETR   giga_window20-flat    0.623699  0.586210        0.338690         0.299378  0.713936       0.512382
11     AE100_1Kd2_RETR  giga_window5-scaled    0.574093  0.602111        0.325602         0.329555  0.751056       0.516483
62          AE200_1Kd2  imdb_window5-scaled    0.629625  0.614388        0.300465         0.296204  0.743991       0.516935
78       LSA200_PPMID2  imdb_window5-scaled    0.630171  0.618935        0.307187         0.298894  0.740287       0.519095
21    LSA200_PPMI_RETR  giga_window5-scaled    0.580715  0.605465        0.353050         0.339956  0.736162       0.523070
7      AE100_1Kd1_RETR  giga_window5-scaled    0.566219  0.603120        0.352415         0.338784  0.761193       0.524346
26  LSA200_PPMID2_RETR  giga_window5-scaled    0.572713  0.614095        0.348592         0.338870  0.748955       0.524645
77   LSA75_PPMID1_RETR  imdb_window5-scaled    0.581110  0.593024        0.381243         0.350897  0.719023       0.525059
22  LSA200_PPMID1_RETR  giga_window5-scaled    0.583648  0.612118        0.348962         0.336825  0.752110       0.526732
56              C1_150  imdb_window5-scaled    0.623301  0.638265        0.347895         0.306515  0.736579       0.530511
57             C1_150r  imdb_window5-scaled    0.624408  0.638717        0.347726         0.307648  0.736912       0.531082
6      AE200_1Kd1_RETR  giga_window5-scaled    0.588966  0.621233        0.357107         0.350816  0.756436       0.534911
76  LSA200_PPMID1_RETR  imdb_window5-scaled    0.610808  0.590201        0.398355         0.356933  0.721921       0.535644
75    LSA200_PPMI_RETR  imdb_window5-scaled    0.599354  0.587413        0.413494         0.364765  0.714033       0.535812
10     AE200_1Kd2_RETR  giga_window5-scaled    0.591959  0.633071        0.355606         0.355505  0.760712       0.539371
60     AE200_1Kd1_RETR  imdb_window5-scaled    0.615839  0.591141        0.398939         0.367284  0.727692       0.540179
61     AE100_1Kd1_RETR  imdb_window5-scaled    0.608664  0.600581        0.404892         0.365140  0.744997       0.544855
65     AE100_1Kd2_RETR  imdb_window5-scaled    0.623134  0.627776        0.376672         0.377537  0.747669       0.550558
80  LSA200_PPMID2_RETR  imdb_window5-scaled    0.627497  0.617972        0.409345         0.394991  0.746077       0.559176
64     AE200_1Kd2_RETR  imdb_window5-scaled    0.640188  0.613962        0.407867         0.399198  0.753825       0.563008     

    During the stage of fine-tuning the autoencoder model we decided to increase the number of epochs to 10K and introduce
    a early stopping mechanism to the training procedure (evaluate after each 100 epochs, early stop after 5 failed 
    evaluation). We re-ran the proposed architectures on each available MCO together with the DPPMI-based LSA in order
    to obtain the final "best" embedding matrix.  Below are the results of this final model search phase:
    
  MODEL                 DATA  wordsim353  mturk771  simverb3500dev  simverb3500test       men  Macro-average
41    LSA200_PPMI_RETR   imdb_window20-flat    0.446412  0.479198        0.315149         0.238742  0.597088       0.415318
42  LSA200_PPMID1_RETR   imdb_window20-flat    0.502515  0.497265        0.304745         0.237583  0.633848       0.435191
19    LSA200_PPMI_RETR   giga_window20-flat    0.559960  0.517598        0.285164         0.259575  0.659936       0.456446
21  LSA200_PPMID2_RETR   giga_window20-flat    0.566863  0.526416        0.296860         0.265071  0.674468       0.465936
43  LSA200_PPMID2_RETR   imdb_window20-flat    0.511102  0.532578        0.360753         0.299072  0.652557       0.471212
20  LSA200_PPMID1_RETR   giga_window20-flat    0.596064  0.539958        0.304601         0.265470  0.696623       0.480543
37     AE200_f3d1_RETR   imdb_window20-flat    0.616646  0.541377        0.326162         0.284005  0.687487       0.491135
38     AE100_f3d1_RETR   imdb_window20-flat    0.600432  0.559047        0.326005         0.285873  0.702826       0.494837
33     AE200_f2d1_RETR   imdb_window20-flat    0.614167  0.547061        0.347045         0.285870  0.684567       0.495742
34     AE100_f2d1_RETR   imdb_window20-flat    0.601524  0.560869        0.325217         0.287026  0.704686       0.495864
14     AE100_f2d2_RETR   giga_window20-flat    0.605782  0.580163        0.338330         0.299358  0.726288       0.509984
18     AE100_f3d2_RETR   giga_window20-flat    0.615087  0.585906        0.340997         0.297099  0.725216       0.512861
16     AE100_f3d1_RETR   giga_window20-flat    0.624479  0.580914        0.330287         0.293817  0.736412       0.513182
12     AE100_f2d1_RETR   giga_window20-flat    0.617250  0.588658        0.336470         0.298640  0.738844       0.515972
8     LSA200_PPMI_RETR  giga_window5-scaled    0.580715  0.605465        0.353050         0.339956  0.736162       0.523070
3      AE100_f2d2_RETR  giga_window5-scaled    0.582202  0.611080        0.332042         0.337763  0.755006       0.523619
7      AE100_f3d2_RETR  giga_window5-scaled    0.580033  0.613528        0.335735         0.338676  0.755206       0.524636
10  LSA200_PPMID2_RETR  giga_window5-scaled    0.572713  0.614095        0.348592         0.338870  0.748955       0.524645
9   LSA200_PPMID1_RETR  giga_window5-scaled    0.583648  0.612118        0.348962         0.336825  0.752110       0.526732
36     AE100_f2d2_RETR   imdb_window20-flat    0.617523  0.591198        0.377491         0.332253  0.719064       0.527506
39     AE200_f3d2_RETR   imdb_window20-flat    0.622171  0.586796        0.377890         0.345824  0.710781       0.528692
17     AE200_f3d2_RETR   giga_window20-flat    0.627954  0.598789        0.358710         0.325629  0.732573       0.528731
40     AE100_f3d2_RETR   imdb_window20-flat    0.622195  0.594116        0.381423         0.329926  0.718871       0.529306
13     AE200_f2d2_RETR   giga_window20-flat    0.635104  0.598258        0.359671         0.325113  0.728910       0.529411
35     AE200_f2d2_RETR   imdb_window20-flat    0.624684  0.587335        0.383164         0.340411  0.716238       0.530366
5      AE100_f3d1_RETR  giga_window5-scaled    0.583922  0.611640        0.359178         0.342194  0.763243       0.532036
11     AE200_f2d1_RETR   giga_window20-flat    0.635968  0.608072        0.360000         0.319900  0.740668       0.532922
1      AE100_f2d1_RETR  giga_window5-scaled    0.581132  0.624879        0.357372         0.343446  0.759918       0.533349
15     AE200_f3d1_RETR   giga_window20-flat    0.645774  0.610491        0.356034         0.318727  0.738964       0.533998
31  LSA200_PPMID1_RETR  imdb_window5-scaled    0.610808  0.590201        0.398355         0.356933  0.721921       0.535644
30    LSA200_PPMI_RETR  imdb_window5-scaled    0.599354  0.587413        0.413494         0.364765  0.714033       0.535812
4      AE200_f3d1_RETR  giga_window5-scaled    0.593181  0.631004        0.368145         0.357143  0.758945       0.541684
0      AE200_f2d1_RETR  giga_window5-scaled    0.603547  0.634335        0.357063         0.361884  0.761070       0.543580
6      AE200_f3d2_RETR  giga_window5-scaled    0.587981  0.641383        0.361646         0.362897  0.772710       0.545323
25     AE100_f2d2_RETR  imdb_window5-scaled    0.628521  0.629260        0.362378         0.375002  0.738579       0.546748
22     AE200_f2d1_RETR  imdb_window5-scaled    0.625893  0.596149        0.405862         0.377832  0.729325       0.547012
2      AE200_f2d2_RETR  giga_window5-scaled    0.601161  0.636068        0.365801         0.372329  0.767164       0.548505
27     AE100_f3d1_RETR  imdb_window5-scaled    0.617286  0.599758        0.409597         0.371462  0.749250       0.549471
23     AE100_f2d1_RETR  imdb_window5-scaled    0.621731  0.618707        0.404083         0.371269  0.751147       0.553387
29     AE100_f3d2_RETR  imdb_window5-scaled    0.642001  0.626371        0.376091         0.378915  0.746271       0.553930
32  LSA200_PPMID2_RETR  imdb_window5-scaled    0.627497  0.617972        0.409345         0.394991  0.746077       0.559176
26     AE200_f3d1_RETR  imdb_window5-scaled    0.638723  0.619183        0.417913         0.379170  0.744900       0.559978
28     AE200_f3d2_RETR  imdb_window5-scaled    0.631439  0.621045        0.409567         0.401222  0.749504       0.562556
24     AE200_f2d2_RETR  imdb_window5-scaled    0.648426  0.619787        0.410915         0.401384  0.747792       0.565661    
    
    
    Observation: due to the small size of the MCOs one of the most obvious candidates for best-model: GloVe is easily 
    outperformed by other methods/models when tested against the validation procedure/datasets.
    
    Finally, the proposed embeddings matrix is directly generated by a retrofitted autoencoder that outputs from the
    encoding layer a embedding of size 300 after it has been trained on the imdb MCO with window size 5 weighted with 
    DPPMI and reduced to 600 features with LSA. 
    In order to further enhance the retrofitting mechanism the method proposed by Mrkšić et al was used (based on the
    source code available from GitHub repository https://github.com/nmrksic/counter-fitting). 
    This counter-fitting mecanism was only applied for about ~300 antonym pairs and only after all the previously 
    mentioned steps (DPPMI-LSA-AE-RETRO).
    """
    
##########################################################
##########################################################
##########################################################
##########################################################
##########################################################
##########################################################
        


def normalise_word_vectors(word_vectors, norm=1.0):
    """
    This method normalises the collection of word vectors provided in the word_vectors dictionary.
    
    
    adapted from:
      https://raw.githubusercontent.com/nmrksic/counter-fitting/master/counterfitting.py
    """
    for word in word_vectors:
        word_vectors[word] /= np.sqrt((word_vectors[word]**2).sum() + 1e-6)
        word_vectors[word] = word_vectors[word] * norm
    return word_vectors

def distance(v1, v2, normalised_vectors=True):
	"""
	Returns the cosine distance between two vectors. 
	If the vectors are normalised, there is no need for the denominator, which is always one. 

    adapted from:
      https://raw.githubusercontent.com/nmrksic/counter-fitting/master/counterfitting.py
	"""
	if normalised_vectors:
		return 1 - np.dot(v1, v2)
	else:
		return 1 - np.dot(v1, v2) / ( np.linalg.norm(v1) * np.linalg.norm(v2) )


def vector_partial_gradient(u, v, normalised_vectors=True):
	"""
	This function returns the gradient of cosine distance: \frac{ \partial dist(u,v)}{ \partial u}
	If they are both of norm 1 (we do full batch and we renormalise at every step), we can save some time.
    
    adapted from:
      https://raw.githubusercontent.com/nmrksic/counter-fitting/master/counterfitting.py
	"""

	if normalised_vectors:
		gradient = u * np.dot(u,v)  - v 
	else:		
		norm_u = np.linalg.norm(u)
		norm_v = np.linalg.norm(v)
		nominator = u * np.dot(u,v) - v * np.power(norm_u, 2)
		denominator = norm_v * np.power(norm_u, 3)
		gradient = nominator / denominator

	return gradient


def one_step_SGD(word_vectors, antonym_pairs,
                 delta=1.0, lr=0.1, gamma=0):
    """
    This method performs a step of SGD to optimise the counterfitting cost function.

    adapted from:
      https://raw.githubusercontent.com/nmrksic/counter-fitting/master/counterfitting.py
    """
    from copy import deepcopy
    new_word_vectors = deepcopy(word_vectors)

    gradient_updates = {}
    update_count = {}

    # AR term:
    for i, (word_i, word_j) in enumerate(antonym_pairs):
        print("\r    Processing antonym pairs {:.1f}%...".format(i/len(antonym_pairs)*100), flush=True, end='')

        current_distance = distance(new_word_vectors[word_i], new_word_vectors[word_j])

        if current_distance < delta:
    
            gradient = vector_partial_gradient( new_word_vectors[word_i], new_word_vectors[word_j])
            gradient = gradient * lr 

            if word_i in gradient_updates:
                gradient_updates[word_i] += gradient
                update_count[word_i] += 1
            else:
                gradient_updates[word_i] = gradient
                update_count[word_i] = 1

   
    print("\r    Applying gradients...", flush=True, end='')
    for word in gradient_updates:
        # we've found that scaling the update term for each word helps with convergence speed. 
        update_term = gradient_updates[word] / (update_count[word]) 
        new_word_vectors[word] += update_term 
    print("\r    Done Applying gradients.", flush=True, end='')
        
    return normalise_word_vectors(new_word_vectors)
  
  
def counter_fit_antonyms(dct_word_vectors,  antonyms, epochs=20, lr=0.1):
  """
  This method repeatedly applies SGD steps to counter-fit word vectors to linguistic constraints. 
  
  https://raw.githubusercontent.com/nmrksic/counter-fitting/master/counterfitting.py
  """
  word_vectors = normalise_word_vectors(dct_word_vectors)
  
  current_iteration = 0
  
  
  max_iter = epochs
  print("Antonym pairs:", len(antonyms), flush=True)
  print("Running the optimisation procedure for", max_iter, "SGD steps...", flush=True)
  
  while current_iteration < max_iter:
    current_iteration += 1
    print("\r  Counter-fitting SGD step {}...".format(current_iteration), flush=True, end='')
    word_vectors = one_step_SGD(word_vectors, antonyms, lr=lr)
  print("")
  return word_vectors  

##################################################        
##################################################  
  

def eval_model(dct, dataset_name, model_name, df, distfunc=vsm.cosine):
  print("\nEvaluation of model '{}' {} based on '{}' MCO".format(
        model_name, df.shape, dataset_name),
        flush=True)
  if 'MODEL' not in dct:
    dct['MODEL'] = []
  if 'DATA' not in dct:
    dct['DATA'] = []

  if 'DST' not in dct:
    dct['DST'] = []
  
  dct['DATA'].append(dataset_name)
  dct['MODEL'].append(model_name)
  dct['DST'].append(distfunc.__name__)
  
  res = full_word_similarity_evaluation(df, distfunc=distfunc)
  print("Results for '{}' on data '{}' with distfunc={}:".format(
      model_name, dataset_name, distfunc.__name__), flush=True)
  
  for key in dict(res):
    if key not in dct:
      dct[key] = []
    dct[key].append(res[key])
    
  return dct, res['Macro-average']

def grid_search_vsm(files, dct_model_funcs):
  pd.set_option('display.max_rows', 500)
  pd.set_option('display.max_columns', 500)
  pd.set_option('display.width', 1000)
  
  best_model = None
  best_macro = 0
  best_model_name = ''
  dct_res = {}
  dist_funcs = [vsm.cosine] # [dice, vsm.cosine]
  for distfunc in dist_funcs:
    print("Using distfunc={}".format(distfunc.__name__))
    for fn in files:
      print("Loading '{}'".format(fn), flush=True)
      data = pd.read_csv(os.path.join(VSM_HOME, fn), index_col=0)  
      print("  Loaded {}".format(data.shape))
      data_name = fn[:-7]
      for model_name, model_func in dct_model_funcs.items():
        print("=" * 70)
        print("Running '{}' on '{}'".format(model_name, data_name), flush=True)
        df = model_func(data)
        if df is None:
          print("{} returned None".format(model_func.__name__))
          continue
        print("Done running {}. Obtained df: {}".format(
            model_func.__name__, df.shape), flush=True)
        dct_res, macro = eval_model(dct_res, 
                                    dataset_name=data_name, 
                                    model_name=model_name, 
                                    df=df,
                                    distfunc=distfunc
                                    )
        if best_macro < macro:
          old_best_file = best_model_name
          best_macro = macro
          best_model = df
          best_model_name = 'best_model_{:.4f}_'.format(macro).replace('.','') + model_name + '_' + fn
          print("Found new best macro-average: {:.4f}".format(best_macro), flush=True)
          if old_best_file != '':
            try:
              old_best_file = old_best_file + '.csv.gz'
              os.remove(old_best_file)
              print("Old best '{}' deleted.".format(old_best_file))
            except:
              print("ERROR: Cound not remove file '{}' !!!".format(old_best_file))            
          best_model.to_csv(best_model_name, compression='gzip')
          
        df_res = pd.DataFrame(dct_res).sort_values('Macro-average')
        df_res.to_csv("20200303_test.csv")
        print("\nResults so far:\n{}".format(df_res), flush=True)
  
  return best_model


def glove_model(df, n_embeds=75, max_iters=10000, retrofit=False):
  from mittens import GloVe
  print("Computing GloVe model with {} embeds for {} iters".format(
      n_embeds, max_iters), flush=True)
  glove_model = GloVe(n=n_embeds, max_iter=max_iters)
  np_res = glove_model.fit(df.values)
  glove_model.sess.close()
  print("", flush=True)
  df_res = pd.DataFrame(np_res, index=df.index)
  if retrofit:
    df_out = retrofit_model(df_res)
  else:
    df_out = df_res
  return df_out



def calc_delta(mco):
  col_totals = np.array(mco).sum(axis=0)
  row_totals = np.array(mco).sum(axis=1)
  cm = [col_totals for _ in range(mco.shape[0])]
  col_mat = np.vstack(cm)
  row_mat = row_totals.reshape((-1,1))
  rm = [row_mat for _ in range(mco.shape[1])]
  row_mat = np.hstack(rm)
  d1 = mco / (mco + 1)
  mins = np.minimum(col_mat, row_mat)
  d2 = mins / (mins + 1)
  delta = d1 * d2
  return delta

def pmid(m, positive=True, delta_on_pmi=True, before=True):
  df = vsm.observed_over_expected(m)
  # Silence distracting warnings about log(0):
  with np.errstate(divide='ignore'):
    pmi = np.log(df)
  pmi[np.isinf(pmi)] = 0.0  # log(0) = 0
  if positive and before:
      pmi[pmi < 0] = 0.0
  delta = calc_delta(pmi if delta_on_pmi else m)
  pmi = pmi * delta
  if positive and not before:
    pmi[pmi < 0] = 0.0
  return pmi


def counterfit_model(data):
  print("Counter-fitting on {} embeds".format( data.shape), flush=True)


  from nltk.corpus import wordnet as wn
  print("  Preparing antonyms", flush=True)
  ant_set = set()
  for ss in wn.all_synsets():
    lema = ss.lemmas()[0]      
    w1 = lema.name()
    ants = [lem.name() for lem in lema.antonyms()]
    if len(ants)>0:
      for w2 in ants:
        if w1 in data.index and w2 in data.index:
          ant_set.add((w1, w2))
  
    
  dct_word_vectors = {k:v for k,v in zip(data.index, data.values)}
  for word in ['expensive','east','smart','adult']:
    if word in data:
      break
  neibs1 = vsm.neighbors(word, data)
  dct_new_embeds = counter_fit_antonyms(dct_word_vectors, ant_set, lr=0.1, epochs=30)
  df = pd.DataFrame.from_dict(dct_new_embeds, orient='index')
  neibs = vsm.neighbors(word, df)

  print("  Status before counter-fitting for word: {}".format(word), flush=True)
  for w in dict(neibs1.iloc[:5]):
    print("  {:<20} {:.3f}".format(w+':', neibs1[w]))
  print("  Status AFTER counter-fitting for word: {}".format(word), flush=True)
  for w in dict(neibs.iloc[:5]):
    print("  {:<20} {:.3f}".format(w+':', neibs[w]))
  return df
  

def retrofit_model(data, name=''):
  from retrofitting import Retrofitter
  print("Retrofitting model '{}' with {} embeds".format(name, data.shape[1]), flush=True)
  from nltk.corpus import wordnet as wn
  print("  Constructing edges for words similarity", flush=True)
  edges = defaultdict(set)
  for ss in wn.all_synsets():
    lem_names = {lem.name() for lem in ss.lemmas()}
    for lem in lem_names:
      edges[lem] |= lem_names            
  print("  Preparing indices...",flush=True)
  lookup = dict(zip(data.index, range(data.shape[0])))
  index_edges = defaultdict(set)
  for start, finish_nodes in edges.items():
      s = lookup.get(start)
      if s:
          f = {lookup[n] for n in finish_nodes if n in lookup}
          if f:
              index_edges[s] = f  
  
  wn_retro = Retrofitter(verbose=True,
                         max_iter=1000,
                         tol=1e-4,
                         )
  print("  Running retrofitter ...", flush=True)
  retro_result = wn_retro.fit(data, index_edges)
  print("")
  return retro_result



def lsa_model(data, k=100, use_ttest=False, disc_pmi=True, retrofit=False, delta=True):
  if use_ttest:
    print("Computing ttest reweighting for LSA {}".format(k), flush=True)
    lsa_input = ttest(data)
  else:
    if disc_pmi:
      print("Computing discounted positive PMI reweighting for LSA {} (delta:{})".format(
          k, delta), flush=True)
      lsa_input = pmid(data, delta_on_pmi=delta)
    else:
      print("Computing positive PMI reweighting for LSA {}".format(k), flush=True)
      lsa_input = vsm.pmi(data)    
  print("Computing LSA k={}...".format(k))
  lsa_output = vsm.lsa(lsa_input, k=k)
  if retrofit:
    df_out = retrofit_model(lsa_output)
  else:
    df_out = lsa_output
  return df_out

def ae_model(data, n_embeds, 
             distfunc=vsm.cosine,
             epochs=100000, 
             retrofit=True, 
             delta=True, 
             max_patience=5, 
             lr=1e-2,
             lsa_factor=2,
             end_counterfit=False,
             ):
  
  from torch_autoencoder import TorchAutoencoder
  
  print("Generating autoencoder based model with {} embeds...".format(n_embeds), flush=True)
  lsa_output = lsa_model(data, k=int(n_embeds * lsa_factor), disc_pmi=True, use_ttest=False, delta=delta)

  n_step_epochs = 100
  steps = epochs // n_step_epochs
    
  ae_model = TorchAutoencoder(max_iter=n_step_epochs, 
                              hidden_dim=n_embeds, 
                              eta=lr,
                              warm_start=True)
  best_macro = 0
  best_model = None
  patience = 0
  print("Performing autoencoder training for {} steps of {} epochs with early stopping on {}...".format(
      steps, n_step_epochs, lsa_output.shape), flush=True)
  for step in range(1, steps+1):
    print("Fitting step {}/{} for {} step-epochs".format(step, steps, n_step_epochs), flush=True)
    ae_output = ae_model.fit(lsa_output)
    print("\nCalculating step {} results...".format(step))
    print("  Before retrofit ...")
    res = full_word_similarity_evaluation(ae_output, distfunc=distfunc)
    mb = res['Macro-average']
    if retrofit:
      df_out = retrofit_model(ae_output)
    else:
      df_out = ae_output
    res = full_word_similarity_evaluation(df_out, distfunc=distfunc)
    macro = res['Macro-average']
    print("  Before retro: {:.4f}".format(mb))
    print("  After retro:  {:.4f}".format(macro))
    print("  {}".format("Good!" if macro>mb else "WORSE!!!"), flush=True)
    if macro > best_macro:
      patience = 0
      best_macro = macro
      best_model = df_out
      print("Found best model at epoch {}".format(step * n_step_epochs), flush=True)
    else:
      patience += 1
      print("Macro {:.4f} < {:.4f} best. Patience {}/{}".format(
          macro, best_macro, patience, max_patience))
      
    if patience >= max_patience:
      print("Early stopping training loop at step {}".format(step))
      break      
  if end_counterfit:
    best_model = counterfit_model(best_model)
  return best_model
 
if 'IS_GRADESCOPE_ENV' not in os.environ:
    

## Bake-off [1 point]

For the bake-off, we will release two additional datasets. The announcement will go out on the discussion forum. We will also release reader code for these datasets that you can paste into this notebook. You will evaluate your custom model $M$ (from the previous question) on these new datasets using `full_word_similarity_evaluation`. Rules:

1. Only one evaluation is permitted.
1. No additional system tuning is permitted once the bake-off has started.

The cells below this one constitute your bake-off entry.

People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

The announcement will include the details on where to submit your entry.

In [33]:
# Enter your bake-off assessment code into this cell. 
# Please do not remove this comment.

if 'IS_GRADESCOPE_ENV' not in os.environ:
    pass
    # Please enter your code in the scope of the above conditional.
    ##### YOUR CODE HERE
    
    # the next section "explains" what are the hyperparams that generated the proposed embeddings matrix
    # Model preparation
    # load data
    """
    mco_data = pd.read_csv(os.path.join(VSM_HOME, "giga_window20-flat.csv.gz"), index_col=0)
    # train model & get retrofitted embeds
    ae_embeds = ae_model(data=mco_data, 
                         n_embeds=300, 
                         epochs=100000, 
                         retrofit=True, 
                         delta=False,
                         max_patience=5,
                         end_counterfit=True,
                         lsa_factor=2)
    # calculate score
    res = full_word_similarity_evaluation(ae_embeds)
    print("Result:\n{}".format(res))
    m_score = res['Macro-average']
    fn = 'best_model_{:.4f}'.format(m_score).replace('.','')+'.csv.gz'
    ae_embeds.to_csv(fn,compression='gzip')
    print("Final embeddings {} result: {:.4f}".format(ae_embeds.shape, m_score))
    """
    # now we can use "best_model_05842.csv.gz"
    distfunc = vsm.cosine
    # the following file is based on the homework sumission
    best_embeds = pd.read_csv("best_model_05842.csv.gz", index_col=0)
    res = full_word_similarity_evaluation(best_embeds, distfunc=distfunc)
    # the results should match the 0.5842 macro score based on cosine distance
    print("\nResults on the DEV sets using distfunc '{}'':\n{}".format(distfunc.__name__, res))
    
    def mturk287_reader():
        """MTurk-287: http://tx.technion.ac.il/~kirar/Datasets.html"""
        src_filename = os.path.join(
            WORDSIM_HOME, 'bakeoff-wordsim-test-data', 'MTurk-287.csv')
        return wordsim_dataset_reader(
            src_filename, header=False)

    def simlex999_reader(wordsim_test_home=WORDSIM_HOME):
        """SimLex999: https://www.cl.cam.ac.uk/~fh295/SimLex-999.zip"""
        src_filename = os.path.join(
            WORDSIM_HOME, 'bakeoff-wordsim-test-data', 'SimLex-999', 'SimLex-999.txt')
        return wordsim_dataset_reader(
            src_filename, delimiter="\t", header=True, score_col_index=3)

    BAKEOFF = (simlex999_reader, mturk287_reader)
    distfunc = vsm.jaccard
    res = full_word_similarity_evaluation(best_embeds, readers=BAKEOFF, distfunc=distfunc)
    print("\n\nResults on the TEST sets using distfunc '{}':\n{}".format(distfunc.__name__, res))
    print("\nFinal bake-off score: {}".format(res['Macro-average']))





Results on the DEV sets using distfunc 'cosine'':
wordsim353         0.630593
mturk771           0.641565
simverb3500dev     0.470225
simverb3500test    0.429700
men                0.748682
Macro-average      0.584153
Name: Spearman r, dtype: float64


Results on the TEST sets using distfunc 'jaccard':
simlex999        0.046431
mturk287         0.062449
Macro-average    0.054440
Name: Spearman r, dtype: float64

Final bake-off score: 0.05444045540308437


In [31]:
# On an otherwise blank line in this cell, please enter
# your "Macro-average" value as reported by the code above. 
# Please enter only a number between 0 and 1 inclusive.
# Please do not remove this comment.
if 'IS_GRADESCOPE_ENV' not in os.environ:
    pass
    # Please enter your score in the scope of the above conditional.
    ##### YOUR CODE HERE

    0.054440