# Homework 4: Word-level entailment with neural networks

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2019"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Data](#Data)
  1. [Edge disjoint](#Edge-disjoint)
  1. [Word disjoint](#Word-disjoint)
1. [Baseline](#Baseline)
  1. [Representing words: vector_func](#Representing-words:-vector_func)
  1. [Combining words into inputs: vector_combo_func](#Combining-words-into-inputs:-vector_combo_func)
  1. [Classifier model](#Classifier-model)
  1. [Baseline results](#Baseline-results)
1. [Homework questions](#Homework-questions)
  1. [Hypothesis-only baseline [2 points]](#Hypothesis-only-baseline-[2-points])
  1. [Alternatives to concatenation [1 point]](#Alternatives-to-concatenation-[1-point])
  1. [A deeper network [2 points]](#A-deeper-network-[2-points])
  1. [Your original system [4 points]](#Your-original-system-[4-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

The general problem is word-level natural language inference.

Training examples are pairs of words $(w_{L}, w_{R}), y$ with $y = 1$ if $w_{L}$ entails $w_{R}$, otherwise $0$.

The homework questions below ask you to define baseline models for this and develop your own system for entry in the bake-off, which will take place on a held-out test-set distributed at the start of the bake-off. (Thus, all the data you have available for development is available for training your final system before the bake-off begins.)

<img src="fig/wordentail-diagram.png" width=600 alt="wordentail-diagram.png" />

## Set-up

See [the first notebook in this unit](nli_01_task_and_data.ipynb) for set-up instructions.

In [1]:
from collections import defaultdict
import json
import numpy as np
import os
import pandas as pd
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
import nli
import utils

In [2]:
DATA_HOME = 'data'

NLIDATA_HOME = os.path.join(DATA_HOME, 'nlidata')

wordentail_filename = os.path.join(
    NLIDATA_HOME, 'nli_wordentail_bakeoff_data.json')

GLOVE_HOME = os.path.join(DATA_HOME, 'glove.6B')

## Data

I've processed the data into two different train/test splits, in an effort to put some pressure on our models to actually learn these semantic relations, as opposed to exploiting regularities in the sample.

* `edge_disjoint`: The `train` and `dev` __edge__ sets are disjoint, but many __words__ appear in both `train` and `dev`.
* `word_disjoint`: The `train` and `dev` __vocabularies are disjoint__, and thus the edges are disjoint as well.

These are very different problems. For `word_disjoint`, there is real pressure on the model to learn abstract relationships, as opposed to memorizing properties of individual words.

In [3]:
with open(wordentail_filename) as f:
    wordentail_data = json.load(f)

The outer keys are the  splits plus a list giving the vocabulary for the entire dataset:

In [4]:
wordentail_data.keys()

dict_keys(['edge_disjoint', 'vocab', 'word_disjoint'])

### Edge disjoint

In [5]:
wordentail_data['edge_disjoint'].keys()

dict_keys(['dev', 'train'])

This is what the split looks like; all three have this same format:

In [6]:
wordentail_data['edge_disjoint']['dev'][: 5]

[[['sweater', 'stroke'], 0],
 [['constipation', 'hypovolemia'], 0],
 [['disease', 'inflammation'], 0],
 [['herring', 'animal'], 1],
 [['cauliflower', 'outlook'], 0]]

Let's test to make sure no edges are shared between `train` and `dev`:

In [7]:
nli.get_edge_overlap_size(wordentail_data, 'edge_disjoint')

0

As we expect, a *lot* of vocabulary items are shared between `train` and `dev`:

In [8]:
nli.get_vocab_overlap_size(wordentail_data, 'edge_disjoint')

2916

This is a large percentage of the entire vocab:

In [9]:
len(wordentail_data['vocab'])

8470

Here's the distribution of labels in the `train` set. It's highly imbalanced, which will pose a challenge for learning. (I'll go ahead and reveal that the `dev` set is similarly distributed.)

In [10]:
def label_distribution(split):
    return pd.DataFrame(wordentail_data[split]['train'])[1].value_counts()

In [11]:
label_distribution('edge_disjoint')

0    14650
1     2745
Name: 1, dtype: int64

### Word disjoint

In [12]:
wordentail_data['word_disjoint'].keys()

dict_keys(['dev', 'train'])

In the `word_disjoint` split, no __words__ are shared between `train` and `dev`:

In [13]:
nli.get_vocab_overlap_size(wordentail_data, 'word_disjoint')

0

Because no words are shared between `train` and `dev`, no edges are either:

In [14]:
nli.get_edge_overlap_size(wordentail_data, 'word_disjoint')

0

The label distribution is similar to that of `edge_disjoint`, though the overall number of examples is a bit smaller:

In [15]:
label_distribution('word_disjoint')

0    7199
1    1349
Name: 1, dtype: int64

## Baseline

Even in deep learning, __feature representation is vital and requires care!__ For our task, feature representation has two parts: representing the individual words and combining those representations into a single network input.

### Representing words: vector_func

Let's consider two baseline word representations methods:

1. Random vectors (as returned by `utils.randvec`).
1. 50-dimensional GloVe representations.

In [16]:
def randvec(w, n=50, lower=-1.0, upper=1.0):
    """Returns a random vector of length `n`. `w` is ignored."""
    return utils.randvec(n=n, lower=lower, upper=upper)

In [17]:
# Any of the files in glove.6B will work here:

glove_dim = 50

glove_src = os.path.join(GLOVE_HOME, 'glove.6B.{}d.txt'.format(glove_dim))

# Creates a dict mapping strings (words) to GloVe vectors:
GLOVE = utils.glove2dict(glove_src)

def glove_vec(w):    
    """Return `w`'s GloVe representation if available, else return 
    a random vector."""
    return GLOVE.get(w, randvec(w, n=glove_dim))

### Combining words into inputs: vector_combo_func

Here we decide how to combine the two word vectors into a single representation. In more detail, where `u` is a vector representation of the left word and `v` is a vector representation of the right word, we need a function `vector_combo_func` such that `vector_combo_func(u, v)` returns a new input vector `z` of dimension `m`. A simple example is concatenation:

In [18]:
def vec_concatenate(u, v):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return np.concatenate((u, v))

`vector_combo_func` could instead be vector average, vector difference, etc. (even combinations of those) – there's lots of space for experimentation here; [homework question 2](#Alternatives-to-concatenation-[1-point]) below pushes you to do some exploration.

### Classifier model

For a baseline model, I chose `TorchShallowNeuralClassifier`:

In [19]:
net = TorchShallowNeuralClassifier(hidden_dim=50, max_iter=100)

### Baseline results

The following puts the above pieces together, using `vector_func=glove_vec`, since `vector_func=randvec` seems so hopelessly misguided for `word_disjoint`!

In [20]:
word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=vec_concatenate)

Finished epoch 100 of 100; error is 0.025328553980216384

              precision    recall  f1-score   support

           0       0.93      0.93      0.93      1910
           1       0.43      0.41      0.42       239

   micro avg       0.87      0.87      0.87      2149
   macro avg       0.68      0.67      0.68      2149
weighted avg       0.87      0.87      0.87      2149



## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Hypothesis-only baseline [2 points]

During our discussion of SNLI and MultiNLI, we noted that a number of research teams have shown that hypothesis-only baselines for inference tasks can be remarkably robust. This question asks you to explore briefly how this baseline effects the 'edge_disjoint' and 'word_disjoint' versions of our task.

For this problem, submit code the following:

1. A `vector_combo_func` function called `hypothesis_only` that simply throws away the premise, using the unmodified hypothesis (second) vector as its representation of the example.

1. Code for looping over the two conditions 'word_disjoint' and 'edge_disjoint' and the two `vector_combo_func` values `vec_concatenate` and `hypothesis_only`, calling `nli.wordentail_experiment` to train on the conditions 'train' portion and assess on its 'dev' portion, with `glove50vec` as the `vector_func`. So that the results are consistent, use an `sklearn.linear_model.LogisticRegression` with default parameters as the model.

1. Print out the percentage-wise increase in macro-F1 over the `hypothesis_only` delivers over `vec_concatenate` for each of the two conditions. For example, if `hypothesis_only` returns 0.5 for condition `C` and  `vec_concatenate` delivers 0.75 for `C`, then you'd report a 50% increase for `C`. The values you need are stored in the dictionary returned by `nli.wordentail_experiment`, with key 'macro-F1'. Please use two digits of precision for the increases.

In [32]:
import sklearn

#1
def hypothesis_only(u, v):
    return v

#2
res = {}
for k in ['edge_disjoint', 'word_disjoint']:
    print(k.upper())
    for vec_func in [vec_concatenate, hypothesis_only]:
        print(f"Experiment for {k} challenge and function {vec_func.__name__}:")
        
        res_key = f"{k}_{vec_func.__name__}"
        
        res[res_key] = nli.wordentail_experiment(
            train_data=wordentail_data[k]['train'],
            assess_data=wordentail_data[k]['dev'],
            model=sklearn.linear_model.LogisticRegression(solver='liblinear'),
            vector_func=glove_vec,
            vector_combo_func=vec_func
        )
        print("-"*80)
    #3
    vc_key = f"{k}_vec_concatenate"
    ho_key = f"{k}_hypothesis_only"
    vc_res = res[vc_key]['macro-F1']
    ho_res = res[ho_key]['macro-F1']
    pct_inc = float(vc_res - ho_res) / float(ho_res)
    print(f"Percent increase for {k}: {pct_inc:.2f}")
    print("="*80)
    print()

EDGE_DISJOINT
Experiment for edge_disjoint challenge and function vec_concatenate:
              precision    recall  f1-score   support

           0       0.88      0.97      0.92      7376
           1       0.59      0.23      0.33      1321

   micro avg       0.86      0.86      0.86      8697
   macro avg       0.73      0.60      0.63      8697
weighted avg       0.83      0.86      0.83      8697

--------------------------------------------------------------------------------
Experiment for edge_disjoint challenge and function hypothesis_only:
              precision    recall  f1-score   support

           0       0.87      0.98      0.92      7376
           1       0.59      0.20      0.29      1321

   micro avg       0.86      0.86      0.86      8697
   macro avg       0.73      0.59      0.61      8697
weighted avg       0.83      0.86      0.83      8697

--------------------------------------------------------------------------------
Percent increase for edge_disjoi

### Alternatives to concatenation [1 point]

We've so far just used vector concatenation to represent the premise and hypothesis words. This question asks you to explore a simple alternative. 

For this problem, submit code the following:

1. A new potential value for `vector_combo_func` that does something different from concatenation. Options include, but are not limited to, element-wise addition, difference, and multiplication. These can be combined with concatenation if you like.
1. Include a use of `nli.wordentail_experiment` in the same configuration as the one in [Baseline results](#Baseline-results) above, but with your new value of `vector_combo_func`.

In [38]:
def elementwise_sum(u, v):
    # Use a length check in case len(u) != len(v)
    if len(u) < len(v):
        q = v.copy()
        q[:len(u)] += u
    else:
        q = u.copy()
        q[:len(v)] += v
    return q

print("ELEMENTWISE_SUM")
elementwise_sum_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=elementwise_sum
)

def cross_dot_product(u, v):
    u_np = np.array(u)
    v_np = np.array(v)
    return np.outer(u_np, v_np).dot(v_np)

print("CROSS_DOT_PRODUCT")
dot_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=cross_dot_product
)


def concat_elementwise_sum(u, v):
    s = elementwise_sum(u, v)
    return np.concatenate([u, s, v])

print("CONCATENATED_SUM")
concat_sum_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=concat_elementwise_sum
)

## Basically the same as the sum...
# def elementwise_diff(u, v):
#     """
#     Returns u - v. If len(v) > len(u), first zero-fill u
#     """
#     # Use a length check in case len(u) != len(v)
#     if len(u) < len(v):
#         q = v.copy()
#         q = -1 * q
#         q[:len(u)] += u
#     else:
#         q = u.copy()
#         q[:len(v)] -= v
#     return q
    
# def concat_elementwise_diff(u, v):
#     s = elementwise_diff(u, v)
#     return np.concatenate([u, s, v])

# print("CONCATENATED_DIFF")
# concat_sum_experiment = nli.wordentail_experiment(
#     train_data=wordentail_data['word_disjoint']['train'],
#     assess_data=wordentail_data['word_disjoint']['dev'], 
#     model=net, 
#     vector_func=glove_vec,
#     vector_combo_func=concat_elementwise_diff
# )


ELEMENTWISE_SUM


Finished epoch 100 of 100; error is 0.8535327166318893

              precision    recall  f1-score   support

           0       0.91      0.92      0.91      1910
           1       0.31      0.29      0.30       239

   micro avg       0.85      0.85      0.85      2149
   macro avg       0.61      0.61      0.61      2149
weighted avg       0.84      0.85      0.85      2149

CROSS_DOT_PRODUCT


Finished epoch 100 of 100; error is 2.7634670436382294

              precision    recall  f1-score   support

           0       0.90      0.78      0.83      1910
           1       0.14      0.29      0.19       239

   micro avg       0.72      0.72      0.72      2149
   macro avg       0.52      0.53      0.51      2149
weighted avg       0.81      0.72      0.76      2149

CONCATENATED_SUM


Finished epoch 100 of 100; error is 0.024772734846919775

              precision    recall  f1-score   support

           0       0.92      0.94      0.93      1910
           1       0.43      0.35      0.39       239

   micro avg       0.88      0.88      0.88      2149
   macro avg       0.68      0.65      0.66      2149
weighted avg       0.87      0.88      0.87      2149

CONCATENATED_DIFF


Finished epoch 100 of 100; error is 0.012119481223635375

              precision    recall  f1-score   support

           0       0.92      0.93      0.93      1910
           1       0.42      0.38      0.40       239

   micro avg       0.87      0.87      0.87      2149
   macro avg       0.67      0.66      0.66      2149
weighted avg       0.87      0.87      0.87      2149



### A deeper network [2 points]

It is very easy to subclass `TorchShallowNeuralClassifier` if all you want to do is change the network graph: all you have to do is write a new `define_graph`. If your graph has new arguments that the user might want to set, then you should also redefine `__init__` so that these values are accepted and set as attributes.

For this question, please subclass `TorchShallowNeuralClassifier` so that it defines the following graph:

$$\begin{align}
h_{1} &= xW_{1} + b_{1} \\
r_{1} &= \textbf{Bernoulli}(1 - \textbf{dropout_prob}, n) \\
d_{1} &= r_1 * h_{1} \\
h_{2} &= f(d_{1}) \\
h_{3} &= h_{2}W_{2} + b_{2}
\end{align}$$

Here, $r_{1}$ and $d_{1}$ define a dropout layer: $r_{1}$ is a random binary vector of dimension $n$, where the probability of a value being $1$ is given by $1 - \textbf{dropout_prob}$. $r_{1}$ is multiplied element-wise by our first hidden representation, thereby zeroing out some of the values. The result is fed to the user's activation function $f$, and the result of that is fed through another linear layer to produce $h_{3}$. (Inside `TorchShallowNeuralClassifier`, $h_{3}$ is the basis for a softmax classifier, so no activation function is applied to it.)

For comparison, using this notation, `TorchShallowNeuralClassifier` defines the following graph:

$$\begin{align}
h_{1} &= xW_{1} + b_{1} \\
h_{2} &= f(h_{1}) \\
h_{3} &= h_{2}W_{2} + b_{2}
\end{align}$$

The following code starts this sub-class for you, so that you can concentrate on `define_graph`. Be sure to make use of `self.dropout_prob`

For this problem, submit just your completed  `TorchDeepNeuralClassifier`. You needn't evaluate it, though we assume you will be keen to do that!

In [47]:
import torch.nn as nn

class TorchDeepNeuralClassifier(TorchShallowNeuralClassifier):
    def __init__(self, dropout_prob=0.7, **kwargs):
        self.dropout_prob = dropout_prob
        super().__init__(**kwargs)
    
    def define_graph(self):
        """Complete this method!
        
        Returns
        -------
        an `nn.Module` instance, which can be a free-standing class you 
        write yourself, as in `torch_rnn_classifier`, or the output of 
        `nn.Sequential`, as in `torch_shallow_neural_classifier`.
        
        """
        return nn.Sequential(nn.Linear(self.input_dim, self.hidden_dim),
                             nn.Dropout(p=self.dropout_prob),
                             self.hidden_activation,
                             nn.Linear(self.hidden_dim, self.n_classes_)
                            )

advanced_net = TorchDeepNeuralClassifier(hidden_dim=100, max_iter=250, dropout_prob=0.1)

baseline_plus_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=advanced_net, 
    vector_func=glove_vec,
    vector_combo_func=concat_elementwise_sum)


Finished epoch 250 of 250; error is 0.10502827400341636

              precision    recall  f1-score   support

           0       0.92      0.95      0.93      1910
           1       0.44      0.32      0.37       239

   micro avg       0.88      0.88      0.88      2149
   macro avg       0.68      0.63      0.65      2149
weighted avg       0.86      0.88      0.87      2149



### Your original system [4 points]

This is a simple dataset, but our focus on the 'word_disjoint' condition ensures that it's a challenging one, and there are lots of modeling strategies one might adopt. 

You are free to do whatever you like. We require only that your system differ in some way from those defined in the preceding questions. They don't have to be completely different, though. For example, you might want to stick with the model but represent examples differently, or the reverse.

Keep in mind that, for the bake-off evaluation, the 'edge_disjoint' portions of the data are off limits. You can, though, train on the combination of the 'word_disjoint' 'train' and 'dev' portions. You are free to use different pretrained word vectors and the like. Please do not introduce additional entailment datasets into your training data, though.

Please embed your code in this notebook so that we can rerun it.

In [117]:
import copy
import sklearn.utils as sku

def balance_data(in_data, style='oversample'):
    out_data = []
    pos_samples = [d for d in in_data if d[1] == 1]
    neg_samples = [d for d in in_data if d[1] == 0]
#     print(len(pos_samples))
#     print(len(neg_samples))
    if style == 'oversample':
        out_data.extend(neg_samples)
        out_data.extend(sku.resample(pos_samples, n_samples=len(neg_samples)))
    elif style == 'undersample':
        out_data.extend(pos_samples)
        out_data.extend(sku.resample(neg_samples, n_samples=len(pos_samples)))
    elif style == 'both':
        num_samps = int(abs(len(pos_samples) - len(neg_samples)) / 2)
        out_data.extend(sku.resample(pos_samples, n_samples=num_samps))
        out_data.extend(sku.resample(neg_samples, n_samples=num_samps))
    else:
        return in_data
    return out_data

glove_dim = 100
glove_src = os.path.join(GLOVE_HOME, 'glove.6B.{}d.txt'.format(glove_dim))
# Creates a dict mapping strings (words) to GloVe vectors:
glove_lookup = utils.glove2dict(glove_src)
def better_glove_vec(w):    
    """Return `w`'s GloVe representation if available, else return 
    a random vector."""
    return glove_lookup.get(w, randvec(w, n=glove_dim))

# Let's try BERT embeddings
# from bert_serving.client import BertClient

# # The BERT Client
# bc = BertClient(check_length=False)
# print("Encoding vocab")
# # print(len(wordentail_data['vocab']))
# # print(wordentail_data['vocab'])
# bert_vocab, bert_toks = bc.encode(wordentail_data['vocab'], show_tokens=True)
# bert_lookup = {}
# print("Building bert_lookup")
# for w, bv in zip(wordentail_data['vocab'], bert_vocab):
#     bert_lookup[w] = bv
# print("Done creating bert lookup")

# def bert_vec(w):
#     return bert_lookup[w]
## BERT doesn't appear to be much better -- this makes sense,
## since BERT is for word embeddings based on their context,
## and in single-word entailment, there is no context to use anyways.

In [118]:
best_s = None
best_f1 = 0
best_hidden = None
best_dropout = None
best_exp = None
for s in ['oversample', 'both', 'original']:
    for h in [50, 100, 200]:
        for d in [0.1, 0.2, 0.3]:
            print(f"Data augmentation style: {s}")
            print(f"Hidden dim: {h}")
            print(f"Dropout: {d}")
            aug_train_data = balance_data(wordentail_data['word_disjoint']['train'], style=s)

            ## Maybe don't want to do this b/c we won't generalize to the test data in the same way...
        #     aug_dev_data = balance_data(wordentail_data['word_disjoint']['dev'], style=s)

            bake_mod = TorchDeepNeuralClassifier(hidden_dim=h, max_iter=200, dropout_prob=d)
        #     bake_mod = TorchShallowNeuralClassifier(hidden_dim=50, max_iter=100)
        #     bake_mod = sklearn.linear_model.LogisticRegression(solver='liblinear')

            sample_experiment = nli.wordentail_experiment(
                train_data=aug_train_data,
                assess_data=wordentail_data['word_disjoint']['dev'], 
                model=bake_mod, 
                vector_func=better_glove_vec,
                vector_combo_func=hypothesis_only)
            
            if sample_experiment['macro-F1'] > best_f1:
                best_f1 = sample_experiment['macro-F1']
                best_s = s
                best_hidden = h
                best_dropout = d
                # This is the object we'll use for bake-off submission
                best_experiment = sample_experiment
                
            print("=" * 80)
    
print(f"Best model was hidden={best_hidden}, dropout={best_dropout}, with data augmentation: {best_s} "
      f"Using those params for the bake off experiment: {best_experiment['macro-F1']}")
    
# aug_train_data = balance_data(wordentail_data['word_disjoint']['train'], style=best_s)
# bake_mod = TorchDeepNeuralClassifier(hidden_dim=best_hidden, max_iter=200, dropout_prob=best_dropout)
# bake_off_experiment = nli.wordentail_experiment(
#     train_data=aug_train_data,
#     assess_data=wordentail_data['word_disjoint']['dev'], 
#     model=bake_mod, 
#     vector_func=better_glove_vec,
#     vector_combo_func=concat_elementwise_sum)

Data augmentation style: oversample
Hidden dim: 50
Dropout: 0.1


Finished epoch 200 of 200; error is 3.8489951379597187

              precision    recall  f1-score   support

           0       0.91      0.92      0.92      1910
           1       0.32      0.30      0.31       239

   micro avg       0.85      0.85      0.85      2149
   macro avg       0.62      0.61      0.61      2149
weighted avg       0.85      0.85      0.85      2149

Data augmentation style: oversample
Hidden dim: 50
Dropout: 0.2


Finished epoch 200 of 200; error is 4.1500989086925985

              precision    recall  f1-score   support

           0       0.92      0.93      0.93      1910
           1       0.40      0.38      0.39       239

   micro avg       0.87      0.87      0.87      2149
   macro avg       0.66      0.65      0.66      2149
weighted avg       0.86      0.87      0.87      2149

Data augmentation style: oversample
Hidden dim: 50
Dropout: 0.3


Finished epoch 200 of 200; error is 4.6389740705490115

              precision    recall  f1-score   support

           0       0.94      0.87      0.90      1910
           1       0.35      0.54      0.42       239

   micro avg       0.84      0.84      0.84      2149
   macro avg       0.64      0.71      0.66      2149
weighted avg       0.87      0.84      0.85      2149

Data augmentation style: oversample
Hidden dim: 100
Dropout: 0.1


Finished epoch 200 of 200; error is 3.7394493483006954

              precision    recall  f1-score   support

           0       0.92      0.89      0.91      1910
           1       0.30      0.36      0.32       239

   micro avg       0.83      0.83      0.83      2149
   macro avg       0.61      0.63      0.61      2149
weighted avg       0.85      0.83      0.84      2149

Data augmentation style: oversample
Hidden dim: 100
Dropout: 0.2


Finished epoch 200 of 200; error is 4.2429143413901334

              precision    recall  f1-score   support

           0       0.92      0.86      0.89      1910
           1       0.28      0.43      0.34       239

   micro avg       0.81      0.81      0.81      2149
   macro avg       0.60      0.64      0.61      2149
weighted avg       0.85      0.81      0.83      2149

Data augmentation style: oversample
Hidden dim: 100
Dropout: 0.3


Finished epoch 200 of 200; error is 4.6657730340957645

              precision    recall  f1-score   support

           0       0.93      0.93      0.93      1910
           1       0.44      0.46      0.45       239

   micro avg       0.88      0.88      0.88      2149
   macro avg       0.69      0.69      0.69      2149
weighted avg       0.88      0.88      0.88      2149

Data augmentation style: oversample
Hidden dim: 200
Dropout: 0.1


Finished epoch 200 of 200; error is 4.8119736760854725

              precision    recall  f1-score   support

           0       0.92      0.91      0.91      1910
           1       0.32      0.33      0.33       239

   micro avg       0.85      0.85      0.85      2149
   macro avg       0.62      0.62      0.62      2149
weighted avg       0.85      0.85      0.85      2149

Data augmentation style: oversample
Hidden dim: 200
Dropout: 0.2


Finished epoch 200 of 200; error is 4.0038951784372335

              precision    recall  f1-score   support

           0       0.92      0.93      0.92      1910
           1       0.38      0.34      0.36       239

   micro avg       0.86      0.86      0.86      2149
   macro avg       0.65      0.64      0.64      2149
weighted avg       0.86      0.86      0.86      2149

Data augmentation style: oversample
Hidden dim: 200
Dropout: 0.3


Finished epoch 200 of 200; error is 4.7622310221195224

              precision    recall  f1-score   support

           0       0.92      0.91      0.92      1910
           1       0.36      0.41      0.38       239

   micro avg       0.85      0.85      0.85      2149
   macro avg       0.64      0.66      0.65      2149
weighted avg       0.86      0.85      0.86      2149

Data augmentation style: both
Hidden dim: 50
Dropout: 0.1


Finished epoch 200 of 200; error is 1.2572931200265884

              precision    recall  f1-score   support

           0       0.93      0.91      0.92      1910
           1       0.37      0.43      0.39       239

   micro avg       0.85      0.85      0.85      2149
   macro avg       0.65      0.67      0.66      2149
weighted avg       0.86      0.85      0.86      2149

Data augmentation style: both
Hidden dim: 50
Dropout: 0.2


Finished epoch 200 of 200; error is 1.4539739638566973

              precision    recall  f1-score   support

           0       0.94      0.86      0.90      1910
           1       0.33      0.56      0.42       239

   micro avg       0.83      0.83      0.83      2149
   macro avg       0.64      0.71      0.66      2149
weighted avg       0.87      0.83      0.84      2149

Data augmentation style: both
Hidden dim: 50
Dropout: 0.3


Finished epoch 200 of 200; error is 1.5344067811965942

              precision    recall  f1-score   support

           0       0.94      0.86      0.90      1910
           1       0.34      0.56      0.42       239

   micro avg       0.83      0.83      0.83      2149
   macro avg       0.64      0.71      0.66      2149
weighted avg       0.87      0.83      0.85      2149

Data augmentation style: both
Hidden dim: 100
Dropout: 0.1


Finished epoch 200 of 200; error is 1.2588433623313904

              precision    recall  f1-score   support

           0       0.92      0.90      0.91      1910
           1       0.31      0.38      0.34       239

   micro avg       0.84      0.84      0.84      2149
   macro avg       0.62      0.64      0.62      2149
weighted avg       0.85      0.84      0.84      2149

Data augmentation style: both
Hidden dim: 100
Dropout: 0.2


Finished epoch 200 of 200; error is 1.3076198250055313

              precision    recall  f1-score   support

           0       0.94      0.85      0.89      1910
           1       0.31      0.55      0.40       239

   micro avg       0.81      0.81      0.81      2149
   macro avg       0.62      0.70      0.64      2149
weighted avg       0.87      0.81      0.84      2149

Data augmentation style: both
Hidden dim: 100
Dropout: 0.3


Finished epoch 200 of 200; error is 1.4018592685461044

              precision    recall  f1-score   support

           0       0.94      0.86      0.90      1910
           1       0.34      0.57      0.43       239

   micro avg       0.83      0.83      0.83      2149
   macro avg       0.64      0.72      0.66      2149
weighted avg       0.87      0.83      0.85      2149

Data augmentation style: both
Hidden dim: 200
Dropout: 0.1


Finished epoch 200 of 200; error is 1.2073002755641937

              precision    recall  f1-score   support

           0       0.92      0.90      0.91      1910
           1       0.29      0.34      0.31       239

   micro avg       0.83      0.83      0.83      2149
   macro avg       0.60      0.62      0.61      2149
weighted avg       0.85      0.83      0.84      2149

Data augmentation style: both
Hidden dim: 200
Dropout: 0.2


Finished epoch 200 of 200; error is 1.3277530670166016

              precision    recall  f1-score   support

           0       0.93      0.89      0.91      1910
           1       0.34      0.46      0.39       239

   micro avg       0.84      0.84      0.84      2149
   macro avg       0.63      0.67      0.65      2149
weighted avg       0.86      0.84      0.85      2149

Data augmentation style: both
Hidden dim: 200
Dropout: 0.3


Finished epoch 200 of 200; error is 1.3433310985565186

              precision    recall  f1-score   support

           0       0.92      0.90      0.91      1910
           1       0.32      0.38      0.35       239

   micro avg       0.84      0.84      0.84      2149
   macro avg       0.62      0.64      0.63      2149
weighted avg       0.85      0.84      0.85      2149

Data augmentation style: original
Hidden dim: 50
Dropout: 0.1


Finished epoch 200 of 200; error is 1.6674161255359657

              precision    recall  f1-score   support

           0       0.89      0.98      0.94      1910
           1       0.33      0.07      0.11       239

   micro avg       0.88      0.88      0.88      2149
   macro avg       0.61      0.53      0.52      2149
weighted avg       0.83      0.88      0.84      2149

Data augmentation style: original
Hidden dim: 50
Dropout: 0.2


Finished epoch 200 of 200; error is 1.8623005300760277

              precision    recall  f1-score   support

           0       0.90      0.98      0.94      1910
           1       0.43      0.12      0.19       239

   micro avg       0.88      0.88      0.88      2149
   macro avg       0.67      0.55      0.56      2149
weighted avg       0.85      0.88      0.85      2149

Data augmentation style: original
Hidden dim: 50
Dropout: 0.3


Finished epoch 200 of 200; error is 1.9002169221639633

              precision    recall  f1-score   support

           0       0.90      0.98      0.94      1910
           1       0.39      0.08      0.14       239

   micro avg       0.88      0.88      0.88      2149
   macro avg       0.64      0.53      0.54      2149
weighted avg       0.84      0.88      0.85      2149

Data augmentation style: original
Hidden dim: 100
Dropout: 0.1


Finished epoch 200 of 200; error is 1.6132768094539642

              precision    recall  f1-score   support

           0       0.90      0.98      0.94      1910
           1       0.38      0.09      0.15       239

   micro avg       0.88      0.88      0.88      2149
   macro avg       0.64      0.54      0.54      2149
weighted avg       0.84      0.88      0.85      2149

Data augmentation style: original
Hidden dim: 100
Dropout: 0.2


Finished epoch 200 of 200; error is 1.7000831216573715

              precision    recall  f1-score   support

           0       0.90      0.99      0.94      1910
           1       0.47      0.08      0.13       239

   micro avg       0.89      0.89      0.89      2149
   macro avg       0.68      0.53      0.54      2149
weighted avg       0.85      0.89      0.85      2149

Data augmentation style: original
Hidden dim: 100
Dropout: 0.3


Finished epoch 200 of 200; error is 1.7737471461296082

              precision    recall  f1-score   support

           0       0.90      0.99      0.94      1910
           1       0.49      0.09      0.15       239

   micro avg       0.89      0.89      0.89      2149
   macro avg       0.69      0.54      0.54      2149
weighted avg       0.85      0.89      0.85      2149

Data augmentation style: original
Hidden dim: 200
Dropout: 0.1


Finished epoch 200 of 200; error is 1.6135762631893158

              precision    recall  f1-score   support

           0       0.89      0.98      0.94      1910
           1       0.38      0.08      0.13       239

   micro avg       0.88      0.88      0.88      2149
   macro avg       0.64      0.53      0.53      2149
weighted avg       0.84      0.88      0.85      2149

Data augmentation style: original
Hidden dim: 200
Dropout: 0.2


Finished epoch 200 of 200; error is 1.6133108884096146

              precision    recall  f1-score   support

           0       0.90      0.98      0.94      1910
           1       0.48      0.16      0.24       239

   micro avg       0.89      0.89      0.89      2149
   macro avg       0.69      0.57      0.59      2149
weighted avg       0.86      0.89      0.86      2149

Data augmentation style: original
Hidden dim: 200
Dropout: 0.3


Finished epoch 200 of 200; error is 1.7036578208208084

              precision    recall  f1-score   support

           0       0.89      0.99      0.94      1910
           1       0.30      0.04      0.07       239

   micro avg       0.88      0.88      0.88      2149
   macro avg       0.60      0.51      0.51      2149
weighted avg       0.83      0.88      0.84      2149

Best model was hidden=100, dropout=0.3, with data augmentation: oversample Using those params for the bake off experiment: 0.6891276842891627


### Bake-Off Submission Based on Abve
Use the variable `best_experiment` to submit to the bake-off. The known params for best performance are:
- hidden_dim = 3
- dropout=0.3
- data augmentation technique: oversample
- macro-F1 = 0.689

In [94]:
## Commenting out this BERT model b/c it doesn't do better

# aug_train_data = balance_data(wordentail_data['word_disjoint']['train'], style='both')

# bake_mod = TorchDeepNeuralClassifier(hidden_dim=50, max_iter=200, dropout_prob=0.1)
# bake_off_experiment = nli.wordentail_experiment(
#     train_data=aug_train_data,
#     assess_data=wordentail_data['word_disjoint']['dev'], 
#     model=bake_mod, 
#     vector_func=bert_vec,
#     vector_combo_func=vec_concatenate)

Finished epoch 200 of 200; error is 0.07415337534621358

              precision    recall  f1-score   support

           0       0.94      0.85      0.89      1910
           1       0.31      0.54      0.39       239

   micro avg       0.81      0.81      0.81      2149
   macro avg       0.62      0.69      0.64      2149
weighted avg       0.87      0.81      0.83      2149



## Bake-off [1 point]

The goal of the bake-off is to achieve the highest macro-average F1 score on __word_disjoint__, on a test set that we will make available at the start of the bake-off on May 6. The announcement will go out on Piazza. To enter, you'll be asked to run `nli.bake_off_evaluation` on the output of your chosen `nli.wordentail_experiment` run. 

To enter the bake-off, upload this notebook on Canvas:

https://canvas.stanford.edu/courses/99711/assignments/187250

The cells below this one constitute your bake-off entry.

The rules described in the [Your original system](#Your-original-system-[4-points]) homework question are also in effect for the bake-off.

Systems that enter will receive the additional homework point, and systems that achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

The bake-off will close at 4:30 pm on May 8. Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

In [119]:
# Enter your bake-off assessment code into this cell. 
# Please do not remove this comment.
test_data_filename = os.path.join(
    NLIDATA_HOME,
    "bakeoff4-wordentail-data",
    "nli_wordentail_bakeoff_data-test.json")

nli.bake_off_evaluation(
    best_experiment,
    test_data_filename)

              precision    recall  f1-score   support

           0       0.86      0.90      0.88      1767
           1       0.52      0.43      0.47       446

   micro avg       0.80      0.80      0.80      2213
   macro avg       0.69      0.67      0.68      2213
weighted avg       0.79      0.80      0.80      2213



In [24]:
# On an otherwise blank line in this cell, please enter
# your macro-avg f1 value as reported by the code above. 
# Please enter only a number between 0 and 1 inclusive.
# Please do not remove this comment.
0.68