# Homework 4: Word-level entailment with neural networks

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2019"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Data](#Data)
  1. [Edge disjoint](#Edge-disjoint)
  1. [Word disjoint](#Word-disjoint)
1. [Baseline](#Baseline)
  1. [Representing words: vector_func](#Representing-words:-vector_func)
  1. [Combining words into inputs: vector_combo_func](#Combining-words-into-inputs:-vector_combo_func)
  1. [Classifier model](#Classifier-model)
  1. [Baseline results](#Baseline-results)
1. [Homework questions](#Homework-questions)
  1. [Hypothesis-only baseline [2 points]](#Hypothesis-only-baseline-[2-points])
  1. [Alternatives to concatenation [1 point]](#Alternatives-to-concatenation-[1-point])
  1. [A deeper network [2 points]](#A-deeper-network-[2-points])
  1. [Your original system [4 points]](#Your-original-system-[4-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

The general problem is word-level natural language inference.

Training examples are pairs of words $(w_{L}, w_{R}), y$ with $y = 1$ if $w_{L}$ entails $w_{R}$, otherwise $0$.

The homework questions below ask you to define baseline models for this and develop your own system for entry in the bake-off, which will take place on a held-out test-set distributed at the start of the bake-off. (Thus, all the data you have available for development is available for training your final system before the bake-off begins.)

<img src="fig/wordentail-diagram.png" width=600 alt="wordentail-diagram.png" />

## Set-up

See [the first notebook in this unit](nli_01_task_and_data.ipynb) for set-up instructions.

In [1]:
from bert_serving.client import BertClient 
from collections import defaultdict
import json
import numpy as np
import os
import pandas as pd
import random
from sklearn.metrics import classification_report
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
from torch_rnn_classifier import TorchRNNClassifier
import nli
import utils

In [2]:
DATA_HOME = '/home/kd/data/data'

NLIDATA_HOME = os.path.join(DATA_HOME, 'nlidata')

wordentail_filename = os.path.join(
    NLIDATA_HOME, 'nli_wordentail_bakeoff_data.json')

GLOVE_HOME = os.path.join(DATA_HOME, 'glove.6B')

## Data

I've processed the data into two different train/test splits, in an effort to put some pressure on our models to actually learn these semantic relations, as opposed to exploiting regularities in the sample.

* `edge_disjoint`: The `train` and `dev` __edge__ sets are disjoint, but many __words__ appear in both `train` and `dev`.
* `word_disjoint`: The `train` and `dev` __vocabularies are disjoint__, and thus the edges are disjoint as well.

These are very different problems. For `word_disjoint`, there is real pressure on the model to learn abstract relationships, as opposed to memorizing properties of individual words.

In [3]:
with open(wordentail_filename, encoding='utf8') as f:
    wordentail_data = json.load(f)

The outer keys are the  splits plus a list giving the vocabulary for the entire dataset:

In [4]:
wordentail_data.keys()

dict_keys(['edge_disjoint', 'vocab', 'word_disjoint'])

### Edge disjoint

In [5]:
wordentail_data['edge_disjoint'].keys()

dict_keys(['dev', 'train'])

This is what the split looks like; all three have this same format:

In [6]:
wordentail_data['edge_disjoint']['dev'][: 5]

[[['sweater', 'stroke'], 0],
 [['constipation', 'hypovolemia'], 0],
 [['disease', 'inflammation'], 0],
 [['herring', 'animal'], 1],
 [['cauliflower', 'outlook'], 0]]

Let's test to make sure no edges are shared between `train` and `dev`:

In [7]:
nli.get_edge_overlap_size(wordentail_data, 'edge_disjoint')

0

As we expect, a *lot* of vocabulary items are shared between `train` and `dev`:

In [8]:
nli.get_vocab_overlap_size(wordentail_data, 'edge_disjoint')

2916

This is a large percentage of the entire vocab:

In [9]:
len(wordentail_data['vocab'])

8470

Here's the distribution of labels in the `train` set. It's highly imbalanced, which will pose a challenge for learning. (I'll go ahead and reveal that the `dev` set is similarly distributed.)

In [4]:
def label_distribution(split, dataset='train'):
    return pd.DataFrame(wordentail_data[split][dataset])[1].value_counts()

In [11]:
label_distribution('edge_disjoint')

0    14650
1     2745
Name: 1, dtype: int64

### Word disjoint

In [12]:
wordentail_data['word_disjoint'].keys()

dict_keys(['dev', 'train'])

In the `word_disjoint` split, no __words__ are shared between `train` and `dev`:

In [13]:
nli.get_vocab_overlap_size(wordentail_data, 'word_disjoint')

0

Because no words are shared between `train` and `dev`, no edges are either:

In [14]:
nli.get_edge_overlap_size(wordentail_data, 'word_disjoint')

0

The label distribution is similar to that of `edge_disjoint`, though the overall number of examples is a bit smaller:

In [179]:
label_distribution('word_disjoint')

0    7199
1    1349
Name: 1, dtype: int64

In [181]:
label_distribution('word_disjoint')[1]

1349

In [5]:
train_ones = [i for i, row in enumerate(wordentail_data['word_disjoint']['train']) if row[1] == 1]
random_ones = np.random.randint(0,label_distribution('word_disjoint')[1], size=label_distribution('word_disjoint')[0])

In [6]:
balanced_ones = [wordentail_data['word_disjoint']['train'][train_ones[one]] for one in random_ones]

In [7]:
train_zeros = [i for i, row in enumerate(wordentail_data['word_disjoint']['train']) if row[1] == 0]

In [8]:
len(train_zeros)

7199

In [9]:
balanced_zeros = [wordentail_data['word_disjoint']['train'][one] for one in train_zeros]

In [10]:
wordentail_data['word_disjoint']['train_balanced'] = balanced_zeros + balanced_ones

In [45]:
wordentail_data['word_disjoint']['train_balanced'][-10:-1]

[[['ship', 'deck'], 1],
 [['amphetamine', 'drug'], 1],
 [['acid', 'chemical'], 1],
 [['deanery', 'building'], 1],
 [['state', 'government'], 1],
 [['monocyte', 'lymphocyte'], 1],
 [['ingredient', 'element'], 1],
 [['list', 'conglomerate'], 1],
 [['eatery', 'building'], 1]]

In [11]:
label_distribution('word_disjoint', 'train_balanced')

1    7199
0    7199
Name: 1, dtype: int64

## Baseline

Even in deep learning, __feature representation is vital and requires care!__ For our task, feature representation has two parts: representing the individual words and combining those representations into a single network input.

### Representing words: vector_func

Let's consider two baseline word representations methods:

1. Random vectors (as returned by `utils.randvec`).
1. 50-dimensional GloVe representations.

In [12]:
def randvec(w, n=50, lower=-1.0, upper=1.0):
    """Returns a random vector of length `n`. `w` is ignored."""
    return utils.randvec(n=n, lower=lower, upper=upper)

In [13]:
# Any of the files in glove.6B will work here:

glove_dim = 50

glove_src = os.path.join(GLOVE_HOME, 'glove.6B.{}d.txt'.format(glove_dim))

# Creates a dict mapping strings (words) to GloVe vectors:
GLOVE = utils.glove2dict(glove_src)

def glove_vec(w):    
    """Return `w`'s GloVe representation if available, else return 
    a random vector."""
    return GLOVE.get(w, randvec(w, n=glove_dim))

### Combining words into inputs: vector_combo_func

Here we decide how to combine the two word vectors into a single representation. In more detail, where `u` is a vector representation of the left word and `v` is a vector representation of the right word, we need a function `vector_combo_func` such that `vector_combo_func(u, v)` returns a new input vector `z` of dimension `m`. A simple example is concatenation:

In [14]:
def vec_concatenate(u, v):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return np.concatenate((u, v))

`vector_combo_func` could instead be vector average, vector difference, etc. (even combinations of those) – there's lots of space for experimentation here; [homework question 2](#Alternatives-to-concatenation-[1-point]) below pushes you to do some exploration.

### Classifier model

For a baseline model, I chose `TorchShallowNeuralClassifier`:

In [15]:
net = TorchShallowNeuralClassifier(hidden_dim=50, max_iter=100)

### Baseline results

The following puts the above pieces together, using `vector_func=glove_vec`, since `vector_func=randvec` seems so hopelessly misguided for `word_disjoint`!

In [16]:
word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=vec_concatenate)

Finished epoch 100 of 100; error is 0.025432696798816323

              precision    recall  f1-score   support

           0       0.92      0.94      0.93      1910
           1       0.42      0.36      0.38       239

    accuracy                           0.87      2149
   macro avg       0.67      0.65      0.66      2149
weighted avg       0.86      0.87      0.87      2149



## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Hypothesis-only baseline [2 points]

During our discussion of SNLI and MultiNLI, we noted that a number of research teams have shown that hypothesis-only baselines for NLI tasks can be remarkably robust. This question asks you to explore briefly how this baseline effects the 'edge_disjoint' and 'word_disjoint' versions of our task.

For this problem, submit code the following:

1. A `vector_combo_func` function called `hypothesis_only` that simply throws away the premise, using the unmodified hypothesis (second) vector as its representation of the example.

1. Code for looping over the two conditions 'word_disjoint' and 'edge_disjoint' and the two `vector_combo_func` values `vec_concatenate` and `hypothesis_only`, calling `nli.wordentail_experiment` to train on the conditions 'train' portion and assess on its 'dev' portion, with `glove_vec` as the `vector_func`. So that the results are consistent, use an `sklearn.linear_model.LogisticRegression` with default parameters as the model.

1. Print out the percentage-wise increase in macro-F1 over the `hypothesis_only` baseline that `vec_concatenate` delivers for each of the two conditions. For example, if `hypothesis_only` returns 0.52 for condition `C` and  `vec_concatenate` delivers 0.75 for `C`, then you'd report a ((0.75 /  0.52) - 1) * 100  = 44.23 percent increase for `C`. The values you need are stored in the dictionary returned by `nli.wordentail_experiment`, with key 'macro-F1'. Please round the percentages to two digits.

In [16]:
def hypothesis_only(prem, hyp):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return hyp

word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=hypothesis_only)

Finished epoch 100 of 100; error is 1.5703061372041702

              precision    recall  f1-score   support

           0       0.90      0.91      0.91      1910
           1       0.23      0.20      0.21       239

    accuracy                           0.83      2149
   macro avg       0.56      0.56      0.56      2149
weighted avg       0.83      0.83      0.83      2149



### Alternatives to concatenation [1 point]

We've so far just used vector concatenation to represent the premise and hypothesis words. This question asks you to explore a simple alternative. 

For this problem, submit code the following:

1. A new potential value for `vector_combo_func` that does something different from concatenation. Options include, but are not limited to, element-wise addition, difference, and multiplication. These can be combined with concatenation if you like.
1. Include a use of `nli.wordentail_experiment` in the same configuration as the one in [Baseline results](#Baseline-results) above, but with your new value of `vector_combo_func`.

In [18]:
def add_vectors(prem, hyp):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return prem + hyp

word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=add_vectors)

Finished epoch 100 of 100; error is 0.8312673419713974

              precision    recall  f1-score   support

           0       0.91      0.91      0.91      1910
           1       0.27      0.26      0.27       239

    accuracy                           0.84      2149
   macro avg       0.59      0.59      0.59      2149
weighted avg       0.84      0.84      0.84      2149



In [19]:
def minus_vectors(prem, hyp):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return prem - hyp

word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=minus_vectors)

Finished epoch 100 of 100; error is 0.2638537138700485

              precision    recall  f1-score   support

           0       0.91      0.85      0.88      1910
           1       0.21      0.32      0.25       239

    accuracy                           0.79      2149
   macro avg       0.56      0.58      0.57      2149
weighted avg       0.83      0.79      0.81      2149



In [20]:
def multiply_vectors(prem, hyp):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return prem * hyp

word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=multiply_vectors)

Finished epoch 100 of 100; error is 1.0253554061055183

              precision    recall  f1-score   support

           0       0.90      0.89      0.90      1910
           1       0.21      0.23      0.22       239

    accuracy                           0.82      2149
   macro avg       0.55      0.56      0.56      2149
weighted avg       0.82      0.82      0.82      2149



In [21]:
def concat_add_vectors(prem, hyp):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return np.concatenate((prem, prem + hyp, hyp))

word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=concat_add_vectors)

Finished epoch 100 of 100; error is 0.02553921565413475

              precision    recall  f1-score   support

           0       0.92      0.94      0.93      1910
           1       0.41      0.34      0.37       239

    accuracy                           0.87      2149
   macro avg       0.66      0.64      0.65      2149
weighted avg       0.86      0.87      0.87      2149



In [22]:
def concat_minus_vectors(prem, hyp):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return np.concatenate((prem, prem - hyp, hyp))

word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=concat_minus_vectors)

Finished epoch 100 of 100; error is 0.01280289446003735

              precision    recall  f1-score   support

           0       0.92      0.94      0.93      1910
           1       0.41      0.35      0.38       239

    accuracy                           0.87      2149
   macro avg       0.67      0.64      0.65      2149
weighted avg       0.86      0.87      0.87      2149



In [23]:
def concat_multiply_vectors(prem, hyp):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return np.concatenate((prem, prem * hyp, hyp))

word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=concat_multiply_vectors)

Finished epoch 100 of 100; error is 0.012708959984593093

              precision    recall  f1-score   support

           0       0.93      0.94      0.93      1910
           1       0.47      0.39      0.42       239

    accuracy                           0.88      2149
   macro avg       0.70      0.67      0.68      2149
weighted avg       0.87      0.88      0.88      2149



In [24]:
def concat_minus_multiply_vectors(prem, hyp):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return np.concatenate((prem, prem - hyp, prem * hyp, hyp))

word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=concat_minus_multiply_vectors)

Finished epoch 100 of 100; error is 0.008822197036352009

              precision    recall  f1-score   support

           0       0.93      0.94      0.93      1910
           1       0.44      0.40      0.42       239

    accuracy                           0.88      2149
   macro avg       0.69      0.67      0.68      2149
weighted avg       0.87      0.88      0.87      2149



### A deeper network [2 points]

It is very easy to subclass `TorchShallowNeuralClassifier` if all you want to do is change the network graph: all you have to do is write a new `define_graph`. If your graph has new arguments that the user might want to set, then you should also redefine `__init__` so that these values are accepted and set as attributes.

For this question, please subclass `TorchShallowNeuralClassifier` so that it defines the following graph:

$$\begin{align}
h_{1} &= xW_{1} + b_{1} \\
r_{1} &= \textbf{Bernoulli}(1 - \textbf{dropout_prob}, n) \\
d_{1} &= r_1 * h_{1} \\
h_{2} &= f(d_{1}) \\
h_{3} &= h_{2}W_{2} + b_{2}
\end{align}$$

Here, $r_{1}$ and $d_{1}$ define a dropout layer: $r_{1}$ is a random binary vector of dimension $n$, where the probability of a value being $1$ is given by $1 - \textbf{dropout_prob}$. $r_{1}$ is multiplied element-wise by our first hidden representation, thereby zeroing out some of the values. The result is fed to the user's activation function $f$, and the result of that is fed through another linear layer to produce $h_{3}$. (Inside `TorchShallowNeuralClassifier`, $h_{3}$ is the basis for a softmax classifier, so no activation function is applied to it.)

For comparison, using this notation, `TorchShallowNeuralClassifier` defines the following graph:

$$\begin{align}
h_{1} &= xW_{1} + b_{1} \\
h_{2} &= f(h_{1}) \\
h_{3} &= h_{2}W_{2} + b_{2}
\end{align}$$

The following code starts this sub-class for you, so that you can concentrate on `define_graph`. Be sure to make use of `self.dropout_prob`

For this problem, submit just your completed  `TorchDeepNeuralClassifier`. You needn't evaluate it, though we assume you will be keen to do that!

In [25]:
import torch.nn as nn

class TorchDeepNeuralClassifier(TorchShallowNeuralClassifier):
    def __init__(self, dropout_prob=0.7, **kwargs):
        self.dropout_prob = dropout_prob
        super().__init__(**kwargs)
    
    def define_graph(self):
        """Complete this method!
        
        Returns
        -------
        an `nn.Module` instance, which can be a free-standing class you 
        write yourself, as in `torch_rnn_classifier`, or the outpiut of 
        `nn.Sequential`, as in `torch_shallow_neural_classifier`.
        
        """
        return nn.Sequential(
        nn.Linear(self.input_dim, self.hidden_dim),
        nn.Dropout(self.dropout_prob),
        self.hidden_activation,
        nn.Linear(self.hidden_dim, self.n_classes_))

deep_net = TorchDeepNeuralClassifier(hidden_dim=50, max_iter=100, dropout_prob=.275)
word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=deep_net, 
    vector_func=glove_vec,
    vector_combo_func=concat_minus_multiply_vectors)

Finished epoch 100 of 100; error is 0.7703916653990746

              precision    recall  f1-score   support

           0       0.92      0.96      0.94      1910
           1       0.54      0.38      0.44       239

    accuracy                           0.89      2149
   macro avg       0.73      0.67      0.69      2149
weighted avg       0.88      0.89      0.89      2149



### Your original system [4 points]

This is a simple dataset, but our focus on the 'word_disjoint' condition ensures that it's a challenging one, and there are lots of modeling strategies one might adopt. 

You are free to do whatever you like. We require only that your system differ in some way from those defined in the preceding questions. They don't have to be completely different, though. For example, you might want to stick with the model but represent examples differently, or the reverse.

Keep in mind that, for the bake-off evaluation, the 'edge_disjoint' portions of the data are off limits. You can, though, train on the combination of the 'word_disjoint' 'train' and 'dev' portions. You are free to use different pretrained word vectors and the like. Please do not introduce additional entailment datasets into your training data, though.

Please embed your code in this notebook so that we can rerun it.

In [26]:
bc = BertClient(check_length=False)

In [27]:
# Any of the files in glove.6B will work here:

bert_dim = 768

def bert_vec(w):    
    """Return `w`'s BERT representation if available, else return 
    a random vector."""
    bert = bc.encode([w], show_tokens=True)[0][0]
    #print(f"word={w}; bert.shape={bert.shape}")
    return bert

bert_net = TorchShallowNeuralClassifier(hidden_dim=768, max_iter=100)

In [28]:
X_train, y_train = nli.word_entail_featurize(wordentail_data['word_disjoint']['train'], bert_vec, vec_concatenate)

In [46]:
X_train_balanced, y_train_balanced = nli.word_entail_featurize(wordentail_data['word_disjoint']['train_balanced'], bert_vec, vec_concatenate)

In [30]:
X_dev, y_dev = nli.word_entail_featurize(wordentail_data['word_disjoint']['dev'], bert_vec, vec_concatenate)

In [31]:
len(X_train[0])

50

In [32]:
bert_rnn_layers = TorchRNNClassifier(
    vocab=[],
    max_iter=100,
    num_layers=1,
    use_embedding=False)

In [33]:
%time _ = bert_rnn_layers.fit(X_train, y_train)

Finished epoch 100 of 100; error is 0.5158032402396202

CPU times: user 43min 21s, sys: 33min 11s, total: 1h 16min 32s
Wall time: 12min 50s


In [34]:
bert_rnn_preds_layers = bert_rnn_layers.predict(X_dev)

In [35]:
print(classification_report(y_dev, bert_rnn_preds_layers, digits=3))

              precision    recall  f1-score   support

           0      0.916     0.940     0.928      1910
           1      0.392     0.310     0.346       239

    accuracy                          0.870      2149
   macro avg      0.654     0.625     0.637      2149
weighted avg      0.858     0.870     0.863      2149



In [36]:
def bakeoff_experiment(wordentail_data, vec_func=vec_concatenate,
                       X_train=None, y_train=None,
                       X_dev=None, y_dev=None,
                       train_dataset_name='train', num_layers=1, 
                       hidden_dim=50, hidden_activation=nn.ReLU(), 
                       l2_strength=0.001,
                       eta=0.01, max_iter=100):
    bert_rnn_layers = TorchRNNClassifier(
                                        vocab=[],
                                        max_iter=max_iter,
                                        num_layers=num_layers,
                                        l2_strength=l2_strength,
                                        hidden_dim=hidden_dim,
                                        #hidden_activation=hidden_activation,
                                        eta=eta,
                                        use_embedding=False)
    
    if X_train is None or y_train is None:
        X_train, y_train = nli.word_entail_featurize(wordentail_data['word_disjoint'][train_dataset_name], bert_vec, vec_func)
    if X_dev is None or y_dev is None:
        X_dev, y_dev = nli.word_entail_featurize(wordentail_data['word_disjoint']['dev'], bert_vec, vec_func)

    %time _ = bert_rnn_layers.fit(X_train, y_train)
    bert_rnn_preds_layers = bert_rnn_layers.predict(X_dev)
    print(classification_report(y_dev, bert_rnn_preds_layers, digits=3))

In [37]:
bakeoff_experiment(wordentail_data, train_dataset_name='train_balanced')

Finished epoch 100 of 100; error is 10.418009340763092

CPU times: user 1h 10min 36s, sys: 55min 2s, total: 2h 5min 38s
Wall time: 21min 4s
              precision    recall  f1-score   support

           0      0.889     1.000     0.941      1910
           1      0.000     0.000     0.000       239

    accuracy                          0.889      2149
   macro avg      0.444     0.500     0.471      2149
weighted avg      0.790     0.889     0.836      2149



  _warn_prf(average, modifier, msg_start, len(result))


In [47]:
X_train_balanced_concat, y_train_balanced_concat = nli.word_entail_featurize(wordentail_data['word_disjoint']['train_balanced'], bert_vec, concat_minus_multiply_vectors)

In [48]:
X_dev_concat, y_dev_concat = nli.word_entail_featurize(wordentail_data['word_disjoint']['dev'], bert_vec, concat_minus_multiply_vectors)

In [49]:
bakeoff_experiment(wordentail_data, train_dataset_name='train_balanced', 
                   eta=0.01, max_iter=100,
                   vec_func=concat_minus_multiply_vectors,
                   X_train=X_train_balanced_concat, y_train=y_train_balanced_concat,
                   X_dev=X_dev_concat, y_dev=y_dev_concat)

Finished epoch 100 of 100; error is 10.397958040237427

CPU times: user 3h 47min 54s, sys: 2h 11min 57s, total: 5h 59min 51s
Wall time: 1h 37s
              precision    recall  f1-score   support

           0      0.889     1.000     0.941      1910
           1      0.000     0.000     0.000       239

    accuracy                          0.889      2149
   macro avg      0.444     0.500     0.471      2149
weighted avg      0.790     0.889     0.836      2149



In [41]:
def fit_rnn_classifier_bert(X, y):    
    mod = TorchRNNClassifier(
        vocab=[],
        max_iter=35,
        eta=0.02,
        use_embedding=False)
    mod.fit(X, y)
    return mod

In [42]:
def fit_shallow_nn_classifier_with_crossvalidation(X, y):
    basemod = fit_rnn_classifier_bert(X, y)
    cv = 3
    param_grid = {
                    'hidden_dim': [50, 100], 
                    'num_layers' : [1,2,3],
                    'l2_strength' : [0.0003, 0.001, 0.003]
                 }    
    best_mod = utils.fit_classifier_with_crossvalidation(
        X, y, basemod, cv, param_grid)
    return best_mod

In [38]:
%time _ = fit_shallow_nn_classifier_with_crossvalidation(X_train_balanced_concat, y_train_balanced_concat)

IndexError: index 514 is out of bounds for dimension 0 with size 2

IndexError: index 1023 is out of bounds for dimension 0 with size 2

IndexError: index 520 is out of bounds for dimension 0 with size 2

IndexError: index 1023 is out of bounds for dimension 0 with size 3

IndexError: index 509 is out of bounds for dimension 0 with size 3

IndexError: index 509 is out of bounds for dimension 0 with size 2

IndexError: index 520 is out of bounds for dimension 0 with size 3

IndexError: index 514 is out of bounds for dimension 0 with size 3

IndexError: index 6 is out of bounds for dimension 0 with size 2

IndexError: index 6 is out of bounds for dimension 0 with size 3

Finished epoch 35 of 35; error is 6.6890704929828643

Best params: {'hidden_dim': 100, 'l2_strength': 0.0003, 'num_layers': 1}
Best score: 0.111
CPU times: user 1d 22min 57s, sys: 15h 46min 4s, total: 1d 16h 9min 2s
Wall time: 6h 46min 14s


In [45]:
'''bakeoff_experiment(wordentail_data, train_dataset_name='train_balanced', 
                   eta=0.02, max_iter=80, hidden_dim=100, l2_strength=0.0001, num_layers=1,
                   vec_func=concat_minus_multiply_vectors,
                   X_train=X_train_balanced_concat, y_train=y_train_balanced_concat,
                   X_dev=X_dev_concat, y_dev=y_dev_concat)'''

Finished epoch 57 of 80; error is 10.397670865058899

KeyboardInterrupt: 

              precision    recall  f1-score   support

           0      0.889     1.000     0.941      1910
           1      0.000     0.000     0.000       239

    accuracy                          0.889      2149
   macro avg      0.444     0.500     0.471      2149
weighted avg      0.790     0.889     0.836      2149



In [23]:
# Enter your bake-off assessment code into this cell. 
# Please do not remove this comment.


In [24]:
# On an otherwise blank line in this cell, please enter
# your macro-avg f1 value as reported by the code above. 
# Please enter only a number between 0 and 1 inclusive.
# Please do not remove this comment.
