<a href="https://colab.research.google.com/github/pgmikhael/mit_deeplearning_bootcamp/blob/master/Tutorial1_NLI_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Natural Language Inference Classifier

Natural language inference is the task of determining whether or not a given statement (the "hypothesis") is entailed by another given statement (the "premise").

The hypothesis is true (entailment) if it is entailed, it is false (contradiction) if it is not entailed, and it is undetermined (neutral) if it is neither true nor false.

An example is:

| Premise | Label | Hypothesis |
| ---  | --- | --- |
|The Golden State Warriors scored 100 points last night.| Entailment | Someone scored a basket in the game. |
|The Golden State Warriors scored 100 points last night. | Neutral | The Warriors won the game. |
| The Golden State Warriors scored 100 points last night. | Contradiction | The Warriors struggled to make baskets. |


## Dataset

For this exercise we'll be using a portion of the [MNLI](https://arxiv.org/abs/1704.05426) dataset --- a dataset for natural language inference that spans multiple genres and writing styles. To keep things simple, we will only be dealing with the "Entailment" and "Contradiction" classes --- making it a binary classification task.

The data is provided to you as a list of entries, where each `entry` has the following structure:

```
example.x1 = ["the", "tokenized", "premise"]
example.x2 = ["the", "tokenized", "hypothesis"]
example.y = 0 or 1
```

In [None]:
# Load the data.
!wget https://raw.githubusercontent.com/pgmikhael/mit_deeplearning_bootcamp/master/data/nli/train.txt
!wget https://raw.githubusercontent.com/pgmikhael/mit_deeplearning_bootcamp/master/data/nli/valid.txt
!wget https://raw.githubusercontent.com/pgmikhael/mit_deeplearning_bootcamp/master/data/nli/test.txt

import collections
import json
import numpy as np

LABELS = ["contradiction", "entailment"]

Example = collections.namedtuple("Entry", ["x1", "x2", "y"])

def load_data(filename):
  examples = []
  with open(filename, "r") as f:
    for line in f:
      fields = json.loads(line)
      x1 = fields["x1"]
      x2 = fields["x2"]
      if fields["y"] not in LABELS:
        continue
      y = LABELS.index(fields["y"])
      examples.append(Example(x1, x2, y))
  return examples

train_examples = load_data("train.txt")
valid_examples = load_data("valid.txt")
test_examples = load_data("test.txt")

## Feature Engineering

As you can see from the example, this task takes **two** inputs $x_1$ and $x_2$. We'll experiment with some basic featurization options.


### Majority baseline

It's always good to start simple when approaching new task. Naïve baselines are often good at uncovering biases in the data you might not have noticed otherwise.

One to start out with is the majority baseline. What is the prior for entailment? In this model simply ignore the input and use the most common class, always.

We can use the [DummyClassifier](https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators) from `sklearn`, create a majority baseline and record the accuracy.

In [None]:
from sklearn.dummy import DummyClassifier
majority_baseline = DummyClassifier(strategy='most_frequent', random_state=0)

X_train = np.ones([len(train_examples)])
Y_train = np.array([ex.y for ex in train_examples])
X_valid = np.ones([len(valid_examples)])
Y_valid = np.array([ex.y for ex in valid_examples])

majority_baseline.fit(X_train, Y_train)
majority_accuracy = majority_baseline.score(X_valid, Y_valid)
print(majority_accuracy)

## Exercise:

1. Using Y_train, calculate the priors for the different labels. What is the most frequent class?

In [None]:
# Write code here!

### Hypothesis- and premise-only baselines

Two other simple baselines are to try to classify the data using just the hypothesis (and no premise) and just the premise (and no hypothesis). We will use a bag-of-words representation.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Set vocab using train text.
min_df = 5
max_features = 1000
vectorizer = CountVectorizer(min_df=min_df, max_features=max_features)
vectorizer.fit([" ".join(ex.x1) for ex in train_examples] +
               [" ".join(ex.x2) for ex in train_examples])

train_hypothesis_only = vectorizer.transform([" ".join(ex.x1) for ex in train_examples])
valid_hypothesis_only = vectorizer.transform([" ".join(ex.x1) for ex in valid_examples])

train_premise_only = vectorizer.transform([" ".join(ex.x2) for ex in train_examples])
valid_premise_only = vectorizer.transform([" ".join(ex.x2) for ex in valid_examples])

### Independent features

Let's now create a featurization that includes both $x_1$ and $x_2$. A simple one to begin with is the concatenation of their bag-of-words vectors: $[\texttt{BoW}(x_1); \texttt{BoW}(x_2)]$.

In [None]:
import scipy

# Simply concatenate the two featurizations together.
train_concatenated = scipy.sparse.hstack([train_premise_only, train_hypothesis_only])
valid_concatenated = scipy.sparse.hstack([valid_premise_only, valid_hypothesis_only])

In [None]:
train_overlap = np.maximum(train_hypothesis_only.toarray(),
                           train_premise_only.toarray())
valid_overlap = np.maximum(valid_hypothesis_only.toarray(),
                           valid_premise_only.toarray())

## Modeling

Using the different featurizations as inputs, we can now experiment with different modeling choices using `sklearn`.

In [None]:
from collections import defaultdict
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

features = {
  "hyp_only": (train_hypothesis_only, valid_hypothesis_only),
  "prem_only": (train_premise_only, valid_premise_only),
  "concat": (train_concatenated, valid_concatenated),
}

models = {
    "logreg": LogisticRegression,
    "rand_forest": RandomForestClassifier,
}

accuracies = collections.defaultdict(dict)
for name, (train, valid) in features.items():
  print("Running on %s" % name)
  for model_name, model_class in models.items():
    print("Using %s" % model_name)
    # Put extra hyper-params here...
    model = model_class()
    clf = model.fit(train, Y_train)
    accuracies[name][model_name] = clf.score(valid, Y_valid)

In [None]:
for name, model_acc in accuracies.items():
  print(name)
  for model, acc in model_acc.items():
    print('\t%s: %2.2f' % (model, acc))