# The Stanford Natural Language Inference Dataset

SNLI attempts to turn the question of asking whether a dachshund is a dog or not into a machine learning task. The goal of SNLI is to be able to handle all kinds of questions like the **hypernym** relationships, but also other types of entailment that encode "common sense" knowledge about the world. This dataset has become a staple test dataset in natural language understanding research.

This is a three-way classification task. The model is going to use the word embeddings to estimate the probability that each `Hypothesis` either is `neutral` with respect to the original `Text`, or whether there is an `entailment` or `contradiction` relation. If the `Hypothesis` is **entailed** by the `Text`, then this means that the `Hypothesis` MUST be true. If the `Hypothesis` **contradicts** the `Text`, then it must _not_ be the case -- that is, the `Hypothesis` MUST be false.

There are approximately 550,000 **training** items, and 10,000 **test** items. We use the test items to evaluate the training items -- we don't want our models to just memorize the training data.



In [1]:
!wget https://nlp.stanford.edu/projects/snli/snli_1.0.zip

from zipfile import ZipFile
import json

data = ZipFile('./snli_1.0.zip')
data = {name: data.read(name) for name in data.namelist()}

# break training data down into lines
training_file_contents = data['snli_1.0/snli_1.0_train.jsonl'].decode('utf-8')

# store as jsons
training_jsons = []
for x in training_file_contents.split("\n"):
  if x!='':
    training_jsons.append(json.loads(x))

zsh:1: command not found: wget


FileNotFoundError: [Errno 2] No such file or directory: './snli_1.0.zip'

## Getting a model ready for our NLI prediction task

Goal: Predict one of three labels in SNLI given a preamble (`Text`) and a sentence that is potentially implied by that text (`Hypothesis`).

To do this the modern way, we are going to need to extract word or sentence embedding representations from a language model.

In [None]:
!pip install transformers

In [None]:
from transformers import DistilBertModel, DistilBertTokenizer 
import numpy as np

model = DistilBertModel.from_pretrained(
    'distilbert-base-uncased', output_hidden_states=True)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [None]:
def get_sentence_embedding(sentence, neural_model, neural_tokenizer,
                           cls_token, layer_number):
  tokenized = neural_tokenizer(sentence, return_tensors='pt')
  embeddings = neural_model(**tokenized)['hidden_states']
  layerwise_embeddings = embeddings[layer_number]
  if cls_token:
    # then grab very first index
    out_embedding = layerwise_embeddings[0].detach().numpy().mean(axis=0).flatten()
  else:
    # then average all indices
    out_embedding = layerwise_embeddings.detach().numpy().mean(axis=1).flatten()
  
  return out_embedding

In [None]:
def process_snli_distilbert(jsons: list, neural_model, neural_tokenizer,
                            cls_token=True, layer_number=-1):
  """
  This function will convert all our jsons into a dataset for sklearn
  jsons: list[dict] of SNLI data
  neural_model: e.g., DistilBertModel from huggingface
  neural_tokenizer: e.g., DistilBertTokenizer from huggingface
  cls_token: Which word embedding(s) do you want (defaults to CLS/first)
  layer_number: What layer do you want (defaults to last)
  """
  Xs, ys = [], []
  for record in jsons:
    sent1, sent2 = record['sentence1'], record['sentence2']
    label = record['gold_label']
    s1_embedding = get_sentence_embedding(
        sent1, neural_model, neural_tokenizer,
        cls_token, layer_number)
    s2_embedding = get_sentence_embedding(
        sent2, neural_model, neural_tokenizer,
        cls_token, layer_number)
    # convolution
    convolved = np.convolve(s1_embedding, s2_embedding)
    # concatenate into one long vector
    embedding = np.hstack([s1_embedding, s2_embedding, convolved])
    Xs.append(embedding)
    ys.append(label)
  return Xs, ys

In [None]:
process_snli_distilbert([{"sentence1": "My sentence", "sentence2": "Another sentence", "gold_label": "neutral"}],
                        model, tokenizer)

### What other representations could we try to use to characterize the relationships between two sentences?

## Building our classifier

We are going to create a standard classifier -- the goal of this model is to find the best way to use the word embeddings to carve up the space of categories. When we build a **linear classifier** we are trying to find **lines** that separate our categories (**classes**). The property of our model that we are aiming for is **linear separability**. In the SNLI context, we have 3 classes (entailment, contradiction, neutral), and up to 3071 **slopes** we can learn on our lines. (This is just like $y = mx + b$ in school!)

<center><img src="https://machinelearningmastery.com/wp-content/uploads/2020/01/Scatter-Plot-of-Multi-Class-Classification-Dataset.png" width=500 /></center>

If we can draw lines that separate the blue blob from the orange and green, and a line that separates the orange from the green, we have a nice lil classifier that knows where to put every new data point. While this plot is in two dimensions, our data are much bigger than that. The math is thankfully still the same!

In [None]:
# downsample because nobody has time to embed 550,000 sentences lol
sub_training_jsons = [x for i, x in enumerate(training_jsons) if i%100==0]

# now create our training dataset
train_X, train_y = process_snli_distilbert(sub_training_jsons, model, tokenizer,
                                           cls_token=True, layer_number=-1)

In [None]:
from sklearn.linear_model import LogisticRegression
import numpy as np

classifier = LogisticRegression(max_iter=1000)
classifier.fit(np.vstack(train_X), train_y)

## Evaluating our classifier

In order to know if our models are any good, we first want to see if our model is learning anything from our training data. To do this, we have to compute some kind of accuracy measure. For this, we will need to think about how to count up incorrect guesses.

<center><img src="https://miro.medium.com/max/1024/1*u8PgZ_84no_swpLnuMf-PQ.png" width=500 /></center>

In our **multiclass** classification task for SNLI, we have three labels. But, let's pretend we only have two for now: **entailment** and **contradiction**. Our models will make guesses about **entailment** and **contradiction** -- so we just need to count up all four of these different combinations.

| Gold Label      | Predicted==entailment | Predicted==contradiction|
| ----------- | ----------- | -----------|
| entailment      | Correct       | Incorrect |
| contradiction   | Incorrect        | Correct |


When we evaluate our models, we are often using the proportions of correct responses relative to different kinds of errors. This is what the two estimates of $Precision$ and $Recall$ are for.

<center>
$\text{Precision} = p(\text{gold label}==\text{entailment} | \text{predicted}==\text{entailment})$
</center>

> Pokemon analogy: How many of the Pokemon that you caught were Pikachus?

<center>
$\text{Recall} = p(\text{predicted}==\text{entailment} | \text{gold label}==\text{entailment})$
</center>

> Pokemon analogy: How many of all of the Pikachus in the world did you catch?

<center>
And then, if we want to combine these two things together, we compute the F score, which allows us to weigh Precision and Recall. The resulting score is a **harmonic mean**. If we weight them equally, then we compute what is called $F_1$. 

$F_1 = 2 \cdot \large \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$
</center>

In [None]:
### Scoring our model on guesses for the training data
from sklearn.metrics import f1_score
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

f1_score(train_y,
         classifier.predict(np.vstack(train_X)),
         average='micro')

## Micro vs. Macro F1

From the scikit-learn documentation:

'micro':

    Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro':

    Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

In [None]:
from sklearn.metrics import f1_score

f1_score(train_y,
         # our predicted training labels
         classifier.predict(np.vstack(train_X)),
         average='macro')

In [None]:
# accuracy
sum(train_y == classifier.predict(np.vstack(train_X))) / len(train_y)

In [None]:
from collections import Counter

Counter(classifier.predict(np.vstack(train_X)))

At least our model isn't just predicting one class!

In [None]:
# break test data down into lines
test_file_contents = data['snli_1.0/snli_1.0_test.jsonl'].decode('utf-8')

# store as jsons
test_jsons = []
for x in test_file_contents.split("\n"):
  if x!='':
    test_jsons.append(json.loads(x))

# create our test dataset
test_X, test_y = process_snli_distilbert(test_jsons, model, tokenizer, cls_token=False)

In [None]:
# assess performance on our test data
f1_score(test_y,
         classifier.predict(np.vstack(test_X)),
         average='macro')

In [None]:
# assess performance on our test data
f1_score(test_y,
         classifier.predict(np.vstack(test_X)),
         average='micro')

Yikes! That's not great.

In [None]:
# raw accuracy is just micro f1
sum(test_y == classifier.predict(np.vstack(test_X))) / len(test_y)

# Semantic role labeling

Semantic roles are ways of characterizing the elements that make sentences similar. When we say these two sentences:

1. Bollywood Bistro serves vegetarian food
2. They serve vegetarian dishes at Bollywood Bistro

We know that these two things "mean" roughly the same thing. That is, they both express the same underlying truth about the world -- there exists some restaurant Bollywood Bistro, and they serve food, which is appropriate for vegetarians to eat. That is, they have the same **canonical form**. We can express both of these statements with something like these **predicates**:

`Serves(BollywoodBistro, VegetarianFood)`. Stated another way:

* There exists some food item on Bollywood Bistro's menu that vegetarians can eat

If we want to know whether some restaurant ($x$) serves some particular kind of food (FoodType), we can make use of **variables** and create an even more general formula:

`Serves(Restaurant(x), FoodType(y))`

When we look at the structure of predictates like `Serves` above, we might want to break down the **arguments** of those predicates. Often, we want to **label** those arguments so we can evaluate whether two sentences have the same canonical form.

Let's start with a more "human readable" framework before we move to a more abstract one.

## FrameNet

https://framenet.icsi.berkeley.edu/fndrupal/

FrameNet is an **annotation schema** that has been used in **semantic role labeling**. 

Like our `Serves(Restaurant,FoodType)` example before, a restaurant serving food is similar to other types of entities offering services or products, so FrameNet lumps these two things together into an `Offering` frame. That is, similar types of events correspond roughly to the same **semantic frame** (hence the name).

All Frames have **seed** sentences or **exemplars** that highlight good examples of the use of a Frame.

FrameNet's documentation points out a tendency among data annotation types to either **lump** or **split** categories. An older version of FrameNet contained about 300 frames and under 500 semantic roles. The newer FrameNet lists 877 frames and 1,068 role types. This greatly increases **sparsity** but produces much more semantically informative labels.

<center><h4>lumping (few labels; PropBank) <--------------------------------------------> splitting (lots of labels; FrameNet)</h4></center>

* The FrameNet entry for "serve": https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu17140.xml?mode=lexentry
* Let's take a look at the Offering page in FrameNet. https://framenet2.icsi.berkeley.edu/fnReports/data/frameIndex.xml?frame=Offering
* For example, one in-depth annotation of the `Offering` frame can be found here: https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu15659.xml?mode=annotation

In [None]:
#@title Example annotation for preference schema
from IPython.display import HTML

schema='An <font style="color: #FFFFFF; background-color: #FF0000;">Experiencer</font>' \
       ' has a greater desire to participate in some <font style="color: #FFFFFF;' \
       ' background-color: #00008B;">Event</font>, as against another (contextually' \
       ' recoverable) event which exhibits a specific <font style="color: #FFFFFF;' \
       ' background-color: #A52A2A;">Contrast</font> with the <font style="color: #FFFFFF;' \
       ' background-color: #00008B;">Event</font>. Alternatively, the <font style="color: #FFFFFF;' \
       ' background-color: #FF0000;">Experiencer</font> may have a greater desire that some' \
       ' <font style="color: #FFFFFF; background-color: #0000FF;">Focal_participant</font>' \
       ' participate in the <font style="color: #FFFFFF; background-color: #00008B;">Event</font>.' \
       ' The <font style="color: #FFFFFF; background-color: #00BFFF;">Location_of_event</font> may also be mentioned.' \
       '<br><font style="color: #FFFFFF; background-color: #9F79EE;">Why</font> do <font style="color: #FFFFFF; background-color: #FF0000;">women</font> <font style="color: rgb(255, 255, 255); background-color: rgb(0, 0, 0); text-transform:uppercase;">prefer</font> <font style="color: #FFFFFF; background-color: #0000FF;">manly faces</font>?<br>' \
       '<br><font style="color: #FFFFFF; background-color: #FF0000;">I</font> <font style="color: rgb(255, 255, 255); background-color: rgb(0, 0, 0); text-transform:uppercase;">prefer</font> <font style="color: #FFFFFF; background-color: #0000FF;">open source programs</font> <font style="color: #FFFFFF; background-color: #A52A2A;">over proprietary ones</font>.<br>' \
       '<br><font style="color: #FFFFFF; background-color: #FF0000;">Other customers</font> <font style="color: rgb(255, 255, 255); background-color: rgb(0, 0, 0); text-transform:uppercase;">prefer</font> <font style="color: #FFFFFF; background-color: #00008B;">to send us an order together with a cheque</font>.<br>' \
       '<br><font style="color: #FFFFFF; background-color: #FF0000;">I</font> <font style="color: rgb(255, 255, 255); background-color: rgb(0, 0, 0); text-transform:uppercase;">prefer</font> <font style="color: #FFFFFF; background-color: #0000FF;">my tartar sauce</font> <font style="color: #FFFFFF; background-color: #00BFFF;">on fish</font>.<br>'
HTML(schema)

## PropBank

From Das, Chen, Martins, Schneider, and Smith (2014).
> PropBank defines **core roles** ARG0 through ARG5, which receive different interpretations for different predicates. Additional modifier roles ARGM-* include ARGM-TMP (temporal) and ARGM-DIR (directional)

> The PropBank representation therefore has a small number of roles, and the training data set comprises some 40,000 sentences, thus making the semantic role labeling task an attractive one from the perspective of machine learning.

The goal of PropBank is to give a sufficiently abstract, dense number of labels. This means that data sparsity will affect learning and/or overfitting.

#### Plain sentence:
    But I will never learn that.

#### Leaves:

    0   But
    1   I
           coref: IDENT        m_0   1-1    I
    2   will
    3   never
    4   learn
           sense: learn-v.1
           prop:  learn.01
            v          * -> 4:0,  learn
            ARGM-DIS   * -> 0:0,  But
            ARG0       * -> 1:1,  I
            ARGM-MOD   * -> 2:0,  will
            ARGM-TMP   * -> 3:1,  never
            ARG1       * -> 5:1,  that
    5   that
           coref: IDENT        284   5-5    that
    6   /.


### PropBank predicates are associated with "frames" as well

But these frames are quite different from the ones in FrameNet. They are numbered instead! Check out the `n` properties of each `<role>` object corresponding to each of the roles in the `prefer` schema or frame from `PropBank` below:

```xml
<roles>
    <role descr="chooser, agent" f="" n="0">
        <vnrole vncls="31.2" vntheta="experiencer"/>
        <vnrole vncls="32.1-1-1" vntheta="experiencer"/>
    </role>
    <role descr="entity chosen" f="" n="1">
        <vnrole vncls="31.2" vntheta="theme"/>
        <vnrole vncls="32.1-1-1" vntheta="theme"/>
    </role>
    <role descr="entity compared to" f="" n="2"/>
    
    
    <note/>
</roles>
```

Each role in a PropBank frame has a natural language description (e.g., "entity chosen") but also some labels from formal linguistics ("experiencer", "theme"). 


## Automatic semantic role labeling

The task is typically broken down into two phases:

1. Identifying which chunks correspond to different aspects of an event schema or frame (e.g., identifying the **span** of the argument) -- **target identification**
2. **Frame identification**
3. Identifying what role the identified chunk occupies in a given predicate, or **argument identification**

Getting anything wrong earlier in these stages will lead to worse performance downstream. You need a good parser (1), a good understanding of semantic or event frames (2), and an ability to map contents onto predicates (3). Typically, the first is done using **shallow parsers**. 

The second and third points are usually accomplished using **analogy** or **clustering**. Statistical models attempt to assign sentences to a cluster based on how similar that sentence is to the _exemplars_ defined earlier. Once a Frame has been identified, then it is just a matter of finding the arguments.

#### Can you think of a reason that we might want to know the arguments of a frame before we know exactly which frame it is?


# Classifiers for semantic role labeling

While it would take too long to run this in class, I'd like to write down our ideas for how to incorporate features such as **propositions** to represent our utterances, so we can learn that two sentences are similar. 

What tools do we have already for creating **proposition representations** that we can use in a classifier?

* 
* 
* 

How might we represent any one of these things mathematically? What is hard about doing this?