# What kind of semantic knowledge do we want?

Should be able to solve the following tasks:

* Textual entailment (whether $X$ follows from $Y$)
* To build up knowledge of facts about the world

A lot of knowledge already exists in structured datasets like WordNet, which contain a deep hierarchy of words and how they relate to each other. But, many more things are true in the world than are in any of our databases :)

So, we must be creative.

# Building a classifier to (try to) do Natural Language Inference (NLI) from sentence embeddings

Tools you will take home today:

* How to interact with a standard natural language inference dataset (SNLI: https://nlp.stanford.edu/projects/snli/)
* Get a better sense for what different **types of inferences** or **entailment** there can be
* How to get word embeddings from a model from HuggingFace to use in a classifier
* How to build a simple linear classifier (logistic regression) in scikit-learn to do inference classification


# "Solving" entailment with statistics

The general insight behind natural language inference datasets is that **entailment** is a relationship that can be **learned** or **approximated**. With sufficiently large data, we should be able to reconstruct a very large proportion of the knowledge ordinary people have about the real world. Then, we just need to build some kind of **model** that can take a representation of a text, and either:

1. Enumerate all valid facts that follow from that knowledge, or,
2. Verify for a given fact whether it follows from existing knowledge

Nowadays, we build **statistical** models of this relationship, seeking to find what regularities predict entailment between two propositions.

In formal semantics, we would think about entailment differently -- perhaps by thinking of all valid subsets of a category. For example, if I ask you whether a _Dachsund_ is a kind of dog or not, the answer is "certainly yes" because you know something about what a Dachshund is.

<center><img src="https://s3.amazonaws.com/cdn-origin-etr.akc.org/wp-content/uploads/2017/11/07143625/Dachshund-standing-outdoors.jpg" width=400 /></center>

We can even see what WordNet has to say about Dachshunds:


In [35]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet

print(wordnet.synset("dachshund.n.01").hypernyms())
print(wordnet.synset("hunting_dog.n.01").hypernyms())
print(wordnet.synset("dog.n.01").hypernyms())
print(wordnet.synset("domestic_animal.n.01").hypernyms())

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[Synset('hunting_dog.n.01')]
[Synset('dog.n.01')]
[Synset('canine.n.02'), Synset('domestic_animal.n.01')]
[Synset('animal.n.01')]


# The Stanford Natural Language Inference Dataset

SNLI attempts to turn the question of asking whether a dachshund is a dog or not into a machine learning task. The goal of SNLI is to be able to handle all kinds of questions like the **hypernym** relationships above, but also other types of entailment that encode "common sense" knowledge about the world. This dataset has become a staple test dataset in natural language understanding research.

This is a three-way classification task. The model is going to use the word embeddings to estimate the probability that each `Hypothesis` either is `neutral` with respect to the original `Text`, or whether there is an `entailment` or `contradiction` relation. If the `Hypothesis` is **entailed** by the `Text`, then this means that the `Hypothesis` MUST be true. If the `Hypothesis` **contradicts** the `Text`, then it must _not_ be the case -- that is, the `Hypothesis` MUST be false.

There are approximately 550,000 **training** items, and 10,000 **test** items. We use the test items to evaluate the training items -- we don't want our models to just memorize the training data.



In [None]:
!wget https://nlp.stanford.edu/projects/snli/snli_1.0.zip

from zipfile import ZipFile
import json

data = ZipFile('./snli_1.0.zip')
data = {name: data.read(name) for name in data.namelist()}

# break training data down into lines
training_file_contents = data['snli_1.0/snli_1.0_train.jsonl'].decode('utf-8')

# store as jsons
training_jsons = []
for x in training_file_contents.split("\n"):
  if x!='':
    training_jsons.append(json.loads(x))

In [12]:
#@title Example sentence pairs and label

def print_snli(json_line: dict):
  text = json_line['sentence1']
  hypothesis = json_line['sentence2']
  category = json_line['gold_label']
  print(f"Text: {text}")
  print(f"Hypothesis: {hypothesis}")
  print(f"Text-Hypothesis Relationship: {category}")
  
print_snli(training_jsons[0])

Text: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is training his horse for a competition.
Text-Hypothesis Relationship: neutral


In [39]:
for line in training_jsons[0:40]:
  print_snli(line)
  print()

Text: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is training his horse for a competition.
Text-Hypothesis Relationship: neutral

Text: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is at a diner, ordering an omelette.
Text-Hypothesis Relationship: contradiction

Text: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is outdoors, on a horse.
Text-Hypothesis Relationship: entailment

Text: Children smiling and waving at camera
Hypothesis: They are smiling at their parents
Text-Hypothesis Relationship: neutral

Text: Children smiling and waving at camera
Hypothesis: There are children present
Text-Hypothesis Relationship: entailment

Text: Children smiling and waving at camera
Hypothesis: The kids are frowning
Text-Hypothesis Relationship: contradiction

Text: A boy is jumping on skateboard in the middle of a red bridge.
Hypothesis: The boy skates down the sidewalk.
Text-Hypothesis Relationship: cont

## Getting a model ready for our NLI prediction task

Goal: Predict one of three labels in SNLI given a preamble (`Text`) and a sentence that is potentially implied by that text (`Hypothesis`).

To do this the modern way, we are going to need to extract word or sentence embedding representations from a language model. 

### Given what you know about the data above, what features do you think are going to matter for our models?

In [None]:
!pip install transformers

In [None]:
from transformers import DistilBertModel, DistilBertTokenizer 

model = DistilBertModel.from_pretrained(
    'distilbert-base-uncased', output_hidden_states=True)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [6]:
def get_sentence_embedding(sentence, neural_model, neural_tokenizer,
                           cls_token, layer_number):
  tokenized = neural_tokenizer(sentence, return_tensors='pt')
  embeddings = neural_model(**tokenized)['hidden_states']
  layerwise_embeddings = embeddings[layer_number]
  if cls_token:
    # then grab very first index
    out_embedding = layerwise_embeddings[0].detach().numpy().mean(axis=0)
  else:
    # then average all indices
    out_embedding = layerwise_embeddings.detach().numpy().mean(axis=1)
  
  return out_embedding

In [7]:
def process_snli_distilbert(jsons: list, neural_model, neural_tokenizer,
                            cls_token=True, layer_number=-1):
  """
  This function will convert all our jsons into a dataset for sklearn
  jsons: list[dict] of SNLI data
  neural_model: e.g., DistilBertModel from huggingface
  neural_tokenizer: e.g., DistilBertTokenizer from huggingface
  cls_token: Which word embedding(s) do you want (defaults to CLS/first)
  layer_number: What layer do you want (defaults to last)
  """
  Xs, ys = [], []
  for record in jsons:
    sent1, sent2 = record['sentence1'], record['sentence2']
    label = record['gold_label']
    # for the heck of it, concatenate the two
    concatenated = sent1 + " " + sent2
    embedding = get_sentence_embedding(
        concatenated, neural_model, neural_tokenizer,
        cls_token, layer_number)
    Xs.append(embedding)
    ys.append(label)
  return Xs, ys

### What other representations could we try to use to characterize the relationships between two sentences?

## Building our classifier

We are going to create a standard classifier -- the goal of this model is to find the best way to use the word embeddings to carve up the space of categories. When we build a **linear classifier** we are trying to find **lines** that separate our categories (**classes**). The property of our model that we are aiming for is **linear separability**. In the SNLI context, we have 3 classes (entailment, contradiction, neutral), and up to 768 **slopes** we can learn on our lines. (This is just like $y = mx + b$ in school!)

<center><img src="https://machinelearningmastery.com/wp-content/uploads/2020/01/Scatter-Plot-of-Multi-Class-Classification-Dataset.png" width=500 /></center>

If we can draw lines that separate the blue blob from the orange and green, and a line that separates the orange from the green, we have a nice lil classifier that knows where to put every new data point. While this plot is in two dimensions, our data are much bigger than that. The math is thankfully still the same!

In [40]:
# downsample because nobody has time to embed 550,000 sentences lol
sub_training_jsons = [x for i, x in enumerate(training_jsons) if i%15==0]

# now create our training dataset
train_X, train_y = process_snli_distilbert(sub_training_jsons, model, tokenizer,
                                           cls_token=False, layer_number=-1)

In [None]:
from sklearn.linear_model import LogisticRegression
import numpy as np

classifier = LogisticRegression(max_iter=1000)
classifier.fit(np.vstack(train_X), train_y)

## Evaluating our classifier

In order to know if our models are any good, we first want to see if our model is learning anything from our training data. To do this, we have to compute some kind of accuracy measure. For this, we will need to think about how to count up incorrect guesses.

<center><img src="https://miro.medium.com/max/1024/1*u8PgZ_84no_swpLnuMf-PQ.png" width=500 /></center>

In our **multiclass** classification task for SNLI, we have three labels. But, let's pretend we only have two for now: **entailment** and **contradiction**. Our models will make guesses about **entailment** and **contradiction** -- so we just need to count up all four of these different combinations.

| Gold Label      | Predicted==entailment | Predicted==contradiction|
| ----------- | ----------- | -----------|
| entailment      | Correct       | Incorrect |
| contradiction   | Incorrect        | Correct |


When we evaluate our models, we are often using the proportions of correct responses relative to different kinds of errors. This is what the two estimates of $Precision$ and $Recall$ are for.

<center>
$\text{Precision} = p(\text{gold label}==\text{entailment} | \text{predicted}==\text{entailment})$
</center>

> Pokemon analogy: How many of the Pokemon that you caught were Pikachus?

<center>
$\text{Recall} = p(\text{predicted}==\text{entailment} | \text{gold label}==\text{entailment})$
</center>

> Pokemon analogy: How many of all of the Pikachus in the world did you catch?

<center>
And then, if we want to combine these two things together, we compute the F score, which allows us to weigh Precision and Recall. The resulting score is a **harmonic mean**. If we weight them equally, then we compute what is called $F_1$. 

$F_1 = 2 \cdot \large \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$
</center>

In [42]:
### Scoring our model on guesses for the training data
from sklearn.metrics import f1_score

f1_score(train_y,
         classifier.predict(np.vstack(train_X)),
         average='micro')

0.6506529977915315

In [43]:
from sklearn.metrics import f1_score

f1_score(train_y,
         # our predicted training labels
         classifier.predict(np.vstack(train_X)),
         average='macro')

0.5005404557061213

In [44]:
# accuracy
sum(train_y == classifier.predict(np.vstack(train_X))) / len(train_y)

0.6506529977915315

In [45]:
from collections import Counter

Counter(classifier.predict(np.vstack(train_X)))

Counter({'-': 1,
         'contradiction': 12178,
         'entailment': 13100,
         'neutral': 11398})

At least our model isn't just predicting one class!

In [46]:
# break test data down into lines
test_file_contents = data['snli_1.0/snli_1.0_test.jsonl'].decode('utf-8')

# store as jsons
test_jsons = []
for x in test_file_contents.split("\n"):
  if x!='':
    test_jsons.append(json.loads(x))

# create our test dataset
test_X, test_y = process_snli_distilbert(test_jsons, model, tokenizer, cls_token=False)

In [47]:
# assess performance on our test data
f1_score(test_y,
         classifier.predict(np.vstack(test_X)),
         average='macro')

0.4707575281996137

Yikes! That's not great.

In [48]:
sum(test_y == classifier.predict(np.vstack(test_X))) / len(test_y)

0.6236

## Preview of Friday!

We will be building more classifiers to keep getting practice in this area.

Next time will be to learn more about what kinds of entailment we might want to capture. How do we create **propositions** to represent our utterances, so we can learn that two sentences are similar?

### Semantic role frameworks

### PropBank

#### Plain sentence:
    But I will never learn that.


#### Leaves:

    0   But
    1   I
           coref: IDENT        m_0   1-1    I
    2   will
    3   never
    4   learn
           sense: learn-v.1
           prop:  learn.01
            v          * -> 4:0,  learn
            ARGM-DIS   * -> 0:0,  But
            ARG0       * -> 1:1,  I
            ARGM-MOD   * -> 2:0,  will
            ARGM-TMP   * -> 3:1,  never
            ARG1       * -> 5:1,  that
    5   that
           coref: IDENT        284   5-5    that
    6   /.


### FrameNet

https://framenet.icsi.berkeley.edu/fndrupal/