<a href="https://colab.research.google.com/github/felixbmuller/nlp-commonsense/blob/main/NLP_Commonsense_Assignment_2_KB_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Commonsense Assignment 2 - Knowledge Base Model

## Setup

In [1]:
!pip install -q transformers datasets torch torchvision
!apt install git-lfs >/dev/null

[K     |████████████████████████████████| 3.5 MB 4.2 MB/s 
[K     |████████████████████████████████| 311 kB 66.2 MB/s 
[K     |████████████████████████████████| 895 kB 45.9 MB/s 
[K     |████████████████████████████████| 6.8 MB 48.0 MB/s 
[K     |████████████████████████████████| 596 kB 67.0 MB/s 
[K     |████████████████████████████████| 67 kB 4.8 MB/s 
[K     |████████████████████████████████| 212 kB 51.0 MB/s 
[K     |████████████████████████████████| 134 kB 69.8 MB/s 
[K     |████████████████████████████████| 1.1 MB 65.8 MB/s 
[K     |████████████████████████████████| 271 kB 70.8 MB/s 
[K     |████████████████████████████████| 94 kB 3.0 MB/s 
[K     |████████████████████████████████| 144 kB 67.8 MB/s 
[?25h



In [2]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [3]:
!git clone https://github.com/felixbmuller/nlp-commonsense.git --depth 1

Cloning into 'nlp-commonsense'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 27 (delta 2), reused 12 (delta 1), pack-reused 0[K
Unpacking objects: 100% (27/27), done.


In [35]:
!git pull

remote: Enumerating objects: 12, done.[K
remote: Counting objects:   8% (1/12)[Kremote: Counting objects:  16% (2/12)[Kremote: Counting objects:  25% (3/12)[Kremote: Counting objects:  33% (4/12)[Kremote: Counting objects:  41% (5/12)[Kremote: Counting objects:  50% (6/12)[Kremote: Counting objects:  58% (7/12)[Kremote: Counting objects:  66% (8/12)[Kremote: Counting objects:  75% (9/12)[Kremote: Counting objects:  83% (10/12)[Kremote: Counting objects:  91% (11/12)[Kremote: Counting objects: 100% (12/12)[Kremote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects:  50% (1/2)[Kremote: Compressing objects: 100% (2/2)[Kremote: Compressing objects: 100% (2/2), done.[K
remote: Total 8 (delta 6), reused 8 (delta 6), pack-reused 0[K
Unpacking objects:  12% (1/8)   Unpacking objects:  25% (2/8)   Unpacking objects:  37% (3/8)   Unpacking objects:  50% (4/8)   Unpacking objects:  62% (5/8)   Unpacking objects:  75% (6/8)   Unpacking object

In [4]:
from datasets import load_dataset, load_metric
import pandas as pd
import transformers

print(transformers.__version__)

model_checkpoint = "bert-base-uncased"
batch_size = 16

datasets = load_dataset("super_glue", "copa")

4.16.2


Downloading:   0%|          | 0.00/9.47k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.23k [00:00<?, ?B/s]

Downloading and preparing dataset super_glue/copa (download: 42.96 KiB, generated: 119.62 KiB, post-processed: Unknown size, total: 162.57 KiB) to /root/.cache/huggingface/datasets/super_glue/copa/1.0.2/d040c658e2ddef6934fdd97deb45c777b6ff50c524781ea434e7219b56a428a7...


Downloading:   0%|          | 0.00/44.0k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset super_glue downloaded and prepared to /root/.cache/huggingface/datasets/super_glue/copa/1.0.2/d040c658e2ddef6934fdd97deb45c777b6ff50c524781ea434e7219b56a428a7. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

## Setup and Test Knowledge Base

In [5]:
%cd nlp-commonsense/src/

/content/nlp-commonsense/src


In [36]:
%load_ext autoreload
%autoreload 2

import utils
import process_examples
import find_shortest_path
import renderer as R
import qa_preprocessing as QA

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [7]:
conceptnet = utils.load_conceptnet(load_compressed=True)

In [8]:
example = datasets["train"][0]

example

{'choice1': 'The sun was rising.',
 'choice2': 'The grass was cut.',
 'idx': 0,
 'label': 0,
 'premise': 'My body cast a shadow over the grass.',
 'question': 'cause'}

In [31]:
print(process_examples.extract_terms(example["premise"]))
print(process_examples.extract_terms(example["choice1"]))
print(process_examples.extract_terms(example["choice2"]))
print(find_shortest_path.find_word_path('body', 'sun', conceptnet))
print(find_shortest_path.find_word_path('body', 'sun', conceptnet, renderer=None))

{'my body', 'grass', 'cast', 'shadow', 'body'}
{'wa', 'rising', 'sun'}
{'grass', 'cut', 'wa'}
body <--RelatedTo-- sun
[182090, 1539020]


In [37]:
R.render_path_natural([], conceptnet)

('', [])

In [38]:
R.render_path_natural([182090, 1539020], conceptnet)

('sun is like body.', [0.909])

In [39]:
print(QA.get_knowledge_for_example(example["premise"], example["choice1"], conceptnet, max_paths=100))
print(QA.get_knowledge_for_example(example["premise"], example["choice1"], conceptnet, max_paths=3))



grass is like side. side is like wa.grass is in the context of slang. rising is in the context of slang.grass is like plant. sun is like plant.cast is like rise. rising and rise have similar meanings.iron can be cast . sun has iron.shadow is like sun.wash is like body. wash and wa have similar meanings.dyke is like body. dyke is like rising.sun is like body.
sun is like body.shadow is like sun.cast is like rise. rising and rise have similar meanings.


## Preprocessing the data

In [40]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [41]:
ending_names = ["choice1", "choice2"]

QUESTION_MAP = {
    "cause": "What was the cause of this?",
    "effect": "What happened as a RESULT?",
}

MAX_PATHS = 3 # only take the three most relevant knowledge paths into account 

def preprocess_function(examples):
    # Repeat premise and question twice for both possible answers
    # for each repetitions, add knowledge from the knowledge base in front of 
    # the premise. The knowledge added is about connections between the premise 
    # and the answer choice. The type of question (cause/effect) is also taken
    # into account
    first_sentences = [
                       [f"{QA.get_knowledge_for_example(f'{context} {question}', c1, conceptnet, MAX_PATHS)} {context} {QUESTION_MAP[question]}", 
                        f"{QA.get_knowledge_for_example(f'{context} {question}', c2, conceptnet, MAX_PATHS)} {context} {QUESTION_MAP[question]}"] 
                       for context, question, c1, c2 in zip(
                           examples["premise"], 
                           examples["question"], 
                           examples["choice1"], 
                           examples["choice2"]
                           )
                       ]
    # Grab all second sentences possible for each context.
    second_sentences = [[c1, c2] 
                        for c1, c2 in zip(examples["choice1"], examples["choice2"])]
    
    # Flatten everything
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    if not len(first_sentences) == len(second_sentences):
        raise ValueError("lengths dont match")
    
    # Tokenize
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    # Un-flatten
    return {k: [v[i:i+2] for i in range(0, len(v), 2)] for k, v in tokenized_examples.items()}

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists of lists for each key: a list of all examples (here 5), then a list of all choices (4) and a list of input IDs (length varying here since we did not apply any padding):

### Test Tokenizer and Preprocessing

In [42]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [43]:
examples = datasets["train"][:5]
features = preprocess_function(examples)

print(features.keys())
print(len(features["input_ids"]), len(features["input_ids"][0]), [len(x) for x in features["input_ids"][0]])



dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
5 2 [46, 38]


To check we didn't do anything group when grouping all possibilites then unflattening, let's have a look at the decoded inputs for a given example:

In [48]:
len(datasets["train"]), len(datasets["test"]), len(datasets["validation"])

(400, 500, 100)

In [None]:
idx = 3
[tokenizer.decode(features["input_ids"][idx][i]) for i in range(2)]

['[CLS] the runner wore shorts. what was the cause of this? [SEP] the forecast predicted high temperatures. [SEP]',
 '[CLS] the runner wore shorts. what was the cause of this? [SEP] she planned to run along the beach. [SEP]']

We can compare it to the ground truth:

In [None]:
datasets["train"][3]

{'choice1': 'The forecast predicted high temperatures.',
 'choice2': 'She planned to run along the beach.',
 'idx': 3,
 'label': 0,
 'premise': 'The runner wore shorts.',
 'question': 'cause'}

### Apply Preprocessing to the Whole Dataset

In [None]:
encoded_datasets = datasets.map(preprocess_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

## Fine-tuning the model

In [None]:
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch
import numpy as np

model = AutoModelForMultipleChoice.from_pretrained(model_checkpoint)

model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-copa",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)


@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch


def compute_metrics(eval_predictions):
    predictions, label_ids = eval_predictions
    preds = np.argmax(predictions, axis=1)
    return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer),
    compute_metrics=compute_metrics,
)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMultipleChoice: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly

Download file pytorch_model.bin:   0%|          | 16.0k/418M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/418M [00:00<?, ?B/s]

When called on a list of examples, it will flatten all the inputs/attentions masks etc. in big lists that it will pass to the `tokenizer.pad` method. This will return a dictionary with big tensors (of shape `(batch_size * 4) x seq_length`) that we then unflatten.

### Test Collator

We can check this data collator works on a list of features, we just have to make sure to remove all features that are not inputs accepted by our model (something the `Trainer` will do automatically for us after):

In [None]:
accepted_keys = ["input_ids", "attention_mask", "label"]
features = [{k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys} for i in range(10)]
batch = DataCollatorForMultipleChoice(tokenizer)(features)

Again, all those flatten/un-flatten are sources of potential errors so let's make another sanity check on our inputs:

In [None]:
[tokenizer.decode(batch["input_ids"][8][i].tolist()) for i in range(2)]

['[CLS] the physician misdiagnosed the patient. what happened as a result? [SEP] the patient filed a malpractice lawsuit against the physician. [SEP] [PAD]',
 '[CLS] the physician misdiagnosed the patient. what happened as a result? [SEP] the patient disclosed confidential information to the physician. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD]']

In [None]:
datasets["train"][8]

{'choice1': 'The patient filed a malpractice lawsuit against the physician.',
 'choice2': 'The patient disclosed confidential information to the physician.',
 'idx': 8,
 'label': 0,
 'premise': 'The physician misdiagnosed the patient.',
 'question': 'effect'}

### Run Training

In [None]:
trainer.train()

#model.push_to_hub("felixbmuller/bert-base-uncased-finetuned-copa")

The following columns in the training set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: idx, question, choice2, premise, choice1.
***** Running training *****
  Num examples = 400
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 75


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.694743,0.47
2,No log,0.662411,0.59
3,No log,0.656186,0.63


The following columns in the evaluation set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: idx, question, choice2, premise, choice1.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 16
The following columns in the evaluation set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: idx, question, choice2, premise, choice1.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 16
The following columns in the evaluation set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: idx, question, choice2, premise, choice1.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)


Cloning https://huggingface.co/felixbmuller/bert-base-uncased-finetuned-copa into local empty directory.


Download file pytorch_model.bin:   0%|          | 16.0k/418M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/418M [00:00<?, ?B/s]

Configuration saved in felixbmuller/bert-base-uncased-finetuned-copa/config.json
Model weights saved in felixbmuller/bert-base-uncased-finetuned-copa/pytorch_model.bin


Upload file pytorch_model.bin:   0%|          | 3.37k/418M [00:00<?, ?B/s]

To https://huggingface.co/felixbmuller/bert-base-uncased-finetuned-copa
   9e249f5..59bfdb7  main -> main



'https://huggingface.co/felixbmuller/bert-base-uncased-finetuned-copa/commit/59bfdb7734ce937ddd698b868add485eb2a71956'

## Evalute the Model


In [None]:
predictions, label_ids, metrics = trainer.predict(encoded_datasets["validation"], metric_key_prefix="val")

The following columns in the test set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: idx, question, choice2, premise, choice1.
***** Running Prediction *****
  Num examples = 100
  Batch size = 16


In [None]:
val = pd.DataFrame(datasets["validation"])
val["label_ids"] = label_ids
val["pred0"] = predictions[:, 0]
val["pred1"] = predictions[:, 1]
val["pred_label"] = np.argmax(predictions, axis=1)

Sanity check to ensure that predictions work the way I expect them to do

In [None]:
print("accuracy", (val.pred_label == val.label).astype(np.float32).mean())

accuracy 0.6299999952316284


In [None]:
val.head(20)

Unnamed: 0,premise,choice1,choice2,question,idx,label,label_ids,pred0,pred1,pred_label
0,The man turned on the faucet.,The toilet filled with water.,Water flowed from the spout.,effect,0,1,1,0.179625,0.236074,1
1,The girl found a bug in her cereal.,She poured milk in the bowl.,She lost her appetite.,effect,1,1,1,-0.424878,0.07543,1
2,The woman retired.,She received her pension.,She paid off her mortgage.,effect,2,0,0,0.05091,-0.158558,0
3,I wanted to conserve energy.,I swept the floor in the unoccupied room.,I shut off the light in the unoccupied room.,effect,3,1,1,-0.398636,-0.347026,1
4,The hamburger meat browned.,The cook froze it.,The cook grilled it.,cause,4,1,1,-0.147592,-0.440141,0
5,I doubted the salesman's pitch.,I turned his offer down.,He persuaded me to buy the product.,effect,5,0,0,-0.371013,-0.377121,0
6,I decided to stay home for the night.,The forecast called for storms.,My friends urged me to go out.,cause,6,0,0,-0.309059,-0.428363,0
7,My eyes became red and puffy.,I was sobbing.,I was laughing.,cause,7,0,0,-0.19218,-0.279411,0
8,The flame on the candle went out.,I blew on the wick.,I put a match to the wick.,cause,8,0,0,-0.401704,-0.382581,1
9,The man drank heavily at the party.,He had a headache the next day.,He had a runny nose the next day.,effect,9,0,0,-0.328688,-0.182583,1


In [None]:
wrong_samples = val[val.label !=  val.pred_label]
wrong_samples.sample(25, random_state=42)

Unnamed: 0,premise,choice1,choice2,question,idx,label,label_ids,pred0,pred1,pred_label
36,The couple travelled south for the winter.,They were retired.,They were separated.,cause,36,0,0,-0.432991,-0.076568,1
27,The driver turned on the car's headlights.,He heard thunder.,The sun went down.,cause,27,1,1,-0.137905,-0.31643,0
14,The player caught the ball.,Her teammate threw it to her.,Her opponent tried to intercept it.,cause,14,0,0,0.491094,0.524528,1
76,The lobbyist persuaded the legislature to supp...,The president vetoed the bill.,The legislature passed the bill.,effect,76,1,1,0.679025,0.558334,0
97,The airline mishandled my luggage.,They offered me compensation.,They cancelled my flight.,effect,97,0,0,-0.36953,0.636157,1
61,The driver slammed on his brakes.,A deer appeared on the road.,The car radio shut off.,cause,61,0,0,-0.108249,0.352234,1
17,The kidnappers released the hostage.,They accepted ransom money.,They escaped from jail.,cause,17,0,0,-0.377641,-0.14109,1
63,I put rubber gloves on.,I was preparing to wash my hands.,I was preparing to clean the bathroom.,cause,63,1,1,-0.408901,-0.428135,0
60,The mother called an ambulance.,Her son lost his cat.,Her son fell out of his bed.,cause,60,1,1,0.50456,0.401593,0
35,The police searched the offender's car.,They were trying to elicit a confession.,They were looking for illegal drugs.,cause,35,1,1,0.134396,-0.202672,0
