# CSE 354 - Nxt Lvl Programmars (Final Project)



## Project Description: TellMeWhy with Small Models: Fine-Tuning Through Contextual Injection
### Authors: Jaret McManus, Dane Meister
Project Number: 9 Project Name: Answering Why Questions in TellMeWhy Project Area: Brief 2 line Project Description (starting point): Given a short story and a why question about an action in the story, generate an answer that explains the reason for performing the action. Relevant Baseline Model: T5, GPT, Gemini Relevant Dataset: TellMeWhy Relevant Papers: Lal et al 2021, Raffel et al 2020, Brown et al 2020, Gemini Team 2023


**Readme**: [Gdrive link to readme](https://drive.google.com/file/d/1Dnp3O8TfjzQ75mYuFVYoyuBFkl9e1IrX/view?usp=drive_link)

In [None]:
# upload and display README for project if necessary
from google.colab import files

uploaded = files.upload()  # This will prompt you to upload the file

from IPython.display import Markdown, display

# Get the filename from the uploaded dictionary
filename = list(uploaded.keys())[0]  # Assuming only the readme file is uploaded

# Now open the file using the filename
with open(filename, "r") as file:
    content = file.read()

display(Markdown(content))

Saving README.md to README.md


﻿# 354 README
 ## Original Source
 Our project is based on a paper/project linked below. We used their dataset and trained our models from the HuggingFace transformers API.
 [Original Paper where the Idea came from](https://aclanthology.org/2022.emnlp-main.79/)
 [Original Git Project where the Idea came from](https://github.com/StonyBrookNLP/knowwhy)
[Data used from that project](https://huggingface.co/datasets/StonyBrookNLP/tellmewhy)


 ## Modified Files
 We ended up not reusing any files from the original paper. We did modify their dataset however:
 [Data used from that project](https://huggingface.co/datasets/StonyBrookNLP/tellmewhy) 
 

## How to Train and Test our Models
We used a Google Colab Notebook that can be run sequentially on a GPU instance to train, save, and evaluate our models. Without a GPU instance the notebook will crash as in some places we move tensors to the GPU explicitly. The notebook requires access to the runner's google drive, and will open a folder named "CSE_354_project", or create it if it doesn't exist. Preprocessing the data requires loading the JSON named "*ALL_CONTEXT_DATA_1.json*" that we created and linked in the data section below.
[Link to Notebook](https://colab.research.google.com/drive/1OD20QNgY24b6lUG22S8CK1lcuuheJw_N?usp=sharing)

Additionally inside the notebook we link to our context generation using Gemini that we did over the course of days due do Gemini's API limits. To run this notebook successfully you need to add a Google Gemini API key in the secrets section of Colab.

## Our Models and Data
[Drive link to our no context tuned model](https://drive.google.com/drive/folders/1jmfPBUO8D9ErSxLEyizCAHfwYmCZsZPi?usp=sharing)
[Drive link to our context tuned model](https://drive.google.com/drive/folders/1-5-wsyQgdH7D9G9WtVD3Udr09FqwllPW?usp=sharing)
[Drive link to our modified data in JSON format](https://drive.google.com/file/d/1yfivuLQud6rmaVqtAW3mtTR2PIL8oIZ6/view?usp=drive_link)
[Drive link to folder containing all files above](https://drive.google.com/drive/folders/1MQjohFNC19qhgc5hetsw9g9yr6bheoF2?usp=sharing)

## Prompts used
Prompt we had given to gemini for context injection:
```py
prompt = '''Given the following narrative sentences that describe a story, produce a sequence of concise and to the point sentences that bring in commonsense information, and external world knowledge that is relevant. Be very verbose about commonsense knowledge and explain the reason why things are done.

  

Here is an example:

narrative: Cam ordered a pizza and took it home. He opened the box to take out a slice. Cam discovered that the store did not cut the pizza for him. He looked for his pizza cutter but did not find it. He had to use his chef knife to cut a slice.

Pizza is a food. People eat food when they are hungry. Pizza is usually already cut. Cam got the pizza from the store.

  

Produce context sentences to the following narrative without any formatting, just as a sequence of 4 short, simple, and single clause sentences, do NOT reason through multiple sentences, each sentence should state commonsense information related to the narrative:

{narrative}

'''
```
Prompt format we trained our models on, either:
```py
f"Narrative: {narrative} Question: {question}"
```
or
```py
f"Narrative: {narrative} Context: {context} Question: {question}"
```

## Requirements
All requirements are set to be installed in the notebook before, or when they are needed. The notebook works if run sequentially.


# *Loading Google Drive*

In [None]:
import os
from google.colab import drive

base_dir = '/content/drive/MyDrive/CSE_354_Project'
drive.mount('/content/drive/')
if not os.path.exists(base_dir):
    print(f"Directory '{base_dir}' does not exist. Creating it...")
    os.makedirs(base_dir)
else:
    print(f"Directory '{base_dir}' already exists.")

%cd $base_dir

Mounted at /content/drive/
Directory '/content/drive/MyDrive/CSE_354_Project' already exists.
/content/drive/MyDrive/CSE_354_Project


# *Loading Dataset with context injected data*

We injected the data with additional context using Google's Gemini LLM. For each unique narrative, we prompted Gemini to generate commonsense and external context, and saved it to a JSON file, to use later. We were limited by Google's API limits, so we modidfied the data in chunks, and later recombined all the chunks into one JSON file.

For more details into how we did this, here is a link to the 2 notebooks containing the code:
- [Notebook for Injecting Context in Chunks](https://colab.research.google.com/drive/1S50O26o_tLbaYE2-s-hRelSfroXaaMaC?usp=sharing)
- [Notebook for Combining Chunks](https://colab.research.google.com/drive/1-o8IBF1KQgMBm7m_2hQVA83sE-cPh-yU?usp=sharing)

In [None]:
!pip install datasets
from datasets import Dataset
import json

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
#load our data
file_name = 'ALL_CONTEXT_DATA_1.json'
with open(file_name, 'r') as f:
    all_context_data = json.load(f)
print(f"Loaded {len(all_context_data)} records from {file_name}")

Loaded 73247 records from ALL_CONTEXT_DATA_1.json


# *Preprocess Data for Transformer*

In [None]:
!pip install transformers
from transformers import T5Tokenizer



In [None]:
# Tokenization function
def tokenize_function(examples):
    # Tokenize the input
    model_inputs = tokenizer(
        examples["input"],
        max_length=128 ,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    )
    labels = tokenizer(
        examples["target"],
        max_length=128,
        truncation=True,
        padding="max_length",
    ).input_ids

    # Replace padding token IDs in labels with -100
    labels = [[-100 if token == tokenizer.pad_token_id else token for token in label] for label in labels]
    model_inputs["labels"] = labels #add to dictionary

    return model_inputs

# Convert data into Dataset format
context_dataset = Dataset.from_dict({
    "input": [f"Narrative: {item['narrative']} Context: {item['context']} Question: {item['question']}" for item in all_context_data],
    "target": [item['answer'] for item in all_context_data],
})
print(f"Context injected Dataset loaded with {len(context_dataset)} samples.")

no_context_dataset = Dataset.from_dict({
    "input": [f"Narrative: {item['narrative']} Question: {item['question']}" for item in all_context_data],
    "target": [item['answer'] for item in all_context_data],
})
print(f"No Context injected Dataset loaded with {len(no_context_dataset)} samples.")


# Initialize the T5 tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')

# Apply tokenization to the dataset
print("Tokenizing the dataset...")
tokenized_context_dataset = context_dataset.map(tokenize_function, batched=True)
tokenized_no_context_dataset = no_context_dataset.map(tokenize_function, batched=True)

# View tokenized example for verification
print("Tokenized example:", tokenized_context_dataset[0])
print("Tokenized example:", tokenized_no_context_dataset[100])

Context injected Dataset loaded with 73247 samples.
No Context injected Dataset loaded with 73247 samples.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Tokenizing the dataset...


Map:   0%|          | 0/73247 [00:00<?, ? examples/s]

Map:   0%|          | 0/73247 [00:00<?, ? examples/s]

Tokenized example: {'input': 'Narrative: Cam ordered a pizza and took it home. He opened the box to take out a slice. Cam discovered that the store did not cut the pizza for him. He looked for his pizza cutter but did not find it. He had to use his chef knife to cut a slice. Context: - Pizza is usually cut into slices before it is served.\n- It is common to find a pizza cutter in homes.\n- Pizza is typically ordered when someone is hungry and wants to eat.\n- Chef knives are sharp and can be used to cut food. Question: Why did Cam order a pizza?', 'target': 'Cam was hungry.', 'input_ids': [13346, 52, 1528, 10, 5184, 5563, 3, 9, 6871, 11, 808, 34, 234, 5, 216, 2946, 8, 1367, 12, 240, 91, 3, 9, 13810, 5, 5184, 3883, 24, 8, 1078, 410, 59, 1340, 8, 6871, 21, 376, 5, 216, 2299, 21, 112, 6871, 20634, 68, 410, 59, 253, 34, 5, 216, 141, 12, 169, 112, 6380, 10821, 12, 1340, 3, 9, 13810, 5, 1193, 6327, 10, 3, 18, 15365, 19, 1086, 1340, 139, 18647, 274, 34, 19, 2098, 5, 3, 18, 94, 19, 1017, 12, 2

# *Set up data for training*

In [None]:
split_context_dataset = tokenized_context_dataset.select(range(10_000)).train_test_split(test_size=0.15)
split_no_context_dataset = tokenized_no_context_dataset.select(range(10_000)).train_test_split(test_size=0.15)

train_context_dataset = split_context_dataset["train"]
test_context_dataset = split_context_dataset["test"]

train_no_context_dataset = split_no_context_dataset["train"]
test_no_context_dataset = split_no_context_dataset["test"]

print("length of context train:", len(train_context_dataset))
print("length of context test:", len(test_context_dataset))
print("length of no context train:", len(train_no_context_dataset))
print("length of no context test:", len(test_no_context_dataset))

length of context train: 8500
length of context test: 1500
length of no context train: 8500
length of no context test: 1500


# *Load BLEURT for metric*

In [None]:
!pip install evaluate
import evaluate
!pip install git+https://github.com/google-research/bleurt.git
bleurt = evaluate.load("bleurt")

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3
Collecting git+https://github.com/google-research/bleurt.git
  Cloning https://github.com/google-research/bleurt.git to /tmp/pip-req-build-k_8mn9xa
  Running command git clone --filter=blob:none --quiet https://github.com/google-research/bleurt.git /tmp/pip-req-build-k_8mn9xa
  Resolved https://github.com/google-research/bleurt.git to commit cebe7e6f996b40910cfaa520a63db47807e3bf5c
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: BLEURT
  Building wheel for BLEURT (setup.py) ... [?25l[?25hdone
  Created wheel for BLEURT: filename=BLEURT-0.0.2-py3-none-any.whl size=16456764 sha256=7d9dce3227073316c90e31ae5f406

Downloading builder script:   0%|          | 0.00/5.20k [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/405M [00:00<?, ?B/s]

In [None]:
import numpy as np
import torch
import gc
def compute_metrics(eval_pred):
    print('eval')
    logits, label_ids = eval_pred.predictions, eval_pred.label_ids
    logits = logits[0]
    with torch.no_grad():
        # Convert logits to predictions (use argmax to get the most probable token)
        preds = np.argmax(logits, axis=-1)

        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
        decoded_labels = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

        # Compute BLEURT scores for the current batch
        scores = bleurt.compute(predictions=decoded_preds, references=decoded_labels)

        del logits, label_ids, preds, decoded_preds, decoded_labels
        gc.collect()
        torch.cuda.empty_cache()  # Free unused GPU memory

        # Return the average BLEURT score across all batches
        return {"bleurt": np.mean(scores["scores"])}

# *Retrieve pretrained T5 model to finetune without context*

In [None]:
from transformers import T5ForConditionalGeneration

# Load the T5 model
t5 = model = T5ForConditionalGeneration.from_pretrained("t5-small")  # Change to a larger version if needed

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

# *Finetuning a T5 without context*

In [None]:
from transformers import Trainer, TrainingArguments

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",         # Output directory
    eval_strategy="epoch",   # Evaluate every epoch
    learning_rate=5e-5,
    per_device_train_batch_size=1, # Adjust based on memory
    per_device_eval_batch_size=1,
    num_train_epochs=4,            # Adjust based on performance
    weight_decay=0.01,
    save_total_limit=1,            # Save only the best checkpoint
    logging_dir="./logs",          # Directory for logs
    logging_steps=60,
    save_steps=1000,
    gradient_accumulation_steps=4,
    fp16=True,
    report_to="none",
    eval_accumulation_steps=8
)

model = T5ForConditionalGeneration.from_pretrained('t5-small')
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_no_context_dataset,
    eval_dataset=test_no_context_dataset,
    compute_metrics=compute_metrics
)


# Train the model
trainer.train()
trainer.save_model("./fine_tuned_whitespace_no_context")
trainer.evaluate()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.


KeyboardInterrupt



In [None]:
import shutil
shutil.copytree('./fine_tuned_whitespace_no_context', base_dir+'/fine_tuned_whitespace_no_context')

In [None]:
# for conserving credits
# leave training running, and disconnect when it finishes
from google.colab import runtime
runtime.unassign()

# *Retreive a pretrained T5 to finetune with context*

In [None]:
from transformers import T5ForConditionalGeneration

# Reload the T5 model
model = T5ForConditionalGeneration.from_pretrained("t5-small")  # Change to a larger version if needed

# *Finetuning a T5 with context*

In [None]:
from transformers import Trainer, TrainingArguments

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",         # Output directory
    eval_strategy="epoch",   # Evaluate every epoch
    learning_rate=5e-5,
    per_device_train_batch_size=4, # Adjust based on memory
    per_device_eval_batch_size=4,
    num_train_epochs=4,            # Adjust based on performance
    weight_decay=0.01,
    save_total_limit=1,            # Save only the best checkpoint
    logging_dir="./logs",          # Directory for logs
    logging_steps=60,
    save_steps=1000,
    gradient_accumulation_steps=4,
    fp16=True,
    report_to="none",
    eval_accumulation_steps=8
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_context_dataset,
    eval_dataset=test_context_dataset,
    compute_metrics=compute_metrics
)


# Train the model
trainer.train()
trainer.save_model("./fine_tuned_whitespace_context")
trainer.evaluate()



Epoch,Training Loss,Validation Loss,Bleurt
0,0.166,0.150395,-0.983019
1,0.1564,0.146696,-0.958194
2,0.1478,0.144871,-0.948719
3,0.1465,0.144614,-0.944824


eval
eval
eval
eval


eval


{'eval_loss': 0.14461372792720795,
 'eval_bleurt': -0.9448243341172735,
 'eval_runtime': 265.3165,
 'eval_samples_per_second': 5.654,
 'eval_steps_per_second': 1.413,
 'epoch': 3.9981176470588236}

In [None]:
import shutil
shutil.copytree('./fine_tuned_whitespace_no_context', base_dir+'/fine_tuned_whitespace_no_context')

In [None]:
# for conserving credits
# leave training running, and disconnect when it finishes
from google.colab import runtime
runtime.unassign()

# *Evaluating models*

Load models from google drive

In [None]:
context_model = T5ForConditionalGeneration.from_pretrained(f"{base_dir}/fine_tuned_whitespace_context")
no_context_model = T5ForConditionalGeneration.from_pretrained(f"{base_dir}/fine_tuned_whitespace_no_context")
tokenizer = T5Tokenizer.from_pretrained("t5-small")

Reload our data in case of starting colab at this point

In [None]:
#load our data
file_name = 'ALL_CONTEXT_DATA_1.json'
with open(file_name, 'r') as f:
    all_context_data = json.load(f)
print(f"Loaded {len(all_context_data)} records from {file_name}")

Loaded 73247 records from ALL_CONTEXT_DATA_1.json


In [None]:
# Get smaller subset to evaluate on
no_context_dataset = Dataset.from_dict({
    "input": [f"Narrative: {item['narrative']} Question: {item['question']}" for item in all_context_data],
    "target": [item['answer'] for item in all_context_data],
})
context_dataset = Dataset.from_dict({
    "input": [f"Narrative: {item['narrative']} Context: {item['context']} Question: {item['question']}" for item in all_context_data],
    "target": [item['answer'] for item in all_context_data],
})
context_dataset = context_dataset.shuffle(seed=42).select(range(1_000))
print(context_dataset[0])
no_context_dataset = no_context_dataset.shuffle(seed=42).select(range(1_000))
print(no_context_dataset[0])

{'input': "Narrative: Ed wanted to paint his house blue. But Sarah wanted it white. They argued for a little while. Then they compromised. They painted the house white with blue shutters! Context: * Painting is used to enhance a house's appearance.\n* Compromise is a negotiation technique used to reach an agreement.\n* Shutters are commonly used to cover windows.\n* Different colors evoke different emotions and perceptions. Question: Why did Ed want to paint his house blue?", 'target': 'Ed wanted to paint his house blue.so.'}
{'input': 'Narrative: Ed wanted to paint his house blue. But Sarah wanted it white. They argued for a little while. Then they compromised. They painted the house white with blue shutters! Question: Why did Ed want to paint his house blue?', 'target': 'Ed wanted to paint his house blue.so.'}


In [None]:
!pip install evaluate
!pip install tqdm
import evaluate
from tqdm import tqdm



In [None]:
!pip install rouge_score
!pip install git+https://github.com/google-research/bleurt.git
rouge_metric = evaluate.load("rouge")
bleurt_metric = evaluate.load("bleurt")

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=0e0fc89e225ab12f9f974daff0c55da78d3664f60c89d2753729c532060f1bb1
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2
Collecting git+https://github.com/google-research/bleurt.git
  Cloning https://github.com/google-research/bleurt.git to /tmp/pip-req-build-hgc_d_el
  Running command git clone --filter=blob:none --quiet https://github.com/google-research/bleurt.git /tmp/pip-req-build-hgc_d_el
  Resolved https://github.com/google-research/bleurt.git to commit cebe7e6f996b40910cfaa520a63db47807

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]



code for metrics

In [None]:
# F1 Score
# Calculates the harmonic mean of precision and recall based on word overlap.
from sklearn.metrics import precision_recall_fscore_support

def compute_f1_score(predictions, references):
    preds = [" ".join(p.split()) for p in predictions]
    refs = [" ".join(r.split()) for r in references]
    precision, recall, f1, _ = precision_recall_fscore_support(refs, preds, average="macro")
    return f1 * 100

# Exact Match (EM)
# Measures the percentage of predictions that exactly match the references.

def compute_exact_match(predictions, references):
    return sum([p.strip() == r.strip() for p, r in zip(predictions, references)]) / len(references) * 100

In [None]:
def evaluate_model(predictions, references, bleurt_metric, rouge_metric):
    """
    Evaluate predictions using BLEURT, ROUGE, Exact Match, and F1 Score.

    Args:
        predictions (list): List of model-generated answers.
        references (list): List of ground-truth answers.
        bleurt_metric (evaluate.Metric): BLEURT metric instance.
        rouge_metric (evaluate.Metric): ROUGE metric instance.

    Returns:
        dict: Dictionary containing BLEURT, ROUGE, EM, and F1 scores.
    """
    # BLEURT scores
    bleurt_results = bleurt_metric.compute(predictions=predictions, references=references)
    bleurt_scores = bleurt_results["scores"]
    avg_bleurt = sum(bleurt_scores) / len(bleurt_scores)

    # ROUGE scores
    rouge_results = rouge_metric.compute(predictions=predictions, references=references)
    rouge_l_f1 = rouge_results["rougeLsum"]

    # Exact Match (EM) Score
    em_score = compute_exact_match(predictions, references)

    # F1 Score
    f1_score = compute_f1_score(predictions, references)

    return {
        "BLEURT Average": avg_bleurt,
        "ROUGE-L F1": rouge_l_f1 * 100,  # Convert to percentage
        "Exact Match": em_score,
        "F1 Score": f1_score,
    }

def generate_predictions(model, tokenizer, dataset, max_length=256):
    """
    Generate predictions for evaluation.
    """
    predictions = []
    references = []
    model.eval()  # Set model to evaluation mode

    for item in tqdm(dataset):
        # Tokenize input dynamically
        inputs = tokenizer(item["input"], return_tensors="pt", truncation=True, padding=True).to("cuda")
        # inputs = tokenizer(item["input"], return_tensors="pt", truncation=True, padding=True)

        # Generate predictions
        outputs = model.generate(inputs["input_ids"], max_length=max_length)
        prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)

        predictions.append(prediction)
        references.append(item["target"])

    return predictions, references

results for combinations of models and given context

In [None]:
no_context_model.to("cuda")
context_model.to("cuda")

#short hand no context to NC, context to C, no context trained model to NCM, context trained model CM
NCM_given_NC_predictions, NCM_given_NC_references = generate_predictions(no_context_model, tokenizer, no_context_dataset)
NCM_given_C_predictions, NCM_given_C_references = generate_predictions(no_context_model, tokenizer, context_dataset)
CM_given_NC_predictions, CM_given_NC_references = generate_predictions(context_model, tokenizer, no_context_dataset)
CM_given_C_predictions, CM_given_C_references = generate_predictions(context_model, tokenizer, context_dataset)

# Evaluate the models
NCM_given_NC_results = evaluate_model(NCM_given_NC_predictions, NCM_given_NC_references, bleurt_metric, rouge_metric)
NCM_given_C_results = evaluate_model(NCM_given_C_predictions, NCM_given_C_references, bleurt_metric, rouge_metric)
CM_given_NC_results = evaluate_model(CM_given_NC_predictions, CM_given_NC_references, bleurt_metric, rouge_metric)
CM_given_C_results = evaluate_model(CM_given_C_predictions, CM_given_C_references, bleurt_metric, rouge_metric)

print("No Context Model Given No Context Results:")
print(f"BLEURT Average Score: {NCM_given_NC_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {NCM_given_NC_results['ROUGE-L F1']:.2f}%")
print()

print("No Context Model Given Context Results:")
print(f"BLEURT Average Score: {NCM_given_C_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {NCM_given_C_results['ROUGE-L F1']:.2f}%")
print()

print("Context Model Given No Context Results:")
print(f"BLEURT Average Score: {CM_given_NC_predictions['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {CM_given_NC_predictions['ROUGE-L F1']:.2f}%")
print()

print("Context Model Given Context Results:")
print(f"BLEURT Average Score: {CM_given_C_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {CM_given_C_results['ROUGE-L F1']:.2f}%")
print()

100%|██████████| 10000/10000 [17:52<00:00,  9.33it/s]
100%|██████████| 10000/10000 [17:42<00:00,  9.41it/s]
100%|██████████| 10000/10000 [17:33<00:00,  9.50it/s]
100%|██████████| 10000/10000 [18:06<00:00,  9.20it/s]
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


No Context Model Given No Context Results:
BLEURT Average Score: -0.7853
ROUGE-L F1 Score: 27.10%

No Context Model Given Context Results:
BLEURT Average Score: -0.7631
ROUGE-L F1 Score: 26.87%

Context Model Given No Context Results:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


TypeError: list indices must be integers or slices, not str

In [None]:
#typo in last cell, to not run for over an hour again, i just made this cell
print("No Context Model Given No Context Results:")
print(f"BLEURT Average Score: {NCM_given_NC_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {NCM_given_NC_results['ROUGE-L F1']:.2f}%")
print(f"Exact Match Score: {NCM_given_NC_results['Exact Match']:.2f}%")
print(f"F1 Score: {NCM_given_NC_results['F1 Score']:.2f}%")
print()

print("No Context Model Given Context Results:")
print(f"BLEURT Average Score: {NCM_given_C_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {NCM_given_C_results['ROUGE-L F1']:.2f}%")
print(f"Exact Match Score: {NCM_given_C_results['Exact Match']:.2f}%")
print(f"F1 Score: {NCM_given_C_results['F1 Score']:.2f}%")
print()

print("Context Model Given No Context Results:")
print(f"BLEURT Average Score: {CM_given_NC_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {CM_given_NC_results['ROUGE-L F1']:.2f}%")
print(f"Exact Match Score: {CM_given_NC_results['Exact Match']:.2f}%")
print(f"F1 Score: {CM_given_NC_results['F1 Score']:.2f}%")
print()

print("Context Model Given Context Results:")
print(f"BLEURT Average Score: {CM_given_C_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {CM_given_C_results['ROUGE-L F1']:.2f}%")
print(f"Exact Match Score: {CM_given_C_results['Exact Match']:.2f}%")
print(f"F1 Score: {CM_given_C_results['F1 Score']:.2f}%")
print()

No Context Model Given No Context Results:
BLEURT Average Score: -0.7853
ROUGE-L F1 Score: 27.10%
Exact Match Score: 1.59%
F1 Score: 0.72%

No Context Model Given Context Results:
BLEURT Average Score: -0.7631
ROUGE-L F1 Score: 26.87%
Exact Match Score: 1.72%
F1 Score: 0.78%

Context Model Given No Context Results:
BLEURT Average Score: -0.8840
ROUGE-L F1 Score: 24.37%
Exact Match Score: 1.07%
F1 Score: 0.45%

Context Model Given Context Results:
BLEURT Average Score: -0.8772
ROUGE-L F1 Score: 24.02%
Exact Match Score: 1.09%
F1 Score: 0.47%



Pretrained results

In [None]:
t5 = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")

t5.to("cuda")
# no_context_model.to("cuda")
t5_predictions, t5_references = generate_predictions(t5, tokenizer, no_context_dataset)
t5_results = evaluate_model(t5_predictions, t5_references, bleurt_metric, rouge_metric)

100%|██████████| 1000/1000 [07:23<00:00,  2.25it/s]
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
print("Context Model Given Context Results:")
print(f"BLEURT Average Score: {t5_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {t5_results['ROUGE-L F1']:.2f}%")
print(f"Exact Match Score: {t5_results['Exact Match']:.2f}%")
print(f"F1 Score: {t5_results['F1 Score']:.2f}%")
print()

Context Model Given Context Results:
BLEURT Average Score: -1.1427
ROUGE-L F1 Score: 17.55%
Exact Match Score: 0.00%
F1 Score: 0.00%



# *Human Eval*
We additionally hand evaluated some results for more data.
- Here is a link to [the notebook that made the spreadsheet](https://colab.research.google.com/drive/166EPUJBX61T4QMm7JF3Jt2KOtlePo6dv?usp=sharing)
- Here is a link to [the google spreadsheet we filled in](https://docs.google.com/spreadsheets/d/1Y30Vr81zzLvb_XWCEKM1OiXPq_ToM3CVctbVE6fEKME/edit?usp=sharing)

# *Demo*

In [None]:
!pip install datasets
from datasets import load_dataset
import textwrap # for nice formatting



In [None]:
# load Stonybrook TellMeWhy and find the bad Morroco example
tmw_dataset = load_dataset("StonyBrookNLP/tellmewhy")   # provided dataset

README.md:   0%|          | 0.00/7.76k [00:00<?, ?B/s]

dataset_infos.json:   0%|          | 0.00/1.89k [00:00<?, ?B/s]

train.json:   0%|          | 0.00/70.1M [00:00<?, ?B/s]

validation.json:   0%|          | 0.00/8.71M [00:00<?, ?B/s]

test.json:   0%|          | 0.00/10.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/71892 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/8976 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10689 [00:00<?, ? examples/s]

In [None]:
contains_morocco = lambda data: True if "Morocco" in data["answer"] else False
morocco_example = list(filter(contains_morocco, tmw_dataset["train"]))[0]

def print_par_neatly(str):
  print(textwrap.fill(str, width=80))

print("Morocco example we mentioned at the presentation:")
print(f"\n{'Narrative':10}: ")
print_par_neatly(morocco_example['narrative'])
print(f"\n{'Question':10}: {morocco_example['question']}")
print(f"\n{'Answer':10}: {morocco_example['answer']}")


Morocco example we mentioned at the presentation:

Narrative : 
Lisa saw a new kid at school. She thought he was dressed a bit odd. But she
remembered what her parents told her about being nice. So she decided to be nice
to him. They weren't friends yet but at the very least friendly.

Question  : Why did She think he was dressed a bit odd?

Answer    : the new kid was from Morocco.


In [None]:
# get random example
SEED = 1014 # ask for a number
random_data = no_context_dataset.shuffle(seed=SEED)[0]

# print data
print("Shuffled data example:")
print(f"\n{'Concatted input'}: ")
print_par_neatly(random_data['input'])
print(f"\n{'Target'}: {random_data['target']}")

# generate output from both our models and a pretrained t5-small
input_ids = tokenizer(random_data['input'], return_tensors="pt")["input_ids"]
context_output = tokenizer.decode(context_model.generate(input_ids)[0], skip_special_tokens=True)
no_context_output = tokenizer.decode(no_context_model.generate(input_ids)[0], skip_special_tokens=True)
t5_output = tokenizer.decode(t5.generate(input_ids)[0], skip_special_tokens=True)

print("\nPretrained T5-small output:")
print(t5_output)
print()

print("No context model output:")
print(no_context_output)
print()

print("Context model output:")
print(context_output)
print()

Shuffled data example:

Concatted input: 
Narrative: Lisa has a beautiful sapphire ring. She always takes it off to wash
her hands. One afternoon, she noticed it was missing from her finger! Lisa
searched everywhere she had been that day. She was elated when she found it on
the bathroom floor! Question: Why was She elated?

Target: she found her ring.

Pretrained T5-small output:
: Why was she elated?

No context model output:
Lisa was searching everywhere she had been.

Context model output:
she was searching everywhere.

