# CSE 354 - Nxt Lvl Programmars (Final Project)



## Project Description: TellMeWhy with Small Models: Fine-Tuning Through Contextual Injection
### Authors: Jaret McManus, Dane Meister
Project Number: 9 Project Name: Answering Why Questions in TellMeWhy Project Area: Brief 2 line Project Description (starting point): Given a short story and a why question about an action in the story, generate an answer that explains the reason for performing the action. Relevant Baseline Model: T5, GPT, Gemini Relevant Dataset: TellMeWhy Relevant Papers: Lal et al 2021, Raffel et al 2020, Brown et al 2020, Gemini Team 2023


**Readme**: [Gdrive link to readme](https://drive.google.com/file/d/1Dnp3O8TfjzQ75mYuFVYoyuBFkl9e1IrX/view?usp=drive_link)

In [None]:
# upload and display README for project if necessary
from google.colab import files

uploaded = files.upload()  # This will prompt you to upload the file

from IPython.display import Markdown, display

# Get the filename from the uploaded dictionary
filename = list(uploaded.keys())[0]  # Assuming only the readme file is uploaded

# Now open the file using the filename
with open(filename, "r") as file:
    content = file.read()

display(Markdown(content))

# *Loading Google Drive*

In [None]:
import os
from google.colab import drive

base_dir = '/content/drive/MyDrive/CSE_354_Project'
drive.mount('/content/drive/')
if not os.path.exists(base_dir):
    print(f"Directory '{base_dir}' does not exist. Creating it...")
    os.makedirs(base_dir)
else:
    print(f"Directory '{base_dir}' already exists.")

%cd $base_dir

# *Loading Dataset with context injected data*

We injected the data with additional context using Google's Gemini LLM. For each unique narrative, we prompted Gemini to generate commonsense and external context, and saved it to a JSON file, to use later. We were limited by Google's API limits, so we modidfied the data in chunks, and later recombined all the chunks into one JSON file.

For more details into how we did this, here is a link to the 2 notebooks containing the code:
- [Notebook for Injecting Context in Chunks](https://colab.research.google.com/drive/1S50O26o_tLbaYE2-s-hRelSfroXaaMaC?usp=sharing)
- [Notebook for Combining Chunks](https://colab.research.google.com/drive/1-o8IBF1KQgMBm7m_2hQVA83sE-cPh-yU?usp=sharing)

In [None]:
!pip install datasets
from datasets import Dataset
import json

In [None]:
#load our data
file_name = 'ALL_CONTEXT_DATA_1.json'
with open(file_name, 'r') as f:
    all_context_data = json.load(f)
print(f"Loaded {len(all_context_data)} records from {file_name}")

# *Preprocess Data for Transformer*

In [None]:
!pip install transformers
from transformers import T5Tokenizer

In [None]:
# Tokenization function
def tokenize_function(examples):
    # Tokenize the input
    model_inputs = tokenizer(
        examples["input"],
        max_length=128 ,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    )
    labels = tokenizer(
        examples["target"],
        max_length=128,
        truncation=True,
        padding="max_length",
    ).input_ids

    # Replace padding token IDs in labels with -100
    labels = [[-100 if token == tokenizer.pad_token_id else token for token in label] for label in labels]
    model_inputs["labels"] = labels #add to dictionary

    return model_inputs

# Convert data into Dataset format
context_dataset = Dataset.from_dict({
    "input": [f"Narrative: {item['narrative']} Context: {item['context']} Question: {item['question']}" for item in all_context_data],
    "target": [item['answer'] for item in all_context_data],
})
print(f"Context injected Dataset loaded with {len(context_dataset)} samples.")

no_context_dataset = Dataset.from_dict({
    "input": [f"Narrative: {item['narrative']} Question: {item['question']}" for item in all_context_data],
    "target": [item['answer'] for item in all_context_data],
})
print(f"No Context injected Dataset loaded with {len(no_context_dataset)} samples.")


# Initialize the T5 tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')

# Apply tokenization to the dataset
print("Tokenizing the dataset...")
tokenized_context_dataset = context_dataset.map(tokenize_function, batched=True)
tokenized_no_context_dataset = no_context_dataset.map(tokenize_function, batched=True)

# View tokenized example for verification
print("Tokenized example:", tokenized_context_dataset[0])
print("Tokenized example:", tokenized_no_context_dataset[100])

# *Set up data for training*

In [None]:
split_context_dataset = tokenized_context_dataset.select(range(10_000)).train_test_split(test_size=0.15)
split_no_context_dataset = tokenized_no_context_dataset.select(range(10_000)).train_test_split(test_size=0.15)

train_context_dataset = split_context_dataset["train"]
test_context_dataset = split_context_dataset["test"]

train_no_context_dataset = split_no_context_dataset["train"]
test_no_context_dataset = split_no_context_dataset["test"]

print("length of context train:", len(train_context_dataset))
print("length of context test:", len(test_context_dataset))
print("length of no context train:", len(train_no_context_dataset))
print("length of no context test:", len(test_no_context_dataset))

# *Load BLEURT for metric*

In [None]:
!pip install evaluate
import evaluate
!pip install git+https://github.com/google-research/bleurt.git
bleurt = evaluate.load("bleurt")

In [None]:
import numpy as np
import torch
import gc
def compute_metrics(eval_pred):
    print('eval')
    logits, label_ids = eval_pred.predictions, eval_pred.label_ids
    logits = logits[0]
    with torch.no_grad():
        # Convert logits to predictions (use argmax to get the most probable token)
        preds = np.argmax(logits, axis=-1)

        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
        decoded_labels = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

        # Compute BLEURT scores for the current batch
        scores = bleurt.compute(predictions=decoded_preds, references=decoded_labels)

        del logits, label_ids, preds, decoded_preds, decoded_labels
        gc.collect()
        torch.cuda.empty_cache()  # Free unused GPU memory

        # Return the average BLEURT score across all batches
        return {"bleurt": np.mean(scores["scores"])}

# *Retrieve pretrained T5 model to finetune without context*

In [None]:
from transformers import T5ForConditionalGeneration

# Load the T5 model
t5 = model = T5ForConditionalGeneration.from_pretrained("t5-small")  # Change to a larger version if needed

# *Finetuning a T5 without context*

In [None]:
from transformers import Trainer, TrainingArguments

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",         # Output directory
    eval_strategy="epoch",   # Evaluate every epoch
    learning_rate=5e-5,
    per_device_train_batch_size=1, # Adjust based on memory
    per_device_eval_batch_size=1,
    num_train_epochs=4,            # Adjust based on performance
    weight_decay=0.01,
    save_total_limit=1,            # Save only the best checkpoint
    logging_dir="./logs",          # Directory for logs
    logging_steps=60,
    save_steps=1000,
    gradient_accumulation_steps=4,
    fp16=True,
    report_to="none",
    eval_accumulation_steps=8
)

model = T5ForConditionalGeneration.from_pretrained('t5-small')
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_no_context_dataset,
    eval_dataset=test_no_context_dataset,
    compute_metrics=compute_metrics
)


# Train the model
trainer.train()
trainer.save_model("./fine_tuned_whitespace_no_context")
trainer.evaluate()

In [None]:
import shutil
shutil.copytree('./fine_tuned_whitespace_no_context', base_dir+'/fine_tuned_whitespace_no_context')

In [None]:
# for conserving credits
# leave training running, and disconnect when it finishes
from google.colab import runtime
runtime.unassign()

# *Retreive a pretrained T5 to finetune with context*

In [None]:
from transformers import T5ForConditionalGeneration

# Reload the T5 model
model = T5ForConditionalGeneration.from_pretrained("t5-small")  # Change to a larger version if needed

# *Finetuning a T5 with context*

In [None]:
from transformers import Trainer, TrainingArguments

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",         # Output directory
    eval_strategy="epoch",   # Evaluate every epoch
    learning_rate=5e-5,
    per_device_train_batch_size=4, # Adjust based on memory
    per_device_eval_batch_size=4,
    num_train_epochs=4,            # Adjust based on performance
    weight_decay=0.01,
    save_total_limit=1,            # Save only the best checkpoint
    logging_dir="./logs",          # Directory for logs
    logging_steps=60,
    save_steps=1000,
    gradient_accumulation_steps=4,
    fp16=True,
    report_to="none",
    eval_accumulation_steps=8
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_context_dataset,
    eval_dataset=test_context_dataset,
    compute_metrics=compute_metrics
)


# Train the model
trainer.train()
trainer.save_model("./fine_tuned_whitespace_context")
trainer.evaluate()



In [None]:
import shutil
shutil.copytree('./fine_tuned_whitespace_no_context', base_dir+'/fine_tuned_whitespace_no_context')

In [None]:
# for conserving credits
# leave training running, and disconnect when it finishes
from google.colab import runtime
runtime.unassign()

# *Evaluating models*

Load models from google drive

In [None]:
context_model = T5ForConditionalGeneration.from_pretrained(f"{base_dir}/fine_tuned_whitespace_context")
no_context_model = T5ForConditionalGeneration.from_pretrained(f"{base_dir}/fine_tuned_whitespace_no_context")
tokenizer = T5Tokenizer.from_pretrained("t5-small")

Reload our data in case of starting colab at this point

In [None]:
#load our data
file_name = 'ALL_CONTEXT_DATA_1.json'
with open(file_name, 'r') as f:
    all_context_data = json.load(f)
print(f"Loaded {len(all_context_data)} records from {file_name}")

In [None]:
# Get smaller subset to evaluate on
no_context_dataset = Dataset.from_dict({
    "input": [f"Narrative: {item['narrative']} Question: {item['question']}" for item in all_context_data],
    "target": [item['answer'] for item in all_context_data],
})
context_dataset = Dataset.from_dict({
    "input": [f"Narrative: {item['narrative']} Context: {item['context']} Question: {item['question']}" for item in all_context_data],
    "target": [item['answer'] for item in all_context_data],
})
context_dataset = context_dataset.shuffle(seed=42).select(range(1_000))
print(context_dataset[0])
no_context_dataset = no_context_dataset.shuffle(seed=42).select(range(1_000))
print(no_context_dataset[0])

In [None]:
!pip install evaluate
!pip install tqdm
import evaluate
from tqdm import tqdm

In [None]:
!pip install rouge_score
!pip install git+https://github.com/google-research/bleurt.git
rouge_metric = evaluate.load("rouge")
bleurt_metric = evaluate.load("bleurt")

code for metrics

In [None]:
# F1 Score
# Calculates the harmonic mean of precision and recall based on word overlap.
from sklearn.metrics import precision_recall_fscore_support

def compute_f1_score(predictions, references):
    preds = [" ".join(p.split()) for p in predictions]
    refs = [" ".join(r.split()) for r in references]
    precision, recall, f1, _ = precision_recall_fscore_support(refs, preds, average="macro")
    return f1 * 100

# Exact Match (EM)
# Measures the percentage of predictions that exactly match the references.

def compute_exact_match(predictions, references):
    return sum([p.strip() == r.strip() for p, r in zip(predictions, references)]) / len(references) * 100

In [None]:
def evaluate_model(predictions, references, bleurt_metric, rouge_metric):
    """
    Evaluate predictions using BLEURT, ROUGE, Exact Match, and F1 Score.

    Args:
        predictions (list): List of model-generated answers.
        references (list): List of ground-truth answers.
        bleurt_metric (evaluate.Metric): BLEURT metric instance.
        rouge_metric (evaluate.Metric): ROUGE metric instance.

    Returns:
        dict: Dictionary containing BLEURT, ROUGE, EM, and F1 scores.
    """
    # BLEURT scores
    bleurt_results = bleurt_metric.compute(predictions=predictions, references=references)
    bleurt_scores = bleurt_results["scores"]
    avg_bleurt = sum(bleurt_scores) / len(bleurt_scores)

    # ROUGE scores
    rouge_results = rouge_metric.compute(predictions=predictions, references=references)
    rouge_l_f1 = rouge_results["rougeLsum"]

    # Exact Match (EM) Score
    em_score = compute_exact_match(predictions, references)

    # F1 Score
    f1_score = compute_f1_score(predictions, references)

    return {
        "BLEURT Average": avg_bleurt,
        "ROUGE-L F1": rouge_l_f1 * 100,  # Convert to percentage
        "Exact Match": em_score,
        "F1 Score": f1_score,
    }

def generate_predictions(model, tokenizer, dataset, max_length=256):
    """
    Generate predictions for evaluation.
    """
    predictions = []
    references = []
    model.eval()  # Set model to evaluation mode

    for item in tqdm(dataset):
        # Tokenize input dynamically
        inputs = tokenizer(item["input"], return_tensors="pt", truncation=True, padding=True).to("cuda")
        # inputs = tokenizer(item["input"], return_tensors="pt", truncation=True, padding=True)

        # Generate predictions
        outputs = model.generate(inputs["input_ids"], max_length=max_length)
        prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)

        predictions.append(prediction)
        references.append(item["target"])

    return predictions, references

results for combinations of models and given context

In [None]:
no_context_model.to("cuda")
context_model.to("cuda")

#short hand no context to NC, context to C, no context trained model to NCM, context trained model CM
NCM_given_NC_predictions, NCM_given_NC_references = generate_predictions(no_context_model, tokenizer, no_context_dataset)
NCM_given_C_predictions, NCM_given_C_references = generate_predictions(no_context_model, tokenizer, context_dataset)
CM_given_NC_predictions, CM_given_NC_references = generate_predictions(context_model, tokenizer, no_context_dataset)
CM_given_C_predictions, CM_given_C_references = generate_predictions(context_model, tokenizer, context_dataset)

# Evaluate the models
NCM_given_NC_results = evaluate_model(NCM_given_NC_predictions, NCM_given_NC_references, bleurt_metric, rouge_metric)
NCM_given_C_results = evaluate_model(NCM_given_C_predictions, NCM_given_C_references, bleurt_metric, rouge_metric)
CM_given_NC_results = evaluate_model(CM_given_NC_predictions, CM_given_NC_references, bleurt_metric, rouge_metric)
CM_given_C_results = evaluate_model(CM_given_C_predictions, CM_given_C_references, bleurt_metric, rouge_metric)

print("No Context Model Given No Context Results:")
print(f"BLEURT Average Score: {NCM_given_NC_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {NCM_given_NC_results['ROUGE-L F1']:.2f}%")
print()

print("No Context Model Given Context Results:")
print(f"BLEURT Average Score: {NCM_given_C_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {NCM_given_C_results['ROUGE-L F1']:.2f}%")
print()

print("Context Model Given No Context Results:")
print(f"BLEURT Average Score: {CM_given_NC_predictions['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {CM_given_NC_predictions['ROUGE-L F1']:.2f}%")
print()

print("Context Model Given Context Results:")
print(f"BLEURT Average Score: {CM_given_C_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {CM_given_C_results['ROUGE-L F1']:.2f}%")
print()

In [None]:
#typo in last cell, to not run for over an hour again, i just made this cell
print("No Context Model Given No Context Results:")
print(f"BLEURT Average Score: {NCM_given_NC_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {NCM_given_NC_results['ROUGE-L F1']:.2f}%")
print(f"Exact Match Score: {NCM_given_NC_results['Exact Match']:.2f}%")
print(f"F1 Score: {NCM_given_NC_results['F1 Score']:.2f}%")
print()

print("No Context Model Given Context Results:")
print(f"BLEURT Average Score: {NCM_given_C_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {NCM_given_C_results['ROUGE-L F1']:.2f}%")
print(f"Exact Match Score: {NCM_given_C_results['Exact Match']:.2f}%")
print(f"F1 Score: {NCM_given_C_results['F1 Score']:.2f}%")
print()

print("Context Model Given No Context Results:")
print(f"BLEURT Average Score: {CM_given_NC_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {CM_given_NC_results['ROUGE-L F1']:.2f}%")
print(f"Exact Match Score: {CM_given_NC_results['Exact Match']:.2f}%")
print(f"F1 Score: {CM_given_NC_results['F1 Score']:.2f}%")
print()

print("Context Model Given Context Results:")
print(f"BLEURT Average Score: {CM_given_C_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {CM_given_C_results['ROUGE-L F1']:.2f}%")
print(f"Exact Match Score: {CM_given_C_results['Exact Match']:.2f}%")
print(f"F1 Score: {CM_given_C_results['F1 Score']:.2f}%")
print()

Pretrained results

In [None]:
t5 = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")

t5.to("cuda")
# no_context_model.to("cuda")
t5_predictions, t5_references = generate_predictions(t5, tokenizer, no_context_dataset)
t5_results = evaluate_model(t5_predictions, t5_references, bleurt_metric, rouge_metric)

In [None]:
print("Context Model Given Context Results:")
print(f"BLEURT Average Score: {t5_results['BLEURT Average']:.4f}")
print(f"ROUGE-L F1 Score: {t5_results['ROUGE-L F1']:.2f}%")
print(f"Exact Match Score: {t5_results['Exact Match']:.2f}%")
print(f"F1 Score: {t5_results['F1 Score']:.2f}%")
print()

# *Human Eval*
We additionally hand evaluated some results for more data.
- Here is a link to [the notebook that made the spreadsheet](https://colab.research.google.com/drive/166EPUJBX61T4QMm7JF3Jt2KOtlePo6dv?usp=sharing)
- Here is a link to [the google spreadsheet we filled in](https://docs.google.com/spreadsheets/d/1Y30Vr81zzLvb_XWCEKM1OiXPq_ToM3CVctbVE6fEKME/edit?usp=sharing)

# *Demo*

In [None]:
!pip install datasets
from datasets import load_dataset
import textwrap # for nice formatting

In [None]:
# load Stonybrook TellMeWhy and find the bad Morroco example
tmw_dataset = load_dataset("StonyBrookNLP/tellmewhy")   # provided dataset

In [None]:
contains_morocco = lambda data: True if "Morocco" in data["answer"] else False
morocco_example = list(filter(contains_morocco, tmw_dataset["train"]))[0]

def print_par_neatly(str):
  print(textwrap.fill(str, width=80))

print("Morocco example we mentioned at the presentation:")
print(f"\n{'Narrative':10}: ")
print_par_neatly(morocco_example['narrative'])
print(f"\n{'Question':10}: {morocco_example['question']}")
print(f"\n{'Answer':10}: {morocco_example['answer']}")


In [None]:
# get random example
SEED = 1014 # ask for a number
random_data = no_context_dataset.shuffle(seed=SEED)[0]

# print data
print("Shuffled data example:")
print(f"\n{'Concatted input'}: ")
print_par_neatly(random_data['input'])
print(f"\n{'Target'}: {random_data['target']}")

# generate output from both our models and a pretrained t5-small
input_ids = tokenizer(random_data['input'], return_tensors="pt")["input_ids"]
context_output = tokenizer.decode(context_model.generate(input_ids)[0], skip_special_tokens=True)
no_context_output = tokenizer.decode(no_context_model.generate(input_ids)[0], skip_special_tokens=True)
t5_output = tokenizer.decode(t5.generate(input_ids)[0], skip_special_tokens=True)

print("\nPretrained T5-small output:")
print(t5_output)
print()

print("No context model output:")
print(no_context_output)
print()

print("Context model output:")
print(context_output)
print()