# Text-as-Data Coursework Introduction

The TaD coding coursework aims to assess your abilities to perform text processing techniques as applied to a multi-class classification problem.

Your work will be submitted through a Moodle quiz. For each question, you should submit your text answer (providing the required information) separately from your code.

## The Task

A museum records organisation has a large set of records that need to be assigned to one of five institutions (in the table below). They have provided a small set of data to be used as the training, validation, and test set. Our goal is to build a classifier that could assign an unseen record to the correct institution.

| Class Index | Institution                                |
|-------------|--------------------------------------------|
| 0           | National Maritime Museum                  |
| 1           | National Railway Museum                   |
| 2           | Royal Botanic Gardens, Kew                |
| 3           | Royal College of Physicians of London     |
| 4           | Shakespeare Birthplace Trust              |

The dataset can be downloaded with the link: [Download Dataset](https://tinyurl.com/tadarchives)  

## Generative AI Usage Policy

You are free to use generative AI in any way during this assessed exercise. Please note your usage in the final question. Material from this coursework may appear on the final exam.

## Questions

### Q1: Training Data Cleaning [9 marks]

Download and load the dataset. There are some issues with the training split of the data that would stop it from being used to train a classifier. Report all issues and how you fixed them.

---

### Q2: Exploration [5 marks]

Once the training set has been fixed, report the following:

- The sample counts for the training, validation, and test sets
- The percentage splits for training, validation, and test sets
- The minimum and maximum length (in characters) of the texts, reported separately for the training, validation, and test sets
- The most frequent five tokens in each class (after tokenizing with `text_pipeline_spacy` from Lab 2)

---

### Q3: Prompting with a Large Language Model [10 marks]

A colleague has tried prompting a large language model (Llama-3.1-8B-Instruct) to classify each of the records in the training set. They evaluated three different prompt templates and saved the results to the provided files.

Calculate the following for each prompt template:

- Accuracy
- Macro precision
- Macro recall
- Macro F1

Comment on the results. Consider any invalid output from the LLM as predicting a sixth hypothetical class.

---

### Q4: Fine-tune a Transformer [10 marks]

Fine-tune a `bert-base-uncased` transformer model on the model using the training set. You should use an `AutoModelForSequenceClassification` and the HuggingFace `Trainer`. Use the following hyperparameters:

- **Epochs:** 8
- **Learning rate:** 5e-5
- **Batch size:** 8

Evaluate on the validation set. Report the following:

- Per-class precision, recall, and F1 score
- Accuracy
- Macro precision
- Macro recall
- Macro F1 score

Don’t be surprised with poor performance (see the next question).

---

### Q5: A Problem with the Validation Set [5 marks]

There is an issue with the validation set which causes poor performance.

- Provide the confusion matrix.
- Describe the problem, how you identified it, and how you fixed it.

---

### Q6: Hyperparameter Tuning [12 marks]

Train and evaluate several fine-tuned transformer models using the corrected training and validation sets. Try the four base models listed below. Use the following hyperparameters:

- **Epochs:** 8
- **Learning rate:** 5e-5
- **Batch size:** 8

Base models to try:

- `bert-base-uncased`
- `roberta-base`
- `distilbert-base-uncased`
- `microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract`

We want the best model found during the training process for each base model and to save it for the final analysis. If the model after 3 epochs is the best performing on the validation set (by macro-F1), we want to keep that.

You should investigate the `load_best_model_at_end` parameter for the `Trainer` (which does require other parameters).

Evaluate each fine-tuned model on the validation set. Report the following:

- Per-class precision, recall, and F1 score
- Accuracy
- Macro precision
- Macro recall
- Macro F1 score

Comment on the performance of each model.

---

### Q7: Final Evaluation and Deployment [6 marks]

Load the best model (based on macro-F1 on the validation set) that was saved in the previous question using a `text-classification` pipeline. Evaluate the best of the four fine-tuned models on the testing set.

- State which model you used.
- Report the per-class precision, recall, and F1 score.
- Report accuracy, macro precision, macro recall, and macro F1 score.
- Comment on the performance and discuss whether the quality is high enough to be deployed for the client.

---

### Q8: Generative AI Usage [1 mark]

Report on whether and how you used generative AI in this assignment.



In [None]:
# !pip install fuzzywuzzy python-Levenshtein transformers datasets evaluate

In [1]:
import os
import sys

# Check if running in Google Colab
try:
    from google.colab import drive
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

# Set the path based on the environment
if IN_COLAB:
    drive.mount('/content/drive')
    path = '/content/drive/MyDrive/Projects/TasD/'
else:
    # Adjust the local path to your project folder
    path = os.path.expanduser('./')

# Ensure the path exists
if not os.path.exists(path):
    os.makedirs(path)

print(f"Using path: {path}")


Using path: ./


In [2]:
import os
import zipfile
import requests
from io import BytesIO

# Define the required filenames
required_files = [
    "dataset.json",
    "llm_prompt_template_1.json",
    "llm_prompt_template_2.json",
    "llm_prompt_template_3.json"
]

# Check for missing files
missing_files = [file for file in required_files if not os.path.exists(os.path.join(path, file))]

# Download and extract if any file is missing
if missing_files:
    print(f"Missing files: {', '.join(missing_files)}")
    print("Downloading and extracting the dataset...")

    # URL to the dataset
    url = "https://tinyurl.com/tadarchives"
    response = requests.get(url)

    if response.status_code == 200:
        # Extract ZIP from response content
        with zipfile.ZipFile(BytesIO(response.content)) as zip_ref:
            zip_ref.extractall(path)
        print("Download and extraction complete.")
    else:
        print("Failed to download the dataset. Please check the URL or your internet connection.")
else:
    print("All required files are present.")


All required files are present.


In [3]:
import json

In [4]:
with open(path+"dataset.json", "r") as file:
    data = json.load(file)

In [5]:
# Standardized keys
CONTENT_KEYS = ["text", "description", "content"]
LABEL_KEYS = ["label", "labl","key"]

### Q1: Training Data Cleaning [9 marks]

Download and load the dataset. There are some issues with the training split of the data that would stop it from being used to train a classifier. Report all issues and how you fixed them.

---


## **Data Cleaning Process**  

### **Objective:**  
The goal of the `clean_entry` function is to clean and standardize dataset entries by:  
- Consolidating multiple alternative keys for **content** (`text`, `description`, `content`) into a single standardized `content` key.  
- Consolidating multiple alternative keys for **labels** (`label`, `labl`) into a standardized `label` key.  
- Ensuring labels are converted to **strings** if they were stored as numbers.  
- Checking for **unexpected keys** — any keys not related to `id`, `content`, or `label` — and issuing warnings if such keys are found.  
- **Removing entries** where `content` or `label` is missing.  

### **Key Steps:**  

1. **Checking for unexpected keys:**  
   - The function defines the allowed keys:  
     - `id`  
     - The standardized `content` and `label` keys  
     - All possible content keys (`text`, `description`, `content`)  
     - All possible label keys (`label`, `labl`)  
   - Any additional keys are flagged as unexpected, and a **warning** is printed, but they are **not removed** from the entry.  

2. **Standardizing the content key:**  
   - The function looks for the first available content key (`text`, `description`, or `content`).  
   - If none are found, `content` is set to `None`.  
   - **Entries with null content are removed.**  

3. **Standardizing the label key:**  
   - The function looks for the first available label key (`label`, `labl`).  
   - If the label is a number, it is **converted to a string**.  
   - If the label is `null`, it remains `None`.  
   - **Entries with null labels are removed.**  

4. **Fuzzy Matching for Label Standardization:**  
   - The function compares the extracted label against a predefined list of valid labels.  
   - If a close enough match (score ≥ 80%) is found, the label is replaced with the standardized version.  
   - Otherwise, a warning is issued, and the original label is kept.  

5. **Removing Invalid Entries:**  
   - If an entry has `None` for either `content` or `label`, it is removed from the dataset.  

6. **Returning Cleaned Entries:**  
   - A cleaned entry containing only `id`, `content`, and `label` is returned.  
   - Any errors during processing are caught and logged, and `None` is returned for problematic entries.   

---


In [6]:
from fuzzywuzzy import fuzz


def clean_entry(entry):
    """Cleans a single record by standardizing keys, handling missing values, and checking for unexpected keys."""
    valid_labels = [
        "National Maritime Museum",
        "Shakespeare Birthplace Trust",
        "National Railway Museum",
        "Royal Botanic Gardens, Kew",
        "Royal College of Physicians of London"
    ]
    try:
        # Allowed keys for a valid entry
        allowed_keys = set(["id", "content", "label"] + CONTENT_KEYS + LABEL_KEYS)

        # Check for unexpected keys BEFORE cleaning
        unexpected_keys = set(entry.keys()) - allowed_keys
        if unexpected_keys:
            print(f"Warning: Unexpected keys found {unexpected_keys} in entry with id {entry.get('id')}")

        # Standardize content key
        entry["content"] = next(
            (
                entry[key]
                for key in CONTENT_KEYS
                if key in entry and isinstance(entry[key], str) and entry[key] is not None
            ),
            None,
        )

        # Standardize label key
        entry["label"] = next(
            (
                entry[key]
                for key in LABEL_KEYS
                if key in entry and isinstance(entry[key], str) and entry[key]
                is not None
            ),
            None,
        )

        if entry["content"] is None or entry["label"] is None:
            print(
                f"Removing entry with id {entry.get('id')} due to invalid content or label"
            )
            return None

        raw_label = entry["label"]
        matched_label = None
        highest_score = 0

        for valid_label in valid_labels:
            score = fuzz.ratio(raw_label.strip(), valid_label.strip())
            if score > highest_score:
                highest_score = score
                matched_label = valid_label

        # If the highest score exceeds a threshold (e.g., 80%), accept the matched label
        if highest_score >= 80:  # Adjust the threshold as needed
            entry["label"] = matched_label
        else:
            print(
                f"Warning: Label '{raw_label}' did not match any valid label well. Keeping original."
            )

        cleaned_entry = {
            "id": entry.get("id"),
            "label": entry["label"],
            "content": entry["content"],
        }

        return cleaned_entry
    except Exception as e:
        print(f"Error processing entry with id {entry.get('id')}: {e}")
        return None

In [7]:
data["train"] = [
    clean_entry(entry) for entry in data["train"] if clean_entry(entry) is not None
]
data["val"] = [
    clean_entry(entry) for entry in data["val"] if clean_entry(entry) is not None
]
data["test"] = [
    clean_entry(entry) for entry in data["test"] if clean_entry(entry) is not None
]

Removing entry with id 08838524-6d11-729c-39d8-fa4d712b7476 due to invalid content or label
Removing entry with id 7ab43067-856c-5cee-8574-ac59a24de44b due to invalid content or label
Removing entry with id 598e3e51-2390-addf-4363-09503488d4a5 due to invalid content or label
Removing entry with id 4cbde7c0-1a7a-ef93-5720-a4b6b5f9c49f due to invalid content or label
Removing entry with id b2e99517-4fdc-9e57-4557-580418e61c82 due to invalid content or label
Removing entry with id 7148fce2-2315-1826-2615-05abd9b0b806 due to invalid content or label
Removing entry with id d438b1f5-6dfa-24bc-7f0e-fc0161ee7be8 due to invalid content or label
Removing entry with id b7a9494b-2738-ec6a-e5b3-523d2cb27b74 due to invalid content or label
Removing entry with id 26dc19c0-9a55-2cae-20f9-a782a56d5c76 due to invalid content or label
Removing entry with id ccd1b1b5-e044-98ca-d21e-7315c1e702f7 due to invalid content or label
Removing entry with id 2e596900-d420-23e4-f5d1-c56cfc0b2352 due to invalid conte

In [8]:
with open(path+"cleaned_dataset.json", "w") as file:
    json.dump(data, file, indent=4)

print("Dataset cleaned and saved as cleaned_dataset.json")

Dataset cleaned and saved as cleaned_dataset.json


In [9]:
with open(path+"cleaned_dataset.json") as f:
    data = json.load(f)

train_data = data["train"]
val_data = data["val"]
test_data = data["test"]

### Q2: Exploration [5 marks]

Once the training set has been fixed, report the following:

- The sample counts for the training, validation, and test sets
- The percentage splits for training, validation, and test sets
- The minimum and maximum length (in characters) of the texts, reported separately for the training, validation, and test sets
- The most frequent five tokens in each class (after tokenizing with `text_pipeline_spacy` from Lab 2)

---

In [10]:
# Sample counts
train_count = len(train_data)
val_count = len(val_data)
test_count = len(test_data)
total_count = train_count + val_count + test_count

# Percentage splits
train_pct = (train_count / total_count) * 100
val_pct = (val_count / total_count) * 100
test_pct = (test_count / total_count) * 100

# Display results
print("Sample Counts:")
print(f"Training set: {train_count}")
print(f"Validation set: {val_count}")
print(f"Test set: {test_count}")

print("\nPercentage Splits:")
print(f"Training set: {train_pct:.2f}%")
print(f"Validation set: {val_pct:.2f}%")
print(f"Test set: {test_pct:.2f}%")

Sample Counts:
Training set: 102
Validation set: 50
Test set: 50

Percentage Splits:
Training set: 50.50%
Validation set: 24.75%
Test set: 24.75%


In [11]:
# Function to calculate min/max lengths of content
def get_text_lengths(data):
    lengths = [len(entry["content"]) for entry in data if entry["content"]]
    return min(lengths, default=0), max(lengths, default=0)


# Get lengths for each set
train_min, train_max = get_text_lengths(train_data)
val_min, val_max = get_text_lengths(val_data)
test_min, test_max = get_text_lengths(test_data)

# Display results
print("\nText Lengths (in characters):")
print(f"Training set - Min: {train_min}, Max: {train_max}")
print(f"Validation set - Min: {val_min}, Max: {val_max}")
print(f"Test set - Min: {test_min}, Max: {test_max}")


Text Lengths (in characters):
Training set - Min: 163, Max: 2349
Validation set - Min: 154, Max: 2794
Test set - Min: 167, Max: 3479


In [12]:
import spacy
from collections import Counter

# Load spaCy model
nlp = spacy.load("en_core_web_sm")


# Tokenization using text_pipeline_spacy
def tokenize(text):
    tokens = []
    doc = nlp(text)
    for t in doc:
        if not t.is_stop and not t.is_punct and not t.is_space:
            tokens.append(t.lemma_.lower())
    return tokens

# Get most frequent tokens per class
def get_top_tokens_by_class(data, top_n=5):
    class_tokens = {}

    for entry in data:
        label = entry["label"]
        tokens = tokenize(entry["content"]) if entry["content"] else []
        if label not in class_tokens:
            class_tokens[label] = Counter()
        class_tokens[label].update(tokens)

    # Get the top N tokens for each class
    top_tokens = {
        label: counter.most_common(top_n) for label, counter in class_tokens.items()
    }
    return top_tokens

label_to_class_mapping = {
    'National Maritime Museum': '0',
    'National Railway Museum': '1',
    'Royal Botanic Gardens, Kew': '2',
    'Royal College of Physicians of London': '3',
    'Shakespeare Birthplace Trust': '4'
  }
# Create label-to-class mapping
def create_label_mapping(train_data):
    return {
        entry["id"]: label_to_class_mapping.get(entry["label"], "5")  # Default to "5" for unknown labels
        for entry in train_data
        if "id" in entry and "label" in entry
    }

# Get top tokens for each dataset
train_top_tokens = get_top_tokens_by_class(train_data)
val_top_tokens = get_top_tokens_by_class(val_data)
test_top_tokens = get_top_tokens_by_class(test_data)

# Display results
print("\nTop 5 Tokens Per Class (Training set):")
for label, tokens in train_top_tokens.items():
    print(f"Class {label}: {tokens}")

label_mapping = create_label_mapping(train_data)


print(label_to_class_mapping)
print(label_mapping)


Top 5 Tokens Per Class (Training set):
Class Royal Botanic Gardens, Kew: [('letter', 25), ('paper', 12), ('include', 9), ('ridley', 9), ('journal', 8)]
Class Shakespeare Birthplace Trust: [('mr.', 10), ('account', 10), ('letter', 10), ('william', 7), ('bridges', 6)]
Class National Railway Museum: [('2000', 108), ('7200', 108), ('756', 107), ('gb', 106), ('water', 23)]
Class Royal College of Physicians of London: [('mr.', 17), ('seal', 15), ('plate', 14), ('college', 14), ('common', 13)]
Class National Maritime Museum: [('sir', 10), ('henry', 8), ('john', 7), ('enclosure', 7), ('land', 6)]
{'National Maritime Museum': '0', 'National Railway Museum': '1', 'Royal Botanic Gardens, Kew': '2', 'Royal College of Physicians of London': '3', 'Shakespeare Birthplace Trust': '4'}
{'467eb98e-1866-708e-03cd-9ab9f8a7f4e5': '2', 'b78f8473-e487-b3cb-be65-4f854d97228b': '2', 'ea15bfc9-9499-e9e2-b3e4-654714a795a5': '4', '03556d52-3c1e-f36e-97bd-1a09c87ef949': '1', 'f94bc0a8-a0a4-c96f-e3a8-9e71050ff05a'

In [13]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np
valid_classes = {"0", "1", "2", "3", "4"}
# Map LLM predictions to classes (invalid outputs go to class 5)
def map_llm_prediction(next_token):
    return int(next_token) if next_token in valid_classes else 5

# Load LLM results and match them to true labels
def load_llm_results(file_path, label_mapping):
    with open(path + file_path, "r") as file:
        data = json.load(file)

    results = []
    for entry in data:
        entry_id = entry.get("id")
        true_label = int(label_mapping.get(entry_id, 5))
        next_token = entry.get("next_token", "").strip()
        predicted_label = map_llm_prediction(next_token)
        # print(f"{entry_id}, next_token: {next_token}, (true_label:predicted_label): ({true_label}:{predicted_label})")

        results.append({"true_label": true_label, "predicted_label": predicted_label})

    return results

# Calculate accuracy, macro precision, recall, and F1 score
def calculate_metrics(results):
    # print(results)
    true_labels = [result["true_label"] for result in results]
    predicted_labels = [result["predicted_label"] for result in results]

    # Accuracy
    # accuracy = accuracy_score(true_labels, predicted_labels)

    # Precision, recall, and F1 for each class (0-5)
    # precision, recall, f1, _ = precision_recall_fscore_support(
    #     true_labels, predicted_labels, average=None, labels=[0, 1, 2, 3, 4, 5], zero_division=0
    # )

    valid_results = [(true, pred) for true, pred in zip(true_labels, predicted_labels) if true != 5 and pred != 5]
    valid_true_labels = [true for true, _ in valid_results]
    valid_predicted_labels = [pred for _, pred in valid_results]
    # print(valid_results)
    # print(valid_true_labels)
    # print(valid_predicted_labels)

    # Accuracy (only valid predictions)
    accuracy = accuracy_score(valid_true_labels, valid_predicted_labels) if valid_true_labels else 0.0
    precision, recall, f1, _ = precision_recall_fscore_support(
        true_labels, predicted_labels, average="macro", labels=[0, 1, 2, 3, 4,5], zero_division=0
    )

    # Macro precision, recall, and F1
    macro_precision = np.mean(precision)
    macro_recall = np.mean(recall)
    macro_f1 = np.mean(f1)

    return {
        "Accuracy": accuracy,
        "Macro Precision": macro_precision,
        "Macro Recall": macro_recall,
        "Macro F1": macro_f1
    }

In [14]:
# Load predictions for each prompt template
prompt1_results = load_llm_results("llm_prompt_template_1.json",label_mapping)
prompt2_results = load_llm_results("llm_prompt_template_2.json",label_mapping)
prompt3_results = load_llm_results("llm_prompt_template_3.json",label_mapping)

# Evaluate predictions
prompt1_metrics = calculate_metrics(prompt1_results)
prompt2_metrics = calculate_metrics(prompt2_results)
prompt3_metrics = calculate_metrics(prompt3_results)

# Display the results
for i, metrics in enumerate([prompt1_metrics, prompt2_metrics, prompt3_metrics], 1):
    print(f"Prompt {i} Metrics:")
    for metric, value in metrics.items():
        print(f"{metric}: {value}")
    print("\n" + "=" * 50 + "\n")

Prompt 1 Metrics:
Accuracy: 0.0
Macro Precision: 0.05333333333333334
Macro Recall: 0.16666666666666666
Macro F1: 0.08080808080808081


Prompt 2 Metrics:
Accuracy: 0.7843137254901961
Macro Precision: 0.44549604705801643
Macro Recall: 0.6380025440127742
Macro F1: 0.5087234738327103


Prompt 3 Metrics:
Accuracy: 0.7216494845360825
Macro Precision: 0.4693225321301613
Macro Recall: 0.5625067659914207
Macro F1: 0.46729988965283087




In [15]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=6
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
from datasets import Dataset
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


def tokenize_function(examples):
    tokenized_inputs = tokenizer(examples['content'], padding=True, truncation=True, max_length=512)

    # Ensure labels are correctly batched as lists
    tokenized_inputs["labels"] = examples['label']  # Pass the whole batch of labels directly

    return tokenized_inputs


# Create datasets
train_dataset = Dataset.from_dict({
    "content": [x["content"] for x in train_data],
    "label": [int(label_to_class_mapping.get(x["label"], 5)) for x in train_data]  # Default to 5 if label not found
})

val_dataset = Dataset.from_dict({
    "content": [x["content"] for x in val_data],
    "label": [int(label_to_class_mapping.get(x["label"], 5)) for x in val_data]
})

# Tokenize datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/102 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [17]:
import evaluate
import numpy as np

# Load each metric individually
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")
f1_metric = evaluate.load("f1")
accuracy_metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    precision = precision_metric.compute(
        predictions=predictions, references=labels, average="macro",
    )["precision"]
    recall = recall_metric.compute(
        predictions=predictions, references=labels, average="macro",
    )["recall"]
    f1 = f1_metric.compute(
        predictions=predictions, references=labels, average="macro",
    )["f1"]
    accuracy = accuracy_metric.compute(
        predictions=predictions, references=labels
    )["accuracy"]

    return {
        "accuracy": accuracy,
        "macro_precision": precision,
        "macro_recall": recall,
        "macro_f1": f1,
    }


In [18]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
model.to(device)

print(f"Model is on device: {model.device}")

Using device: cpu
Model is on device: cpu


In [19]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./bert_results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=8,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=200,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_macro_f1",
    no_cuda=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

In [20]:
import os
# from google.colab import userdata
# wandb_api_key = userdata.get('WANDB_API_KEY')
os.environ["WANDB_API_KEY"] = "fe6f46e7acd9db3ef7cd76e9bf2abf44993ed01c"

trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
results = trainer.evaluate()
print(results)

## Q5: A problem with the validation set [5 marks]

There is an issue with the validation set which causes poor performance. Provide the confusion matrix. Describe the
problem, how you identified it and how you fixed it.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
# Get predictions and true labels from the validation dataset
predictions = trainer.predict(val_dataset)

# Extract predicted labels (the model's predictions)
predicted_labels = predictions.predictions.argmax(axis=-1)

# True labels from the validation dataset
true_labels = predictions.label_ids

# Assuming you have predictions and true labels in the 'predictions' and 'true_labels' variables
conf_matrix = confusion_matrix(true_labels, predicted_labels)

# Display the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix)
disp.plot(cmap=plt.cm.Blues)
plt.show()


## Q6: Hyperparameter tuning [12 marks]

Train and evaluate several fine-tuned transformer models using the corrected training and validation sets. Try the
four base models listed below. Use 8 epochs, a learning_rate of 5e-5 and a batch size of 8.  Ideally, you would try
different base_models/learning_rates/batch_sizes/etc, but we will limit this to evaluating four different base models
and keep the remaining hyperparameters static.
Base models to try:  'bert-base-uncased','roberta-base','distilbert-base-uncased', and
'microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract'  
This time, we want the best model found during the training process for each base model and to save it for the final
analysis. For example, if the model after 3 epochs is the best performing on the validation set (by macro-F1), we want
to keep that. You should investigate the load_best_model_at_end parameter for the Trainer (which does require
other parameters).
Evaluate each fine-tuned model on the validation set. Report the per-class precision, recall and F1 score as well as the
accuracy, macro precision, macro recall and macro F1 score. Comment on the performance of each model.

In [None]:
def compute_metrics2(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(axis=-1)

    accuracy = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="macro", zero_division=0)

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }


In [None]:
training_args = TrainingArguments(
    output_dir="./results",  # Save outputs here
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=8,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=200,
    save_strategy="epoch",
    load_best_model_at_end=True,  # Load the best model based on evaluation metrics
    metric_for_best_model="eval_macro_f1",  # Select the best model based on macro F1 score
    no_cuda=False
)


In [None]:
def train_and_evaluate(base_model_name, train_dataset, val_dataset, tokenizer):
    model = AutoModelForSequenceClassification.from_pretrained(base_model_name, num_labels=5)  # Change num_labels to match your dataset

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        processing_class=tokenizer,
        compute_metrics=compute_metrics,
    )

    # Train the model
    trainer.train()

    # Evaluate the model on the validation set
    eval_results = trainer.evaluate()

    return model, eval_results


In [None]:
base_models = [
    'bert-base-uncased',
    'roberta-base',
    'distilbert-base-uncased',
    'microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract'
]

# Assuming you have train_dataset, val_dataset, and tokenizer already defined
best_models = {}
eval_results = {}

for model_name in base_models:
    print(f"Training and evaluating model: {model_name}")

    # Train and evaluate the model
    model, results = train_and_evaluate(model_name, train_dataset, val_dataset, tokenizer)

    # Save the best model for each base model
    best_models[model_name] = model
    eval_results[model_name] = results

    print(f"Evaluation results for {model_name}: {results}")


In [None]:
for model_name, results in eval_results.items():
    print(f"Evaluation results for {model_name}:")
    print(f"  Accuracy: {results['eval_accuracy']:.4f}")
    print(f"  Macro Precision: {results['eval_macro_precision']:.4f}")
    print(f"  Macro Recall: {results['eval_macro_recall']:.4f}")
    print(f"  Macro F1: {results['eval_macro_f1']:.4f}")


## Q7: Final evaluation and deployment [6 marks]

Load the best model (based on macro-F1 on the validation set) that was saved in the previous question using a
‘text-classification’ pipeline. Evaluate the best of the four fine-tuned models on the testing set.  
State which model you used and report the per-class precision, recall and F1 score as well as the accuracy, macro
precision, macro recall and macro F1 score. Comment on the performance and discuss whether the quality is high
enough to be deployed for the client.

In [None]:
from transformers import pipeline

# Load the best model for the base model with the highest macro-F1 score
best_model = best_models['roberta-base']  # Example: you should replace with the actual best model based on the previous analysis


In [None]:
text_classifier = pipeline("text-classification", model=best_model, tokenizer=tokenizer)


In [None]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# Assuming test_dataset is defined with the necessary format
true_labels = [example['label'] for example in test_dataset]
predictions = []

for example in test_dataset:
    text = example['text']  # Assuming 'text' field contains the input text
    pred = text_classifier(text)
    predictions.append(pred[0]['label'])  # 'label' is the predicted class

# Calculate metrics
precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predictions, average=None)
accuracy = accuracy_score(true_labels, predictions)

# Calculate macro scores
macro_precision = precision.mean()
macro_recall = recall.mean()
macro_f1 = f1.mean()

# Print results
print(f"Accuracy: {accuracy:.4f}")
print(f"Per-class Precision: {precision}")
print(f"Per-class Recall: {recall}")
print(f"Per-class F1: {f1}")
print(f"Macro Precision: {macro_precision:.4f}")
print(f"Macro Recall: {macro_recall:.4f}")
print(f"Macro F1: {macro_f1:.4f}")
