# **Assignment 5: Transformers and Natural Language Processing (Part 2, V2)**
# **Note: there is a separate submission portal for part 2 on Moodle**

## *YOUR FULL NAME HERE*
Netid: Your netid here

Note: this assignment falls under collaboration Mode 2: Individual Assignment – Collaboration Permitted. Please refer to the syllabus for additional information.


**Problem 3: Text Classification with A Large Language Model  (30 points)**  In this example you will utilize a modern large language model to classify text.  Specifically, you will use load the pre-trained BERT encoder that we discussed in class, and then fine-tune it to solve a custom text classification problem where you classify news articles into one of four categories: world, sports, business, sci/tech.   

To assist with this exercise, we will need to make use of some libraries from Hugging Face, an organization that provides many widely-used libraries to support deep learning applications ([link](https://huggingface.co/)).   

Below is a code skeleton for completing this task, with comments to guide you through the process of completing it. Please complete the code below and submit a pdf of your completed code with results.  *There will be a separate submission portal for this question on Moodle.  Although your code will be reviewed, you will be graded primarily based graded upon the correctness of your output*  

Although the code skeleton below provides useful guidance/hints to fill in teh code, I highly recommend that you review a tutorial on text classification provided by hugging face before, or while, you complete this exercise ([tutorial link](https://huggingface.co/docs/transformers/en/tasks/sequence_classification))


**Installations:** Make sure you use pip or conda to install the following
libraries for this exercise:  datasets, evaluate, metrics, transformers, numpy, and torch.

Google Colab already has torch and numpy, but you will still need to install
transformers, datasets, evaluate and metrics.  You can copy and paste the line below into colab and it will install them.

*pip install transformers datasets evaluate accelerate*

In [None]:
# Necessary Imports
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import torch

# Load the AG News dataset using load_dataset
dataset = load_dataset("ag_news")
train_dataset = dataset['train']
test_dataset = dataset['test']

#Load the tokenizer for a BERT-based model "TinyBERT", and specify the number of labels
tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D",num_labels=4)

# Define a function to tokenize the data
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)

# TODO: Tokenize the training and testing data. Hint: use .map to apply the tokenize function above to your train and test datasets
train_dataset = ...
test_dataset = ...

# Load TinyBERT model We use TinyBERT, which requires substantially less
# compute than BERT, with only a modest reduction in accuracy
model = AutoModelForSequenceClassification.from_pretrained("huawei-noah/TinyBERT_General_4L_312D", num_labels=4)

In [None]:

# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',             # output directory
    num_train_epochs=3,                 # number of training epochs
    per_device_train_batch_size=8,      # batch size for training
    per_device_eval_batch_size=16,      # batch size for evaluation
    warmup_steps=500,                   # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                  # strength of weight decay
    logging_dir='./logs',               # directory for storing logs
    logging_steps=100,
    evaluation_strategy="epoch"
)


In [None]:
# TODO: Function to compute accuracy of the model
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = ...
    return {'accuracy': (predictions == labels).mean()}

In [None]:
# TODO: Initialize the Trainer
trainer = Trainer(
    model=...,
    args=...,
    train_dataset=...,
    eval_dataset=...,
    compute_metrics=...
)

# Train the model
trainer.train()

# TODO: Evaluate the model
results = ...
print(results)

In [None]:
num_examples = 6

def get_example(data, idx):
    return data['text'][idx], data['label'][idx]

# TODO: Make a label mapping dictionary for the AG News dataset (keys should be numbers and values should be the category as a string)
label_map = {...}

# TODO: Select num_examples examples from the test dataset
examples_text = []
examples_label = []
for i in range(num_examples):
    text, label = get_example(test_dataset, i)
    ...
    ...

# TODO: Tokenize the examples
# Hint: similar to how we defined the tokenize_function above, except here you also want to set return_tensors="pt"
# to ensure that the output from the tokenizer is ready for a PyTorch model
inputs = [tokenizer(...) for text in examples_text]

# Move to the same device as model
if torch.cuda.is_available():
    inputs = [{k: v.cuda() for k, v in inp.items()} for inp in inputs]
    model.cuda()

# For people with a GPU on a Macintosh machine, uncomment this
# elif torch.backends.mps.is_available():
#     inputs = [input.to(device) for input in inputs]
#     device = torch.device("mps")
#     model = model.to(device)


# Get predictions
with torch.no_grad():
    outputs = [model(**inp) for inp in inputs]

# TODO: Extract logits from the output and apply softmax to get probabilities
# Hint: ModelOutput class documentation https://huggingface.co/docs/transformers/en/main_classes/output
probabilities = [... for output in outputs]

# Get the predicted class indices
predicted_classes = [torch.argmax(prob, dim=-1) for prob in probabilities]

# TODO: Print 6 examples where you have the example text on one line, and the true and predicted labels on the next.
for i in range(num_examples):
    ...
    ...
