# Transformers and Natural Language Processing 
## *Bogdan Bošković*


**Problem 3: Text Classification with A Large Language Model  (30 points)**  In this example you will utilize a modern large language model to classify text.  Specifically, you willse load the pre-trained BERT encoder that we discussed in class, and then fine-tune it to solve a custom text classification problem where you classify news articles into one of four categories: world, sports, business, sci/tech.   

To assist with this exercise, we will need to make use of some libraries from Hugging Face, an organization that provides many widely-used libraries to support deep learning applications ([link](https://huggingface.co/)).   

Below is a code skeleton for completing this task, with comments to guide you through the process of completing it. Please complete the code below and submit a pdf of your completed code with results.  *There will be a separate submission portal for this question on Moodle.  Although your code will be reviewed, you will be graded primarily based graded upon the correctness of your output*  

Although the code skeleton below provides useful guidance/hints to fill in teh code, I highly recommend that you review a tutorial on text classification provided by hugging face before, or while, you complete this exercise ([tutorial link](https://huggingface.co/docs/transformers/en/tasks/sequence_classification))


**Installations:** Make sure you use pip or conda to install the following
libraries for this exercise:  datasets, evaluate, metrics, transformers, numpy, and torch.

Google Colab already has torch and numpy, but you will still need to install
transformers, datasets, evaluate and metrics.  You can copy and paste the line below into colab and it will install them.

*pip install transformers datasets evaluate accelerate*

In [None]:
# Necessary Imports
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import torch
import warnings

warnings.simplefilter('ignore')

# Load the AG News dataset using load_dataset
dataset = load_dataset("ag_news")
train_dataset = dataset['train']
test_dataset = dataset['test']

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Define a function to tokenize the data
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)

# TODO: Tokenize the training and testing data. Hint: use .map to apply the tokenize function above to your train and test datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Load TinyBERT model We use TinyBERT, which requires substantially less
# compute than BERT, with only a modest reduction in accuracy
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=4)

Map: 100%|██████████| 7600/7600 [00:00<00:00, 13426.69 examples/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [33]:

# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',             # output directory
    num_train_epochs=3,                 # number of training epochs
    per_device_train_batch_size=8,      # batch size for training
    per_device_eval_batch_size=16,      # batch size for evaluation
    warmup_steps=500,                   # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                  # strength of weight decay
    logging_dir='./logs',               # directory for storing logs
    logging_steps=100,
    evaluation_strategy="epoch"
)

In [36]:
# TODO: Function to compute accuracy of the model
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {'accuracy': (predictions == labels).mean()}

In [37]:
# TODO: Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()

# TODO: Evaluate the model
results = trainer.evaluate()
print(results)

Epoch,Training Loss,Validation Loss,Accuracy
1,0.209,0.180367,0.941974
2,0.1156,0.18534,0.943421
3,0.0681,0.223086,0.945658


{'eval_loss': 0.22308553755283356, 'eval_accuracy': 0.9456578947368421, 'eval_runtime': 9.6449, 'eval_samples_per_second': 787.982, 'eval_steps_per_second': 12.338, 'epoch': 3.0}


In [38]:
from torch import softmax

num_examples = 6

def get_example(data, idx):
    return data['text'][idx], data['label'][idx]

# TODO: Make a label mapping dictionary for the AG News dataset (keys should be numbers and values should be the category as a string)
label_map = {0: 'World', 1: 'Sports', 2: 'Business', 3: 'Sci/Tech'}

# TODO: Select num_examples examples from the test dataset
examples_text = []
examples_label = []
for i in range(num_examples):
    text, label = get_example(test_dataset, np.random.randint(len(test_dataset)))
    examples_text.append(text)
    examples_label.append(label)

# TODO: Tokenize the examples
# Hint: similar to how we defined the tokenize_function above, except here you also want to set return_tensors="pt"
# to ensure that the output from the tokenizer is ready for a PyTorch model
inputs = [tokenizer(text, padding="max_length", truncation=True, max_length=128, return_tensors="pt") for text in examples_text]

# Move to the same device as model
if torch.cuda.is_available():
    inputs = [{k: v.cuda() for k, v in inp.items()} for inp in inputs]
    model.cuda()

# For people with a GPU on a Macintosh machine, uncomment this
# elif torch.backends.mps.is_available():
#     inputs = [input.to(device) for input in inputs]
#     device = torch.device("mps")
#     model = model.to(device)

# Get predictions
with torch.no_grad():
    outputs = [model(**inp) for inp in inputs]

# TODO: Extract logits from the output and apply softmax to get probabilities
# Hint: ModelOutput class documentation https://huggingface.co/docs/transformers/en/main_classes/output
probabilities = [softmax(output.logits, dim=-1) for output in outputs]

# Get the predicted class indices
predicted_classes = [torch.argmax(prob, dim=-1) for prob in probabilities]

# TODO: Print 6 examples where you have the example text on one line, and the true and predicted labels on the next.
for i in range(num_examples):
    print('"' + examples_text[i] + '"')
    print(f"\nTrue label: {label_map[examples_label[i]]};   Predicted label: {label_map[predicted_classes[i].item()]}\n\n")

"Intuit gets deeper into IT, revamps Quicken The software maker adds a network management application. It also updates its Quicken personal-finance software."

True label: Sci/Tech;   Predicted label: Sci/Tech


"SEVEN KILLED IN KABUL BLOODSHED At least seven people have been killed in a bomb blast in central Kabul - the second deadly explosion in Afghanistan over the weekend."

True label: World;   Predicted label: World


"Bluetooth Group Outlines Strategy (NewsFactor) NewsFactor - With Bluetooth short-range wireless technology finding its way into an array of hardware products, ranging from mobile phones to in-vehicle telematics systems, a working group promoting the specification has outlined a strategy to make it even more attractive and useful."

True label: Sci/Tech;   Predicted label: Sci/Tech


"Inheriting Aura From Woods, the New King of Golf Is a Lion Vijay Singh has a golf swing to envy, even when fooling around. A few days ago on the driving range at the Tour Championship,