## ECON 2355 Implementation Exercise 3: Transformers and Language Models

This exercise has just one part:
 - **1: Large Language Models for Language Understanding**: This task applies several modern Language models to a sentiment classification task, demonstrating downstream applications of these large language models. It also introduces the popular [huggingface](https://huggingface.co/) framework, an extremely popular API for working with these models, similar to the timm library introduced in the previous exercises.    

### Notes on the class's implementation exercises in general:

 - These exercises are still being finalized! If you encounter problems please don't hesitate to reach out: tom_bryan@fas.harvard.edu

 - You are welcome to download these notebooks and complete them on your local machine, or work on them in colab. If you are hoping to run things on your local machine you will likely want to set up an [Anaconda](https://www.anaconda.com/products/distribution) python environment and run notebooks from either [VS Code](https://code.visualstudio.com/download) or [Jupyter Lab](https://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html). For your future Deep Learning-oriented endevours, knowing how to set up an environment to run the frameworks and libraries discussed here will likely be important, so it might not be a bad idea to try setting things up locally. On the other hand, working in colab is nice for reproducibility purposes--anyone can run and/or debug your code without problems.

 - Exercises in this class use [PyTorch](https://pytorch.org/get-started/locally/), the [dominant](https://www.assemblyai.com/blog/pytorch-vs-tensorflow-in-2023/) research deep learning python framework. If you have a _compelling_ reason why you wish to become more familiar with another framework, like Tensorflow, reach out and we _may_ be able to accomodate that.

 - In these exercises we'll try to find the sweet spot between providing so much of the code that the implementation is meaningless and leaving so much that the work is overly tedious. Feedback is appreciated!

 - To submit the assignements, please save the exercise as a `.ipynb` file named `ECON_2355_Exercise_{n}_{firstname}_{lastname}.ipynb` and submit to the appropriate place in XXXXX  

 - These exercises are graded as complete/incomplete. _Complete_ is defined as showing effort to complete at least half of the steps.

 - Many of these exercises are adapted from other courses, tutorials, or other sources. Like any good social scientist, I list those sources, so should you choose there are often other places to look for help/partial solutions. How and when you use those resources are entirely up to you and your learning style. One caveat: outside sources for exercises will likely be less and less common as we progress through the course.  

### Exercise Set 3: Intro to Large Language Models and Sentiment Classification with BERT



### 1. Intro to Large Language Models

This set of exercises introduces the huggingface `transformers` library, which provides a wide variety of pretrained language models. Pretrained language models have (hopefully) grasped much of language's structure and composition via their training procedure and can quickly generalize to other tasks.

This is very similar to how vision models are pretrained on ImageNet, which gives them a broad understanding of how vision works and what features in an image might be important to making determinations about that image. In the same way, language models know how to break text down into its salient features.

Let's start by installing the HuggingFace `transformers` library.

In [None]:
!pip install transformers

And some necessary imports:

In [None]:
import os
import torch
from PIL import Image
import torchvision
from matplotlib import pyplot as plt
from tqdm import tqdm
import numpy as np
from transformers import pipeline, BertTokenizer, BertModel, BertForPreTraining

We've talked a good amount about BERT in the course, so you likely know that it is a masked language model. In essence, it is trained to predict the token masked by a `[MASK]` token in a given sentence.

You can see a good example of this task by using the `fill-mask` pipeline option, as shown here. Calling `unmasker` on a sentence containing the `[MASK]` token will provide the five most likely candidates for the masked word.

In [None]:
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker('BERT is a [MASK] language model.')

Making predictions like this one of the two essential tasks of BERT. It provides the majority of its token-level understanding. As it learns to infer the masked word, it learns how language is used in context. Feel free to experiment by using different sentences and masking different characters. Are there situations where BERT does better or worse?

Experimenting with these masked examples is also a good way to see some of the biases or inaccuracies enbedded in the model. Try, for example, the input sentence


```
My mother worked hard as a [MASK] to feed our family.
```
compared with
```
My father worked hard as a [MASK] to feed our family.
```

Note that the biases reflected in responses like this do not reflect a problem with the model _per se_, but with the training data provided to the model. BERT and models like it use words in similar contexts as they've seen previously. In this case, BERT has consistently seen "father" associated with historically masculine professions and "mother" with historically feminine ones.

This also explains apparent social, moral, or political stances the model favors. See, for example, responses to:

```
Capitalism is a [MASK] idea.
Socialism is a [MASK] idea.
Communism is a [MASK] idea.
```

Of course, these are just annecdotal examples, although others ([1](https://arxiv.org/pdf/2004.09456.pdf), [2](https://arxiv.org/pdf/2010.00133.pdf)) have proposed methods to quantify this bias.

##### **Applying BERT to Sentiment Classification**

In the following exercises, we will use BERT to predict sentiments coresponding to tweets about US airlines. This [dataset](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment) comes from [Kaggle](https://www.kaggle.com/), a good repository for example datasets like this.

We will teach BERT to predict sentiments by finetuning it on this particular task. In particular, we will train the model to adapt its [CLS] token to predict whether a tweet is positive, negative, or neutral in its stance towards airlines.

First, let's bring in the data and show some examples. **Important:** You will need a Kaggle account to access this data, which is a good thing to have.In general. Once you've made an account, you will need to follow the instructions [here](https://www.kaggle.com/docs/api) under **Authentication** to generate a Kaggle API key. Upload the key (the file called `kaggle.json` to colab in the default location (under `content/`) and then run the following cell to download the relevant data.

##### a) **Data and Examples**



In [None]:
!mkdir /root/.kaggle
!cp /content/kaggle.json /root/.kaggle/kaggle.json
!kaggle datasets download -d crowdflower/twitter-airline-sentiment -p ./airline_tweets
!unzip /content/airline_tweets/twitter-airline-sentiment.zip

This dataset comes in csv format, with a series of columns, including the text of the tweet, the sentiment behind the tweet (one of `postive`, `negative`, or `neutral), and a number of other features like the date, airline the tweet refers to, etc. In this example we will use only the text of the tweet to predict its sentiment, so we will discard the remaining features.

The following cell will bring the data into memory and show some illustrative examples (the data is originally sorted by airline, so they will all be from Virgin America). Click the Magic Wand symbol in the output to see things in a more readable format.

In [None]:
import pandas as pd

tweets = pd.read_csv('Tweets.csv')[['airline_sentiment', 'text']]
tweets.head(10)

##### b) **Bringing in BERT**

Now that we have our text data, we need to get a model! In this project we will use BERT base uncased from huggingface. You can see more information about that model [here](https://huggingface.co/bert-base-uncased), it is essentially the same as the one described in the original [BERT paper](https://arxiv.org/pdf/1810.04805.pdf). This is one of the most common and widely used LLMs, as you can see from the model's download statistics. It's often instructive to try this model on your tasks first to get a benchmark on performance, before trying other, more task-specific models.

BERT needs two objects to run: the model and the tokenizer. The tokenizer takes arbitrary text sequences and turns them into vectors for the model, while the model calls BERT's self attention mechanism to create a dense, feature-rich embedding of the sequence.

The following cell will bring in both parts of BERT and run it over a sample sequence.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForPreTraining.from_pretrained("bert-base-uncased")
text = "[CLS] This is a test sentence."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

The `output` object here contains all the BERT output, which consists of (see [here](https://huggingface.co/docs/transformers/model_doc/bert#transformers.models.bert.modeling_bert.BertForPreTrainingOutput) in this case, though other BERT models will have different output formats) the final hidden states for each token, sentence relationship classification predictions, and a loss metric. You can access each of these parts separately.

##### c) **The `[CLS]` token**

You may recall from the lecture that BERT's pretraining has two parts: first, it attempts to fill in masked tokens, and secondly it attemps to determine if one sentence logically follows another. BERT input is structured like

```
[CLS] Sentence 1 [SEP] Sentence 2
```

And BERT attempts to use a classification head over the output of the `CLS` token to predict whether Sentence 2 logically follows Sentence 1. So BERT is tuned to produce meaningful representations for classifying the entire sequence in it's `CLS` space.

In a first example, let's see how BERT does at generating predictions based on its original classification task: does one sentence logically follow the other?

Run the model we've initialized over some examples, some of which logically follow each other, and some of which do not. How does BERT do? You can access BERT's classification predictions via `output.seq_relationship_logits`.

**Note** The model here is not really tuned for classification, and may make incorrect predictions. However, you should see _lower_ values for the first value in `output.seq_relationship_logits` and _higher_ values (they will likely still be negative) for the second value for unrelated sentences.

In [None]:
follows = 'The location of the earthquake places it within the vicinity of a triple junction between the Anatolian, Arabian, and African plates. The mechanism and location of the earthquake are consistent with it having occurred in either the East Anatolian Fault zone or the Dead Sea Transform Fault Zone.'
does_not_follow = 'The Dead Sea Transform extends north–south from the Red Sea to the Marash Triple Junction where it meets the East Anatolian Fault.' + \
                  'Like certain other upper houses of state and territorial legislatures and the United States Senate, the state Senate can confirm or reject gubernatorial appointments to state departments, commissions, boards, and other state governmental agencies.'

# Encode both sentences (pass 'add_special_tokens=True' as an argument to the tokenizer)
follows_encoded = tokenizer(follows, return_tensors='pt', add_special_tokens=True)
not_follows_encoded = tokenizer(does_not_follow, return_tensors='pt', add_special_tokens=True)

# Run the model over the tokenized text
follows_output = model(**follows_encoded)
not_follows_output = model(**not_follows_encoded)

# Check the outputs and classification scores
print('Follows predictions: {}'.format(follows_output.seq_relationship_logits))
print('Does not follow predictions: {}'.format(not_follows_output.seq_relationship_logits))


##### d) **Adding a classification head**

The example above produces predictions (the final `seq_relatioship_logits` pair) because it has a classification head on top. Here we add a classification head for our problem (classifying airline tweet sentiment) on top of a BERT model.

Now, instead of using the `BertForPreTraining` model, we will switch to the standard `BertModel`, since we no longer need to access the pretraining behavior.

What does a classification head look like? The `CLS` token outputs from the model as a dense, 768-dimensional vector. We will add several linear layers to produce a three-dimensional output, and then add a softmax layer to produce probabilities for each of the classes, positive, negative, and neutral.

In [None]:
class BertSentimentClassifier(torch.nn.Module):
  def __init__(self, bert_model):
    super(BertSentimentClassifier, self).__init__()
    self.bert_model = bert_model

    # TODO: Create a linear layer and softmax to move from a 768 dimensional vector to a three-dimensional probability vector
    self.linear_1 = torch.nn.Linear(768, 3)
    self.softmax = torch.nn.Softmax()

  def forward(self, input_ids, mask):
    _, pooled_output = self.bert_model(input_ids = input_ids, attention_mask = mask, return_dict = False)
    # TODO: run the linear layer and softmax over the pooled output (a 768-d vector)
    return self.softmax(self.linear_1(pooled_output))

In [None]:
bert_model = BertModel.from_pretrained('bert-base-uncased')
model = BertSentimentClassifier(bert_model)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


##### e) **Preparing to train the model**

Now that we have our encoder, custom model, and data all initialized, we need to set up the standard elements for a PyTorch training loop (similar to the first two exercise sets). In particular, we will need to create PyTorch datasts and dataloaders, an optimizer,

First, we apply a mapping to the data to get numeric classication values.

In [None]:
mapping = {'negative': 0,
           'neutral': 1,
           'positive': 2}

# TODO: Encode the semtiment classifications
tweets['enc_sentiment'] = tweets['airline_sentiment'].apply(lambda x: mapping[x])

Next, we need to create a dataset object for this dataset. `X` should be the actual texts, while `y` should be the encoded semtiment values.

In [None]:
# TODO: Finish implementing the TweetDataset object
class TweetDataset:
  def __init__(self, tweet_data):
    self.x = tweet_data['text']
    self.y = tweet_data['enc_sentiment']

  def __len__(self):
    return self.x.shape[-1]

  def __getitem__(self, i):
    return self.x[i], self.y[i]

Then, as usual, we need to create train, test, and validation sets. We also need to create `torch.utils.data.DataLoader` objects from the various datasets. Let's again use **80% train, 10% validation, and 10% test.**

In [None]:
tweet_dataset = TweetDataset(tweets)

# TODO: Split data into three sets, with
train_set, val_set, test_set = torch.utils.data.random_split(tweet_dataset, [.8, .1, .1])

# TODO: Create dataloaders for the data
train_loader = torch.utils.data.DataLoader(train_set, batch_size = 16, shuffle = True)
val_loader = torch.utils.data.DataLoader(val_set, batch_size = 16, shuffle = True)
test_loader = torch.utils.data.DataLoader(test_set, batch_size = 1, shuffle = True)

Now we need to create the remaining objects for our training loop. Move the model to the GPU and create a `CrossEntropyLoss` loss function and an `Adam` optimizer.

In [None]:
learning_rate = 1e-6

#TODO: Move the model to the GPU
model = model.cuda()

# TODO: create an Adam optimizer for the model
optim = torch.optim.Adam(model.parameters(), lr = learning_rate)

# TODO: create a cross entropy loss function
loss_fn = torch.nn.CrossEntropyLoss()

##### f) **Training the model**

Finally, create a training loop for this model. This will look very similar to previous training loops, with a few exceptions:
- Input will have to be tokenized _before_ being moved onto the GPU (since the tokenizer is still on the CPU)
- Both the `input_ids` and `attention_mask` attribues of the encoded input will need to be moved to the GPU
- Running the model over the input data will need to take both the input ids and attention mask as inputs

In [None]:
num_epochs = 5

for _ in range(num_epochs):

  model.train()
  for X, y in tqdm(train_loader):
    # TODO: tokenize the input
    print(X)
    enc = tokenizer(X, padding='max_length', max_length=64, truncation=True, return_tensors='pt')
    print(enc.keys())

    # TODO: move the encoded input ids, encoded attention mask, and y to the GPU
    input_ids = enc.input_ids.cuda()
    attention_mask = enc.attention_mask.cuda()
    y = y.cuda()

    # TODO: Run the model over the input, compute the loss, zero the gradients, backpropagate, and step the optimzer
    output = model(input_ids, attention_mask)
    loss = loss_fn(output, y)
    optim.zero_grad()
    loss.backward()
    optim.step()

  break
  model.eval()
  n_correct = 0
  for X, y in tqdm(val_loader):
    # TODO: complete the evaulation loop, computing validation accuracy
    enc = tokenizer(X, padding='max_length', max_length=64, truncation=True, return_tensors='pt')
    input_ids = enc.input_ids.cuda()
    attention_mask = enc.attention_mask.cuda()
    output = model(input_ids, attention_mask)
    preds = np.argmax(output.detach().cpu().numpy(), axis = 1)
    n_correct += (preds == y.numpy()).sum()

  print(f'Val Accuracy epoch {_ + 1}: {n_correct / len(val_set)}')

After five epochs, you should see accuracy in the 80-90% range. If you don't consider checking the model you defined earlier for any possible errors.

##### g) **A more direct, pretrained and finetuned approach**

While we can adapt the general BERT model to just about any task, there are also many models (provided by huggingface or other sites) pretrained for a specific task. In the majority of cases, if you have a widely-used data source (like twitter, in our case) and a broad, often-explored task (like sentiment classification, in our case) there is likely a variety of pretrained models to choose from.

In this case we can use the [Twitter-roBERTa for Sentiment Analysis](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) model, provided through huggingface by [a team of NLP researchers with the University of Cardiff](https://arxiv.org/pdf/2202.03829.pdf). This model is much easier to work with and can be expected to provide more initial knowledge than the first. It:
- Has been pretrained on tweets, unlike BERT
- Has already been finetuned to classify sentiments

The code below will intialize the model and tokenizer for you:

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig

MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
roberta_model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Inputting a sequence into the model will produce a list of pre-softmax outputs for Negative, Neutral, and Positive. We will need to recreate a custom classifier, but **the only change you should need to make from the earlier classifier** is removing the linear layer (since that is provided in the finetuned model)

In [None]:
# TODO Create a RoBerta sentiment classifer, very similar to the previous custom classifier.
class RoBertaSentimentClassifier(torch.nn.Module):
  def __init__(self, bert_model):
    super(RoBertaSentimentClassifier, self).__init__()
    self.bert_model = bert_model
    self.softmax = torch.nn.Softmax()

  def forward(self, input_ids, mask):
    pooled_output = self.bert_model(input_ids = input_ids, attention_mask = mask, return_dict = False)
    return self.softmax(pooled_output[0])

Now we need to move the model to the GPU, and create an optimizer and loss function again.

In [None]:
learning_rate = 1e-6
model = RoBertaSentimentClassifier(roberta_model)

#TODO: Move the model to the GPU
model = model.cuda()

# TODO: create an Adam optimizer for the model
optim = torch.optim.Adam(model.parameters(), lr = learning_rate)

# TODO: create a cross entropy loss function
loss_fn = torch.nn.CrossEntropyLoss()

This model should preform well zero-shot on our data, but we can improve preformance by training a bit more on our specific dataset (so that the model can get used to airline-specific terms, for example).

In this last step, you need to recreate the training loop from above. It should be an exact copy-paste. We should make one adjustment, however: run an validation loop _first_ (before doing any training) so that you can measure the model's zero-shot performance. How much does the model improve from that benchmark? Does the model converge faster or slower than the bert-base model? Is final accuracy (after five epochs) higher or lower?

In [None]:
num_epochs = 5

for _ in range(num_epochs):

  model.eval()
  n_correct = 0
  for X, y in tqdm(val_loader):
    # TODO: complete the evaulation loop, computing validation accuracy
    enc = tokenizer(X, padding='max_length', max_length=64, truncation=True, return_tensors='pt')
    input_ids = enc.input_ids.cuda()
    attention_mask = enc.attention_mask.cuda()
    output = model(input_ids, attention_mask)
    preds = np.argmax(output.detach().cpu().numpy(), axis = 1)
    n_correct += (preds == y.numpy()).sum()
  print(f'Val Accuracy epoch {_ + 1}: {n_correct / len(val_set)}')

  model.train()
  for X, y in tqdm(train_loader):
    # TODO: tokenize the input
    enc = tokenizer(X, padding='max_length', max_length=64, truncation=True, return_tensors='pt')

    # TODO: move the encoded input ids, encoded attention mask, and y to the GPU
    input_ids = enc.input_ids.cuda()
    attention_mask = enc.attention_mask.cuda()
    y = y.cuda()

    # TODO: Run the model over the input, compute the loss, zero the gradients, backpropagate, and step the optimzer
    output = model(input_ids, attention_mask)
    loss = loss_fn(output, y)
    optim.zero_grad()
    loss.backward()
    optim.step()



Hopefully this exercise has provided a window into not only sentiment classification, but downstream language tasks in general. The system used here (finding a pretrained model, adapting it to a specific task, and then finetuning) will generalize remarkably well across datasets and language tasks.

### Part 2: Perplexity

In [None]:
# TODO: Create the get_perplexity function
def get_perplexity(tokenizer, model, text = "Economics is a very interesting topic"):

  # Tokenize the inputs
  inputs = tokenizer(text, return_tensors='pt')

  # Run the inputs through the model, saving the logits
  input_seq = inputs['input_ids']
  results = model(input_seq)
  logits = results.logits

  # Initialize loss at zero
  loss = torch.tensor(0.0)

  # For each logit index (not including the last one), compute the log of softmax probilities.
  # Then, look up the logged softmax probibility of the NEXT (i + 1) token in the input sequence.
  # That value is p(x_i | x_<i)
  # Sum those values
  for i in range(inputs.input_ids.size(-1) - 1):
    log_softmax = torch.log(torch.softmax(logits[:, i, :], dim = 1))
    sub_loss = log_softmax[:, inputs.input_ids[:, i + 1]][0][0]
    loss += sub_loss

  # Take the exponential of the negative average of summed likelihoods
  perplexity = torch.exp(-1 * loss / (inputs.input_ids.size(-1) - 1))

  # Return the result
  return perplexity.item()
