# Classifying IMDB Movie Reviews with BERT Transformers

- [View Solution Notebook](./solutions.html)
- [View Project Page](https://www.codecademy.com/content-items/29838c7636654e48ac72458af6373d5d)

**Setup - Import Libraries**

Run the cell below to import the libraries and set the random seed.

In [1]:
import numpy as np
import pandas as pd

import torch
torch.manual_seed(42) # set random seed --do not change!

from transformers import logging
logging.set_verbosity_error() # remove warnings --do not change!

## Task Group 1 - Import and Inspect the Movie Reviews

Let's first import and inspect the datasets containing the movie reviews!

### Task 1

In this project, we'll be classifying different _aspects_ within movie reviews from IMDB (Internet Movie Database), which is a free online database containing data on movies, TV shows, and other types of media.

We've obtained the movie reviews from this [Hugging Face Dataset](https://huggingface.co/datasets/Lowerated/lm6-movies-reviews-aspects) that we've cleaned and pre-processed into the following files:
- `"datasets/imdb_movie_reviews_train.csv"` - contains movie reviews in the training dataset
- `"datasets/imdb_movie_reviews_test.csv"` - contains the movie reviews in the testing dataset

Here's a quick summary of the columns in the dataset:

- `review` - the text of the movie review
- `aspect` - the thematic aspect of the movie the review targets
- `aspect_encoded` an integer label encoding the `aspect` column

Start by importing the CSV file to a pandas DataFrame named `train_reviews_df`.

Use the `.head()` method to preview the first five rows.

In [2]:
train_reviews_df = pd.read_csv("datasets/imdb_movie_reviews_train.csv")
train_reviews_df.head()

Unnamed: 0,review,aspect,aspect_encoded
0,Ibiza filming location looks very enchanting,Cinematography,0
1,RANDOLPH SCOTT always played men you could loo...,Characters,1
2,"interesting and promising basic idea', 'some p...",Story,2
3,"the film could explore very powerful politics,...",Story,2
4,"The animation is nice, and the use of color ma...",Cinematography,0


### Task 2

Next, use `.info()` to inspect the training dataset. Make sure the data types for each column make sense and if there are any missing values.

In [3]:
train_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 369 entries, 0 to 368
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   review          369 non-null    object
 1   aspect          369 non-null    object
 2   aspect_encoded  369 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 8.8+ KB


### Task 3

Let's explore the different aspect labels in the training dataset.

Use the `.value_counts()` method on the `aspect` column to count the number of aspect labels.

Do the same for the `aspect_encoded` column to verify that the number of aspect labels correspond equally.

In [4]:
print(train_reviews_df['aspect'].value_counts())
print(train_reviews_df['aspect_encoded'].value_counts())

aspect
Cinematography    125
Characters        123
Story             121
Name: count, dtype: int64
aspect_encoded
0    125
1    123
2    121
Name: count, dtype: int64


Here's a quick summary of each aspect:

- **Cinematography** focuses on the movie's visual elements like the locations, quality, lighting, and visual appeal
- **Characters** addresses the portrayal of characters and their development throughout the movie, which can focus on their acting, personality, depth, and relatability
- **Story** highlights the movie's themes, plots, originality, and quality of storytelling

### Task 4

Since each movie review is classified into one of the three aspects, we'll frame this task as a **multi-class** classification task. 

Let's save the number of aspects to the variable `n_aspects`, which we'll reference later when we build the neural network and BERT transformer using PyTorch.

In [5]:
n_aspects = train_reviews_df['aspect'].nunique()
print("Number of aspects (labels)", n_aspects)

Number of aspects (labels) 3


### Task 5

Next, create the _training corpus_ by using the `.tolist()` function to convert the text sequence in each review and their corresponding aspect labels to the following lists:

- `train_texts` contains each movie review separated by commas
- `train_labels` contains each review aspect separated by commas

In [6]:
train_texts = train_reviews_df['review'].tolist()
train_labels = train_reviews_df['aspect_encoded'].tolist()

## Task Group 2 - Pre-processing the Text Data

Next, we'll need to pre-process the text data into a numerical representation that our text classification model will understand!

### Task 6

Next, let's tokenize each review into word tokens.

A. Create a function named `tokenize_review` that uses each review text as input.
- with the `re` module, use the `re.findall()` function to lowercase the text using `.lower()` and locate words (without punctuations and special characters) using the regular expression `r'\b\w+\b'`
- save the tokenized text to the variable `tokenized_review_text` within the function
- make sure the function returns the tokenized text

B. Apply the function to tokenize each review text in the training corpus.
- use a `for` loop to apply the `tokenize` function to each review text in `train_texts`
- save the list of tokenized reviews to the variable `tokenized_corpus`

In [7]:
import re
def tokenize(text):
    tokenized_review_text = re.findall(r'\b\w+\b', text.lower())
    return tokenized_review_text

tokenized_corpus = [tokenize(review_text) for review_text in train_texts]

### Task 7

From the `collections` module, use the `Counter()` function to count the number of occurrences of each token and save the counts to the variable `word_freqs`
- First, we'll need to combine each tokenized text sequence together into a single, long list of tokens of the full training corpus. We'll save the list to the variable `combined_corpus`.
- Next, you'll need to create a _nested loop_ that first loops through each tokenized text sequence within the training corpus
- Then, the second loop iterates through each token within each text sequence and appends each token to `combined_corpus`
- Lastly, apply the `Counter()` function to `combined_corpus` to count the number of occurrences of each token

In [8]:
from collections import Counter
combined_corpus = []
for text in tokenized_corpus:
    for token in text:
        combined_corpus.append(token)

word_freqs = Counter(combined_corpus)

### Task 8

Use the `.most_common()` function to obtain the top 1000 most commonly occurring tokens.
- Set the value `1000` to the variable `MAX_VOCAB_SIZE`
- Save the top 1000 most commonly occurring tokens to the variable `most_common_words`

Print out the top 10 most common words in the training corpus.

In [9]:
MAX_VOCAB_SIZE = 1000
most_common_words = word_freqs.most_common(MAX_VOCAB_SIZE)
print("Top 10 Most Common Words: ", most_common_words[0:10])

Top 10 Most Common Words:  [('the', 732), ('a', 307), ('and', 306), ('of', 296), ('is', 218), ('to', 213), ('in', 177), ('it', 134), ('s', 109), ('that', 105)]


### Task 9

Create the vocabulary as a dictionary using the top 1000 most commonly occurring word tokens.
- use the `enumerate` function to assign token IDs for each word token in `most_common_words` based on its positional value **starting at the value 2**
- save the vocabulary to the variable `vocab`

Add the special tokens `<unk>` and `<pad>` to the vocabulary by:
- assigning the special token key `<unk>` with the token ID value `0`
- assigning the special token key `<pad>` with the token ID value `1`

In [10]:
vocab = {word: idx + 2 for idx, (word, freq) in enumerate(most_common_words)}
vocab['<unk>'] = 0
vocab['<pad>'] = 1 

### Task 10

Using the vocabulary, create a function named `encode_text` that tokenizes and encodes each review text into a sequence of token IDs.
- specify the inputs `text` and `vocab`
    - `text` will be the raw review text to be tokenized and encoded
    - `vocab` is the vocabulary
- apply the `tokenize` function to the input `text`
- encode the tokenized text into a sequence of token IDs
    - assign the token ID value for the special token `<unk>` for word tokens that are not in the vocabulary
- the function should return the encoded review text

In [11]:
def encode_text(text, vocab):
    tokenized_text = tokenize(text)
    encoded_text = [vocab.get(word, vocab['<unk>']) for word in tokenized_text]
    return encoded_text

### Task 11

Let's create another function named `pad_or_truncate` that pre-processes each encoded text to have the same length specified by a maximum length value.
- specify the inputs `encoded_text` and `max_len`
    - `encoded_text` will be the input encoded review text
    - `max_len` is the specified maximum length value
- use an `if` statement to identify reviews longer than the maximum length and returns the review truncated to the maximum length value
- use an `else` statement to identify reviews shorter than the maximum length and returns the review padded with `1` values (corresponding to the token ID for the special token `<pad>`)

In [12]:
def pad_or_truncate(encoded_text, max_len):
    if len(encoded_text) > max_len:
        return encoded_text[:max_len]
    else:
        return encoded_text + [vocab['<pad>']] * (max_len - len(encoded_text)) 

### Task 12

Now, let's fully tokenize and encode each review text by applying the `encode_text` and `pad_or_truncate` functions:
- specify a maximum length of `128` for each review text sequence and save the value to the variable `MAX_SEQ_LENGTH`
- tokenize and encode each review text as follows:
    - use a `for` loop to iterate through each training review text in `train_texts`
    - apply the `encode_text` function to each review text (be sure to specify the vocabulary)
    - apply the `pad_or_truncate` function to each encoded text (be sure to specify the maximum length value)
- save the fully pre-processed review text sequences to the variable `padded_text_seqs`

In [13]:
MAX_SEQ_LENGTH = 128
padded_text_seqs = [pad_or_truncate(encode_text(text, vocab), max_len=MAX_SEQ_LENGTH) for text in train_texts]

### Task 13

Let's prepare our text sequences to for training by converting them into PyTorch tensors:

- create the input tensor `X_tensor` by converting the padded/truncated sequences
- create the label tensor `y_tensor` by converting the training review labels (be sure to specify the datatype `torch.long`)

Using the PyTorch utility module `torch.utils.data`, let's organize the training input and label tensors into a single dataset object and an iterable that will allow us to load the training data in batches:
- create the variable `train_dataset` using the `TensorDataset` utility class to organize the input tensor and label tensor into a single dataset object
- create the variable `train_dataloader` using the `DataLoader` utility class to create an iterable that loads the `train_dataset` in batches of `16` (be sure to set `shuffle` to `True`)

In [14]:
import torch
X_tensor = torch.tensor(padded_text_seqs)
y_tensor = torch.tensor(train_labels, dtype=torch.long)

from torch.utils.data import DataLoader, TensorDataset
batch_size = 16
train_dataset = TensorDataset(X_tensor, y_tensor)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

## Task Group 3 - Training a Simple Neural Network

The first text classification model we'll build and train to classify movie reviews is a simple neural network with an embedding layer.

### Task 14

Let's construct the neural network architecture that will be trained to classify movie reviews!

Create a class named `SimpleNNWithEmbedding` for our neural network (using the PyTorch's `nn.Module`) with the following:

A. Create the `init` method that takes in the following attributes:
- `vocab_size` is the number of tokens in the vocabulary 
- `embed_size` is the embedding size
- `hidden_size` is the number of neurons in the linear layer
- `output_size` is the number of output classes

B. Build the `init` method by initializing the following layers:
- `self.embedding` is an embedding layer that creates embeddings equal to the vocabulary size with embedding sizes specified by the `embed_size` input
- `self.fc1` is the first linear layer with an input size equal to the embedding size in the embedding layer and an output size equal to the number of neurons in the hidden layer specified by the `hidden_size` input
- `self.fc2` is the second linear layer with an input size equal to the hidden size of the first linear layer and an output size equal to the number of classes specified by the `output_size` input

C. Build the `forward` method and create the forward operations in the following order:
1. Start by passing the input `x` into the embedding layer
2. Average the embeddings into a single representation
3. Pass the averaged embedding into the first linear layer
4. Apply the ReLU activation function 
5. Pass the activated output to the second linear layer
6. Return the output of the second linear layer

In [15]:
import torch.nn as nn
torch.manual_seed(42) # set random seed --do not change!

class SimpleNNWithEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, output_size):
        super(SimpleNNWithEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.fc1 = nn.Linear(embed_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.embedding(x)
        x = torch.mean(x, dim=1)
        x = self.fc1(x)
        x = torch.relu(x)
        x = self.fc2(x)
        return x

### Task 15

Next, let's create an instance of our `SimpleNNWithEmbedding` neural network class.

Start by defining the following variables:
- `vocab_size` - integer containing the vocabulary size (vocabulary length)
- `embed_size` - an embedding size of `50` dimensions for each token
- `hidden_size` - specifies the hidden layer in the neural network with `100` neurons
- `output_size` - specifies the number of class labels in our **multi-class** classification task
    - hint: each review will be classified as one of the following aspects: Cinematography, Characters, or Story
 
Instantiate the model with the variable parameters to the variable `text_classifier_nn`.

In [16]:
vocab_size = len(vocab)
embed_size = 50 
hidden_size = 100
output_size = 3

text_classifier_nn = SimpleNNWithEmbedding(vocab_size, embed_size, hidden_size, output_size)
print(text_classifier_nn)

SimpleNNWithEmbedding(
  (embedding): Embedding(1002, 50)
  (fc1): Linear(in_features=50, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=3, bias=True)
)


### Task 16

Next, let's initialize the loss function and optimizer for training:
- create an instance of the **cross-entropy** loss function in PyTorch (from the `torch.nn` module) and save it to the variable `criterion`
- create an instance of the **Adam** optimizer in PyTorch (from the `torch.optim` module) with a learning rate of **0.005** and save it to the variable `optimizer

In [17]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(text_classifier_nn.parameters(), lr=0.005)

### Task 17

Now let's train our neural network to classify movie reviews!

A. Create a function named `train_model` that takes in the following inputs:
- `model` is the text classification model
- `train_loader` is the training data contained in a PyTorch `DataLoader` object
- `criterion` is the loss function used to train the model
- `optimizer` is the optimizer used to train the model
- `num_epochs` is the number of training epochs

B. In the function, create a `for` loop that loops through the number of epochs specified by the input `num_epochs`. Within the loop:
- set the model to training mode
- track the loss per epoch by initializing the variable `epoch_loss` to `0.0`

C. Create a nested `for` loop within the first `for` loop such that:

1. Loops through the inputs and labels of each training batch in the input `train_loader` 
2. Reset the gradients at each iteration
3. Input the training batch through the forward pass
4. Calculate the cross-entropy loss
5. Backpropagate the loss through the network
6. Adjust the weights and biases
7. Update the total loss within the current training epoch

D. Run the function to train the `text_classifier_nn` model on the training dataset stored in the iterable `train_dataloader` using the loss-function saved in `criterion` and the optimizer saved in `optimizer` for `50` epochs.

In [18]:
def train_model(model, train_loader, criterion, optimizer, num_epochs=5):
    for epoch in range(num_epochs):
        model.train()
        epoch_loss = 0.0
        for batch_X, batch_y in train_loader:
            optimizer.zero_grad()
            output = model(batch_X)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        avg_loss = epoch_loss / len(train_loader)
        if (epoch + 1) % 5 == 0:
            print(f"[Epoch {epoch + 1}/{num_epochs}], Average CE Loss: {avg_loss:.4f}")

train_model(text_classifier_nn, train_dataloader, criterion, optimizer, num_epochs=50)

[Epoch 5/50], Average CE Loss: 0.9832
[Epoch 10/50], Average CE Loss: 0.7402
[Epoch 15/50], Average CE Loss: 0.4075
[Epoch 20/50], Average CE Loss: 0.1535
[Epoch 25/50], Average CE Loss: 0.0710
[Epoch 30/50], Average CE Loss: 0.0363
[Epoch 35/50], Average CE Loss: 0.0280
[Epoch 40/50], Average CE Loss: 0.0120
[Epoch 45/50], Average CE Loss: 0.0163
[Epoch 50/50], Average CE Loss: 0.0119


### Task 18

Let's evaluate the trained neural network on an out-of-sample testing dataset.

Import the testing dataset stored in the CSV file `"datasets/imdb_movie_reviews_test.csv"` to a pandas DataFrame named `test_reviews_df`.

Then, create the _testing corpus_ by using the `.tolist()` function to convert the text sequence in each review and their corresponding aspect labels to the following lists:

- `test_texts` contains each movie review separated by commas
- `test_labels` contains each review aspect separated by commas

In [19]:
test_reviews_df = pd.read_csv("datasets/imdb_movie_reviews_test.csv")

test_texts = test_reviews_df['review'].tolist()
test_labels = test_reviews_df['aspect_encoded'].tolist()

### Task 19

Let's pre-process (tokenize, encode, pad, and truncate) the text sequences in the testing dataset using the previous functions:
- `encode_text` - uses the vocabulary to tokenize and encode text into token IDs
- `pad_or_truncate` - pads or truncates the encoded text to a specified maximum sequence length

Be sure to use the same maximum sequence length we used for the training dataset.

Save the pre-processed test sequences to the variable `padded_text_seqs_test`.

In [20]:
padded_text_seqs_test = [pad_or_truncate(encode_text(test_seq, vocab), MAX_SEQ_LENGTH) for test_seq in test_texts]

### Task 20

Next, let's convert the pre-processed test sequences into PyTorch tensors:
- create the input tensor `X_tensor_test` by converting the padded/truncated test sequences
- create the label tensor `y_tensor_test` by converting the testing labels (be sure to specify thet datatype `torch.long`


Using the PyTorch utility module `torch.utils.data`, let's organize the testing input and label tensors into a single dataset object and an iterable that will allow us to load the testing data in batches:
- create the variable `test_dataset` using the `TensorDataset` utility class to organize the input tensor and label tensor into a single dataset object
- create the variable `test_dataloader` using the `DataLoader` utility class to create an iterable that loads the `test_dataset` in batches of `8` (be sure to set `shuffle` to `False`)

In [21]:
X_tensor_test = torch.tensor(padded_text_seqs_test)
y_tensor_test = torch.tensor(test_labels, dtype=torch.long)

from torch.utils.data import DataLoader, TensorDataset
test_dataset = TensorDataset(X_tensor_test, y_tensor_test)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

### Task 21

Let's generate predictions from the trained neural network on the reviews in the testing dataset!

A. Create the function `get_predictions_and_probabilities` that takes in the trained model and testing dataloader as input.

B. Within the function:
- set the model to evaluation mode
- initialize the following empty lists:
    - `all_probs` - stores all of the predicted probabilities for the testing dataset
    - `all_labels` - stores all of the predicted labels for the testing dataset

- using `with torch.no_grad()`, loop through each batch in the testing dataloader and:
    -  generate outputs from the forward pass
    -  use the **softmax** function to generate predicted probabilities (be sure to add the probabilities to the `all_probs` list using `extend()`
    -  use the **argmax** function to select the class label with the highest probabilities (be sure to add the labels to the `all_labels` list using `extend()`
- the function should return the predicted probabilities `all_probs` and predicted labels `all_labels`

C. Apply the `get_predictions_and_probabilities` to generate predictions. Save the predicted probabilities to the variable `pred_probs` and predicted labels to the variable `pred_labels`.

In [22]:
import torch.nn.functional as F

def get_predictions_and_probabilities(model, test_loader):
    model.eval()

    all_probs = [] 
    all_labels = []

    with torch.no_grad(): 
        for batch_X, batch_y in test_loader:
            outputs = model(batch_X)
            probs = F.softmax(outputs, dim=1)  
            all_probs.extend(probs.cpu().numpy())
            predicted_labels = torch.argmax(outputs, dim=1)
            all_labels.extend(predicted_labels.cpu().numpy())

    return all_probs, all_labels

pred_probs, pred_labels = get_predictions_and_probabilities(text_classifier_nn, test_loader)

### Task 22

Let's evaluate the model's predictions (`pred_labels`) with the true labels (`test_labels`).

A. Generate a **confusion matrix** to count the number of true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP). Save the confusion matrix to the variable `conf_matrix`.

B. Generate a **classification report** to calculate the accuracy, precision, recall, and F1-score metrics for each label. Save the classification report to the variable `report`.

Print the confusion matrix and classification report.

In [23]:
from sklearn.metrics import confusion_matrix, classification_report
conf_matrix = confusion_matrix(test_labels, pred_labels)
report = classification_report(test_labels, pred_labels)

print(conf_matrix)
print(report)

[[27 18  4]
 [ 1 35  2]
 [ 2 15 28]]
              precision    recall  f1-score   support

           0       0.90      0.55      0.68        49
           1       0.51      0.92      0.66        38
           2       0.82      0.62      0.71        45

    accuracy                           0.68       132
   macro avg       0.75      0.70      0.68       132
weighted avg       0.76      0.68      0.69       132



<details><summary style="display:list-item; font-size:16px; color:blue;">How well does the simple neural network perform when classifying movie reviews?</summary>

    
**Confusion Matrix**

We can interpret the confusion matrix by this structure:
    
```py
    [[TP0 FN0  FN0]
     [FN1 TP1  FN1]
     [FN2 FN2  TP2]]
```

Each row corresponds to the TPs and FNs for each review aspect: 
    
- row 1 corresponds to the TPs and FNs for Cinematography reviews (label 0)
    - 27 Cinematography reviews were correctly classified as Cinematography reviews (label 0)
    - 18 Cinematography reviews were incorrectly classified as Character reviews (label 1)
    - 4 Cinematography reviews were incorrectly classified as Story reviews (label 2)
- row 2 corresponds to the TPs and FNs for Character reviews (label 1)
    - 35 Character reviews were correctly classified as Character reviews (label 1)
    - 1 Character review was incorrectly classified as a Cinematography review (label 0)
    - 2 Character reviews were incorrectly classified as Story reviews (label 2)
- row 3 corresponds to the TPs and FNs for Story reviews (label 2)
    - 28 Story reviews were correctly classified as Story reviews (label 2)
    - 2 Story review was incorrectly classified as a Cinematography review (label 0)
    - 15 Story reviews were incorrectly classified as Character reviews (label 1)    

**Classification Report**

When classifying movie reviews, the simple neural network (with an embedding layer) had an overall accuracy of 68%, which is decent but not too great. 
    
For Cinematography reviews (label 0):
- the model has a high precision score of 90%, meaning that when it classified a review as a Cinematography review, it was right 90% of the time
- however, the model has a low recall score of 55%, meaning that it could only identify 55% of all Cinematography reviews, which suggests that it is overpredicting 
    
For Character reviews (label 1):
- the model has a low precision score of 51%, meaning that when it classified a review as a Character review, it was right 51% of the time (pretty much a 50/50 coin flip)
- however, the model has a high recall score of 92%, meaning that it successfully identified 92% of all Character reviews
- this suggests that the model rarely misses a Character review, but seems to misclassify the other review aspects often
    
For Story reviews (label 2):
- the model has a decent precision score of 82%, meaning that when it classified a review as a Story review, it was right 82% of the time
- however, the model has a low recall score of 62%, meaning that it could only identify 62% of all Story reviews
    
Overall, the model performs moderately well in classifying movie review aspects, but there is definitely room for improvement.

## Task Group - Fine-tuning a TinyBERT Transformer

Let's now train a more advanced BERT transformer language model! Specifically, let's fine-tune a TinyBERT model to classify movie reviews! 

In [24]:
import torch
torch.manual_seed(42) # set random seed --do not change!

<torch._C.Generator at 0x20a53550110>

### Task 23

Load the pre-trained BERT model using the `transformers` library from Hugging Face.

Due to memory and hardware constraints, we won't be able to use the full pre-trained BERT model that contains 12 layers and 110 million parameters. Instead, we'll use a distilled, smaller version of BERT called **TinyBERT** that contains 4 layers and 14.5 million parameters developed by Huawei Noah's Ark Lab (https://huggingface.co/huawei-noah).

A. First, save the TinyBERT model name `'huawei-noah/TinyBERT_General_4L_312D'` as a string to the variable `model_name`.

B. Download and load the TinyBERT tokenizer from the `BertTokenizer` module and save the tokenizer to the variable `tinybert_tokenizer`.

C. Download and load the TinyBERT models (weights) with the `BertForSequenceClassification` module. Be sure to specify `num_labels=` with the correct number of class labels (# of movie aspects). Save the TinyBERT model to the variable `text_classifier_tinybert`. 

In [25]:
from transformers import BertTokenizer, BertForSequenceClassification

model_name = 'huawei-noah/TinyBERT_General_4L_312D'
tinybert_tokenizer = BertTokenizer.from_pretrained(model_name)
text_classifier_tinybert = BertForSequenceClassification.from_pretrained(model_name, num_labels=n_aspects)

### Task 24

Before fine-tuning the TinyBERT, let's **freeze** and **unfreeze** the following layers:

- first, freeze all of the parameters in the pre-trained TinyBERT
- second, unfreeze the classification layer added on top of the pre-trained model
- third, unfreeze the encoder layer specified at `layer[3]` 

In [26]:
for param in text_classifier_tinybert.bert.parameters():
    param.requires_grad = False

for param in text_classifier_tinybert.classifier.parameters():
    param.requires_grad = True

for param in text_classifier_tinybert.bert.encoder.layer[3].parameters():
    param.requires_grad = True

### Task 25

Let's pre-process the training text reviews (`train_texts`) with the TinyBERT tokenizer:
- specify a maximum sequence length of `124` and save the value to the variable `MAX_SEQ_LENGTH_TINYBERT`
- create the tensor `X_train` that contains the tokenized training text tokenized by the TinyBERT tokenizer:
    - the sequences have a maximum length of **128** tokens
    - the sequences are **padded**
    - the sequences are **truncated**
    - the padded and truncated sequences are returned as PyTorch tensors
- create the tensor `y_train` that contains the corresponding training labels in `train_labels` converted to a PyTorch tensor (be sure to specify the `torch.long` datatype)

Using the PyTorch utility module `torch.utils.data`, let's organize the training input and label tensors into a single dataset object and an iterable that will allow us to load the training data in batches:
- create the variable `train_dataset` using the `TensorDataset` utility class to organize the input tensor `X_train` and label tensor `y_train` into a single dataset object (be sure to specify the attention mask in the input tensor created by the TinyBERT tokenizer)
- create the variable `train_dataloader` using the `DataLoader` utility class to create an iterable that loads the `train_dataset` in batches of `16` (be sure to set `shuffle` to `True`)

In [27]:
MAX_SEQ_LENGTH_TINYBERT = 124

X_train = tinybert_tokenizer(train_texts, padding=True, truncation=True, return_tensors="pt", max_length=MAX_SEQ_LENGTH_TINYBERT)
y_train = torch.tensor(train_labels, dtype=torch.long)

from torch.utils.data import DataLoader, TensorDataset
train_dataset = TensorDataset(X_train['input_ids'], X_train['attention_mask'], y_train)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

### Task 26

Next, let's initialize the optimizer and loss function to fine-tune our TinyBERT model (`text_classifier_tinybert`).

A. Initialize the Adam optimizer with **weight decay** with a learning rate of `0.0025`. Ensure that only the unfrozen layers are optimized during training. Save the optimizer to the variable `optimizer`.

B. Initialize the cross-entropy loss function. Save the loss function to the variable `criterion`.

In [28]:
import torch.nn as nn
import torch.optim as optim

optimizer = torch.optim.AdamW(filter(lambda p: p.requires_grad, text_classifier_tinybert.parameters()), lr=0.0025)
criterion = nn.CrossEntropyLoss()

### Task 27

Let's create a training loop to fine-tune the TinyBERT model for `10` epochs!

A. Set the value of `10` to the variable `num_epochs`.

B. Create a training loop first loops through each epoch where in each epoch:
- set the TinyBERT model to training mode
- initialize the empty variable `total_loss` with `0.0` to keep track of the total loss per epoch
- create a nested loop that loops through each batch (be sure to specify the batch's attention mask) in the training dataloader where for each batch:
    1. reset the gradients
    2. apply the forward pass to the training batch (be sure to pass the current batch's attention mask)
    3. extract the logits from the forward pass output
    4. calculate the cross-entropy loss
    5. update the `total_loss`
    6. backpropagate the loss through the network
    7. adjust the weights and biases

In [29]:
num_epochs = 10
for epoch in range(num_epochs):
    text_classifier_tinybert.train()
    total_loss = 0.0
    
    for batch_X, batch_attention_mask, batch_y in train_loader:
        optimizer.zero_grad()
        outputs = text_classifier_tinybert(input_ids=batch_X, 
                                     attention_mask=batch_attention_mask)
        logits = outputs.logits
        loss = criterion(logits, batch_y)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    
    avg_loss = total_loss / len(train_loader)
    print(f'Epoch {epoch+1}, Loss: {avg_loss}')

Epoch 1, Loss: 1.0000568603475888
Epoch 2, Loss: 0.584193958590428
Epoch 3, Loss: 0.3997136053318779
Epoch 4, Loss: 0.3257132777944207
Epoch 5, Loss: 0.37498734301577014
Epoch 6, Loss: 0.26159133203327656
Epoch 7, Loss: 0.23366263881325722
Epoch 8, Loss: 0.22857480434079966
Epoch 9, Loss: 0.2677521393634379
Epoch 10, Loss: 0.49066755175590515


### Task 28

Now that the TinyBERT model is fine-tuned, let's evaluate how the transformer performs on the movie reviews in the testing set and compare its performance with the simple neural network from before!

Pre-process the testing dataset into the following PyTorch tensors:

- create the tensor `X_test` using the TinyBERT tokenizer to tokenize and testing dataset and save the pre-processed tensor to the variable `X_test`:
    - the maximum sequence length should be same as the length used during fine-tuning the TinyBERT
    - apply padding and truncating
    - return the tokenized text as a PyTorch tensor
- create the tensor `y_test` that contains the testing labels `test_labels` as a PyTorch tensor (be sure to specify the `torch.long` datatype).

Using the PyTorch utility module `torch.utils.data`, let's organize the training input and label tensors into a single dataset object and an iterable that will allow us to load the training data in batches:
- create the variable `test_dataset` using the `TensorDataset` utility class to organize the input tensor `X_test` and label tensor `y_test` into a single dataset object (be sure to specify the attention mask in the input tensor created by the TinyBERT tokenizer)
- create the variable `test_dataloader` using the `DataLoader` utility class to create an iterable that loads the `test_dataset` in batches of `8` (be sure to set `shuffle` to `False`)

In [30]:
X_test = tinybert_tokenizer(test_texts, padding=True, truncation=True, return_tensors="pt", max_length=MAX_SEQ_LENGTH_TINYBERT)
y_test = torch.tensor(test_labels, dtype=torch.long)

from torch.utils.data import DataLoader, TensorDataset
test_dataset = TensorDataset(X_test['input_ids'], X_test['attention_mask'], y_test)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

### Task 29

Next, let's generate predictions from the fine-tuned TinyBERT!

A. Set the fine-tuned TinyBERT model to evaluation mode.

B. Initialize the following empty lists:
- `pred_probs` - stores the predicted probabilities for the reviews in the testing dataset
- `pred_labels` - stores the predicted labels for the reviews in the testing dataset

C. Within `with torch.no_grad()`:
- loop through each batch in the testing dataloader (be sure to specify the attention masks)
- apply the forward pass to each batch
- obtain the logits from the outputs from the forward pass
-  use the **softmax** function to generate predicted probabilities (be sure to add the probabilities to the `pred_probs` list using `extend()`
-  use the **argmax** function to select the class label with the highest probabilities (be sure to add the labels to the `pred_labels` list using `extend()`

In [31]:
import torch.nn.functional as F

text_classifier_tinybert.eval()
pred_probs = []
pred_labels = []

with torch.no_grad():
    for batch_X, batch_attention_mask, batch_y in test_loader:
        outputs = text_classifier_tinybert(input_ids= batch_X, 
                                      attention_mask= batch_attention_mask)
        logits = outputs.logits
        probs = F.softmax(logits, dim=1)
        pred_probs.extend(probs.cpu().numpy())
        
        predicted_labels = torch.argmax(logits, dim=1)
        pred_labels.extend(predicted_labels.cpu().numpy())

### Task 30

Let's see how well our fine-tuned TinyBERT model classifies movie reviews!

Evaluate the TinyBERT model's predictions (`pred_labels`) with the true labels (`test_labels`) with the following:

A. Generate a **confusion matrix** to count the number of true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP). Save the confusion matrix to the variable `conf_matrix`.

B. Generate a **classification report** to calculate the accuracy, precision, recall, and F1-score metrics for each label. Save the classification report to the variable `report`.

Print the confusion matrix and classification report.

How does the fine-tuned TinyBERT model performance compare to the simple neural network?

In [32]:
from sklearn.metrics import confusion_matrix, classification_report

confusion_matrix = confusion_matrix(test_labels, pred_labels)
report = classification_report(test_labels, pred_labels)

print(confusion_matrix)
print(report)

[[47  0  2]
 [ 4 29  5]
 [ 3  2 40]]
              precision    recall  f1-score   support

           0       0.87      0.96      0.91        49
           1       0.94      0.76      0.84        38
           2       0.85      0.89      0.87        45

    accuracy                           0.88       132
   macro avg       0.89      0.87      0.87       132
weighted avg       0.88      0.88      0.88       132



<details><summary style="display:list-item; font-size:16px; color:blue;">How does the fine-tuned TinyBERT model performance compare to the simple neural network?</summary>
   
**Confusion Matrix**

We can interpret the confusion matrix by this structure:
    
```py
    [[TP0 FN0  FN0]
     [FN1 TP1  FN1]
     [FN2 FN2  TP2]]
```

Each row corresponds to the TPs and FNs for each review aspect: 
    
- row 1 corresponds to the TPs and FNs for Cinematography reviews (label 0)
    - 49 Cinematography reviews were correctly classified as Cinematography reviews (label 0)
    - 0 Cinematography reviews were incorrectly classified as Character reviews (label 1)
    - 0 Cinematography reviews were incorrectly classified as Story reviews (label 2)
- row 2 corresponds to the TPs and FNs for Character reviews (label 1)
    - 38 Character reviews were correctly classified as Character reviews (label 1)
    - 0 Character review was incorrectly classified as a Cinematography review (label 0)
    - 0 Character reviews were incorrectly classified as Story reviews (label 2)
- row 3 corresponds to the TPs and FNs for Story reviews (label 2)
    - 35 Story reviews were correctly classified as Story reviews (label 2)
    - 8 Story review was incorrectly classified as a Cinematography review (label 0)
    - 2 Story reviews were incorrectly classified as Character reviews (label 1)    

**Classification Report**

When classifying movie reviews, the fine-tuned TinyBERT achieved an overall accuracy of 92%, which outperformed the simple neural network that had an overall accuracy of 68%. 
    
For Cinematography reviews (label 0):
- the model has a high precision score of 86%, meaning that when it classified a review as a Cinematography review, it was right 86% of the time
- the model has a perfect recall score of 100%, meaning that it correctly identified all Cinematography reviews
    
For Character reviews (label 1):
- the model has a high precision score of 95%, meaning that when it classified a review as a Character review, it was right 95% of the time
- the model has a perfect recall score of 100%, meaning that it correctly identified all Character reviews
    
For Story reviews (label 2):
- the model has a perfect precision score of 100%, meaning that when it classified a review as a Story review, it was right 100% of the time
- the model has a decent recall score of 78%, meaning that it identified 78% of all Story reviews

    
The macro and weighted average F1 Score of 92% tell us that the model performs well across all label aspects.
    
Overall, the model performs exceptionally well in classifying movie review aspects, confirming that fine-tuning a TinyBERT model outperforms a simple neural network for this text classification task!

Note: The Jupyter Notebook kernel may crash or be unresponsive after fine-tuning the TinyBERT model. Be sure to save your work every so often to avoid losing any work you've done so far! Feel free to restart the kernel and re-run the cells.

That's the end of our project on building text classification models to classify different aspects of movie reviews! There's definitely a lot of room for improvement and we encourage you to use your skills to explore different techniques to enhance the language models.

Here are some areas for improvement:
- increase the maximum sequence length
- increase the training batch sizes
- freeze/unfreeze different layers in the TinyBERT architecture
- increase the number of training epochs
- try different optimizers and learning rates
- try a different transformer model
- increase the size of training dataset ([Full Hugging Face Dataset](https://huggingface.co/datasets/Lowerated/lm6-movies-reviews-aspects))

You might want to consider building and training larger language models on your own device or cloud platform with greater memory.

Happy coding!