# Fine-Tuning a transformer model

For this assignment I am using the imdb dataset I found on kaggle. I could not get the dataset from huggingface imported due to a server error. Hence I did this assignment with a different dataset.

## 1. Data understanding

I'll start with importing the dataset.

In [58]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

In [59]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset from the local file
df = pd.read_csv("imdb dataset.csv")

# Preview the first few rows
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


then I look at the shape of the dataset

In [60]:
df.shape

(50000, 2)

This dataset has 50000 rows and 2 columns. The columns this dataset has are:
- review
- sentiment

I am now going to create a brief overview of the dataset using the describe() function. The describe function provides a quick and useful summary of some important points. This shows how many different unique values each column has and which value appears most frequently.

In [61]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


From these two columns, we can infer the following information:

- The dataset contains 50,000 reviews in total.
- The "review" column has 49,582 unique reviews, indicating that some reviews appear multiple times in the dataset.
- The "sentiment" column has 2 unique values, indicating that the sentiment of the reviews is either positive or negative.
- The most common sentiment in the dataset is positive, with 25,000 reviews classified as such.
- The top review that appears multiple times in the dataset is "Loved today's show!!! It was a variety and not...", which appears 5 times. However, we do not know whether these instances of the review are classified as positive or negative, as this information is not provided in the summary.

## 2. Cleaning the data

In this part, I will examine whether there are any missing values in the dataset and whether any rows or columns can be removed. Missing values refer to the absence of data in certain cells, and they can occur for various reasons such as errors in data collection or data entry.

In [62]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

There are no values missing in this dataset.

## 3. Tokenize the data with NLTK

Now I'll start with tokenizing the data.

In [63]:
import nltk
import numpy as np
import torch
from torch import nn
import time

df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


The command "nltk.download('punkt');" will initiate the NLTK downloader and instruct it to install the punkt data, which is a sentence tokenizer that takes a sentence of words and breaks it down into individual tokens.

In [64]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Warmtebron\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [65]:
#Creating a function for tokenization
def tokenize(column):
    tokens = nltk.word_tokenize(column)
    return [w for w in tokens if w.isalpha()] 

We will use our function on the IMDB dataset.

In [66]:
df['tokenized'] = df.apply(lambda x: tokenize(x['review']), axis=1)
df[['tokenized']].head()

Unnamed: 0,tokenized
0,"[One, of, the, other, reviewers, has, mentione..."
1,"[A, wonderful, little, production, br, br, The..."
2,"[I, thought, this, was, a, wonderful, way, to,..."
3,"[Basically, there, a, family, where, a, little..."
4,"[Petter, Mattei, Love, in, the, Time, of, Mone..."


## 4. Review Encoden

To encode the tokenized data, I will assign a numerical value to each word that appears in the review column. Then, I will pad the encoded data to ensure that all sequences of numbers are the same length. Padding involves adding zeros to the end of each sequence so that they have the same length as the longest sequence in the dataset.

In [67]:
from keras.preprocessing.text import Tokenizer

# De trefwoorden tokenizen
t  = Tokenizer()
t.fit_on_texts(df['tokenized'])
df['sequences'] = t.texts_to_sequences(df['tokenized'])

df.head()

Unnamed: 0,review,sentiment,tokenized,sequences
0,One of the other reviewers has mentioned that ...,positive,"[One, of, the, other, reviewers, has, mentione...","[28, 4, 1, 79, 1940, 45, 1025, 12, 99, 142, 40..."
1,A wonderful little production. <br /><br />The...,positive,"[A, wonderful, little, production, br, br, The...","[3, 370, 118, 351, 7, 7, 1, 1321, 2928, 6, 52,..."
2,I thought this was a wonderful way to spend ti...,positive,"[I, thought, this, was, a, wonderful, way, to,...","[10, 187, 11, 13, 3, 370, 96, 5, 1072, 60, 21,..."
3,Basically there's a family where a little boy ...,negative,"[Basically, there, a, family, where, a, little...","[641, 38, 3, 223, 111, 3, 118, 405, 3192, 1166..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[Petter, Mattei, Love, in, the, Time, of, Mone...","[60692, 10349, 112, 9, 1, 60, 4, 281, 6, 3, 20..."


In [68]:
#Seperating all the numbers
df['sequences']. apply(lambda x: pd.Series(str(x).split(",")))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2444,2445,2446,2447,2448,2449,2450,2451,2452,2453
0,[28,4,1,79,1940,45,1025,12,99,142,...,,,,,,,,,,
1,[3,370,118,351,7,7,1,1321,2928,6,...,,,,,,,,,,
2,[10,187,11,13,3,370,96,5,1072,60,...,,,,,,,,,,
3,[641,38,3,223,111,3,118,405,3192,1166,...,,,,,,,,,,
4,[60692,10349,112,9,1,60,4,281,6,3,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,[10,187,11,17,70,3,180,195,49,282,...,,,,,,,,,,
49996,[84,114,84,392,84,113,2888,926,1,605,...,,,,,,,,,,
49997,[10,222,3,3380,4195,9,36866,8061,5260,32,...,,,,,,,,,,
49998,[10,159,5,26,5,2930,15,1,858,875,...,,,,,,,,,,


There is a maximum of 2454 numbers in the data. I can use this number as maxlen value in the next code snippet.

In [69]:
from keras_preprocessing.sequence import pad_sequences

#Padden
input_ids = pad_sequences(
    df['sequences'], maxlen=100, dtype="long", truncating="post", padding="post"
)

#print(input_ids)

np.array(input_ids).shape

(50000, 100)

In [70]:
# Replacing the sequence column with the new sequence
input_ids = input_ids.tolist()
df['sequences'] = input_ids
df

Unnamed: 0,review,sentiment,tokenized,sequences
0,One of the other reviewers has mentioned that ...,positive,"[One, of, the, other, reviewers, has, mentione...","[28, 4, 1, 79, 1940, 45, 1025, 12, 99, 142, 40..."
1,A wonderful little production. <br /><br />The...,positive,"[A, wonderful, little, production, br, br, The...","[3, 370, 118, 351, 7, 7, 1, 1321, 2928, 6, 52,..."
2,I thought this was a wonderful way to spend ti...,positive,"[I, thought, this, was, a, wonderful, way, to,...","[10, 187, 11, 13, 3, 370, 96, 5, 1072, 60, 21,..."
3,Basically there's a family where a little boy ...,negative,"[Basically, there, a, family, where, a, little...","[641, 38, 3, 223, 111, 3, 118, 405, 3192, 1166..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[Petter, Mattei, Love, in, the, Time, of, Mone...","[60692, 10349, 112, 9, 1, 60, 4, 281, 6, 3, 20..."
...,...,...,...,...
49995,I thought this movie did a down right good job...,positive,"[I, thought, this, movie, did, a, down, right,...","[10, 187, 11, 17, 70, 3, 180, 195, 49, 282, 8,..."
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"[Bad, plot, bad, dialogue, bad, acting, idioti...","[84, 114, 84, 392, 84, 113, 2888, 926, 1, 605,..."
49997,I am a Catholic taught in parochial elementary...,negative,"[I, am, a, Catholic, taught, in, parochial, el...","[10, 222, 3, 3380, 4195, 9, 36866, 8061, 5260,..."
49998,I'm going to have to disagree with the previou...,negative,"[I, going, to, have, to, disagree, with, the, ...","[10, 159, 5, 26, 5, 2930, 15, 1, 858, 875, 2, ..."


## 5. Sentiment Encoden

In this step I will encode the sentiments. Because you only have 2 sentiments, the labels will be 0 and 1.

In [71]:
from sklearn.preprocessing import LabelEncoder

# Creating instance of labelencoder
labelencoder = LabelEncoder()

# Assigning numerical values and storing in another column
df['Label Encode'] = labelencoder.fit_transform(df['sentiment'])

df

Unnamed: 0,review,sentiment,tokenized,sequences,Label Encode
0,One of the other reviewers has mentioned that ...,positive,"[One, of, the, other, reviewers, has, mentione...","[28, 4, 1, 79, 1940, 45, 1025, 12, 99, 142, 40...",1
1,A wonderful little production. <br /><br />The...,positive,"[A, wonderful, little, production, br, br, The...","[3, 370, 118, 351, 7, 7, 1, 1321, 2928, 6, 52,...",1
2,I thought this was a wonderful way to spend ti...,positive,"[I, thought, this, was, a, wonderful, way, to,...","[10, 187, 11, 13, 3, 370, 96, 5, 1072, 60, 21,...",1
3,Basically there's a family where a little boy ...,negative,"[Basically, there, a, family, where, a, little...","[641, 38, 3, 223, 111, 3, 118, 405, 3192, 1166...",0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[Petter, Mattei, Love, in, the, Time, of, Mone...","[60692, 10349, 112, 9, 1, 60, 4, 281, 6, 3, 20...",1
...,...,...,...,...,...
49995,I thought this movie did a down right good job...,positive,"[I, thought, this, movie, did, a, down, right,...","[10, 187, 11, 17, 70, 3, 180, 195, 49, 282, 8,...",1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"[Bad, plot, bad, dialogue, bad, acting, idioti...","[84, 114, 84, 392, 84, 113, 2888, 926, 1, 605,...",0
49997,I am a Catholic taught in parochial elementary...,negative,"[I, am, a, Catholic, taught, in, parochial, el...","[10, 222, 3, 3380, 4195, 9, 36866, 8061, 5260,...",0
49998,I'm going to have to disagree with the previou...,negative,"[I, going, to, have, to, disagree, with, the, ...","[10, 159, 5, 26, 5, 2930, 15, 1, 858, 875, 2, ...",0


In [72]:
tokenized_df = df[['Label Encode', 'sequences']].copy()
tokenized_df

Unnamed: 0,Label Encode,sequences
0,1,"[28, 4, 1, 79, 1940, 45, 1025, 12, 99, 142, 40..."
1,1,"[3, 370, 118, 351, 7, 7, 1, 1321, 2928, 6, 52,..."
2,1,"[10, 187, 11, 13, 3, 370, 96, 5, 1072, 60, 21,..."
3,0,"[641, 38, 3, 223, 111, 3, 118, 405, 3192, 1166..."
4,1,"[60692, 10349, 112, 9, 1, 60, 4, 281, 6, 3, 20..."
...,...,...
49995,1,"[10, 187, 11, 17, 70, 3, 180, 195, 49, 282, 8,..."
49996,0,"[84, 114, 84, 392, 84, 113, 2888, 926, 1, 605,..."
49997,0,"[10, 222, 3, 3380, 4195, 9, 36866, 8061, 5260,..."
49998,0,"[10, 159, 5, 26, 5, 2930, 15, 1, 858, 875, 2, ..."


## 6. Split data in test and training

Now I'll split the data in test and training sets.

In [73]:
from sklearn.model_selection import train_test_split

train_y, test_y, train_x, test_x = train_test_split(tokenized_df['Label Encode'], np.array(input_ids), test_size=0.2, random_state=25)

## Defining the model
Now I'm going to build the model. The labels are as follows:
- 0. Negative
- 1. Positive

In [74]:
#Dataloaders aanmaken van train en test data.
def create_set(x, y):
    dataset = TensorDataset(
    torch.tensor(x), torch.tensor(y.values, dtype=torch.long)
    )
    sampler = RandomSampler(dataset)
    dataloader = DataLoader(
        dataset, sampler=sampler, batch_size=32
    )
    return dataloader

train_dataloader = create_set(train_x, train_y)
test_dataloader = create_set(test_x, test_y)

print(train_x)

[[  157  2874  4510 ...    50    11    19]
 [   10    76   798 ...   919    12    10]
 [   89     4    30 ...     4   521    47]
 ...
 [ 5364     1   415 ...   571   607     1]
 [   10   187    11 ... 10368    21 78608]
 [   10   196    11 ...   343     2     8]]


In [75]:
from torch import nn

# Implement Class
class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    # def forward(self, text, offsets):
    #     embedded = self.embedding(text, offsets)
    #     return self.fc(embedded)

    def __call__(self, text):
        embedded = self.embedding(text)
        return self.fc(embedded)

Now I am going to count all the unique words. I will need this later for the model itself.

In [76]:
# #Counting all unique words.
# all_words = sum(df['tokenized'], [])
# unique_words = list(set(all_words))

# #Tellen van de unieke woorden
# num_unique_words = len(unique_words)
# num_words = len(all_words)

# print("The number of all words together:", num_words)
# print("The number of all unique words:", num_unique_words)

Num_class = the length of all unique labels<br>
vocab_size = the length of unique words in the dataframe (review)<br>
emsize = the length of the word embeddings<br>

In [77]:
# calculate the number of unique labels in the label encode column
num_class = len(tokenized_df['Label Encode'].unique())
# The model will be trained on a corpus with ..  unique words.
vocab_size = 124048
# embedding size, each word in the input will be
# represented by 64-dimenstional vector
emsize = 64
# initialize a new text classification model
model = TextClassificationModel(vocab_size, emsize, num_class)

In [78]:
def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (text, label) in enumerate(dataloader):
        optimizer.zero_grad()
        # Call the model without the `offsets` argument   
        predicted_label = model(text)

        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()
            

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (text, label) in enumerate(dataloader):
            # Call the model without the `offsets` argument
            predicted_label = model(text)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count


In [79]:
import matplotlib.pyplot as plt
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

# Hyperparameters
EPOCHS = 10 # epoch
LR = [0.01, 1, 2]  # learning rate
BATCH_SIZE = 43 # batch size for training

criterion = torch.nn.CrossEntropyLoss()

accuracy_lr = []



# Collect the accuracy values at each epoch

for lr in LR:
  optimizer = torch.optim.SGD(model.parameters(), lr=lr)
  scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
  total_accu = None

  train_accuracies = []
  val_accuracies = []

  for epoch in range(1, EPOCHS + 1):
      # Train the model and collect the training accuracy
      epoch_start_time = time.time()
      train(train_dataloader)
      train_acc = evaluate(train_dataloader)
      train_accuracies.append(train_acc)
      
      # Evaluate the model and collect the validation accuracy
      val_acc = evaluate(test_dataloader)
      val_accuracies.append(val_acc)
      
      # Adjust the learning rate if the validation accuracy did not improve
      if total_accu is not None and total_accu > val_acc:
        scheduler.step()
      else:
        total_accu = val_acc
      
      print('-' * 59)
      print('| end of epoch {:3d} | time: {:5.2f}s | '
            'train accuracy {:8.3f} | val accuracy {:8.3f}'.format(epoch,
                                                                  time.time() - epoch_start_time,
                                                                  train_acc, val_acc))
      print('-' * 59)
 
  accuracy_lr.append(val_acc)

print(accuracy_lr)

# Plot the training and validation accuracies
plt.plot(train_accuracies, label='Training accuracy')
plt.plot(val_accuracies, label='Validation accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

NotImplementedError: Could not run 'aten::_foreach_norm.Scalar' with arguments from the 'SparseCPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_foreach_norm.Scalar' is only available for these backends: [CPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

CPU: registered at aten\src\ATen\RegisterCPU.cpp:31034 [kernel]
BackendSelect: fallthrough registered at ..\aten\src\ATen\core\BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ..\aten\src\ATen\core\PythonFallbackKernel.cpp:144 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ..\aten\src\ATen\functorch\DynamicLayer.cpp:491 [backend fallback]
Functionalize: registered at ..\aten\src\ATen\FunctionalizeFallbackKernel.cpp:280 [backend fallback]
Named: registered at ..\aten\src\ATen\core\NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ..\aten\src\ATen\ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at ..\aten\src\ATen\native\NegateFallback.cpp:19 [backend fallback]
ZeroTensor: registered at ..\aten\src\ATen\ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ..\aten\src\ATen\core\VariableFallbackKernel.cpp:63 [backend fallback]
AutogradOther: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradCPU: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradCUDA: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradHIP: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradXLA: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradMPS: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradIPU: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradXPU: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradHPU: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradVE: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradLazy: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradMeta: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradMTIA: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradPrivateUse1: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradPrivateUse2: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradPrivateUse3: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
AutogradNestedTensor: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:17472 [autograd kernel]
Tracer: registered at ..\torch\csrc\autograd\generated\TraceType_2.cpp:16726 [kernel]
AutocastCPU: fallthrough registered at ..\aten\src\ATen\autocast_mode.cpp:487 [backend fallback]
AutocastCUDA: fallthrough registered at ..\aten\src\ATen\autocast_mode.cpp:354 [backend fallback]
FuncTorchBatched: registered at ..\aten\src\ATen\functorch\LegacyBatchingRegistrations.cpp:815 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ..\aten\src\ATen\functorch\VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at ..\aten\src\ATen\LegacyBatchingRegistrations.cpp:1073 [backend fallback]
VmapMode: fallthrough registered at ..\aten\src\ATen\VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ..\aten\src\ATen\functorch\TensorWrapper.cpp:210 [backend fallback]
PythonTLSSnapshot: registered at ..\aten\src\ATen\core\PythonFallbackKernel.cpp:152 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ..\aten\src\ATen\functorch\DynamicLayer.cpp:487 [backend fallback]
PythonDispatcher: registered at ..\aten\src\ATen\core\PythonFallbackKernel.cpp:148 [backend fallback]
