Text Classification on BERT
Bidirectional Encoder Representations from Transformers (BERT)

In [1]:
import os
import torch

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'imdb-dataset-of-50k-movie-reviews' dataset.
Path to dataset files: /kaggle/input/imdb-dataset-of-50k-movie-reviews


In [3]:
import pandas as pd
data = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
print(data)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


In [4]:
texts = data['review'].tolist()
labels = [1 if label == 'positive' else 0 for label in data['sentiment'].tolist()]

In [5]:
from sklearn.model_selection import train_test_split

#Split the data into train and text (8:2)
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)

#Further splitting the training data into train and validation sets, leaving 60% for training and 20% for testing
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.25, random_state=42)

In [6]:
from transformers import BertTokenizer
from torch.utils.data import Dataset

#Define the name of the BERT Model
bert_model_name = 'bert-base-uncased'

max_lenght = 128
tokenizer = BertTokenizer.from_pretrained(bert_model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [7]:
class TextClassificationDataset(Dataset):
  def __init__(self, texts, labels, tokenizer, max_length):
    self.texts = texts
    self.labels = labels
    self.tokenizer = tokenizer
    self.max_length = max_length

  def __len__(self):
    return len(self.texts)

  def __getitem__(self, idx):
    text = self.texts[idx]
    label = self.labels[idx]

    #Tokenizer and preprocess the text sample
    encoding = self.tokenizer(text,
      truncation=True, # Truncate sequence if they exceed the max length.
      padding='max_length', # pad sequence to have same length.
      max_length=self.max_length, # Truncate or pad text to the specified max length.
      return_tensors='pt') #Return the poytorch tensors.

    return {
      'input_ids': encoding['input_ids'].flatten(),
      'attention_mask': encoding['attention_mask'].flatten(),
      'label': torch.tensor(label, dtype=torch.long)
    }

In [8]:
#Sample Tokenizer

tokenizer = BertTokenizer.from_pretrained(bert_model_name)
text = "This movies is one of the best production I have seen in years."

encoding = tokenizer(text, truncation=True, padding='max_length', max_length=128, return_tensors='pt')

print("Tokenizer arguments:", encoding.keys())
print("\n Tokenizer Result size:", encoding['input_ids'].shape)
print("\n Input ids:", encoding['input_ids'])
print("\n Token type ids:", encoding['token_type_ids'])
print("\n Attention mask:", encoding['attention_mask'])

Tokenizer arguments: KeysView({'input_ids': tensor([[ 101, 2023, 5691, 2003, 2028, 1997, 1996, 2190, 2537, 1045, 2031, 2464,
         1999, 2086, 1012,  102,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [9]:
input_ids = encoding['input_ids'][0]
tokens = tokenizer.convert_ids_to_tokens(input_ids)

for token, input_id in zip(tokens, input_ids):
  print(f"Token: {token}\t Input ID: {input_id}")
  if input_id == 0:
    break

Token: [CLS]	 Input ID: 101
Token: this	 Input ID: 2023
Token: movies	 Input ID: 5691
Token: is	 Input ID: 2003
Token: one	 Input ID: 2028
Token: of	 Input ID: 1997
Token: the	 Input ID: 1996
Token: best	 Input ID: 2190
Token: production	 Input ID: 2537
Token: i	 Input ID: 1045
Token: have	 Input ID: 2031
Token: seen	 Input ID: 2464
Token: in	 Input ID: 1999
Token: years	 Input ID: 2086
Token: .	 Input ID: 1012
Token: [SEP]	 Input ID: 102
Token: [PAD]	 Input ID: 0


In [10]:
train_dataset = TextClassificationDataset(train_texts, train_labels, tokenizer, max_lenght)
val_dataset = TextClassificationDataset(val_texts, val_labels, tokenizer, max_lenght)

In [11]:
from torch.utils.data import DataLoader
batch_size = 16
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

In [12]:
from torch import nn
from transformers import BertModel

class BertClassifier(nn.Module):
    def __init__(self, num_classes, dropout=0.5):
        super(BertClassifier, self).__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(0.1)
        self.fc = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        # hidden_state = outputs.last_hidden_state # This line is not used and can be removed
        pooled_output = outputs.pooler_output
        x = self.dropout(pooled_output)
        logits = torch.sigmoid(self.fc(x)) # Apply dropout to pooled_output before the linear layer
        return logits # No need to return hidden_state if not used later

In [13]:
def train(model, data_loader, optimizer, scheduler, devices):
    model.train()
    for batch in data_loader:
        optimizer.zero_grad()

        input_ids = batch['input_ids'].to(devices)
        attention_mask = batch['attention_mask'].to(devices)
        labels = batch['label'].to(devices)

        outputs = model(input_ids, attention_mask=attention_mask) # Removed labels argument and _ placeholder
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        optimizer.step()
        scheduler.step()

In [14]:
from sklearn.metrics import accuracy_score, classification_report

def evaluate(model, data_loader, devices):
    model.eval()
    predictions = []
    true_labels = []

    with torch.no_grad():
        for batch in data_loader:
          input_ids = batch['input_ids'].to(devices)
          attention_mask = batch['attention_mask'].to(devices)
          labels = batch['label'].to(devices)


          outputs = model(input_ids, attention_mask=attention_mask) # Removed _ placeholder
          _, preds = torch.max(outputs, dim=1)

          predictions.extend(preds.cpu().tolist())
          true_labels.extend(labels.cpu().tolist())
    return accuracy_score(true_labels, predictions), classification_report(true_labels, predictions)

In [15]:
num_classes = 2
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = BertClassifier(num_classes=num_classes).to(device)
model

BertClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwis

In [16]:
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
num_epochs = 15
optimizer = AdamW(model.parameters(), lr=2e-5)
train_steps = len(train_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=train_steps)

In [None]:
accuracy_array = []
for epoch in range(num_epochs):
    train(model, train_loader, optimizer, scheduler, device)
    accuracy, report = evaluate(model, val_loader, device)
    accuracy_array.append(accuracy)

In [18]:
torch.save(model.state_dict(), 'bert_classifier.pth')

In [19]:
load_model = BertClassifier(num_classes).to(device)
load_model.load_state_dict(torch.load('bert_classifier.pth'))

<All keys matched successfully>

In [20]:
test_dataset = TextClassificationDataset(test_texts, test_labels, tokenizer, max_lenght)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)
accuracy, report = evaluate(load_model, test_loader, device)
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Test Report:\n{report}")

Test Accuracy: 0.8889
Test Report:
              precision    recall  f1-score   support

           0       0.90      0.88      0.89      4961
           1       0.88      0.90      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



In [21]:
def predict(model, text, tokenizer, device, max_length = 128):
    model.eval()
    encoding = tokenizer(text, truncation=True, padding='max_length', max_length=128, return_tensors='pt')
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)
    with torch.no_grad():
        outputs = model(input_ids = input_ids, attention_mask=attention_mask) # Changed input_id to input_ids
        _, predicted = torch.max(outputs, dim=1)

    print(f"Text: {text}, \n Predicted Sentiment: {predicted.item()}\n") # Moved print and added .item()

    if predicted.item() == 1: # Changed .items() to .item()
        return 'Positive'
    else:
        return 'Negative'

In [22]:
import random
for i in range(5):
  text = random.choice(test_texts)
  print(predict(load_model, text, tokenizer, device))

Text: Such a highly-anticipated remake of a cherished musical classic and such a bitter pill it was to have to take. Very, very hard to swallow...all of it. It didn't have an ounce of believability anywhere. And when you don't have a Rose, you don't have a show.<br /><br />Bette Midler seemed born to play this part. Yet, all she was able to produce was a cute, funny, glitzy, trademark Bette Midler...weighed down with all the familiar Midlerisms. Roz Russell has nothing to worry about. She can rest in her grave knowing she is still the definitive Mama Rose (of film, anyway).<br /><br />I thought Midler was really going to put it across this time...to throw herself into what is one of the greatest musical roles of all time...like she did in "The Rose." But, no, she played it safe. She played herself. She made Rose a total dinner-theatre cartoon. Even her songs were uninspired. It was maddening to watch, knowing Midler has the talent to rise above her money-making schtick. She showed prom

In [23]:
text = "I am sure this is a masterpiece from the directors"
print(predict(load_model, text, tokenizer, device))

Text: I am sure this is a masterpiece from the directors, 
 Predicted Sentiment: 1

Positive
