# Training a DistilBERT model to recognize movie review sentiment

This was built following a tutorial in the wonderful book Machine Learning with PyTorch and Scikit-Learn by Raschka, Liu and Mirjalili (2022, Packt Publishing). In this notebook, we do the following:

1. Use the Hugging Face transformers library (https://huggingface.co/docs/transformers/index) to load the pre-trained model DistilBertForSequenceClassification and tokenizer DistilBertTokenizerFast.  DistilBERT was proposed here: https://arxiv.org/abs/1910.01108
2. We fine-tune the pre-trained model on the IMDb Large Movie Reviews Dataset: https://ai.stanford.edu/~amaas/data/sentiment/
3. We eeploy the trained model as an interactive web app using the Gradio library (https://gradio.app/).
Note that the Gradio app can be found by visiting my huggingface page:

https://huggingface.co/spaces/etweedy/movie_review_2

Import libraries and set gpu device and random seed for repeatability:

In [2]:
%%capture
! pip install torchtext==0.13.0
! pip install torchdata==0.4.0

In [15]:
import gzip
import shutil

import pandas as pd
import requests
import torch
import torch.nn.functional as F
import torchtext
from torch.utils.data import Dataset, DataLoader
import itertools

import transformers
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification

In [16]:
torch.backends.cudnn.deterministric=True
torch.manual_seed(123)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

NUM_EPOCHS=3

Download and extract the csv file.

In [17]:
url = ("https://github.com/rasbt/machine-learning-book/raw/main/ch08/movie_data.csv.gz")
filename = url.split("/")[-1]

with open(filename, 'wb') as f:
    r = requests.get(url)
    f.write(r.content)
    
with gzip.open('movie_data.csv.gz','rb') as f_in:
    with open('movie_data.csv','wb') as f_out:
        shutil.copyfileobj(f_in,f_out)

Read csv into dataframe and split into training (70% of samples), validation (10% of samples), and test data (20% of samples).

In [18]:
df = pd.read_csv('movie_data.csv')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [19]:
train_texts = df.iloc[:35000]['review'].values
train_labels = df.iloc[:35000]['sentiment'].values
valid_texts = df.iloc[35000:40000]['review'].values
valid_labels = df.iloc[35000:40000]['sentiment'].values
test_texts = df.iloc[40000:]['review'].values
test_labels = df.iloc[40000:]['sentiment'].values

Load in the tokenizer and encode the three data sets.

In [20]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_enc = tokenizer(list(train_texts),truncation=True,padding=True)
valid_enc = tokenizer(list(valid_texts),truncation=True,padding=True)
test_enc = tokenizer(list(test_texts),truncation=True,padding=True)

Create a custom Dataset class to hold encoded text and labels, initialize datasets and dataloaders for training, validation, and test sets.

In [21]:
class IMDbDataset(Dataset):
    def __init__(self,encodings,labels):
        self.encodings=encodings
        self.labels=labels
    
    def __getitem__(self,idx):
        item = {key:torch.tensor(val[idx]) for key,val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    
    def __len__(self):
        return len(self.labels)

In [22]:
train_ds = IMDbDataset(train_enc,train_labels)
valid_ds = IMDbDataset(valid_enc,valid_labels)
test_ds = IMDbDataset(test_enc,test_labels)

In [23]:
train_dl = DataLoader(train_ds,batch_size=16,shuffle=True)
valid_dl = DataLoader(valid_ds,batch_size=16,shuffle=True)
test_dl = DataLoader(test_ds,batch_size=16,shuffle=True)

Load the pre-trained DistilBERT model, and declare optimizer.

In [24]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(DEVICE)
model.train()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [25]:
opt = torch.optim.Adam(model.parameters(),lr=5e-5)

This function will compute accuracy as we train and evaluate

In [26]:
def compute_accuracy(model,dl,device):
    with torch.no_grad():
        correct_pred,num_examples = 0,0

        for batch_idx,batch in enumerate(dl):
            # Collect the ids, mask, labels from dl and pass to device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            # Evaluate the model, collect output logits, and calculate predictions
            outputs = model(input_ids,attention_mask=attention_mask)
            logits = outputs['logits']
            pred_labels = torch.argmax(logits,1)
            
            num_examples += labels.size(0)
            correct_pred += (pred_labels == labels).sum()
            
        # Compute percentage of accurate predictions
        return correct_pred.float()/num_examples*100

Fine-tune the model and print scoring results as we go.

In [14]:
for epoch in range(NUM_EPOCHS):
    model.train()
    
    for batch_idx, batch in enumerate(train_dl):
        # Collect the ids, mask, labels from dl and pass to device
        input_ids = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels = batch['labels'].to(DEVICE)
        
        # Forward pass of model
        outputs = model(input_ids,attention_mask=attention_mask,labels=labels)
        loss,logits = outputs['loss'],outputs['logits']
        
        # Backward pass of model
        opt.zero_grad()
        loss.backward()
        opt.step()
        
        # Print loss every 250 batches
        if batch_idx % 250 == 0:
            print(f'Epoch {epoch+1:04d} of {NUM_EPOCHS:04d}, batch {batch_idx:04d} of {len(train_dl):04d} === Loss: {loss:.4f}')
    model.eval()
    
    # Print training and validation accuracy after each epoch
    with torch.set_grad_enabled(False):
        print(f'Training accuracy: {compute_accuracy(model,train_dl,DEVICE):.2f} === Valid accuracy: {compute_accuracy(model,valid_dl,DEVICE):.2f}')

# Check accuracy of the fine-tuned model on the test set
print(f'Test accuracy: {compute_accuracy(model,test_dl,DEVICE):.2f}')

Epoch 0001 of 0003, batch 0000 of 2188 === Loss: 0.6800
Epoch 0001 of 0003, batch 0250 of 2188 === Loss: 0.2488
Epoch 0001 of 0003, batch 0500 of 2188 === Loss: 0.4501
Epoch 0001 of 0003, batch 0750 of 2188 === Loss: 0.1309
Epoch 0001 of 0003, batch 1000 of 2188 === Loss: 0.4273
Epoch 0001 of 0003, batch 1250 of 2188 === Loss: 0.3193
Epoch 0001 of 0003, batch 1500 of 2188 === Loss: 0.5093
Epoch 0001 of 0003, batch 1750 of 2188 === Loss: 0.4583
Epoch 0001 of 0003, batch 2000 of 2188 === Loss: 0.3154
Training accuracy: 96.62 === Valid accuracy: 92.54
Time elapsed: 18.81 min
Epoch 0002 of 0003, batch 0000 of 2188 === Loss: 0.1179
Epoch 0002 of 0003, batch 0250 of 2188 === Loss: 0.0136
Epoch 0002 of 0003, batch 0500 of 2188 === Loss: 0.1435
Epoch 0002 of 0003, batch 0750 of 2188 === Loss: 0.0454
Epoch 0002 of 0003, batch 1000 of 2188 === Loss: 0.0768
Epoch 0002 of 0003, batch 1250 of 2188 === Loss: 0.2802
Epoch 0002 of 0003, batch 1500 of 2188 === Loss: 0.0200
Epoch 0002 of 0003, batch 175

Create save directory and save our fine-tuned model for later use.

In [14]:
import os
output_dir = './model_save/'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

In [16]:
print("Saving model to %s" % output_dir)
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)


Saving model to ./model_save/


('./model_save/tokenizer_config.json',
 './model_save/special_tokens_map.json',
 './model_save/vocab.txt',
 './model_save/added_tokens.json',
 './model_save/tokenizer.json')

Finally, we implement a little Gradio web app that we can use to interact with our model. The app will ask the user to input a movie review into a text entry box, and will return a prediction of 'Positive' or 'Negative' sentiment. The below code will generate a locally hosted app, but see the following blog post for a nice tutorial on deploying your web app on Hugging Face: https://huggingface.co/blog/gradio-spaces

In [None]:
%%capture
! pip install gradio
import gradio as gr

Our app first needs a simple predict function.

In [27]:
def predict(text):
    encoding=tokenizer(text,return_tensors='pt')
    input_ids, attention_mask = encoding['input_ids'],encoding['attention_mask']
    outputs = model(input_ids,attention_mask=attention_mask)
    logits = outputs['logits']
    pred_label = torch.argmax(logits,1)[0]
    return 'Positive' if pred_label > 0.5 else 'Negative'

Then we initialize the Gradio app.

In [30]:
title = "Write a movie review"
description = "Enter a review for a movie you've seen.  This tool will try to guess whether your review is positive or negative."
gr.Interface(fn=predict, 
             inputs="text",
             outputs="label",
             title = title,
             description = description,
              ).launch()

IMPORTANT: You are using gradio version 3.12.0, however version 3.14.0 is available, please upgrade.
--------
Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.


