# Data Exploration

Let's use the `.info()`, `.describe()`, and `.head()` methods to learn more about the dataset.

In [1]:
import pandas as pd

df = pd.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [2]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


We can see that the dateset contains one column, the review, along with the binary target variable that we'd like to predict, sentiment (positive/negative). There are no missing values (phew!!).

In [4]:
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

The dataset is as balanced as it could be, we have the same number of positive and negative samples. 25000 each.

In [5]:
df['review'].head(20)

0     One of the other reviewers has mentioned that ...
1     A wonderful little production. <br /><br />The...
2     I thought this was a wonderful way to spend ti...
3     Basically there's a family where a little boy ...
4     Petter Mattei's "Love in the Time of Money" is...
5     Probably my all-time favorite movie, a story o...
6     I sure would like to see a resurrection of a u...
7     This show was an amazing, fresh & innovative i...
8     Encouraged by the positive comments about this...
9     If you like original gut wrenching laughter yo...
10    Phil the Alien is one of those quirky films wh...
11    I saw this movie when I was about 12 when it c...
12    So im not a big fan of Boll's work but then ag...
13    The cast played Shakespeare.<br /><br />Shakes...
14    This a fantastic movie of three prisoners who ...
15    Kind of drawn in by the erotic scenes, only to...
16    Some films just simply should not be remade. T...
17    This movie made it into one of my top 10 m

A quick look at the first 20 rows of reviews reveals that the dataset contains some HTML elements like the `<br />` line break element.

## Data Cleaning

In [6]:
df['review'].str.contains(r'<.*?>').sum()

29202

There are 29200 reviews containing html elements! let's remove them.

In [7]:
import re

html = re.compile(r'<.*?>')
df['review'] = df['review'].str.replace(html, '', regex=True)

Let's also lowecase the text.

In [8]:
df['review'] = df['review'].str.lower()

## Data Visualization

Let's visualize the dataset using a word cloud.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords



In [10]:
class CustomCountVectorizer(CountVectorizer):
    def build_tokenizer(self):
        tokenizer = super().build_tokenizer()
        return lambda doc: [token for token in tokenizer(doc) if token not in stopwords.words('english')] # We will ignore stop words

tokenizer = CustomCountVectorizer().build_tokenizer()

# Modeling

Substitute the values in the target column with 0/1.

In [11]:
df['sentiment'].replace({"positive": 1, "negative": 0}, inplace=True)

Split the dataset into a training (80%) and a test (20%) set with stratification.

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, stratify=df['sentiment'], random_state=1337)

We will use a TF-IDF (term frequency–inverse document frequency) vectorizer because it won't give greater importance to tokens that are common among most reviews and thus have little predictive power such as *movie*, *film*, *story*, *one*, ...

And we will also remove stop words for the same reason.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert stop words to a set to speed up the stop word lookup
stopwords = {word for word in stopwords.words('english')}

class CustomTfidfVectorizer(TfidfVectorizer):
    def build_tokenizer(self):
        tokenizer = super().build_tokenizer()
        return lambda doc: [token for token in tokenizer(doc) if token not in stopwords] # Remove stop words

vectorizer = CustomTfidfVectorizer()

In [14]:
X_train_vectorized = vectorizer.fit_transform(X_train)

Since this is a binary classification problem, it makes sense to use a logistic regression model.

In [20]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(n_jobs=-1)
clf.fit(X_train_vectorized, y_train)



In [21]:
# Calculate predictions on the test set
X_test_vectorized = vectorizer.transform(X_test)
y_pred = clf.predict(X_test_vectorized)

In [22]:
from sklearn.metrics import accuracy_score, classification_report

accuracy_score(y_test, y_pred)

0.8995

In [23]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.91      0.88      0.90      5000
           1       0.89      0.92      0.90      5000

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



90% accuracy and F-1 score! that's pretty good. But maybe we can improve it. First, let's try stemming.

In [15]:
import numpy as np
import random
from tqdm import tqdm

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

from transformers import AutoModelForSequenceClassification, AdamW, DistilBertTokenizer, get_linear_schedule_with_warmup

In [16]:
# Use GPU when available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

Tokenize the dataset.

In [17]:
# Load the DistilBert tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)

def tokenize_dataset(reviews):
    input_ids = []
    attention_masks = []

    for review in reviews:
        # `encode_plus` will:
        #   (1) Tokenize the sentence.
        #   (2) Prepend the `[CLS]` token to the start.
        #   (3) Append the `[SEP]` token to the end.
        #   (4) Map tokens to their IDs.
        #   (5) Pad or truncate the sentence to `max_length`
        #   (6) Create attention masks for [PAD] tokens.
        encoded_dict = tokenizer.encode_plus(
                            review,                      # Sentence to encode
                            add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                            max_length = 512,           # Pad & truncate all sentences
                            pad_to_max_length = True,
                            return_attention_mask = True,   # Construct attn. masks
                            return_tensors = 'pt',     # Return pytorch tensors
                       )

        # Add the encoded sentence to the list
        input_ids.append(encoded_dict['input_ids'])

        # And its attention mask (simply differentiates padding from non-padding)
        attention_masks.append(encoded_dict['attention_mask'])

    return input_ids, attention_masks

input_ids_train, attention_masks_train = tokenize_dataset(X_train)
input_ids_test, attention_masks_test = tokenize_dataset(X_test)

# Convert the lists into tensors
input_ids_train = torch.cat(input_ids_train, dim=0)
attention_masks_train = torch.cat(attention_masks_train, dim=0)
labels_train = torch.tensor(y_train.values)

input_ids_test = torch.cat(input_ids_test, dim=0)
attention_masks_test = torch.cat(attention_masks_test, dim=0)
labels_test = torch.tensor(y_test.values)

# Combine the training inputs into a TensorDataset
train_dataset = TensorDataset(input_ids_train, attention_masks_train, labels_train)
test_dataset = TensorDataset(input_ids_test, attention_masks_test, labels_test)

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Create the DataLoaders for our training and test sets.

In [18]:
batch_size = 32

# We'll take training samples in random order
train_dataloader = DataLoader(
            train_dataset,
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size
        )

# For testing the order doesn't matter, so we'll just read them sequentially
test_dataloader = DataLoader(
            test_dataset,
            sampler = SequentialSampler(test_dataset), # Pull out batches sequentially
            batch_size = batch_size
        )

Define the model.

In [19]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels = 2,
    output_attentions = False,
    output_hidden_states = False
)

model = model.to(device)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Define the optimizer.

In [20]:
optimizer = AdamW(model.parameters(),
                  lr = 2e-5,
                  eps = 1e-8
                )



Define the learning rate scheduler.

In [21]:
epochs = 3

# Total number of training steps is [number of batches] x [number of epochs]
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)

Define our metric.

In [22]:
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

Training loop.

In [23]:
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

for epoch in range(epochs):
    # Training
    print('======== Epoch {:} / {:} ========'.format(epoch + 1, epochs))
    total_train_loss = 0
    model.train()

    for batch in tqdm(train_dataloader, desc="Training"):
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        optimizer.zero_grad()
        output = model(b_input_ids,
                         attention_mask=b_input_mask, 
                         labels=b_labels)        
        loss = output.loss
        total_train_loss += loss.item()

        # Perform a backward pass to calculate the gradients
        loss.backward()

        # Clip the norm of the gradients to 1.0 to help prevent the "exploding gradients" problem
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient
        optimizer.step()

        # Update the learning rate
        scheduler.step()

    # Calculate the average loss over all of the batches
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))

    # Evaluation
    print("")
    # Put the model in evaluation mode
    model.eval()

    total_test_accuracy = 0
    best_test_accuracy = 0
    total_test_loss = 0

    for batch in tqdm(test_dataloader, desc="Evaluating"):
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Constructing the compute graph is only needed for backprop (training)
        with torch.no_grad():        
            output = model(b_input_ids,
                           attention_mask=b_input_mask,
                           labels=b_labels)
        loss = output.loss
        total_test_loss += loss.item()

        # Move logits and labels to CPU if we are using GPU
        logits = output.logits
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of test reviews, and
        # accumulate it over all batches
        total_test_accuracy += flat_accuracy(logits, label_ids)

    # Report the final accuracy for this validation run
    avg_test_accuracy = total_test_accuracy / len(test_dataloader)
    print("  Accuracy: {0:.3f}".format(avg_test_accuracy))
    print("")

    # Calculate the average loss over all of the batches
    avg_val_loss = total_test_loss / len(test_dataloader)

    # Save the best model
    if avg_test_accuracy > best_test_accuracy:
        torch.save(model, 'bert_model.pt')
        best_test_accuracy = avg_test_accuracy

print("")
print("Training complete!")



Training: 100%|██████████| 1250/1250 [17:21<00:00,  1.20it/s]



  Average training loss: 0.24



Evaluating: 100%|██████████| 313/313 [01:28<00:00,  3.54it/s]


  Accuracy: 0.935



Training: 100%|██████████| 1250/1250 [17:22<00:00,  1.20it/s]



  Average training loss: 0.13



Evaluating: 100%|██████████| 313/313 [01:28<00:00,  3.54it/s]


  Accuracy: 0.941



Training: 100%|██████████| 1250/1250 [17:22<00:00,  1.20it/s]



  Average training loss: 0.08



Evaluating: 100%|██████████| 313/313 [01:28<00:00,  3.53it/s]


  Accuracy: 0.942


Training complete!


Let's calculate predictions on the entire test set.

In [24]:
predictions = []

for batch in test_dataloader:
    b_input_ids = batch[0].to(device)
    b_input_mask = batch[1].to(device)
    b_labels = batch[2].to(device)

    with torch.no_grad():        
        output = model(b_input_ids,
                       attention_mask=b_input_mask,
                       labels=b_labels)

    logits = output.logits
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    preds = np.argmax(logits, axis=1).flatten()
    predictions.append(preds)

predictions = np.concatenate(predictions)

In [28]:
from sklearn.metrics import accuracy_score, classification_report
accuracy_score(y_test, predictions)

0.9415

In [29]:
print(classification_report(y_test, predictions))


              precision    recall  f1-score   support

           0       0.94      0.94      0.94      5000
           1       0.94      0.94      0.94      5000

    accuracy                           0.94     10000
   macro avg       0.94      0.94      0.94     10000
weighted avg       0.94      0.94      0.94     10000

