# BERT text classification on movie reviews dataset

This dataset is a combination of movie and hotel reviews. It can be use for text classification or sentimental analysis task, with the objective being to predict whether a review is positive or negative.

## Load and install necessary libraries

In [1]:
!pip install datasets
!pip install transformers
!pip install torch
!pip install evaluate

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [2]:
import warnings
warnings.filterwarnings('ignore')

from datasets import load_dataset
import logging
import torch
import transformers
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import evaluate
import numpy as np
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [3]:
!git clone https://github.com/bonniektran/BERT_Text_Classification.git

Cloning into 'BERT_Text_Classification'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (3/3), done.


In [4]:
import sys
sys.path.append('/content/MovieReviews_Text_Classification')

## Load data from Hugging Face

https://huggingface.co/datasets/arize-ai/movie_reviews_with_context_drift

*   The data is split into a `train` set with 9,916 rows and a `validation` set with 2,479 rows.

*   There are a total of 6 columns: `prediction_ts`, `age`, `gender`, `context`, `text`, and `label`.

In [5]:
reviews = load_dataset("arize-ai/movie_reviews_with_context_drift")

README.md:   0%|          | 0.00/4.12k [00:00<?, ?B/s]

training.csv:   0%|          | 0.00/13.3M [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/3.36M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9916 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2479 [00:00<?, ? examples/s]

In [6]:
reviews

DatasetDict({
    train: Dataset({
        features: ['prediction_ts', 'age', 'gender', 'context', 'text', 'label'],
        num_rows: 9916
    })
    validation: Dataset({
        features: ['prediction_ts', 'age', 'gender', 'context', 'text', 'label'],
        num_rows: 2479
    })
})

## View the first row of the `train` data

In [7]:
reviews['train'][0]

{'prediction_ts': 1650092400.0,
 'age': 44,
 'gender': 'female',
 'context': 'movies',
 'text': "An interesting premise, and Billy Drago is always good as a dangerous nut-bag (side note: I'd love to see Drago, Stephen McHattie and Lance Hendrikson in a flick together; talk about raging cheekbones!). The soundtrack wasn't terrible, either.<br /><br />But the acting--even that of such professionals as Drago and Debbie Rochon--was terrible, the directing worse (perhaps contributory to the former), the dialog chimp-like, and the camera work, barely tolerable. Still, it was the SETS that got a big 10 on my oy-vey scale. I don't know where this was filmed, but were I to hazard a guess, it would be either an open-air museum, or one of those re-enactment villages, where everything is just a bit too well-kept to do more than suggest the real Old West. Okay, so it was shot on a college kid's budget. That said, I could have forgiven one or two of the aforementioned faults. But taken all together,

## Keep relevant columns `text` and `label` only

In [8]:
for dataset in reviews.keys():
  reviews[dataset] = reviews[dataset].remove_columns(column_names=[col for col in reviews[dataset].column_names if col not in ["text", "label"]])

reviews["train"][0]

{'text': "An interesting premise, and Billy Drago is always good as a dangerous nut-bag (side note: I'd love to see Drago, Stephen McHattie and Lance Hendrikson in a flick together; talk about raging cheekbones!). The soundtrack wasn't terrible, either.<br /><br />But the acting--even that of such professionals as Drago and Debbie Rochon--was terrible, the directing worse (perhaps contributory to the former), the dialog chimp-like, and the camera work, barely tolerable. Still, it was the SETS that got a big 10 on my oy-vey scale. I don't know where this was filmed, but were I to hazard a guess, it would be either an open-air museum, or one of those re-enactment villages, where everything is just a bit too well-kept to do more than suggest the real Old West. Okay, so it was shot on a college kid's budget. That said, I could have forgiven one or two of the aforementioned faults. But taken all together, and being generous, I could not see giving it more than three stars.",
 'label': 'nega

## Set up tokenizer and model

*   Use auto tokenizer and automodel for easy generality.
*   `num_label=2` since label is either negative or positive.



In [9]:
model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Test the base model with no finetuning

Test model name `google-bert/bert-base-uncased`

In [10]:
train = reviews["train"]

clf = pipeline("text-classification", model=model_name, tokenizer=tokenizer)

idx = 0
while idx < 5:
  print(f'text: {train["text"][idx]}')
  print(f'label: {train["label"][idx]}')
  print(f'prediction: {clf(train["text"][idx])}\n')

  idx += 1

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


text: An interesting premise, and Billy Drago is always good as a dangerous nut-bag (side note: I'd love to see Drago, Stephen McHattie and Lance Hendrikson in a flick together; talk about raging cheekbones!). The soundtrack wasn't terrible, either.<br /><br />But the acting--even that of such professionals as Drago and Debbie Rochon--was terrible, the directing worse (perhaps contributory to the former), the dialog chimp-like, and the camera work, barely tolerable. Still, it was the SETS that got a big 10 on my oy-vey scale. I don't know where this was filmed, but were I to hazard a guess, it would be either an open-air museum, or one of those re-enactment villages, where everything is just a bit too well-kept to do more than suggest the real Old West. Okay, so it was shot on a college kid's budget. That said, I could have forgiven one or two of the aforementioned faults. But taken all together, and being generous, I could not see giving it more than three stars.
label: negative
predi

If we consider `{"positive":1, "negative":0}` the appropriate mapping, then the pretrained `google-bert/bert-base-uncased` performs accurately only about 50% of the time.

For example, the subject in the last text-label pair (`idx=4`) doubts they could sit through the movie a second time, meaning the movie review is unfavorable. Clearly this is a negative review, yet the prediction returns a `positive` review with `LABEL_1`.

# Finetuning the model

## Preprocessing with mapping binary `label` and tokenizing `text`

In [11]:
def binary_mapping(dataset):
    if dataset["label"] == "positive":
        dataset["label"] = 1
    else:
        dataset["label"] = 0
    return dataset

reviews = reviews.map(binary_mapping)

reviews["train"][0] # label is 0 for negative

Map:   0%|          | 0/9916 [00:00<?, ? examples/s]

Map:   0%|          | 0/2479 [00:00<?, ? examples/s]

{'text': "An interesting premise, and Billy Drago is always good as a dangerous nut-bag (side note: I'd love to see Drago, Stephen McHattie and Lance Hendrikson in a flick together; talk about raging cheekbones!). The soundtrack wasn't terrible, either.<br /><br />But the acting--even that of such professionals as Drago and Debbie Rochon--was terrible, the directing worse (perhaps contributory to the former), the dialog chimp-like, and the camera work, barely tolerable. Still, it was the SETS that got a big 10 on my oy-vey scale. I don't know where this was filmed, but were I to hazard a guess, it would be either an open-air museum, or one of those re-enactment villages, where everything is just a bit too well-kept to do more than suggest the real Old West. Okay, so it was shot on a college kid's budget. That said, I could have forgiven one or two of the aforementioned faults. But taken all together, and being generous, I could not see giving it more than three stars.",
 'label': 0}

In [12]:
def tokenize_text(dataset):
    return tokenizer(dataset["text"], max_length=400, padding="max_length", truncation=True)


reviews = reviews.map(tokenize_text, batched=True)

print(reviews["train"][0]["input_ids"])

Map:   0%|          | 0/9916 [00:00<?, ? examples/s]

Map:   0%|          | 0/2479 [00:00<?, ? examples/s]

[101, 2019, 5875, 18458, 1010, 1998, 5006, 8011, 2080, 2003, 2467, 2204, 2004, 1037, 4795, 17490, 1011, 4524, 1006, 2217, 3602, 1024, 1045, 1005, 1040, 2293, 2000, 2156, 8011, 2080, 1010, 4459, 11338, 12707, 9515, 1998, 9993, 21863, 28730, 3385, 1999, 1037, 17312, 2362, 1025, 2831, 2055, 17559, 27181, 999, 1007, 1012, 1996, 6050, 2347, 1005, 1056, 6659, 1010, 2593, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2021, 1996, 3772, 1011, 1011, 2130, 2008, 1997, 2107, 8390, 2004, 8011, 2080, 1998, 16391, 21326, 8747, 1011, 1011, 2001, 6659, 1010, 1996, 9855, 4788, 1006, 3383, 12130, 2100, 2000, 1996, 2280, 1007, 1010, 1996, 13764, 8649, 9610, 8737, 1011, 2066, 1010, 1998, 1996, 4950, 2147, 1010, 4510, 2000, 3917, 3085, 1012, 2145, 1010, 2009, 2001, 1996, 4520, 2008, 2288, 1037, 2502, 2184, 2006, 2026, 1051, 2100, 1011, 2310, 2100, 4094, 1012, 1045, 2123, 1005, 1056, 2113, 2073, 2023, 2001, 6361, 1010, 2021, 2020, 1045, 2000, 15559, 1037, 3984, 1010, 2009, 2052, 2022, 2593, 2019, 233

In [13]:
reviews

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9916
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2479
    })
})

## Training the model

In [14]:
import os
from transformers import DefaultDataCollator

# function to calculate model accuracy
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_predictions):
    predictions, labels = eval_predictions
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [15]:
main_dir = "MovieReviews_Text_Classification/"
model_path = os.path.join(main_dir, "trained_model")
# training arguments
args = TrainingArguments(
    output_dir=model_path,
    learning_rate=1e-5,
    per_device_train_batch_size=50,
    per_device_eval_batch_size=10,
    optim="adafactor",
    num_train_epochs=3,
    evaluation_strategy="epoch",
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    push_to_hub=False,
)

# train the model
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=reviews["train"],
    eval_dataset=reviews["validation"],
    tokenizer=tokenizer,
    data_collator=DefaultDataCollator(),
    compute_metrics=compute_metrics,
)

In [16]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.264438,0.901977
2,No log,0.222794,0.922146
3,0.241700,0.237453,0.922146


TrainOutput(global_step=597, training_loss=0.22782900345385373, metrics={'train_runtime': 2504.2225, 'train_samples_per_second': 11.879, 'train_steps_per_second': 0.238, 'total_flos': 6114865370976000.0, 'train_loss': 0.22782900345385373, 'epoch': 3.0})

## Model evaluation

The model performs quite well even with only three epochs and a miniscule fraction of the train and validation data, scoring an accuracy of 92.21% and a decently low validation loss of 23.75% for only 3 epochs.

## Save the finetuned model

In [17]:
trainer.save_model(model_path)

## Test out the finetuned model

In [26]:
id2label = {0: "negative", 1: "positive"}
label2id = {"negative": 0, "positive": 1}

tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
finedtuned_model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=2, id2label=id2label, label2id=label2id, local_files_only=True)

In [32]:
clf = pipeline("text-classification", model=finedtuned_model, tokenizer=tokenizer)

validation = reviews["validation"]

print(f'text: {validation["text"][5]}')
print(f'label: {validation["label"][5]}')
print(f'prediction: {clf(validation["text"][5])}\n')

print(f'text: {validation["text"][200]}')
print(f'label: {validation["label"][200]}')
print(f'prediction: {clf(validation["text"][200])}\n')

print(f'text: {validation["text"][300]}')
print(f'label: {validation["label"][300]}')
print(f'prediction: {clf(validation["text"][300])}\n')

print(f'text: {validation["text"][600]}')
print(f'label: {validation["label"][600]}')
print(f'prediction: {clf(validation["text"][600])}\n')

print(f'text: {validation["text"][800]}')
print(f'label: {validation["label"][800]}')
print(f'prediction: {clf(validation["text"][800])}\n')

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


text: If you want to watch something that is for 'him' and 'her' so to say then this is the film to pick. I am a sucker for rom coms but my husband is not always so keen (what a guy!!!). Anyway I managed to get him to watch it because I told him it was about sport, and you know what, he loved it!!!<br /><br />Drew Barrymore is very funny and her leading man (sorry but can't remember his name) is equally as good. When I watched the film it was called 'The Perfect Match' but I think the title was changed for the UK as it is based on the book Fever Pitch and there was already a film made about football with that title (the same film but the UK version - phew!),<br /><br />Anyway all of the reviews on here will tell you more details if you need them buy girls, take it from me, get your hubby/boyfriend in front of the television on a Saturday night and you will both laugh and cry together. A real gem.
label: 1
prediction: [{'label': 'positive', 'score': 0.994286835193634}]

text: I love a f

## Conclusion


The model looks like it works quite well 🙌! It got them all right as opposed to the pre-trained model that only got half right, or predicted all positives or all negatives only.

Ideally, these two training arguments should be set to `per_device_train_batch_size=3000` and `per_device_eval_batch_size=600` since there are 9,000+ samples in the train set and 2,000+ in the validation set, and the max `epochs` Colab can handle is 3. If limitations to `epochs` is not set, the parameter would be set to at least 100 to start with. However, Colab is limited to only `epochs=3` and cannot handle large batches of train and validation data beyond a certain double-digit value, so these are the limited values chosen.