# Movie Sentiment Analysis

Goal: Use pre-trained models distillBERT and fine tune it with IMBD movie reviews so that it can give sentiment analysis of reviews.

Outcome: The sentiment analysis should give a positive or negative sentiment.


# Summary of Work

In this task, I fine-tuned a pre-trained transformer model to perform sentiment analysis on movie reviews. While BERT was initially considered, its large size made it less practical for this project, so I opted to use DistilBERT, a smaller and more efficient distilled version of BERT.

After selecting the model, I sourced the IMDB movie review dataset from Kaggle. The dataset was organized into two columns, review and sentiment, with sentiment labels of positive or negative. Although the data was already labeled, some preprocessing was required to prepare it for training with DistilBERT.

First, the dataset was split into training and testing sets using train_test_split, with 20% of the data reserved for evaluation. The column names were then updated to text and label to match the format expected by the model. Once properly formatted, the data was tokenized using the tokenizer from the pre-trained DistilBERT model.

With the data prepared, training arguments were defined and the model was fine-tuned using the Hugging Face Trainer class. This approach simplified the training process by handling the training loop and evaluation internally, allowing the focus to remain on model configuration rather than low-level implementation.

In the end, the model was trained on the IMDB dataset and achieved an accuracy of approximately 90%.

# What I Learned

This task may seem simple, and at first glance it might not look like much was done, but I ended up spending a large portion of my time trying to understand how the model worked and digging into the finer details of the code. Along the way, I ran into several issues that took time to identify and fully understand.

To begin with, the output of the sentiment analysis was not using the correct labels. Instead of returning positive or negative, the model was outputting LABEL_0 and LABEL_1. I learned how to modify the modelâ€™s configuration so that it would return human-readable labels instead. This was done by adding 'id2label' argument to the model config file that mapped that 0 and 1 to 'negative' and 'positve', respectively.

I also learned that the Trainer class from Hugging Face requires the dataset to include a labels column, which meant I had to map the original sentiment values to numerical labels and update the dataset accordingly. In addition, the IMDB movie review dataset was very large, which caused issues during tokenization. The tokenizer has a maximum sequence length of 512 tokens, and many of the reviews exceeded this limit, so I learned how and why truncation is required for the model to work properly and that you can even reduce the length to much smaller size to try to speed up training.

Beyond data handling, I also explored transfer learning concepts, including how to freeze certain layers of the model to reduce training time. Finally, I gained experience working with the TrainingArguments class and learned how different parameters affect training and evaluation behavior.

# Future Work

As for future work, the next step would be to take the sentiment analysis further by applying it to live internet reviews. This would involve adding functionality to extract movie reviews from platforms like Reddit or Twitter (now known as X) and using that data to further retrain or fine-tune the model.

It would also be interesting to perform large-scale sentiment analysis across many reviews and generate a concise overall sentiment for a movie based on the proportion of positive and negative feedback found online. In addition, presenting the results as a numeric score could make the output more intuitive and easier for users to interpret.

As a side note, I also think it be fun to give the sentiment anlaysis output in meme form. I think it be a funny way to give overall sentiment on a particular movie.

## Data processing and setup

In [1]:
# Load and split data
from datasets import load_dataset

imbd_raw_data = load_dataset('csv', data_files='IMBD_Dataset.csv')
imbd_dataset = imbd_raw_data['train'].train_test_split(test_size=0.2)

In [2]:
label_map = {
    "negative": 0,
    "positive": 1,
}

def encode_labels(example):
    example["text"] = example["review"]
    example["labels"] = label_map[example["sentiment"]]
    return example

imbd_train_test_data = imbd_dataset.map(encode_labels).remove_columns(['review','sentiment'])

Map:   0%|          | 0/40000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [3]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DistilBertConfig

model_name = "distilbert/distilbert-base-uncased"

my_config = DistilBertConfig.from_pretrained(
    model_name, 
    num_labels=2,
    id2label={
        0: "negative",
        1: "positive",
    },
)

model = AutoModelForSequenceClassification.from_pretrained(model_name, config=my_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
def tokenize_data(dataset):
    return tokenizer(dataset["text"], truncation=True, padding=True, max_length=256)

imbd_data = imbd_train_test_data.map(tokenize_data, batched=True)


Map:   0%|          | 0/40000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [5]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [6]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, f1_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return { 
        "accuracy" : accuracy_score(labels, preds),
        "precision" : precision_score(labels, preds),
        "f1_score" : f1_score(labels, preds),
    }

In [7]:
# freeze base model parameters
for name, param in model.base_model.named_parameters():
    param.requires_grad = False

# keep classifier trainable
for name, param in model.base_model.named_parameters():
    if "transformer.layer.5" in name or "classifier" in name:
        param.requires_grad = True

## Training

In [8]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./distilbert-imbd-dataset",
    save_strategy="best",
    eval_strategy="steps",
    metric_for_best_model="loss",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    load_best_model_at_end=True,
    dataloader_pin_memory=False,
    push_to_hub=False,
)

In [9]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=imbd_data["train"],
    eval_dataset=imbd_data["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


In [10]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,Precision,F1 Score
500,0.3788,0.274759,0.8842,0.894437,0.883454
1000,0.2828,0.26319,0.8915,0.877489,0.894177
1500,0.2771,0.255893,0.8962,0.88695,0.898095
2000,0.2651,0.250657,0.8969,0.891961,0.898213
2500,0.2636,0.249731,0.8975,0.891628,0.898925


TrainOutput(global_step=2500, training_loss=0.2934693115234375, metrics={'train_runtime': 6532.8263, 'train_samples_per_second': 12.246, 'train_steps_per_second': 0.383, 'total_flos': 5298695946240000.0, 'train_loss': 0.2934693115234375, 'epoch': 2.0})

In [11]:
trainer.evaluate()

{'eval_loss': 0.2497306913137436,
 'eval_accuracy': 0.8975,
 'eval_precision': 0.8916275430359938,
 'eval_f1_score': 0.8989251553101272,
 'eval_runtime': 83.2547,
 'eval_samples_per_second': 120.113,
 'eval_steps_per_second': 3.76,
 'epoch': 2.0}

## Results

In [12]:
from transformers import pipeline

classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-imbd-dataset/checkpoint-2500"
)

classifier("This movie was surprisingly bad.")

Device set to use mps:0


[{'label': 'negative', 'score': 0.9619709849357605}]

In [13]:
from transformers import pipeline

classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-imbd-dataset/checkpoint-2500"
)

classifier("This movie was the wonderful.")

Device set to use mps:0


[{'label': 'positive', 'score': 0.9867086410522461}]

# Store Model and Dataset to Hugging Face

In [14]:
model.push_to_hub("movie-sentiment-distilbert")
tokenizer.push_to_hub("movie_sentiment-distilbert")

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

CommitInfo(commit_url='https://huggingface.co/acosio14/movie_sentiment-distilbert/commit/6f4dc3a4b94aa3c725b1f18d177f66670953cf80', commit_message='Upload tokenizer', commit_description='', oid='6f4dc3a4b94aa3c725b1f18d177f66670953cf80', pr_url=None, repo_url=RepoUrl('https://huggingface.co/acosio14/movie_sentiment-distilbert', endpoint='https://huggingface.co', repo_type='model', repo_id='acosio14/movie_sentiment-distilbert'), pr_revision=None, pr_num=None)

In [15]:
imbd_data.push_to_hub("acosio14/imbd-movie-reviews")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

CommitInfo(commit_url='https://huggingface.co/datasets/acosio14/imbd-movie-reviews/commit/5dfd25cf1162851cd6fcea5e1950c277382aa69f', commit_message='Upload dataset', commit_description='', oid='5dfd25cf1162851cd6fcea5e1950c277382aa69f', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/acosio14/imbd-movie-reviews', endpoint='https://huggingface.co', repo_type='dataset', repo_id='acosio14/imbd-movie-reviews'), pr_revision=None, pr_num=None)