## Sentiment Analysis on Movie Reviews

The aim is to predict whether a film review is positive or negative.\
We will use the IMDb dataset that contains labelled reviews.\
We will use the DistilBERT language model which is small, fast, cheap and light Transformer model trained by distilling BERT base.


Load the IMDb dataset from the Hugging Face dataset library:

In [1]:
from datasets import load_dataset  
import pandas as pd
imdb = load_dataset("imdb")

Found cached dataset imdb (/home/alexis/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

There are two fields in this dataset:

'text': the movie review text.\
'label': a value that is either 0 for a negative review or 1 for a positive review.

We train on 25,000 reviews: 12,500 positive and 12,500 negative

In [3]:
train_reviews = pd.DataFrame(imdb["train"][:])
train_reviews.label.value_counts()

0    12500
1    12500
Name: label, dtype: int64

Some examples of Negative reviews:

In [4]:
train_reviews[1:10]

Unnamed: 0,text,label
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0
5,I would put this at the top of my list of film...,0
6,Whoever wrote the screenplay for this movie ob...,0
7,"When I first saw a glimpse of this movie, I qu...",0
8,"Who are these ""They""- the actors? the filmmake...",0
9,This is said to be a personal film for Peter B...,0


In [7]:
train_reviews["text"][4]

'Oh, brother...after hearing about this ridiculous film for umpteen years all I can think of is that old Peggy Lee song..<br /><br />"Is that all there is??" ...I was just an early teen when this smoked fish hit the U.S. I was too young to get in the theater (although I did manage to sneak into "Goodbye Columbus"). Then a screening at a local film museum beckoned - Finally I could see this film, except now I was as old as my parents were when they schlepped to see it!!<br /><br />The ONLY reason this film was not condemned to the anonymous sands of time was because of the obscenity case sparked by its U.S. release. MILLIONS of people flocked to this stinker, thinking they were going to see a sex film...Instead, they got lots of closeups of gnarly, repulsive Swedes, on-street interviews in bland shopping malls, asinie political pretension...and feeble who-cares simulated sex scenes with saggy, pale actors.<br /><br />Cultural icon, holy grail, historic artifact..whatever this thing was,

Some examples of Positive reviews:

In [5]:
train_reviews[20000:20010]

Unnamed: 0,text,label
20000,After reading some quite negative views for th...,1
20001,This is one of those movies that's difficult t...,1
20002,"Strangely, this version of OPEN YOUR EYES is m...",1
20003,I really liked this movie. I've read a few of ...,1
20004,"Despite excellent trailers for Vanilla Sky, I ...",1
20005,"I went to see Vanilla Sky with a huge, huge, h...",1
20006,This film was just absolutly brilliant. It act...,1
20007,"As the one-line summary says, two movies have ...",1
20008,David Aames is a rich good-looking guy who liv...,1
20009,this movie is another on the list that i did n...,1


In [8]:
train_reviews["text"][20000]

"After reading some quite negative views for this movie, I was not sure whether I should fork out some money to rent it. However, it was a pleasant surprise. I haven't seen the original movie, but if its better than this, I'd be in heaven.<br /><br />Tom Cruise gives a strong performance as the seemingly unstable David, convincing me that he is more than a smile on legs (for only the third time in his career- the other examples were Magnolia and Born on the Fourth of July). Penelope Cruz is slightly lightweight but fills the demands for her role, as does Diaz. The only disappointment is the slightly bland Kurt Russell. In the movie, however, it is not the acting that really impresses- its the filmmaking.<br /><br />Cameron Crowe excels in the director's role, providing himself with a welcome change of pace from his usual schtick. The increasing insanity of the movie is perfectly executed by Crowe (the brief sequence where Cruise walks through an empty Time Square is incredibly effectiv

**Preprocess**

The next step is to load a DistilBERT tokenizer to preprocess the text field:

In [9]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length:

In [10]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use the 'datasets' map function. \
You can speed up map by setting batched=True to process multiple elements of the dataset at once:

In [11]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

Loading cached processed dataset at /home/alexis/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-ff72b902d345500c.arrow


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Loading cached processed dataset at /home/alexis/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-e1727b0a1a20f26f.arrow


In [12]:
tokenized_imdb["train"][0].keys()

dict_keys(['text', 'label', 'input_ids', 'attention_mask'])

Now create a batch of examples using DataCollatorWithPadding. \
It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [13]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

2023-10-03 11:08:20.016105: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-03 11:08:20.231759: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-03 11:08:20.233546: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


**Evaluate**

Including a metric during training is often helpful for evaluating your model’s performance. \
You can quickly load a evaluation method with the HuggingFace Evaluate library. \
For this task, load the accuracy metric:

In [14]:
import evaluate
accuracy = evaluate.load("accuracy")

In [15]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

**Train**

Before you start training your model, create a map of the expected ids to their labels with id2label and label2id:

In [16]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [17]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

1) Define your training hyperparameters in TrainingArguments. The only required parameter is output_dir which specifies where to save your model. You’ll push this model to the Hub by setting push_to_hub=True (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the Trainer will evaluate the accuracy and save the training checkpoint.
2) Pass the training arguments to Trainer along with the model, dataset, tokenizer, data collator, and compute_metrics function.
3) Call train() to finetune your model.

In [None]:
training_args = TrainingArguments(
    output_dir="sentiment_analysis",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

  0%|          | 0/3126 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [None]:
trainer.push_to_hub()

**Inference**

Great, now that you’ve finetuned a model, you can use it for inference!\
Grab some text you’d like to run inference on:

In [None]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."
# text = "This film was amazing. A bit long, but it captivated me from beginning to end. It is definitely one of my favorites."
# text = "This film was very boring from beginning to end. I can't recommend it to anyone and I am 1000% that I will never see it again."

The simplest way to try out your finetuned model for inference is to use it in a pipeline(). \
Instantiate a pipeline for sentiment analysis with your model, and pass your text to it:

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="alexisdpc/my_awesome_model")
classifier(text)

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("alexisdpc/my_awesome_model")
inputs = tokenizer(text, return_tensors="pt")

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("alexisdpc/my_awesome_model")
with torch.no_grad():
    logits = model(**inputs).logits

In [None]:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]