<a href="https://colab.research.google.com/github/danadria/Skills-Lab-Introduction-to-Transformers-BERT-and-Explainable-NLP/blob/main/skills_lab_sentiment_analysis_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practical 10: Transformers for Sentiment Analysis
#### Daniel Anadria

<img src="img/uu_logo.png" alt="logo" align="right" title="UU" width="50" height="20" />


#### Applied Text Mining - Utrecht Summer School

**Practical is under construction, parts remain to be annotated.**


In this practical, we are going to perform sentiment analysis of movie reviews using a transformer model, have a look under its hood, and try to explain the model predictions using SHAP.

Our model of choice is [DistilBERT](https://medium.com/huggingface/distilbert-8cf3380435b5), a light-weight transformer whose performance is comparable to Google's [BERT base model](https://arxiv.org/abs/1810.04805).



## Overview

In Part 1, we will use an off-the-shelf sentiment analysis pipeline from the Hugging Face [`transformers`](https://huggingface.co/docs/transformers/index) module to classify two movie reviews.

In Part 2, we will dissasemble the sentiment analysis pipeline by performing the same analysis as in Part 1 step-by-step.

In Part 3, we will open the black box and explore which tokens were most important for DistilBERT's sentiment classification. We do this using [Shapley Additive Explanations \(SHAP\)](https://arxiv.org/abs/1705.07874).

In Part 4, we will fine-tune DistilBERT on the IMDB movie review dataset.

## Prepare the Colab Environment

**Fine-tuning a transformer model is quite resource-intensive! Switch your runtime type to GPU T4 under Runtime > Change runtime type.**

**Running this practical requires a more recent version of the `accelerate` package than installed by default in Google Colab. Run the code below to upgrade `accelerate`.**

In [2]:
!pip install -q -U accelerate # update accelerate

**Now restart your runtime under Runtime > Restart runtime or by pressing `ctrl + M`**

**All set?**

## Part 1: Off-the-shelf sentiment analysis pipeline

Since sentiment analysis is a popular application, there are off-the-shelf pipelines which we can use to quickly classify documents by sentiment. One such pipeline is part of the Hugging Face `transformers` module.



<blockquote> 🤗 Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch.

[ ... ]

The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the task summary for examples of use.

[ ... ]

The `pipeline()` is the most powerful object encapsulating all other pipelines.


</blockquote>



In [2]:
!pip install -q transformers
from transformers import pipeline

Our model will be distilbert base uncased. Uncased means that the model disregards casing (upper or lower case). Our distilbert version has been fine-tuned for binary sentiment classification using the Stanford Sentiment Treebank corpus ([SST-2; Pang and Lee, 2005](https://arxiv.org/abs/cs/0506075)). Hence, we can use it off-the-shelf.

Pre-trained BERT models are available for many different natural language processing tasks based on the [General Language Understanding Evaluation (GLUE) benchmark resources](https://huggingface.co/datasets/glue).


In [11]:
sentiment_pipeline = pipeline("sentiment-analysis", model = 'distilbert-base-uncased-finetuned-sst-2-english')

Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


To showcase how to use the sentiment analysis pipeline, we will compare two relatively complex IMDB reviews of Mark Mylod's 2022 movie The Menu (2022). Load the following two reviews:

In [3]:
review1 = "The Menu isn't the first to satirise the rich and their incompetence and isn't saying anything new but that definitely doesn't prevent it from being a great satire that pokes fun at everything it can in ways that are often consistently funny, playful and extremely stylish. Ralph Fiennes gives a terrific performance full of awkward unease that only enhances his commanding screen presence. Anya Taylor-Joy is a perfect audience surrogate amongst a sea of deliberately unlikeable characters of which the best is Nicholas Hoult whose almost too good at making his character hilariously pathetic. Mark Mylod's direction is excellent, the film has more than enough visual style to match the pretentiousness of its characters and is really good at building tension. The music by Colin Stetson is fantastic, striking a unusual balance between beautiful and unnerving."


In [8]:
review2 = "This looked like an interesting film based on the trailer and the first half of it was just that. The tension and suspense was building nicely. There were little dribs and drabs and hints of what might be coming without being too obvious. The acting from everyone in the film was good. Even supporting characters with only a few lines. Were well realized I remember thinking that I couldn't wait to see where it was all going. Sadly it didn't really go anywhere. It all unwound in the second half. The acting was still on but the writing failed. That's the most i can say without giving up any spoilers. And that was extra disappointing because the first half was so good. This Menu did not deliver the meal as advertised."


You can skim the reviews.

In [5]:
print(review1)

The Menu isn't the first to satirise the rich and their incompetence and isn't saying anything new but that definitely doesn't prevent it from being a great satire that pokes fun at everything it can in ways that are often consistently funny, playful and extremely stylish. Ralph Fiennes gives a terrific performance full of awkward unease that only enhances his commanding screen presence. Anya Taylor-Joy is a perfect audience surrogate amongst a sea of deliberately unlikeable characters of which the best is Nicholas Hoult whose almost too good at making his character hilariously pathetic. Mark Mylod's direction is excellent, the film has more than enough visual style to match the pretentiousness of its characters and is really good at building tension. The music by Colin Stetson is fantastic, striking a unusual balance between beautiful and unnerving.


In [6]:
print(review2)

This looked like an interesting film based on the trailer and the first half of it was just that. The tension and suspense was building nicely. There were little dribs and drabs and hints of what might be coming without being too obvious. The acting from everyone in the film was good. Even supporting characters with only a few lines. Were well realized I remember thinking that I couldn't wait to see where it was all going. Sadly it didn't really go anywhere. It all unwound in the second half. The acting was still on but the writing failed. That's the most i can say without giving up any spoilers. And that was extra disappointing because the first half was so good. This Menu did not deliver the meal as advertised.


 What is your guess of the sentiment of the following reviews? On the scale 1-10, what rating do you think the respective authors gave The Menu?

**1. Use the sentiment pipeline to predict the sentiment of the two reviews.**

In [12]:
sentiment_pipeline(review1) # predict sentiment

[{'label': 'POSITIVE', 'score': 0.9993577599525452}]

In [13]:
sentiment_pipeline(review2) # predict sentiment

[{'label': 'NEGATIVE', 'score': 0.9622442722320557}]

For each review, we see the output label and the associated probability.

Do you agree with model predictions?

Here is the ground truth:

[Review 1](https://www.imdb.com/review/rw8682076/?ref_=tt_urv) is a positive review with a rating of 8/10.

[Review 2](https://www.imdb.com/review/rw8693249/?ref_=tt_urv) is a negative review with a rating of 4/10.

For each review, we see the output label, and the associated probability.

Now we are going to show you how to build your own sentiment analysis pipeline from scratch. In practice you can use the already existing one, but it is helpful to understand the steps associated with setting up a transformer-based pipeline.



## Part 2: Sentiment Analysis Pipeline - Deconstructed

Same analysis on the same two reviews - this time step-by-step.

**2. Define the tokenizer and model. For the tokenizer, use the pretrained DistilBERT tokenizer (`DistilBertTokenizer.from_pretrained`) and for the model use `"distilbert-base-uncased-finetuned-sst-2-english"`.**

In [14]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

**3. Tokenize the `review1` and `review2` objects. Pad and truncate the sequences, and return PyTorch (`pt`) tensors. Save the output object as `encoding`.**


In [None]:
encoding = tokenizer([review1, review2], padding = True, truncation = True, return_tensors = 'pt') # tokenize the reviews

BERT and several other transformer models use tokenizers based on [WordPiece](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt), a subword tokenization algorithm. The main advantage of a subword tokenizer is that it interpolates between word-based and character-based tokenization. Common words get a slot in the vocabulary, but the tokenizer can fall back to word pieces and individual characters for unknown words.

Since batched inputs (our reviews) are of different lengths, they cannot be converted to fixed-size tensors to befed to the model.

There are two main strategies for solving this problem -- *padding* and *truncation*.

In order to create rectangular tensors from batches of varying lengths, padding adds a special padding token to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by truncating long sequences.

`padding = True`: pad to the longest sequence in the batch (no padding is applied if you only provide a single sequence).

`truncation = True`: truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None).

**4. Inspect the `encoding` object by prining the first reveiw's input ids.**

In [None]:
print(encoding['input_ids'][0]) # first review's input_ids

We see that BERT assigns a unique id to each token (`input_ids`).

**5. Convert first review's input ids to tokens using `convert_ids_to_tokens` to see how the text got tokenized.**

In [None]:
print(tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])) # first review's tokens

['[CLS]', 'the', 'menu', 'isn', "'", 't', 'the', 'first', 'to', 'sat', '##iri', '##se', 'the', 'rich', 'and', 'their', 'inc', '##omp', '##ete', '##nce', 'and', 'isn', "'", 't', 'saying', 'anything', 'new', 'but', 'that', 'definitely', 'doesn', "'", 't', 'prevent', 'it', 'from', 'being', 'a', 'great', 'satire', 'that', 'poke', '##s', 'fun', 'at', 'everything', 'it', 'can', 'in', 'ways', 'that', 'are', 'often', 'consistently', 'funny', ',', 'playful', 'and', 'extremely', 'st', '##yl', '##ish', '.', 'ralph', 'fi', '##enne', '##s', 'gives', 'a', 'terrific', 'performance', 'full', 'of', 'awkward', 'unease', 'that', 'only', 'enhance', '##s', 'his', 'commanding', 'screen', 'presence', '.', 'anya', 'taylor', '-', 'joy', 'is', 'a', 'perfect', 'audience', 'sur', '##rogate', 'amongst', 'a', 'sea', 'of', 'deliberately', 'unlike', '##able', 'characters', 'of', 'which', 'the', 'best', 'is', 'nicholas', 'ho', '##ult', 'whose', 'almost', 'too', 'good', 'at', 'making', 'his', 'character', 'hilarious', 

Note that BERT-based models also operate with special tokens:


| Token      | Token ID | Meaning                                 |
|:----------:|:--------:|:---------------------------------------:|
| `[CLS]`    | `101`    | Beginning of input                     |
| `[SEP]`    | `102`    | End of input or sentence               |
| `[MASK]`   | `103`    | Masked tokens the model should predict |
| `[PAD]`    | `0`      | Padding                                 |
| `[UNK]`    | `100`    | Unknown token not in training data     |

Now we are ready to do some sentiment prediction. We import `torch`, define our model by feeding it `input_ids`, `attention_mask` and `labels`. The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them.

In [None]:
# prediction of sentiment
import torch

output = model(input_ids = encoding['input_ids'], attention_mask = encoding['attention_mask'], labels = torch.tensor([1, 0]))
print("Predicted logits:", output['logits']) # logits
print("Predicted probabilities:", torch.nn.functional.softmax(output['logits'], dim=-1)) # from logits to probabilities
prediction = torch.argmax(output['logits'], 1) # from logits to binary class
print("Predicted classes:", prediction)

Predicted logits: tensor([[-3.5648,  3.7852],
        [ 1.8161, -1.4221]], grad_fn=<AddmmBackward0>)
Predicted probabilities: tensor([[6.4221e-04, 9.9936e-01],
        [9.6224e-01, 3.7756e-02]], grad_fn=<SoftmaxBackward0>)
Predicted classes: tensor([1, 0])


We see that our output sentiment and probabilities are the same as when using off-the-shelf sentiment classification pipeline.

## Part 3: Feature importance with SHAP

Now that we have classified our two reviews, we might want to explain DistilBERT's predictions using Shapley Additive Values (SHAP).


We install the `shap` module, import `shap.Explainer` and feed it our model. This takes about 4 minutes, but can be very computationally intensive in most real life applications.

In [None]:
!pip install -q shap
import shap

explainer = shap.Explainer(sentiment_pipeline)

shap_values = explainer([review1, review2])

  0%|          | 0/498 [00:00<?, ?it/s]

Partition explainer:  50%|█████     | 1/2 [00:00<?, ?it/s]

  0%|          | 0/498 [00:00<?, ?it/s]

Partition explainer: 3it [04:52, 146.30s/it]


All done! A nice thing about the `shap` module is that it comes with a built-in visualizer.

In [None]:
shap.plots.text(shap_values[0]) # first review

In [None]:
shap.plots.text(shap_values[1]) # second review

## Part 4: Fine-tuning BERT using the IMDb dataset

That was a toy example. Now let's do a sentiment analysis of the IMDB dataset using the sentiment analysis pipeline.

Since the DistilBERT we are using was trained on the Stanford Sentiment Treebank (SST) dataset, we also fine-tune our model for IMDb movie reviews.


In [None]:
!pip install -q datasets
!pip install -q transformers
!pip install -q evaluate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from datasets import load_dataset

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

import evaluate
import numpy as np

In [None]:
imdb = load_dataset("imdb")

In [None]:
imdb["test"][0] # examine the first instance in test

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

In [None]:
imdb.shape # inspect dimensions full data

{'train': (25000, 2), 'test': (25000, 2), 'unsupervised': (50000, 2)}

Because fine-tuning on the entire IMDb dataset would be too resource-intensive to run in this practical, we will work with a randomly sampled 10% of the original train and test dataset size.

In [None]:
imdb_sample = imdb
imdb_sample['train'] = imdb['train'].shuffle(seed=42).select(range(int(0.1*len(imdb['train']))))
imdb_sample['test'] = imdb['test'].shuffle(seed=42).select(range(int(0.1*len(imdb['test']))))
imdb_sample['unsupervised'] = imdb['unsupervised'].shuffle(seed=42).select(range(int(0.1*len(imdb['unsupervised']))))



In [None]:
imdb_sample.shape

{'train': (2500, 2), 'test': (2500, 2), 'unsupervised': (5000, 2)}

The next step is to load a DistilBERT tokenizer to preprocess the text field:

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length:

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use Datasets map function. You can speed up map by setting batched=True to process multiple elements of the dataset at once:

In [None]:
tokenized_imdb = imdb_sample.map(preprocess_function, batched=True)

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Including a metric during training is often helpful for evaluating your model’s performance. You can quickly load a evaluation method with the `Evaluate` library. For this task, load the accuracy metric.

In [None]:
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):

    predictions, labels = eval_pred

    predictions = np.argmax(predictions, axis=1)

    return accuracy.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(

    "distilbert-base-uncased-finetuned-sst-2-english", num_labels=2, id2label=id2label, label2id=label2id

)

In [None]:
training_args = TrainingArguments(

    output_dir="tuned_model",

    learning_rate=2e-5,

    per_device_train_batch_size=16,

    per_device_eval_batch_size=16,

    num_train_epochs=2,

    weight_decay=0.01,

    evaluation_strategy="epoch",

    logging_steps = 100,

    save_strategy="epoch",

    load_best_model_at_end=True,

    push_to_hub=False,

)

trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=tokenized_imdb["train"],

    eval_dataset=tokenized_imdb["test"],

    tokenizer=tokenizer,

    compute_metrics=compute_metrics,

)

trainer.train()
trainer.save_model()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.201,0.35047,0.8852
2,0.1295,0.321993,0.9056


In [None]:
classifier = pipeline("sentiment-analysis", model="tuned_model")
classifier("The movie was good.")

[{'label': 'POSITIVE', 'score': 0.9479438066482544}]

## Remember: Be on the lookout for bias and other limitations!

Pre-trained transformer models have been made available for many different tasks and by many different people. It is important to be aware that there may be bias and other limitations in the models that could affect your results.

DistilBERT is known to produce biased predictions that target underrepresented populations. For instance, for sentences like This film was filmed in COUNTRY, DistilBERT for binary classification will give radically different probabilities for the positive label depending on the country (0.89 if the country is France, but 0.08 if the country is Afghanistan) when nothing in the input indicates such a strong semantic shift.

See:

[Risks, Limitations and Biases](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english#risks-limitations-and-biases)

[Aurélien Géron's Sentiment Bias Map](https://colab.research.google.com/gist/ageron/fb2f64fb145b4bc7c49efc97e5f114d3/biasmap.ipynb)

In [None]:
sentiment_pipeline("French movie")

[{'label': 'POSITIVE', 'score': 0.9987333416938782}]

In [None]:
sentiment_pipeline("Yemeni movie")

[{'label': 'POSITIVE', 'score': 0.5799139142036438}]

When in doubt fine-tune and use feature importance measures!

## Further reading / materials



*   How to fine-tune a model: https://huggingface.co/docs/transformers/training



## Credits

Many code and quote blocks are adapted from the [HuggingFace Documentation website](https://huggingface.co/docs/). The website contains a lot of additional information and is a great resource for learners.