<a href="https://colab.research.google.com/github/danadria/Skills-Lab-Introduction-to-Transformers-BERT-and-Explainable-NLP/blob/main/skills_lab_sentiment_analysis_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis in movie reviews using BERT

## BERT off-the-shelf sentiment analysis pipeline (Huggingface)

In [1]:
# Hugging face transformer pipeline using BERT pre-trained on GLUE Stanford Sentiment Treebank movie reviews (https://huggingface.co/datasets/glue)
!pip install -q transformers
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model = 'distilbert-base-uncased-finetuned-sst-2-english') # distilbert is faster with similar performance to BERT

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25h

Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


Movie reviews for "The Menu" (2022) on IMDB

In [2]:
# positive review 8/10 (https://www.imdb.com/review/rw8682076/?ref_=tt_urv)
review1 = "The Menu isn't the first to satirise the rich and their incompetence and isn't saying anything new but that definitely doesn't prevent it from being a great satire that pokes fun at everything it can in ways that are often consistently funny, playful and extremely stylish. Ralph Fiennes gives a terrific performance full of awkward unease that only enhances his commanding screen presence. Anya Taylor-Joy is a perfect audience surrogate amongst a sea of deliberately unlikeable characters of which the best is Nicholas Hoult whose almost too good at making his character hilariously pathetic. Mark Mylod's direction is excellent, the film has more than enough visual style to match the pretentiousness of its characters and is really good at building tension. The music by Colin Stetson is fantastic, striking a unusual balance between beautiful and unnerving."
sentiment_pipeline(review1)

[{'label': 'POSITIVE', 'score': 0.9993577599525452}]

In [3]:
# negative review 4/10 (https://www.imdb.com/review/rw8693249/?ref_=tt_urv)
review2 = "This looked like an interesting film based on the trailer and the first half of it was just that. The tension and suspense was building nicely. There were little dribs and drabs and hints of what might be coming without being too obvious. The acting from everyone in the film was good. Even supporting characters with only a few lines. Were well realized I remember thinking that I couldn't wait to see where it was all going. Sadly it didn't really go anywhere. It all unwound in the second half. The acting was still on but the writing failed. That's the most i can say without giving up any spoilers. And that was extra disappointing because the first half was so good. This Menu did not deliver the meal as advertised."
sentiment_pipeline(review2)

[{'label': 'NEGATIVE', 'score': 0.9622442722320557}]

## Going step by step - taking the pipeline apart

In [4]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") 

In [5]:
tokens = tokenizer([review1, review2])
print(tokens['input_ids'][0])
print(tokenizer.convert_ids_to_tokens(tokens['input_ids'][0]))

[101, 1996, 12183, 3475, 1005, 1056, 1996, 2034, 2000, 2938, 15735, 3366, 1996, 4138, 1998, 2037, 4297, 25377, 12870, 5897, 1998, 3475, 1005, 1056, 3038, 2505, 2047, 2021, 2008, 5791, 2987, 1005, 1056, 4652, 2009, 2013, 2108, 1037, 2307, 18312, 2008, 26202, 2015, 4569, 2012, 2673, 2009, 2064, 1999, 3971, 2008, 2024, 2411, 10862, 6057, 1010, 18378, 1998, 5186, 2358, 8516, 4509, 1012, 6798, 10882, 24336, 2015, 3957, 1037, 27547, 2836, 2440, 1997, 9596, 27880, 2008, 2069, 11598, 2015, 2010, 7991, 3898, 3739, 1012, 21728, 4202, 1011, 6569, 2003, 1037, 3819, 4378, 7505, 21799, 5921, 1037, 2712, 1997, 9969, 4406, 3085, 3494, 1997, 2029, 1996, 2190, 2003, 6141, 7570, 11314, 3005, 2471, 2205, 2204, 2012, 2437, 2010, 2839, 26316, 2135, 17203, 1012, 2928, 2026, 4135, 2094, 1005, 1055, 3257, 2003, 6581, 1010, 1996, 2143, 2038, 2062, 2084, 2438, 5107, 2806, 2000, 2674, 1996, 3653, 6528, 20771, 2791, 1997, 2049, 3494, 1998, 2003, 2428, 2204, 2012, 2311, 6980, 1012, 1996, 2189, 2011, 6972, 26261, 25

Special tokens

[CLS] - 101 Beginning of input

[SEP] - 102 End of input or sentence

[MASK] - 103 Masked tokens the model should predict

[PAD] - 0 Padding

[UNK] - 100 Unknown token not in training data

## Be on the lookout for bias and other limitations

(https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english#risks-limitations-and-biases)

In [6]:
sentiment_pipeline("French movie")

[{'label': 'POSITIVE', 'score': 0.9987333416938782}]

In [7]:
sentiment_pipeline("Yemeni movie")

[{'label': 'POSITIVE', 'score': 0.5799139142036438}]

Feature importance with SHAP