# Case study: the IMDb dataset

Foundations of Data Science

## How to use this material

-   The coding is an important part of this lecture and I will comment
    on the most relevant aspects.

-   The analyses made in these slides are fully
    <span style="color:DarkOrange">reproducible</span>. You are
    encouraged to:

    -   Create your own Python notebook and reproduce the results, using
        the code provided in these slides.
    -   Be independent. If something does not work, try to fix it before
        asking.
    -   Be creative. Modify the code and explore the data yourself.

## Textual data

-   Data cleaning and pre-processing
    -   <span style="color:DarkOrange">Cleaning the data</span>. Removal
        of abbreviations, HTML/XML tags (if any), special symbols and
        other jargons.
    -   <span style="color:DarkOrange">Tokenization and stemming</span>.
        Tokenization means that we split sentences into “tokens”, i.e.
        words. Stemming means that words are reduced to their root, i.e.
        without suffixes.
-   Data analysis
    -   <span style="color:DarkOrange">Term document matrix
        (TDM)</span>. Textual data are converted into a matrix with
        numeric entries. Frequency and related quantities (tf-idf) are
        obtained.
    -   <span style="color:DarkOrange">Unsupervised sentiment
        analysis</span>. Classification of the sentiment of each
        document using dictionaries and no human intervention.

## Getting started

-   In the first place, let us load the relevant python packages.

-   The [`nltk` (Natural Language
    Toolkit)](https://www.nltk.org/index.html) package is the main tool
    we will use in this notebook.

-   You should be familiar with this package with `pandas` at this
    stage.

In [1]:
# Loading the necessary packages
import pandas as pd
import re # Package for regular expressions
import nltk # Main python package for natural language processing

## String manipulation I

-   Let us recap some basic notions about
    <span style="color:DarkOrange">strings</span>.

-   An example of a string is the following:

In [2]:
monty = 'Monty Python'
monty

'Monty Python'

-   You can use the `"` symbol instead of `'`, which is necessary when
    handling with the apostrophe.

In [3]:
circus = "Monty Python's Flying Circus"
circus

"Monty Python's Flying Circus"

## String manipulation II

-   An alternative syntax is the following:

In [4]:
circus = 'Monty Python\'s Flying Circus'
circus

"Monty Python's Flying Circus"

-   Elements of a string can be accessed as if they were a list:

In [5]:
monty[0] # First character of the string

'M'

In [6]:
monty[0:5] # From 0 to 5 (not including 5)

'Monty'

## String manipulation III

-   Strings can be combined using the `+` operator, and repeated using
    the `*` operator:

In [7]:
'hello' + 'very' + ' ' + 'much'

'hellovery much'

In [8]:
'very' * 3

'veryveryvery'

-   A string can be split, using space as the separator.

In [9]:
monty_list = monty.split()
monty_list

['Monty', 'Python']

-   This is similar in spirit to the
    <span style="color:DarkOrange">tokenization</span>.

-   A list containing strings can be joined into a single string, using
    the following syntax:

In [10]:
" ".join(monty_list)

'Monty Python'

## Regular expressions

-   <span style="color:DarkOrange">Regular expressions</span>, sometimes
    shortened in “regex”, are a powerful and flexible method for
    specifying **search patterns**.

-   An example of regular expression is `([A-z])+`.

-   To use regular expressions in python, we need to use the package
    `re`.

-   There are several online resources about
    <span style="color:DarkOrange">regex</span>.

-   If you want to play around with this kind of syntax, you can visit
    website <https://regexr.com>.

-   We will not discuss regular expressions in this notebook, but it is
    essential to keep in mind that they are often at the core of more
    advanced functions.

## The IMDb dataset

-   In this notebook, we will analyze a
    <span style="color:DarkOrange">subset</span> of the [**Large Movie
    Review Dataset**](https://ai.stanford.edu/~amaas/data/sentiment/).

-   This dataset is associated with the paper

> Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y.
> Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment
> Analysis. *The 49th Annual Meeting of the Association for
> Computational Linguistics (ACL 2011)*.

-   The `IMDB_small.csv` dataset contains only `200` movie reviews. The
    original dataset is much bigger.

-   Let us import it into python, using `pandas`:

In [11]:
# Let us download the dataset from the course repository
imdb = pd.read_csv('https://datasciencebocconi.github.io/Data/IMDB_small.csv')
imdb.shape # Size of the dataset

(200, 1)

## The IMDb dataset

![](https://datasciencebocconi.github.io/Images/text_mining/imdb.png)

## A glimpse of the data

-   The first `5` reviews (out of `200`) can be displayed in the
    following:

In [12]:
# Display the first 5 rows of this dataset
imdb.head(5)

-   <span style="color:DarkOrange">Disclaimer</span>. These are
    <span style="color:DarkOrange">real movie reviews</span> written by
    anonymous people over the internet, which means there might be
    <span style="color:DarkOrange">offensive words</span>.

## Document term matrix

-   We wish to transform the data into something like this:

| Document   | Word 1   | Word 2   | Word 3   | …       | Word $p-1$    | Word $p$ |
|------------|----------|----------|----------|---------|---------------|----------|
| Review 1   | $n_{11}$ | $n_{12}$ | $n_{13}$ | $\dots$ | $n_{1,{p-1}}$ | $n_{1p}$ |
| Review 2   | $n_{21}$ | $n_{22}$ | $n_{23}$ |         | $n_{2,{p-1}}$ | $n_{2p}$ |
| $\vdots$   | $\vdots$ | $\vdots$ | $\vdots$ |         | $\vdots$      | $\vdots$ |
| Review $N$ | $n_{N1}$ | $n_{N1}$ | $n_{N3}$ | $\dots$ | $n_{N,p-1}$   | $n_{Np}$ |

-   Each $n_{ij}$ is the number of times the $j$th word appears in the
    $i$th review.

-   This object is sometimes called
    <span style="color:DarkOrange">document term matrix</span> and it is
    the starting point of most analyses.

-   This is a deceptively simple problem: in practice, it requires a lot
    of <span style="color:DarkOrange">pre-processing</span>.

-   A <span style="color:DarkOrange">bag of words</span>. What is the
    implicit assumption behind this representation?

## HTML and abbreviations I

-   Let us look at a random review, say the 9th.

In [13]:
review = imdb.iloc[8, 0]
review

"Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. <br /><br />The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only."

-   It is not the most positive review ever written :-)

-   Let us focus on the technical aspects at the moment.

## HTML and abbreviations II

-   There are several weird `<br />` symbols, which are
    <span style="color:DarkOrange">HTML tags</span>.

-   In fact, these movie reviews have been downloaded from the [IMDB
    website](https://www.imdb.com).

-   These tags are not informative, so we need to remove them. A first
    approach is using <span style="color:DarkOrange">regular
    expressions</span>.

-   The following command replaces `<br />` with a blank space.

In [14]:
re.sub(r"<br />", " ", review) # Removes the <br /> tag

"Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film.   The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only."

## HTML and abbreviations III

-   Albeit useful, the above regular expression fixes only a very
    specific HTML tag.

-   To remove all the HTML parts of the text, we need a
    <span style="color:DarkOrange">dictionary</span>.

-   Here, we make use of the `BeautifulSoup` package, whose
    [documentation](https://beautiful-soup-4.readthedocs.io/en/latest/)
    is available online.

In [15]:
from bs4 import BeautifulSoup # Load the package

# Removes the <br /> and other HTML tags
def remove_html(data):
    data = BeautifulSoup(data)
    return data.getText()

review = remove_html(review)
review

"Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only."

## English abbreviations

-   Another source of concern is the presence of standard English
    abbreviations, which we want to replace with their extended form.

-   We can do this by defining our own
    <span style="color:DarkOrange">dictionary</span>.

-   The following dictionary is by no means exhaustive. Feel free to
    modify it and add other examples.

In [16]:
def remove_abb(review):
    replacements = {
       "ain't": "am not",
        "aren't": "are not",
        "can't": "cannot",
        "could've": "could have",
        "couldn't": "could not",
        "didn't": "did not",
        "doesn't": "does not",
        "don't": "do not",
        "gonna": "going to",
        "hadn't": "had not",
        "hasn't": "has not",
        "haven't": "have not",
        "he'd": "he would",
        "he'll": "he will",
        "he's": "he is",
        "how'd": "how did",
        "how'll": "how will",
        "how's": "how is",
        "I'd": "I would",
        "I'll": "I will",
        "I'm": "I am",
        "I've": "I have",
        "isn't": "is not",
        "it'd": "it would",
        "it'll": "it will",
        "it's": "it is",
        "Its" : "It is",
        "let's": "let us",
        "mightn't": "might not",
        "mustn't": "must not",
        "shan't": "shall not",
        "she'd": "she would",
        "she'll": "she will",
        "she's": "she is",
        "should've": "should have",
        "shouldn't": "should not",
        "that's": "that is",
        "there's": "there is",
        "they'd": "they would",
        "wanna" : "want to",
        "We're" : "We are"
    }
    for key, value in replacements.items():
        review = re.sub(r"{}".format(key), value, review)
    return review

-   We apply this function to the current review:

In [17]:
review = remove_abb(review)
review

"Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I have seen 950+ films and this is truly one of the worst of them - it is awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only."

## Normalization

-   We now convert the text to the <span style="color:DarkOrange">lower
    case</span>.

-   This can be done by using the `.lower()` method:

In [18]:
review = review.lower()
review

"encouraged by the positive comments about this film on here i was looking forward to watching this film. bad mistake. i have seen 950+ films and this is truly one of the worst of them - it is awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). the film looks cheap and nasty and is boring in the extreme. rarely have i been so happy to see the end credits of a film. the only thing that prevents me giving this a 1-score is harvey keitel - while this is far from his best performance he at least seems to be making a bit of an effort. one for keitel obsessives only."

## Tokenization

-   <span style="color:DarkOrange">Tokenization</span> is the task of
    cutting a string into linguistic units that constitute a piece of
    language data.

-   Tokenization is performed using specialized functions, such as the
    `word_tokenize` of the `nltk` python package:

In [19]:
review_tokens = nltk.word_tokenize(review) # Perform tokenization
review_tokens[140:] # Shows the last tokens

['one', 'for', 'keitel', 'obsessives', 'only', '.']

-   Composite words (i.e., “San Siro”) should be treated as a single
    token, but `word_tokenize` fails to recognize it.

## Special symbols and punctuation

-   In our analyses, we wish to focus on **words**, therefore we
    **delete** commas, dots, and other special symbols such as `!@#*`.

-   This is a simplifying operation because
    <span style="color:DarkOrange">punctuation</span> might be very
    informative.

In [20]:
# Retain a word only if it is alphanumeric
review_tokens = [words for words in review_tokens if words.isalpha()] 
review_tokens[115:]

['of', 'an', 'effort', 'one', 'for', 'keitel', 'obsessives', 'only']

## Filtering stopwords

-   In many languages, there are
    <span style="color:DarkOrange">high-frequency words</span> that have
    no meaning on their own, such as conjunctions and articles.

-   These tokens are called
    <span style="color:DarkOrange">stopwords</span> and we wish to
    eliminate them.

-   A list of stopwords is conveniently stored in the `nltk.corpus`
    package, as shown below

In [21]:
from nltk.corpus import stopwords
stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

-   Stopwords can be removed as follows:

In [22]:
review_tokens = [words for words in review_tokens if words not in stopwords.words('english')]
review_tokens[:6]

['encouraged', 'positive', 'comments', 'film', 'looking', 'forward']

## Stemming I

-   Stemming reduces each word to its
    <span style="color:DarkOrange">root</span>, namely deleting
    suffixes, thus decreasing the dictionary and avoiding token
    duplications.

-   Stemming is performed using the `SnowballerStemmer` function.

-   Other stemmers are available in the `nltk` package; please see the
    [documentation](https://www.nltk.org/howto/stem.html) for further
    info.

-   Stemmers are language-dependent: we need to specify that the reviews
    are written in English.

## Stemming II

-   Let us see what the effect of a stemmer on a couple of words is:

In [23]:
nltk.SnowballStemmer("english").stem("films")

'film'

In [24]:
nltk.SnowballStemmer("english").stem("filmed")

'film'

-   We now perform stemming to the full review:

In [25]:
review_tokens = [nltk.SnowballStemmer("english").stem(words) for words in review_tokens]
review_tokens[:8]

['encourag', 'posit', 'comment', 'film', 'look', 'forward', 'watch', 'film']

-   Finally, we convert the tokens into a single string:

In [26]:
from nltk.tokenize.treebank import TreebankWordDetokenizer
detokenizer = TreebankWordDetokenizer()

# Create a "review" from the stemmed tokens
detokenizer.detokenize(review_tokens)

'encourag posit comment film look forward watch film bad mistak seen film truli one worst aw almost everi way edit pace storylin soundtrack film song lame countri tune play less four time film look cheap nasti bore extrem rare happi see end credit film thing prevent give harvey keitel far best perform least seem make bit effort one keitel obsess'

## Putting pieces together

In [27]:
# 1st round of pre-processing
def basic_cleaning(review):
  review = remove_html(review) # Remove HTML
  review = remove_abb(review) # Remove abbreviations
  return review

# 2nd round of Pre-processing
def advanced_cleaning(review):
  
  # Basic cleaning (HTML + symbols)
  review = basic_cleaning(review)
  
  # Normalization
  review = review.lower()

  # Tokenization
  review_tokens = nltk.word_tokenize(review)
  
  # Special symbols and punctuation
  review_tokens = [words for words in review_tokens if words.isalpha()] 
  
  # Filtering
  review_tokens = [words for words in review_tokens if words not in stopwords.words('english')]
  
  # Stemming
  review_tokens = [nltk.SnowballStemmer("english").stem(words) for words in review_tokens]
  
  # Conversion to a single string
  review = detokenizer.detokenize(review_tokens)
  return review

## Original document

-   For instance, one has that the
    <span style="color:DarkOrange">original document</span> is:

In [28]:
# Original document
imdb.iloc[4,0]

'Petter Mattei\'s "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler\'s play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case wit

## Basic cleaning

-   The document after <span style="color:DarkOrange">basic
    cleaning</span> is:

In [29]:
# Basic cleaning
basic_cleaning(imdb.iloc[4,0])

'Petter Mattei\'s "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. This being a variation on the Arthur Schnitzler\'s play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with most of the people we 

-   The document after <span style="color:DarkOrange">stemming</span>

In [30]:
# After stemming
advanced_cleaning(imdb.iloc[4,0])

'petter mattei love time money visual stun film watch mattei offer us vivid portrait human relat movi seem tell us money power success peopl differ situat encount variat arthur schnitzler play theme director transfer action present time new york differ charact meet connect one connect one way anoth next person one seem know previous point contact stylish film sophist luxuri look taken see peopl live world live thing one get soul pictur differ stage loneli one inhabit big citi exact best place human relat find sincer fulfil one discern case peopl act good mattei direct steve buscemi rosario dawson carol kane michael imperioli adrian grenier rest talent cast make charact come wish mattei good luck await anxious next work'

## Global operations

-   Apply the `basic_cleaning` and `advanced_cleaning` to all the
    reviews.

-   Create two new variables in the dataset: `review_clean` and
    `review_token`.

In [31]:
# This could take a while
imdb['review_clean'] = imdb['review'].apply(lambda z: basic_cleaning(z))
imdb['review_token'] = imdb['review'].apply(lambda z: advanced_cleaning(z))

imdb.head(2)

## Term document matrix

-   We want to know what are the <span style="color:DarkOrange">most
    common words</span>. This can quickly be done with the following
    chunk of code:

In [32]:
# Put everything into a single string
words  = ' '.join(imdb['review_token'])
# Create a global tokenization
tokens = nltk.word_tokenize(words)

# Conversion to "text"
text = nltk.Text(tokens)
# Compute the most common words
fdist = nltk.FreqDist(text)

# Use pandas for organizing and displaying the results
df_words = pd.DataFrame(list(fdist.items()), columns = ["Word","Frequency"])
# Order words from the most frequent
df_words = df_words.sort_values(by = "Frequency", ascending = False)

# Dimension of the dataset
df_words.shape

(5294, 2)

-   Hence, we obtained `5294` different stems.

## Most frequent words

In [33]:
df_words.head(10)

## Document term matrix I

-   Our original goal was to obtain something like the following:

| Document   | Word 1   | Word 2   | Word 3   | …       | Word $p-1$    | Word $p$ |
|------------|----------|----------|----------|---------|---------------|----------|
| Review 1   | $n_{11}$ | $n_{12}$ | $n_{13}$ | $\dots$ | $n_{1,{p-1}}$ | $n_{1p}$ |
| Review 2   | $n_{21}$ | $n_{22}$ | $n_{23}$ |         | $n_{2,{p-1}}$ | $n_{2p}$ |
| $\vdots$   | $\vdots$ | $\vdots$ | $\vdots$ |         | $\vdots$      | $\vdots$ |
| Review $N$ | $n_{N1}$ | $n_{N1}$ | $n_{N3}$ | $\dots$ | $n_{N,p-1}$   | $n_{Np}$ |

-   This is now an <span style="color:DarkOrange">easy task</span>,
    because documents (reviews) have been tokenized and stemmed.

-   In practice, we will make use of the `CountVectorizer` of the
    `sklearn` python package.

## Document term matrix II

-   The total number of distinct stems we obtained after cleaning is
    `5294`.

-   We consider only a fraction of them ($p = 500$): those having higher
    frequencies.

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

# Creation of a TDM with p = 500 words
vectorizer = CountVectorizer(max_features = 500)
X = vectorizer.fit_transform(imdb['review_token'])
word_names = list(vectorizer.get_feature_names_out())

# Conversion to dataframe
X = pd.DataFrame(X.toarray())
# Renaming columns according to words
X.columns = word_names

-   The `CountVectorizer` function performs more operations than we
    need.

-   For example, it silently converts the text to lowercase. It is also
    possible to remove stopwords.

-   These operations are <span style="color:DarkOrange">redundant</span>
    in our case.

-   Please refer to the [official
    documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
    for further details.

## Document term matrix III

In [36]:
X.head(8)

## TF-IDF transformation I

-   Sometimes, one might be interested in obtaining a variation of the
    former TDM, which is based on the so-called
    <span style="color:DarkOrange">term frequency - inverse document
    frequency</span>.

-   Each $n_{ij}$ is the number of times the $j$th word appears in the
    $i$th review, for $i = 1,\dots,N$ and $j = 1,\dots,p$.

-   Let us define the following quantity:

$$
N_j = \sum_{i=1}^N I(n_{ij} > 0) =  \text{("Number of documents containing the j-th word")}.
$$

-   Moreover, we define:

$$
n_{i \cdot} = \sum_{j=1}^p n_{ij} = \text{("Number of words in the i-th document")}.
$$

## TF-IDF transformation II

-   The so-called <span style="color:DarkOrange">term frequencies</span>
    (TF) are just the fraction of times $j$th word appears in the $i$th
    document, that is:

$$
f_{ij} = \frac{n_{ij}}{n_{i \cdot}}.
$$

-   The <span style="color:DarkOrange">inverse document frequency</span>
    (IDF) is a measure of how much information the word provides, that
    is if it is common or rare across all documents. It is defined as
    follows:

$$
\text{IDF}_{j} = \log\left({\frac{N}{N_j}}\right).
$$

## TF-IDF transformation III

-   The <span style="color:DarkOrange">term frequency-inverse document
    frequency</span> is then defined as:

$$
\text{TF-IDF}_{ij} = f_{ij} \times \text{IDF}_j.
$$

-   In other words, the TF-IDF is just a weighted version of the
    original frequencies $n_{ij}$, accounting for the fact that certain
    words are “more relevant” (i.e., rare) than others.

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Creation of a TDM TF-IDF with p = 500 words
vectorizer = TfidfVectorizer(max_features = 500)
X = vectorizer.fit_transform(imdb['review_token'])
word_names = list(vectorizer.get_feature_names_out())

# Conversion to dataframe
X = pd.DataFrame(X.toarray())
# Renaming columns according to words
X.columns = word_names

## TF-IDF transformation IV

In [38]:
X.head(8)

## Sentiment analysis I

-   Sentiment analysis is the practice of understanding the overall
    <span style="color:DarkOrange">opinion</span> (sentiment) of a
    document.

-   It is arguably a very difficult (sometimes impossible) task,
    especially in the presence of complex texts.

-   Here, we showcase a straightforward algorithm for sentiment
    analysis, based on the idea of
    <span style="color:DarkOrange">scoring</span>, called VADER.

-   The [associated
    article](https://ojs.aaai.org/index.php/ICWSM/article/view/14550/14399)
    is:

> Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based
> Model for Sentiment Analysis of Social Media Text. Eighth
> International Conference on Weblogs and Social Media (ICWSM-14). Ann
> Arbor, MI, June 2014.

## Sentiment analysis II

-   The core idea is straightforward: “positive words” are given a
    positive score, and vice versa with “negative words”.

-   A human identifies whether these are positive or negative terms.

-   Then, the scores are weighted, manipulated, and summarized through a
    large number of <span style="color:DarkOrange">heuristics</span>.

-   Even though VADER is very simplistic, it is quick to compute and is
    a reasonable starting point for more complex analysis. Let us see it
    in action:

In [39]:
# Please install the vader lexicon if it was not present
# nltk.download('vader_lexicon')

## Sentiment analysis III

-   Let us consider the 2nd review in our dataset, which is arguably
    <span style="color:DarkOrange">positive</span>.

In [40]:
review = basic_cleaning(imdb.iloc[1, 0])
review

'A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.'

In [41]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sentiment = SentimentIntensityAnalyzer()
sentiment.polarity_scores(review)

{'neg': 0.055, 'neu': 0.768, 'pos': 0.177, 'compound': 0.9641}

-   The `compound` term is <span style="color:DarkOrange">standardized
    score</span> between $(-1, 1)$, measuring if sentiment is positive
    or negative.

-   In this case, the VADER algorithm correctly identifies the
    sentiment.

## Comparison with ChatGPT

-   In some other cases, the VADER algorithm
    <span style="color:DarkOrange">fails badly</span>:

In [42]:
review = basic_cleaning(imdb.iloc[7, 0])
review

"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it is continued its decline further to the complete waste of time it is today.It's truly disgraceful how far this show has fallen. The writing is painfully bad, the performances are almost as bad - if not for the mildly entertaining respite of the guest-hosts, this show probably wouldn't still be on the air. I find it so hard to believe that the same creator that hand-selected the original cast also chose the band of hacks that followed. How can one recognize such brilliance and then see fit to replace it with such mediocrity? I felt I must give 2 stars out of respect for the original cast that made this show such a huge success. As it is now, the show is just awful. I cannot believe it is still on the air."

-   Despite the quite negative review, the VADER algorithm produces the
    following output:

In [43]:
sentiment.polarity_scores(review)

{'neg': 0.149, 'neu': 0.654, 'pos': 0.197, 'compound': 0.8596}

This is entirely inappropriate. Perhaps the words `amazing`, `fresh` and
`innovative` (and others) misled the algorithm.

## Comparison with ChatGPT

-   [ChatGPT](https://openai.com/blog/chatgpt/) returns this:

![](https://datasciencebocconi.github.io/Images/text_mining/chatbot.png)