# The dataset

The IMDB sentiment dataset is a collection of 50K movie reviews, annotated as positive or negative, and split in two sets of equal size: a training and a test set. Both set have an equal number of positive and negative review. The dataset is available on several libraries, but we ask that you use the HuggingFace [datasets](https://huggingface.co/datasets/imdb) version. Follow their [tutorial](https://huggingface.co/docs/datasets/load_hub) on how to use the library for more details.

Download and look at the dataset, and answer the following questions.
1. How many splits does the dataset has? (1 point)
2. How big are these splits? (1 point)
3. What is the proportion of each class on the supervised splits? (1 point)

In [1]:
!pip install datasets



In [4]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 3.0 MB/s eta 0:00:01
[?25hCollecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.2-py3-none-any.whl (27 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.8-py3-none-any.whl (17 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 3.5 MB/s eta 0:00:01
Collecting srsly<3.0.0,>=2.4.3
  Downloading srsly-2.4.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (461 kB)
[K     |████████████████████████████████| 461 kB 3.0 MB/s eta 0:00:01
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.8-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl 

In [5]:
!python3 -m spacy download en_core_web_sm

2022-09-15 10:59:44.611559: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-09-15 10:59:49.556915: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-09-15 10:59:49.556975: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (yacine-ROG-Strix-G533ZW-G533ZW): /proc/driver/nvidia/version does not exist
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 2.3 MB/s eta 0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-cor

In [33]:
from datasets import load_dataset_builder
from datasets import load_dataset
import pandas as pd
import numpy as np

import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 
import re
import spacy

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline

SEED = 42

In [34]:
database_name = "imdb"
ds_builder = load_dataset_builder(database_name)
print(ds_builder.info.description)
print(ds_builder.info.features)

dataset = load_dataset(database_name)

Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None)}


In [7]:
from datasets import get_dataset_split_names
print("Split names", get_dataset_split_names(database_name))
dataset

Split names ['train', 'test', 'unsupervised']


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

We can see that this database has 3 splits. The "train" and "test" splits have 25000 rows each and the unsupervised split has 50000 rows.

In [11]:
# To start we are going to split our datasets into 3 differents datasets
train = dataset["train"].to_pandas()
test = dataset["test"].to_pandas()

# Then we will have a look on the 
print("Test values count : {0}".format(len(test)))
print("Train values count : {0}".format(len(train)))

Test values count : 25000
Train values count : 25000


We can see that there are as many positive as negative reviews in the supervised split.Indeed, each class has 25000 occurrences.

## Naive Bayes classifier **(9 points)**

Implement your own naive Bayes classifier (the pseudo code can be found in the slides or the [book reference](https://web.stanford.edu/~jurafsky/slp3/)) or use [one provided by scikit-learn](https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes) combined with a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

Go through the following steps.
1. (2 points) Take a look at the data and create an adapted preprocessing function which at least:
   1. Lower case the text.
   2. Remove punctuation (you can use `from string import punctuation` to ease your work).
2. (4 points) Implement and train a naive Bayes classifier on the training data. Either:
   * Code your own classifier following the algorithm given in class.
   * Or use a scikit-learn [Pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline) with a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) classifier. (Recommended)
3. (1 point) Report the accuracy on both training and test set.
4. (1 point) Why is accuracy a sufficient measure of evaluation here?
5. **\[Bonus\]** What are the top 10 most important words (features) for each class? (bonus points)
   1. Look at the words with the highest likelihood in each class (if you use scikit-learn, you want to check `feature_log_prob_`).
   2. Remove stopwords (see [NLTK stopwords corpus](https://pythonspot.com/nltk-stop-words/)) and check again.
6. Take at least 2 wrongly classified example from the test set and try explaining why the model failed. (1 point)

### Preprocessing

Let's now have a look on our dataframe and our data 

In [12]:
train.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


Let's Lower case the text and remove the punctuation.

In [14]:
import utility_functions

# Replace uppercase with lowercases + remove punctuation
train = utility_functions.preprocess_df(train)
test = utility_functions.preprocess_df(test)

In [15]:
train.head()

Unnamed: 0,text,label
0,i rented i am curiousyellow from my video stor...,0
1,i am curious yellow is a risible and pretentio...,0
2,if only to avoid making this type of film in t...,0
3,this film was probably inspired by godards mas...,0
4,oh brotherafter hearing about this ridiculous ...,0


Now let's create a Naive Bayes classifier for our dataset.

To do that, we will use a Pipeline that will have 2 parameters : 

- One CountVectorizer to process our data

- Our Naive Bayes which will be a MultinomialNB

In [16]:
# Creating our pipeline 
pipeline = Pipeline([('Vect', CountVectorizer()), ('Mnb', MultinomialNB())])

# Fitting our pipeline on the train data (with the lables as we are doing supervised learning)
pipeline.fit(train['text'], train['label'])

Pipeline(steps=[('Vect', CountVectorizer()), ('Mnb', MultinomialNB())])

In [52]:
# Predicting the results of the test data 
predictions = pipeline.predict(test['text'])

# Computing the F1_score to have a look on the accuracy of our model 
print("test f1-score", f1_score(test['label'], predictions))

print("test accuracy", pipeline.score(test['text'], test['label']))
print("train accuracy", pipeline.score(train['text'], train['label']))

test f1-score 0.8051338904997442
test accuracy 0.8172
train accuracy 0.91284


In [21]:
# Creating a new dataset to have another look on the results of our prediction
test_df = pd.concat([pd.Series(predictions), test['label']], axis=1)

print("same values :", test_df[test_df[0] == test_df['label']].count())
print("different values : {0}".format(test_df[test_df[0] != test_df['label']].count()))

same values : 0        20430
label    20430
dtype: int64
different values : 0        4570
label    4570
dtype: int64


Now let's add some processing to our data with two possible choices :

- Stemming

- Lemmatization 

Let's have a look on the results of these two before choosing.

Stemming 

In [19]:
# We need to download a package for word tokenization
nltk.download('punkt')

re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")
analyzer = CountVectorizer().build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

pipeline_stemmer = Pipeline([('Vect', CountVectorizer(analyzer=stemmed_words)), ('Mnb', MultinomialNB())])

pipeline_stemmer.fit(train['text'], train['label'])

# Predicting the results of the test data 
predictions_stemmer = pipeline_stemmer.predict(test['text'])

# Computing the F1_score to have a look on the accuracy of our model 
f1_score(test['label'], predictions_stemmer)

[nltk_data] Downloading package punkt to /home/yacine/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


0.7966195740321823

Lemmatization

In [22]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download("omw-1.4")
re_word = re.compile(r"^\w+$")

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in tokens]

pipeline_lemmatizer = Pipeline([('Vect', CountVectorizer(tokenizer=tokenize, stop_words='english')), ('Mnb', MultinomialNB())])

pipeline_lemmatizer.fit(train['text'], train['label'])

# Predicting the results of the test data 
predictions_lemmatizer = pipeline_lemmatizer.predict(test['text'])

# Computing the F1_score to have a look on the accuracy of our model 
f1_score(test['label'], predictions_lemmatizer)

[nltk_data] Downloading package wordnet to /home/yacine/nltk_data...
[nltk_data] Downloading package omw-1.4 to /home/yacine/nltk_data...


0.8088703853179829

In [44]:
print("Lemmatizer errors:")
wrong_prediction_sample = test[test['label'] != predictions_lemmatizer]["text"].sample(2, random_state=SEED)
for review in wrong_prediction_sample:
    print(review, "\n")
    print("Lemmatized text:")
    print(tokenize(review), "\n")

print("==========================================================================")
print("==========================================================================\n")

print("Stemmer errors:")
wrong_prediction_sample = test[test['label'] != predictions_stemmer]["text"].sample(2, random_state=SEED)
for review in wrong_prediction_sample:
    print(review, "\n")
    print("Stemmed text:")
    print(list(stemmed_words(review)), "\n")

Lemmatizer errors:
im sorry but star wars episode 1 did not do any justice to natalie portmans talent and undeniable cuteness she was entirely underused as queen amidala and when she was used her makeup was frighteningly terrible for anywhere but here she sheds her godawful makeup and she acts normally and not only can she act good she looks good doing it im a bit older than she shes only 18 and i have little or no chance of meeting her but hey a guy is allowed to dream rightbr br even though susan sarandon does take a good turn in this movie the film belongs entirely to portman ive been a watcher of portmans since beautiful girls where she was younger but just as cute theres big things for her in the future  i can see it 

Lemmatized text:
['im', 'sorry', 'but', 'star', 'war', 'episode', '1', 'did', 'not', 'do', 'any', 'justice', 'to', 'natalie', 'portmans', 'talent', 'and', 'undeniable', 'cuteness', 'she', 'wa', 'entirely', 'underused', 'a', 'queen', 'amidala', 'and', 'when', 'she', 

We can see that both lemmatization and stemming make mistakes and creates word which surely contributes to the error in the classification of the above examples.

In the first example review, we find words with positive connotation such as: talent, cuteness, good and beautifull. But we also find words with negative connotation: underused, godawful, frighteningly terrible. Having both positive and negative connotated words makes the classification harder, this may explain the classification error.