# Lab 02

## Installation and import of necessary librairies

In [1]:
!pip install datasets



In [2]:
from datasets import load_dataset_builder
from datasets import load_dataset
import pandas as pd
import numpy as np

import nltk
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline

import utility_functions

SEED = 42

[nltk_data] Downloading package stopwords to /home/yacine/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## The dataset

The IMDB sentiment dataset is a collection of 50K movie reviews, annotated as positive or negative, and split in two sets of equal size: a training and a test set. Both set have an equal number of positive and negative review. The dataset is available on several libraries, but we ask that you use the HuggingFace [datasets](https://huggingface.co/datasets/imdb) version. Follow their [tutorial](https://huggingface.co/docs/datasets/load_hub) on how to use the library for more details.

Download and look at the dataset, and answer the following questions.
1. How many splits does the dataset has? (1 point)
2. How big are these splits? (1 point)
3. What is the proportion of each class on the supervised splits? (1 point)

To start, we will load our dataset and then we will have a look on how it is strutured.

In [3]:
database_name = "imdb"
ds_builder = load_dataset_builder(database_name)
print(ds_builder.info.description)
print(ds_builder.info.features)

dataset = load_dataset(database_name)

Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None)}


Reusing dataset imdb (/home/yacine/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
from datasets import get_dataset_split_names
print("Split names", get_dataset_split_names(database_name))
dataset

Split names ['train', 'test', 'unsupervised']


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

As we can see, our dataset has 3 splits. The "train" and "test" splits have 25000 rows each and the unsupervised split has 50000 rows. We will in our case use the "train" and "test" splits that have 25000 rows each as we will do some supervised learning.

In [5]:
# To start we are going to split our datasets into 3 differents datasets
train = dataset["train"].to_pandas()
test = dataset["test"].to_pandas()

# Then we will have a look on the 
print("Test values count : {0}".format(len(test)))
print("Train values count : {0}".format(len(train)))

Test values count : 25000
Train values count : 25000


We can see that there are as many positive as negative reviews in the supervised split. Indeed, each class has 25000 occurrences.

## Naive Bayes classifier **(9 points)**

Implement your own naive Bayes classifier (the pseudo code can be found in the slides or the [book reference](https://web.stanford.edu/~jurafsky/slp3/)) or use [one provided by scikit-learn](https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes) combined with a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

Go through the following steps.
1. (2 points) Take a look at the data and create an adapted preprocessing function which at least:
   1. Lower case the text.
   2. Remove punctuation (you can use `from string import punctuation` to ease your work).
2. (4 points) Implement and train a naive Bayes classifier on the training data. Either:
   * Code your own classifier following the algorithm given in class.
   * Or use a scikit-learn [Pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline) with a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) classifier. (Recommended)
3. (1 point) Report the accuracy on both training and test set.
4. (1 point) Why is accuracy a sufficient measure of evaluation here?
5. **\[Bonus\]** What are the top 10 most important words (features) for each class? (bonus points)
   1. Look at the words with the highest likelihood in each class (if you use scikit-learn, you want to check `feature_log_prob_`).
   2. Remove stopwords (see [NLTK stopwords corpus](https://pythonspot.com/nltk-stop-words/)) and check again.
6. Take at least 2 wrongly classified example from the test set and try explaining why the model failed. (1 point)

### 1. Preprocessing

Before doing any preprocessing on our data, let's now have a look on our dataframe.

In [6]:
train.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


We can see here that we have some uppercase characters and ponctuation. So for our process, let's change all of our uppercase to lowercase and remove the punctuation.

To do this, we will use our utility_functions.py file that contains all the needed function for our preprocessing.

In [7]:
# Replace uppercase with lowercases + remove punctuation
train = utility_functions.preprocess_df(train)
test = utility_functions.preprocess_df(test)

Let's now have a look on the result that is our processed dataframe.

In [8]:
train.head()

Unnamed: 0,text,label
0,i rented i am curiousyellow from my video stor...,0
1,i am curious yellow is a risible and pretentio...,0
2,if only to avoid making this type of film in t...,0
3,this film was probably inspired by godards mas...,0
4,oh brotherafter hearing about this ridiculous ...,0


### 2. Naive Bayes classifier 

As our data is now procesed, let's create a Naive Bayes classifier for our dataset.

To do that, we will use a Pipeline that will have 2 parameters : 

- One CountVectorizer to process our data

- Our Naive Bayes which will be a MultinomialNB

In [9]:
# Creating our pipeline 
pipeline = Pipeline([('Vect', CountVectorizer()), ('Mnb', MultinomialNB())])

# Fitting our pipeline on the train data (with the lables as we are doing supervised learning)
pipeline.fit(train['text'], train['label'])

Pipeline(steps=[('Vect', CountVectorizer()), ('Mnb', MultinomialNB())])

#### 3. Accuracy

In [10]:
# Predicting the results of the test data 
predictions = pipeline.predict(test['text'])
train_predictions = pipeline.predict(train['text'])

# Computing the F1_score and accuracy
# For the train dataset 
print("train accuracy", pipeline.score(train['text'], train['label']))
print("train f1-score", f1_score(train['label'], train_predictions))

# For the test dataset
print("\ntest accuracy", pipeline.score(test['text'], test['label']))
print("test f1-score", f1_score(test['label'], predictions))

train accuracy 0.91284
train f1-score 0.9099847151650349

test accuracy 0.8172
test f1-score 0.8051338904997442


#### 4. Why is accuracy a sufficient measure of evaluation here?

We can see that the f1-score and the accuracy are simillar because the dataset is perfectly balenced. So the accuracy is a sufficient metric.

We can also have a look on how many good predictions our model have and how many bad ones he did.

#### 5.1 top 10 most important words for each class

In [11]:
pipeline["Mnb"].feature_log_prob_
# pipeline["Mnb"].feature_names_in_
# mnb = pipeline["Mnb"]

array([[-13.77050692, -14.17597202, -14.17597202, ..., -14.17597202,
        -14.8691192 , -14.17597202],
       [-14.89530489, -14.89530489, -14.89530489, ..., -14.89530489,
        -14.20215771, -14.89530489]])

#### 5.2 Remove stopwords

In [12]:
stopWords = set(stopwords.words('english'))
print("Stopwords:", stopWords)

Stopwords: {'had', 'she', 'over', 'herself', 'isn', 'again', 'on', "haven't", 'has', 'or', 'no', 'few', 'have', 'wouldn', 'yourselves', "you'll", 'why', 'these', 'from', 'am', "wouldn't", 'under', "don't", 'so', 'because', 'yourself', 'against', 'itself', 'if', 'at', 'having', 'o', "hasn't", 'were', 'up', 'once', 'her', "won't", "isn't", 'its', 'you', 'of', "shouldn't", 'just', 'needn', "she's", 'in', "aren't", 'me', 'what', 'those', 'will', 'such', "it's", 'same', 'while', 'it', "didn't", 'an', 'd', 'hadn', 'is', 'himself', 'a', 've', 'above', 'than', 'both', 'whom', 'before', 'he', "mightn't", 'during', 'further', 'but', "doesn't", 'nor', 'are', 'm', 'haven', 'our', 'until', 'don', "you're", 'ours', 'yours', 's', 'how', 'there', 'can', 'that', 'who', 'off', 'not', 'most', 'should', 'mightn', 'ain', 'ma', "that'll", 'wasn', 'out', "mustn't", 'now', 'all', 'more', 'hasn', 'and', 'weren', 'doing', 'do', 'him', 'your', 'for', 'about', 'down', 'other', 'being', 'doesn', 'i', 'into', 'aren

In [25]:
train_without_stopwords = train
train_without_stopwords["text"] = train_without_stopwords["text"].apply(lambda text: utility_functions.remove_stopwords(text, stopWords))

test_without_stopwords = test
test_without_stopwords["text"] = test_without_stopwords["text"].apply(lambda text: utility_functions.remove_stopwords(text, stopWords))

print("Text without stopwords:")
list(train_without_stopwords["text"][:3])

Text without stopwords:


['rented curiousyellow video store controversy surrounded first released 1967 also heard first seized us customs ever tried enter country therefore fan films considered controversial really see myselfbr br plot centered around young swedish drama student named lena wants learn everything life particular wants focus attentions making sort documentary average swede thought certain political issues vietnam war race issues united states asking politicians ordinary denizens stockholm opinions politics sex drama teacher classmates married menbr br kills curiousyellow 40 years ago considered pornographic really sex nudity scenes far even shot like cheaply made porno countrymen mind find shocking reality sex nudity major staple swedish cinema even ingmar bergman arguably answer good old boy john ford sex scenes filmsbr br commend filmmakers fact sex shown film shown artistic purposes rather shock people make money shown pornographic theaters america curiousyellow good film anyone wanting study

In [26]:
# Creating our pipeline 
no_stopwords_pipeline = Pipeline([('Vect', CountVectorizer()), ('Mnb', MultinomialNB())])

# Fitting our pipeline on the train data (with the lables as we are doing supervised learning)
no_stopwords_pipeline.fit(train_without_stopwords['text'], train_without_stopwords['label'])

# Predicting the results of the test data 
no_stopwords_predictions = pipeline.predict(test_without_stopwords['text'])

# Computing the F1_score and accuracy
print("\ntest accuracy", no_stopwords_pipeline.score(test_without_stopwords['text'], test_without_stopwords['label']))
print("test f1-score", f1_score(test_without_stopwords['label'], no_stopwords_predictions))


test accuracy 0.82692
test f1-score 0.8202030371675475


We notice that the accuracy is slightly better without stopwords.

#### 6. Wrongly classified example explaination

In [14]:
naive_bayes_predictions = pipeline.predict(test["text"])

print("Classification errors:")
wrong_prediction_sample = test[test['label'] != naive_bayes_predictions]["text"][-2:]
for review in wrong_prediction_sample:
    print(review, "\n")

Classification errors:
five minutes in i started to feel how naff this was looking youve got a completely unheroic hero and his overweight fool of a friend seen it all before yeah right i was getting ready to be bored out of my mind for a good few hours this is something i have become quite used to havent we all then after a few minutes of testosterone fuelled insults and such the truck appeared okay the filming techniques used to make it look fast were clumsy but who cares that truck is amazing soon however that is taken away again and were back to the geek and his overweight friend but now im satisfied that at least it wont be too terrible i then proceed to be amazed again and again by the cleverness of the film there are so many jokes at their expense its like everyone in the world is in on this except the two of them the mind behind the makeup and effects was a genius i swear it believe me if you are a man you miss so many of the jokes in this film there is so much here that only a

The first review is positive and has been classified as negative. Because the author had a bad feeling about the first minutes of the film, many words with a negative connotation are present such as: "unheroic", "fool", "bored", "clumsy". But as time goes by, the author appreciates the film more and more and uses words with positive connotations. So there are both words with positive and words with negative connotations which makes the classification harder for the model. The review being about an horror movie, negative words such as: "killer", "blood", "horror"; were used to discribe the atmosphere and events of the movie. These words may have influenced the model to classify the review as a negative review.

Like the first one, the second review is also positive and has been classified as negative. And just like the first review, we find a lot of negative words such as: "worst", "freaky", "ridiculously", "passable","horror", "terror". These words may have caused a wrong classification. 

## Stemming and Lemmatization

The two notebooks in this directory give examples on how to run stemming and lemmatization using NLTK or spaCy. Pick either stemming or lemmatization, and add the operation to your pretreatment.

1. (2 points) Add stemming or lemmatization to your pretreatment.
2. (1 point) Train and evaluate your model again with these pretreatment.
3. (1 point) Are the results better or worse? Try explaining why the accuracy changed.

Now that we saw the results of our model with a very simple preprocessing, let's add some processing to our data with two possible choices :

- Stemming

- Lemmatization 

Let's have a look on the results of these two before choosing.

### Stemming 

To start, before adding the stemming to our preprocessing, we will need to download the necessary packages and setup our stemmer and the function that will be used in our CountVectorizer to use our stemmer.

In [16]:
# We need to download a package for word tokenization
nltk.download('punkt')

# Setting up our stemmer and CountVectorizer
re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")
analyzer = CountVectorizer().build_analyzer()

# Function used in the CountVectorizer to use our stemmer
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

[nltk_data] Downloading package punkt to /home/yacine/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Then now, we can use our pipeline with our stemmer and check the results that it will give us !

In [17]:
# Creation of our pipeline 
pipeline_stemmer = Pipeline([('Vect', CountVectorizer(analyzer=stemmed_words)), ('Mnb', MultinomialNB())])

# Fitting on the train data 
pipeline_stemmer.fit(train['text'], train['label'])

# Predicting the results of the test data 
predictions_stemmer = pipeline_stemmer.predict(test['text'])

Now that we have the results, let's have a look on the accuracy of our model with stemming :

In [18]:
# Computing the F1_score and accuracy
# For the train dataset 
print("train accuracy", pipeline_stemmer.score(train['text'], train['label']))
print("train f1-score", f1_score(train['label'], pipeline_stemmer.predict(train['text'])))

# For the test dataset
print("\ntest accuracy", pipeline_stemmer.score(test['text'], test['label']))
print("test f1-score", f1_score(test['label'], predictions_stemmer))

train accuracy 0.9134
train f1-score 0.9115134671189766

test accuracy 0.608
test f1-score 0.7151494012324148


As we can see it here, the results are not really improving by adding the stemming to our preprocessing. It is in fact the opposite in our case. 

### Lemmatization

To start, before adding the lemmatization to our preprocessing, we will need to download the necessary packages and setup our lemmatizer and the function that will be used in our CountVectorizer to use our lemmatizer.

In [19]:
from nltk.stem import WordNetLemmatizer

# Downloading needed packages for our lemmatizer
nltk.download('wordnet')
nltk.download("omw-1.4")
re_word = re.compile(r"^\w+$")
# Creating our lemmatizer
lemmatizer = WordNetLemmatizer()

# This is the function that will be used in our CountVectorizer
def tokenize(text):
    tokens = nltk.word_tokenize(text)
    return [lemmatizer.lemmatize(token) for token in tokens]

[nltk_data] Downloading package wordnet to /home/yacine/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/yacine/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Then now, we can use our pipeline with our lemmatizer and check the results that it will give us !

In [20]:
# Creation of our pipeline 
pipeline_lemmatizer = Pipeline([('Vect', CountVectorizer(tokenizer=tokenize, stop_words='english')), ('Mnb', MultinomialNB())])

# Fitting on the train data 
pipeline_lemmatizer.fit(train['text'], train['label'])

# Predicting the results of the test data 
predictions_lemmatizer = pipeline_lemmatizer.predict(test['text'])



Now that we have the results, let's have a look on the accuracy of our model with the lemmatizer :

In [21]:
# Computing the F1_score and accuracy
# For the train dataset 
print("train accuracy", pipeline_lemmatizer.score(train['text'], train['label']))
print("train f1-score", f1_score(train['label'], pipeline_lemmatizer.predict(train['text'])))

# For the test dataset
print("\ntest accuracy", pipeline_lemmatizer.score(test['text'], test['label']))
print("test f1-score", f1_score(test['label'], predictions_lemmatizer))

train accuracy 0.92204
train f1-score 0.920309113955105

test accuracy 0.8194
test f1-score 0.809002072845721


As we can see it, our results are better than the ones with the stemmer but they are still very close to our original results so it may not be an optimal adding to our preprocessing.

## A final look on the results

In [22]:
print("Lemmatizer errors:")
wrong_prediction_sample = test[test['label'] != predictions_lemmatizer]["text"].sample(2, random_state=SEED)
for review in wrong_prediction_sample:
    print(review, "\n")
    print("Lemmatized text:")
    print(tokenize(review), "\n")

print("==========================================================================")
print("==========================================================================\n")

print("Stemmer errors:")
wrong_prediction_sample = test[test['label'] != predictions_stemmer]["text"].sample(2, random_state=SEED)
for review in wrong_prediction_sample:
    print(review, "\n")
    print("Stemmed text:")
    print(list(stemmed_words(review)), "\n")

Lemmatizer errors:
director and playwright richard day adapted his own stage material for the screen clearly inspired by rock hudsons reallife dilemma from the 1950s what to do with a screen idol who is secretly homosexual marry him off to an unsuspecting woman in order to quell the gossips and keep him working wispythin idea given some energy by the good cast and retro production design which amusingly resembles a greeting card by shag the dialogue isnt very clever and theres some slapstick goofing around near the beginning which fails to work spitting out food etc still when a serious tone comes over the final act it is handled with great tasteand is far more welcomed by the viewer than all the klutzy silliness matt letscher does good work as movie heromale whore guy stone but are his experiences here enough to strengthen his character or would he be right back at the bar the next night the movie seems not to knowor care day wants to get off a few oneliners and one carefully written 

We can see that both lemmatization and stemming make mistakes and creates word which surely contributes to the error in the classification of the above examples.

In the first example review, we find words with positive connotation such as: talent, cuteness, good and beautifull. But we also find words with negative connotation: underused, godawful, frighteningly terrible. Having both positive and negative connotated words makes the classification harder, the other examples also have positive and negative words. This may explain the classification errors.