# The dataset

The IMDB sentiment dataset is a collection of 50K movie reviews, annotated as positive or negative, and split in two sets of equal size: a training and a test set. Both set have an equal number of positive and negative review. The dataset is available on several libraries, but we ask that you use the HuggingFace [datasets](https://huggingface.co/datasets/imdb) version. Follow their [tutorial](https://huggingface.co/docs/datasets/load_hub) on how to use the library for more details.

Download and look at the dataset, and answer the following questions.
1. How many splits does the dataset has? (1 point)
2. How big are these splits? (1 point)
3. What is the proportion of each class on the supervised splits? (1 point)

In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 14.4 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 85.7 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 73.8 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 74.1 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 108.5 MB/s 
Installing collected packa

In [None]:
!python -m spacy download en_core_web_sm

2022-09-13 17:20:42.647802: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 13.3 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
from datasets import load_dataset_builder
from datasets import load_dataset
import pandas as pd
import numpy as np

import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 
import re
import spacy

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline

database_name = "imdb"
ds_builder = load_dataset_builder(database_name)
print(ds_builder.info.description)
print(ds_builder.info.features)

dataset = load_dataset(database_name)

Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None)}




  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
from datasets import get_dataset_split_names
print("Split names", get_dataset_split_names(database_name))
dataset

Split names ['train', 'test', 'unsupervised']


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

We can see that this database has 3 splits. The "train" and "test" splits have 25000 rows each and the unsupervised split has 50000 rows.

In [None]:
# To start we are going to split our datasets into 3 differents datasets
train = dataset["train"].to_pandas()
test = dataset["test"].to_pandas()
supervised = pd.concat([train, test])

# Then we will have a look on the 
print("Test values count : {0}".format(len(test)))
print("Train values count : {0}".format(len(train)))
supervised["label"].value_counts()

Test values count : 25000
Train values count : 25000


0    25000
1    25000
Name: label, dtype: int64

Let's now have a look on our dataframe and our data 

In [None]:
supervised.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


We can see that there are as many positive as negative reviews in the supervised split.Indeed, each class has 25000 occurrences.

Now let's create a Naive Bayes classifier for our dataset.

To do that, we will use a Pipeline that will have 2 parameters : 

- One CountVectorizer to process our data

- Our Naive Bayes which will be a MultinomialNB

In [None]:
# Creating our pipeline 
pipeline = Pipeline([('Vect', CountVectorizer()), ('Mnb', MultinomialNB())])

# Fitting our pipeline on the train data (with the lables as we are doing supervised learning)
pipeline.fit(train['text'], train['label'])

Pipeline(steps=[('Vect', CountVectorizer()), ('Mnb', MultinomialNB())])

In [None]:
# Predicting the results of the test data 
predictions = pipeline.predict(test['text'])

# Computing the F1_score to have a look on the accuracy of our model 
f1_score(test['label'], predictions)

0.8005648025330537

In [None]:
# Creating a new dataset to have another look on the results of our prediction
test_df = pd.concat([pd.Series(predictions), test['label']], axis=1)

print("same values : {0}".format(test_df[test_df[0] == test_df['label']].count()))
print("different values : {0}".format(test_df[test_df[0] != test_df['label']].count()))

same values : 0        20339
label    20339
dtype: int64
different values : 0        4661
label    4661
dtype: int64


Now let's add some processing to our data with two possible choices :

- Stemming

- Lemmatization 

Let's have a look on the results of these two before choosing.

Stemming 

In [None]:
# We need to download a package for word tokenization
nltk.download('punkt')

re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")
analyzer = CountVectorizer().build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

pipeline_stemmer = Pipeline([('Vect', CountVectorizer(analyzer=stemmed_words)), ('Mnb', MultinomialNB())])

pipeline_stemmer.fit(train['text'], train['label'])

# Predicting the results of the test data 
predictions_stemmer = pipeline_stemmer.predict(test['text'])

# Computing the F1_score to have a look on the accuracy of our model 
f1_score(test['label'], predictions_stemmer)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


0.7923413004369056

Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet', '/usr/share/nltk_data')
nltk.download("omw-1.4")
re_word = re.compile(r"^\w+$")
# loading the small English model
nlp = spacy.load("en_core_web_sm")

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    # lemmatizer = nlp
    return [lemmatizer.lemmatize(token) for token in tokens]

pipeline_lemmatizer = Pipeline([('Vect', CountVectorizer(tokenizer=tokenize, stop_words='english')), ('Mnb', MultinomialNB())])

pipeline_lemmatizer.fit(train['text'], train['label'])

# Predicting the results of the test data 
predictions_lemmatizer = pipeline_lemmatizer.predict(test['text'])

# Computing the F1_score to have a look on the accuracy of our model 
f1_score(test['label'], predictions_lemmatizer)

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
  % sorted(inconsistent)


0.8024222781355281