# The dataset

The IMDB sentiment dataset is a collection of 50K movie reviews, annotated as positive or negative, and split in two sets of equal size: a training and a test set. Both set have an equal number of positive and negative review. The dataset is available on several libraries, but we ask that you use the HuggingFace [datasets](https://huggingface.co/datasets/imdb) version. Follow their [tutorial](https://huggingface.co/docs/datasets/load_hub) on how to use the library for more details.

Download and look at the dataset, and answer the following questions.
1. How many splits does the dataset has? (1 point)
2. How big are these splits? (1 point)
3. What is the proportion of each class on the supervised splits? (1 point)

## Installation and import of necessary librairies


In [1]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!pip install fasttext

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
!python -m spacy download en_core_web_sm

2022-10-11 14:23:27.492906: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 2.1 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
from datasets import load_dataset_builder
from datasets import load_dataset
import pandas as pd
import numpy as np

import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 
import re
import spacy

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline

database_name = "imdb"
ds_builder = load_dataset_builder(database_name)
print(ds_builder.info.description)
print(ds_builder.info.features)

dataset = load_dataset(database_name)

from datasets import get_dataset_split_names
import fasttext

Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None)}




  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
# Import of our file with usefull functions
%load utility_functions.py
import sys
sys.path.append("../tools/")
import utility_functions

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## The dataset

Now that we installed all the necessary librairies, let's have a look on our data.

In [6]:
print("Split names", get_dataset_split_names(database_name))
dataset

Split names ['train', 'test', 'unsupervised']


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

We can see that this database has 3 splits. The "train" and "test" splits have 25000 rows each and the unsupervised split has 50000 rows.

As here we are only going to do supervised learning, we will only have two datasets : train and test.

In [7]:
# To start we are going to split our datasets into 3 differents datasets
train = dataset["train"].to_pandas()
test = dataset["test"].to_pandas()

# Then we will have a look on the 
print("Test values count : {0}".format(len(test)))
print(test["label"].value_counts())
print("Train values count : {0}".format(len(train)))
print(train["label"].value_counts())

Test values count : 25000
0    12500
1    12500
Name: label, dtype: int64
Train values count : 25000
0    12500
1    12500
Name: label, dtype: int64


As we can also see it here, we have an equal amount of each different label (0 and 1) in our datasets.

Let's now have a look on our dataframe and our data 

In [8]:
train.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


Now we will do a quick shuffle of our dataset to make sure to not have any order in our data.

## Preprocessing

But before doing anything else with our dataset, we must preprocess it by changing all letters to lowercase and removing punctuation.

In [9]:
train = utility_functions.preprocess_df(train)
test = utility_functions.preprocess_df(test)

train.head()

Unnamed: 0,text,label
0,i rented i am curiousyellow from my video stor...,0
1,i am curious yellow is a risible and pretentio...,0
2,if only to avoid making this type of film in t...,0
3,this film was probably inspired by godards mas...,0
4,oh brotherafter hearing about this ridiculous ...,0


As we can see it, our preprocess is now done !

## Shuffling the dataset

To get more coherent results, is is necessary to shuffle the dataset (we will also need it for the hyperparameter optimization later) as follow:

In [10]:
# Function used to shuffle the given dataset
def shuffle(dataset):
  return dataset.sample(frac=1).reset_index(drop=True)

In [11]:
train = shuffle(train)
test = shuffle(test)

# Displaying a dataset to see if the shuffle worked
train.head()

Unnamed: 0,text,label
0,this game is the bomb and this is the 007 game...,1
1,george lopez is a funny man even without the s...,1
2,despite positive reviews and screenings at the...,0
3,cinematographycompared to the wrestler a degre...,1
4,i just saw this movie tonight opening night it...,1


We can see that the rows of the datasets have indeed been shuffled, as compared to their previous order.

## 1 - (2 points) Turn the dataset into a dataset compatible with Fastext (see the Tips on using FastText section a bit lower)

The FastText dataset format is:

```__label__<your_label> <corresponding text>```

For convenience, we will replace the original labels (0 and 1) by their intended meaning, "negative" and "positive" respectively.

In [12]:
def pandas_to_fasttext(pandas_data):
  '''Convert a pandas dataset to a string of fasttext format'''
  pandas_data = pandas_data.astype('str')
  fasttext_data = '__label__' + pandas_data['label'] + ' ' + pandas_data['text']
  fasttext_str = fasttext_data.to_string(index=False)
  fasttext_str = fasttext_str.replace('0', 'negative').replace('1', 'positive')
  return fasttext_str

def str_to_txt_file(text, filename):
  '''Saves a string as a txt file'''
  file = open(filename, "w")
  file.write(text)
  file.close()

We now convert each dataset (train, test) to fasttext format using the above functions, and save them to text files for later use:

In [13]:
fasttext_train = pandas_to_fasttext(train)
str_to_txt_file(fasttext_train, 'train.txt')

fasttext_test = pandas_to_fasttext(test)
str_to_txt_file(fasttext_test, 'test.txt')

fasttext_train.split('\n')[:10]
print(fasttext_train.split('\n')[:1])

['__label__positive this game is the bomb and this is th...']


We can see above the result of fasttext formatting, which correspond to the expected format.

## 2 - (2 points) Train a FastText classifier with default parameters on the training data, and evaluate it on the test data using accuracy.

Training simple classifier with formatted train dataset:

In [14]:
model = fasttext.train_supervised(input='train.txt')

Displaying the results of the model on the formatted test dataset:

In [15]:
test_results = model.test("test.txt")
print('Number of samples:', test_results[0])
print('Accurracy:', test_results[1])

Number of samples: 25000
Accurracy: 0.64


As we can see, this is a correct result for a simple model like this one. Let's see if we can make it better !

## 3 - (3 points) Use the hyperparameters search functionality of FastText and repeat step 2

Created the hyper_validation dataset by taking half of the test dataset:

In [16]:
test_size = len(test)//2

hyper_validation = test.iloc[test_size:,:]

display(hyper_validation)

Unnamed: 0,text,label
12500,this film is just really great i dont know why...,1
12501,challenge to be free was one of the first film...,1
12502,one of the worst movies ive ever seen absolute...,0
12503,the cinematography is the films shining featur...,1
12504,no doubt about it this is the animated short t...,1
...,...,...
24995,ok i wanted to see this because it had a few g...,0
24996,heart of darkness was terrible the novel was d...,0
24997,ive seen three of the animatrix episodes and t...,1
24998,this is a surprisingly great low budget horror...,1


Formatting this validation dataset to fasttext format and saving it to txt file:

In [17]:
fasttext_validation = pandas_to_fasttext(hyper_validation)
str_to_txt_file(fasttext_validation, 'hyper_validation.txt')

fasttext_validation.split('\n')[:10]

['__label__positive this film is just really great i don...',
 '__label__positive challenge to be free was one of the ...',
 '__label__negative one of the worst movies ive ever see...',
 '__label__positive the cinematography is the films shin...',
 '__label__positive no doubt about it this is the animat...',
 '__label__positive who should watch this film anyone wh...',
 '__label__negative i cant believe how dumb this movie t...',
 '__label__positive after all these years of solving cri...',
 '__label__positive the final installment sees sho aikaw...',
 '__label__negative im writing this as i watch the dvd i...']

Training the hyperparameter optimized model with the train and validation dataset:

In [18]:
hyper_model = fasttext.train_supervised(input='train.txt', autotuneValidationFile='hyper_validation.txt')

Displaying optimized model results:

In [19]:
hyper_test_results = hyper_model.test("test.txt")
print('Number of samples:', hyper_test_results[0])
print('Accurracy:', hyper_test_results[1])

Number of samples: 25000
Accurracy: 0.65132


## 4 - (1 points) Look at the differences between the default model and the attributes found with hyperparameters search. How do the two models differ? 

Displaying the two models performances on the same test dataset:

In [20]:
print('Number of samples:', hyper_test_results[0])
print('Accurracy (without hyperparameters search):\n', test_results[1])
print('Accurracy (with hyperparameters search):\n', hyper_test_results[1])

Number of samples: 25000
Accurracy (without hyperparameters search):
 0.64
Accurracy (with hyperparameters search):
 0.65132


We can see a good improvement of the model accurracy when using hyperparameters search (0.62992 and 0.6498 respectively) on the same test dataset.

## 5 - (1 point) Using the tuned model, take at least 2 wrongly classified examples from the test set, and try explaining why the model failed.

Getting all predictions of the tuned model and saving it to 'prediction' variable:

In [21]:
predictions = hyper_model.predict(test['text'].tolist())

Displaying the actual versus predicted labels with associated inputs:

In [22]:
print('\nActual values:')
display(fasttext_test.split('\n')[:5])

print('\nPredictions:')
display(predictions[0][:5])

print('\nThe model guessed wongly on the following inputs:')

print('- Error 1: Was positive, guessed negative:')
print(test.text[1].replace('<br /><br />', ' '))

print('\n- Error 2: Was negative, guessed positive:')
print(test.text[5].replace('<br /><br />', ' '))


Actual values:


['__label__negative i happened to catch this on tv and w...',
 '__label__negative at last a film to rival el padrino a...',
 '__label__positive yes it feels and for the most part p...',
 '__label__negative i havent seen any other films by ant...',
 '__label__negative reading through most of the other re...']


Predictions:


[['__label__negative'],
 ['__label__negative'],
 ['__label__positive'],
 ['__label__positive'],
 ['__label__negative']]


The model guessed wongly on the following inputs:
- Error 1: Was positive, guessed negative:
at last a film to rival el padrino and darkness falls in terms of sheer and utter dullness this is actually the first film ive ever given 1 out of 10 for on imdb and with good reasonbr br for one the cast is nothing special thats usually not a problem for me except that the only character thats in anyways interesting or different from all the rest is grand l bushs harrington secondly the production values a substandard  television scifi such as stargate has more convincing sets and all of the underwater scenes not handled by the sfx teams are filmed on dry sets with falling particles that arent very convincing this film is literally drydocked the worst part though is that this film is boring for the first 45 minutes i felt as if we were going round and round in circles its a prehistoric shark bullsht no really bullsht im not making this up bullsht there it is now i didnt see anything let me gu

We can see that the model guessed wrongly on inputs 1 and 5 of the test dataset.

After analysis, it seems that theses inputs contains words that the model probably uses to determine whether or not an input is positive:

In input 1, the input is 'positive' but the model guessed 'negative'.
This is probably because the following negative words are present in it:
- cliché
- not-so-popular
- tragic
- disappointed

We can also notice that the first two sentences are obviously negative, which is probably a good part of why the model guessed wrongly.

In input 2, the input is 'negative' but the model guessed 'positive'.
This is probably because the following negative words are present in it:
- I agree
- his best sound film
- This is a Must-See film!

The last sentence, 'This is a Must-See film!', is probably what got the model to guess wrongly.