# The dataset

The IMDB sentiment dataset is a collection of 50K movie reviews, annotated as positive or negative, and split in two sets of equal size: a training and a test set. Both set have an equal number of positive and negative review. The dataset is available on several libraries, but we ask that you use the HuggingFace [datasets](https://huggingface.co/datasets/imdb) version. Follow their [tutorial](https://huggingface.co/docs/datasets/load_hub) on how to use the library for more details.

Download and look at the dataset, and answer the following questions.
1. How many splits does the dataset has? (1 point)
2. How big are these splits? (1 point)
3. What is the proportion of each class on the supervised splits? (1 point)

In [1]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!pip install fasttext

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
!python -m spacy download en_core_web_sm

2022-09-30 19:08:41.306441: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 7.0 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
from datasets import load_dataset_builder
from datasets import load_dataset
import pandas as pd
import numpy as np

import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 
import re
import spacy

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline

database_name = "imdb"
ds_builder = load_dataset_builder(database_name)
print(ds_builder.info.description)
print(ds_builder.info.features)

dataset = load_dataset(database_name)



Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None)}




  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
from datasets import get_dataset_split_names
print("Split names", get_dataset_split_names(database_name))
dataset

Split names ['train', 'test', 'unsupervised']


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

We can see that this database has 3 splits. The "train" and "test" splits have 25000 rows each and the unsupervised split has 50000 rows.

In [6]:
# To start we are going to split our datasets into 3 differents datasets
train = dataset["train"].to_pandas()
test = dataset["test"].to_pandas()
supervised = pd.concat([train, test])

# Then we will have a look on the 
print("Test values count : {0}".format(len(test)))
print("Train values count : {0}".format(len(train)))
supervised["label"].value_counts()

Test values count : 25000
Train values count : 25000


0    25000
1    25000
Name: label, dtype: int64

Let's now have a look on our dataframe and our data 

In [7]:
supervised.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


We can see that there are as many positive as negative reviews in the supervised split.Indeed, each class has 25000 occurrences.

## Shuffling the dataset

To get more coherent results, is is necessary to shuffle the dataset (we will also need it for the hyperparameter optimization later) as follow:

In [8]:
def shuffle(dataset):
  return dataset.sample(frac=1).reset_index(drop=True)

In [9]:
train = shuffle(train)
test = shuffle(test)
supervised = shuffle(supervised)

supervised.head()

Unnamed: 0,text,label
0,"Eleven years ago, Stanley Ipkiss released his ...",0
1,I went into this movie after having read it wa...,1
2,Style but no substance. Not as funny as it sho...,0
3,"This movie, quite literally, does not have one...",0
4,This film is so wonderful it captures the gami...,1


We can see that the rows of the datasets have indeed been shuffled, as compared to their previous order.

## 1 - (2 points) Turn the dataset into a dataset compatible with Fastext (see the Tips on using FastText section a bit lower)

The FastText dataset format is:

```__label__<your_label> <corresponding text>```

For convenience, we will replace the original labels (0 and 1) by their intended meaning, "negative" and "positive" respectively.

In [10]:
def pandas_to_fasttext(pandas_data):
  '''Convert a pandas dataset to a string of fasttext format'''
  pandas_data = pandas_data.astype('str')
  fasttext_data = '__label__' + pandas_data['label'] + ' ' + pandas_data['text']
  fasttext_str = fasttext_data.to_string(index=False)
  fasttext_str = fasttext_str.replace('0', 'negative').replace('1', 'positive')
  return fasttext_str

def str_to_txt_file(text, filename):
  '''Saves a string as a txt file'''
  file = open(filename, "w")
  file.write(text)
  file.close()

We now convert each dataset (train, test and supervised) to fasttext format using the above functions, and save them to text files for later use:

In [11]:
fasttext_train = pandas_to_fasttext(train)
str_to_txt_file(fasttext_train, 'train.txt')

fasttext_test = pandas_to_fasttext(test)
str_to_txt_file(fasttext_test, 'test.txt')

fasttext_supervised = pandas_to_fasttext(supervised)
str_to_txt_file(fasttext_supervised, 'supervised.txt')

fasttext_train.split('\n')[:10]

['__label__positive Most 7negatives (and 8negatives) Kong Kong martial...',
 '__label__positive This film, in my opinion, is, despit...',
 '__label__negative Oh, man, how low serials had fallen ...',
 '__label__negative I am a massive fan of the book and O...',
 '__label__positive After a big tip of the hat to Spinal...',
 '__label__negative This is probably the best horror fil...',
 '__label__negative By my "Kool-Aid drinkers" remark, I ...',
 '__label__negative I loved "The Curse of Frankenstein" ...',
 "__label__positive If you a purist, don't waste your ti...",
 '__label__negative Probably one of the most boriest sla...']

We can see above the result of fasttext formatting, which correspond to the expected format.

## 2 - (2 points) Train a FastText classifier with default parameters on the training data, and evaluate it on the test data using accuracy.

In [12]:
import fasttext

Training simple classifier with formatted train dataset:

In [13]:
model = fasttext.train_supervised(input='train.txt')

Displaying the results of the model on the formatted test dataset:

In [14]:
test_results = model.test("test.txt")
print('Number of samples:', test_results[0])
print('Accurracy:', test_results[1])

Number of samples: 25000
Accurracy: 0.63036


## 3 - (3 points) Use the hyperparameters search functionality of FastText and repeat step 2

Created the hyper_validation dataset by taking half of the test dataset:

In [15]:
test_size = len(test)//2

hyper_validation = test.iloc[test_size:,:]

display(hyper_validation)

Unnamed: 0,text,label
12500,"""Boy Next Door"" is a hilarious romp through ma...",1
12501,I was shocked at how bad it was and unable to ...,0
12502,I remembered this awful movie I bought at Came...,0
12503,An interesting movie with Jordana Brewster as ...,1
12504,Jane Russell was an underrated comedienne and ...,0
...,...,...
24995,"Once you sit down to see this film "" A Cannon ...",1
24996,I watched this film on the advice of a friend ...,0
24997,"The MTV sci-fi animated series ""Æon Flux"" is b...",0
24998,A new and innovative show with a great cast th...,1


Formatting this validation dataset to fasttext format and saving it to txt file:

In [16]:
fasttext_validation = pandas_to_fasttext(hyper_validation)
str_to_txt_file(fasttext_validation, 'hyper_validation.txt')

fasttext_validation.split('\n')[:10]

['__label__positive "Boy Next Door" is a hilarious romp ...',
 '__label__negative I was shocked at how bad it was and ...',
 '__label__negative I remembered this awful movie I boug...',
 '__label__positive An interesting movie with Jordana Br...',
 '__label__negative Jane Russell was an underrated comed...',
 '__label__positive I was blubbing like an idiot during ...',
 '__label__negative Great book, poorly done movie. Chees...',
 '__label__negative This movie was definitely on the bor...',
 '__label__positive From the offset, I knew this was goi...',
 "__label__positive What's the matter with you people? J..."]

Training the hyperparameter optimized model with the train and validation dataset:

In [17]:
hyper_model = fasttext.train_supervised(input='train.txt', autotuneValidationFile='hyper_validation.txt')

Displaying optimized model results:

In [18]:
hyper_test_results = hyper_model.test("test.txt")
print('Number of samples:', hyper_test_results[0])
print('Accurracy:', hyper_test_results[1])

Number of samples: 25000
Accurracy: 0.64376


## 4 - (1 points) Look at the differences between the default model and the attributes found with hyperparameters search. How do the two models differ? 

Displaying the two models performances on the same test dataset:

In [19]:
print('Number of samples:', hyper_test_results[0])
print('Accurracy (without hyperparameters search):\n', test_results[1])
print('Accurracy (with hyperparameters search):\n', hyper_test_results[1])

Number of samples: 25000
Accurracy (without hyperparameters search):
 0.63036
Accurracy (with hyperparameters search):
 0.64376


We can see a good improvement of the model accurracy when using hyperparameters search (0.62992 and 0.6498 respectively) on the same test dataset.

## 5 - (1 point) Using the tuned model, take at least 2 wrongly classified examples from the test set, and try explaining why the model failed.

Getting all predictions of the tuned model and saving it to 'prediction' variable:

In [20]:
predictions = hyper_model.predict(test['text'].tolist())

Displaying the actual versus predicted labels with associated inputs:

In [21]:
print('\nActual values:')
display(fasttext_test.split('\n')[:5])

print('\nPredictions:')
display(predictions[0][:5])

print('\nThe model guessed wongly on the following inputs:')

print('- Error 1: Was positive, guessed negative:')
print(test.text[1].replace('<br /><br />', ' '))

print('\n- Error 2: Was negative, guessed positive:')
print(test.text[5].replace('<br /><br />', ' '))


Actual values:


['__label__positive The feature length CGI movie has jus...',
 '__label__negative I will never go to another Tarantino...',
 '__label__positive A study of one of those universally ...',
 '__label__negative Target is the story of a special age...',
 '__label__positive I had always heard about this great ...']


Predictions:


[['__label__positive'],
 ['__label__negative'],
 ['__label__positive'],
 ['__label__negative'],
 ['__label__positive']]


The model guessed wongly on the following inputs:
- Error 1: Was positive, guessed negative:
I will never go to another Tarantino movie again. The entire film was worthless. My wife and I both regret that we didn't get up and walk out at the first indication of what the film was really going to be about (which is still hard to determine since it was such a ridiculous storyline...blood, guts, and violence seemed to be the only real theme), but we kept hoping there'd be something redeeming just around the corner. Unfortunately, there wasn't because there wasn't anything that made sense! We, along with a lot of the other people in the audience walked out of the theater muttering "that was disgusting", "what a waste of time", "I should've walked out", "where was the comedy", "that was pathetic", etc. It actually made us, the audience, voice our disgust and the feeling that we had just been thoroughly ripped off. The only thing of merit in the film was the costuming and the acting ability 

We can see that the model guessed wrongly on inputs 1 and 5 of the test dataset.

After analysis, it seems that theses inputs contains words that the model probably uses to determine whether or not an input is positive:

In input 1, the input is 'positive' but the model guessed 'negative'.
This is probably because the following negative words are present in it:
- cliché
- not-so-popular
- tragic
- disappointed

We can also notice that the first two sentences are obviously negative, which is probably a good part of why the model guessed wrongly.

In input 2, the input is 'negative' but the model guessed 'positive'.
This is probably because the following negative words are present in it:
- I agree
- his best sound film
- This is a Must-See film!

The last sentence, 'This is a Must-See film!', is probably what got the model to guess wrongly.