# The dataset

The IMDB sentiment dataset is a collection of 50K movie reviews, annotated as positive or negative, and split in two sets of equal size: a training and a test set. Both set have an equal number of positive and negative review. The dataset is available on several libraries, but we ask that you use the HuggingFace [datasets](https://huggingface.co/datasets/imdb) version. Follow their [tutorial](https://huggingface.co/docs/datasets/load_hub) on how to use the library for more details.

Download and look at the dataset, and answer the following questions.
1. How many splits does the dataset has? (1 point)
2. How big are these splits? (1 point)
3. What is the proportion of each class on the supervised splits? (1 point)

In [1]:
!pip install datasets



In [2]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[K     |████████████████████████████████| 68 kB 871 kB/s eta 0:00:01
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25ldone
[?25h  Created wheel for fasttext: filename=fasttext-0.9.2-cp38-cp38-linux_x86_64.whl size=4415765 sha256=ea4e57a0a2bc424e7365772306debc894eeb73d4c92eefb825e8c7319034819e
  Stored in directory: /home/yacine/.cache/pip/wheels/93/61/2a/c54711a91c418ba06ba195b1d78ff24fcaad8592f2a694ac94
Successfully built fasttext
Installing collected packages: fasttext
Successfully installed fasttext-0.9.2


In [3]:
!python -m spacy download en_core_web_sm

pyenv: python: command not found

The `python' command exists in these Python versions:
  3.10.4
  3.10.4/envs/ML
  3.10.4/envs/ML_3.10.4
  3.10.4/envs/PFEE
  3.10.4/envs/PYBD
  ML
  ML_3.10.4
  PFEE
  PYBD

Note: See 'pyenv help global' for tips on allowing both
      python2 and python3 to be found.


In [4]:
from datasets import load_dataset_builder
from datasets import load_dataset
import pandas as pd
import numpy as np

import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 
import re
import spacy

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline

database_name = "imdb"
ds_builder = load_dataset_builder(database_name)
print(ds_builder.info.description)
print(ds_builder.info.features)

dataset = load_dataset(database_name)

2022-10-11 09:47:02.472041: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-10-11 09:47:07.187969: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-10-11 09:47:07.188045: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (yacine-ROG-Strix-G533ZW-G533ZW): /proc/driver/nvidia/version does not exist


Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None)}


Reusing dataset imdb (/home/yacine/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
from datasets import get_dataset_split_names
print("Split names", get_dataset_split_names(database_name))
dataset

Split names ['train', 'test', 'unsupervised']


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

We can see that this database has 3 splits. The "train" and "test" splits have 25000 rows each and the unsupervised split has 50000 rows.

In [6]:
# To start we are going to split our datasets into 3 differents datasets
train = dataset["train"].to_pandas()
test = dataset["test"].to_pandas()
supervised = pd.concat([train, test])

# Then we will have a look on the 
print("Test values count : {0}".format(len(test)))
print("Train values count : {0}".format(len(train)))
supervised["label"].value_counts()

Test values count : 25000
Train values count : 25000


0    25000
1    25000
Name: label, dtype: int64

Let's now have a look on our dataframe and our data 

In [7]:
supervised.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


We can see that there are as many positive as negative reviews in the supervised split.Indeed, each class has 25000 occurrences.

## Shuffling the dataset

To get more coherent results, is is necessary to shuffle the dataset (we will also need it for the hyperparameter optimization later) as follow:

In [8]:
def shuffle(dataset):
  return dataset.sample(frac=1).reset_index(drop=True)

In [9]:
train = shuffle(train)
test = shuffle(test)
supervised = shuffle(supervised)

supervised.head()

Unnamed: 0,text,label
0,I had the opportunity to see this film twice a...,1
1,Having been driven out of the house and into t...,1
2,the only enjoyable thing about this highly moc...,0
3,"""A Family Affair"" takes us back to a less comp...",1
4,See Dick work.<br /><br />See Jane work.<br />...,0


We can see that the rows of the datasets have indeed been shuffled, as compared to their previous order.

## 1 - (2 points) Turn the dataset into a dataset compatible with Fastext (see the Tips on using FastText section a bit lower)

The FastText dataset format is:

```__label__<your_label> <corresponding text>```

For convenience, we will replace the original labels (0 and 1) by their intended meaning, "negative" and "positive" respectively.

In [10]:
def pandas_to_fasttext(pandas_data):
  '''Convert a pandas dataset to a string of fasttext format'''
  pandas_data = pandas_data.astype('str')
  fasttext_data = '__label__' + pandas_data['label'] + ' ' + pandas_data['text']
  fasttext_str = fasttext_data.to_string(index=False)
  fasttext_str = fasttext_str.replace('0', 'negative').replace('1', 'positive')
  return fasttext_str

def str_to_txt_file(text, filename):
  '''Saves a string as a txt file'''
  file = open(filename, "w")
  file.write(text)
  file.close()

We now convert each dataset (train, test and supervised) to fasttext format using the above functions, and save them to text files for later use:

In [11]:
fasttext_train = pandas_to_fasttext(train)
str_to_txt_file(fasttext_train, 'train.txt')

fasttext_test = pandas_to_fasttext(test)
str_to_txt_file(fasttext_test, 'test.txt')

fasttext_supervised = pandas_to_fasttext(supervised)
str_to_txt_file(fasttext_supervised, 'supervised.txt')

fasttext_train.split('\n')[:10]

['__label__negative I actually saw this movie at a theat...',
 '__label__positive I think this movie would be more enj...',
 '__label__positive I really enjoyed watching this movie...',
 "__label__positive Let's cut to the chase: If you're a ...",
 '__label__negative It is quite simple. Friends is a com...',
 '__label__negative So far after week two of "The lone o...',
 '__label__positive Here, on IMDb.com I read an opinion,...',
 '__label__positive I just watched I. Q. again tonight a...',
 '__label__negative I rented this flick for one reason T...',
 '__label__positive Casting unknown Michelle Rodriguez a...']

We can see above the result of fasttext formatting, which correspond to the expected format.

## 2 - (2 points) Train a FastText classifier with default parameters on the training data, and evaluate it on the test data using accuracy.

In [12]:
import fasttext

Training simple classifier with formatted train dataset:

In [13]:
model = fasttext.train_supervised(input='train.txt')

Read 0M words
Number of words:  26144
Number of labels: 2
Progress: 100.0% words/sec/thread:  612667 lr: -0.000003 avg.loss:  0.556414 ETA:   0h 0m 0sProgress: 100.0% words/sec/thread:  609972 lr:  0.000000 avg.loss:  0.556414 ETA:   0h 0m 0s


Displaying the results of the model on the formatted test dataset:

In [14]:
test_results = model.test("test.txt")
print('Number of samples:', test_results[0])
print('Accurracy:', test_results[1])

Number of samples: 25000
Accurracy: 0.62468


## 3 - (3 points) Use the hyperparameters search functionality of FastText and repeat step 2

Created the hyper_validation dataset by taking half of the test dataset:

In [15]:
test_size = len(test)//2

hyper_validation = test.iloc[test_size:,:]

display(hyper_validation)

Unnamed: 0,text,label
12500,"Personally, I LOVED TRIS MOVIE! My best friend...",1
12501,Is there any question that Jeffrey Combs is on...,1
12502,I swear I didn't mean to! I picked this out on...,0
12503,This film is full of charming situations and h...,1
12504,The authors know nothing about Russians prison...,0
...,...,...
24995,I had never even heard of ONE DARK NIGHT until...,0
24996,I went straight to the big screen to view this...,1
24997,I think Samuel Goldwyn was trying to accomplis...,1
24998,I had never heard of Dead Man's Bounty when I ...,0


Formatting this validation dataset to fasttext format and saving it to txt file:

In [16]:
fasttext_validation = pandas_to_fasttext(hyper_validation)
str_to_txt_file(fasttext_validation, 'hyper_validation.txt')

fasttext_validation.split('\n')[:10]

['__label__positive Personally, I LOVED TRIS MOVIE! My b...',
 '__label__positive Is there any question that Jeffrey C...',
 "__label__negative I swear I didn't mean to! I picked t...",
 '__label__positive This film is full of charming situat...',
 '__label__negative The authors know nothing about Russi...',
 '__label__negative Yes, I admire the independent spirit...',
 '__label__negative I am a fan of a few of the Vacation ...',
 '__label__positive This super creepy Southern Gothic me...',
 '__label__positive The Regard of Flight and the Clown B...',
 '__label__negative Bad plot, bad acting, bad direction....']

Training the hyperparameter optimized model with the train and validation dataset:

In [None]:
hyper_model = fasttext.train_supervised(input='train.txt', autotuneValidationFile='hyper_validation.txt')

Progress:   3.0% Trials:   10 Best score:  0.630480 ETA:   0h 4m50s

Displaying optimized model results:

In [None]:
hyper_test_results = hyper_model.test("test.txt")
print('Number of samples:', hyper_test_results[0])
print('Accurracy:', hyper_test_results[1])

## 4 - (1 points) Look at the differences between the default model and the attributes found with hyperparameters search. How do the two models differ? 

Displaying the two models performances on the same test dataset:

In [None]:
print('Number of samples:', hyper_test_results[0])
print('Accurracy (without hyperparameters search):\n', test_results[1])
print('Accurracy (with hyperparameters search):\n', hyper_test_results[1])

We can see a good improvement of the model accurracy when using hyperparameters search (0.62992 and 0.6498 respectively) on the same test dataset.

## 5 - (1 point) Using the tuned model, take at least 2 wrongly classified examples from the test set, and try explaining why the model failed.

Getting all predictions of the tuned model and saving it to 'prediction' variable:

In [None]:
predictions = hyper_model.predict(test['text'].tolist())

Displaying the actual versus predicted labels with associated inputs:

In [None]:
print('\nActual values:')
display(fasttext_test.split('\n')[:5])

print('\nPredictions:')
display(predictions[0][:5])

print('\nThe model guessed wongly on the following inputs:')

print('- Error 1: Was positive, guessed negative:')
print(test.text[1].replace('<br /><br />', ' '))

print('\n- Error 2: Was negative, guessed positive:')
print(test.text[5].replace('<br /><br />', ' '))

We can see that the model guessed wrongly on inputs 1 and 5 of the test dataset.

After analysis, it seems that theses inputs contains words that the model probably uses to determine whether or not an input is positive:

In input 1, the input is 'positive' but the model guessed 'negative'.
This is probably because the following negative words are present in it:
- cliché
- not-so-popular
- tragic
- disappointed

We can also notice that the first two sentences are obviously negative, which is probably a good part of why the model guessed wrongly.

In input 2, the input is 'negative' but the model guessed 'positive'.
This is probably because the following negative words are present in it:
- I agree
- his best sound film
- This is a Must-See film!

The last sentence, 'This is a Must-See film!', is probably what got the model to guess wrongly.