# FastText and Word Vector (TP n°3)

In [3]:
import os

import numpy as np

from datasets import load_dataset

from nltk import tokenize
from nltk.stem.snowball import SnowballStemmer

import spacy
import fasttext as fast
#import transformers

from typing import Dict
from typing import Callable
from typing import List
import re

# set a defined random generator, better for reproducible results.
random = np.random.default_rng(42)

## Take a look on IMDB dataset:

In [4]:
imdb = load_dataset('imdb')
print(imdb)

Reusing dataset imdb (/home/leherlemaxime/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


And we have the following number of entries:

In [5]:
print(f"train entries: {len(imdb['train'])}\ntest entries: {len(imdb['test'])}")

train entries: 25000
test entries: 25000


## Translate dataset for FastText API:

Generate a shuffle index list:

In [6]:
rand_idx = np.arange(len(imdb['train']))
np.random.shuffle(rand_idx)
print(rand_idx[:10])
rand_idy = np.arange(len(imdb['train']))
np.random.shuffle(rand_idy)
print(rand_idy[:10])

[ 7615  7076  4795  2022 12317 22115 23608 20182 19417  1655]
[23422  8434  2358  8980 15792  7655 20818 11901  1219 22105]


Write IMDB dataset into file with FastText format:

In [7]:
%%time

if not os.path.exists("imdb_train.txt"):
    with open("imdb_train.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['train'][int(i)]
            s = f"__label__{entry['label']} {entry['text']}\n".encode("utf-8")
            f.write(s)
    
        f.close()
        
if not os.path.exists("imdb_test.txt"):
    with open("imdb_test.txt", "wb") as f:
        for i in rand_idy:
            entry = imdb['test'][int(i)]
            s = f"__label__{entry['label']} {entry['text']}\n".encode("utf-8")
            f.write(s)
    
        f.close()

CPU times: user 317 µs, sys: 89 µs, total: 406 µs
Wall time: 251 µs


Let's see the input format of an entry:

In [8]:
!head -n 1 imdb_train.txt

__label__0 Omen IV: The Awakening starts at the 'St. Frances Orphanage' where husband & wife Karen (Faye Grant) & Gene York (Michael Woods) are given a baby girl by Sister Yvonne (Megan Leitch) who they have adopted, they name her Delia. At first things go well but as the years pass & Delia (Asia Vieria) grows up Karen becomes suspicious of her as death & disaster follows her, Karen is convinced that she is evil itself. Karen then finds out that she is pregnant but discovers a sinister plot to use her as a surrogate mother for th next Antichrist & gets a shock when she finds out who Delia's real father was...<br /><br />Originally to be directed by Dominique Othenin-Girard who either quit or was sacked & was replaced by Jorge Montesi who completed the film although why he bothered is anyone's guess as Omen IV: The Awakening is absolutely terrible & a disgrace when compared to it illustrious predecessors. The script by Brian Taggert is hilariously bad, I'm not sure whether this nonsense

## First training with FastText model:

In [9]:
fast_model = fast.train_supervised('imdb_train.txt')

Read 5M words
Number of words:  281132
Number of labels: 2
Progress: 100.0% words/sec/thread: 1312665 lr:  0.000000 avg.loss:  0.426153 ETA:   0h 0m 0s100.0% words/sec/thread: 1312778 lr: -0.000010 avg.loss:  0.426153 ETA:   0h 0m 0s


Let's see the train vocabulary:

In [10]:
print(f"the vocabulary size is: {len(fast_model.words)}\n\nThis is a slice of it:\n{fast_model.words[:20]}")

the vocabulary size is: 281132

This is a slice of it:
['the', 'a', 'and', 'of', 'to', 'is', 'in', 'I', 'that', 'this', 'it', '/><br', 'was', 'as', 'with', 'for', 'but', 'The', 'on', 'movie']


### Results of the model:

We respectfully copy and paste this print function from FastText documentation to see results:

In [11]:
def print_results(N : int, p : float, r : float) -> None:
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

So let's compute precision at 1 (P@1) and the recall on the test dataset:

In [12]:
print_results(*fast_model.test('imdb_test.txt'))

N	25000
P@1	0.860
R@1	0.860


And we can compute these metrics for all labels separately:

In [13]:
def print_labels_results(l_scores : Dict[str, Dict[str, float]]) -> None:
    for label in l_scores:
        print(f"label '{label}':\n")
        print(f"\tprecision: {np.round(l_scores[label]['precision'], 3)}")
        print(f"\trecall: {np.round(l_scores[label]['recall'], 3)}")
        print(f"\tF1 score: {np.round(l_scores[label]['f1score'], 3)}\n")

In [14]:
print_labels_results(fast_model.test_label('imdb_test.txt'))

label '__label__1':

	precision: 0.856
	recall: nan
	F1 score: 1.713

label '__label__0':

	precision: 0.863
	recall: nan
	F1 score: 1.726



## Pre-processing on IMDB dataset:

### Clean the text

The text-format is not perfect, we have for exemple '\t' or '<br\>' that are formated text. So we will replace all special char by space. And we will also add space before and after '!' to make it a separated word.

In [15]:
def clean_the_text(text_array : str) -> str:
    '''
        This function return a list of all word and char in the text in parameters.

            Parameters:
                    text_array (str): The text in a string format.

            Returns:
                    result (str) : A list with all the word and char in the inpt text.
    '''
    
    specialChars = "()\\\''.,;:\"?-" 
    for specialChar in specialChars:
        text_array = text_array.replace(specialChar, ' ')
        
    text_array = text_array.replace("/>", ' ')
    text_array = text_array.replace("<br", ' ')
    
    ''' We add space before and after '!' for the split function '''
    text_array = text_array.replace("!", " ! ")
    
    return text_array.lower()

Now we can try the same model but with the clean text and see if this modification change the results.

In [16]:
%%time

if not os.path.exists("imdb_clean_train.txt"):
    with open("imdb_clean_train.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['train'][int(i)]
            s = f"__label__{entry['label']} {clean_the_text(entry['text'])}\n".encode("utf-8")
            f.write(s)
    
        f.close()
        
if not os.path.exists("imdb_clean_test.txt"):
    with open("imdb_clean_test.txt", "wb") as f:
        for i in rand_idy:
            entry = imdb['test'][int(i)]
            s = f"__label__{entry['label']} {clean_the_text(entry['text'])}\n".encode("utf-8")
            f.write(s)
    
        f.close()

CPU times: user 402 µs, sys: 36 µs, total: 438 µs
Wall time: 242 µs


In [17]:
!head -n 1 imdb_clean_train.txt

__label__0 what a time we live in when someone like this joe swan whatever the hell is considered a good filmmaker   or even a filmmaker at all !  where are the new crop of filmmakers with brains and talent    we need them bad  and to hell with mumblecore !       this movie is about nothing  just as the characters in the film stand for nothing  it s this horrible  so called gen y  that is full of bored idiots  some of which declare themselves filmmakers with out bothering to learn anything about the craft before shooting  well  orson welles was a filmmaker  john huston was a filmmaker  fellini was a filmmaker  dreyer was a filmmaker  etc  current films like these show just how stupid young  so called  filmmakers  can be when they believe going out with no script  no direction  no thought  no legit  camerawork   everything shot horribly on dv   no craft of editing  no nothing  stands for  rebellious  or  advanced  film making  nope  it s called ignorance and laziness or just pure mastur

In [18]:
fast_model_clean = fast.train_supervised('imdb_clean_train.txt')

Read 6M words
Number of words:  80799
Number of labels: 2
Progress: 100.0% words/sec/thread: 1572989 lr:  0.000000 avg.loss:  0.389734 ETA:   0h 0m 0s


In [19]:
print(f"the vocabulary size is: {len(fast_model_clean.words)}\n\nThis is a slice of it:\n{fast_model_clean.words[:20]}")

the vocabulary size is: 80799

This is a slice of it:
['the', 'and', 'a', 'of', 'to', 'is', 'it', 'in', 'i', 'this', 'that', 's', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'you']


#### Result of clean model

In [20]:
print_results(*fast_model_clean.test('imdb_clean_test.txt'))

N	25000
P@1	0.879
R@1	0.879


In [21]:
print_labels_results(fast_model_clean.test_label('imdb_clean_test.txt'))

label '__label__0':

	precision: 0.878
	recall: nan
	F1 score: 1.757

label '__label__1':

	precision: 0.879
	recall: nan
	F1 score: 1.758



So we can see that in average we have a upgrade of our result of 0.02. It's not a huge upgrade but it's still ok for a few more seconds of calculation.

### Clean text with stop word

We have see that the function to clean upgrade our result but why stop here ?

We can add a other clean step on the text, this step is to delete stop words. What is stop words ? Stop words are the non-discriminating words, like __the__, __a__, __an__, __this__ ....

So first of all we will create a list of all the stop word and after we will delete them from text.

In [22]:
list_of_stop_word = ["the", "and", "a", "of", "to", "is", "it", "in", "this", "that", "s", "was", "as", "for", "with", "but", "then", "an", "at", "who", "when", "than", "where", "which", "with", "on", "t", "are", "by", "so", "from", "have", "be", "or", "just", "about", ""]

Now we will create our extend clean text function.

In [23]:
def clean_the_text_extend(text : str, list_of_stop_word : List[str]) -> str:
    '''
        This function return a list of all word and char in the text in parameters.

            Parameters:
                    text (str): The text in a string format.
                    
                    list_of_stop_word: This is that list of our stop words to remove from the text

            Returns:
                    result (str) : A list with all the word and char in the inpt text.
    '''
    text = text.lower()
    
    specialChars = "()\\\''.,;:\"?-" 
    for specialChar in specialChars:
        text = text.replace(specialChar, ' ')
        
    text = text.replace("/>", ' ')
    text = text.replace("<br", ' ')
    
    text = text.replace("</s>", " ")
    
    ''' We add space before and after '!' for the split function '''
    text = text.replace("!", " ! ")
    
    for word in list_of_stop_word:
        ''' We add this to only remove the all word and not isolated letter in an other word'''
        word = " " + word + " "
        text = text.replace(word, " ")
    
    return text.lower()

Now try again with this new function

In [24]:
%%time

if not os.path.exists("imdb_clean_extend_train.txt"):
    with open("imdb_clean_extend_train.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['train'][int(i)]
            s = f"__label__{entry['label']} {clean_the_text_extend(entry['text'], list_of_stop_word)}\n".encode("utf-8")
            f.write(s)
    
        f.close()

if not os.path.exists("imdb_clean_extend_test.txt"):
    with open("imdb_clean_extend_test.txt", "wb") as f:
        for i in rand_idy:
            entry = imdb['test'][int(i)]
            s = f"__label__{entry['label']} {clean_the_text_extend(entry['text'], list_of_stop_word)}\n".encode("utf-8")
            f.write(s)
    
        f.close()

CPU times: user 635 µs, sys: 40 µs, total: 675 µs
Wall time: 427 µs


In [25]:
!head -n 1 imdb_clean_extend_train.txt

__label__0 this yawn titles credits  boring point tedium acting wooden stilted !  admittedly director richard jobson directing debut  earth green lit script poorly developed one  looks like another money down drain government project  scottish screen credited surprise  surprise   nearly fell asleep three times my review will unfortunately more restrained one  please  please mister jobson what ever you ve been doing prior directing sedative  go back ! 


In [26]:
fast_model_clean_extend = fast.train_supervised('imdb_clean_extend_train.txt')

Read 3M words
Number of words:  80799
Number of labels: 2
Progress: 100.0% words/sec/thread: 1550060 lr:  0.000000 avg.loss:  0.323175 ETA:   0h 0m 0s


In [27]:
print(f"the vocabulary size is: {len(fast_model_clean_extend.words)}\n\nThis is a slice of it:\n{fast_model_clean_extend.words[:20]}")

the vocabulary size is: 80799

This is a slice of it:
['you', 'not', 'one', '</s>', '!', 'all', 'they', 'like', 'there', 'or', 'just', 'about', 'out', 'if', 'has', 'what', 'some', 'good', 'can', 'more']


#### Result of clean model extend

In [28]:
print_results(*fast_model_clean_extend.test('imdb_clean_extend_test.txt'))

N	25000
P@1	0.885
R@1	0.885


In [29]:
print_labels_results(fast_model_clean_extend.test_label('imdb_clean_extend_test.txt'))

label '__label__1':

	precision: 0.888
	recall: nan
	F1 score: 1.775

label '__label__0':

	precision: 0.883
	recall: nan
	F1 score: 1.766



So with this result we can see that the result are abit better so we will now use the clean text expand instead of clean text classic.

### Stemming the data:

First of all we need to create a function that stemme a word.

In [30]:
re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")

stem_word: Callable[[str], str] = lambda w : stemmer.stem(w.lower()) if re_word.match(w) else w

Now we have ti create a function that apply stemming to a whole text.

In [31]:
def stemming_text(text : str) -> str:
    '''
        This function steeming the text in parameter and return in
        
        Parameters :
                text (str) : the text to stemming
                
        Returns :
                return_text (str) : the text stemmed
    '''
    list_of_words = text.split(" ")
    
    list_of_words = [stem_word(word) for word in list_of_words]
    
    return_text = " ".join(list_of_words)
    
    return return_text

Now with this function we can create the new model where we use the stemming for all text before write them in the file.

In [32]:
%%time

if not os.path.exists("imdb_stemmed_train.txt"):
    with open("imdb_stemmed_train.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['train'][int(i)]
            s = f"__label__{entry['label']} {stemming_text(entry['text'])}\n".encode("utf-8")
            f.write(s)
    
        f.close()
        
if not os.path.exists("imdb_stemmed_test.txt"):
    with open("imdb_stemmed_test.txt", "wb") as f:
        for i in rand_idy:
            entry = imdb['test'][int(i)]
            s = f"__label__{entry['label']} {stemming_text(entry['text'])}\n".encode("utf-8")
            f.write(s)
    
        f.close()

CPU times: user 140 µs, sys: 7 µs, total: 147 µs
Wall time: 155 µs


In [33]:
!head -n 1 imdb_stemmed_train.txt

__label__1 i watch this last night after not have seen it for sever years. it realli is a fun littl film, with a bunch of face you didn't know were in it. arkin shine as always. check it out; you won't be dissappointed. by the way, it was just releas on dvd and contrari to it packaging, it is widescreen. the transfer is rather poor, but at least the whole movi is visible. ;-)


In [34]:
fast_model_stemmed = fast.train_supervised('imdb_stemmed_train.txt')

Read 5M words
Number of words:  245430
Number of labels: 2
Progress: 100.0% words/sec/thread: 1497238 lr:  0.000000 avg.loss:  0.419089 ETA:   0h 0m 0s100.0% words/sec/thread: 1497419 lr: -0.000007 avg.loss:  0.419089 ETA:   0h 0m 0s


In [35]:
print(f"the vocabulary size is: {len(fast_model_stemmed.words)}\n\nThis is a slice of it:\n{fast_model_stemmed.words[:20]}")

the vocabulary size is: 245430

This is a slice of it:
['the', 'a', 'and', 'of', 'to', 'is', 'in', 'it', 'i', 'this', 'that', '/><br', 'was', 'as', 'for', 'with', 'but', 'movi', 'film', 'be']


#### Result of stemmed model

In [36]:
print_results(*fast_model_stemmed.test('imdb_stemmed_test.txt'))

N	25000
P@1	0.861
R@1	0.861


In [37]:
print_labels_results(fast_model_stemmed.test_label('imdb_stemmed_test.txt'))

label '__label__1':

	precision: 0.853
	recall: nan
	F1 score: 1.705

label '__label__0':

	precision: 0.87
	recall: nan
	F1 score: 1.739



The stemmed model don't change the result or juste of 0.002 in the label 1 so the result is not convincing.

### Lemming the data:

Firstly, we need to download the english model of Spacy lemmatization:

In [38]:
!python -m spacy download en_core_web_sm > output_dl.txt

In [39]:
# loading the small English model
nlp = spacy.load("en_core_web_sm")

In [40]:
%%time

if not os.path.exists("lemmed_imdb_train.txt"):
    with open("lemmed_imdb_train.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['train'][int(i)]
            
            # lemmatize before writting
            lemmed_text = ' '.join([token.lemma_ for token in nlp(entry['text'])])
            s = f"__label__{entry['label']} {lemmed_text}\n".encode("utf-8")
            f.write(s)
    
        f.close()

CPU times: user 213 µs, sys: 10 µs, total: 223 µs
Wall time: 251 µs


Do it in test dataset also:

In [41]:
if not os.path.exists("lemmed_imdb_test.txt"):
    with open("lemmed_imdb_test.txt", "wb") as f:
        for i in rand_idy:
            entry = imdb['test'][int(i)]
            
            # lemmatize before writting
            lemmed_text = ' '.join([token.lemma_ for token in nlp(entry['text'])])
            s = f"__label__{entry['label']} {lemmed_text}\n".encode("utf-8")
            f.write(s)
    
        f.close()

In [42]:
fast_model_lemming = fast.train_supervised('lemmed_imdb_train.txt')

Read 6M words
Number of words:  106199
Number of labels: 2
Progress: 100.0% words/sec/thread: 1511882 lr:  0.000000 avg.loss:  0.417234 ETA:   0h 0m 0s


In [43]:
print(f"the vocabulary size is: {len(fast_model_lemming.words)}\n\nThis is a slice of it:\n{fast_model_lemming.words[:20]}")

the vocabulary size is: 106199

This is a slice of it:
['the', 'be', ',', '.', 'and', 'a', 'of', 'to', 'it', 'I', 'in', 'this', 'that', '"', 'have', '-', '/><br', 'movie', 'film', 'as']


#### Results

In [44]:
print_results(*fast_model_lemming.test('lemmed_imdb_test.txt'))

N	25000
P@1	0.867
R@1	0.867


In [45]:
print_labels_results(fast_model_lemming.test_label('lemmed_imdb_test.txt'))

label '__label__0':

	precision: 0.869
	recall: nan
	F1 score: 1.739

label '__label__1':

	precision: 0.865
	recall: nan
	F1 score: 1.729



We can see that this time the improvement is a little more interesting, even if it is very long (about 1 hour on the computer where we did the test). But we will keep it because it is quite long but it allows an improvement and we will see how it couples with the other optimization.

## Hyperparameters tunning:

We need to extract a validation set of our train dataset to avoid a tunning validation on test dataset:

In [46]:
# split command will copy and separate file into set of files of 20000 lines.
# Train file have 25.000 lines, so train will have 20.000 lines and validation 5.000 lines.
!split -l20000 "imdb_train.txt" tuning_

!ls -l ./tuning_*

-rw-r--r-- 1 leherlemaxime leherlemaxime 26719815 Oct  6 16:57 ./tuning_aa
-rw-r--r-- 1 leherlemaxime leherlemaxime  6713008 Oct  6 16:57 ./tuning_ab
-rw-r--r-- 1 leherlemaxime leherlemaxime 19853802 Oct  6 16:48 ./tuning_complet_aa
-rw-r--r-- 1 leherlemaxime leherlemaxime  4974531 Oct  6 16:48 ./tuning_complet_ab


### Try the default hyperparameter tunning of FastText:

In [47]:
tunning_fast_model = fast.train_supervised(input='tuning_aa', autotuneValidationFile='tuning_ab', autotuneMetric="f1:__label__0")

Progress: 100.0% Trials:    9 Best score:  0.884967 ETA:   0h 0m 0s
Training again with best arguments
Read 4M words
Number of words:  244423
Number of labels: 2
Progress: 100.0% words/sec/thread:  603064 lr:  0.000000 avg.loss:  0.048096 ETA:   0h 0m 0s100.0% words/sec/thread:  603065 lr: -0.000001 avg.loss:  0.048096 ETA:   0h 0m 0s


Let's compute global metrics:

In [48]:
print_results(*tunning_fast_model.test('imdb_test.txt'))

N	25000
P@1	0.883
R@1	0.883


It looks to give better results with default hyperparameter tunning. But how labels scores change ?

In [49]:
print_labels_results(tunning_fast_model.test_label('imdb_test.txt'))

label '__label__1':

	precision: 0.882
	recall: nan
	F1 score: 1.765

label '__label__0':

	precision: 0.884
	recall: nan
	F1 score: 1.768



Results are better with tunning and we highlight that optimize f1 result on __negative__ label induces better improvements on __positive__ label. The reason is because we juste have two labels and __negative__ label had less wrongly classification.

## Merge optimisation

First we will try to add the clean text extend optimisation to other optimisaion because we see that this optimisation clean the text and just keep this important part. We also see that lemming is much better that stemming.

It's the reason why we try a model with clean text extend and lemming.

In [48]:
%%time

if not os.path.exists("lemmed_clean_imdb_train.txt"):
    with open("lemmed_clean_imdb_train.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['train'][int(i)]
            
            # lemmatize before writting
            lemmed_text = ' '.join([token.lemma_ for token in nlp(clean_the_text_extend(entry['text'], list_of_stop_word))])
            s = f"__label__{entry['label']} {lemmed_text}\n".encode("utf-8")
            f.write(s)
    
        f.close()

if not os.path.exists("lemmed_clean_imdb_test.txt"):
    with open("lemmed_clean_imdb_test.txt", "wb") as f:
        for i in rand_idy:
            entry = imdb['test'][int(i)]
            
            # lemmatize before writting
            lemmed_text = ' '.join([token.lemma_ for token in nlp(clean_the_text_extend(entry['text'], list_of_stop_word))])
            s = f"__label__{entry['label']} {lemmed_text}\n".encode("utf-8")
            f.write(s)
    
        f.close()

CPU times: user 6.48 ms, sys: 0 ns, total: 6.48 ms
Wall time: 6.01 ms


In [49]:
fast_model_lemming_clean = fast.train_supervised('lemmed_clean_imdb_train.txt')

Read 3M words
Number of words:  64059
Number of labels: 2
Progress: 100.0% words/sec/thread: 1563884 lr:  0.000000 avg.loss:  0.326712 ETA:   0h 0m 0s


In [50]:
print(f"the vocabulary size is: {len(fast_model_lemming_clean.words)}\n\nThis is a slice of it:\n{fast_model_lemming_clean.words[:20]}")

the vocabulary size is: 64059

This is a slice of it:
['you', 'not', 'they', 'have', 'be', 'one', 'do', '</s>', '!', 'all', 'see', 'make', 'like', 'good', 'there', 'well', 'or', 'just', 'about', 'out']


#### Results of combine optimisations

In [51]:
print_results(*fast_model_lemming_clean.test('lemmed_clean_imdb_test.txt'))

N	25000
P@1	0.880
R@1	0.880


In [52]:
print_labels_results(fast_model_lemming_clean.test_label('lemmed_clean_imdb_test.txt'))

label '__label__0':

	precision: 0.874
	recall: nan
	F1 score: 1.747

label '__label__1':

	precision: 0.886
	recall: nan
	F1 score: 1.772



So with these 2 optimisation we have not a improvement compare to clean text extended only. OUr hypothesys is that we start to overfit on the train set. So a solution is to use this methode but with tunning hyperparametres.

## Our final model

In [53]:
# split command will copy and separate file into set of files of 20000 lines.
# Train file have 25.000 lines, so train will have 20.000 lines and validation 5.000 lines.
!split -l20000 "lemmed_clean_imdb_train.txt" tuning_complet_

!ls -l ./tuning_complet_*

-rw-r--r-- 1 leherlemaxime leherlemaxime 19853802 Oct  6 16:48 ./tuning_complet_aa
-rw-r--r-- 1 leherlemaxime leherlemaxime  4974531 Oct  6 16:48 ./tuning_complet_ab


We have manually work on the parameter epochs, rate and n-grams. But we don't succed in find a better resulat that juste use autometric so we juste use this to get our better result on this data set.

In [50]:
tunning_fast_model_complet = fast.train_supervised(input='tuning_complet_aa', autotuneValidationFile='tuning_complet_ab', autotuneMetric="f1:__label__0")

Progress: 100.0% Trials:    9 Best score:  0.890872 ETA:   0h 0m 0s
Training again with best arguments
Read 3M words
Number of words:  58103
Number of labels: 2
Progress: 100.0% words/sec/thread:  406096 lr:  0.000000 avg.loss:  0.035149 ETA:   0h 0m 0s


### Our final results

In [51]:
print_results(*tunning_fast_model_complet.test('lemmed_clean_imdb_test.txt'))

N	25000
P@1	0.894
R@1	0.894


In [52]:
print_labels_results(tunning_fast_model_complet.test_label('lemmed_clean_imdb_test.txt'))

label '__label__1':

	precision: 0.897
	recall: nan
	F1 score: 1.795

label '__label__0':

	precision: 0.892
	recall: nan
	F1 score: 1.783



## Conclusion:

So we arrive with our final model to really satisfactory results for what we wanted we get close to 90% accuracy on the 2 labels. Even if this model is quite long (about 1h20 to do everything) we find that it is a pretty good time for this result. But we will see examples of misclassified texts to try to understand why they are misclassified and see the weakness of our model.

In [53]:
print_labels_results(tunning_fast_model_complet.test_label('lemmed_clean_imdb_test.txt'))

label '__label__1':

	precision: 0.897
	recall: nan
	F1 score: 1.795

label '__label__0':

	precision: 0.892
	recall: nan
	F1 score: 1.783



In [66]:
wrong_0_index = []
wrong_1_index = []

for i in range(len(imdb["test"])):
    label_predict = tunning_fast_model_complet.predict(imdb["test"][i]["text"])[0][0]
    if ((label_predict == "__label__1") and (imdb["test"][i]["label"] == 0)):
        wrong_1_index.append(i)
    elif ((label_predict == "__label__0") and (imdb["test"][i]["label"] == 1)):
        wrong_0_index.append(i)

In [67]:
len(wrong_0_index)

717

In [68]:
len(wrong_1_index)

5202

2 examples of wrong 0 label :

In [69]:
imdb["test"][int(random.choice(wrong_0_index))]["text"]

"black tar can't be snorted there's a documentary: dark end of the street about s.f. street punks and b.t. abuse - not bad - quite heavy. in wasted there's this stuff that looks like coke but should be something else... no big deal. black tar can't be snorted there's a documentary: dark end of the street about s.f. street punks and b.t. abuse - not bad - quite heavy. in wasted there's this stuff that looks like coke but should be something else... no big deal. black tar can't be snorted there's a documentary: dark end of the street about s.f. street punks and b.t. abuse - not bad - quite heavy. in wasted there's this stuff that looks like coke but should be something else... no big deal."

In [70]:
imdb["test"][int(random.choice(wrong_0_index))]["text"]

"I loved this film! Fantastically original and different! A solid, intense, hard-core and suspenseful movie that has just the right touch of (dark?) humor. If you're tired of the typical, overdone, ridiculous Hollywood B.S. movies, how many big explosions and awful and unrealistic shoot em up gun fights that insult your intelligence can we take, then this film is for you. Fantastic characters that are wonderfully original and believable, and solid performances by all actors, not a weak character or performance in the film. Skip Woods' film is a breath of fresh air and I applaud his originality and efforts, his film has the feel of a cross between a Quentin Tarantino and a Cohen brothers film (not a bad mix at all in my opinion). This movie grabs you by the throat and doesn't let go, there's nothing boring or bubble gum about this film. The only disappointment is that nobody seems to know about it, everyone I've recommended it to has thanked me and shared my opinion on it. This film is 

2 examples of wrong 1 label :

In [71]:
imdb["test"][int(random.choice(wrong_1_index))]["text"]

'I haven\'t seen this film in years, but the awful "taste" of Quaid\'s performance still lingers on my tongue. Some have commented on how Quaid has Jerry Lee Lewis "to a tee" but the fact is he only appears to have the most extreme stage Jerry in mind. Nobody acts that way all the time, and the performance comes off as hopelessly clownish, reducing Lewis to a buffoonish caricature. The nuances of a man\'s life are lost in the rubble of sheer over-acting.<br /><br />The author of the book this is based on (Nick Tosches) is a good writer, who has written several fine musical bios (I particularly liked "Dino" on Dean Martin); in the books Tosches gives us a full human being, both separate from and involved in the "biz." Quaid\'s acting seems to imply that Jerry never acted like a human being. If people were like this, no one would bother to hang around them. As cartoons go, it is mildly amusing, but otherwise it is one of the most egregious, film-destroying performances I have had the "ho

In [72]:
imdb["test"][int(random.choice(wrong_1_index))]["text"]

"This is a movie with an excellent concept for a story but that got sidetracked but a large number of clichéd sub-plots, hackneyed and unrealistic portrayed characterizations and performances, and some frankly implausible (and highly coincidental and, not to mention, convenient as plot points to move the story to its inexorable finish).<br /><br />The lack of anything that marked the lead as actually gay, other than some coincidental references to Crow Bar or that he's gay, was troubling. It wouldn't have hurt to actually show him do something, even if it was just meet a friend for drinks.<br /><br />It's worth checking out and has it's merits. There isn't much, even now a few years after the movie was released, in the way of movies that feature both a lead that is gay, or a significant gay plot line, and that is also about African-Americans. For that, it's worth checking out. I wouldn't look too hard for it and I wouldn't waste my time looking for it to own. This is a rental, and not 

After analyzing some examples, the previous ones and others not shown here, we understand in fact the problem of our model on the 10% error. Indeed it does not take into account the context. Indeed a person can say that a movie is good because it is different from other movies, but if he says that there is a strong chance that this person will give a negative feeling on the other movies he is talking about and so as our classifier does not take into account the context he will think that these negative adjectives are used for the movie. And the second problem of this contextualization can be seen with expressions with negative connotations but used in a positive context for example "with an extremely dark humor", this is not necessarily negative but as the terms used are we arrive at a classification problem.

So to summarize our model has a very good success rate and the big problem that can remain to solve is the context handle but which is much more complicated.