# FastText and Word Vector (TP n°3)

In [1]:
import os

import numpy as np

from datasets import load_dataset

from nltk import tokenize
from nltk.stem.snowball import SnowballStemmer

import spacy
import fasttext as fast
#import transformers

from typing import Dict
from typing import Callable
from typing import List
import re

# set a defined random generator, better for reproducible results.
random = np.random.default_rng(42)

## Take a look on IMDB dataset:

In [2]:
imdb = load_dataset('imdb')
print(imdb)

Reusing dataset imdb (/home/leherlemaxime/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


And we have the following number of entries:

In [3]:
print(f"train entries: {len(imdb['train'])}\ntest entries: {len(imdb['test'])}")

train entries: 25000
test entries: 25000


## Translate dataset for FastText API:

Generate a shuffle index list:

In [4]:
rand_idx = np.arange(len(imdb['train']))
np.random.shuffle(rand_idx)
print(rand_idx[:10])

[ 6401 19017  6643 17772 19755 23333 15152 18940 14600 14428]


Write IMDB dataset into file with FastText format:

In [5]:
%%time

if not os.path.exists("imdb_train.txt"):
    with open("imdb_train.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['train'][int(i)]
            s = f"__label__{entry['label']} {entry['text']}\n".encode("utf-8")
            f.write(s)
    
        f.close()
        
# We shuffle rand_idx to apply with test dataset ...
np.random.shuffle(rand_idx)
        
if not os.path.exists("imdb_test.txt"):
    with open("imdb_test.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['test'][int(i)]
            s = f"__label__{entry['label']} {entry['text']}\n".encode("utf-8")
            f.write(s)
    
        f.close()

CPU times: user 461 µs, sys: 167 µs, total: 628 µs
Wall time: 529 µs


Let's see the input format of an entry:

In [6]:
!head -n 1 imdb_train.txt

__label__0 Omen IV: The Awakening starts at the 'St. Frances Orphanage' where husband & wife Karen (Faye Grant) & Gene York (Michael Woods) are given a baby girl by Sister Yvonne (Megan Leitch) who they have adopted, they name her Delia. At first things go well but as the years pass & Delia (Asia Vieria) grows up Karen becomes suspicious of her as death & disaster follows her, Karen is convinced that she is evil itself. Karen then finds out that she is pregnant but discovers a sinister plot to use her as a surrogate mother for th next Antichrist & gets a shock when she finds out who Delia's real father was...<br /><br />Originally to be directed by Dominique Othenin-Girard who either quit or was sacked & was replaced by Jorge Montesi who completed the film although why he bothered is anyone's guess as Omen IV: The Awakening is absolutely terrible & a disgrace when compared to it illustrious predecessors. The script by Brian Taggert is hilariously bad, I'm not sure whether this nonsense

## First training with FastText model:

In [7]:
fast_model = fast.train_supervised('imdb_train.txt')

Read 5M words
Number of words:  281132
Number of labels: 2
Progress: 100.0% words/sec/thread: 1826810 lr:  0.000000 avg.loss:  0.425605 ETA:   0h 0m 0s


Let's see the train vocabulary:

In [8]:
print(f"the vocabulary size is: {len(fast_model.words)}\n\nThis is a slice of it:\n{fast_model.words[:20]}")

the vocabulary size is: 281132

This is a slice of it:
['the', 'a', 'and', 'of', 'to', 'is', 'in', 'I', 'that', 'this', 'it', '/><br', 'was', 'as', 'with', 'for', 'but', 'The', 'on', 'movie']


### Results of the model:

We respectfully copy and paste this print function from FastText documentation to see results:

In [9]:
def print_results(N : int, p : float, r : float) -> None:
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

So let's compute precision at 1 (P@1) and the recall on the test dataset:

In [10]:
print_results(*fast_model.test('imdb_test.txt'))

N	25000
P@1	0.860
R@1	0.860


And we can compute these metrics for all labels separately:

In [11]:
def print_labels_results(l_scores : Dict[str, Dict[str, float]]) -> None:
    for label in l_scores:
        print(f"label '{label}':\n")
        print(f"\tprecision: {np.round(l_scores[label]['precision'], 3)}")
        print(f"\trecall: {np.round(l_scores[label]['recall'], 3)}")
        print(f"\tF1 score: {np.round(l_scores[label]['f1score'], 3)}\n")

In [12]:
print_labels_results(fast_model.test_label('imdb_test.txt'))

label '__label__1':

	precision: 0.858
	recall: nan
	F1 score: 1.717

label '__label__0':

	precision: 0.861
	recall: nan
	F1 score: 1.722



## Pre-processing on IMDB dataset:

### Clean the text

The text-format is not perfect, we have for exemple '\t' or '<br\>' that are formated text. So we will replace all special char by space. And we will also add space before and after '!' to make it a separated word.

In [13]:
def clean_the_text(text_array : str) -> str:
    '''
        This function return a list of all word and char in the text in parameters.

            Parameters:
                    text_array (str): The text in a string format.

            Returns:
                    result (str) : A list with all the word and char in the inpt text.
    '''
    
    specialChars = "()\\\''.,;:\"?-" 
    for specialChar in specialChars:
        text_array = text_array.replace(specialChar, ' ')
        
    text_array = text_array.replace("/>", ' ')
    text_array = text_array.replace("<br", ' ')
    ''' We add space before and after '!' for the split function '''
    text_array = text_array.replace("!", " ! ")
    
    return text_array.lower()

Now we can try the same model but with the clean text and see if this modification change the results.

In [14]:
%%time

if not os.path.exists("imdb_clean_train.txt"):
    with open("imdb_clean_train.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['train'][int(i)]
            s = f"__label__{entry['label']} {clean_the_text(entry['text'])}\n".encode("utf-8")
            f.write(s)
    
        f.close()
        
# We shuffle rand_idx to apply with test dataset ...
np.random.shuffle(rand_idx)
        
if not os.path.exists("imdb_clean_test.txt"):
    with open("imdb_clean_test.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['test'][int(i)]
            s = f"__label__{entry['label']} {clean_the_text(entry['text'])}\n".encode("utf-8")
            f.write(s)
    
        f.close()

CPU times: user 3.72 s, sys: 200 ms, total: 3.92 s
Wall time: 3.91 s


In [15]:
!head -n 1 imdb_clean_train.txt

__label__0 what a time we live in when someone like this joe swan whatever the hell is considered a good filmmaker   or even a filmmaker at all !  where are the new crop of filmmakers with brains and talent    we need them bad  and to hell with mumblecore !       this movie is about nothing  just as the characters in the film stand for nothing  it s this horrible  so called gen y  that is full of bored idiots  some of which declare themselves filmmakers with out bothering to learn anything about the craft before shooting  well  orson welles was a filmmaker  john huston was a filmmaker  fellini was a filmmaker  dreyer was a filmmaker  etc  current films like these show just how stupid young  so called  filmmakers  can be when they believe going out with no script  no direction  no thought  no legit  camerawork   everything shot horribly on dv   no craft of editing  no nothing  stands for  rebellious  or  advanced  film making  nope  it s called ignorance and laziness or just pure mastur

In [16]:
fast_model_clean = fast.train_supervised('imdb_clean_train.txt')

Read 6M words
Number of words:  80799
Number of labels: 2
Progress: 100.0% words/sec/thread: 2116883 lr:  0.000000 avg.loss:  0.390082 ETA:   0h 0m 0s100.0% words/sec/thread: 2117159 lr: -0.000008 avg.loss:  0.390082 ETA:   0h 0m 0s


In [17]:
print(f"the vocabulary size is: {len(fast_model_clean.words)}\n\nThis is a slice of it:\n{fast_model_clean.words[:20]}")

the vocabulary size is: 80799

This is a slice of it:
['the', 'and', 'a', 'of', 'to', 'is', 'it', 'in', 'i', 'this', 'that', 's', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'you']


#### Result of clean model

In [18]:
print_results(*fast_model_clean.test('imdb_clean_test.txt'))

N	25000
P@1	0.879
R@1	0.879


In [19]:
print_labels_results(fast_model_clean.test_label('imdb_clean_test.txt'))

label '__label__0':

	precision: 0.882
	recall: nan
	F1 score: 1.765

label '__label__1':

	precision: 0.875
	recall: nan
	F1 score: 1.75



So we can see that in average we have a upgrade of our result of 0.02. It's not a huge upgrade but it's still ok for a few more seconds of calculation.

### Stemming the data:

First of all we need to create a function that stemme a word.

In [20]:
re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")

stem_word: Callable[[str], str] = lambda w : stemmer.stem(w.lower()) if re_word.match(w) else w

Now we have ti create a function that apply stemming to a whole text.

In [21]:
def stemming_text(text : str) -> str:
    '''
        This function steeming the text in parameter and return in
        
        Parameters :
                text (str) : the text to stemming
                
        Returns :
                return_text (str) : the text stemmed
    '''
    list_of_words = text.split(" ")
    
    list_of_words = [stem_word(word) for word in list_of_words]
    
    return_text = " ".join(list_of_words)
    
    return return_text

Now with this function we can create the new model where we use the stemming for all text before write them in the file.

In [22]:
%%time

if not os.path.exists("imdb_stemmed_train.txt"):
    with open("imdb_stemmed_train.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['train'][int(i)]
            s = f"__label__{entry['label']} {stemming_text(entry['text'])}\n".encode("utf-8")
            f.write(s)
    
        f.close()
        
# We shuffle rand_idx to apply with test dataset ...
np.random.shuffle(rand_idx)
        
if not os.path.exists("imdb_stemmed_test.txt"):
    with open("imdb_stemmed_test.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['test'][int(i)]
            s = f"__label__{entry['label']} {stemming_text(entry['text'])}\n".encode("utf-8")
            f.write(s)
    
        f.close()

CPU times: user 1.31 ms, sys: 60 µs, total: 1.37 ms
Wall time: 1.17 ms


In [23]:
!head -n 1 imdb_stemmed_train.txt

__label__1 i watch this last night after not have seen it for sever years. it realli is a fun littl film, with a bunch of face you didn't know were in it. arkin shine as always. check it out; you won't be dissappointed. by the way, it was just releas on dvd and contrari to it packaging, it is widescreen. the transfer is rather poor, but at least the whole movi is visible. ;-)


In [24]:
fast_model_stemmed = fast.train_supervised('imdb_stemmed_train.txt')

Read 5M words
Number of words:  245430
Number of labels: 2
Progress: 100.0% words/sec/thread: 2100794 lr:  0.000000 avg.loss:  0.418999 ETA:   0h 0m 0s


In [25]:
print(f"the vocabulary size is: {len(fast_model_stemmed.words)}\n\nThis is a slice of it:\n{fast_model_stemmed.words[:20]}")

the vocabulary size is: 245430

This is a slice of it:
['the', 'a', 'and', 'of', 'to', 'is', 'in', 'it', 'i', 'this', 'that', '/><br', 'was', 'as', 'for', 'with', 'but', 'movi', 'film', 'be']


#### Result of stemmed model

In [26]:
print_results(*fast_model_stemmed.test('imdb_stemmed_test.txt'))

N	25000
P@1	0.861
R@1	0.861


In [27]:
print_labels_results(fast_model_stemmed.test_label('imdb_stemmed_test.txt'))

label '__label__1':

	precision: 0.857
	recall: nan
	F1 score: 1.715

label '__label__0':

	precision: 0.864
	recall: nan
	F1 score: 1.728



The stemmed model don't change the result or juste of 0.002 in the label 1 so the result is not convincing.

### Lemming the data:

Firstly, we need to download the english model of Spacy lemmatization:

In [28]:
!python -m spacy download en_core_web_sm > output_dl.txt

In [29]:
# loading the small English model
nlp = spacy.load("en_core_web_sm")

In [30]:
%%time

# Shuffle order again
np.random.shuffle(rand_idx)

if not os.path.exists("lemmed_imdb_train.txt"):
    with open("lemmed_imdb_train.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['train'][int(i)]
            
            # lemmatize before writting
            lemmed_text = ' '.join([token.lemma_ for token in nlp(entry['text'])])
            s = f"__label__{entry['label']} {lemmed_text}\n".encode("utf-8")
            f.write(s)
    
        f.close()

CPU times: user 11min 19s, sys: 650 ms, total: 11min 20s
Wall time: 11min 20s


Do it in test dataset also:

In [31]:
# Shuffle order again
np.random.shuffle(rand_idx)

if not os.path.exists("lemmed_imdb_test.txt"):
    with open("lemmed_imdb_test.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['test'][int(i)]
            
            # lemmatize before writting
            lemmed_text = ' '.join([token.lemma_ for token in nlp(entry['text'])])
            s = f"__label__{entry['label']} {lemmed_text}\n".encode("utf-8")
            f.write(s)
    
        f.close()

In [32]:
fast_model_lemming = fast.train_supervised('lemmed_imdb_train.txt')

Read 6M words
Number of words:  106199
Number of labels: 2
Progress: 100.0% words/sec/thread: 2445864 lr:  0.000000 avg.loss:  0.416560 ETA:   0h 0m 0s


In [33]:
print(f"the vocabulary size is: {len(fast_model_lemming.words)}\n\nThis is a slice of it:\n{fast_model_lemming.words[:20]}")

the vocabulary size is: 106199

This is a slice of it:
['the', 'be', ',', '.', 'and', 'a', 'of', 'to', 'it', 'I', 'in', 'this', 'that', '"', 'have', '-', '/><br', 'movie', 'film', 'as']


#### Results

In [34]:
print_results(*fast_model_lemming.test('lemmed_imdb_test.txt'))

N	25000
P@1	0.867
R@1	0.867


In [37]:
print_labels_results(fast_model_lemming.test_label('lemmed_imdb_test.txt'))

label '__label__0':

	precision: 0.872
	recall: nan
	F1 score: 1.743

label '__label__1':

	precision: 0.862
	recall: nan
	F1 score: 1.723



## Hyperparameters tunning:

We need to extract a validation set of our train dataset to avoid a tunning validation on test dataset:

In [38]:
# split command will copy and separate file into set of files of 20000 lines.
# Train file have 25.000 lines, so train will have 20.000 lines and validation 5.000 lines.
!split -l20000 "imdb_train.txt" tuning_

!ls -l ./tuning_*

-rw-r--r-- 1 leherlemaxime leherlemaxime 26719815 Oct  4 19:16 ./tuning_aa
-rw-r--r-- 1 leherlemaxime leherlemaxime  6713008 Oct  4 19:16 ./tuning_ab


### Try the default hyperparameter tunning of FastText:

In [39]:
tunning_fast_model = fast.train_supervised(input='tuning_aa', autotuneValidationFile='tuning_ab', autotuneMetric="f1:__label__0")

Progress: 100.0% Trials:    9 Best score:  0.884740 ETA:   0h 0m 0s
Training again with best arguments
Read 4M words
Number of words:  244423
Number of labels: 2
Progress: 100.0% words/sec/thread:  767266 lr:  0.000000 avg.loss:  0.048147 ETA:   0h 0m 0s


Let's compute global metrics:

In [40]:
print_results(*tunning_fast_model.test('imdb_test.txt'))

N	25000
P@1	0.883
R@1	0.883


It looks to give better results with default hyperparameter tunning. But how labels scores change ?

In [41]:
print_labels_results(tunning_fast_model.test_label('imdb_test.txt'))

label '__label__1':

	precision: 0.882
	recall: nan
	F1 score: 1.764

label '__label__0':

	precision: 0.884
	recall: nan
	F1 score: 1.768



Results are better with tunning and we highlight that optimize f1 result on __negative__ label induces better improvements on __positive__ label. The reason is because we juste have two labels and __negative__ label had less wrongly classification.

## Conclusion: