# FastText and Word Vector (TP n°3)

In [2]:
import os

import numpy as np

from datasets import load_dataset

import spacy
import fasttext as fast
import transformers

# set a defined random generator, better for reproducible results.
random = np.random.default_rng(42)

## Take a look on IMDB dataset:

In [3]:
imdb = load_dataset('imdb')
print(imdb)

Reusing dataset imdb (/home/cloud441/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


And we have the following number of entries:

In [4]:
print(f"train entries: {len(imdb['train'])}\ntest entries: {len(imdb['test'])}")

train entries: 25000
test entries: 25000


## Translate dataset for FastText API:

Generate a shuffle index list:

In [5]:
rand_idx = np.arange(len(imdb['train']))
np.random.shuffle(rand_idx)
print(rand_idx[:10])

[ 2187 22262  4672 12858 23855 19338 23333  2842 20483 23441]


Write IMDB dataset into file with FastText format:

In [6]:
%%time

if not os.path.exists("imdb_train.txt"):
    with open("imdb_train.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['train'][int(i)]
            s = f"__label__{entry['label']} {entry['text']}\n".encode("utf-8")
            f.write(s)
    
        f.close()
        
# We shuffle rand_idx to apply with test dataset ...
np.random.shuffle(rand_idx)
        
if not os.path.exists("imdb_test.txt"):
    with open("imdb_test.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['test'][int(i)]
            s = f"__label__{entry['label']} {entry['text']}\n".encode("utf-8")
            f.write(s)
    
        f.close()

CPU times: user 399 µs, sys: 269 µs, total: 668 µs
Wall time: 628 µs


Let's see the input format of an entry:

In [7]:
!head -n 1 imdb_train.txt

__label__0 People who actually liked Problem Child (1990) need to have their heads examined. Who would take the idea of watching a malevolent little boy wreak havoc on others and deem it funny? The movie is not funny, ever, in any way, beginning to end. It wants to be a cartoon, but the writers don't realize that slapstick isn't funny when people get attacked by bears, or hit with baseball bats. It may be funny in cartoons, but not in a motion picture.<br /><br />The film's young hero is Junior (Michael Oliver) who, since he was a baby, has been placed at the front doors of foster parents for adoption. The families reject him, because Junior tends to give them a hard time.<br /><br />He is then thrown into an orphanage, where he terrorizes the nuns, and writes pen pal letters to the convicted Bow-Tie Killer (Michael Richards). He is soon adopted by Ben and Flo Healy (the late John Ritter and his wife, Amy Yasbeck), who are dying to have a child, in order to be just like every other par

## First training with FastText model:

In [8]:
fast_model = fast.train_supervised('imdb_train.txt')

Read 5M words
Number of words:  281132
Number of labels: 2
Progress: 100.0% words/sec/thread: 3338162 lr:  0.000000 avg.loss:  0.425961 ETA:   0h 0m 0s 0m 0s


Let's see the train vocabulary:

In [9]:
print(f"the vocabulary size is: {len(fast_model.words)}\n\nThis is a slice of it:\n{fast_model.words[:20]}")

the vocabulary size is: 281132

This is a slice of it:
['the', 'a', 'and', 'of', 'to', 'is', 'in', 'I', 'that', 'this', 'it', '/><br', 'was', 'as', 'with', 'for', 'but', 'The', 'on', 'movie']


### Results of the model:

We respectfully copy and paste this print function from FastText documentation to see results:

In [10]:
def print_results(N : int, p : float, r : float) -> None:
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

So let's compute precision at 1 (P@1) and the recall on the test dataset:

In [11]:
print_results(*fast_model.test('imdb_test.txt'))

N	25000
P@1	0.860
R@1	0.860


And we can compute these metrics for all labels separately:

In [12]:
def print_labels_results(l_scores : dict[str, dict[str, float]]) -> None:
    for label in l_scores:
        print(f"label '{label}':\n")
        print(f"\tprecision: {np.round(l_scores[label]['precision'], 3)}")
        print(f"\trecall: {np.round(l_scores[label]['recall'], 3)}")
        print(f"\tF1 score: {np.round(l_scores[label]['f1score'], 3)}\n")

In [13]:
print_labels_results(fast_model.test_label('imdb_test.txt'))

label '__label__0':

	precision: 0.86
	recall: nan
	F1 score: 1.72

label '__label__1':

	precision: 0.86
	recall: nan
	F1 score: 1.719



## Pre-processing on IMDB dataset:

### Stemming the data:

### Lemming the data:

Firstly, we need to download the english model of Spacy lemmatization:

In [14]:
!python -m spacy download en_core_web_sm > output_dl.txt

In [15]:
# loading the small English model
nlp = spacy.load("en_core_web_sm")

In [16]:
%%time

# Shuffle order again
np.random.shuffle(rand_idx)

if not os.path.exists("lemmed_imdb_train.txt"):
    with open("lemmed_imdb_train.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['train'][int(i)]
            
            # lemmatize before writting
            lemmed_text = ' '.join([token.lemma_ for token in nlp(entry['text'])])
            s = f"__label__{entry['label']} {lemmed_text}\n".encode("utf-8")
            f.write(s)
    
        f.close()

CPU times: user 350 µs, sys: 41 µs, total: 391 µs
Wall time: 393 µs


Do it in test dataset also:

In [17]:
# Shuffle order again
np.random.shuffle(rand_idx)

if not os.path.exists("lemmed_imdb_test.txt"):
    with open("lemmed_imdb_test.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['test'][int(i)]
            
            # lemmatize before writting
            lemmed_text = ' '.join([token.lemma_ for token in nlp(entry['text'])])
            s = f"__label__{entry['label']} {lemmed_text}\n".encode("utf-8")
            f.write(s)
    
        f.close()

## Hyperparameters tunning:

We need to extract a validation set of our train dataset to avoid a tunning validation on test dataset:

In [18]:
# split command will copy and separate file into set of files of 20000 lines.
# Train file have 25.000 lines, so train will have 20.000 lines and validation 5.000 lines.
!split -l20000 "imdb_train.txt" tuning_

!ls -l ./tuning_*

-rw-r--r-- 1 cloud441 cloud441 26845142 28 sept. 12:19 ./tuning_aa
-rw-r--r-- 1 cloud441 cloud441  6587681 28 sept. 12:19 ./tuning_ab


### Try the default hyperparameter tunning of FastText:

In [19]:
tunning_fast_model = fast.train_supervised(input='tuning_aa', autotuneValidationFile='tuning_ab', autotuneMetric="f1:__label__0")

Progress: 100.0% Trials:    9 Best score:  0.889420 ETA:   0h 0m 0s
Training again with best arguments
Read 4M words
Number of words:  245418
Number of labels: 2
Progress: 100.0% words/sec/thread: 1349410 lr:  0.000000 avg.loss:  0.049218 ETA:   0h 0m 0s


Let's compute global metrics:

In [20]:
print_results(*tunning_fast_model.test('imdb_test.txt'))

N	25000
P@1	0.884
R@1	0.884


It looks to give better results with default hyperparameter tunning. But how labels scores change ?

In [21]:
print_labels_results(tunning_fast_model.test_label('imdb_test.txt'))

label '__label__0':

	precision: 0.888
	recall: nan
	F1 score: 1.777

label '__label__1':

	precision: 0.879
	recall: nan
	F1 score: 1.758



Results are better with tunning and we highlight that optimize f1 result on __negative__ label induces better improvements on __positive__ label. The reason is because we juste have two labels and __negative__ label had less wrongly classification.

## Conclusion: