# FastText and Word Vector (TP n°3)

In [1]:
import os

import numpy as np

from datasets import load_dataset

import fasttext as fast
import transformers

# set a defined random generator, better for reproducible results.
random = np.random.default_rng(42)

## Take a look on IMDB dataset:

In [2]:
imdb = load_dataset('imdb')
print(imdb)

Reusing dataset imdb (/home/cloud441/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


And we have the following number of entries:

In [3]:
print(f"train entries: {len(imdb['train'])}\ntest entries: {len(imdb['test'])}")

train entries: 25000
test entries: 25000


## Translate dataset for FastText API:

Generate a shuffle index list:

In [4]:
rand_idx = np.arange(len(imdb['train']))
np.random.shuffle(rand_idx)
print(rand_idx[:10])

[24965 23207  4265 23127 11815 10280  4488 15073 11023 16587]


Write IMDB dataset into file with FastText format:

In [5]:
%%time

if not os.path.exists("imdb_train.txt"):
    with open("imdb_train.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['train'][int(i)]
            s = f"__label__{entry['label']} {entry['text']}\n".encode("utf-8")
            f.write(s)
    
        f.close()
        
# We shuffle rand_idx to apply with test dataset ...
np.random.shuffle(rand_idx)
        
if not os.path.exists("imdb_test.txt"):
    with open("imdb_test.txt", "wb") as f:
        for i in rand_idx:
            entry = imdb['test'][int(i)]
            s = f"__label__{entry['label']} {entry['text']}\n".encode("utf-8")
            f.write(s)
    
        f.close()

CPU times: user 1.5 s, sys: 87.4 ms, total: 1.59 s
Wall time: 1.59 s


Let's see the input format of an entry:

In [6]:
!head -n 1 imdb_train.txt

__label__0 People who actually liked Problem Child (1990) need to have their heads examined. Who would take the idea of watching a malevolent little boy wreak havoc on others and deem it funny? The movie is not funny, ever, in any way, beginning to end. It wants to be a cartoon, but the writers don't realize that slapstick isn't funny when people get attacked by bears, or hit with baseball bats. It may be funny in cartoons, but not in a motion picture.<br /><br />The film's young hero is Junior (Michael Oliver) who, since he was a baby, has been placed at the front doors of foster parents for adoption. The families reject him, because Junior tends to give them a hard time.<br /><br />He is then thrown into an orphanage, where he terrorizes the nuns, and writes pen pal letters to the convicted Bow-Tie Killer (Michael Richards). He is soon adopted by Ben and Flo Healy (the late John Ritter and his wife, Amy Yasbeck), who are dying to have a child, in order to be just like every other par

## Training with FastText model:

In [7]:
fast_model = fast.train_supervised('imdb_train.txt')

Read 5M words
Number of words:  281132
Number of labels: 2
Progress: 100.0% words/sec/thread: 2972655 lr:  0.000000 avg.loss:  0.434704 ETA:   0h 0m 0s


Let's see the train vocabulary:

In [8]:
print(f"the vocabulary size is: {len(fast_model.words)}\n\nThis is a slice of it:\n{fast_model.words[:20]}")

the vocabulary size is: 281132

This is a slice of it:
['the', 'a', 'and', 'of', 'to', 'is', 'in', 'I', 'that', 'this', 'it', '/><br', 'was', 'as', 'with', 'for', 'but', 'The', 'on', 'movie']


### Results of the model:

We respectfully copy and paste this print function from FastText documentation to see results:

In [9]:
def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

So let's compute precision at 1 (P@1) and the recall on the test dataset:

In [10]:
print_results(*fast_model.test('imdb_test.txt'))

N	25000
P@1	0.859
R@1	0.859
