## Installing fastText

In [None]:
from fast_word import *

The first step of this tutorial is to install and build fastText.

### Download directly with command line the english dataset

In [6]:
!cd fastText-0.9.2 && python3 download_model.py en

File exists. Use --overwrite to download anyway.


### Loading the dataset "IMDB"

Using the split argument, we can split the imdb into two separate dataset.

In [83]:
train_dataset = load_dataset('imdb', split='train')
test_dataset = load_dataset('imdb', split='test')

Reusing dataset imdb (/home/token/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)
Reusing dataset imdb (/home/token/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


*train_dataset* is a class with two attributes :
- Features which contains two features : text and label (our x_train and our y_train)
- The number of rows in our dataset

In [21]:
print(train_dataset)

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})


If we take the first sample of our training dataset, we can see the first review and the label (positive or negative) according to that review.

In [22]:
print(train_dataset[0])

{'text': 'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!', 'label': 1}


### Preprocessing the dataset

With the review we previously saw, the review are definitely not preprocessed. There is still some html tags etc.. 
We definitely to work on those.

In [85]:
y_train = train_dataset['label']

In [86]:
y_test = test_dataset['label']

In [54]:
create_imdb_train_file(train_dataset, y_train)
create_imdb_train_and_validation_file(train_dataset, y_train)
create_imdb_test_file(test_dataset, y_test)

100%|██████████| 25000/25000 [00:13<00:00, 1796.02it/s]
100%|██████████| 5000/5000 [00:02<00:00, 1864.51it/s]
100%|██████████| 25000/25000 [00:13<00:00, 1812.98it/s]


## Let's train the model using fasttext.

##### 1) Lemming

In [87]:
create_imdb_train_file(train_dataset, y_train, lemm = True)
create_imdb_train_and_validation_file(train_dataset, y_train, lemm = True)
create_imdb_test_file(test_dataset, y_test, lemm = True)

100%|██████████| 25000/25000 [00:32<00:00, 761.64it/s]
100%|██████████| 5000/5000 [00:06<00:00, 762.32it/s]
100%|██████████| 25000/25000 [00:33<00:00, 744.22it/s]


In [88]:
%%time
model = fasttext.train_supervised(input="imdb_train_lemmed.txt")

CPU times: user 21.5 s, sys: 183 ms, total: 21.6 s
Wall time: 3.97 s


In [91]:
res = model.test("imdb_test.txt")
print("Nombre de sample: " + str(res[0]) + ", Taux de précision: " + str(res[1]) + ", Rappel: " + str(res[2]))

Nombre de sample: 25000, Taux de précision: 0.8726, Rappel: 0.8726


##### 2) Stemming

In [92]:
create_imdb_train_file(train_dataset, y_train, stemm = True)
create_imdb_train_and_validation_file(train_dataset, y_train, stemm = True)
create_imdb_test_file(test_dataset, y_test, stemm = True)

100%|██████████| 25000/25000 [01:36<00:00, 258.92it/s]
100%|██████████| 5000/5000 [00:19<00:00, 253.94it/s]
100%|██████████| 25000/25000 [01:36<00:00, 259.78it/s]


In [93]:
%%time
model = fasttext.train_supervised(input="imdb_train_stemmed.txt")

CPU times: user 20.2 s, sys: 167 ms, total: 20.4 s
Wall time: 3.58 s


In [94]:
res = model.test("imdb_test.txt")
print("Nombre de sample: " + str(res[0]) + ", Taux de précision: " + str(res[1]) + ", Rappel: " + str(res[2]))

Nombre de sample: 25000, Taux de précision: 0.8164, Rappel: 0.8164


##### 3) Stop Words

In [95]:
create_imdb_train_file(train_dataset, y_train, stop_words = True)
create_imdb_train_and_validation_file(train_dataset, y_train, stop_words = True)
create_imdb_test_file(test_dataset, y_test, stop_words = True)

100%|██████████| 25000/25000 [00:17<00:00, 1451.18it/s]
100%|██████████| 5000/5000 [00:03<00:00, 1533.33it/s]
100%|██████████| 25000/25000 [00:15<00:00, 1569.33it/s]


In [96]:
%%time
model = fasttext.train_supervised(input="imdb_train_sw.txt")

CPU times: user 9.91 s, sys: 183 ms, total: 10.1 s
Wall time: 2.08 s


In [97]:
res = model.test("imdb_test.txt")
print("Nombre de sample: " + str(res[0]) + ", Taux de précision: " + str(res[1]) + ", Rappel: " + str(res[2]))

Nombre de sample: 25000, Taux de précision: 0.87496, Rappel: 0.87496


##### 4) None

In [98]:
%%time
model = fasttext.train_supervised(input="imdb_train.txt")

CPU times: user 18.5 s, sys: 276 ms, total: 18.8 s
Wall time: 3.52 s


In [99]:
res = model.test("imdb_test.txt")
print("Nombre de sample: " + str(res[0]) + ", Taux de précision: " + str(res[1]) + ", Rappel: " + str(res[2]))

Nombre de sample: 25000, Taux de précision: 0.87548, Rappel: 0.87548


### Making the model better

#### How about N-grams ?

In [100]:
%%time
model = fasttext.train_supervised(input="imdb_train.txt", wordNgrams=2) 

CPU times: user 43.2 s, sys: 542 ms, total: 43.8 s
Wall time: 7.72 s


In [101]:
res = model.test("imdb_test.txt")
print("Nombre de sample: " + str(res[0]) + ", Taux de précision: " + str(res[1]) + ", Rappel: " + str(res[2]))

Nombre de sample: 25000, Taux de précision: 0.86576, Rappel: 0.86576


#### With more epochs and a smaller learning_rate ?

In [102]:
%%time
model = fasttext.train_supervised(input="imdb_train.txt", lr=0.5, epoch=25, wordNgrams=2)

CPU times: user 3min 32s, sys: 1.61 s, total: 3min 33s
Wall time: 34.1 s


In [103]:
res = model.test("imdb_test.txt")
print("Nombre de sample: " + str(res[0]) + ", Taux de précision: " + str(res[1]) + ", Rappel: " + str(res[2]))

Nombre de sample: 25000, Taux de précision: 0.89536, Rappel: 0.89536


#### With more epochs and a smaller smaller learning_rate ?

In [104]:
%%time
model = fasttext.train_supervised(input="imdb_train.txt", lr=0.25, epoch=25, wordNgrams=2)

CPU times: user 3min 33s, sys: 1.91 s, total: 3min 35s
Wall time: 33.9 s


In [105]:
res = model.test("imdb_test.txt")
print("Nombre de sample: " + str(res[0]) + ", Taux de précision: " + str(res[1]) + ", Rappel: " + str(res[2]))

Nombre de sample: 25000, Taux de précision: 0.89628, Rappel: 0.89628


#### With more epochs and a bigger learning_rate ?

In [106]:
%%time
model = fasttext.train_supervised(input="imdb_train.txt", lr=1.0, epoch=25, wordNgrams=2)

CPU times: user 3min 31s, sys: 1.7 s, total: 3min 33s
Wall time: 33.6 s


In [107]:
res = model.test("imdb_test.txt")
print("Nombre de sample: " + str(res[0]) + ", Taux de précision: " + str(res[1]) + ", Rappel: " + str(res[2]))

Nombre de sample: 25000, Taux de précision: 0.8952, Rappel: 0.8952


#### With the autotuneValidationfile (testing and performance purpose)

In [108]:
%%time
model = fasttext.train_supervised(input="imdb_train_splited_with_the_validation.txt", autotuneValidationFile='imdb_validation.txt', autotuneDuration=150)

CPU times: user 25min 48s, sys: 12.5 s, total: 26min
Wall time: 4min 17s


In [109]:
res = model.test("imdb_test.txt")
print("Nombre de sample: " + str(res[0]) + ", Taux de précision: " + str(res[1]) + ", Rappel: " + str(res[2]))

Nombre de sample: 25000, Taux de précision: 0.87896, Rappel: 0.87896


## Beating the baseline

###### Glove

In [120]:
glove = EmbeddingTransformer('glove')
x_train = glove.transform(x_train)

In [121]:
model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(x_train, y_train)

x_test = glove.transform(x_test)
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.77      0.76     12500
           1       0.76      0.73      0.75     12500

    accuracy                           0.75     25000
   macro avg       0.75      0.75      0.75     25000
weighted avg       0.75      0.75      0.75     25000

