## Objective

Transform the data into FastText format and learn the model. In order to do that, we have to download the AG NEWS data, iterate over its data instances and export them as a TXT file.

**Note:** we have not removed special characters from the data. It should be done to improve the performance of FastText

### 1 - Load the raw data from TorchText and generate train/validation/test datasets

In [1]:
from torchtext.datasets import AG_NEWS
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

train_iter, test_iter = AG_NEWS()

train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)

num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])

train_dataset = split_train_
validation_dataset = split_valid_

### 2 - Put data in FastText format

In [2]:
def fromPyTorchToFastext(dataset):
    string_list = []
    label = None
    text = None
    s = None
    for instance in dataset:
        label = instance[0]
        text = instance[1]
        s = "__label__" + str(label) + " " + text
        string_list.append(s)
    return string_list

def writeFasttextToFile(string_list, filePath):
    with open(filePath, "w") as file:
        for s in string_list:
            file.write(s+"\n")

validation_dataset_fasttext = fromPyTorchToFastext(validation_dataset)
writeFasttextToFile(validation_dataset_fasttext, "fasttext_data/ag_news_validation.txt")

train_dataset_fasttext = fromPyTorchToFastext(validation_dataset)
writeFasttextToFile(validation_dataset_fasttext, "fasttext_data/ag_news_train.txt")

test_dataset_fasttext = fromPyTorchToFastext(test_dataset)
writeFasttextToFile(validation_dataset_fasttext, "fasttext_data/ag_news_test.txt")  

### 3 - Training

In [3]:
import fasttext

model = fasttext.train_supervised(input="fasttext_data/ag_news_train.txt", epoch=5)

Read 0M words
Number of words:  35357
Number of labels: 4
Progress: 100.0% words/sec/thread: 1678454 lr:  0.000000 avg.loss:  1.172119 ETA:   0h 0m 0s


#### 3.1 - autotune

FastText provides the option to automatically search for the best set of hyperparameter (need to research a littlebit more on which of the whole set of hyperparameters are considered, e.g., if it considers epoch or the size word n-grams).

Once the model has finished learning, we can extract the best set of hyperparameters by accessing the resulting object

In [7]:
model = fasttext.train_supervised(
    input="fasttext_data/ag_news_train.txt", 
    autotuneValidationFile="fasttext_data/ag_news_validation.txt", 
    autotuneDuration=10) # seconds

Progress: 100.0% Trials:    5 Best score:  0.999667 ETA:   0h 0m 0s
Training again with best arguments
Read 0M words
Number of words:  35357
Number of labels: 4
Progress: 100.0% words/sec/thread: 1777062 lr:  0.000000 avg.loss:  0.297760 ETA:   0h 0m 0s


In [8]:
args_obj = model.f.getArgs()
for hparam in dir(args_obj):
    if not hparam.startswith('__'):
        print(f"{hparam} -> {getattr(args_obj, hparam)}")

autotuneDuration -> 10
autotuneMetric -> f1
autotuneModelSize -> 
autotunePredictions -> 1
autotuneValidationFile -> fasttext_data/ag_news_validation.txt
bucket -> 0
cutoff -> 0
dim -> 60
dsub -> 2
epoch -> 45
input -> fasttext_data/ag_news_train.txt
label -> __label__
loss -> loss_name.softmax
lr -> 0.04552101402547641
lrUpdateRate -> 100
maxn -> 0
minCount -> 1
minCountLabel -> 0
minn -> 0
model -> model_name.supervised
neg -> 5
output -> 
pretrainedVectors -> 
qnorm -> False
qout -> False
retrain -> False
saveOutput -> False
seed -> 0
setManual -> <bound method PyCapsule.setManual of <fasttext_pybind.args object at 0x7fc21d2b63b0>>
t -> 0.0001
thread -> 7
verbose -> 2
wordNgrams -> 1
ws -> 5


### 4 - Evaluation

In [9]:
model.test("fasttext_data/ag_news_test.txt")

(6000, 0.9996666666666667, 0.9996666666666667)