# FastText
<h4 style="font-size:14px; font-family:Calibry" align="left"> Andrii Kruchko </h4>

<hr style="height: 1px; background-color: #808080">
## Table of Contents

1. FastText overview
2. Data preprocessing for fastText
3. The model training and parameters overview

<hr style="height: 1px; background-color: #808080">
## 1. FastText overview
### What is it?

FastText is a linear model with a rank constraint and a fast loss approximation.<br>
It can obtain the accuracy comparable to deep learning classifiers.<br>

But it is way faster:
- FastText can train on more than one 200M words in less than five minutes using a standard multicore CPU
- Classify nearly 150K reviews in less than a minute

<hr style="height: 1px; width: 100px; background-color: #808080"; align="left">

### Architecture

<img src="https://raw.githubusercontent.com/akruchko/test/master/1_model_architecture_of_fastText.PNG">
The model architecture of fastText for a sentence with N ngram features x1, . . . , xN .<br> The features are embedded and averaged to form the hidden variable$^1$

<hr style="height: 1px; width: 100px; background-color: #808080"; align="left">
$^1$ https://arxiv.org/pdf/1607.01759.pdf

### Algorithm

FastText uses the softmax function $f$ to compute the probability distribution over the predefined classes. For a set of N documents, this leads to minimizing the negative loglikelihood over the classes:


\begin{align}
\ -\frac{1}{N} \sum_{n=1}^N y_n log(f(BAx_n))
\end{align}
$x_n$ - the normalized bag of features of the n-th document, <br>
$y_n$ - the label, <br>
$A, B$ - weight matrices

Optimization is performing using stochastic gradient descent and a linearly decaying learning rate.

<hr style="height: 1px; background-color: #808080">
## 2. Data preprocessing for fastText
- remove nonprintable characters
- fix $n't$, $'re$, $'s$ and other cases
- remove punctuation and digits
- Porter stemming

In [1]:
import pandas as pd

from string import punctuation, digits
from nltk.stem import PorterStemmer

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from fasttext import supervised, load_model

In [2]:
train = 'data/movie_reviews.csv'
test = 'data/test.csv'

In [3]:
# the helper function for reading the test data
import os
BASE_DIR = ''
TEXT_DATA_DIR = BASE_DIR + 'test/'
TEXT_DATA_FILE_1 = "rt-polarity_neg.txt"
TEXT_DATA_FILE_2 = "rt-polarity_pos.txt"
HEADER = True

def load_data():
    x = []
    y = []
    for i in [TEXT_DATA_FILE_1, TEXT_DATA_FILE_2]:
        with open(os.path.join(TEXT_DATA_DIR, i), "r", encoding='utf-8', errors='ignore') as f:
            if HEADER:
                _ = next(f)
            if i[-7:-4] == "pos":
                temp_y = 1
            else: temp_y = 0
            for line in f:
                x.append(line.rstrip("\n"))
                y.append(temp_y)

    return x, y

In [4]:
df_train = pd.read_csv(train).sample(n = 80000, replace=False, random_state=42).reset_index()
df_train.reset_index(drop=True, inplace=True)
df_test, labels_test = load_data()
df_test = pd.DataFrame({'label': labels_test, 'text': df_test})

In [5]:
# prepare punctuation and digits list for removal
translator = str.maketrans('', '', punctuation + digits)

# basic preprocessing:
def clean_data(df, col):
    df['clean_text'] = df[col].str.replace('\n', '').str.replace('\r', '').str.replace('\t', '')
    df.clean_text = df.clean_text.str.replace("n't", " not").str.replace("'re", " are").str.replace("'s", " s")
    df.clean_text = df.clean_text.str.replace("'ve", " have").str.replace("'ll", " will").str.replace("'d", " d")
    df.clean_text = df.clean_text.str.translate(translator).str.strip().str.lower()
    return df

In [6]:
df_train = clean_data(df_train, 'text')
df_test = clean_data(df_test, 'text')

In [7]:
%%time
# Porter stemming
stemmer = PorterStemmer()
df_train['porter_text'] = df_train['clean_text'].apply(lambda x: ' '.join([stemmer.stem(w) for w in x.split()]))
df_train = df_train[df_train.porter_text.apply(len) != 0]
df_test['porter_text'] = df_test['clean_text'].apply(lambda x: ' '.join([stemmer.stem(w) for w in x.split()]))

CPU times: user 1min 19s, sys: 48 ms, total: 1min 19s
Wall time: 1min 19s


In [8]:
# splitting on train and validation
df_train2, df_val = train_test_split(df_train[['label', 'porter_text']], test_size=0.2, random_state=42)

In [9]:
# Since fastText can be trained only from text files, we should mark labels. The default is `__label__` but can be custom.
df_train2['ft_label'] = df_train2['label'].apply(lambda x: '__label__1 ' if x == 1 else '__label__0 ')
df_train2[['ft_label', 'porter_text']].to_csv('data/train_fastText.csv', index=False, header=False)

### observations and yet another preprocessing variant

Weren't helpful:
- default stop words make the model performance worse. The list should be revised or created from scratch
- entities removal influences model's results badly. 

SpaCy lemmatization less aggressive than Porter stemmer

<hr style="height: 1px; background-color: #808080">
## 3. The model training and parameters overview

### The model training

In [11]:
%%time
# Let's try to train the model with default parameters
clf = supervised('data/train_fastText.csv', 'data/fastText_porter', label_prefix='__label__')

CPU times: user 12 s, sys: 372 ms, total: 12.3 s
Wall time: 6.94 s


In [12]:
def get_score(df, clf, label='label', text='porter_text', model_name='data/fastText_porter'):
  
    prediction = clf.predict_proba(list(df[text]))
    prediction = [int(item[0][0]) for item in prediction]

    print('Accuracy:', round(accuracy_score(list(df[label]), prediction), 4), 
          'F1:', round(f1_score(list(df[label]), prediction), 4))

In [13]:
# As you can see, the model was overfitted
get_score(df_train2, clf)
get_score(df_val, clf)

Accuracy: 0.8546 F1: 0.8794
Accuracy: 0.8056 F1: 0.8378


### Parameters overview

- `lr` - learning rate. Default: **0.1**.
- `dim` - size of word vectors in the hidden unit. Default: **100**. Should be less for small datasets and the number of labels.
- `epoch` - number of epochs. Default: **5**. Higher for small learning rates.
- `min_count` - minimal number of word occurences. Default: **1**. 5 or higher to avoid overfitting.
- `word_ngrams` - max length of word ngram. Default: **1**. Higher order ngrams lead to overfitting on small datasets. if value greater than 1 learning rate and epoch should be revised.
- `bucket` - number of buckets. Default: **2000000**. Developers recommend to use lower values for small datasets (ex. 100K).
- `minn` - min length of char ngram. Default: **0**.
- `maxn` - max length of char ngram. Default: **0**.

In [14]:
%%time
# let's train the model with higher minimal number of word occurences
clf = supervised('data/train_fastText.csv', 'data/fastText_porter', label_prefix='__label__', 
                 min_count=5)

CPU times: user 10.7 s, sys: 284 ms, total: 11 s
Wall time: 6.07 s


In [15]:
# as you can see, the model became slightly less overfitted 
get_score(df_train2, clf)
get_score(df_val, clf)

Accuracy: 0.8474 F1: 0.874
Accuracy: 0.8051 F1: 0.8377


In [16]:
# Get score on the test data
get_score(df_test, clf)

Accuracy: 0.7693 F1: 0.7896


### small exercise

Try to improve the previous results using diiferent values of parameters.

In [None]:
%%time
# let's try to i
clf = supervised('data/train_fastText.csv', 'data/fastText_porter', label_prefix='__label__', 
                # your code here 
                )

In [None]:
get_score(df_train2, clf)
get_score(df_val, clf)

In [None]:
get_score(df_test, clf)

## Conclusions

- Really fast
- It was developed mainly for large datasets (ex. 1 billion words). In case of small datasets hyperparameters should be tuned carefully to avoid overfitting or you shoud get more data.