### fastText

A method to deal with out of vocabulary (OOV). And it is extremely fast at learning on large corpora. 

FastText is based on the idea of enriching word embeddings with subword-level information. Thus, the embedding representation for each word is represented as a sum of the representations of individual **character** n-grams.

FastText is a text classification model. 

We’ll work with the DBpedia dataset containined 14 classes of 560,000 training examples and 70,000 texting examples. 

In [8]:
import os
import pandas as pd
import wget
import tarfile
from fasttext import train_supervised 

### Download & load the data

In [9]:
data_path = 'DATAPATH'
train_file = data_path + '/dbpedia_csv/train.csv'
test_file = data_path + '/dbpedia_csv/test.csv'
if not os.path.exists(train_file):
    !wget -P DATAPATH https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz
    !tar -xvf DATAPATH/dbpedia_csv.tar.gz -C DATAPATH
    
df = pd.read_csv(train_file, header=None, names=['class','name','description'])
df_test = pd.read_csv(test_file, header=None, names=['class','name','description'])
print("Train:{} Test:{}".format(df.shape,df_test.shape))

Train:(560000, 3) Test:(70000, 3)


In [25]:
# Mapping from class number to class name
class_dict={
            1:'Company',
            2:'EducationalInstitution',
            3:'Artist',
            4:'Athlete',
            5:'OfficeHolder',
            6:'MeanOfTransportation',
            7:'Building',
            8:'NaturalPlace',
            9:'Village',
            10:'Animal',
            11:'Plant',
            12:'Album',
            13:'Film',
            14:'WrittenWork'}
# Mapping the classes
df['class_name'] = df['class'].map(class_dict)
df.sample(5, random_state=42)

Unnamed: 0,class,name,description,class_name
34566,1,Sterling Piano Company,The Sterling Piano Company was a piano manufa...,Company
223092,6,NYC S-Motor,S-Motor was the class designation given by th...,MeanOfTransportation
110270,3,Axel Zwingenberger,Axel Zwingenberger (born May 7 1955 Hamburg G...,Artist
365013,10,Sceptrophasma hispidulum,Sceptrophasma hispidulum commonly known as th...,Animal
311625,8,Nucet River (Chiojdeanca),The Nucet River is a tributary of the Chiojde...,NaturalPlace


### Text Cleaning

In [30]:
def clean_it(text,normalize=True):
    # Replacing possible issues with data. We can add or reduce the replacemtent in this chain
    s = str(text).replace(',',' ').replace('"','').replace('\'',' \' ').replace('.',' . ').replace('(',' ( ').\
            replace(')',' ) ').replace('!',' ! ').replace('?',' ? ').replace(':',' ').replace(';',' ').lower()
    # normalizing / encoding the text
    if normalize:
        s = s.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8')
    return s

# Now lets define a small function where we can use above cleaning on datasets
def clean_df(data, cleanit= False, shuffleit=False, encodeit=False, label_prefix='__class__'):
    # Defining the new data
    df = data[['name','description']].copy(deep=True)
    df['class'] = label_prefix + data['class'].astype(str) + ' '
    # cleaning it
    if cleanit:
        df['name'] = df['name'].apply(lambda x: clean_it(x,encodeit))
        df['description'] = df['description'].apply(lambda x: clean_it(x,encodeit))
    # shuffling it
    if shuffleit:
        df.sample(frac=1).reset_index(drop=True)
    return df

# Transform the datasets using the above clean functions
df_train_cleaned = clean_df(df, True, True)
df_test_cleaned = clean_df(df_test, True, True)

# Write files to disk as fastText classifier API reads files from disk.
train_file = data_path + '/dbpedia_train.csv'
df_train_cleaned.to_csv(train_file, header=None, index=False, columns=['class','name','description'] )
test_file = data_path + '/dbpedia_test.csv'
df_test_cleaned.to_csv(test_file, header=None, index=False, columns=['class','name','description'] )

All the labels start by the `__label__` prefix (in this example we use `__class__`), which is how fastText recognize what is a label or what is a word.

In [31]:
df_train_cleaned.sample(5, random_state=42)

Unnamed: 0,name,description,class
34566,sterling piano company,the sterling piano company was a piano manufa...,__class__1
223092,nyc s-motor,s-motor was the class designation given by th...,__class__6
110270,axel zwingenberger,axel zwingenberger ( born may 7 1955 hamburg...,__class__3
365013,sceptrophasma hispidulum,sceptrophasma hispidulum commonly known as th...,__class__10
311625,nucet river ( chiojdeanca ),the nucet river is a tributary of the chiojde...,__class__8


### Model training
Notice that we gave the classifier raw text and not the feature vector.

In [32]:
%%time
model = train_supervised(input=train_file, label="__class__", 
                         lr=1.0, epoch=75, loss='ova', wordNgrams=2, 
                         dim=200, thread=2, verbose=100)

Read 31M words
Number of words:  1116962
Number of labels: 14
Progress: 100.0% words/sec/thread:  598277 lr:  0.000078 avg.loss:  0.003162 ETA:   0h 0m 0s33m59s avg.loss:  0.056193 ETA:   0h34m17s 580394 lr:  0.985471 avg.loss:  0.056543 ETA:   0h33m45s 0.985384 avg.loss:  0.056265 ETA:   0h33m46s 0.984468 avg.loss:  0.053297 ETA:   0h33m46s lr:  0.984209 avg.loss:  0.052462 ETA:   0h33m44s avg.loss:  0.051262 ETA:   0h33m45s  1.7% words/sec/thread:  580071 lr:  0.983177 avg.loss:  0.049215 ETA:   0h33m41s  0h33m38s 0.982795 avg.loss:  0.047964 ETA:   0h33m36s  0h33m33s 0.980356 avg.loss:  0.043070 ETA:   0h33m32s% words/sec/thread:  580492 lr:  0.979637 avg.loss:  0.041645 ETA:   0h33m32s  2.1% words/sec/thread:  580861 lr:  0.979280 avg.loss:  0.041615 ETA:   0h33m30s% words/sec/thread:  581498 lr:  0.978668 avg.loss:  0.040631 ETA:   0h33m27s 0.978468 avg.loss:  0.040468 ETA:   0h33m26s avg.loss:  0.040320 ETA:   0h33m26s% words/sec/thread:  580965 lr:  0.978146 avg.loss:  0.040000 

CPU times: user 1h 6min 16s, sys: 55.9 s, total: 1h 7min 12s
Wall time: 33min 37s


Progress: 100.0% words/sec/thread:  598276 lr:  0.000029 avg.loss:  0.003170 ETA:   0h 0m 0sProgress: 100.0% words/sec/thread:  598263 lr: -0.000000 avg.loss:  0.003172 ETA:   0h 0m 0sProgress: 100.0% words/sec/thread:  598263 lr:  0.000000 avg.loss:  0.003172 ETA:   0h 0m 0s


Evaluate fast text model

In [33]:
for k in range(1,6):
    results = model.test(test_file,k=k)
    print(f"Test Samples: {results[0]} \
    Precision@{k} : {results[1]*100:2.4f} \
    Recall@{k} : {results[2]*100:2.4f}")

Test Samples: 70000     Precision@1 : 88.3500     Recall@1 : 88.3500
Test Samples: 70000     Precision@2 : 47.0121     Recall@2 : 94.0243
Test Samples: 70000     Precision@3 : 31.8224     Recall@3 : 95.4671
Test Samples: 70000     Precision@4 : 24.1404     Recall@4 : 96.5614
Test Samples: 70000     Precision@5 : 19.4271     Recall@5 : 97.1357


The only downside is that the trained model carries the entire character n-gram embeddings dictionary with it. This results in a bulky model and can result in engineering issues. 

However, fastText implementation also comes with options to reduce the memory footprint of its classification models with minimal reduction in classification performance. It does this by doing vocabulary pruning and using compression algorithms.

### Let's try (for fun) a logistic regression model with BoW embeddings on the same dataset to compare