In this notebook we will demonstrate using the fastText library to perform text classificatoin on the dbpedie data which can we downloaded from [here](https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz). <br>fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages(source: [wiki](https://en.wikipedia.org/wiki/FastText)).<br>
**Note**: This notebook uses an older version of fasttext.

In [None]:
!pip install fasttext

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)
[K     |████████████████████████████████| 71kB 1.8MB/s 
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=3016703 sha256=7168ce45f9e56adbf64adb4f947699bc4f77b9a2941416c8e75293b267507f92
  Stored in directory: /root/.cache/pip/wheels/98/ba/7f/b154944a1cf5a8cee91c154b75231136cc3a3321ab0e30f592
Successfully built fasttext
Installing collected packages: fasttext
Successfully installed fasttext-0.9.2


In [85]:
#necessary imports
import numpy as np
import pandas as pd
from time import time
from fasttext import supervised 

In [None]:
!wget -c "https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz"

--2020-08-01 10:14:33--  https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/srhrshr/torchDatasets/raw/master/dbpedia_csv.tar.gz [following]
--2020-08-01 10:14:33--  https://github.com/srhrshr/torchDatasets/raw/master/dbpedia_csv.tar.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/srhrshr/torchDatasets/master/dbpedia_csv.tar.gz [following]
--2020-08-01 10:14:33--  https://raw.githubusercontent.com/srhrshr/torchDatasets/master/dbpedia_csv.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connecte

In [86]:
import tarfile
tar = tarfile.open("/content/dbpedia_csv.tar.gz")
tar.extractall()
for member in tar.getmembers():
    print("Extracting %s" % member.name)
    tar.extract(member, #path='/home/connor/'
    )

Extracting dbpedia_csv
Extracting dbpedia_csv/test.csv
Extracting dbpedia_csv/classes.txt
Extracting dbpedia_csv/train.csv
Extracting dbpedia_csv/readme.txt


In [None]:

# Loading train data
train_file = '/content/dbpedia_csv/train.csv'
df = pd.read_csv(train_file, header=None, names=['class','name','description'])
# Loading test data
test_file = '/content/dbpedia_csv/test.csv'
df_test = pd.read_csv(test_file, header=None, names=['class','name','description'])
# Data we have
print("Train:{} Test:{}".format(df.shape,df_test.shape))


Train:(560000, 3) Test:(70000, 3)


The remaining
part of this subsection shows how to use the fastText classifier [17] for text
classification. We’ll work with the DBpedia dataset [18]. It’s a balanced dataset
consisting of 14 classes, with 40,000 training and 5,000 testing examples per class.
Thus, the total size of the dataset is 560,000 training and 70,000 testing data points.
Clearly, this is a much larger dataset than what we saw before. Can we build a fast
training model using fastText? Let’s check it out!

In [87]:
df

Unnamed: 0,class,name,description,class_name
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...,Company
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...,Company
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...,Company
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...,Company
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...,Company
...,...,...,...,...
559995,14,Barking in Essex,Barking in Essex is a Black comedy play direc...,WrittenWork
559996,14,Science & Spirit,Science & Spirit is a discontinued American b...,WrittenWork
559997,14,The Blithedale Romance,The Blithedale Romance (1852) is Nathaniel Ha...,WrittenWork
559998,14,Razadarit Ayedawbon,Razadarit Ayedawbon (Burmese: ရာဇာဓိရာဇ် အရေး...,WrittenWork


In [88]:
df['class'].value_counts()

14    40000
13    40000
12    40000
11    40000
10    40000
9     40000
8     40000
7     40000
6     40000
5     40000
4     40000
3     40000
2     40000
1     40000
Name: class, dtype: int64

In [89]:
# Since we have no clue about the classes lets build one
# Mapping from class number to class name
class_dict={
            1:'Company',
            2:'EducationalInstitution',
            3:'Artist',
            4:'Athlete',
            5:'OfficeHolder',
            6:'MeanOfTransportation',
            7:'Building',
            8:'NaturalPlace',
            9:'Village',
            10:'Animal',
            11:'Plant',
            12:'Album',
            13:'Film',
            14:'WrittenWork'
        }

# Mapping the classes
df['class_name'] = df['class'].map(class_dict)
df.head()

Unnamed: 0,class,name,description,class_name
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...,Company
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...,Company
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...,Company
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...,Company
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...,Company


In [90]:
df["class_name"].value_counts()

Animal                    40000
MeanOfTransportation      40000
NaturalPlace              40000
Building                  40000
Plant                     40000
WrittenWork               40000
Company                   40000
EducationalInstitution    40000
Album                     40000
Artist                    40000
OfficeHolder              40000
Film                      40000
Village                   40000
Athlete                   40000
Name: class_name, dtype: int64

In [91]:
# Lets do some cleaning of this text
def clean_it(text,normalize=True):
    # Replacing possible issues with data. We can add or reduce the replacemtent in this chain
    s = str(text).replace(',',' ').replace('"','').replace('\'',' \' ').replace('.',' . ').replace('(',' ( ').\
            replace(')',' ) ').replace('!',' ! ').replace('?',' ? ').replace(':',' ').replace(';',' ').lower()
    
    # normalizing / encoding the text
    if normalize:
        s = s.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8')
    
    return s

# Now lets define a small function where we can use above cleaning on datasets
def clean_df(data, cleanit= False, shuffleit=False, encodeit=False, label_prefix='__class__'):
    # Defining the new data
    df = data[['name','description']].copy(deep=True)
    df['class'] = label_prefix + data['class'].astype(str) + ' '
    
    # cleaning it
    if cleanit:
        df['name'] = df['name'].apply(lambda x: clean_it(x,encodeit))
        df['description'] = df['description'].apply(lambda x: clean_it(x,encodeit))
    
    # shuffling it
    if shuffleit:
        df.sample(frac=1).reset_index(drop=True)
            
    return df

In [92]:
%%time
# Transform the datasets using the above clean functions
df_train_cleaned = clean_df(df, True, True)
df_test_cleaned = clean_df(df_test, True, True)

CPU times: user 5.04 s, sys: 573 ms, total: 5.61 s
Wall time: 5.62 s


In [93]:
df

Unnamed: 0,class,name,description,class_name
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...,Company
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...,Company
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...,Company
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...,Company
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...,Company
...,...,...,...,...
559995,14,Barking in Essex,Barking in Essex is a Black comedy play direc...,WrittenWork
559996,14,Science & Spirit,Science & Spirit is a discontinued American b...,WrittenWork
559997,14,The Blithedale Romance,The Blithedale Romance (1852) is Nathaniel Ha...,WrittenWork
559998,14,Razadarit Ayedawbon,Razadarit Ayedawbon (Burmese: ရာဇာဓိရာဇ် အရေး...,WrittenWork


In [94]:
df_train_cleaned

Unnamed: 0,name,description,class
0,e . d . abbott ltd,abbott of farnham e d abbott limited was a br...,__class__1
1,schwan-stabilo,schwan-stabilo is a german maker of pens for ...,__class__1
2,q-workshop,q-workshop is a polish company located in poz...,__class__1
3,marvell software solutions israel,marvell software solutions israel known as ra...,__class__1
4,bergan mercy medical center,bergan mercy medical center is a hospital loc...,__class__1
...,...,...,...
559995,barking in essex,barking in essex is a black comedy play direc...,__class__14
559996,science & spirit,science & spirit is a discontinued american b...,__class__14
559997,the blithedale romance,the blithedale romance ( 1852 ) is nathanie...,__class__14
559998,razadarit ayedawbon,razadarit ayedawbon ( burmese ရာဇာဓိရာဇ် အရ...,__class__14


In [95]:
# Write files to disk as fastText classifier API reads files from disk.
train_file = 'dbpedia_train_clean.csv'
df_train_cleaned.to_csv(train_file, header=None, 
                        index=False, columns=['class','name','description'] )

test_file ='dbpedia_test_clean.csv'
df_test_cleaned.to_csv(test_file, header=None, 
                       index=False, columns=['class','name','description'] )


In [96]:
pd.read_csv('dbpedia_train_clean.csv').head()

Unnamed: 0,__class__1,e . d . abbott ltd,abbott of farnham e d abbott limited was a british coachbuilding business based in farnham surrey trading under that name from 1929 . a major part of their output was under sub-contract to motor vehicle manufacturers . their business closed in 1972 .
0,__class__1,schwan-stabilo,schwan-stabilo is a german maker of pens for ...
1,__class__1,q-workshop,q-workshop is a polish company located in poz...
2,__class__1,marvell software solutions israel,marvell software solutions israel known as ra...
3,__class__1,bergan mercy medical center,bergan mercy medical center is a hospital loc...
4,__class__1,the unsigned guide,the unsigned guide is an online contacts dire...


Now that we have the train and test files written into disk in a format fastText wants, we are ready to use it for text classification!

In [97]:
import fasttext
%time model = fasttext.train_supervised(input='dbpedia_train_clean.csv',label_prefix="__class__")
results=model.test('dbpedia_test_clean.csv')
results
 # run with no header

CPU times: user 1min 6s, sys: 452 ms, total: 1min 6s
Wall time: 1min 6s


(70000, 0.07142857142857142, 0.07142857142857142)

(70000, 0.07142857142857142, 0.07142857142857142)

Do chinh xac ban dau: 0.07%

In [None]:
model.predict("schwan-stabilo is a german maker of pens for ")

(('__class__14',), array([0.96447182]))

Imporve performance: add learning rate and epoch

In [None]:
%time model = fasttext.train_supervised(input='dbpedia_train_clean.csv',\
                                  lr=1.0, epoch=25,label_prefix="__class__")
results=model.test('dbpedia_test_clean.csv')
results

CPU times: user 5min 38s, sys: 2.06 s, total: 5min 40s
Wall time: 5min 40s


(70000, 0.2650142857142857, 0.2650142857142857)

Tang do chinh xac tu 1% len 26%, thoi gian training tang len nhieu lan

In [None]:
%time model = fasttext.train_supervised(input='dbpedia_train_clean.csv',\
                                  lr=1.0, epoch=25,wordNgrams=2,\
                                  label_prefix="__class__")
results=model.test('dbpedia_test_clean.csv')
results

CPU times: user 11min 57s, sys: 3.25 s, total: 12min
Wall time: 12min 1s


(70000, 0.4786285714285714, 0.4786285714285714)

Do chinh xac tang tu 26% len 47%

Tim cach cai thien toc do train bang ham softmax (loss= 'hs')

In [98]:
%time model = fasttext.train_supervised(input='dbpedia_train_clean.csv',\
                                        lr=1.0, epoch=25,wordNgrams=2,\
                                        bucket=200000, dim=50, loss='hs',\
                                        label_prefix="__class__")
results=model.test('dbpedia_test_clean.csv')
results
 # run with no header

CPU times: user 5min 54s, sys: 2.42 s, total: 5min 57s
Wall time: 5min 57s


(70000, 0.41482857142857144, 0.41482857142857144)

Toc do nhanh hon gap doi, tu 12 phut xuong 6 phut, do chinh xac giam 4%

In [None]:
temp = fasttext.load_model("temp")

In [None]:
## Using fastText for feature extraction and training
from fasttext import supervised 
"""fastText expects and training file (csv), a model name as input arguments.
label_prefix refers to the prefix before label string in the dataset.
default is __label__. In our dataset, it is __class__. 
There are several other parameters which can be seen in: 
https://pypi.org/project/fasttext/
"""
%time model = supervised(train_file, 'temp', label_prefix="__class__")
results = model.test(test_file)
print(results.nexamples, results.precision, results.recall)

CPU times: user 1min 11s, sys: 908 ms, total: 1min 12s
Wall time: 1min 12s


AttributeError: ignored

Model 'temp' cho do chinh xac la 97% voi chi vai giay training, hien tai khong co o day


CPU times: user 56.5 s, sys: 1.51 s, total: 58 s

Wall time: 12.6 s

70000 | 0.9710571428571428 | 0.9710571428571428

Try training a classifier on this dataset with, say, LogisticRegression to realize how fast fastText is! 97% Precision and Recall are hard numbers to beat, too!

# Fast text example

https://fasttext.cc/docs/en/supervised-tutorial.html

In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz
!head cooking.stackexchange.txt

--2020-08-01 11:50:41--  https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 104.22.75.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 457609 (447K) [application/x-tar]
Saving to: ‘cooking.stackexchange.tar.gz’


2020-08-01 11:50:42 (1.21 MB/s) - ‘cooking.stackexchange.tar.gz’ saved [457609/457609]

cooking.stackexchange.id
cooking.stackexchange.txt
readme.txt
__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife ski

Before training our first classifier, we need to split the data into train and validation. We will use the validation set to evaluate how good the learned classifier is on new data.

In [None]:
!wc cooking.stackexchange.txt

  15404  169582 1401900 cooking.stackexchange.txt


Our full dataset contains 15404 examples. Let's split it into a training set of 12404 examples and a validation set of 3000 examples:

In [None]:
!head -n 12404 cooking.stackexchange.txt > cooking.train

In [None]:
!tail -n 3000 cooking.stackexchange.txt > cooking.valid

## Our first classifier

In [None]:
import fasttext
model = fasttext.train_supervised(input="cooking.train")

In [None]:
model.predict("Which baking dish is best to bake a banana bread ?")

(('__label__baking',), array([0.07257967]))

In [None]:
model.predict("Why not put knives in the dishwasher?")

(('__label__food-safety',), array([0.07451777]))

In [None]:
model.test("cooking.valid")

(3000, 0.135, 0.05838258613233386)

The output are the number of samples (here 3000), the precision at one (0.124) and the recall at one (0.0541).

We can also compute the precision at five and recall at five with:

In [None]:
model.test("cooking.valid", k=5)

(3000, 0.06606666666666666, 0.14285714285714285)

In [None]:
model.predict("Why not put knives in the dishwasher?", k=5)

(('__label__food-safety',
  '__label__baking',
  '__label__bread',
  '__label__substitutions',
  '__label__equipment'),
 array([0.07451777, 0.07366108, 0.04390582, 0.0373    , 0.03408055]))

## Making the model better

preprocessing the data

In [None]:
!cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt
!head -n 12404 cooking.preprocessed.txt > cooking.train
!tail -n 3000 cooking.preprocessed.txt > cooking.valid

In [None]:
import fasttext
model = fasttext.train_supervised(input="cooking.train")

In [None]:
model.test("cooking.valid")

(3000, 0.16433333333333333, 0.07106818509442121)

We observe that thanks to the pre-processing, the vocabulary is smaller (from 14k words to 9k). The precision is also starting to go up by 4%!

##more epochs and larger learning rate

By default, fastText sees each training example only five times during training, which is pretty small, given that our training set only have 12k training examples. The number of times each examples is seen (also known as the number of epochs), can be increased using the -epoch option:

In [None]:
import fasttext
model = fasttext.train_supervised(input="cooking.train", epoch=25)

In [None]:
model.test("cooking.valid") # precision increase 35%

(3000, 0.52, 0.22488107250973044)

This is much better! Another way to change the learning speed of our model is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after processing each example. A learning rate of 0 would mean that the model does not change at all, and thus, does not learn anything. Good values of the learning rate are in the range 0.1 - 1.0.

In [None]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0)
model.test("cooking.valid")

(3000, 0.5693333333333334, 0.2462159434914228)

Even better! Let's try both together:

In [None]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25)
model.test("cooking.valid")

(3000, 0.5843333333333334, 0.25270289750612657)

Let us now add a few more features to improve even further our performance!

## word n-grams

Finally, we can improve the performance of a model by using word bigrams, instead of just unigrams. This is especially important for classification problems where word order is important, such as sentiment analysis.

In [None]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25, wordNgrams=2)
model.test("cooking.valid")

(3000, 0.5996666666666667, 0.2593340060544904)

With a few steps, we were able to go from a precision at one of 12.4% to 59.9%. Important steps included:

- preprocessing the data ;
- changing the number of epochs (using the option -epoch, standard range [5 - 50]) ;
- changing the learning rate (using the option -lr, standard range [0.1 - 1.0]) ;
- using word n-grams (using the option -wordNgrams, standard range [1 - 5]).

## Advanced readers: What is a Bigram?

A 'unigram' refers to a single undividing unit, or token, usually used as an input to a model. For example a unigram can be a word or a letter depending on the model. In fastText, we work at the word level and thus unigrams are words.

Similarly we denote by 'bigram' the concatenation of 2 consecutive tokens or words. Similarly we often talk about n-gram to refer to the concatenation any n consecutive tokens.

For example, in the sentence, 'Last donut of the night', the unigrams are 'last', 'donut', 'of', 'the' and 'night'. The bigrams are: 'Last donut', 'donut of', 'of the' and 'the night'.

Bigrams are particularly interesting because, for most sentences, you can reconstruct the order of the words just by looking at a bag of n-grams.

Let us illustrate this by a simple exercise, given the following bigrams, try to reconstruct the original sentence: 'all out', 'I am', 'of bubblegum', 'out of' and 'am all'. It is common to refer to a word as a unigram.

## Scaling things up

Since we are training our model on a few thousands of examples, the training only takes a few seconds. But training models on larger datasets, with more labels can start to be too slow. A potential solution to make the training faster is to use the hierarchical softmax, instead of the regular softmax. This can be done with the option -loss hs:

In [None]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25, 
                                  wordNgrams=2, bucket=200000, dim=50, loss='hs')
model.test("cooking.valid")

(3000, 0.5806666666666667, 0.25111719763586565)

In [None]:
model.predict("Which baking dish is best to bake a banana bread ?")

(('__label__baking',), array([0.41760537]))

##Multi-label classification
When we want to assign a document to multiple labels, we can still use the softmax loss and play with the parameters for prediction, namely the number of labels to predict and the threshold for the predicted probability. However playing with these arguments can be tricky and unintuitive since the probabilities must sum to 1.

A convenient way to handle multiple labels is to use independent binary classifiers for each label. This can be done with -loss one-vs-all or -loss ova.

In [None]:
model = fasttext.train_supervised(input="cooking.train", lr=0.5, epoch=25, wordNgrams=2, 
                                  bucket=200000, dim=50, loss='ova')
model.test("cooking.valid")

(3000, 0.604, 0.2612080149920715)

In [None]:
model.predict("Which baking dish is best to bake a banana bread ?", k=-1, threshold=0.5)

(('__label__baking',
  '__label__equipment',
  '__label__bread',
  '__label__bananas'),
 array([1.00001001, 0.97967768, 0.97632056, 0.8872146 ]))