In this notebook we will demonstrate using the fastText library to perform text classificatoin on the dbpedie data which can we downloaded from [here](https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz). <br>fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages(source: [wiki](https://en.wikipedia.org/wiki/FastText)).<br>
**Note**: This notebook uses an older version of fasttext.

In [2]:
!pip install fasttext

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)
[K     |████████████████████████████████| 71kB 1.8MB/s 
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=3016703 sha256=7168ce45f9e56adbf64adb4f947699bc4f77b9a2941416c8e75293b267507f92
  Stored in directory: /root/.cache/pip/wheels/98/ba/7f/b154944a1cf5a8cee91c154b75231136cc3a3321ab0e30f592
Successfully built fasttext
Installing collected packages: fasttext
Successfully installed fasttext-0.9.2


In [1]:
#necessary imports
import numpy as np
import pandas as pd
from time import time
from fasttext import supervised 

In [2]:
!wget -c "https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz"

--2020-08-01 10:14:33--  https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/srhrshr/torchDatasets/raw/master/dbpedia_csv.tar.gz [following]
--2020-08-01 10:14:33--  https://github.com/srhrshr/torchDatasets/raw/master/dbpedia_csv.tar.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/srhrshr/torchDatasets/master/dbpedia_csv.tar.gz [following]
--2020-08-01 10:14:33--  https://raw.githubusercontent.com/srhrshr/torchDatasets/master/dbpedia_csv.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connecte

In [3]:
import tarfile
tar = tarfile.open("/content/dbpedia_csv.tar.gz")
tar.extractall()
for member in tar.getmembers():
    print("Extracting %s" % member.name)
    tar.extract(member, #path='/home/connor/'
    )

Extracting dbpedia_csv
Extracting dbpedia_csv/test.csv
Extracting dbpedia_csv/classes.txt
Extracting dbpedia_csv/train.csv
Extracting dbpedia_csv/readme.txt


In [4]:

# Loading train data
train_file = '/content/dbpedia_csv/train.csv'
df = pd.read_csv(train_file, header=None, names=['class','name','description'])
# Loading test data
test_file = '/content/dbpedia_csv/test.csv'
df_test = pd.read_csv(test_file, header=None, names=['class','name','description'])
# Data we have
print("Train:{} Test:{}".format(df.shape,df_test.shape))


Train:(560000, 3) Test:(70000, 3)


The remaining
part of this subsection shows how to use the fastText classifier [17] for text
classification. We’ll work with the DBpedia dataset [18]. It’s a balanced dataset
consisting of 14 classes, with 40,000 training and 5,000 testing examples per class.
Thus, the total size of the dataset is 560,000 training and 70,000 testing data points.
Clearly, this is a much larger dataset than what we saw before. Can we build a fast
training model using fastText? Let’s check it out!

In [5]:
df

Unnamed: 0,class,name,description
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...
...,...,...,...
559995,14,Barking in Essex,Barking in Essex is a Black comedy play direc...
559996,14,Science & Spirit,Science & Spirit is a discontinued American b...
559997,14,The Blithedale Romance,The Blithedale Romance (1852) is Nathaniel Ha...
559998,14,Razadarit Ayedawbon,Razadarit Ayedawbon (Burmese: ရာဇာဓိရာဇ် အရေး...


In [6]:
df['class'].value_counts()

14    40000
13    40000
12    40000
11    40000
10    40000
9     40000
8     40000
7     40000
6     40000
5     40000
4     40000
3     40000
2     40000
1     40000
Name: class, dtype: int64

In [7]:
# Since we have no clue about the classes lets build one
# Mapping from class number to class name
class_dict={
            1:'Company',
            2:'EducationalInstitution',
            3:'Artist',
            4:'Athlete',
            5:'OfficeHolder',
            6:'MeanOfTransportation',
            7:'Building',
            8:'NaturalPlace',
            9:'Village',
            10:'Animal',
            11:'Plant',
            12:'Album',
            13:'Film',
            14:'WrittenWork'
        }

# Mapping the classes
df['class_name'] = df['class'].map(class_dict)
df.head()

Unnamed: 0,class,name,description,class_name
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...,Company
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...,Company
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...,Company
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...,Company
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...,Company


In [8]:
df["class_name"].value_counts()

WrittenWork               40000
Album                     40000
Film                      40000
Animal                    40000
NaturalPlace              40000
Village                   40000
MeanOfTransportation      40000
OfficeHolder              40000
Building                  40000
Company                   40000
Athlete                   40000
Plant                     40000
Artist                    40000
EducationalInstitution    40000
Name: class_name, dtype: int64

In [None]:
# Lets do some cleaning of this text
def clean_it(text,normalize=True):
    # Replacing possible issues with data. We can add or reduce the replacemtent in this chain
    s = str(text).replace(',',' ').replace('"','').replace('\'',' \' ').replace('.',' . ').replace('(',' ( ').\
            replace(')',' ) ').replace('!',' ! ').replace('?',' ? ').replace(':',' ').replace(';',' ').lower()
    
    # normalizing / encoding the text
    if normalize:
        s = s.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8')
    
    return s

# Now lets define a small function where we can use above cleaning on datasets
def clean_df(data, cleanit= False, shuffleit=False, encodeit=False, label_prefix='__class__'):
    # Defining the new data
    df = data[['name','description']].copy(deep=True)
    df['class'] = label_prefix + data['class'].astype(str) + ' '
    
    # cleaning it
    if cleanit:
        df['name'] = df['name'].apply(lambda x: clean_it(x,encodeit))
        df['description'] = df['description'].apply(lambda x: clean_it(x,encodeit))
    
    # shuffling it
    if shuffleit:
        df.sample(frac=1).reset_index(drop=True)
            
    return df

In [None]:
%%time
# Transform the datasets using the above clean functions
df_train_cleaned = clean_df(df, True, True)
df_test_cleaned = clean_df(df_test, True, True)

CPU times: user 4.44 s, sys: 68.1 ms, total: 4.51 s
Wall time: 4.51 s


In [None]:
# Write files to disk as fastText classifier API reads files from disk.
train_file = data_path + 'dbpedia_train.csv'
df_train_cleaned.to_csv(train_file, header=None, index=False, columns=['class','name','description'] )

test_file = data_path + 'dbpedia_test.csv'
df_test_cleaned.to_csv(test_file, header=None, index=False, columns=['class','name','description'] )


Now that we have the train and test files written into disk in a format fastText wants, we are ready to use it for text classification!

In [None]:
## Using fastText for feature extraction and training
from fasttext import supervised 
"""fastText expects and training file (csv), a model name as input arguments.
label_prefix refers to the prefix before label string in the dataset.
default is __label__. In our dataset, it is __class__. 
There are several other parameters which can be seen in: 
https://pypi.org/project/fasttext/
"""
%time model = supervised(train_file, 'temp', label_prefix="__class__")
results = model.test(test_file)
print(results.nexamples, results.precision, results.recall)

CPU times: user 56.5 s, sys: 1.51 s, total: 58 s
Wall time: 12.6 s
70000 0.9710571428571428 0.9710571428571428


Try training a classifier on this dataset with, say, LogisticRegression to realize how fast fastText is! 97% Precision and Recall are hard numbers to beat, too!