# DBPedia Classification with fastText

### "fastText: Faster, better text classification!". A research from Facebook AI Research (FAIR) lab.

fastText as name suggest is for doing fast text classificaiton. For this they have used character ngrams with many methods to get better results.

The [paper]() give quite detailed view of how things work here.

Let's get our hands on with fastText with text classification dataset of DBPedia. This dataset consists of text descriptions of 14 different classes. The training set contains 560,000 reviews and the test contains 70,000. 

Download this dataset from [here](https://drive.google.com/drive/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M). 


In [1]:
# Importing Libraries
import os,sys  

# For loading data and doing some exploration
import pandas as pd

# The default import
import numpy as np

In [2]:
# Set path for loading data, saving processed data and saving model
# Also always mention full path. 'fastText' may give errors when full path is not specified 
data_path = '/Users/data/dbpedia_csv/'

In [3]:
# Loading train data
train_file = data_path + 'train.csv'
df = pd.read_csv(train_file, header=None, names=['class','name','description'])

# Loading test data
test_file = data_path + 'test.csv'
df_test = pd.read_csv(test_file, header=None, names=['class','name','description'])

# Data with us
print("Train:{} Test:{}".format(df.shape,df_test.shape))

Train:(560000, 3) Test:(70000, 3)


In [4]:
# Since we have no clue about the classes lets build one
# Mapping from class number to class name
class_dict={
            1:'Company',
            2:'EducationalInstitution',
            3:'Artist',
            4:'Athlete',
            5:'OfficeHolder',
            6:'MeanOfTransportation',
            7:'Building',
            8:'NaturalPlace',
            9:'Village',
            10:'Animal',
            11:'Plant',
            12:'Album',
            13:'Film',
            14:'WrittenWork'
        }

# Mapping the classes
df['class_name'] = df['class'].map(class_dict)
df.head()

Unnamed: 0,class,name,description,class_name
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...,Company
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...,Company
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...,Company
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...,Company
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...,Company


In [5]:
df.tail()

Unnamed: 0,class,name,description,class_name
559995,14,Barking in Essex,Barking in Essex is a Black comedy play direc...,WrittenWork
559996,14,Science & Spirit,Science & Spirit is a discontinued American b...,WrittenWork
559997,14,The Blithedale Romance,The Blithedale Romance (1852) is Nathaniel Ha...,WrittenWork
559998,14,Razadarit Ayedawbon,Razadarit Ayedawbon (Burmese: ရာဇာဓိရာဇ် အရေး...,WrittenWork
559999,14,The Vinyl Cafe Notebooks,Vinyl Cafe Notebooks: a collection of essays ...,WrittenWork


In [6]:
# What is the group behaviour
desc = df.groupby('class')
desc.describe()

Unnamed: 0_level_0,class_name,class_name,class_name,class_name,description,description,description,description,name,name,name,name
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
1,40000,1,Company,40000,40000,39996,MegaPath Corporation—headquartered in Pleasan...,2,40000,40000,Echologics,1
2,40000,1,EducationalInstitution,40000,40000,39992,St. Croix Country Day School is an independen...,2,40000,40000,Bishop Perowne CofE College,1
3,40000,1,Artist,40000,40000,40000,Eiji Maruyama (丸山 詠二 Maruyama Eiji born Octob...,1,40000,40000,Reina,1
4,40000,1,Athlete,40000,40000,40000,Guillermo Andres Rivera Aránguiz born 2 Febru...,1,40000,40000,Maximiliano Estévez,1
5,40000,1,OfficeHolder,40000,40000,39998,Dr.,3,40000,40000,Angie Bray,1
6,40000,1,MeanOfTransportation,40000,40000,39998,The Hero Karizma ZMR is a motorcycle manufact...,2,40000,40000,USS Nassau (CVE-16),1
7,40000,1,Building,40000,40000,39998,Kuo Yuan Ye (Chinese: 郭元益; pinyin: Guōyuányì)...,2,40000,40000,Telšiai Cathedral,1
8,40000,1,NaturalPlace,40000,40000,39927,Steinkopf is a mountain of Hesse Germany.,4,40000,40000,Darul River,1
9,40000,1,Village,40000,40000,39999,Chah Amiq-e Astan Qods (Persian: چاه عميق است...,2,40000,40000,Boraszyce Małe,1
10,40000,1,Animal,40000,40000,39995,Typhlops leucomelas is a species of snake in ...,2,40000,40000,Eulepidotis persimilis,1


In [7]:
# Lets do some cleaning
def clean_it(text,normalize=True):
    # Replacing possible issues with data. We can add or reduce the replacemtent in this chain
    s = str(text).replace(',',' ').replace('"','').replace('\'',' \' ').replace('.',' . ').replace('(',' ( ').\
            replace(')',' ) ').replace('!',' ! ').replace('?',' ? ').replace(':',' ').replace(';',' ').lower()
    
    # normalizing / encoding the text
    if normalize:
        s = s.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8')
    
    return s

# Now lets define a small function where we can use above cleaning on datasets
def clean_df(data, cleanit= False, shuffleit=False, encodeit=False, label_prefix='__class__'):
    # Defining the new data
    df = data[['name','description']].copy(deep=True)
    df['class'] = label_prefix + data['class'].astype(str) + ' '
    
    # cleaning it
    if cleanit:
        df['name'] = df['name'].apply(lambda x: clean_it(x,encodeit))
        df['description'] = df['description'].apply(lambda x: clean_it(x,encodeit))
    
    # shuffling it
    if shuffleit:
        df.sample(frac=1).reset_index(drop=True)
        
    # for fastext to understand data better
    df['name'] = ' ' + df['name'] + ' '
    df['description'] = ' ' + df['description'] + ' '
        
    return df



In [8]:
%%time
# Transform datasets
df_train = clean_df(df, True, True)
df_test_cleaned = clean_df(df_test, True, False)

CPU times: user 6.79 s, sys: 624 ms, total: 7.41 s
Wall time: 7.44 s


In [9]:
df_train.head()

Unnamed: 0,name,description,class
0,e . d . abbott ltd,abbott of farnham e d abbott limited was a b...,__class__1
1,schwan-stabilo,schwan-stabilo is a german maker of pens for...,__class__1
2,q-workshop,q-workshop is a polish company located in po...,__class__1
3,marvell software solutions israel,marvell software solutions israel known as r...,__class__1
4,bergan mercy medical center,bergan mercy medical center is a hospital lo...,__class__1


In [10]:
df_train.tail()

Unnamed: 0,name,description,class
559995,barking in essex,barking in essex is a black comedy play dire...,__class__14
559996,science & spirit,science & spirit is a discontinued american ...,__class__14
559997,the blithedale romance,the blithedale romance ( 1852 ) is nathani...,__class__14
559998,razadarit ayedawbon,razadarit ayedawbon ( burmese ရာဇာဓိရာဇ် အ...,__class__14
559999,the vinyl cafe notebooks,vinyl cafe notebooks a collection of essays...,__class__14


In [11]:
df['description'][661]

' İzmir Banliyö Anonym Şirketi or İZBAN A.Ş. is the holding company of İZBAN. It was created in 2006 to operate a commuter railroad around İzmir. İZBAN A.Ş. is owned 50% by the Turkish State Railways and 50% by the İzmir Municipality.'

In [12]:
df_train['description'][661]

'  i̇zmir banliyö anonym şirketi or i̇zban a . ş .  is the holding company of i̇zban .  it was created in 2006 to operate a commuter railroad around i̇zmir .  i̇zban a . ş .  is owned 50% by the turkish state railways and 50% by the i̇zmir municipality .  '

### Now since fastext is basically built on C++ for direct commandline usages, the api exposed need data from the directory itself. Hence we need to save data and hold its path to pass to fasttext model.

In [13]:
# Write files to disk
train_file = data_path + 'dbpedia.train'
df_train.to_csv(train_file, header=None, index=False, columns=['class','name','description'] )

test_file = data_path + 'dbpedia.valid'
df_test_cleaned.to_csv(test_file, header=None, index=False, columns=['class','name','description'] )

# also small function to see evaluated results.
def print_results(N, p, r):
    print("N\t" + str(N))
    print("Precision {}\t{:.3f}".format(1, p))
    print("Recall    {}\t{:.3f}".format(1, r))

In [14]:
# The library under exploration
from fastText import train_supervised

### Making Basic Model with fasttext

In [15]:
%%time
print('Train a classifier')
model = train_supervised(input=train_file, label='__class__', epoch=25, lr=1.0, wordNgrams=2, minCount=1, verbose=1)

print('Evaluating results')
print_results(*model.test(test_file))

print('Saving model')
model.save_model(data_path +"basic_model")
                 

Train a classifier
Evaluating results
N	70000
Precision 1	0.987
Recall    1	0.987
Saving model
CPU times: user 9min 1s, sys: 4.41 s, total: 9min 5s
Wall time: 1min 21s


### Trying to set cutoffs, other settings and retraining model

In [17]:
%%time
print('Classifier retraining')
model.quantize(input=train_file, qnorm=True, retrain=True, cutoff=100000)

print('Again Evaluating')
print_results(*model.test(test_file))

print('Saving retrained model')
model.save_model(data_path +"basic_model_quantized")

Classifier retraining
Again Evaluating
N	70000
Precision 1	0.986
Recall    1	0.986
Saving retrained model
CPU times: user 1min 51s, sys: 805 ms, total: 1min 52s
Wall time: 1min 51s
