In this notebook we will demonstrate using the fastText library to perform text classificatoin on the dbpedie data which can we downloaded from [here](https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz). <br>fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages(source: [wiki](https://en.wikipedia.org/wiki/FastText)).<br>

More information: 
- https://fasttext.cc/docs/en/supervised-tutorial.html
- https://pypi.org/project/fasttext/#text-classification-model

**Note**: This notebook uses an older version of fasttext.

## 1. 准备数据

In [6]:
# !pip install fasttext==0.9.2

In [5]:
#necessary imports
import pandas as pd

In [None]:
"""
# downloading the data
!wget -P DATAPATH https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz

# untaring the reuqired file
!tar -xvf DATAPATH/dbpedia_csv.tar.gz -C DATAPATH

# sneek peek in the folder structure
!ls -lah DATAPATH
"""

In [7]:
data_path = '/Users/chenwang/Workspace/github/practical_nlp/chapter4_text_classification/data/dbpedie'

# Loading train data
train_file = data_path + '/train.csv'
df = pd.read_csv(train_file, header=None, names=['class','name','description'])

# Loading test data
test_file = data_path + '/test.csv'
df_test = pd.read_csv(test_file, header=None, names=['class','name','description'])

# Data we have
print("Train:{} Test:{}".format(df.shape,df_test.shape))


Train:(560000, 3) Test:(70000, 3)


In [8]:
df.head()

Unnamed: 0,class,name,description
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...


In [9]:
df_test.head()

Unnamed: 0,class,name,description
0,1,TY KU,TY KU /taɪkuː/ is an American alcoholic bever...
1,1,Odd Lot Entertainment,OddLot Entertainment founded in 2001 by longt...
2,1,Henkel,Henkel AG & Company KGaA operates worldwide w...
3,1,GOAT Store,The GOAT Store (Games Of All Type Store) LLC ...
4,1,RagWing Aircraft Designs,RagWing Aircraft Designs (also called the Rag...


#### 给每个class label 一个描述

In [10]:
# Since we have no clue about the classes lets build one
# Mapping from class number to class name
class_dict={
            1:'Company',
            2:'EducationalInstitution',
            3:'Artist',
            4:'Athlete',
            5:'OfficeHolder',
            6:'MeanOfTransportation',
            7:'Building',
            8:'NaturalPlace',
            9:'Village',
            10:'Animal',
            11:'Plant',
            12:'Album',
            13:'Film',
            14:'WrittenWork'
        }



In [11]:
# 在dataframe 中添加一列 
# Mapping the classes

df['class_name'] = df['class'].map(class_dict)
df.head()

Unnamed: 0,class,name,description,class_name
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...,Company
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...,Company
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...,Company
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...,Company
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...,Company


In [12]:
df["class_name"].value_counts()

Plant                     40000
WrittenWork               40000
Artist                    40000
OfficeHolder              40000
Building                  40000
Animal                    40000
Company                   40000
NaturalPlace              40000
Album                     40000
EducationalInstitution    40000
Athlete                   40000
Film                      40000
Village                   40000
MeanOfTransportation      40000
Name: class_name, dtype: int64

#### 数据清洗

- normalize
- encoding

In [13]:
# Lets do some cleaning of this text
def clean_it(text,normalize=True):
    # Replacing possible issues with data. We can add or reduce the replacemtent in this chain
    s = str(text).replace(',',' ').replace('"','').replace('\'',' \' ').replace('.',' . ').replace('(',' ( ').\
            replace(')',' ) ').replace('!',' ! ').replace('?',' ? ').replace(':',' ').replace(';',' ').lower()
    
    # normalizing / encoding the text
    if normalize:
        s = s.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8')
    
    return s


In [14]:
# Now lets define a small function where we can use above cleaning on datasets
def clean_df(data, cleanit= False, shuffleit=False, encodeit=False, label_prefix='__class__'):
    # Defining the new data
    df = data[['name','description']].copy(deep=True)
    df['class'] = label_prefix + data['class'].astype(str) + ' '
    
    # cleaning it
    if cleanit:
        df['name'] = df['name'].apply(lambda x: clean_it(x,encodeit))
        df['description'] = df['description'].apply(lambda x: clean_it(x,encodeit))
    
    # shuffling it
    if shuffleit:
        df.sample(frac=1).reset_index(drop=True)
            
    return df

In [15]:
%%time
# Transform the datasets using the above clean functions
df_train_cleaned = clean_df(df, True, True)
df_test_cleaned = clean_df(df_test, True, True)

CPU times: user 5.47 s, sys: 265 ms, total: 5.74 s
Wall time: 5.81 s


#### 将清理过的数据歇会，作为fasttext 的输入

In [16]:
# Write files to disk as fastText classifier API reads files from disk.
train_file = data_path + '/dbpedia_train.csv'
# df_train_cleaned.to_csv(train_file, header=None, index=False, columns=['class','name','description'] )

test_file = data_path + '/dbpedia_test.csv'
# df_test_cleaned.to_csv(test_file, header=None, index=False, columns=['class','name','description'] )


## 2. 使用fasttext 训练模型

Now that we have the train and test files written into disk in a format fastText wants, we are ready to use it for text classification!

In [21]:
## Using fastText for feature extraction and training
from fasttext import train_supervised 
import fasttext

In [24]:
help(fasttext.FastText)

Help on module fasttext.FastText in fasttext:

NAME
    fasttext.FastText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the MIT license found in the
    # LICENSE file in the root directory of this source tree.

FUNCTIONS
    cbow(*kargs, **kwargs)
    
    eprint(*args, **kwargs)
    
    load_model(path)
        Load a model given a filepath and return a model object.
    
    read_args(arg_list, arg_dict, arg_names, default_values)
    
    skipgram(*kargs, **kwargs)
    
    supervised(*kargs, **kwargs)
    
    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
    
    train_supervised(*kargs, **kwargs)
        Train a supervised model and return a model object.
        
        input must be a filepath. The input text does not need to be tokenized
        as per the tokenize function, but it must be preprocessed and encoded
        as UTF-8. You might wan

In [18]:
%%time

"""fastText expects and training file (csv), a model name as input arguments.
label_prefix refers to the prefix before label string in the dataset.
default is __label__. In our dataset, it is __class__. 
There are several other parameters which can be seen in: 
https://pypi.org/project/fasttext/
"""
model = train_supervised(input=train_file, label="__class__", lr=1.0, epoch=75, loss='ova', wordNgrams=2, dim=200, thread=2, verbose=100)


CPU times: user 31min 36s, sys: 32.2 s, total: 32min 9s
Wall time: 16min 42s


Once the model is trained, we can retrieve the list of words and labels:

In [35]:
print(model.words[:10])

['.', 'the', 'in', 'of', 'a', 'is', 'and', '</s>', '(', ')']


In [29]:
print(model.labels)

['__class__14', '__class__13', '__class__12', '__class__2', '__class__11', '__class__3', '__class__10', '__class__9', '__class__8', '__class__4', '__class__7', '__class__6', '__class__5', '__class__1']


In [36]:
def print_results(N, p, r):
    print("Number\t" + str(N))
    print("Precision@{}\t{:.3f}".format(1, p))
    print("Recall@{}\t{:.3f}".format(1, r))

print_results(*model.test(test_file))

Number	70000
Precision@1	0.877
Recall@1	0.877


In [19]:
for k in range(1,6):
    results = model.test(test_file,k=k)
    print(f"Test Samples: {results[0]} Precision@{k} : {results[1]*100:2.4f} Recall@{k} : {results[2]*100:2.4f}")

Test Samples: 70000 Precision@1 : 87.6886 Recall@1 : 87.6886
Test Samples: 70000 Precision@2 : 47.2907 Recall@2 : 94.5814
Test Samples: 70000 Precision@3 : 31.8819 Recall@3 : 95.6457
Test Samples: 70000 Precision@4 : 23.9821 Recall@4 : 95.9286
Test Samples: 70000 Precision@5 : 19.2034 Recall@5 : 96.0171


Try training a classifier on this dataset with, say, LogisticRegression to realize how fast fastText is! 93% Precision and Recall are hard numbers to beat, too!

#### predict the label of a sentence

默认返回probability 最高的label. 使用参数k 返回probability 最高的k 个label.

In [37]:
model.predict("Which baking dish is best to bake a banana bread ?")

(('__class__1',), array([1.00000034e-05]))

In [38]:
model.predict("Which baking dish is best to bake a banana bread ?", k=3)

(('__class__12', '__class__1', '__class__5'),
 array([1.00000034e-05, 1.00000034e-05, 1.00000034e-05]))

#### save model

我们可以看到，输出的文件大小为2.5G 左右。

In [25]:
model.save_model("output/fasttext_model.bin")