In this notebook we will demonstrate using the fastText library to perform text classificatoin on the dbpedia data which can we downloaded from [here](https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz). <br>fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages(source: [wiki](https://en.wikipedia.org/wiki/FastText)).<br>
**Note**: This notebook uses an older version of fasttext.

In [2]:
import os
import pandas as pd
import wget
import tarfile

In [3]:
try :
    
    from google.colab import files
    
    # downloading the data
    !wget -P DATAPATH https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz

    # untaring the required file
    !tar -xvf DATAPATH/dbpedia_csv.tar.gz -C DATAPATH

    # sneek peek in the folder structure
    !ls -lah DATAPATH
    
    # specifying the data_path
    data_path = 'DATAPATH'
    
except ModuleNotFoundError:
    
    if not os.path.exists(os.getcwd()+'\\Data\\dbpedia_csv') :
        # downloading the data
        url="https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz"
        path=os.getcwd()+'\Data'
        wget.download(url,path)

        # untaring the required file
        temp=path+'\dbpedia_csv.tar.gz'
        tar = tarfile.open(temp, "r:gz")
        tar.extractall(path)     
        tar.close()
    
    # specifying the data_path
    data_path='Data'

In [4]:
# Loading train data
train_file = data_path + '/dbpedia_csv/train.csv'
df = pd.read_csv(train_file, header=None, names=['class','name','description'])
# Loading test data
test_file = data_path + '/dbpedia_csv/test.csv'
df_test = pd.read_csv(test_file, header=None, names=['class','name','description'])
# Data we have
print("Train:{} Test:{}".format(df.shape,df_test.shape))


Train:(560000, 3) Test:(70000, 3)


In [5]:
# Since we have no clue about the classes lets build one
# Mapping from class number to class name
class_dict={
            1:'Company',
            2:'EducationalInstitution',
            3:'Artist',
            4:'Athlete',
            5:'OfficeHolder',
            6:'MeanOfTransportation',
            7:'Building',
            8:'NaturalPlace',
            9:'Village',
            10:'Animal',
            11:'Plant',
            12:'Album',
            13:'Film',
            14:'WrittenWork'
        }

# Mapping the classes
df['class_name'] = df['class'].map(class_dict)
df.head()

Unnamed: 0,class,name,description,class_name
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...,Company
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...,Company
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...,Company
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...,Company
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...,Company


In [6]:
df["class_name"].value_counts()

Company                   40000
EducationalInstitution    40000
Artist                    40000
Athlete                   40000
OfficeHolder              40000
MeanOfTransportation      40000
Building                  40000
NaturalPlace              40000
Village                   40000
Animal                    40000
Plant                     40000
Album                     40000
Film                      40000
WrittenWork               40000
Name: class_name, dtype: int64

In [7]:
# Lets do some cleaning of this text
def clean_it(text, normalize=True):
    # Replacing possible issues with data. We can add or reduce the replacement in this chain
    s = str(text).replace(',',' ').replace('"','').replace('\'',' \' ').replace('.',' . ').replace('(',' ( ').\
            replace(')',' ) ').replace('!',' ! ').replace('?',' ? ').replace(':',' ').replace(';',' ').lower()
    
    # normalizing / encoding the text
    if normalize:
        # This is to do with unicode compatibility.  See: https://docs.python.org/3/library/unicodedata.html
        s = s.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8')
    
    return s

# Now lets define a small function where we can use above cleaning on datasets
def clean_df(data, cleanit=False, shuffleit=False, encodeit=False, label_prefix='__class__'):
    # Defining the new data
    df = data[['name','description']].copy(deep=True)
    df['class'] = label_prefix + data['class'].astype(str) + ' '
    
    # cleaning it
    if cleanit:
        df['name'] = df['name'].apply(lambda x: clean_it(x, encodeit))
        df['description'] = df['description'].apply(lambda x: clean_it(x, encodeit))
    
    # shuffling it
    if shuffleit:
        df.sample(frac=1).reset_index(drop=True)
            
    return df

In [8]:
%%time
# Transform the datasets using the above clean functions
df_train_cleaned = clean_df(df, True, True)
df_test_cleaned = clean_df(df_test, True, True)

CPU times: total: 2.78 s
Wall time: 2.78 s


In [9]:
# Write files to disk as fastText classifier API reads files from disk.
train_file = data_path + '/dbpedia_train.csv'
df_train_cleaned.to_csv(train_file, header=None, index=False, columns=['class','name','description'] )

test_file = data_path + '/dbpedia_test.csv'
df_test_cleaned.to_csv(test_file, header=None, index=False, columns=['class','name','description'] )


Now that we have the train and test files written into disk in a format fastText wants, we are ready to use it for text classification!

In [10]:
%%time
## Using fastText for feature extraction and training
from fasttext import train_supervised 
"""fastText expects and training file (csv), a model name as input arguments.
label_prefix refers to the prefix before label string in the dataset.
default is __label__. In our dataset, it is __class__. 
There are several other parameters which can be seen in: 
https://pypi.org/project/fasttext/
"""
model = train_supervised(input=train_file, label="__class__", lr=1.0, epoch=75, loss='ova', wordNgrams=2, dim=200, thread=4, verbose=100)

CPU times: total: 59min 15s
Wall time: 14min 53s


In [11]:
for k in range(1,6):
    results = model.test(test_file,k=k)
    print(f"Test Samples: {results[0]} Precision@{k} : {results[1]*100:2.4f} Recall@{k} : {results[2]*100:2.4f}")

Test Samples: 70000 Precision@1 : 98.3700 Recall@1 : 98.3700
Test Samples: 70000 Precision@2 : 49.5686 Recall@2 : 99.1371
Test Samples: 70000 Precision@3 : 33.0624 Recall@3 : 99.1871
Test Samples: 70000 Precision@4 : 24.8139 Recall@4 : 99.2557
Test Samples: 70000 Precision@5 : 19.8740 Recall@5 : 99.3700


Try training a classifier on this dataset with, say, LogisticRegression to realize how fast fastText is! 93% Precision and Recall are hard numbers to beat, too!

In [15]:
from gensim.models import Word2Vec, KeyedVectors
#Load W2V model. This will take some time. 
path_to_model = "GoogleNews-vectors-negative300.bin"
%time w2v_model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)
print('done loading Word2Vec')

CPU times: total: 18.7 s
Wall time: 18.7 s
done loading Word2Vec


In [16]:
#Inspect the model
word2vec_vocab = w2v_model.key_to_index.keys()
word2vec_vocab_lower = [item.lower() for item in word2vec_vocab]
print(len(word2vec_vocab))

3000000


In [17]:
df_train_cleaned.head()

Unnamed: 0,name,description,class
0,e . d . abbott ltd,abbott of farnham e d abbott limited was a br...,__class__1
1,schwan-stabilo,schwan-stabilo is a german maker of pens for ...,__class__1
2,q-workshop,q-workshop is a polish company located in poz...,__class__1
3,marvell software solutions israel,marvell software solutions israel known as ra...,__class__1
4,bergan mercy medical center,bergan mercy medical center is a hospital loc...,__class__1


In [20]:
texts = list(df_train_cleaned["name"] +  " " + df_train_cleaned["description"])

In [22]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

#preprocess the text.
def preprocess_corpus(texts):
    mystopwords = set(stopwords.words("english"))
    def remove_stops_digits(tokens):
        #Nested function that lowercases, removes stopwords and digits from a list of tokens
        return [token.lower() for token in tokens if token.lower() not in mystopwords and not token.isdigit()
               and token not in punctuation]
    #This return statement below uses the above function to process twitter tokenizer output further. 
    return [remove_stops_digits(word_tokenize(text)) for text in texts]
 
texts_processed = preprocess_corpus(texts)

In [23]:
import numpy as np
# Creating a feature vector by averaging all embeddings for all sentences
def embedding_feats(list_of_lists):
    DIMENSION = 300
    zero_vector = np.zeros(DIMENSION)
    feats = []
    for tokens in list_of_lists:
        feat_for_this =  np.zeros(DIMENSION)
        count_for_this = 0 + 1e-5 # to avoid divide-by-zero 
        for token in tokens:
            if token in w2v_model:
                feat_for_this += w2v_model[token]
                count_for_this +=1
        if(count_for_this!=0):
            feats.append(feat_for_this/count_for_this) 
        else:
            feats.append(zero_vector)
    return feats


train_vectors = embedding_feats(texts_processed)
print(len(train_vectors))

560000


In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

classifier = LogisticRegression(random_state=42)
classifier.fit(train_vectors, df_train_cleaned["class"])


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [26]:
texts_test = list(df_test_cleaned["name"] +  " " + df_test_cleaned["description"])
texts_test_processed = preprocess_corpus(texts_test)
test_vectors = embedding_feats(texts_test_processed)
print(len(test_vectors))

70000


In [27]:
print("Accuracy: ", classifier.score(test_vectors, df_test_cleaned["class"]))
preds = classifier.predict(test_vectors)
print(classification_report(df_test_cleaned["class"], preds))

Accuracy:  0.9666142857142858
              precision    recall  f1-score   support

 __class__1        0.93      0.92      0.93      5000
__class__10        0.98      0.98      0.98      5000
__class__11        0.98      0.98      0.98      5000
__class__12        0.97      0.98      0.97      5000
__class__13        0.97      0.97      0.97      5000
__class__14        0.93      0.93      0.93      5000
 __class__2        0.97      0.98      0.97      5000
 __class__3        0.93      0.91      0.92      5000
 __class__4        0.99      0.99      0.99      5000
 __class__5        0.97      0.97      0.97      5000
 __class__6        0.97      0.98      0.98      5000
 __class__7        0.96      0.96      0.96      5000
 __class__8        0.99      0.99      0.99      5000
 __class__9        0.99      0.99      0.99      5000

    accuracy                           0.97     70000
   macro avg       0.97      0.97      0.97     70000
weighted avg       0.97      0.97      0.97     70