# Working with FastText Classifiers in Scikit-learn Pipelines

The NLP field has been advancing at a rapid pace, with a new state of the art technique being published almost every few months.

I am currently working on where we are trying to automatically classify OCR-ed documents. FastText is the model which is giving the best results so far due to the fact that it makes use of subword information, which makes it a little more robust to OCR error. Curious readers can find more information at [fasttext.cc](https://fasttext.cc/).

![tSNE of word embeddings](tsne.png)

The problem with working with FastText, is that the paper was only published in 2016. The fine engineers provided working code with their paper, but it is written in C and the python bindings they provide are somewhat minimal. There are also many wrappers available, but many either miss functionality, contain bugs, or don't match the current API's. 

So in this notebook I want share the work I did in order to get fasttext working as an sk-learn estimator, which means it can be put into a pipeline, and can be used with packages like `LIME` and `ELI5`.

## System configuration

Before you can run this code, some special steps need to be taken in order to get all the python packages installed. The main package which makes all this possible is [skift](https://github.com/shaypal5/skift), which actually implements the sk-learn estimator for fasttext. 

So following these steps should ensure this notebook can be run. These steps can be found on their github as well:

1. Run `pip install skift` in your python environment.
2. Clone https://github.com/facebookresearch/fastText and build the fasttext binary (using `Make`).
3. In order to install the correct version of the fasttext python bindings, run
```
pip install git+https://github.com/facebookresearch/fastText.git@ca8c5face7d5f3a64fff0e4dfaf58d60a691cb7c
```

Now you should be ready to rumble!

## Importing the data

For this notebook we will be using the super-handy 20newsgroups dataset. It is a typical text classification dataset, which includes many different emails which we want to classify. 

This is included by default in sklearn, but does need an internet connection to download the dataset. 

In [1]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Getting data and transforming it to a pandas dataframe
twenty_train = fetch_20newsgroups(
    subset='train',
    remove=('headers', 'footers'),
)
twenty_test = fetch_20newsgroups(
    subset='test',
    remove=('headers', 'footers'),
)

twenty_train_df = pd.DataFrame({
    'text': twenty_train.data,
    'target': twenty_train.target,
    'target_text': [twenty_train.target_names[i] 
                    for i in twenty_train.target]
})
twenty_test_df = pd.DataFrame({
    'text': twenty_test.data,
    'target': twenty_test.target,
    'target_text': [twenty_test.target_names[i] 
                    for i in twenty_test.target]
})

twenty_train_df.head()

Unnamed: 0,text,target,target_text
0,I was wondering if anyone out there could enli...,7,rec.autos
1,A fair number of brave souls who upgraded thei...,4,comp.sys.mac.hardware
2,"well folks, my mac plus finally gave up the gh...",4,comp.sys.mac.hardware
3,Robert J.C. Kyanko (rob@rjck.UUCP) wrote:\n> a...,1,comp.graphics
4,"From article <C5owCB.n3p@world.std.com>, by to...",14,sci.space


## Creating Transformers we need

Due to the implementation of the transformer, we need some extra transformers to get it working. We need to do these steps:

* **RemoveNewlines:** FastText can't handle newlines, so we remove them with a transformer
* **CastToPandas**: The FtClassifier can't handle list inputs, but TextExplainer requires Pipelines which can handle lists. We transform the inputs to fix this, so we don't have to change the package's code.

These changes allow us to use GridSearchCV and TextExplainer

In [2]:
from sklearn.base import BaseEstimator, TransformerMixin

class RemoveWsChars(BaseEstimator, TransformerMixin):
    def fit(self, *_):
        return self

    def transform(self, X, *_):
        return X.replace("\s+", " ", regex = True)


class CastToPandas(BaseEstimator, TransformerMixin):
    def fit(self, *_):
        return self

    def transform(self, X, *_):
        if type(X) is list:
            return pd.DataFrame(X)
        return X

## Creating the Pipeline

Another problem with this implementation, if we want to GridSearch over parameters, we have to initialize FtClassifier with these parameters when we make the Pipeline. 

This has to do with the get_params() method which, in this implementation, only shows the parameters given when the Estimator instance was created.

> Note that a lot of these parameters are for the Skipgram training. These aren't used when training using the preTrainedVectors

In [3]:
from sklearn.pipeline import Pipeline
from skift import FirstColFtClassifier as FtClassifier

ft_clf = Pipeline([
    ("casting", CastToPandas()),
    ("cleaning", RemoveWsChars()),
    ("fasttext", FtClassifier(lr=0.1,
                              dim=300,
                              ws=5,
                              epoch=50,
                              minCount=1,
                              minCountLabel=0,
                              minn=0,
                              maxn=0,
                              neg=5,
                              wordNgrams=1,
                              loss="softmax",
                              bucket=2000000,
                              thread=12,
                              lrUpdateRate=100,
                              t=1e-4,
                              pretrainedVectors="wiki-news-300d-1M.vec"))
])

# It works!
ft_clf.fit(twenty_train_df[["text"]], twenty_train_df.target)

Pipeline(memory=None,
     steps=[('casting', CastToPandas()), ('cleaning', RemoveWsChars()), ('fasttext', FirstColFtClassifier(bucket=2000000, dim=300, epoch=50, loss='softmax',
           lr=0.1, lrUpdateRate=100, maxn=0, minCount=1, minCountLabel=0,
           minn=0, neg=5, pretrainedVectors='wiki-news-300d-1M.vec',
           t=0.0001, thread=12, wordNgrams=1, ws=5))])

In [4]:
test_sentence = ["Hello. Yes, this is dog"]

print("Predicted Probabilities: \n\n",
      ft_clf.predict_proba(test_sentence), "\n")
print("Predicted Class:",
      twenty_train.target_names[int(ft_clf.predict(test_sentence)[0])])

Predicted Probabilities: 

 [[1.45242693e-04 1.95201011e-04 9.84643399e-02 6.70902460e-03
  3.47427488e-04 3.26935098e-04 2.66173911e-05 4.80787383e-05
  8.84267272e-01 1.73957531e-05 9.99873769e-06 1.00005881e-05
  9.35861500e-03 1.11199213e-05 1.00033352e-05 9.99802297e-06
  1.00035834e-05 9.99800388e-06 1.22712423e-05 1.04567449e-05]] 

Predicted Class: rec.motorcycles


## GridSearching our Pipeline

Here we simply show that our newly made Pipeline now works with GridSearchCV

In [5]:
from sklearn.model_selection import GridSearchCV

params = {
    "fasttext__bucket": [10000],
    "fasttext__wordNgrams": [1],
}

gs = GridSearchCV(ft_clf, params)
gs.fit(twenty_train_df[["text"]], twenty_train_df.target)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('casting', CastToPandas()), ('cleaning', RemoveWsChars()), ('fasttext', FirstColFtClassifier(bucket=2000000, dim=300, epoch=50, loss='softmax',
           lr=0.1, lrUpdateRate=100, maxn=0, minCount=1, minCountLabel=0,
           minn=0, neg=5, pretrainedVectors='wiki-news-300d-1M.vec',
           t=0.0001, thread=12, wordNgrams=1, ws=5))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'fasttext__bucket': [10000], 'fasttext__wordNgrams': [1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [6]:
gs.best_score_

0.6979847975958989

## LIME and TextExplainer

Now that we have the Pipeline setup, we can very easily put this into the TextExplainer function to generate some LIME vizualisations to better explain the results.

Play around with the `doc_ind` to get a feeling for the explanations

In [31]:
from re import sub

doc_ind = 15
explain_doc = sub("\s+", " ", twenty_test_df.text.iloc[doc_ind])
print(explain_doc)

From article <C68uBG.K2w@world.std.com>, by cfw@world.std.com (Christopher F Wroten): > I have an EISA machine and I just do not understand why most > EISA video cards only match the performance of their ISA > counterparts. For instance, the EISA Orchid Pro Designer IIs-E is > only about as "fast" as the ISA Diamond SpeedStar Plus, which isn't > what I would call "fast." > > I don't understand why EISA video cards aren't, as a group, on the > same level of performance as Local Bus cards, given that EISA video > cards have a 32 bit bus to move data around, instead of ISA's 8 bits. > Good question. Answer: The EISA bus does move 32 bits rather than ISA's 8/(16?) But it still moves it at about the speed as the ISA bus. I think that's either 8 or 10 mhz. The local bus designs also move 32 bits like the EISA, but they move the data at the cpu speed, up to 40 mhz. So, on a 33mhz cpu, the local bus is moving 32bit data at 33 mhz, and the EISA is moving 32bit data at 8 or 10 mhz. So the local 

In [33]:
from eli5.lime import TextExplainer

te = TextExplainer()
te.fit(explain_doc, ft_clf.predict_proba)
print("Actual Class: ", 
      twenty_test.target_names[twenty_test_df.target.iloc[doc_ind]],
      "\n")
te.show_prediction(target_names=twenty_train.target_names, 
                   top_targets = 3, 
                   force_weights = True,
                   top = (10,10))

Actual Class:  comp.os.ms-windows.misc 



Contribution?,Feature,Unnamed: 2_level_0
Contribution?,Feature,Unnamed: 2_level_1
Contribution?,Feature,Unnamed: 2_level_2
+2.245,bus,
+1.291,isa,
+1.015,eisa,
+0.716,cards,
+0.672,the,
+0.569,data,
+0.480,i,
+0.379,local,
+0.369,s,
+0.296,speed,

Contribution?,Feature
+2.245,bus
+1.291,isa
+1.015,eisa
+0.716,cards
+0.672,the
+0.569,data
+0.480,i
+0.379,local
+0.369,s
+0.296,speed

Contribution?,Feature
+0.743,a
+0.460,32bit data
+0.448,the
+0.423,as the
+0.397,i think
+0.379,as a
+0.354,the local
+0.348,data at
+0.337,get a
+0.336,than isa

Contribution?,Feature
+0.715,diamond
+0.670,move
+0.494,file
+0.451,about
+0.429,have
+0.415,but
+0.351,more
+0.339,performance
+0.318,data at
+0.306,that s

Contribution?,Feature
+5.226,Highlighted in text (sum)
… 120 more positive …,… 120 more positive …
… 109 more negative …,… 109 more negative …
-1.097,<BIAS>

Contribution?,Feature
… 113 more positive …,… 113 more positive …
… 66 more negative …,… 66 more negative …
-5.894,Highlighted in text (sum)

Contribution?,Feature
… 89 more positive …,… 89 more positive …
… 102 more negative …,… 102 more negative …
-3.418,Highlighted in text (sum)
