# Understanding fasttext using LIME

## Importing the data

In [90]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Getting data and transforming it to a pandas dataframe
twenty_train = fetch_20newsgroups(
    subset='train',
    remove=('headers', 'footers'),
)
twenty_test = fetch_20newsgroups(
    subset='test',
    remove=('headers', 'footers'),
)

twenty_train_df = pd.DataFrame({
    'text': twenty_train.data,
    'target': twenty_train.target
})
twenty_test_df = pd.DataFrame({
    'text': twenty_test.data,
    'target': twenty_test.target
})

for cat in twenty_train.target_names:
    print(cat)

twenty_train_df.head()

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc


Unnamed: 0,text,target
0,I was wondering if anyone out there could enli...,7
1,A fair number of brave souls who upgraded thei...,4
2,"well folks, my mac plus finally gave up the gh...",4
3,Robert J.C. Kyanko (rob@rjck.UUCP) wrote:\n> a...,1
4,"From article <C5owCB.n3p@world.std.com>, by to...",14


## Creating Transformers we need

Due to the implementation of the transformer, we need some extra transformers to get it working. We need to do these steps:

* **RemoveNewlines:** FastText can't handle newlines, so we remove them with a transformer
* **CastToPandas**: The FtClassifier can't handle list inputs, but TextExplainer requires Pipelines which can handle lists. We transform the inputs to fix this, so we don't have to change the package's code.

These changes allow us to use GridSearchCV and TextExplainer

In [98]:
from sklearn.base import BaseEstimator, TransformerMixin

class RemoveWsChars(BaseEstimator, TransformerMixin):
    def fit(self, *_):
        return self

    def transform(self, X, *_):
        return X.replace("\s+", " ", regex = True)


class CastToPandas(BaseEstimator, TransformerMixin):
    def fit(self, *_):
        return self

    def transform(self, X, *_):
        if type(X) is list:
            return pd.DataFrame(X)
        return X

## Creating the Pipeline

Another problem with this implementation, if we want to GridSearch over parameters, we have to initialize FtClassifier with these parameters when we make the Pipeline. 

This has to do with the get_params() method which, in this implementation, only shows the parameters given when the Estimator instance was created.

> Note that a lot of these parameters are for the Skipgram training. These aren't used when training using the preTrainedVectors

In [99]:
from sklearn.pipeline import Pipeline
from skift import FirstColFtClassifier as FtClassifier

ft_clf = Pipeline([
    ("casting", CastToPandas()),
    ("cleaning", RemoveWsChars()),
    ("fasttext", FtClassifier(lr=0.1,
                              dim=300,
                              ws=5,
                              epoch=50,
                              minCount=1,
                              minCountLabel=0,
                              minn=0,
                              maxn=0,
                              neg=5,
                              wordNgrams=1,
                              loss="softmax",
                              bucket=2000000,
                              thread=12,
                              lrUpdateRate=100,
                              t=1e-4,
                              pretrainedVectors="wiki-news-300d-1M.vec"))
])

# It works!
ft_clf.fit(twenty_train_df[["text"]], twenty_train_df.target)

Pipeline(memory=None,
     steps=[('casting', CastToPandas()), ('cleaning', RemoveWsChars()), ('fasttext', FirstColFtClassifier(bucket=2000000, dim=300, epoch=50, loss='softmax',
           lr=0.1, lrUpdateRate=100, maxn=0, minCount=1, minCountLabel=0,
           minn=0, neg=5, pretrainedVectors='wiki-news-300d-1M.vec',
           t=0.0001, thread=12, wordNgrams=1, ws=5))])

In [100]:
test_sentence = ["Hello. Yes, this is dog"]

print("Predicted Probabilities: \n\n",
      ft_clf.predict_proba(test_sentence), "\n")
print("Predicted Class:",
      twenty_train.target_names[int(ft_clf.predict(test_sentence)[0])])

Predicted Probabilities: 

 [[1.11318608e-04 2.08665675e-04 5.78481741e-02 5.02797030e-03
  1.67364051e-04 2.16755638e-04 2.57845804e-05 3.74646770e-05
  9.28558290e-01 1.43055177e-05 1.00004709e-05 1.00013667e-05
  7.89040886e-03 1.10070769e-05 1.00057650e-05 1.00000125e-05
  1.00050875e-05 1.00000034e-05 1.22522715e-05 1.03720531e-05]] 

Predicted Class: rec.motorcycles


## GridSearching our Pipeline

We simply show here that our newly made Pipeline now works with GridSearchCV

In [101]:
from sklearn.model_selection import GridSearchCV

params = {
    "fasttext__bucket": [10000],
    "fasttext__wordNgrams": [1],
}

gs = GridSearchCV(ft_clf, params)
gs.fit(twenty_train_df[["text"]], twenty_train_df.target)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('casting', CastToPandas()), ('cleaning', RemoveWsChars()), ('fasttext', FirstColFtClassifier(bucket=2000000, dim=300, epoch=50, loss='softmax',
           lr=0.1, lrUpdateRate=100, maxn=0, minCount=1, minCountLabel=0,
           minn=0, neg=5, pretrainedVectors='wiki-news-300d-1M.vec',
           t=0.0001, thread=12, wordNgrams=1, ws=5))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'fasttext__bucket': [10000], 'fasttext__wordNgrams': [1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [102]:
gs.best_score_

0.6986035000883861

## LIME and TextExplainer

Now that we have the Pipeline setup, we can very easily put this into the TextExplainer function to generate some LIME vizualisations to better explain the results.

Play around with the `doc_ind` to get a feeling for the explanations

In [103]:
from re import sub

doc_ind = 12
explain_doc = sub("\s+", " ", twenty_test_df.text.iloc[doc_ind])
print(explain_doc)

In article <1993May13.201441.23139@nysernet.org> astein@nysernet.org (Alan Stein) writes: >It seems that, to keep the peace talks going, Israel has to keep >making goodwill gesture after goodwill gesture, while Palestinian >Arabs continue to go around hunting Jews. You *know* that putting something like this out on the newsgroup is *only* going to generate flames, not discussion. Try adding some substance to the issue of "gestures" you mentioned. > >If the peace talks are going to have any realistic chance of success, >the Arabs are going to have to start reciprocating, especially since >they are the ones who will be getting tangible concessions in return >for giving up only intangibles. What is it you feel that Israel *has* offered as a "gesture"? What would you (*realistically*) expect to see presented by the Arabs/Palestinians in the way of "gesture"? >If they keep trying to change the already agreed upon rules, which seems >to be one of their favorite games, the Israelis are not li

In [104]:
from eli5.lime import TextExplainer

te = TextExplainer()
te.fit(explain_doc, ft_clf.predict_proba)
print("Actual Class: ", 
      twenty_test.target_names[twenty_test_df.target.iloc[doc_ind]],
      "\n")
te.show_prediction(target_names=twenty_train.target_names)

Actual Class:  talk.politics.mideast 



Contribution?,Feature
-0.314,<BIAS>
-6.102,Highlighted in text (sum)

Contribution?,Feature
-0.144,<BIAS>
-44.647,Highlighted in text (sum)

Contribution?,Feature
-0.122,<BIAS>
-34.361,Highlighted in text (sum)

Contribution?,Feature
-0.131,<BIAS>
-37.45,Highlighted in text (sum)

Contribution?,Feature
-0.118,<BIAS>
-40.106,Highlighted in text (sum)

Contribution?,Feature
-0.151,<BIAS>
-37.84,Highlighted in text (sum)

Contribution?,Feature
0.011,<BIAS>
-18.264,Highlighted in text (sum)

Contribution?,Feature
-0.145,<BIAS>
-36.076,Highlighted in text (sum)

Contribution?,Feature
-0.207,<BIAS>
-23.757,Highlighted in text (sum)

Contribution?,Feature
-0.218,<BIAS>
-22.959,Highlighted in text (sum)

Contribution?,Feature
-0.325,<BIAS>
-5.406,Highlighted in text (sum)

Contribution?,Feature
-0.198,<BIAS>
-30.498,Highlighted in text (sum)

Contribution?,Feature
-0.144,<BIAS>
-33.256,Highlighted in text (sum)

Contribution?,Feature
-0.174,<BIAS>
-38.266,Highlighted in text (sum)

Contribution?,Feature
-0.16,<BIAS>
-37.997,Highlighted in text (sum)

Contribution?,Feature
-0.18,<BIAS>
-32.075,Highlighted in text (sum)

Contribution?,Feature
-0.391,<BIAS>
-5.602,Highlighted in text (sum)

Contribution?,Feature
1.55,Highlighted in text (sum)
-0.177,<BIAS>

Contribution?,Feature
-0.332,<BIAS>
-4.166,Highlighted in text (sum)

Contribution?,Feature
-0.497,<BIAS>
-1.906,Highlighted in text (sum)
