# Debugging scikit-learn text classification pipeline




## 1. Baseline model

First, we need some data. Let's load 20 Newsgroups data, keeping only 4 categories:

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian', 
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42
)
twenty_test = fetch_20newsgroups(
    subset='test',
    categories=categories,
    shuffle=True,
    random_state=42
)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


A basic text processing pipeline - bag of words features and Logistic Regression as a classifier:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV, SGDClassifier
from sklearn.pipeline import make_pipeline

In [None]:
vec = CountVectorizer()
clf = LogisticRegression(max_iter=1000)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target);

Let's check quality of this pipeline:

In [None]:
from sklearn import metrics

def print_report(pipe):
    y_test = twenty_test.target
    y_pred = pipe.predict(twenty_test.data)
    report = metrics.classification_report(y_test, y_pred, 
        target_names=twenty_test.target_names)
    print(report)
    print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred)))
    
print_report(pipe)

                        precision    recall  f1-score   support

           alt.atheism       0.95      0.74      0.83       319
         comp.graphics       0.81      0.95      0.87       389
               sci.med       0.90      0.85      0.88       396
soc.religion.christian       0.88      0.93      0.90       398

              accuracy                           0.87      1502
             macro avg       0.88      0.87      0.87      1502
          weighted avg       0.88      0.87      0.87      1502

accuracy: 0.874


Not bad. We can try other classifiers and preprocessing methods, but let's check first what the model learned using ``eli5.show_weights`` function:

In [1]:
!pip install eli5



In [None]:
import eli5



In [None]:
eli5.show_weights(clf, top=10)



Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3
+0.502,x19218,,
+0.420,x5714,,
+0.406,x21167,,
+0.399,x6597,,
+0.377,x5718,,
+0.355,x23677,,
+0.350,x35350,,
+0.323,x6472,,
… 8455 more positive …,… 8455 more positive …,,
… 27324 more negative …,… 27324 more negative …,,

Weight?,Feature
+0.502,x19218
+0.420,x5714
+0.406,x21167
+0.399,x6597
+0.377,x5718
+0.355,x23677
+0.350,x35350
+0.323,x6472
… 8455 more positive …,… 8455 more positive …
… 27324 more negative …,… 27324 more negative …

Weight?,Feature
+0.924,x15699
+0.781,<BIAS>
+0.429,x17366
+0.426,x14277
+0.423,x30117
+0.422,x17356
+0.420,x14281
+0.378,x7874
+0.373,x2148
… 10727 more positive …,… 10727 more positive …

Weight?,Feature
+0.504,x17854
+0.497,x25234
+0.496,x12026
+0.419,x11729
+0.418,x22379
+0.401,x32847
+0.356,x16328
… 14209 more positive …,… 14209 more positive …
… 21570 more negative …,… 21570 more negative …
-0.342,x7005

Weight?,Feature
+1.224,<BIAS>
+0.606,x28473
+0.556,x8609
+0.542,x8559
+0.499,x8553
+0.475,x15521
+0.442,x8544
… 9529 more positive …,… 9529 more positive …
… 26250 more negative …,… 26250 more negative …
-0.506,x25663


The table above doesn't make any sense; the problem is that eli5 was not able to get feature and class names from the classifier object alone. 

We would be able to solve this problem by providing **vectorizer** and let eli5 figure out the details automatically.

In [None]:
eli5.show_weights(clf, vec=vec, top=10,
                  target_names=twenty_test.target_names)

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3
+0.502,keith,,
+0.420,atheism,,
+0.406,mathew,,
+0.399,bible,,
+0.377,atheists,,
+0.355,okcforum,,
+0.350,writes,,
+0.323,benedikt,,
… 8455 more positive …,… 8455 more positive …,,
… 27324 more negative …,… 27324 more negative …,,

Weight?,Feature
+0.502,keith
+0.420,atheism
+0.406,mathew
+0.399,bible
+0.377,atheists
+0.355,okcforum
+0.350,writes
+0.323,benedikt
… 8455 more positive …,… 8455 more positive …
… 27324 more negative …,… 27324 more negative …

Weight?,Feature
+0.924,graphics
+0.781,<BIAS>
+0.429,images
+0.426,file
+0.423,software
+0.422,image
+0.420,files
+0.378,card
+0.373,3d
… 10727 more positive …,… 10727 more positive …

Weight?,Feature
+0.504,information
+0.497,pitt
+0.496,doctor
+0.419,disease
+0.418,msg
+0.401,treatment
+0.356,health
… 14209 more positive …,… 14209 more positive …
… 21570 more negative …,… 21570 more negative …
-0.342,book

Weight?,Feature
+1.224,<BIAS>
+0.606,rutgers
+0.556,church
+0.542,christians
+0.499,christian
+0.475,god
+0.442,christ
… 9529 more positive …,… 9529 more positive …
… 26250 more negative …,… 26250 more negative …
-0.506,posting


This starts to make more sense. Columns are target classes. In each column there are features and their weights. 

Intercept (bias) feature is shown as ``<BIAS>`` in the same table. 

We can inspect features and weights because we're using a bag-of-words vectorizer and a linear classifier (so there is a direct mapping between individual words and classifier coefficients). 

Some features look good, but some don't. It seems model learned some names specific to a dataset (email parts, etc.) though, instead of learning topic-specific words. 

Let's check prediction results on an example:

In [None]:
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names)

# probability is the value after softmax function
# score is score before softmax function.

Contribution?,Feature
1.875,Highlighted in text (sum)
-2.027,<BIAS>

Contribution?,Feature
0.781,<BIAS>
-1.293,Highlighted in text (sum)

Contribution?,Feature
2.606,Highlighted in text (sum)
0.022,<BIAS>

Contribution?,Feature
1.224,<BIAS>
-3.188,Highlighted in text (sum)


What can be highlighted in text is highlighted in text. The separate table for features which can't be highlighted in text - ``<BIAS>`` in this case. If you hover mouse on a highlighted word it shows you a weight of this word in a title. Words are colored according to their weights.

## 2. Baseline model, improved data

From the highlighting above it can be seen that a classifier learned some non-interesting stuff indeed, e.g. it remembered parts of email addresses.

We should probably **clean the data first** to make it more interesting; improving model (trying different classifiers, etc.) doesn't make sense at this point - it may just learn to leverage these email addresses better. 

In practice we'd have to do cleaning yourselves; in this example 20 **newsgroups dataset provides an option to remove footers and headers** from the messages. Nice. Let's clean up the data and re-train a classifier.

In [None]:
twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=['headers', 'footers'],
)
twenty_test = fetch_20newsgroups(
    subset='test',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=['headers', 'footers'],
)

In [None]:
vec = CountVectorizer()
clf = LogisticRegression(max_iter=1000)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target);

We just made the task harder and more realistic for a classifier.

In [None]:
print_report(pipe)

                        precision    recall  f1-score   support

           alt.atheism       0.80      0.76      0.78       319
         comp.graphics       0.81      0.93      0.86       389
               sci.med       0.86      0.77      0.81       396
soc.religion.christian       0.86      0.86      0.86       398

              accuracy                           0.83      1502
             macro avg       0.83      0.83      0.83      1502
          weighted avg       0.83      0.83      0.83      1502

accuracy: 0.832


A great result - we just made quality worse! Does it mean pipeline is worse now? No, likely it has a better quality on unseen messages. It is evaluation which is more fair now. Inspecting features used by classifier allowed us to notice a problem with the data and made a good change, despite of numbers which told us not to do that.

So, what have the updated classifier learned? (output is less verbose because only a subset of classes is shown - see "targets" argument):

In [None]:
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])

Contribution?,Feature
1.677,Highlighted in text (sum)
0.092,<BIAS>


It no longer uses email addresses (from: brian@ucsd.edu (brian kantor)), but it still doesn't look good: classifier assigns high weights to seemingly unrelated words like 'do' or 'my'. These words appear in many texts, so maybe classifier uses them as a proxy for bias. Or maybe some of them are more common in some of classes. 

## 3. Pipeline improvements

To help classifier we may filter out stop words:

In [None]:
vec = CountVectorizer(stop_words='english')
clf = LogisticRegression(max_iter=1000)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)

                        precision    recall  f1-score   support

           alt.atheism       0.85      0.77      0.81       319
         comp.graphics       0.84      0.94      0.89       389
               sci.med       0.92      0.85      0.88       396
soc.religion.christian       0.85      0.89      0.87       398

              accuracy                           0.87      1502
             macro avg       0.87      0.86      0.86      1502
          weighted avg       0.87      0.87      0.87      1502

accuracy: 0.866


In [None]:
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])

Contribution?,Feature
1.299,Highlighted in text (sum)
0.137,<BIAS>



Alternatively, we can use TF\*IDF scheme; it should give a somewhat similar effect. 



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()
clf = LogisticRegression(max_iter=1000)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)

                        precision    recall  f1-score   support

           alt.atheism       0.91      0.75      0.82       319
         comp.graphics       0.83      0.95      0.89       389
               sci.med       0.90      0.87      0.89       396
soc.religion.christian       0.85      0.88      0.87       398

              accuracy                           0.87      1502
             macro avg       0.87      0.86      0.87      1502
          weighted avg       0.87      0.87      0.87      1502

accuracy: 0.869


In [None]:
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])

Contribution?,Feature
0.827,Highlighted in text (sum)
0.315,<BIAS>


It helped, but didn't have quite the same effect. Why not do both?

In [None]:
vec = TfidfVectorizer(stop_words='english')
clf = LogisticRegression(max_iter=1000)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)

                        precision    recall  f1-score   support

           alt.atheism       0.93      0.74      0.82       319
         comp.graphics       0.90      0.97      0.93       389
               sci.med       0.92      0.92      0.92       396
soc.religion.christian       0.86      0.93      0.89       398

              accuracy                           0.90      1502
             macro avg       0.90      0.89      0.89      1502
          weighted avg       0.90      0.90      0.89      1502

accuracy: 0.896


In [None]:
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names,
                     targets=['sci.med'])

Contribution?,Feature
0.793,Highlighted in text (sum)
0.27,<BIAS>



## 4. Char-based pipeline

Let's try other analysers instead - use char n-grams instead of words:

In [None]:
vec = TfidfVectorizer(stop_words='english', analyzer='char', 
                      ngram_range=(3,5))
clf = LogisticRegression(max_iter=500)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)

                        precision    recall  f1-score   support

           alt.atheism       0.95      0.74      0.83       319
         comp.graphics       0.81      0.95      0.87       389
               sci.med       0.90      0.85      0.88       396
soc.religion.christian       0.88      0.93      0.90       398

              accuracy                           0.87      1502
             macro avg       0.88      0.87      0.87      1502
          weighted avg       0.88      0.87      0.87      1502

accuracy: 0.874


In [None]:
eli5.show_prediction(clf, twenty_test.data[0], vec=vec, 
                     target_names=twenty_test.target_names)

Contribution?,Feature
-0.838,Highlighted in text (sum)
-6.48,<BIAS>

Contribution?,Feature
0.934,<BIAS>
-6.052,Highlighted in text (sum)

Contribution?,Feature
4.493,Highlighted in text (sum)
-5.149,<BIAS>

Contribution?,Feature
0.6,Highlighted in text (sum)
-5.648,<BIAS>


It works, but quality is a bit worse. Also, it takes ages to train. 



If the number of words or features i.e. vocabulary is huge, consider using HashingVectorizer or perform Latent Semantic Analysis (Truncated SVD) and then apply classification algorithm.