In [5]:
from sklearn.datasets import fetch_20newsgroups

# Uncomment this to learn more about the API.
# help(fetch_20newsgroups)

We load the 20 Newsgroups data, keeping only 4 categories.

In [6]:
categories = ["alt.atheism", "soc.religion.christian", "comp.graphics", "sci.med"]

# We use a fixed random state to ensure the shuffle is deterministic.
twenty_train = fetch_20newsgroups(
    subset="train",
    categories=categories,
    shuffle=True,
    random_state=42,
)
twenty_test = fetch_20newsgroups(
    subset="test",
    categories=categories,
    shuffle=True,
    random_state=42,
)

We create a basic text processing pipeline - bag of words features and Logistic Regression as a classifier.

See [How Count Vectorizer works](./refresher/001_how_count_vectorizer_works.md)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.pipeline import make_pipeline

vec = CountVectorizer()

# We increase the number of iterations to prevent the error below:
# ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT
clf = LogisticRegressionCV(max_iter=1_000)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

To check the quality of the model, we run it against the test data.

In [47]:
from sklearn import metrics


def print_report(pipe):
    y_test = twenty_test.target
    y_pred = pipe.predict(twenty_test.data)
    report = metrics.classification_report(
        y_test, y_pred, target_names=twenty_test.target_names
    )
    print(report)
    print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred)))

In [48]:
print_report(pipe)

                        precision    recall  f1-score   support

           alt.atheism       0.91      0.81      0.85       319
         comp.graphics       0.86      0.94      0.90       389
               sci.med       0.92      0.81      0.86       396
soc.religion.christian       0.88      0.98      0.92       398

              accuracy                           0.89      1502
             macro avg       0.89      0.89      0.89      1502
          weighted avg       0.89      0.89      0.89      1502

accuracy: 0.889


The model achieve a relatively high accuracy. To understand what the model learned, we use the `eli5.show_weights` function. The function accepts the classifier as the first argument:

In [55]:
import eli5

eli5.show_weights(clf, vec=vec, top=10, target_names=twenty_test.target_names)

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3
+0.567,keith,,
+0.482,atheism,,
+0.472,mathew,,
+0.471,bible,,
+0.429,atheists,,
+0.415,okcforum,,
+0.392,writes,,
+0.376,benedikt,,
… 9550 more positive …,… 9550 more positive …,,
… 26229 more negative …,… 26229 more negative …,,

Weight?,Feature
+0.567,keith
+0.482,atheism
+0.472,mathew
+0.471,bible
+0.429,atheists
+0.415,okcforum
+0.392,writes
+0.376,benedikt
… 9550 more positive …,… 9550 more positive …
… 26229 more negative …,… 26229 more negative …

Weight?,Feature
+1.083,graphics
+0.835,<BIAS>
+0.508,images
+0.496,software
+0.495,file
+0.489,image
+0.487,files
+0.453,package
+0.450,card
… 14331 more positive …,… 14331 more positive …

Weight?,Feature
+0.589,information
+0.580,pitt
+0.579,doctor
+0.486,disease
+0.477,msg
+0.471,treatment
+0.418,health
+0.409,radford
… 13222 more positive …,… 13222 more positive …
… 22557 more negative …,… 22557 more negative …

Weight?,Feature
+1.261,<BIAS>
+0.705,rutgers
+0.643,church
+0.631,christians
+0.584,christian
+0.543,god
+0.510,christ
… 10623 more positive …,… 10623 more positive …
… 25156 more negative …,… 25156 more negative …
-0.563,posting


How to interpret:

- columns are target calsses
- In each column, there are features and their weights.
- Intercept (bias) feature is shown as `<BIAS>` in the same table.


Documents with words `graphics`, `images` are closely related to the class `comp.graphics`.

In [56]:
eli5.show_prediction(
    clf, twenty_test.data[0], vec=vec, target_names=twenty_test.target_names
)

Contribution?,Feature
1.968,Highlighted in text (sum)
-2.093,<BIAS>

Contribution?,Feature
0.835,<BIAS>
-1.567,Highlighted in text (sum)

Contribution?,Feature
3.106,Highlighted in text (sum)
-0.003,<BIAS>

Contribution?,Feature
1.261,<BIAS>
-3.507,Highlighted in text (sum)


We can visualize how the predictions are made, and the highlighted feature (words) that contributes to the prediction.

> NOTE: Hovering on the highlighted words shows you the weight of the word.


We notice that the classifier learned some non-interesting data, such as email addresses. To improve the model performance, we should perform cleaning. 

We can take that step by first removing the `headers` and `footers`.

In [57]:
twenty_train = fetch_20newsgroups(
    subset="train",
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=["headers", "footers"],
)

twenty_test = fetch_20newsgroups(
    subset="test",
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=["headers", "footers"],
)

vec = CountVectorizer()
clf = LogisticRegressionCV(max_iter=1_000)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target);

In [58]:
print_report(pipe)

                        precision    recall  f1-score   support

           alt.atheism       0.81      0.76      0.79       319
         comp.graphics       0.82      0.93      0.87       389
               sci.med       0.87      0.78      0.82       396
soc.religion.christian       0.86      0.88      0.87       398

              accuracy                           0.84      1502
             macro avg       0.84      0.84      0.84      1502
          weighted avg       0.84      0.84      0.84      1502

accuracy: 0.840


The accuracy of the model somehow becomes worse (from `0.889` to `0.840`).

In [59]:
eli5.show_prediction(
    clf, twenty_test.data[0], vec=vec, target_names=twenty_test.target_names
)



Contribution?,Feature
-0.58,Highlighted in text (sum)
-1.002,<BIAS>

Contribution?,Feature
0.959,<BIAS>
-1.529,Highlighted in text (sum)

Contribution?,Feature
1.435,Highlighted in text (sum)
0.112,<BIAS>

Contribution?,Feature
0.675,Highlighted in text (sum)
-0.069,<BIAS>


Email addresses no longer appears in the result, but the classifier still doesn't look good. The classifier seems to assign high weights to non-related words such as `my`, `to`.

## Pipeline Improvement

To improve the classifier, we may filter out stop words.

In [60]:
vec = CountVectorizer(stop_words="english")
clf = LogisticRegressionCV(max_iter=1_000)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)

                        precision    recall  f1-score   support

           alt.atheism       0.86      0.76      0.81       319
         comp.graphics       0.85      0.94      0.89       389
               sci.med       0.92      0.85      0.88       396
soc.religion.christian       0.86      0.89      0.87       398

              accuracy                           0.87      1502
             macro avg       0.87      0.86      0.86      1502
          weighted avg       0.87      0.87      0.87      1502

accuracy: 0.868


In [61]:
eli5.show_prediction(
    clf, twenty_test.data[0], vec=vec, target_names=twenty_test.target_names
)



Contribution?,Feature
-0.084,Highlighted in text (sum)
-0.954,<BIAS>

Contribution?,Feature
0.732,<BIAS>
-0.654,Highlighted in text (sum)

Contribution?,Feature
1.035,Highlighted in text (sum)
0.176,<BIAS>

Contribution?,Feature
0.046,<BIAS>
-0.297,Highlighted in text (sum)


The accuracy of the model is now `0.868`, which is better than the previous (`0.840`), but still not as good as the first (`0.889`). 

But the highlighted words seems to be more relevant.

We can also experiment with using TFIDF vectorizer.

In [64]:
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(stop_words="english")
clf = LogisticRegressionCV(max_iter=1_000)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)

                        precision    recall  f1-score   support

           alt.atheism       0.92      0.80      0.86       319
         comp.graphics       0.90      0.96      0.93       389
               sci.med       0.95      0.92      0.93       396
soc.religion.christian       0.89      0.94      0.91       398

              accuracy                           0.91      1502
             macro avg       0.91      0.91      0.91      1502
          weighted avg       0.91      0.91      0.91      1502

accuracy: 0.911


In [65]:
eli5.show_prediction(
    clf,
    twenty_test.data[0],
    vec=vec,
    target_names=twenty_test.target_names,
    targets=["sci.med"],
)



Contribution?,Feature
1.803,Highlighted in text (sum)
0.245,<BIAS>


## Debugging HashingVectorizer

Instead of char n-grams, we try fitting word n-grams. 

To handle large vocabularies, we can use HashingVectorizer, to make training faster we can employ SGDClassifier:

In [71]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vec = HashingVectorizer(stop_words="english", ngram_range=(1, 2))
clf = SGDClassifier(max_iter=20, random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(twenty_train.data, twenty_train.target)

print_report(pipe)

                        precision    recall  f1-score   support

           alt.atheism       0.92      0.79      0.85       319
         comp.graphics       0.87      0.96      0.92       389
               sci.med       0.93      0.89      0.91       396
soc.religion.christian       0.88      0.93      0.91       398

              accuracy                           0.90      1502
             macro avg       0.90      0.89      0.90      1502
          weighted avg       0.90      0.90      0.90      1502

accuracy: 0.899


The classifier ran super fast. Let's check what the model learned.

In [72]:
eli5.show_prediction(
    clf,
    twenty_test.data[0],
    vec=vec,
    target_names=twenty_test.target_names,
    targets=["sci.med"],
)

Contribution?,Feature
0.783,Highlighted in text (sum)
-0.558,<BIAS>


The result looks similar to `CountVectorizer`. But with `HashingVectorizer` we don't even have a vocabulary. How does this work?

In [74]:
eli5.show_weights(clf, vec=vec, top=10, target_names=twenty_test.target_names)

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3
+2.709,x199378,,
+2.601,x938889,,
+1.820,x349126,,
+1.691,x718537,,
+1.562,x242643,,
+1.522,x71928,,
… 53649 more positive …,… 53649 more positive …,,
… 53981 more negative …,… 53981 more negative …,,
-1.694,x683213,,
-1.757,x741207,,

Weight?,Feature
+2.709,x199378
+2.601,x938889
+1.820,x349126
+1.691,x718537
+1.562,x242643
+1.522,x71928
… 53649 more positive …,… 53649 more positive …
… 53981 more negative …,… 53981 more negative …
-1.694,x683213
-1.757,x741207

Weight?,Feature
+3.530,x580586
+1.848,x342790
+1.747,x771885
+1.716,x363686
+1.663,x111283
… 32802 more positive …,… 32802 more positive …
… 32516 more negative …,… 32516 more negative …
-1.643,x1031983
-1.669,x85557
-1.851,x120354

Weight?,Feature
+2.206,x988761
+2.135,x337555
+1.937,x154565
+1.683,x806262
… 46295 more positive …,… 46295 more positive …
… 46148 more negative …,… 46148 more negative …
-1.661,x34701
-1.721,x354651
-1.734,x790864
-1.956,x85557

Weight?,Feature
+3.150,x641063
+2.955,x199709
+2.793,x741207
+2.034,x396081
+1.778,x274863
… 55186 more positive …,… 55186 more positive …
… 55313 more negative …,… 55313 more negative …
-1.983,x672777
-2.066,x443433
-2.087,x199378


The feature is not what we expect - since we don't have vocabulary, we don't have feature names. We can use `InvertableHashingVectorizer` to get feature names for `HashingVectorizer` without fitting a huge vocabulary.

We still need some data to learn the words to hash mapping, and we can use a random subset of data to fit it.

In [75]:
import numpy as np
from eli5.sklearn import InvertableHashingVectorizer

In [77]:
ivec = InvertableHashingVectorizer(vec)
sample_size = len(twenty_train.data) // 10
X_sample = np.random.choice(twenty_train.data, size=sample_size)
ivec.fit(X_sample)

In [78]:
eli5.show_weights(clf, vec=ivec, top=20, target_names=twenty_test.target_names)

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3
+2.709,atheism,,
+2.601,writes,,
+1.820,motto,,
+1.694,morality,,
+1.562,religion,,
+1.522,islam,,
+1.502,wrote,,
+1.451,keith,,
+1.392,objective,,
+1.357,livesey,,

Weight?,Feature
+2.709,atheism
+2.601,writes
+1.820,motto
+1.694,morality
+1.562,religion
+1.522,islam
+1.502,wrote
+1.451,keith
+1.392,objective
+1.357,livesey

Weight?,Feature
+3.530,graphics
+2.257,image
+1.991,3d
+1.851,files
+1.848,code
+1.747,images
+1.716,software
+1.663,file
+1.643,card
+1.600,computer

Weight?,Feature
+2.206,health
+2.135,msg
+2.081,disease
+2.006,treatment
+1.937,doctor
+1.734,pain
+1.683,com
+1.661,medical
+1.474,effects
+1.459,cancer

Weight?,Feature
+3.150,christians
+3.136,church
+2.955,christ
+2.793,rutgers
+2.756,rutgers edu
+2.066,christian
+2.034,heaven
+1.983,love
+1.778,athos
+1.778,athos rutgers


There are collisions (hover mouse over features with "..."), and there important features which were not seen in the random sample (FEATURE[...]), but overall it looks fine.

The `rutgets edu` bigram is suspicious though, it looks like a part of email.

In [80]:
rutgers_example = [x for x in twenty_train.data if "rutgers" in x.lower()][0]
rutgers_example

"In article <Apr.8.00.57.41.1993.28246@athos.rutgers.edu> REXLEX@fnal.gov writes:\n>In article <Apr.7.01.56.56.1993.22824@athos.rutgers.edu> shrum@hpfcso.fc.hp.com\n>Matt. 22:9-14 'Go therefore to the main highways, and as many as you find\n>there, invite to the wedding feast.'...\n\n>hmmmmmm.  Sounds like your theology and Christ's are at odds. Which one am I \n>to believe?"

Yup, it seems like the model learned this email address instead of something useful.

In [81]:
eli5.show_prediction(
    clf,
    rutgers_example,
    vec=vec,
    target_names=twenty_test.target_names,
    targets=["soc.religion.christian"],
)

Contribution?,Feature
2.645,Highlighted in text (sum)
-0.667,<BIAS>
