In [1]:
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score

Load the dataset

In [2]:
data = fetch_20newsgroups()

Use `HashingVectorizer` to encode the text into sparse features.

I found out that in English there are about 1M unique words, so default n_features=2**20 will be enough.
Also, I added stop_words parameter to exclude common words from corpus, because they are not significant.

In [3]:
hv = HashingVectorizer(n_features=2**20, binary=True, stop_words={'english'})
transformed_texts = hv.fit_transform(data.data)

Use the K-Fold cross-validation to split the dataset into training and test parts.

Dataset will be split into 5 parts, the data will be shuffled before and to make the output reproducible I set random_state parameter.

In [4]:
kf = KFold(n_splits=5, shuffle=True, random_state=7)

Use Logistic Regression to create a model.

I used default parameters to get first results. To compute cross-validated metrics I used cross_val_score function.

In [5]:
clf = SGDClassifier(loss='log', penalty='l2', alpha=1e-5, random_state=42, n_jobs=-1)

In [6]:
cross_val_score(clf, transformed_texts, data.target, cv=kf)

array([0.89262042, 0.89703933, 0.87892179, 0.8935042 , 0.90318302])

I got pretty good results, and now I want to experiment with different parameters (regularization/loss function/alpha). This experiment will help to find the best parameters.

Also, I decided to use GridSearchCV class to reduce the amount of code.

In [7]:
params = {
    'penalty': ['l1', 'l2'],
    'loss': ['hinge', 'log'],
    'alpha': [1e-6, 1e-5, 1e-4, 1e-3]
}
clf_search = GridSearchCV(clf, params, cv=kf, verbose=2)

In [8]:
clf_search.fit(transformed_texts, data.target)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END ................alpha=1e-06, loss=hinge, penalty=l1; total time=   2.5s
[CV] END ................alpha=1e-06, loss=hinge, penalty=l1; total time=   2.3s
[CV] END ................alpha=1e-06, loss=hinge, penalty=l1; total time=   2.2s
[CV] END ................alpha=1e-06, loss=hinge, penalty=l1; total time=   2.2s
[CV] END ................alpha=1e-06, loss=hinge, penalty=l1; total time=   2.8s
[CV] END ................alpha=1e-06, loss=hinge, penalty=l2; total time=   0.9s
[CV] END ................alpha=1e-06, loss=hinge, penalty=l2; total time=   0.9s
[CV] END ................alpha=1e-06, loss=hinge, penalty=l2; total time=   0.9s
[CV] END ................alpha=1e-06, loss=hinge, penalty=l2; total time=   0.7s
[CV] END ................alpha=1e-06, loss=hinge, penalty=l2; total time=   0.8s
[CV] END ..................alpha=1e-06, loss=log, penalty=l1; total time=   2.1s
[CV] END ..................alpha=1e-06, loss=log

GridSearchCV(cv=KFold(n_splits=5, random_state=7, shuffle=True),
             estimator=SGDClassifier(alpha=1e-05, loss='log', n_jobs=-1,
                                     random_state=42),
             param_grid={'alpha': [1e-06, 1e-05, 0.0001, 0.001],
                         'loss': ['hinge', 'log'], 'penalty': ['l1', 'l2']},
             verbose=2)

In [9]:
print('best_params:', clf_search.best_params_, '\nbest_score:', clf_search.best_score_)

best_params: {'alpha': 1e-05, 'loss': 'hinge', 'penalty': 'l2'} 
best_score: 0.8982685753557498


The results obtained above are explainable.

Since we have a classification problem, `hinge loss` function works better than log, because log function is useful when we are trying to estimate probability. Hinge loss leads to better accuracy at the cost of much less sensitivity regarding probabilities.

`L2 penalty` leads to minimizing all model weights, which makes it more robust. We should use L1 regularization when we are trying to decrease influence of some features, that we have done before, when we exclude stop words.

`Alpha` parameter is a multiplier for regularization term, and it means how big steps we make towards the function minimum. It is always chosen experimentally and depends on dataset.

In [10]:
import pandas as pd

In [11]:
pd.DataFrame(clf_search.cv_results_)[['param_alpha', 'param_loss', 'param_penalty', 'mean_test_score']]

Unnamed: 0,param_alpha,param_loss,param_penalty,mean_test_score
0,1e-06,hinge,l1,0.87131
1,1e-06,hinge,l2,0.880327
2,1e-06,log,l1,0.882536
3,1e-06,log,l2,0.897208
4,1e-05,hinge,l1,0.863621
5,1e-05,hinge,l2,0.898269
6,1e-05,log,l1,0.874316
7,1e-05,log,l2,0.893054
8,0.0001,hinge,l1,0.813948
9,0.0001,hinge,l2,0.897473


The data in this table confirm the conclusion made above. Alpha less than 1e-5 is too big and mess the accuracy.