## day11: Text classification with simple features

In [130]:
import heapq
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, make_scorer

random_state = 42

# Text classification

The text classification task is to determine its class from the document.

In this case, it is proposed to consider as documents - letters pre-classified by 20 topics.

In [None]:
all_categories = fetch_20newsgroups().target_names
all_categories

Let's take only 3 topics, but from one section (documents from similar topics are more difficult to distinguish from each other)

In [None]:
categories = [
    'sci.electronics',
    'sci.med',
    'sci.space',
]

train_data = fetch_20newsgroups(subset='train',
                                categories=categories,
                                remove=('headers', 'footers', 'quotes'))

test_data = fetch_20newsgroups(subset='test',
                               categories=categories,
                               remove=('headers', 'footers', 'quotes'))

## Text vectorization
** Question: how to describe text documents with a feature space? **


** Idea # 1 **: bag-of-words - each document or text looks like an unordered collection of words with no information about the relationships between them.
<img src='https://st2.depositphotos.com/2454953/9959/i/450/depositphotos_99593622-stock-photo-holidays-travel-bag-word-cloud.jpg'>

** Idea number 2 **: create a vector of "words", each component corresponds to a separate word.

For text vectorization let's use [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). You can vary the extraction of features in every possible way (remove rare words, remove frequent words, remove words of general vocabulary, take bigrams, etc.)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
CountVectorizer()

In [None]:
count_vectorizer = CountVectorizer(min_df=5, ngram_range=(1, 2)) 

In [None]:
sparse_feature_matrix = count_vectorizer.fit_transform(train_data.data)
sparse_feature_matrix

In [None]:
num_2_words = {
    v: k
    for k, v in count_vectorizer.vocabulary_.items()
}

Words with the highest positive weight are characteristic words of the topic

Let's use the `macro`-average to estimate the quality of the solution in the multiclass classification problem.

In [None]:
f_scorer = make_scorer(f1_score, average='macro')

Train Logistic Regression to Predict Document Topic

In [None]:
lr_algo_params = {"solver": "liblinear"}
grid_search_params = {
    "verbose": 1,
    "n_jobs": -1,
    "cv": StratifiedKFold(shuffle=True, random_state=random_state),
    "scoring": f_scorer,
}

lr_param_grid = {
    "C": np.logspace(-4, 4, 9)
}

algo = LogisticRegression(**lr_algo_params)

gscv_lr = GridSearchCV(algo, param_grid=lr_param_grid, **grid_search_params)
gscv_lr.fit(sparse_feature_matrix, train_data.target)
algo = gscv_lr.best_estimator_
print(f"best params: {gscv_lr.best_params_}")
train_accuracy = algo.score(sparse_feature_matrix, train_data.target)
print(f"train accuracy: {train_accuracy:.4f}")

In [None]:
W = algo.coef_.shape[1]

for c in algo.classes_:
    topic_words = [
        num_2_words[w_num]
        for w_num in heapq.nlargest(10, range(W), key=lambda w: algo.coef_[c, w])
    ]
    print(f"topic: {categories[c]}")
    print(',  '.join(topic_words))
    print()

Let's compare the quality for train and test samples.

In [None]:
f_scorer(algo, sparse_feature_matrix, y_train)

In [None]:
f_scorer(algo, count_vectorizer.transform(test_data.data), test_data.target)

The f-measure values are very low.

** Question: ** what is the reason?

In [None]:
plt.hist(algo.coef_[0], bins=500)
plt.xlim([-0.06, 0.06])
plt.show()

** Which metric to choose for regularization? **

In [None]:
# Let's add L1 regularization to the logistic Regression
algo = LogisticRegression(penalty="l1", **lr_algo_params)

In [None]:
arr = cross_val_score(algo, sparse_feature_matrix, train_data.target, cv=grid_search_params["cv"], scoring=f_scorer)
print(arr)
print(np.mean(arr))

In [None]:
algo.fit(sparse_feature_matrix, train_data.target)

In [None]:
f_scorer(algo, sparse_feature_matrix, train_data.target)

In [None]:
f_scorer(algo, count_vectorizer.transform(test_data.data), test_data.target)

Let's select the optimal value of the regularization parameter

In [None]:
def grid_plot(x, y, x_label, title, y_label='f_measure'):
    plt.figure(figsize=(12, 6))
    plt.grid(True),
    plt.plot(x, y, 'go-')
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title(title)

In [None]:
np.logspace(-2, 2, 11)

In [None]:
lr_grid = {
    'C': np.logspace(-2, 2, 11),
}
gs = GridSearchCV(algo, lr_grid, **grid_search_params)
%time  gs.fit(sparse_feature_matrix, train_data.target)
print(f"best_params: {gs.best_params_}, best_score: {gs.best_score_:.4f}")

Рассмотрим график:

In [None]:
grid_plot(
    np.log(lr_grid['C']), gs.cv_results_['mean_test_score'], 'C - coefficient of regularization', 'LogReg(penalty=l1)'
)

In [None]:
lr_grid = {
    "C": np.logspace(0, 2, 11), # OR YOUR CODE: create your own grid
}
gs = GridSearchCV(
    LogisticRegression(**lr_algo_params),
    lr_grid,
    **grid_search_params
)
gs.fit(sparse_feature_matrix, train_data.target)
print(f"best_params: {gs.best_params_}, best_score: {gs.best_score_:.4f}")

In [None]:
grid_plot(
    np.log(lr_grid['C']), gs.cv_results_['mean_test_score'], 'C - coefficient of regularization', 'LogReg(penalty=l1)'
)

In [None]:
lr_final = LogisticRegression(penalty='l2', C=6.309573444801933, solver="liblinear")
%time lr_final.fit(sparse_feature_matrix, train_data.target)

In [None]:
accuracy_score(lr_final.predict(sparse_feature_matrix), train_data.target)

In [None]:
f_scorer(lr_final, sparse_feature_matrix, train_data.target)

In [None]:
accuracy_score(lr_final.predict(count_vectorizer.transform(test_data.data)), test_data.target)

In [None]:
f_scorer(lr_final, count_vectorizer.transform(test_data.data), test_data.target)

## Regularization along with feature vectorization
In order not to do vectorization and training separately, there is a convenient Pipeline class. It allows you to chain a sequence of actions

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
pipeline = Pipeline([
    ("vectorizer", CountVectorizer(min_df=5, ngram_range=(1, 2))),
    ("algo", LogisticRegression(**lr_algo_params))
])

In [None]:
pipeline.fit(train_data.data, train_data.target)

In [None]:
f_scorer(pipeline, train_data.data, train_data.target)

In [None]:
f_scorer(pipeline, test_data.data, test_data.target)

The values are the same as we got earlier, taking the steps separately.

In [None]:
from sklearn.pipeline import make_pipeline

Cross-validation requires that the CountVectorizer does not learn on the test (otherwise the objects become dependent). Pipeline makes this easy.

In [None]:
pipeline = make_pipeline(
    CountVectorizer(min_df=5, ngram_range=(1, 2)),
    LogisticRegression(penalty="l2", **lr_algo_params)
)
arr = cross_val_score(pipeline, train_data.data, train_data.target, cv=5, scoring=f_scorer)
print(arr)
print(np.mean(arr))

New data preprocessing steps can be added to Pipeline

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
pipeline = make_pipeline(
    CountVectorizer(min_df=5, ngram_range=(1, 2)),
    TfidfTransformer(),
    LogisticRegression(penalty="l2", solver="liblinear"),
)
arr = cross_val_score(pipeline, train_data.data, train_data.target, cv=5, scoring=f_scorer)
print(arr)
print(np.mean(arr))

In [None]:
pipeline.fit(train_data.data, train_data.target)

In [None]:
accuracy_score(pipeline.predict(train_data.data), train_data.target)

In [None]:
f_scorer(pipeline, train_data.data, train_data.target)

In [None]:
accuracy_score(pipeline.predict(test_data.data), test_data.target)

In [None]:
f_scorer(pipeline, test_data.data, test_data.target)

The quality is slightly better

## Task: play with some other algorithm from sklearn than Logistic Regression

In [None]:
# YOUR CODE