## week08: Text classification with simple features

In [None]:
import heapq
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import StratifiedKFold

%matplotlib inline

# Text classification

The text classification task is to determine its class from the document.

In this case, it is proposed to consider as documents - letters pre-classified by 20 topics.

In [None]:
all_categories = fetch_20newsgroups().target_names
all_categories

Let's take only 3 topics, but from one section (documents from similar topics are more difficult to distinguish from each other)

In [None]:
categories = [
    'sci.electronics',
    'sci.space',
    'sci.med'
]

train_data = fetch_20newsgroups(subset='train',
                                categories=categories,
                                remove=('headers', 'footers', 'quotes'))

test_data = fetch_20newsgroups(subset='test',
                               categories=categories,
                               remove=('headers', 'footers', 'quotes'))

## Text vectorization
** Question: how to describe text documents with a feature space? **


** Idea # 1 **: bag-of-words - each document or text looks like an unordered collection of words with no information about the relationships between them.
<img src='https://st2.depositphotos.com/2454953/9959/i/450/depositphotos_99593622-stock-photo-holidays-travel-bag-word-cloud.jpg'>

** Idea number 2 **: create a vector of "words", each component corresponds to a separate word.

For text vectorization let's use [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). You can vary the extraction of features in every possible way (remove rare words, remove frequent words, remove words of general vocabulary, take bigrams, etc.)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
CountVectorizer()

In [None]:
count_vectorizer = CountVectorizer(min_df=5, ngram_range=(1, 2)) 

In [None]:
sparse_feature_matrix = count_vectorizer.fit_transform(train_data.data)
sparse_feature_matrix

In [None]:
num_2_words = {
    v: k
    for k, v in count_vectorizer.vocabulary_.items()
}

Words with the highest positive weight are characteristic words of the topic

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics.scorer import make_scorer
from sklearn.model_selection import cross_val_score, GridSearchCV

Let's use the `macro`-average to estimate the quality of the solution in the multiclass classification problem.

In [None]:
f_scorer = make_scorer(f1_score, average='macro')

Train Logistic Regression to Predict Document Topic

In [None]:
algo = # YOUR code
# Your code for algo fitting

In [None]:
W = algo.coef_.shape[1]
for c in algo.classes_:
    topic_words = [
        num_2_words[w_num]
        for w_num in heapq.nlargest(10, range(W), key=lambda w: algo.coef_[c, w])
    ]
    print(',  '.join(topic_words))


Let's compare the quality for train and test samples.

In [None]:
f_scorer(algo, sparse_feature_matrix, train_data.target)

In [None]:
f_scorer(algo, count_vectorizer.transform(test_data.data), test_data.target)

The f-measure values are very low.

** Question: ** what is the reason?

In [None]:
plt.hist(algo.coef_[0], bins=500)
plt.xlim([-0.0006, 0.0006])
plt.show()

** Which metric to choose for regularization? **

In [None]:
algo = #  YOUR CODE: Add regularization to the logistic Regression

In [None]:
arr = cross_val_score( # YOUR CODE: cross_val_parameters )
print(arr)
print(np.mean(arr))

In [None]:
algo.fit(sparse_feature_matrix, train_data.target)

In [None]:
f_scorer(algo, sparse_feature_matrix, train_data.target)

In [None]:
f_scorer(algo, count_vectorizer.transform(test_data.data), test_data.target)

Let's select the optimal value of the regularization parameter

In [None]:
def grid_plot(x, y, x_label, title, y_label='f_measure'):
    plt.figure(figsize=(12, 6))
    plt.grid(True),
    plt.plot(x, y, 'go-')
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title(title)

In [None]:
print(*map(float, np.logspace(-2, 2, 10)))

In [None]:
lr_grid = {
    'C': np.logspace(-2, 2, 10),
}
gs = GridSearchCV(LogisticRegression(penalty='l1'), lr_grid, scoring=f_scorer, cv=5, n_jobs=5)
%time  gs.fit(sparse_feature_matrix, train_data.target)
print("best_params: {}, best_score: {}".format(gs.best_params_, gs.best_score_))

Рассмотрим график:

In [None]:
grid_plot(
    lr_grid['C'], gs.cv_results_['mean_test_score'], 'C - coefficient of regularization', 'LogReg(penalty=l1)'
)

In [None]:
lr_grid = {
    # YOUR CODE: create your own grid
}
gs = GridSearchCV(LogisticRegression(penalty='l1'), lr_grid, scoring=f_scorer, cv=5, n_jobs=5)
%time  gs.fit(sparse_feature_matrix, train_data.target)
print("best_params: {}, best_score: {}".format(gs.best_params_, gs.best_score_))

In [None]:
grid_plot(
    lr_grid['C'], gs.cv_results_['mean_test_score'], 'C - coefficient of regularization', 'LogReg(penalty=l1)'
)

In [None]:
lr_final = LogisticRegression(penalty='l1', C=10)
%time lr_final.fit(sparse_feature_matrix, train_data.target)

In [None]:
accuracy_score(lr_final.predict(sparse_feature_matrix), train_data.target)

In [None]:
f_scorer(lr_final, sparse_feature_matrix, train_data.target)

In [None]:
accuracy_score(lr_final.predict(count_vectorizer.transform(test_data.data)), test_data.target)

In [None]:
f_scorer(lr_final, count_vectorizer.transform(test_data.data), test_data.target)

## Regularization along with feature vectorization
In order not to do vectorization and training separately, there is a convenient Pipeline class. It allows you to chain a sequence of actions

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
pipeline = Pipeline([
    ("vectorizer", CountVectorizer(min_df=5, ngram_range=(1, 2))),
    ("algo", LogisticRegression())
])

In [None]:
pipeline.fit(train_data.data, train_data.target)

In [None]:
f_scorer(pipeline, train_data.data, train_data.target)

In [None]:
f_scorer(pipeline, test_data.data, test_data.target)

The values are the same as we got earlier, taking the steps separately.

In [None]:
from sklearn.pipeline import make_pipeline

Cross-validation requires that the CountVectorizer does not learn on the test (otherwise the objects become dependent). Pipeline makes this easy.

In [None]:
pipeline = make_pipeline(CountVectorizer(min_df=5, ngram_range=(1, 2)), LogisticRegression())
arr = cross_val_score(pipeline, train_data.data, train_data.target, cv=5, scoring=f_scorer)
print(arr)
print(np.mean(arr))

New data preprocessing steps can be added to Pipeline

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
pipeline = make_pipeline(CountVectorizer(min_df=5, ngram_range=(1, 2)), TfidfTransformer(), LogisticRegression())
arr = cross_val_score(pipeline, train_data.data, train_data.target, cv=5, scoring=f_scorer)
print(arr)
print(np.mean(arr))

In [None]:
pipeline.fit(train_data.data, train_data.target)

In [None]:
accuracy_score(pipeline.predict(train_data.data), train_data.target)

In [None]:
f_scorer(pipeline, train_data.data, train_data.target)

In [None]:
accuracy_score(pipeline.predict(test_data.data), test_data.target)

In [None]:
f_scorer(pipeline, test_data.data, test_data.target)

The quality is slightly better

## Task: play with some other algorithm from sklearn than Logistic Regression

In [None]:
# YOUR CODE