# Classfication

This notebook evaluates methods for classification using the [academia.stackexchange.com](https://academia.stackexchange.com/) data dump.

## Table of Contents
* [Data import](#data_import)
* [Classfication methods](#methods)

In [1]:
%load_ext autoreload
%autoreload 2

import matplotlib.pyplot as plt
import numpy as np
import time
from joblib import dump, load
from academia_tag_recommender.definitions import MODELS_PATH

<a id='data_import'/>

## Data import

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from academia_tag_recommender.preprocessor import BasicPreprocessor
from academia_tag_recommender.tokenizer import BasicTokenizer, EnglishStemmer, PorterStemmer, LancasterStemmer, Lemmatizer
from academia_tag_recommender.vectorizer_computation import get_vect_feat_with_params
from academia_tag_recommender.documents import documents as get_documents

documents = get_documents()
texts = [document.text for document in documents]

[vectorizer, features] = get_vect_feat_with_params(texts, TfidfVectorizer, BasicTokenizer, BasicPreprocessor, None, (1, 1))

## Data Preparation

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

label = [document.tags for document in documents]

X = features
print('X data with shape {}'.format(X.shape))

y = MultiLabelBinarizer().fit_transform(label)
print('Y data with shape {}'.format(y.shape))

In [None]:
from sklearn.model_selection import train_test_split

y_one_label = y[:,1]
X_train, X_test, y_train, y_test = train_test_split(X, y_one_label, test_size=0.5, random_state=0)

In [None]:
from sklearn.multioutput import MultiOutputClassifier

X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X, y, test_size=0.5, random_state=0)

multi = False

<a id='methods'/>

## Classification methods

Probabilistic algorithms (high dimensionality, data sparsity)
- Naive Bayes (NB), [Explanation](https://scikit-learn.org/stable/modules/naive_bayes.html)
- [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression), [Explanation](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
- Linear Classifier (LLSF: Linear Least Squares Fit), [Explanation](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)


Geometric algorithms
- [k-Nearest Neighbor (kNN)](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier), [Explanation](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification)
- [Support Vector Machine (SVM)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC), [Explanation](https://scikit-learn.org/stable/modules/svm.html)


- [Neural Network (NN)](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier), [Explanation](https://scikit-learn.org/stable/modules/neural_networks_supervised.html)


In [None]:
from sklearn.metrics import mean_squared_error

results=[]
def test_classifier(name, clf, X_train, y_train, X_test, y_test, use_score=True):
    start = time.time()
    clf_fit = clf.fit(X_train, y_train)
    if use_score:
        score_orig = clf_fit.score(X_train, y_train)
        score_pred = clf.score(X_test, y_test)
    else:
        pred_train = clf_fit.predict(X_train)
        pred_test = clf_fit.predict(X_test)
        score_orig = mean_squared_error(y_train, pred_train)
        score_pred = mean_squared_error(y_test, pred_test)
    end = time.time()
    process_time = end - start
    results.append([name, score_orig, score_pred, process_time])

In [None]:
X_train_one_feat = X_train[:,1].toarray()
X_test_one_feat = X_test[:,1].toarray()

def plot_decision(clf):
    clf_fit = clf.fit(X_train_one_feat, y_train)
    y_pred = clf_fit.predict(X_test_one_feat)
    
    plt.scatter(X_test_one_feat, y_test,  color='black')
    plt.plot(X_test_one_feat, y_pred, color='blue', linewidth=3)
    plt.xticks(())
    plt.yticks(())
    plt.show()

**Naive Bayes**

- [Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)
- [Multinomial Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
- [Complement Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.ComplementNB.html#sklearn.naive_bayes.ComplementNB)
- [Categorical Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB)

In [None]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, ComplementNB, CategoricalNB

*Gaussian Naive Bayes*

In [None]:
test_classifier('Gaussian Naive Bayes', 
                GaussianNB(),
                X_train.toarray(), y_train,
                X_test.toarray(), y_test)

*Multinomial Naive Bayes*

In [None]:
test_classifier('Multinomial Naive Bayes', 
                MultinomialNB(), 
                X_train.toarray(), y_train, 
                X_test.toarray(), y_test)

*Complement Naive Bayes*

In [None]:
test_classifier('Complement Naive Bayes', 
                ComplementNB(), 
                X_train.toarray(), y_train, 
                X_test.toarray(), y_test)

*Categorical Naive Bayes*

In [None]:
test_classifier('Categorical Naive Bayes', 
                CategoricalNB(), 
                X_train.toarray(), y_train,
                X_test.toarray(), y_test)

*Multioutput - Multinomial Naive Bayes*

In [None]:
if multi:
    test_classifier('Multioutput - Multinomial Naive Bayes', 
                    MultiOutputClassifier(MultinomialNB()), 
                    X_train_multi.toarray(), y_train_multi, 
                    X_test_multi.toarray(), y_test_multi)

**Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
test_classifier('Logistic Regression', 
                LogisticRegression(random_state=0), 
                X_train, y_train, 
                X_test, y_test)

In [None]:
plot_decision(LogisticRegression(random_state=0))

*Multioutput - Logistic Regression*

In [None]:
if multi:
    test_classifier('Multioutput - Logistic Regression', 
                    MultiOutputClassifier(LogisticRegression(random_state=0)), 
                    X_train_multi.toarray(), y_train_multi, 
                    X_test_multi.toarray(), y_test_multi)

**Liner Regression**

- [Ordinary Least Squares](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)
- [Non Negative Least Squares](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)
- [Ridge Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier)
- [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.MultiTaskLasso.html#sklearn.linear_model.MultiTaskLasso)
- [Elastic Net](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.MultiTaskElasticNet.html#sklearn.linear_model.MultiTaskElasticNet)

In [None]:
from sklearn.linear_model import LinearRegression, RidgeClassifier, MultiTaskLasso, Lasso, MultiTaskElasticNet, ElasticNet

*Ordinary Least Squares*

In [None]:
test_classifier('Ordinary Least Squares', 
                LinearRegression(),
                X_train, y_train, 
                X_test, y_test,
                False)

In [None]:
plot_decision(LinearRegression())

*Non Negative Least Squares*

In [None]:
test_classifier('Non Negative Least Squares', 
                LinearRegression(positive=True),
                X_train.toarray(), y_train, 
                X_test.toarray(), y_test,
                False)

*Ridge Regression*

In [None]:
test_classifier('Ridge Classifier',
                RidgeClassifier(),
                X_train.toarray(), y_train, 
                X_test.toarray(), y_test)

In [None]:
plot_decision(RidgeClassifier())

*Lasso*

In [None]:
test_classifier('Lasso',
                Lasso(),
                X_train.toarray(), y_train, 
                X_test.toarray(), y_test,
                False)

In [None]:
if multi:
    test_classifier('Multioutput - Lasso', 
                    MultiTaskLasso(alpha=0.1),
                    X_train_multi.toarray(), y_train_multi, 
                    X_test_multi.toarray(), y_test_multi,
                    False)

In [None]:
plot_decision(Lasso(alpha=0.1))

*Elastic Net*

In [None]:
test_classifier('Elastic Net',
                ElasticNet(alpha=0.1),
                X_train.toarray(), y_train, 
                X_test.toarray(), y_test,
                False)

In [None]:
if multi:
    test_classifier('Multioutput - Elastic Net', 
                    MultiTaskElasticNet(alpha=0.1),
                    X_train_multi.toarray(), y_train_multi, 
                    X_test_multi.toarray(), y_test_multi,
                    False)

In [None]:
plot_decision(ElasticNet(alpha=0.1))

**k-Nearest Neighbors**

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
test_classifier('k-Nearest Neighbors',
                KNeighborsClassifier(),
                X_train.toarray(), y_train, 
                X_test.toarray(), y_test)

*Multiclass k_Nearest Neighbors*

In [None]:
if multi:
    test_classifier('Multioutput - k-Nearest Neighbors', 
                    MultiOutputClassifier(KNeighborsClassifier()),
                    X_train_multi.toarray(), y_train_multi, 
                    X_test_multi.toarray(), y_test_multi)

**Support Vector Machines**

- [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
- [NuSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVC)
- [Linear SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)

In [None]:
from sklearn.svm import SVC, NuSVC, LinearSVC

*SVC*

In [None]:
test_classifier('SVC',
                SVC(),
                X_train, y_train, 
                X_test, y_test)

*NuSVC*

In [None]:
#test_classifier('NuSVC',
#                NuSVC(),
#                X_train, y_train, 
#                X_test, y_test)

*Linear SVC*

In [None]:
test_classifier('Linear SVC',
                LinearSVC(),
                X_train, y_train, 
                X_test, y_test)

In [None]:
if multi:
    test_classifier('Multioutput - Linear SVC', 
                    MultiOutputClassifier(LinearSVC()),
                    X_train_multi, y_train_multi,
                    X_test_multi, y_test_multi)

**Neural Networks**

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
test_classifier('Neural Network',
                MLPClassifier(random_state=1),
                X_train, y_train,
                X_test, y_test)

In [None]:
if multi:
    test_classifier('Multioutput - Neural Network', 
                    MultiOutputClassifier(MLPClassifier(random_state=1)), 
                    X_train_multi, y_train_multi, 
                    X_test_multi, y_test_multi)

In [None]:
print('{:<30}{:<25}{:<25}{:<25}'.format("Classifier", "Train Score", "Test Score", "Time"))
for result in results:
    [name, score_orig, score_pred, process_time] = result
    print('{:<30}{:<25}{:<25}{:<25}'.format(name, score_orig, score_pred, process_time))

In [None]:
indices = np.arange(len(results))

results_ = [[x[i] for x in results] for i in range(4)]

clf_names, score_orig, score_pred, proc_time = results_
proc_time = np.array(proc_time) / np.max(proc_time)
score_pred = [score if score > 0 else 0 for score in score_pred]

plt.figure(figsize=(12, 8))
plt.title("Score")
plt.barh(indices, score_orig, .2, label="score train", color='navy')
plt.barh(indices + .3, score_pred, .2, label="score test", color='darkorange')
plt.barh(indices + .6, proc_time, .2, label="training time",
         color='c')
plt.yticks(())
plt.legend(loc='best')
plt.subplots_adjust(left=.25)
plt.subplots_adjust(top=.95)
plt.subplots_adjust(bottom=.05)

for i, c in zip(indices, clf_names):
    plt.text(-.3, i, c)

plt.show()