#### Improving m001 (log loss: 0.117) by exploring naive bayes + logistic regression approach
Found in Jeremy Howard Kaggle kernel [here](https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline-eda-0-052-lb) 

In [1]:
import sys
sys.path.append('..')

from __future__ import division, print_function 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_selection import chi2, mutual_info_classif, SelectKBest
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import binarize 
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict 
%matplotlib inline
%load_ext autoreload
%autoreload 2

from evaluation import cross_validate_multilabel, multilabel_results, log_loss_multilabel

In [2]:
toxic_classes = [
    'toxic', 'severe_toxic', 'obscene', 
    'threat', 'insult', 'identity_hate' 
]

df = pd.read_csv('../data/train.csv', na_filter=False)
X_text = df['comment_text'].values
Y = df[toxic_classes].values
ids = df['id']

First running the code found in the Kaggle kernel to confirm that we can replicate the log loss of 0.052. 

In [3]:
import re, string
re_tok = re.compile('([{}“”¨«»®´·º½¾¿¡§£₤‘’])'.format(string.punctuation))
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

We can replicate the result, but will now investigate why this gives a significantly better result than m001. An initial comparison of the differences between the two models (we will refer to the new model as **d002**) :
* m001 uses a multinomial naive Bayes with class priors whereas d002 uses a regularized logistic regression
* m001 features are token counts, d002 are tf-idf values (multiplied by a multinomial class conditional probability)
* d002 uses a combination of unigrams and bigrams as features (only unigrams in m001)
* tokenization is different between the models and accents are stripped in d002
* chi2 explicit feature selection in m001 compared to implicit feature "reduction" through l2 regularization in d002

Below is the origin approach from the **JeremyHoward** Kaggle kernel, adapted into a scikit-learn classifier by **AlexSánchez**

In [12]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_is_fitted
from sklearn.linear_model import LogisticRegression
from scipy import sparse
class NbSvmClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, C=1.0, dual=False, n_jobs=1):
        self.C = C
        self.dual = dual
        self.n_jobs = n_jobs

    def predict(self, x):
        # Verify that model has been fit
        check_is_fitted(self, ['_r', '_clf'])
        return self._clf.predict(x.multiply(self._r))

    def predict_proba(self, x):
        # Verify that model has been fit
        check_is_fitted(self, ['_r', '_clf'])
        return self._clf.predict_proba(x.multiply(self._r))

    def fit(self, x, y):
        # Check that X and y have correct shape
        x, y = check_X_y(x, y, accept_sparse=True)

        def pr(x, y_i, y):
            p = x[y==y_i].sum(0)
            return (p+1) / ((y==y_i).sum()+1)

        self._r = sparse.csr_matrix(np.log(pr(x,1,y) / pr(x,0,y)))
        x_nb = x.multiply(self._r)
        self._clf = LogisticRegression(C=self.C, dual=self.dual, n_jobs=self.n_jobs).fit(x_nb, y)
        return self

Looking at the paper referenced by Jeremy the motivation for this approach is that the SVM component (here a logistic regression) will perform better on longer text documents, whereas the Naive Bayes will be better on shorter snippets of text. 

By combining the two approaches we can increase accuracy across documents of varying length, as found here in the toxic comments dataset. 

In [4]:
vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )

X = vec.fit_transform(X_text)

In [16]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8)

In [17]:
nbsvm = NbSvmClassifier(C=4, dual=True, n_jobs=-1)

In [18]:
models = OneVsRestClassifier(nbsvm) 
models.fit(X_train, Y_train)

Y_test_prob = models.predict_proba(X_test)

In [19]:
log_loss_multilabel(Y_test, Y_test_prob)

0.05314303268580043

So we can replicate the increased performance recorded in the original notebook. This approach is just an ('l2') regularized logistic regression which uses tf-idf values as features, with an additional multiplication by the multinomial naive bayes probability (used as a prior probability in the notebook).

Next we will investigate how much benefit is gained from using this "multinomial prior" vs. tf-idf values alone.

In [20]:
class SvmClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, C=1.0, dual=False, n_jobs=1):
        self.C = C
        self.dual = dual
        self.n_jobs = n_jobs

    def predict(self, x):
        # Verify that model has been fit
        check_is_fitted(self, ['_clf'])
        return self._clf.predict(x) 
    def predict_proba(self, x):
        # Verify that model has been fit
        check_is_fitted(self, ['_clf'])
        return self._clf.predict_proba(x) 
    def fit(self, x, y):
        # Check that X and y have correct shape
        x, y = check_X_y(x, y, accept_sparse=True)

        self._clf = LogisticRegression(C=self.C, dual=self.dual, n_jobs=self.n_jobs).fit(x, y)
        return self

In [23]:
lr = LogisticRegression(C=4, dual=True, n_jobs=-1)

In [24]:
models = OneVsRestClassifier(lr) 
models.fit(X_train, Y_train)

Y_test_prob = models.predict_proba(X_test)
                                  
log_loss_multilabel(Y_test, Y_test_prob)

0.053508391182241773

We can achive a very similar log_loss without the "multinomial prior" so will continue without this for now to keep a simpler model, without a noticeable decrease in performance.

#### UPDATE: New train/test set released by Kaggle. Now re-evaluating models on latest data.

In [26]:
df = pd.read_csv('../data/train_new.csv', na_filter=False)
X_text = df['comment_text'].values
Y = df[toxic_classes].values
ids = df['id']

In [27]:
vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )

X = vec.fit_transform(X_text)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8)

In [28]:
nbsvm = NbSvmClassifier(C=4, dual=True, n_jobs=-1)

models = OneVsRestClassifier(nbsvm) 
models.fit(X_train, Y_train)

Y_test_prob = models.predict_proba(X_test)

log_loss_multilabel(Y_test, Y_test_prob)

0.052406458093759718

Now re-running without "multinomial prior"

In [29]:
lr = LogisticRegression(C=4, dual=True, n_jobs=-1)

models = OneVsRestClassifier(lr) 
models.fit(X_train, Y_train)

Y_test_prob = models.predict_proba(X_test)

log_loss_multilabel(Y_test, Y_test_prob)

0.05196229289823228

Will now submit this latter model to leaderboard, then in the next notebook test more of the hyperparameters of the logistic regression / vectorizer to see if we can increase performance and/or decrease complexity.