Naive bayes combined with logistic regression based on Jeremy Howard's notebook [here](https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline-eda-0-052-lb)

Also refining naive bayes hyperparameters

In [None]:
import sys
sys.path.append('..')

from __future__ import division, print_function 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_selection import chi2, mutual_info_classif, SelectKBest
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import binarize 
from sklearn.metrics import log_loss
import pandas as pd
import numpy as np

from evaluation import multilabel_log_loss

Define classes and load test/train data, some input text is "N/A" so turn na_filter off to prevent this being converted to NaN

In [None]:
toxic_classes = [
    'toxic', 'severe_toxic', 'obscene', 
    'threat', 'insult', 'identity_hate' 
]

df = pd.read_csv('../data/train.csv', na_filter=False)
X_train_text = df['comment_text'].values
Y_train = df[toxic_classes].values
id_train = df['id']

df = pd.read_csv('../data/test.csv', na_filter=False)
X_test_text = df['comment_text'].values
id_test = df['id']

del(df)

Will initially try to improve m000 by changing the standard hyperparameters / pre-processing for our simple MultinomialNB model, then later implement the combined shown in the Jeremy Howard notebook.
Overview of differences between jhoward 

In [None]:
tvec = TfidfVectorizer(ngram_range=(1,2),
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1)

X_train = tvec.fit_transform(X_train_text)
X_test = tvec.transform(X_test_text)

del(X_train_text)
del(X_test_text)

print('number of features:', len(tvec.vocabulary_))

Evaluation for competition is mean of the log loss across all classes. Computing metric across 10 folds to give us our baseline score

In [51]:
cv_scores = cross_validate_multilabel(MultinomialNB(), X_train, Y_train, cv=10, scoring='neg_log_loss')
np.mean(cv_scores), np.std(cv_scores)

(-0.18786538228255142, 0.0080973807398286302)

Fitting models on all data, then applying to test data to give us our probabilities for submission

We use OneVsRestClassifier to fit one model per class as MultinomialNB cannot handle a multi-label input by default

In [53]:
nb_model = MultinomialNB()
ovr_nb = OneVsRestClassifier(nb_model)
ovr_nb.fit(X_train, Y_train) 
ovr_nb.estimators_

[MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)]

In [56]:
Y_test_prob = ovr_nb.predict_proba(X_test)
Y_test_prob

array([[  2.52842777e-05,   1.52804881e-10,   1.26619762e-05,
          2.38209407e-10,   3.36363407e-06,   3.01831396e-08],
       [  6.66552593e-16,   3.46862042e-30,   3.85420157e-18,
          5.95280927e-22,   6.38911831e-20,   9.44841693e-21],
       [  5.13770247e-19,   1.04641141e-42,   9.34651433e-23,
          1.00644397e-41,   7.25965785e-23,   6.35388326e-38],
       ..., 
       [  3.06263881e-06,   5.02402659e-11,   9.02998535e-07,
          2.34925660e-11,   5.77075874e-07,   1.24178475e-09],
       [  1.64277107e-02,   1.97501507e-02,   2.54775188e-02,
          2.17313876e-02,   2.30510110e-02,   3.94433234e-02],
       [  1.78417474e-21,   8.62141034e-55,   2.28563059e-26,
          4.45122550e-60,   3.05508745e-29,   4.89934988e-51]])

Note that predict_proba function returns the normalised probabilities over each one-vs-rest class distribution.

In other words, we don't use the raw "posterior" probability calculated by the MultinomialNB `P(x|y)•P(y)`  as these values will be very small, and is not a realistic probability anyway as this does not include the "evidence" term P(x) in the Bayes equation as this is fixed for all classes. Instead this value is normalised so that the "probabilities" for each record sum to ~1 within each binary class decision. So the probability `P(yi|Xi) + P(not(yi)|Xi) ≈ 1`.

Note that as this is a multilabel classification for the one-vs-rest estimator the probability  is normalised within each class, and not across _all_ classes, so the sum of all probabilities across all classes does not have to sum to 1. For example we can have `P(y1|xi) = P(y2|xi) ≈ 1`  etc which gives us a high probability of that record belonging to multiple classes. Equally it is valid for the probability of a record to have low probabilities across all classes, suggesting it does not belong to any of the classes (we have many training examples that exhibit this characteristic). 

In [59]:
df_submit = pd.concat([id_test, pd.DataFrame(Y_test_prob, columns=toxic_classes)], axis=1)
df_submit.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,6044863,2.528428e-05,1.528049e-10,1.266198e-05,2.382094e-10,3.363634e-06,3.018314e-08
1,6102620,6.665526e-16,3.4686199999999996e-30,3.8542020000000004e-18,5.952809000000001e-22,6.389118e-20,9.448417e-21
2,14563293,5.137702e-19,1.046411e-42,9.346514e-23,1.0064439999999999e-41,7.259658e-23,6.353883e-38
3,21086297,0.006786617,1.692325e-05,0.00136901,6.565989e-05,0.0007297767,5.367645e-06
4,22982444,0.02723515,0.004956793,0.02108912,0.004873167,0.01588704,0.001861962


In [60]:
df_submit.to_csv('../results/m000.csv', index=False)