Creating a simple naive bayes model to get a baseline score

In [1]:
import sys
sys.path.append('..')

from __future__ import division, print_function 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np

Define classes and load test/train data, some input text is "N/A" so turn na_filter off to prevent this being converted to NaN

In [2]:
toxic_classes = [
    'toxic', 'severe_toxic', 'obscene', 
    'threat', 'insult', 'identity_hate' 
]

df = pd.read_csv('../data/train.csv', na_filter=False)
# single column containing comment strings 
X_train_text = df['comment_text'].values
# matrix of shape (n_sample, n_classes) containing class indicator variables 
Y_train = df[toxic_classes].values
id_train = df['id']

df = pd.read_csv('../data/test.csv', na_filter=False)
X_test_text = df['comment_text'].values
id_test = df['id']

del(df)

Building sparse features matrices (one column per token). Using pre-fitted vectorizer on test data to ensure features are the same as training data. 

It is clear why we would need our two sparse matrices to allign, but regardless of this we would want to exclude features not seen in training data anyway. Whilst we would not get probabilities for these extra features, they _would_ contribute to the denominator of the multinomial `P(xi|yi)` calculation and so affect other "live" features.

In [3]:
cvec = CountVectorizer()
X_train = cvec.fit_transform(X_train_text)
X_test = cvec.transform(X_test_text)

del(X_train_text)
del(X_test_text)

print('number of features:', len(cvec.vocabulary_))

number of features: 139171


In [4]:
def cross_validate_multilabel(model, X, Y, **cv_kwargs):
    """cross validation for a multi-label target"""
    # scores is ndarray of shape (number of Y classes, number of cross validation folds)
    scores_per_class = np.array([cross_val_score(model, X, y, **cv_kwargs) for y in Y.T])
    # return average score across folds, for each class
    return scores_per_class.mean(axis=1)

def multilabel_results(cv_scores, class_labels, aggregate=True, index=None):
    df = pd.DataFrame([cv_scores], columns=class_labels, index=index)
    if aggregate:
        df['all'] = df.mean(axis=1)
    return df

Evaluation for competition is mean of the log loss across all classes. Computing metric across 10 folds to give us our baseline score

N.B. folds will be different for each class (model) as stratified is default. In future would be preferable to find a way to stratify across all classes to ensure averaged log_loss is across models trained on the same data

In [5]:
cv_scores = cross_validate_multilabel(MultinomialNB(), X_train, Y_train, cv=10, scoring='neg_log_loss')
multilabel_results(cv_scores, toxic_classes)

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate,all
0,-0.531283,-0.183396,-0.41944,-0.118345,-0.450686,-0.222666,-0.320969


Fitting models on all data, then applying to test data to give us our probabilities for submission

We use OneVsRestClassifier to fit one model per class as MultinomialNB cannot handle a multi-label input by default

In [6]:
nb_model = MultinomialNB()
ovr_nb = OneVsRestClassifier(nb_model)
ovr_nb.fit(X_train, Y_train) 
ovr_nb.estimators_

[MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)]

In [7]:
Y_test_prob = ovr_nb.predict_proba(X_test)
Y_test_prob

array([[  1.78686108e-005,   1.90288426e-011,   5.72171323e-006,
          3.47230459e-012,   1.21433115e-006,   1.63990509e-009],
       [  8.11093685e-030,   2.61409997e-083,   3.94468214e-039,
          1.49574877e-099,   1.26741258e-044,   1.55645858e-073],
       [  4.40003800e-029,   6.16082809e-082,   7.82543846e-038,
          1.07465287e-098,   2.06179104e-040,   5.31569640e-080],
       ..., 
       [  1.04167968e-006,   5.83827617e-013,   1.64828968e-007,
          1.34969600e-014,   7.39746231e-008,   6.53275870e-012],
       [  5.72750836e-003,   4.33806004e-004,   5.94946763e-003,
          1.01509543e-004,   4.55276950e-003,   1.10152721e-003],
       [  1.34294945e-040,   1.56448653e-130,   3.23802221e-055,
          8.54112961e-168,   2.05750382e-061,   6.77445083e-127]])

Note that predict_proba function returns the normalised probabilities over each one-vs-rest class distribution.

In other words, we don't use the raw "posterior" probability calculated by the MultinomialNB `P(x|y)•P(y)`  as these values will be very small, and is not a realistic probability anyway as this does not include the "evidence" term P(x) in the Bayes equation as this is fixed for all classes. Instead this value is normalised so that the "probabilities" for each record sum to ~1 within each binary class decision. So the probability `P(yi|Xi) + P(not(yi)|Xi) ≈ 1`.

Note that as this is a multilabel classification for the one-vs-rest estimator the probability  is normalised within each class, and not across _all_ classes, so the sum of all probabilities across all classes does not have to sum to 1. For example we can have `P(y1|xi) = P(y2|xi) ≈ 1`  etc which gives us a high probability of that record belonging to multiple classes. Equally it is valid for the probability of a record to have low probabilities across all classes, suggesting it does not belong to any of the classes (we have many training examples that exhibit this characteristic). 

In [8]:
df_submit = pd.concat([id_test, pd.DataFrame(Y_test_prob, columns=toxic_classes)], axis=1)
df_submit.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,6044863,1.786861e-05,1.902884e-11,5.721713e-06,3.472305e-12,1.214331e-06,1.639905e-09
1,6102620,8.110937e-30,2.6141e-83,3.9446820000000005e-39,1.495749e-99,1.267413e-44,1.556459e-73
2,14563293,4.400038e-29,6.160828000000001e-82,7.825437999999999e-38,1.074653e-98,2.061791e-40,5.315696e-80
3,21086297,0.01447875,1.218987e-07,0.0005627875,7.577484e-09,0.0002140137,2.541342e-08
4,22982444,0.02433156,0.002481446,0.01626303,0.001194813,0.01136432,0.0007060328


In [9]:
df_submit.to_csv('../results/m000.csv', index=False)