## Toxic comments analysis

In [48]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib
import matplotlib.pyplot as plt
import string
import scipy
import keras
import sklearn
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.metrics import AUC
from tensorflow.keras.models import load_model

### Read data

Read clean data and split to features(comments) and labels

In [2]:
# Input data
train_df = pd.read_csv('data/train_clean.csv')
X_train = train_df['comment_clean']
y_train = train_df[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

### Build sklearn pipeline and tune hyperparameters

We need a little change in model performance reach required auc score - less than 1%. Usually hypper parameter search, during witch multidimentional parameter spaceis explored to find the best model is really computatianally and time expensive. Therefore I will focus of a few features, that I think can help me improve the model's performance. Based on learning curves from initial model, it was quite easy for a neural network to fit to data. Most likely the input features space was too small to retain all the necessary information from the comments, therefore increasing LSA n_dim variable should help increase the model accuracy.

For hyper parameter tuning I will use sklearn pipeline. Since we are using TFIDF and LSA implementations from sklearn pipeline will incorporate there algorithms really easily. For Keras NN model, I will need to define estimator, for this I will use keras wrapper for sklearn.

In [3]:
# Define model
def create_model(l1_dim=10, lsa_dim=10):
    model = Sequential()
    model.add(Dense(l1_dim, input_dim=lsa_dim, activation='relu'))
    model.add(Dense(6, activation='sigmoid'))

    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[AUC(name='auc')])
    return model

# Build keras estimator
keras_estimator = KerasClassifier(build_fn=create_model, verbose=3, epochs=20)

Next we create the pipeline. We provide cleaned comments as an input to the pipeline. Inside the pipeline the data goes through TFIDF, LSA and NN steps and we get predictions as an output.

In [4]:
pipeline_estimator = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')), 
                               ('lsa', TruncatedSVD()), 
                               ('nn', keras_estimator)])

We also need to define which parameters we are looking for. For parameter search I will use grid search. Mainly I will try to change LSA dimentions, TFIDF size

Initial model's parameters:
- TFIDF: max_features=10000
- LSA: n_dim=10
- NN: hidden_layer_dim=10

In [18]:
# We have 3 dictionaries, because the nn input depends on lsa dimentions
# and I could not find a more convenient way on how to specify dependant parameters in a grid search :)

param_grid = [ 
    {
        'tfidf__max_features': [10000, 30000, 100000],
        'lsa__n_components': [30],
        'nn__lsa_dim': [30],
        'nn__l1_dim': [20, 30, 50],
        'nn__dropout': [0.5, 0.3, 0.]
    },
    {
        'tfidf__max_features': [10000, 30000, 100000],
        'lsa__n_components': [50],
        'nn__lsa_dim': [50],
        'nn__l1_dim': [20, 30, 50],
        'nn__dropout': [0.5, 0.3, 0.]
    },
    {
        'tfidf__max_features': [10000, 30000, 100000],
        'lsa__n_components': [100],
        'nn__lsa_dim': [100],
        'nn__l1_dim': [20, 30, 50],
        'nn__dropout': [0.5, 0.3, 0.]
    }]

       
kfold_splits = 5
grid = GridSearchCV(estimator=pipeline_estimator,  
                    scoring='roc_auc',
                    n_jobs=5, 
                    verbose=1,
                    return_train_score=True,
                    cv=kfold_splits,
                    param_grid=param_grid)


In [79]:
  
param_grid = [ 
    {
        'tfidf__max_features': [10000],
        'lsa__n_components': [100],
        'nn__lsa_dim': [100]
    }]

kfold_splits = 5
grid = GridSearchCV(estimator=pipeline_estimator,  
                    scoring='roc_auc',
                    n_jobs=5, 
                    verbose=1,
                    return_train_score=True,
                    cv=kfold_splits,
                    param_grid=param_grid)

In [80]:
grid_result = grid.fit(X_train, y_train)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done   2 out of   5 | elapsed:  5.5min remaining:  8.2min
[Parallel(n_jobs=5)]: Done   5 out of   5 | elapsed:  5.5min finished


Train on 159565 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [81]:
# Show hyperparameter tuning results
models_df = pd.DataFrame(grid_result.cv_results_['mean_test_score'], columns=['mean_auc_score'])
models_df['std'] = grid_result.cv_results_['std_test_score']
models_df = pd.concat((models_df, pd.DataFrame(grid_result.cv_results_['params'])), axis=1)

models_df.sort_values(by=['mean_auc_score'], ascending=False)

Unnamed: 0,mean_auc_score,std,lsa__n_components,nn__lsa_dim,tfidf__max_features
0,0.958446,0.00102,100,100,10000


Save best trained model

In [82]:
# Workaround to save Keras model that is within sklearn pipeline

# Save Keras model seperately
grid_result.best_estimator_.named_steps['nn'].model.save('models/keras_model.h5')

# Remove keras model from pipeline and save the rest of pipeline
grid_result.best_estimator_.named_steps['nn'].model = None
joblib.dump(grid_result.best_estimator_, 'models/pipeline.pkl')

['models/pipeline.pkl']

In [83]:
# Test loading the pipeline
pipeline = joblib.load('models/pipeline.pkl')
pipeline.named_steps['nn'].model = load_model('models/keras_model.h5')

AttributeError: 'GridSearchCV' object has no attribute 'named_steps'

### Checking performance on test data

To actually understand how the model would perform with unseen data, we need to evaluate on test data. This will be the result that we will compare with the benchmark.

In [68]:
# Read data
X_test = pd.read_csv('data/test.csv')
y_test = pd.read_csv('data/test_labels.csv')

X_test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [69]:
# Points with value of -1 were not used for evaluation
y_test.shape

(153164, 7)

In [70]:
# join comments with labels and remove rows where labels are '-1'
test_df = X_test.merge(y_test, on='id')
test_df_filtered = test_df[test_df['toxic'] != -1]
test_df_filtered.shape

(63978, 8)

In [71]:
X_test_filtered = test_df_filtered['comment_text']
y_test_filtered = test_df_filtered[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

In [72]:
def clean_comments(df):

    # Lowercase
    df_clean = df.str.lower()

    # Change empty space characters to spaces
    df_clean = df_clean.str.replace(r'\s', ' ', regex=True)

    # Remove non ASCII letters
    df_clean = df_clean.str.replace(r"[^a-z ']+", '', regex=True)

    # Normalize spaces (for readability)
    df_clean = pd.Series([" ".join(x.split()) for x in df_clean])
    
    return df_clean

In [73]:
def eval_auc(y, pred):
    auc_list = []
    for i in range(6):
    
        fpr, tpr, _ = sklearn.metrics.roc_curve(y.iloc[:, i].values, pred[:, i])
        auc_list.append(sklearn.metrics.auc(fpr, tpr))
    
    print('AUC for each class:')
    print(np.array(auc_list))
    print('Mean AUC:', np.mean(auc_list))

In [78]:
X_test_clean = clean_comments(X_test_filtered)
pred_test = pipeline.predict_proba(X_test_clean)

NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.

In [59]:
# Evaluate
eval_auc(y_test_filtered, pred_test)

NameError: name 'pred_test' is not defined

As we can see the model is performing great, and is above the set benchmark.