# Train Toxicity Model

This notebook trains a model to detect toxicity in online comments. It uses a CNN architecture for text classification trained on the [Wikipedia Talk Labels: Toxicity dataset](https://figshare.com/articles/Wikipedia_Talk_Labels_Toxicity/4563973) and pre-trained GloVe embeddings which can be found at:
http://nlp.stanford.edu/data/glove.6B.zip
(source page: http://nlp.stanford.edu/projects/glove/).

This model is a modification of [example code](https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py) found in the [Keras Github repository](https://github.com/fchollet/keras) and released under an [MIT license](https://github.com/fchollet/keras/blob/master/LICENSE). For further details of this license, find it [online](https://github.com/fchollet/keras/blob/master/LICENSE) or in this repository in the file KERAS_LICENSE. 

## Usage Instructions
(TODO: nthain) - Move to README

Prior to running the notebook, you must:

* Download the [Wikipedia Talk Labels: Toxicity dataset](https://figshare.com/articles/Wikipedia_Talk_Labels_Toxicity/4563973)
* Download pre-trained [GloVe embeddings](http://nlp.stanford.edu/data/glove.6B.zip)
* (optional) To skip the training step, you will need to download a model and tokenizer file. We are looking into the appropriate means for distributing these (sometimes large) files.

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import pandas as pd

from model_tool import ToxModel

Using TensorFlow backend.


HELLO from model_tool


## Load Data

In [2]:
SPLITS = ['train', 'dev', 'test']

wiki = {}
debias = {}
random = {}
for split in SPLITS:
    wiki[split] = '../data/wiki_%s.csv' % split
    debias[split] = '../data/wiki_debias_%s.csv' % split
    random[split] = '../data/wiki_debias_random_%s.csv' % split

## Train Models

### Random model

In [3]:
for i in xrange(100, 110):
    MODEL_NAME = 'cnn_debias_random_tox_v3_{}'.format(i)
    debias_random_model = ToxModel()
    debias_random_model.train(random['train'], random['dev'], text_column = 'comment', label_column = 'is_toxic', model_name = MODEL_NAME)

Hyperparameters
---------------
max_num_words: 10000
dropout_rate: 0.3
verbose: True
cnn_pooling_sizes: [5, 5, 40]
es_min_delta: 0
learning_rate: 5e-05
es_patience: 1
batch_size: 128
embedding_dim: 100
epochs: 20
cnn_filter_sizes: [128, 128, 128]
cnn_kernel_sizes: [5, 5, 5]
max_sequence_length: 250
stop_early: True
embedding_trainable: False

Fitting tokenizer...
Tokenizer fitted!
Preparing data...
Data prepared!
Loading embeddings...
Embeddings loaded!
Building model graph...
Training model...
Train on 99157 samples, validate on 33283 samples
Epoch 1/20
Epoch 00000: val_loss improved from inf to 0.16887, saving model to ../models/cnn_debias_random_tox_v3_100_model.h5
9s - loss: 0.2342 - acc: 0.9189 - val_loss: 0.1689 - val_acc: 0.9387
Epoch 2/20
Epoch 00001: val_loss improved from 0.16887 to 0.14691, saving model to ../models/cnn_debias_random_tox_v3_100_model.h5
8s - loss: 0.1607 - acc: 0.9409 - val_loss: 0.1469 - val_acc: 0.9461
Epoch 3/20
Epoch 00002: val_loss improved from 0.14691

Epoch 00007: val_loss did not improve
8s - loss: 0.1058 - acc: 0.9612 - val_loss: 0.1452 - val_acc: 0.9432
Epoch 9/20
Epoch 00008: val_loss did not improve
8s - loss: 0.1007 - acc: 0.9630 - val_loss: 0.1291 - val_acc: 0.9560
Epoch 00008: early stopping
Model trained!
Best model saved to ../models/cnn_debias_random_tox_v3_103_model.h5
Loading best model from checkpoint...
Model loaded!
Hyperparameters
---------------
max_num_words: 10000
dropout_rate: 0.3
verbose: True
cnn_pooling_sizes: [5, 5, 40]
es_min_delta: 0
learning_rate: 5e-05
es_patience: 1
batch_size: 128
embedding_dim: 100
epochs: 20
cnn_filter_sizes: [128, 128, 128]
cnn_kernel_sizes: [5, 5, 5]
max_sequence_length: 250
stop_early: True
embedding_trainable: False

Fitting tokenizer...
Tokenizer fitted!
Preparing data...
Data prepared!
Loading embeddings...
Embeddings loaded!
Building model graph...
Training model...
Train on 99157 samples, validate on 33283 samples
Epoch 1/20
Epoch 00000: val_loss improved from inf to 0.18119,

Fitting tokenizer...
Tokenizer fitted!
Preparing data...
Data prepared!
Loading embeddings...
Embeddings loaded!
Building model graph...
Training model...
Train on 99157 samples, validate on 33283 samples
Epoch 1/20
Epoch 00000: val_loss improved from inf to 0.17391, saving model to ../models/cnn_debias_random_tox_v3_107_model.h5
9s - loss: 0.2413 - acc: 0.9161 - val_loss: 0.1739 - val_acc: 0.9367
Epoch 2/20
Epoch 00001: val_loss improved from 0.17391 to 0.14993, saving model to ../models/cnn_debias_random_tox_v3_107_model.h5
8s - loss: 0.1666 - acc: 0.9390 - val_loss: 0.1499 - val_acc: 0.9452
Epoch 3/20
Epoch 00002: val_loss improved from 0.14993 to 0.13633, saving model to ../models/cnn_debias_random_tox_v3_107_model.h5
8s - loss: 0.1457 - acc: 0.9463 - val_loss: 0.1363 - val_acc: 0.9490
Epoch 4/20
Epoch 00003: val_loss improved from 0.13633 to 0.13114, saving model to ../models/cnn_debias_random_tox_v3_107_model.h5
8s - loss: 0.1333 - acc: 0.9509 - val_loss: 0.1311 - val_acc: 0.9511

In [4]:
random_test = pd.read_csv(random['test'])
debias_random_model.score_auc(random_test['comment'], random_test['is_toxic'])

0.95278138776342269

### Plain wikipedia model

In [5]:
for i in xrange(100, 110):
    MODEL_NAME = 'cnn_wiki_tox_v3_{}'.format(i)
    wiki_model = ToxModel()
    wiki_model.train(wiki['train'], wiki['dev'], text_column = 'comment', label_column = 'is_toxic', model_name = MODEL_NAME)

Hyperparameters
---------------
max_num_words: 10000
dropout_rate: 0.3
verbose: True
cnn_pooling_sizes: [5, 5, 40]
es_min_delta: 0
learning_rate: 5e-05
es_patience: 1
batch_size: 128
embedding_dim: 100
epochs: 20
cnn_filter_sizes: [128, 128, 128]
cnn_kernel_sizes: [5, 5, 5]
max_sequence_length: 250
stop_early: True
embedding_trainable: False

Fitting tokenizer...
Tokenizer fitted!
Preparing data...
Data prepared!
Loading embeddings...
Embeddings loaded!
Building model graph...
Training model...
Train on 95692 samples, validate on 32128 samples
Epoch 1/20
Epoch 00000: val_loss improved from inf to 0.18188, saving model to ../models/cnn_wiki_tox_v3_100_model.h5
9s - loss: 0.2384 - acc: 0.9154 - val_loss: 0.1819 - val_acc: 0.9354
Epoch 2/20
Epoch 00001: val_loss improved from 0.18188 to 0.15620, saving model to ../models/cnn_wiki_tox_v3_100_model.h5
8s - loss: 0.1705 - acc: 0.9373 - val_loss: 0.1562 - val_acc: 0.9416
Epoch 3/20
Epoch 00002: val_loss improved from 0.15620 to 0.14465, savin

Epoch 00003: val_loss improved from 0.13900 to 0.13266, saving model to ../models/cnn_wiki_tox_v3_103_model.h5
8s - loss: 0.1350 - acc: 0.9501 - val_loss: 0.1327 - val_acc: 0.9506
Epoch 5/20
Epoch 00004: val_loss did not improve
8s - loss: 0.1260 - acc: 0.9530 - val_loss: 0.1342 - val_acc: 0.9522
Epoch 6/20
Epoch 00005: val_loss improved from 0.13266 to 0.12538, saving model to ../models/cnn_wiki_tox_v3_103_model.h5
8s - loss: 0.1188 - acc: 0.9565 - val_loss: 0.1254 - val_acc: 0.9531
Epoch 7/20
Epoch 00006: val_loss improved from 0.12538 to 0.12468, saving model to ../models/cnn_wiki_tox_v3_103_model.h5
8s - loss: 0.1131 - acc: 0.9580 - val_loss: 0.1247 - val_acc: 0.9537
Epoch 8/20
Epoch 00007: val_loss did not improve
8s - loss: 0.1069 - acc: 0.9606 - val_loss: 0.1276 - val_acc: 0.9547
Epoch 9/20
Epoch 00008: val_loss did not improve
8s - loss: 0.1012 - acc: 0.9625 - val_loss: 0.1312 - val_acc: 0.9508
Epoch 00008: early stopping
Model trained!
Best model saved to ../models/cnn_wiki_to

Fitting tokenizer...
Tokenizer fitted!
Preparing data...
Data prepared!
Loading embeddings...
Embeddings loaded!
Building model graph...
Training model...
Train on 95692 samples, validate on 32128 samples
Epoch 1/20
Epoch 00000: val_loss improved from inf to 0.18286, saving model to ../models/cnn_wiki_tox_v3_107_model.h5
10s - loss: 0.2392 - acc: 0.9162 - val_loss: 0.1829 - val_acc: 0.9341
Epoch 2/20
Epoch 00001: val_loss improved from 0.18286 to 0.15677, saving model to ../models/cnn_wiki_tox_v3_107_model.h5
8s - loss: 0.1661 - acc: 0.9387 - val_loss: 0.1568 - val_acc: 0.9417
Epoch 3/20
Epoch 00002: val_loss improved from 0.15677 to 0.15115, saving model to ../models/cnn_wiki_tox_v3_107_model.h5
8s - loss: 0.1476 - acc: 0.9456 - val_loss: 0.1512 - val_acc: 0.9432
Epoch 4/20
Epoch 00003: val_loss improved from 0.15115 to 0.14173, saving model to ../models/cnn_wiki_tox_v3_107_model.h5
8s - loss: 0.1355 - acc: 0.9497 - val_loss: 0.1417 - val_acc: 0.9493
Epoch 5/20
Epoch 00004: val_loss i

In [6]:
wiki_test = pd.read_csv(wiki['test'])
wiki_model.score_auc(wiki_test['comment'], wiki_test['is_toxic'])

0.95696629963337654

### Debiased model

In [7]:
for i in xrange(100, 110):
    MODEL_NAME = 'cnn_debias_tox_v3_{}'.format(i)
    debias_model = ToxModel()
    debias_model.train(debias['train'], debias['dev'], text_column = 'comment', label_column = 'is_toxic', model_name = MODEL_NAME)

Hyperparameters
---------------
max_num_words: 10000
dropout_rate: 0.3
verbose: True
cnn_pooling_sizes: [5, 5, 40]
es_min_delta: 0
learning_rate: 5e-05
es_patience: 1
batch_size: 128
embedding_dim: 100
epochs: 20
cnn_filter_sizes: [128, 128, 128]
cnn_kernel_sizes: [5, 5, 5]
max_sequence_length: 250
stop_early: True
embedding_trainable: False

Fitting tokenizer...
Tokenizer fitted!
Preparing data...
Data prepared!
Loading embeddings...
Embeddings loaded!
Building model graph...
Training model...
Train on 99157 samples, validate on 33283 samples
Epoch 1/20
Epoch 00000: val_loss improved from inf to 0.17132, saving model to ../models/cnn_debias_tox_v3_100_model.h5
10s - loss: 0.2267 - acc: 0.9213 - val_loss: 0.1713 - val_acc: 0.9376
Epoch 2/20
Epoch 00001: val_loss improved from 0.17132 to 0.15116, saving model to ../models/cnn_debias_tox_v3_100_model.h5
9s - loss: 0.1631 - acc: 0.9409 - val_loss: 0.1512 - val_acc: 0.9446
Epoch 3/20
Epoch 00002: val_loss improved from 0.15116 to 0.14667, 

Epoch 00008: val_loss did not improve
9s - loss: 0.1020 - acc: 0.9629 - val_loss: 0.1244 - val_acc: 0.9545
Epoch 00008: early stopping
Model trained!
Best model saved to ../models/cnn_debias_tox_v3_103_model.h5
Loading best model from checkpoint...
Model loaded!
Hyperparameters
---------------
max_num_words: 10000
dropout_rate: 0.3
verbose: True
cnn_pooling_sizes: [5, 5, 40]
es_min_delta: 0
learning_rate: 5e-05
es_patience: 1
batch_size: 128
embedding_dim: 100
epochs: 20
cnn_filter_sizes: [128, 128, 128]
cnn_kernel_sizes: [5, 5, 5]
max_sequence_length: 250
stop_early: True
embedding_trainable: False

Fitting tokenizer...
Tokenizer fitted!
Preparing data...
Data prepared!
Loading embeddings...
Embeddings loaded!
Building model graph...
Training model...
Train on 99157 samples, validate on 33283 samples
Epoch 1/20
Epoch 00000: val_loss improved from inf to 0.17438, saving model to ../models/cnn_debias_tox_v3_104_model.h5
11s - loss: 0.2409 - acc: 0.9163 - val_loss: 0.1744 - val_acc: 0.93

Epoch 00007: val_loss did not improve
9s - loss: 0.1047 - acc: 0.9612 - val_loss: 0.1289 - val_acc: 0.9545
Epoch 9/20
Epoch 00008: val_loss did not improve
9s - loss: 0.0995 - acc: 0.9633 - val_loss: 0.1321 - val_acc: 0.9551
Epoch 00008: early stopping
Model trained!
Best model saved to ../models/cnn_debias_tox_v3_107_model.h5
Loading best model from checkpoint...
Model loaded!
Hyperparameters
---------------
max_num_words: 10000
dropout_rate: 0.3
verbose: True
cnn_pooling_sizes: [5, 5, 40]
es_min_delta: 0
learning_rate: 5e-05
es_patience: 1
batch_size: 128
embedding_dim: 100
epochs: 20
cnn_filter_sizes: [128, 128, 128]
cnn_kernel_sizes: [5, 5, 5]
max_sequence_length: 250
stop_early: True
embedding_trainable: False

Fitting tokenizer...
Tokenizer fitted!
Preparing data...
Data prepared!
Loading embeddings...
Embeddings loaded!
Building model graph...
Training model...
Train on 99157 samples, validate on 33283 samples
Epoch 1/20
Epoch 00000: val_loss improved from inf to 0.17738, saving

In [8]:
debias_test = pd.read_csv(debias['test'])
debias_model.prep_data_and_score(debias_test['comment'], debias_test['is_toxic'])

AttributeError: ToxModel instance has no attribute 'prep_data_and_score'