# Baseline
We trained a simple, one layer RNN featuring LSTM units and a pretrained word embedding layer. The model was trained on 5 epochs and had the following metrics:

**Train average AUROC:** 0.984

**Validation average AUROC:** 0.98

**Test average AUROC:** 0.968

In [0]:
import json
import csv
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# tf-related
import keras
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras import regularizers
from keras.layers import Embedding
from keras.initializers import Constant

Using TensorFlow backend.


In [0]:
from google.colab import drive 
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
# execution hyperparameters
MAX_WORDS = 100000
VALIDATION_SPLIT = .2
EMBEDDING_DIM = 100
embedding_dim = 100
max_length = 16
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
training_size=160000
test_portion=.1
corpus = []
DATA_PATH = 'gdrive/My Drive/Toxic Comments/data/'
glove_path = DATA_PATH + "glove.6B/glove.6B.100d.txt" # file name specifies dimension of embedding space


## Tokenization and sequence formation

In [0]:
# obtain training 'labels' and 'sentences'
train = pd.read_csv(DATA_PATH + "train.csv")
labels = train[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]].values.tolist()
labels = np.array(labels)
sentences = train["comment_text"].values.tolist()

In [0]:
# word tokenization
tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print("Words in training set: "+str(len(word_index)))

Words in training set: 210337


We observe that the training set actually has many more unique tokens than the number of words we will admit for computational reasons. This calls for further investigation. Let's look at the sentences' length distribution to decide on a reasonable maximum length.

In [0]:
# cutting around the 90th percentile seems reasonable given the distribution's long tail
MAX_SEQ_LENGTH = 150

In [0]:
# make the sequences uniform to pass them to a network
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, maxlen=MAX_SEQ_LENGTH)

## Data split

In [0]:
# split the data into a training set and a validation set
indices = np.arange(padded.shape[0])
np.random.shuffle(indices)
padded = padded[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * padded.shape[0])

x_train = padded[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = padded[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

## Transfer learning
We followed [this tutorial from the Keras documentation](https://keras.io/examples/pretrained_word_embeddings/) to implement transfer learning of a pretrained GloVe embedding. The training data for the embedding consists of messages from a forum-like network termed 'netnews'.

In [0]:
# load pretrained embedding matrix (implemented as an index for memory efficiency)
embeddings_index = {}
with open(glove_path) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

In [0]:
# prepare embedding matrix
num_words = min(MAX_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQ_LENGTH,
                            trainable=False)

W0719 18:50:41.314832 140670169905024 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.



## RNN training

In [0]:
# Keras-ready average AUROC
from avg_auroc import AvgAurocCallback, avg_auroc_metric
avg_auroc_callback = AvgAurocCallback(x_train, y_train, x_val, y_val)

In [0]:
!pip install git+https://www.github.com/keras-team/keras-contrib.git

Collecting git+https://www.github.com/keras-team/keras-contrib.git
  Cloning https://www.github.com/keras-team/keras-contrib.git to /tmp/pip-req-build-6pggm_2_
  Running command git clone -q https://www.github.com/keras-team/keras-contrib.git /tmp/pip-req-build-6pggm_2_
Building wheels for collected packages: keras-contrib
  Building wheel for keras-contrib (setup.py) ... [?25l[?25hdone
  Stored in directory: /tmp/pip-ephem-wheel-cache-dsbzoc8p/wheels/11/27/c8/4ed56de7b55f4f61244e2dc6ef3cdbaff2692527a2ce6502ba
Successfully built keras-contrib


In [0]:
from keras_contrib.layers.capsule import Capsule

In [0]:
model = keras.Sequential([
    embedding_layer,
    keras.layers.Bidirectional(
        keras.layers.GRU(128, activation='relu', dropout=0.25, 
                         recurrent_dropout=0.25, return_sequences=True)),
    Capsule(num_capsule=10, dim_capsule=16, routings=5, 
            activation='sigmoid', share_weights=True),
    keras.layers.Flatten(),
    keras.layers.Dropout(rate=0.25),
    keras.layers.Dense(6, activation='sigmoid')
])

W0719 18:50:50.237428 140670169905024 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0719 18:50:50.444626 140670169905024 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0719 18:50:50.606158 140670169905024 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0719 18:50:50.622828 140670169905024 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Ins

In [0]:
model.compile(loss='binary_crossentropy', optimizer='adam')
model.summary()

W0719 18:50:51.726307 140670169905024 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0719 18:50:51.760720 140670169905024 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W0719 18:50:51.767787 140670169905024 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 150, 100)          10000100  
_________________________________________________________________
bidirectional_1 (Bidirection (None, 150, 256)          175872    
_________________________________________________________________
capsule_1 (Capsule)          (None, 10, 16)            40960     
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 966       
Total params: 10,217,898
Trainable params: 217,798
Non-trainable params: 10,000,100
__________________________________________________________

In [0]:
model.compile(loss='binary_crossentropy', optimizer='adam')
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 150, 100)          10000100  
_________________________________________________________________
bidirectional_1 (Bidirection (None, 150, 256)          175872    
_________________________________________________________________
capsule_1 (Capsule)          (None, 10, 16)            40960     
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 966       
Total params: 10,217,898
Trainable params: 217,798
Non-trainable params: 10,000,100
__________________________________________________________

In [0]:
num_epochs = 6
history = model.fit(x_train, y_train, epochs=num_epochs, verbose=1, batch_size=128, validation_data=(x_val, y_val),
                   callbacks=[avg_auroc_callback])

Train on 127657 samples, validate on 31914 samples
Epoch 1/6

Train avg_auroc: 0.970, Val avg_auroc: 0.969
Epoch 2/6

Train avg_auroc: 0.979, Val avg_auroc: 0.977
Epoch 3/6

Train avg_auroc: 0.982, Val avg_auroc: 0.979
Epoch 4/6

Train avg_auroc: 0.983, Val avg_auroc: 0.981
Epoch 5/6

Train avg_auroc: 0.986, Val avg_auroc: 0.983
Epoch 6/6

Train avg_auroc: 0.987, Val avg_auroc: 0.984


In [0]:
history.history['roc_train'] = avg_auroc_callback.roc_train
history.history['roc_val'] = avg_auroc_callback.roc_val

## Make a submission

In [0]:
test = pd.read_csv(DATA_PATH + "test.csv")
test_sentences = test.pop("comment_text").values.tolist()

In [0]:
# make the sequences uniform to pass them to a network
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, maxlen=MAX_SEQ_LENGTH)

In [0]:
# make predictions
y_test_hat = model.predict(test_padded)

In [0]:
# write submission file
y_test_hat = pd.DataFrame(data=y_test_hat, columns=["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
test = test.join(y_test_hat)
test.to_csv("tuned_capsule_biderectional_gru_submission.csv", index=False)

## Store results
We'll store the final `History` object to make graphs and the model weights to further train this model, as the small difference between training and validation performance tells us it hasn't reached its full potential.

In [0]:
import pickle

In [0]:
hist_file = open("tuned_capsule_b-gru_history", "wb")
pickle.dump(history.history, hist_file)
hist_file.close()