# Toxic Comment Classification with Deep Learning

A CTAWG demo with PyTorch fast.ai and Keras. [View the Toxic Comment Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) on Kaggle.

In Colab: Click View -> Table of Contents to show an outline to the left.

Steps:
- Sign up for Kaggle account
- Download data from Kaggle
- Clean data
- Fine-tune language model
- Build classifier
- Generate predictions
- Submit to leaderboard (using Kaggle CLI)


Todo:
- Replace hate speech example prediction text with a hateful comment from the training data (or test).
- Fix manual install of fast.ai module which doesn't persist across workspace restarts in FloydHub
- Fix processed data being saved into the data-raw folder
- Multi-output model for all 6 outcomes (fast.ai lesson 8/9 in part 2)
   - Also see https://forums.fast.ai/t/creating-a-multi-label-torchtext-dataset/11960
- Generate predictions for all models (see bottom of https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline)
- Compare to BERT, GPT-2, standard LSTM/CNN, etc.
- Handle class imbalance (or just ignore?)
- Get working on Benten, home desktop, and XSEDE

## Prepare environment

### Setup Google Colab

Go to Runtime -> Change runtime type and under "Hardware Accelerator" select "GPU".

Or use FloydHub - much faster for actual training, although expensive.

The cell below should print "true" if the GPU is working:

In [None]:
# Test pytorch GPU access.
import torch
torch.cuda.is_available()

### Setup Kaggle Account and Data

Using https://www.kaggle.com/general/51898 as a guide.

Go to [Kaggle](https://www.kaggle.com)
- Create an acocunt, or login to your existing account
- Click "MyAccount"
- Create "New API Token" button
- This will automatically download a "kaggle.json" file.
- Run the cell below and upload that file:

In [None]:
# Run this in CoLab; in FloydHub upload via JupyterLab.
# Run this cell, then browse to your downloaded kaggle.json and upload.
from google.colab import files
files.upload()

Setup Kaggle profile. Code via https://www.kaggle.com/general/51898

In [None]:
!mkdir -p ~/.kaggle && mv kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
!cat ~/.kaggle/kaggle.json

### Install python packages

In [None]:
# Install kaggle packages (not needed on floydhub)
# !pip3 install -q kaggle kaggle-cli
# Already installed: pytorch 1.0.1, torchvision, numpy, fast.ai, tensorflow-gpu, keras

In [None]:
# Install GitHub version of fast.ai
!git clone https://github.com/fastai/fastai
# Run this in a single line because "!cd" does not persist across lines.
# NOTE: on FloydHub I had to manually edit 3 files in the tools/ directory to specify python 3.6 rather than 3.5
!cd fastai && tools/run-after-git-clone && pip install -e ".[dev]"

Download the data files for this competition. [See kaggle API documentation](https://github.com/Kaggle/kaggle-api) for more details.

In [None]:
# Store our kaggle data in the data-raw subdirectory to stay organized.
# Also create a folder for processed data.
!mkdir -p data-raw data

# This will download zipfiles of sample_submission.csv, test.csv, train.csv, test_labels.csv into the current directory
!kaggle competitions download -c jigsaw-toxic-comment-classification-challenge -p data-raw

# We need to quote the "*" in order for unzip to work on multiple files.
# -o: overwrite existing files without prompting
# -q: quiet mode
!unzip -oq 'data-raw/*.zip' -d data-raw

# Check the number of lines in each of these new csv files.
!wc -l data-raw/*.csv

## Preprocessing

In [1]:
# This will also import pandas as pd apparently.
# If this fails make sure that fastai is installed from github above.
import fastai
from fastai.text import *

# Should be 1.0.51.dev0, not 1.0.42 (too old)
print(fastai.__version__)

raw_data = Path("data-raw")
data_path = Path("data")

#bs = 8
#bs = 48
#bs = 128 # works on tesla V100 (FloydHub)
bs = 160 # works for LM on tesla V100 (FloydHub)

1.0.51.dev0


In [9]:
df = pd.read_csv(raw_data / "train.csv")

print(df.head(), "\n"*2)

# Last 6 columns are the outcomes.
outcomes = df.columns[-6:]

# Examine the outcome distribution. Identity hate is only 0.9% positive!
print(df[outcomes].mean(axis = 0), "\n")

# We have only 1,405 positive cases vs. 158,166 negative cases :/
print(df["identity_hate"].value_counts())

                 id                                       comment_text  toxic  \
0  0000997932d777bf  Explanation\nWhy the edits made under my usern...      0   
1  000103f0d9cfb60f  D'aww! He matches this background colour I'm s...      0   
2  000113f07ec002fd  Hey man, I'm really not trying to edit war. It...      0   
3  0001b41b1c6bb37e  "\nMore\nI can't make any real suggestions on ...      0   
4  0001d958c54c6e35  You, sir, are my hero. Any chance you remember...      0   

   severe_toxic  obscene  threat  insult  identity_hate  
0             0        0       0       0              0  
1             0        0       0       0              0  
2             0        0       0       0              0  
3             0        0       0       0              0  
4             0        0       0       0              0   


toxic            0.095844
severe_toxic     0.009996
obscene          0.052948
threat           0.002996
insult           0.049364
identity_hate    0.008805
dtype:

In [10]:
# Take a look at the first comment
print(df["comment_text"][0], "\n")

# And look at some identity_hate rows.
# Trigger warning: this may contain slurs and other offensive language.
print(df[df["identity_hate"] == 1].head(), "\n")

# Look at outcome distribution among the identity hate observations.
print(df[df["identity_hate"] == 1][outcomes].mean(axis = 0))

Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27 

                   id                                       comment_text  \
42   001810bf8c45bf5f  You are gay or antisemmitian? \n\nArchangel WH...   
105  00472b8e2d38d1ea         A pair of jew-hating weiner nazi schmucks.   
176  006b94add72ed61c  I think that your a Fagget get a oife and burn...   
218  008e0818dde894fb  Kill all niggers. \n\nI have hard, that others...   
238  0097dd5c29bf7a15  u r a tw@ fuck off u gay boy.U r smelly.Fuck u...   

     toxic  severe_toxic  obscene  threat  insult  identity_hate  
42       1             0        1       0       1              1  
105      1             0        1       0       1              1  
176      1             0        1       1       1              1  
218     

In [None]:
# This will take about 3 minutes.
# TODO: also include test.csv for the language model training.
data_lm = (TextList.from_csv(raw_data, 'train.csv', cols = "comment_text")
             # Randomly split and keep 10% for validation of the language model.
             .split_by_rand_pct(0.1)
             .label_for_lm()           
             .databunch(bs = bs))
             
# Type needs to be TextLMDataBunch for use in language model fine-tuning.
print(type(data_lm))

# This will be saved into the data-raw directory unfortunately.
data_lm.save('data_lm.pkl')

In [None]:
# The next time we run this notebook, skip above cell and load the preprocessed data.
# If you get an error "load_data does not exist" then using too old of a fastai module.
data_lm = load_data(raw_data, 'data_lm.pkl', bs=bs)
# This should be a TextLMDataBunch
type(data_lm)

In [None]:
data_lm.show_batch()

# Examine vocabulary
print(data_lm.vocab.itos[:15])

# 60k tokens in our vocabulary
print(len(data_lm.vocab.itos))

## Language Modeling

In [None]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

# Clear unused GPU memory, to help lr_find().
torch.cuda.empty_cache()

In [None]:
from fastai.utils.mem import gpu_mem_get
gpu_mem_get()

In [None]:
# This can give a RuntimeError if the GPU runs out of memory ("CUDA out of memory").
# If this happens we need to use a smaller batch size.
# GPU RAM usage is also affected by other running notebooks in Google Colab.
# (See Runtime -> "Manage sessions" to delete old sessions.)
# This will take about 1 minute.
learn.lr_find()

In [None]:
learn.recorder.plot(skip_end=12, skip_start=60)

In [None]:
# This will take 46 minutes on Google Colab GPU! Accuracy after epoch 0: 21%
# Takes only 7 minutes per epoch on FloydHub Tesla V100, with batch size of 160!
learn.fit_one_cycle(1, 2e-1, moms=(0.8,0.7))

In [None]:
learn.save('fit_head')

In [None]:
learn.load("fit_head");

In [None]:
learn.unfreeze()

# 51 mins per epoch on Colab GPU, 7.5 per epoch on FloydHub v100
learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))

# 2 mins / X mins
learn.save('fine_tuned')

In [None]:
learn.load('fine_tuned');

In [None]:
#TEXT = "I liked this movie because"
TEXT = "I hate myself for"
N_WORDS = 30
N_SENTENCES = 4

In [None]:
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

In [None]:

learn.save_encoder('fine_tuned_enc')

## Classification

In [None]:
# Following lesson3-imdb from fast.ai course 1
# https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson3-imdb.ipynb

# Define our outcome column. Other options shown in the cell output above.
outcome = "identity_hate"

# More documentation at https://docs.fast.ai/text.data.html
# This will take about 2-3 minutes to run.
# Other parameters: max_vocab (default 60k), min_freq (default 2), bs (default 64)
data_clas = TextDataBunch.from_csv(raw_data, 'train.csv', text_cols = "comment_text", \
                              # Max_vocab needs to match the vocabulary size of the LM encoder
                              max_vocab = 60000, \
                             # Reduce vocabulary size due to GPU memory constraints.
                              #max_vocab = 20000, \
                              #max_vocab = 50000, \
                              # Limit batch size due to GPU memory constraints.
                              label_cols = outcome, bs = bs)

# This takes 45 seconds or so.
data_clas.save('data_clas.pkl')

In [3]:
# 160 bs is too big for FloydHub, 64 seems to work though.
data_clas = load_data(raw_data, 'data_clas.pkl', bs = 64)

In [None]:
data_clas.show_batch()

In [4]:
# AUC callback code via https://forums.fast.ai/t/using-auc-as-metric-in-fastai/38917/7
from sklearn.metrics import roc_auc_score, average_precision_score
# For average precision score see:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score

def auroc_score(input, target):
    input, target = input.cpu().numpy()[:,1], target.cpu().numpy()
    return roc_auc_score(target, input)

class callback_auc(Callback):
    _order = -20 #Needs to run before the recorder

    def __init__(self, learn, **kwargs): self.learn = learn
    def on_train_begin(self, **kwargs): self.learn.recorder.add_metric_names(['auc'])
    def on_epoch_begin(self, **kwargs): self.output, self.target = [], []
    
    def on_batch_end(self, last_target, last_output, train, **kwargs):
        if not train:
            self.output.append(last_output)
            self.target.append(last_target)
                
    def on_epoch_end(self, last_metrics, **kwargs):
        if len(self.output) > 0:
            output = torch.cat(self.output)
            target = torch.cat(self.target)
            preds = F.softmax(output, dim=1)
            metric = auroc_score(preds, target)
            return add_metrics(last_metrics, [metric])
        
def prauc_score(input, target):
    input, target = input.cpu().numpy()[:,1], target.cpu().numpy()
    return average_precision_score(target, input)

class callback_prauc(Callback):
    # Run before recorder but after AUC callback
    _order = -19

    def __init__(self, learn, **kwargs): self.learn = learn
    def on_train_begin(self, **kwargs): self.learn.recorder.add_metric_names(['pr-auc'])
    def on_epoch_begin(self, **kwargs): self.output, self.target = [], []
    
    def on_batch_end(self, last_target, last_output, train, **kwargs):
        if not train:
            self.output.append(last_output)
            self.target.append(last_target)
                
    def on_epoch_end(self, last_metrics, **kwargs):
        if len(self.output) > 0:
            output = torch.cat(self.output)
            target = torch.cat(self.target)
            preds = F.softmax(output, dim=1)
            metric = prauc_score(preds, target)
            return add_metrics(last_metrics, [metric])

# AWD = ASGD Weight-Dropped, ASGD = averaged stochastic gradient descent
# See Merity et al. (2017) Regularizing and optimizing LSTM language models
# Other model options here: https://docs.fast.ai/text.models.html
# Transformer and TranformerXL are the main two alternatives currently.
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5, callback_fns = (callback_auc, callback_prauc))
learn.load_encoder('fine_tuned_enc')

In [None]:
# 10 seconds
learn.lr_find()

In [None]:
learn.recorder.plot(skip_end=7, skip_start = 50)

In [None]:
# 3-4 mins on FloydHub, 99% accuracy, 0.85-0.87 AUC, PR-AUC 0.08
learn.fit_one_cycle(1, 4e-1, moms=(0.8,0.7))

In [None]:
learn.save('first')

In [None]:
learn.freeze_to(-2)
# 4.5 minutes
# AUC now up to 0.928, PR-AUC 0.243
# Note that if we only looked at accuracy, it would seem to be doing worse.
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))

In [None]:
learn.save('second')

In [None]:
learn.freeze_to(-3)
# Some improvement: AUC now 0.949 and PR-AUC 0.343
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

In [None]:
learn.save('third')

In [None]:
learn.load("third");

In [6]:
learn.unfreeze()
# 8.5 then 9 minutes
# AUC now up to 0.954 and PR-AUC 0.385 - great.
# Do we need a slower learning rate?
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,auc,pr-auc,time
0,0.028363,0.953902,0.989535,0.95116,0.380817,08:03
1,0.027963,1.236983,0.98966,0.954707,0.384531,08:53


In [7]:
learn.save('last')

## Prediction

Trigger warning: we need to test some hateful speech in order for this to be a worthwhile exercise.

In [None]:
# Our third model was best so far.
#learn.load("third");
# Now our last model is best - random variation or something else?
learn.load("last");

In [11]:
# Probability distributions look good!
print(learn.predict(df[df["identity_hate"] == 1]["comment_text"].iloc[4]))
print(learn.predict("I hate dirty green people, kill them all!! So vile and disgusting!!! Fuck em"))
# Positive identity speech - still an increase in probability due to the identity terms.
print(learn.predict("I am proud to be a brown Mexican immigrant latina"))
# Positive robot speech
print(learn.predict("I am a happy deep learning algorithm"))

(Category 1, tensor(1), tensor([1.9644e-04, 9.9980e-01]))
(Category 0, tensor(0), tensor([0.7174, 0.2826]))
(Category 0, tensor(0), tensor([0.9455, 0.0545]))
(Category 0, tensor(0), tensor([0.9961, 0.0039]))


In [None]:
# Compare to perspective API?

In [None]:
## Score on test
## Generate export

## Submit Kaggle entry (for posterity) using kaggle-cli