# Toxic Comment Classification Benchmarks
Collection of Deep Learning Text Classification Models and Benchmarks; All gpu models contained within the repo were trained with 4 v100's using the AWS p3.8xlarge instance and the AWS deep learning AMI. 
## Table of Contents
- [Competition Overview](#Competition-Overview)
- [Embeddings Used](#Embeddings)
- [Hardware Used](#Hardware-Used)
- [Models](#Models)

## Competition Overview
Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. The Kaggle Toxic Comment Classification Challenge sponsored by the Conversation AI team sets out to discover and apply machine learning to identify toxic comments. This (potentially) will allow platforms to identify toxic comments and to successfully fascilitate discussions at scale.
<br>
<br>
<br>
### Description
The Conversation AI team, a research initiative founded by Jigsaw and Google (both a part of Alphabet) are working on tools to help improve online conversation. One area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion). So far they’ve built a range of publicly available models served through the Perspective API, including toxicity. But the current models still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding (e.g. some platforms may be fine with profanity, but not with other types of toxic content).
<br>
<br>
In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate better than Perspective’s current models. You’ll be using a dataset of comments from Wikipedia’s talk page edits. Improvements to the current model will hopefully help online discussion become more productive and respectful.
<br>
<br>
***Disclaimer: the dataset for this competition contains text that may be considered profane, vulgar, or offensive.***
<br>
<br>
<br>
### Evalution
We randomly select 10% of the training data as the development set. The evaluation metric is the mean column-wise ROC AUC. In other words, the score is the average of the individual AUCs of each predicted column.
<br>
<br>
<br>
### Data Overview
You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:
- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate

We must create a model which predicts a probability of each type of toxicity for each comment.
<br>
<br>

File descriptions:
- train.csv - the training set, contains comments with their binary labels
- test.csv - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set contains some comments which are not included in scoring.
<br>
<br>

***Source: Toxic Comment Classification Challenge, https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/***
<br>
<br>
<br>

## Embeddings

The word embeddings are pre-trained on much larger unannotated corpora to achieve better generalization given limited amount of training data (Turian et al., 2010). In particular, our experiments utilize the GloVe embeddings trained by Pennington et al. (2014) on 6 billion tokens of Wikipedia 2014 and Gigaword 5. Words not present in the set of pre-trained words are initialized by zeros. The dimension of word embeddings is 300.

More info on GloVe Embeddings: https://nlp.stanford.edu/projects/glove/
<br>
<br>
<br>

## Hardware Used
These models were trained on the amazon p3.8xlarge instance type and a deep learning AMI

### p3.8xlarge

| Instance Size | GPUs - Tesla V100 | GPU Peer to Peer | GPU Memory (GB) | vCPUs | Memory (GB) | Network Bandwidth | EBS Bandwidth | On-Demand Price/hr* | 1-yr Reserved Instance Effective Hourly* | 3-yr Reserved Instance Effective Hourly* |
|---------------|-------------------|------------------|-----------------|-------|-------------|-------------------|---------------|---------------------|------------------------------------------|------------------------------------------|
| p3.8xlarge    | 4                 | NVLink           | 64              | 32    | 244         | 10 Gbps           |  7 Gbps        | 12.24 USD             | 7.96 USD                                  | 9.87 USD                               |

https://aws.amazon.com/ec2/instance-types/p3/
<br>
<br>
<br>

In [None]:
# GLOBE PATHS
TRAIN_FILE = '../data/train.csv'
TEST_FILE = '../data/test.csv'
EMBEDDING_FILE = '../data/glove.42B.300d.txt'


# load
train = pd.read_csv(TRAIN_FILE)
test = pd.read_csv(TEST_FILE)


# fill and store X's an y's
X_train = train["comment_text"].fillna("fillna").values
y_train = train[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]].values
X_test = test["comment_text"].fillna("fillna").values