In [15]:
from IPython.display import HTML, display

# Toxic Comments

- Chris Haddad, Sanjay Roberts, and Jeff Coady

## Dataset

- **Kaggle Competition**: Jigsaw Unintended Bias in Toxicity Classification 


- ~$2$ million public comments from various platforms
    - train.csv - 812 mb
    - test.csv - 30 mb
    
    
- `comment_text` and `target` toxicity columns

    - `target` ranges from $0$ to $1$, if $>= 0.5$, classify as **'toxic'**

## Motivation

- Build model that recognizes toxicity in comments and minimizes unintended bias with respect to mentions of identities.
- Experiment with different `sklearn` classifiers
- Work with an imbalanced dataset
    - ~10% of comments are classified as toxic 

# Exploratory Data Analysis

### Toxicity distribution
<table><tr><td><img src='dist.jpg'></td><td><img src='class.jpg' style='width: 670px;'></td></tr></table>"

## Word clouds 
<table><tr><td><img src='nontoxic.jpg'></td><td><img src='toxic.jpg'></td></tr></table>

<img src="most_toxic.jpg" alt="Drawing" style="width: 600px;"/>



## Topic Modeling

![title](ldavis.jpg)

# Modeling

### AWS
- EC2 instance with multiple user ssh and Jupyter Notebooks port-forwarding 
    - Instance type: `r4.16xlarge` - 64 CPU, **488 RAM**
    
    
- Still ran into memory issues!
    - To load and process data before modeling: 150G minimum
    - Running `clf.fit(X, y)` killed the kernel every time
        - Had to `partial_fit` or `batch` data to the classifiers in chunks of 100,000 comments

### HTOP

<img src="htop4.jpg" alt="Drawing" style="width: 800px;"/>

### Comment Cleaning - NLTK
- Lower case words
- Remove stop words
- Remove punctuation
- Tokenize and Lemmatize comments
- Word frequency count, keep words greater than 10 occurences
    - `dict` of 60,000 words
- Create matrix of word frequency and outcome `(0: non-toxic, 1: toxic)` per comment    

### Stochastic Gradient Descent Classifier

- Treat dataset as balanced:
    - Train time: 14 min
    - Accuracy: 93.8%
    - Precision: 0.814
    - Recall:0.2881
    - F-1 Score:0.42557

```
| Conf Mat |   | prediction |      |
|----------|---|------------|------|
|          |   | 0          | 1    |
| actual   | 0 | 330295     | 1895 |
|          | 1 | 20492      | 8293 |
```
    
    
    
- Take into account class imbalance:
    - Accuracy: 93.36%
    - Precision: .5841
    - Recall: .5821
    - F-1 Score: .5831
    
```
| Conf Mat |   | prediction |       |
|----------|---|------------|-------|
|          |   | 0          | 1     |
| actual   | 0 | 320261     | 11929 |
|          | 1 | 12030      | 16755 |
```

### Multi-Layer Perceptron

- Treat dataset as balanced:
    - Train time: 9 min
    - Accuracy: 92.03%
    - Precision: 0
    - Recall: 0
    - F-1 Score: 0

```
| Conf Mat |   | prediction |   |
|----------|---|------------|---|
|          |   | 0          | 1 |
| actual   | 0 | 332190     | 0 |
|          | 1 | 28785      | 0 |
```

### Naive Bayes

- Treat dataset as balanced:
    - Train time: 15 min
    - Accuracy: 89.38%
    - Precision: 0.3961
    - Recall: 0.6323
    - F-1 Score: 0.4871
    
```
| Conf Mat |   | prediction |       |
|----------|---|------------|-------|
|          |   | 0          | 1     |
| actual   | 0 | 304445     | 27745 |
|          | 1 | 10584      | 18201 |
```
    
- Take into account class imbalance:
    - Accuracy: 88.56%
    - Precision: 0.3762
    - Recall: 0.6607
    - F-1 Score: 0.4794 
    
```
| Conf Mat |   | prediction |       |
|----------|---|------------|-------|
|          |   | 0          | 1     |
| actual   | 0 | 300655     | 31535 |
|          | 1 | 9767       | 19018 |
```

### Random Forest 

- Treat dataset as balanced:
    - Train time: 8 min
    - Accuracy: 92.03%
    - Precision: 0
    - Recall: 0
    - F-1 Score: 0
    
```
| Conf Mat |   | prediction |   |
|----------|---|------------|---|
|          |   | 0          | 1 |
| actual   | 0 | 332190     | 0 |
|          | 1 | 28785      | 0 |
```
    
- Take into account class imbalance:
    - Accuracy: 92.78%
    - Precision: 0.5499
    - Recall: 0.51905
    - F-1 Score: 0.53404

```
| Conf Mat |   | prediction |       |
|----------|---|------------|-------|
|          |   | 0          | 1     |
| actual   | 0 | 319962     | 12228 |
|          | 1 | 13844      | 14941 |
```

# Competition Placing
- Predicted toxicity of test set with each model,submitted results to Kaggle

```
| Model           | Score   |
|-----------------|---------|
| SGDC            | 0.63615 |
| W-SGDC          | 0.73864 |
| Naive Bayes     | 0.73172 |
| W-Naive Bayes   | 0.73555 |
| Random Forest   | 0.50000 |
| W-Random Forest | 0.69865 |
```

# Place: 2158 out of 2278

<img src="place.jpg" alt="Drawing" style="width: 1000px;"/>

## Convolutional Neural Network Using Attention

- Takes top 300,000 features
- Uses Fast Text and GLOVE embeddings


    - Train time: 8 hours
    - Accuracy: 94.52%
    - Precision: 0.75742
    - Recall: 0.46212
    - F-1 Score: 57402
    
```
| Conf Mat |   | prediction |       |
|----------|---|------------|-------|
|          |   | 0          | 1     |
| actual   | 0 | 1639180    | 21362 |
|          | 1 | 77634      | 66700 |
```

### GPU usage
<img src="gpu.png" alt="Drawing" style="width: 500px;"/>