# Detecting and Classifying Toxic Comments

# Part 1: Initial Exploration and Basic Data Preparation

The models will be trained using a publicly available dataset containing human labeled comments. 

- Data Source:
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data

As our first models will be built in spaCy, our first cleaning passes will be minimally invasive, primarily removing ip addresses, urls, and extra white space.

## Python Library Imports

In [1]:
import pandas as pd
import numpy as np

# scikit learn imports
from sklearn.model_selection import train_test_split

%load_ext autoreload
%autoreload 2

## Import Custom Functions

Resources:
- [Importing custom functions (relative filepath)](https://www.codegrepper.com/code-examples/python/import+files+in+src+folder+without+referencing+src+folder+python)  

In [2]:
import sys

# add src folder to path
sys.path.insert(1, '../src')

from text_prep import tidy_series, uppercase_proportion_column 

## Import Data to DataFrame

In [3]:
! ls ../data

toxic_2-1.pkl   toxic_2-2.pkl   toxic_2-3.pkl   toxic_basic.pkl train.csv


In [4]:
# Load original from csv

# path if using google colabs
# path = "gdrive/MyDrive/Colab Notebooks/capstone_exploration/data/toxic_comment_data/train.csv"

# local path
path = '../data/train.csv'

toxic_df = pd.read_csv(path)

# Basic Exploration

Texts in the dataset are labeled by human users as either **Toxic** or **Not Toxic**. 

Toxic comments can be further categorized as displaying any combination of five subcategories. Toxic comments can belong to any of the subcategories, multiple subcategories, or no further subcategories.

Subcategories:
- Severely toxic
- Obscene
- Threat
- Insult
- Identity hate

### Category Summary

| Category            	| Totals 	|
|---------------------	|-------:	|
| Not Toxic         	| 144277 	|
| Toxic             	|  15294 	|
| Toxic Subcategories 	|        	|
| Severely toxic      	|   1595 	|
| Obscene             	|   8449 	|
| Threat              	|    478 	|
| Insult              	|   7877 	|
| Identity hate       	|   1405 	|
| Subcategories Total 	|  19804 	|


### Proportions

About 10% of the comments in the dataset are considered Toxic.

```
Proportion of Not Toxic Comments in Dataset: 0.9041555169799024
Proportion of Toxic Comments in Dataset: 0.09584448302009764
```


Resources:
- [Table Generator](https://www.tablesgenerator.com/markdown_tables#)  

In [5]:
# how many rows labeled as not toxic?
not_toxic_count = toxic_df[toxic_df['toxic']==0].shape[0]
print(f"Rows labeled as Not Toxic: {not_toxic_count}") # not toxic: (144277) 

# rows labeled toxic
toxic_count = toxic_df[toxic_df['toxic']==1].shape[0]
print(f"Rows labeled as Toxic:      {toxic_count}") # toxic: (15294)
sub_toxic = toxic_df[['severe_toxic', 'obscene','threat','insult','identity_hate']].sum()

print(sub_toxic, '\n')
print(f"total sub_toxic:            {sub_toxic.sum()}")

Rows labeled as Not Toxic: 144277
Rows labeled as Toxic:      15294


severe_toxic     1595
obscene          8449
threat            478
insult           7877
identity_hate    1405
dtype: int64 

total sub_toxic:            19804


In [6]:
# Proportions:
total_rows = toxic_df.shape[0] # 159571

# Not Toxic Proportion
not_toxic_prop = not_toxic_count/total_rows # 0.9041555169799024
print(f"Proportion of Not Toxic Comments in Dataset: {not_toxic_prop}")

# Toxic Proportion
toxic_prop = toxic_count/total_rows # 0.09584448302009764
print(f"Proportion of Toxic Comments in Dataset: {toxic_prop}")


Proportion of Not Toxic Comments in Dataset: 0.9041555169799024
Proportion of Toxic Comments in Dataset: 0.09584448302009764


# Basic Data Cleaning

## Drop 'id' Column From Full Dataset
The id column is not really useful for our purposes, so we'll drop it from the dataframe

In [7]:
toxic_df.drop(columns='id', inplace=True)

## Tidy the 'comment_text' column
`tidy_series` provides a few basic cleaning functions:
- convert interior quotes to all single quotes
- strip any extraneous whitespace
- strip any ip addresses
- [strip url](https://stackoverflow.com/a/62729865)  


In [8]:
# tidy comment_text
toxic_df['comment_text'] = tidy_series(toxic_df['comment_text'])

In [9]:
toxic_df['comment_text'].head()

0    Explanation Why the edits made under my userna...
1    D'aww! He matches this background colour I'm s...
2    Hey man, I'm really not trying to edit war. It...
3    ' More I can't make any real suggestions on im...
4    You, sir, are my hero. Any chance you remember...
Name: comment_text, dtype: object

# Basic Feature Engineering

There are a few features that are not obvious in the original dataset that may be useful for prediction and classification.

## Proportion of All-Caps Type

In many circles, typing in all caps is considered a way to indicate yelling. Before changing the initial text, we'll record the proportion of upper case letters to the total number of alphabetical characters. 

PossibleConfounds:
- [People with dislexia occasionally choose all-caps as an accomodataion](https://www.readandspell.com/us/writing-in-all-caps)  
- Quoted all-caps text
    - not counting quoted and block quoted text may help here.
- Text referencing all-caps acronymns
- Programming language conventions
    - e.g. SQL syntax typically inlcudes all-caps reserved words

In [10]:
toxic_df.columns

Index(['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult',
       'identity_hate'],
      dtype='object')

In [11]:
%%time
# create uppercase_proportion column
toxic_df.insert(1, 'uppercase_proportion', uppercase_proportion_column(toxic_df['comment_text']))
toxic_df.columns

CPU times: user 10.1 s, sys: 1.31 s, total: 11.5 s
Wall time: 10.8 s


Index(['comment_text', 'uppercase_proportion', 'toxic', 'severe_toxic',
       'obscene', 'threat', 'insult', 'identity_hate'],
      dtype='object')

In [12]:
mean_all = toxic_df['uppercase_proportion'].mean()
mean_not_toxic = toxic_df['uppercase_proportion'][toxic_df['toxic']==0].mean()
mean_toxic = toxic_df['uppercase_proportion'][toxic_df['toxic']==1].mean()

'''
Uppercase Proportion mean:           0.06970968433852934
Uppercase Proportion mean not toxic: 0.06073052868635834
Uppercase Proportion mean toxic:     0.15440166285616988
'''

print(f"Uppercase Proportion mean:           {mean_all}")
print(f"Uppercase Proportion mean not toxic: {mean_not_toxic}")
print(f"Uppercase Proportion mean toxic:     {mean_toxic}")
# uppercase proportion for toxic comments is over twice that of not toxic comments.

Uppercase Proportion mean:           0.06970968433852934
Uppercase Proportion mean not toxic: 0.06073052868635834
Uppercase Proportion mean toxic:     0.15440166285616988


# Save Basic Columns As Pickle File

In [13]:
%%time
'''
CPU times: user 84.7 ms, sys: 87.5 ms, total: 172 ms
Wall time: 215 ms
'''
# Pickle basic
toxic_df.to_pickle("../data/toxic_basic.pkl")

CPU times: user 84.7 ms, sys: 87.5 ms, total: 172 ms
Wall time: 215 ms


# Resources & Articles

Resources:
- [Detecting Insults in Social Commentary Dataset On Kaggle](https://www.kaggle.com/c/detecting-insults-in-social-commentary/data) 
- [Cleaned Toxic Comments on Kaggle](https://www.kaggle.com/fizzbuzz/cleaned-toxic-comments)  
- [Insult Sets](https://www.kaggle.com/rogier2012/insult-sets)  
- [Wikipedia Talk Labels: Personal Attacks](https://datasetsearch.research.google.com/search?query=stalking%20text&docid=L2cvMTFqbnl5cWw0Xw%3D%3D) 
    -  [At Kaggle](https://datasetsearch.research.google.com/search?query=stalking%20text&docid=L2cvMTFqbnl5cWw0Xw%3D%3D)  
- [Toxic Dataset](https://www.kaggle.com/ra2041/toxic-dataset)  
- [Dataset for Mean Birds: Detecting Agression and Bullying on Twitter](https://zenodo.org/record/1184178) 

Articles: 
- [NLP AND MACHINE LEARNING TECHNIQUES TO DETECT
ONLINE HARASSMENT...(has links to datasets)](https://dalspace.library.dal.ca/handle/10222/76331) 
- [Detecting Cyberbullying...](http://www.ijetsr.com/images/short_pdf/1517199597_1428-1435-oucip915_ijetsr.pdf) 


