# Toxic Text


Detecting Insults in Social Commentary

Data from Wikipedia 

Data Source:
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



# Resources & Articles

Resources:
- [Detecting Insults in Social Commentary Dataset On Kaggle](https://www.kaggle.com/c/detecting-insults-in-social-commentary/data) 
- [Cleaned Toxic Comments on Kaggle](https://www.kaggle.com/fizzbuzz/cleaned-toxic-comments)  
- [Insult Sets](https://www.kaggle.com/rogier2012/insult-sets)  
- [Wikipedia Talk Labels: Personal Attacks](https://datasetsearch.research.google.com/search?query=stalking%20text&docid=L2cvMTFqbnl5cWw0Xw%3D%3D) 
    -  [At Kaggle](https://datasetsearch.research.google.com/search?query=stalking%20text&docid=L2cvMTFqbnl5cWw0Xw%3D%3D)  
- [Toxic Dataset](https://www.kaggle.com/ra2041/toxic-dataset)  
- [Dataset for Mean Birds: Detecting Agression and Bullying on Twitter](https://zenodo.org/record/1184178) 

Articles: 
- [NLP AND MACHINE LEARNING TECHNIQUES TO DETECT
ONLINE HARASSMENT...(has links to datasets)](https://dalspace.library.dal.ca/handle/10222/76331) 
- [Detecting Cyberbullying...](http://www.ijetsr.com/images/short_pdf/1517199597_1428-1435-oucip915_ijetsr.pdf) 




## Python Library Imports


Resources:
- [pool]()

In [1]:
import pandas as pd
import numpy as np

from collections import Counter
import re

# nltk imports
import nltk
from nltk.corpus import stopwords

# scikit learn imports
from sklearn.model_selection import train_test_split

%load_ext autoreload
%autoreload 2

## Import Data to DataFrame

In [2]:
! ls ../data

toxic_2-1.pkl toxic_2-2.pkl toxic_2-3.pkl train.csv


In [3]:
# Load original from csv

# path if using google colabs
# path = "gdrive/MyDrive/Colab Notebooks/capstone_exploration/data/toxic_comment_data/train.csv"

# local path
path = '../data/train.csv'

toxic_df = pd.read_csv(path)

# Basic Exploration

Texts in the dataset are labeled by human users as either **Toxic** or **Not Toxic**. 

Toxic comments can be further categorized as displaying any combination of five subcategories. Toxic comments can belong to any of the subcategories, multiple subcategories, or no further subcategories.

Subcategories:
- Severely toxic
- Obscene
- Threat
- Insult
- Identity hate

### Category Summary

| Category            	| Totals 	|
|---------------------	|-------:	|
| Not Toxic         	| 144277 	|
| Toxic             	|  15294 	|
| Toxic Subcategories 	|        	|
| Severely toxic      	|   1595 	|
| Obscene             	|   8449 	|
| Threat              	|    478 	|
| Insult              	|   7877 	|
| Identity hate       	|   1405 	|
| Subcategories Total 	|  19804 	|


### Proportions

About 10% of the comments in the dataset are considered Toxic.

```
Proportion of Not Toxic Comments in Dataset: 0.9041555169799024
Proportion of Toxic Comments in Dataset: 0.09584448302009764
```


Resources:
- [Table Generator](https://www.tablesgenerator.com/markdown_tables#)  

In [4]:
# how many rows labeled as not toxic?
not_toxic_count = toxic_df[toxic_df['toxic']==0].shape[0]
print(f"Rows labeled as Not Toxic: {not_toxic_count}") # not toxic: (144277) 

# rows labeled toxic
toxic_count = toxic_df[toxic_df['toxic']==1].shape[0]
print(f"Rows labeled as Toxic:      {toxic_count}") # toxic: (15294)
print('\n')
sub_toxic = toxic_df[['severe_toxic', 'obscene','threat','insult','identity_hate']].sum()

print(sub_toxic, '\n')
print(f"total sub_toxic:            {sub_toxic.sum()}")


Rows labeled as Not Toxic: 144277
Rows labeled as Toxic:      15294


severe_toxic     1595
obscene          8449
threat            478
insult           7877
identity_hate    1405
dtype: int64 

total sub_toxic:            19804


In [5]:
# Proportions:
total_rows = toxic_df.shape[0] # 159571

# Not Toxic Proportion
not_toxic_prop = not_toxic_count/total_rows # 0.9041555169799024
print(f"Proportion of Not Toxic Comments in Dataset: {not_toxic_prop}")

# Toxic Proportion
toxic_prop = toxic_count/total_rows # 0.09584448302009764
print(f"Proportion of Toxic Comments in Dataset: {toxic_prop}")


Proportion of Not Toxic Comments in Dataset: 0.9041555169799024
Proportion of Toxic Comments in Dataset: 0.09584448302009764


# Drop 'id' Column From Full Dataset
The id column is not really useful for our purposes, so we'll drop it from the dataframe

In [6]:
toxic_df.drop(columns='id', inplace=True)

# Basic Data Cleaning

Cleaning Functions:
- convert interior quotes to all single quotes
- strip any extraneous whitespace
- strip any ip addresses
- [strip url](https://stackoverflow.com/a/62729865)  


In [7]:
# Convert all interior quotes to single quotes

def convert_interior_quotes(s):
    '''
    Arguments:
        s = Series of strings
            Takes a series of strings as an argument
            converts all interior quotes in a string to single quotes
    Returns: 
        Series of strings with interior quotes
    '''
    quotes_pattern = '["]+'
    return s.str.replace(quotes_pattern, "'")

def strip_ip(s):
    '''
    Arguments:
        s = Series of strings
            Takes a series of strings as an argument
            removes any ip addresses
    Returns: 
        Series of strings without ip addresses
    '''
    ip_pat = '(?:[0-9]{1,3}\.){3}[0-9]{1,3}'
    return s.str.replace(ip_pat, "")

def strip_url(s):
    '''
    Arguments:
        s = Series of strings
            Takes a series of strings as an argument
            removes any ip addresses
    Returns: 
        Series of strings without url
    '''
    url_pat = 'https?:\/\/\S*'
    return s.str.replace(url_pat, "")

def strip_whitespace(s):
    '''
    Arguments:
        s = Series of strings
            Takes a series of strings as an argument
            removes extraneous whitespace
    Returns: 
        Series of strings without extraneous whitespace
    '''
    
    t = s.copy()
    # remove whitespace from edge
    t = t.str.strip()

    # reduce interior whitespace to single space
    t = t.str.replace('[\s]+', ' ')

    return t


def remove_all_punct(s):
    '''
    Arguments:
        s = Series of strings
            Takes a series of strings as an argument
            removes all punctuation
    Returns: 
        Series of strings with no punctuation
    '''
    not_alpha_pattern = '[^A-Za-z\s]'
    return s.str.replace(not_alpha_pattern, "")

def tidy_series(s):
    '''
    returns tidied series
    '''
    # copy series
    t = s.copy()

    # call individual functions
    t = convert_interior_quotes(t)
    t = strip_whitespace(t)
    t = strip_ip(t)
    t = strip_url(t)

    return t

## Apply Basic Cleaning to Full Dataset


In [8]:
# tidy comment_text
toxic_df['comment_text'] = tidy_series(toxic_df['comment_text'])

In [9]:
toxic_df['comment_text'].head()

0    Explanation Why the edits made under my userna...
1    D'aww! He matches this background colour I'm s...
2    Hey man, I'm really not trying to edit war. It...
3    ' More I can't make any real suggestions on im...
4    You, sir, are my hero. Any chance you remember...
Name: comment_text, dtype: object

# Basic Feature Engineering

There are a few features that are not obvious in the original dataset that may be useful for prediction and classification.

Resource:  
- [running pandas operations in parallel](http://www.racketracer.com/2016/07/06/pandas-in-parallel/)  

In [10]:
# # parallelize dataframe

from multiprocessing import Pool
# import multiprocessing

# multiprocessing.cpu_count() # 2 for colabs
# num_partitions = 100
# num_cores = 4

def parallelize_dataframe(df, func, num_cores=2, num_partitions=100):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()

    return(df)

## Proportion of All-Caps Type

In many circles, typing in all caps is considered a way to indicate yelling. Before changing the initial text, we'll record the proportion of upper case letters to the total number of alphabetical characters. 

PossibleConfounds:
- [People with dislexia occasionally choose all-caps as an accomodataion](https://www.readandspell.com/us/writing-in-all-caps)  
- Quoted all-caps text
    - not counting quoted and block quoted text may help here.
- Text referencing all-caps acronymns
- Programming language conventions
    - e.g. SQL syntax typically inlcudes all-caps reserved words

### Custom Function: uppercase_proportion_column(s)


In [11]:
def uppercase_proportion_column(s):
    '''
    given a pandas Series:
        containing rows of strings
    returns: a series of floats representing
        the percentage of capital letters vs total alpha chars
        in provided strings
    '''
    import re # dependent on re

    uc_pattern = '[A-Z]'
    alpha_pattern = '[A-Za-z]'

    cap_count = s.str.findall(uc_pattern).str.len()
    # print(cap_count)

    alpha_char_count = s.str.findall(alpha_pattern).str.len()
    # print(alpha_char_count)

    uc_proportion = cap_count / alpha_char_count
    # print(uc_proportion)

    return uc_proportion

## Apply Custom Features to Full Dataset

In [12]:
toxic_df.columns

Index(['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult',
       'identity_hate'],
      dtype='object')

In [13]:
%%time
# create uppercase_proportion column
toxic_df.insert(1, 'uppercase_proportion', uppercase_proportion_column(toxic_df['comment_text']))
toxic_df.columns

CPU times: user 9.74 s, sys: 806 ms, total: 10.5 s
Wall time: 9.92 s


Index(['comment_text', 'uppercase_proportion', 'toxic', 'severe_toxic',
       'obscene', 'threat', 'insult', 'identity_hate'],
      dtype='object')

In [20]:
mean_all = toxic_df['uppercase_proportion'].mean()
mean_not_toxic = toxic_df['uppercase_proportion'][toxic_df['toxic']==0].mean()
mean_toxic = toxic_df['uppercase_proportion'][toxic_df['toxic']==1].mean()

'''
Uppercase Proportion mean:           0.06970968433852934
Uppercase Proportion mean not toxic: 0.06073052868635834
Uppercase Proportion mean toxic:     0.15440166285616988
'''

print(f"Uppercase Proportion mean:           {mean_all}")
print(f"Uppercase Proportion mean not toxic: {mean_not_toxic}")
print(f"Uppercase Proportion mean toxic:     {mean_toxic}")
# uppercase proportion for toxic comments is over twice that of not toxic comments.

Uppercase Proportion mean:           0.06970968433852934
Uppercase Proportion mean not toxic: 0.06073052868635834
Uppercase Proportion mean toxic:     0.15440166285616988


# Save Basic Columns As Pickle File

In [21]:
%%time
'''
CPU times: user 87.8 ms, sys: 68.3 ms, total: 156 ms
Wall time: 214 ms
'''
# Pickle basic
toxic_df.to_pickle("../data/toxic_basic.pkl")

CPU times: user 87.8 ms, sys: 68.3 ms, total: 156 ms
Wall time: 214 ms
