# Toxic Text


Detecting Insults in Social Commentary

Data from Wikipedia 

# Resources & Articles

Resources:
- [Detecting Insults in Social Commentary Dataset On Kaggle](https://www.kaggle.com/c/detecting-insults-in-social-commentary/data) 
- [Cleaned Toxic Comments on Kaggle](https://www.kaggle.com/fizzbuzz/cleaned-toxic-comments)  
- [Insult Sets](https://www.kaggle.com/rogier2012/insult-sets)  
- [Wikipedia Talk Labels: Personal Attacks](https://datasetsearch.research.google.com/search?query=stalking%20text&docid=L2cvMTFqbnl5cWw0Xw%3D%3D) 
    -  [At Kaggle](https://datasetsearch.research.google.com/search?query=stalking%20text&docid=L2cvMTFqbnl5cWw0Xw%3D%3D)  
- [Toxic Dataset](https://www.kaggle.com/ra2041/toxic-dataset)  
- [Dataset for Mean Birds: Detecting Agression and Bullying on Twitter](https://zenodo.org/record/1184178) 

Articles: 
- [NLP AND MACHINE LEARNING TECHNIQUES TO DETECT
ONLINE HARASSMENT...(has links to datasets)](https://dalspace.library.dal.ca/handle/10222/76331) 
- [Detecting Cyberbullying...](http://www.ijetsr.com/images/short_pdf/1517199597_1428-1435-oucip915_ijetsr.pdf) 




# Setup

We'll mount our Google Drive and import any necessary Python libraries.

## Mount Google Drive


In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
# ! pwd
# /content

! ls /content/gdrive/MyDrive/'Colab Notebooks'/capstone_exploration/data

ship_detection_data  toxic_comment_data


## Kaggle Setup & Imports

We'll be using at least one Kaggle dataset.

Resources: 

- [Downloading Datasets directly into Google Drive](https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166)  


In [3]:
# '''
# This code has been commented out as
# it is only necessary to run this once with your credentials.

# credentials, however, seem to be stored on the local machine
# '''

# from google.colab import files
# files.upload() #this will prompt you to update the json

# !pip install -q kaggle
# !mkdir -p ~/.kaggle
# !cp kaggle.json ~/.kaggle/
# !ls ~/.kaggle
# !chmod 600 /root/.kaggle/kaggle.json  # set permission

In [4]:
# ! ls gdrive/MyDrive/'Colab Notebooks'/capstone_exploration/data/toxic_comment_data

In [5]:
# ! pwd

# ! ls gdrive/MyDrive/'Colab Notebooks'/capstone_exploration/data

In [6]:
# ! kaggle competitions list -s jigsaw-toxic-comment-classification-challenge

# ! kaggle competitions download -c jigsaw-toxic-comment-classification-challenge -p /content/gdrive/MyDrive/Colab\ Notebooks/capstone_exploration/data

## spaCy Setup & Imports

In [7]:
# spaCy Setup & Imports
import spacy
from spacy.lang.en import English

# update install to > version 3
! pip install -U spacy

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
! python -m spacy download en_core_web_lg

Requirement already up-to-date: spacy in /usr/local/lib/python3.7/dist-packages (3.0.5)
2021-03-22 01:24:37.630099: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


## Python Library Imports


Resources:
- [pool]()

In [8]:
import pandas as pd
import numpy as np

from collections import Counter
import re

# nltk imports
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords


# scikit learn imports
from sklearn.model_selection import train_test_split

%load_ext autoreload
%autoreload 2

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Import Data to DataFrame

In [9]:
path = "gdrive/MyDrive/Colab Notebooks/capstone_exploration/data/toxic_comment_data/train.csv"

toxic_df = pd.read_csv(path)

# Basic Exploration

Texts in the dataset are labeled by human users as either **Toxic** or **Not Toxic**. 

Toxic comments can be further categorized as displaying any combination of five subcategories. Toxic comments can belong to any of the subcategories, multiple subcategories, or no further subcategories.

Subcategories:
- Severely toxic
- Obscene
- Threat
- Insult
- Identity hate

### Category Summary

| Category            	| Totals 	|
|---------------------	|-------:	|
| Not Toxic         	| 144277 	|
| Toxic             	|  15294 	|
| Toxic Subcategories 	|        	|
| Severely toxic      	|   1595 	|
| Obscene             	|   8449 	|
| Threat              	|    478 	|
| Insult              	|   7877 	|
| Identity hate       	|   1405 	|
| Subcategories Total 	|  19804 	|


### Proportions

About 10% of the comments in the dataset are considered Toxic.

```
Proportion of Not Toxic Comments in Dataset: 0.9041555169799024
Proportion of Toxic Comments in Dataset: 0.09584448302009764
```


Resources:
- [Table Generator](https://www.tablesgenerator.com/markdown_tables#)  

In [10]:
# how many rows labeled as not toxic?
not_toxic_count = toxic_df[toxic_df['toxic']==0].shape[0]
print(f"Rows labeled as Not Toxic: {not_toxic_count}") # not toxic: (144277) 

# rows labeled toxic
toxic_count = toxic_df[toxic_df['toxic']==1].shape[0]
print(f"Rows labeled as Toxic:      {toxic_count}") # toxic: (15294)
print('\n')
sub_toxic = toxic_df[['severe_toxic', 'obscene','threat','insult','identity_hate']].sum()

print(sub_toxic, '\n')
print(f"total sub_toxic:            {sub_toxic.sum()}")


Rows labeled as Not Toxic: 144277
Rows labeled as Toxic:      15294


severe_toxic     1595
obscene          8449
threat            478
insult           7877
identity_hate    1405
dtype: int64 

total sub_toxic:            19804


In [11]:
# Proportions:
total_rows = toxic_df.shape[0] # 159571

# Not Toxic Proportion
not_toxic_prop = not_toxic_count/total_rows # 0.9041555169799024
print(f"Proportion of Not Toxic Comments in Dataset: {not_toxic_prop}")

# Toxic Proportion
toxic_prop = toxic_count/total_rows # 0.09584448302009764
print(f"Proportion of Toxic Comments in Dataset: {toxic_prop}")


Proportion of Not Toxic Comments in Dataset: 0.9041555169799024
Proportion of Toxic Comments in Dataset: 0.09584448302009764


# Drop 'id' Column From Full Dataset
The id column is not really useful for our purposes, so we'll drop it from the dataframe

In [12]:
toxic_df.drop(columns='id', inplace=True)

# Basic Data Cleaning

Cleaning Functions:
- convert interior quotes to all single quotes
- strip any extraneous whitespace
- strip any ip addresses


In [13]:
# Convert all interior quotes to single quotes

def convert_interior_quotes(s):
    '''
    Arguments:
        s = Series of strings
            Takes a series of strings as an argument
            converts all interior quotes in a string to single quotes
    Returns: 
        Series of strings with interior quotes
    '''
    quotes_pattern = '["]+'
    return s.str.replace(quotes_pattern, "'")

def strip_ip(s):
    '''
    Arguments:
        s = Series of strings
            Takes a series of strings as an argument
            removes any ip addresses
    Returns: 
        Series of strings without ip addresses
    '''
    ip_pat = '(?:[0-9]{1,3}\.){3}[0-9]{1,3}'
    return s.str.replace(ip_pat, "")

def strip_whitespace(s):
    '''
    Arguments:
        s = Series of strings
            Takes a series of strings as an argument
            removes extraneous whitespace
    Returns: 
        Series of strings without extraneous whitespace
    '''
    
    t = s.copy()
    # remove whitespace from edge
    t = t.str.strip()

    # reduce interior whitespace to single space
    t = t.str.replace('[\s]+', ' ')

    return t


def remove_all_punct(s):
    '''
    Arguments:
        s = Series of strings
            Takes a series of strings as an argument
            removes all punctuation
    Returns: 
        Series of strings with no punctuation
    '''
    not_alpha_pattern = '[^A-Za-z\s]'
    return s.str.replace(not_alpha_pattern, "")

def tidy_series(s):
    '''
    returns tidied series
    '''
    # copy series
    t = s.copy()

    # call individual functions
    t = convert_interior_quotes(t)
    t = strip_whitespace(t)
    t = strip_ip(t)

    return t



## Apply Cleaning to Full Dataset


In [14]:
# tidy comment_text
toxic_df['comment_text'] = tidy_series(toxic_df['comment_text'])

# Feature Engineering

There are a few features that are not obvious in the original dataset that may be useful for prediction and classification.

spaCy: doc and raw

In [19]:
toxic_df.columns

Index(['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult',
       'identity_hate'],
      dtype='object')

In [20]:
import spacy

nlp = spacy.load('en_core_web_lg')

In [21]:
# doc1 = nlp(toxic_df['comment_text'][0])

# for x, token in enumerate(doc1):
#     print(x, token.lemma_)

# textcat = nlp.add_pipe('textcat')
# print(nlp.pipeline)

In [22]:
%%time

def doc_per_row(s):

    t = s.copy()
    
    t = remove_all_punct(t)
    t = t.str.strip()

    return t.apply(lambda x: nlp(x))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.72 µs


In [None]:
%%time
toxic_df['doc_raw'] = doc_per_row(toxic_df['comment_text'])

Resource:  
- [running pandas operations in parallel](http://www.racketracer.com/2016/07/06/pandas-in-parallel/)  

In [34]:
# # parallelize dataframe

from multiprocessing import Pool
# import multiprocessing

# multiprocessing.cpu_count() # 2 for colabs
num_partitions = 100
num_cores = 2

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()

    return(df)

In [None]:
%%time
toxic_df['doc_per_row'] = parallelize_dataframe(toxic_df['comment_text'], doc_per_row)

classification with spacy

- [Really useful article](https://www.machinelearningplus.com/nlp/custom-text-classification-spacy/)  
- [another project on kaggle](https://www.kaggle.com/poonaml/text-classification-using-spacy)  
- [turbo charge you spacy nlp pipeline](https://towardsdatascience.com/turbo-charge-your-spacy-nlp-pipeline-551435b664ad) 
- [python & spacy nlp](https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/)  
- [cheat sheet](https://www.datacamp.com/community/blog/spacy-cheatsheet)  


Other Resources:
- [Naive bayes](https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html)  

Video Resources:
- [spacy introduction, tokenization, lemmatization, stemming & techniques](https://www.youtube.com/watch?v=ZIiho_JfJNw)  

-



In [None]:
%%time

def raw_lemma_per_row(s):

    t = s.copy()
    return t.apply(lambda x: [i.lemma_ for i in x])

# test_lemma = raw_lemma_per_row(test)

# print(test_lemma)
# # raw_lemma_lst = [doc_v.lemma_ for doc_v in raw_doc_list]

# # toxic_df['raw_lemma'] = pd.Series(raw_lemma_lst)

# type(test_lemma)

# print(toxic_df['comment_text'][0])



In [None]:
%%time
toxic_df['lemma_raw'] = raw_lemma_per_row(toxic_df['doc_raw'])

In [None]:
toxic_df['lemma_raw'][0]

In [None]:
ls


In [None]:
# for token in doc1:
# #   print(f'{token.text:{20}} {token.lemma_:{20}} {token.pos_:{10}}')
#     print(f'{token.lemma_:{20}}')
#     # print(f'{token.lemma_:{21}}')

## Proportion of All-Caps Type

In many circles, typing in all caps is considered a way to indicate yelling. Before changing the initial text, we'll record the proportion of upper case letters to the total number of alphabetical characters. 

PossibleConfounds:
- [People with dislexia occasionally choose all-caps as an accomodataion](https://www.readandspell.com/us/writing-in-all-caps)  
- Quoted all-caps text
    - not counting quoted and block quoted text may help here.
- Text referencing all-caps acronymns
- Programming language conventions
    - e.g. SQL syntax typically inlcudes all-caps reserved words

### Custom Function: uppercase_proportion_column(s)


In [None]:
def uppercase_proportion_column(s):
    '''
    given a pandas Series:
        containing rows of strings
    returns: a series of floats representing
        the percentage of capital letters vs total alpha chars
        in provided strings
    '''
    import re # dependent on re

    uc_pattern = '[A-Z]'
    alpha_pattern = '[A-Za-z]'

    cap_count = s.str.findall(uc_pattern).str.len()
    # print(cap_count)

    alpha_char_count = s.str.findall(alpha_pattern).str.len()
    # print(alpha_char_count)

    uc_proportion = cap_count / alpha_char_count
    # print(uc_proportion)

    return uc_proportion

In [None]:
doc_lemma = nlp('Practice, practiced, practicing')
doc_lemma
print(doc_lemma) # can be indexed as a list

In [None]:
for token in doc_lemma:
    print(token.text, token.lemma_, token.lemma_.lower().strip())

In [None]:
short_df['uppercase_proportion'] = uppercase_proportion_column(short_df['comment_text'])
short_df.columns

## Apply Custom Features to Full Dataset

In [None]:
# create uppercase_proportion column
toxic_df['uppercase_proportion'] = uppercase_proportion_column(toxic_df['comment_text'])
toxic_df.columns

# Simple Train Test Split

As our process should first determine whether the text is toxic or not toxic, we'll make a simplified stratified train test split, ensuring our balance of toxic and non toxic rows are proportionally distributed.

For now, we won't be too concerned with the proportion of sub-categories, as our first step will be to filter not toxic from toxic, then run parallel operations for each toxic sub-category, as toxic sub-categories are not mutually exclusive.

## Stratified Split maintaining ratio of toxic to not toxic texts


In [None]:
# split df into X(independent) and y(depenendent) groups
ind_cols = ['comment_text', 'uppercase_proportion']

X = toxic_df[ind_cols]
y = toxic_df.drop(columns=ind_cols)

print(f"X columns: {X.columns}\ny columns:{y.columns}")

In [None]:
# Train Test Split. Stratified on y['toxic']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42, 
                                                    stratify=y['toxic'])

# spaCy

Let's try out spaCy, a nlp processing library!

- https://course.spacy.io/en/chapter1
- [text classification with spaCy](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/) 
- [customized list of stopwords](https://spacy.io/usage/linguistic-features#stop-words)  
- [Split Series into list of sentences](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.cat.html)  
- [contractions](https://theslaps.medium.com/cant-stand-don-t-want-contractions-with-spacy-39715cac2ebb)  


In [None]:
# # check version
! python -m spacy info

## adjusting pipeline with small data subset

In [None]:
# sentence tokinization
nlp = English()

# add the component to the pipeline
nlp.add_pipe('sentencizer')

text = short_df['comment_text'].str.cat(sep=" ")

# 'nlp' object is used to create documents with linguistic annotations
doc = nlp(text)

sents_list = [sent.text for sent in doc.sents]

print(len(sents_list))
print(sents_list[0:3])

In [None]:
token_list = [token.text for token in doc]
print(token_list)

In [None]:
# for token in doc:
#     print(token.text, token.pos_, token.i)

for word in doc[0:10]:
    print(word.text, word.lemma_, word.pos_)
    


# NLTK Naive Bayes


In [None]:
# ! pip install --user -U nltk
# ! pip install sklearn

In [None]:
# from nltk.classify import naivebayes

# from sklearn.svm import LinearSVC
from nltk.classify.scikitlearn import SklearnClassifier
# classif = SklearnClassifier(LinearSVC())

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline


pipeline = Pipeline([('tfidf', TfidfTransformer()),
                     ('chi2', SelectKBest(chi2, k=1000)),
                     ('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)

# SKLearn 

- Resources: 
[Naive Bayes Classification ](https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html)  

In [None]:
# imports
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()