# Detecting and Classifying Toxic Comments
# Part 3-1: Sequential Binary Classifiers

It may be possible to employ sequential binary models in order to get better results with rarer cases.

If we first classify Toxic and Not Toxic, we could further process only the Toxic results against models that had been trained only to recognise sub-classes of toxic models.

# Setup

## Python Library Imports

In [1]:
import pandas as pd
import numpy as np

from timeit import default_timer as timer

%load_ext autoreload
%autoreload 2

## spaCy Setup and Imports

This time, we'll only use spaCy for data cleaning

In [22]:
import spacy

from spacy.lang.en import English
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

from spacy.tokens import Doc
import en_core_web_lg
nlp = en_core_web_lg.load()

## Import Custom Functions

I've created a few custom functions to assist in text preparation

In [19]:
import sys

# add src folder to path
sys.path.insert(1, '../src')

# from text_prep import tidy_series, uppercase_proportion_column
from spacy_helper import doc_check

## Load Train & Test Dataframes from Pickle File

We've already done a stratified Train Test Split, and a little bit of very basic text processing.

In [8]:
# ! ls ../data/basic_df_split/

X_train = pd.read_pickle('../data/basic_df_split/basic_X_train.pkl')
X_test = pd.read_pickle('../data/basic_df_split/basic_X_test.pkl')
y_train = pd.read_pickle('../data/basic_df_split/basic_y_train.pkl')
y_test= pd.read_pickle('../data/basic_df_split/basic_y_test.pkl')

basic_X_test.pkl  basic_X_train.pkl basic_y_test.pkl  basic_y_train.pkl


In [12]:
print(X_train.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 106912 entries, 27301 to 14596
Data columns (total 2 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   comment_text          106912 non-null  object 
 1   uppercase_proportion  106897 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.4+ MB
None


In [13]:
print(y_train.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 106912 entries, 27301 to 14596
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype
---  ------         --------------   -----
 0   toxic          106912 non-null  int64
 1   severe_toxic   106912 non-null  int64
 2   obscene        106912 non-null  int64
 3   threat         106912 non-null  int64
 4   insult         106912 non-null  int64
 5   identity_hate  106912 non-null  int64
dtypes: int64(6)
memory usage: 5.7 MB
None


# 1: Use spaCy for feature reduction

We will utilize spaCy to reduce features to:
- remove stopwords
- remove punctuation
- retain only lemmas
- render all lemmas to lowercase

## 1a: testing process with subset of text data

In [15]:
# create test subset copy
text_sub = X_train['comment_text'].sample(5)

In [25]:
text_sub.iloc[0]

"It was you who initiated this request and your colleague BlankVerse added me into the mix [here]. I'm sure you are aware that those IP checks are for Arbitration Committee issues only right? Simply to satify your curiousity does not qualify as a reason to invade my (or anyone else's) privacy. I left the same message on BlankVerse's talk page as well. Sincerely,"

In [29]:
test_doc = nlp(text_sub.iloc[0])
print(type(test_doc))

<class 'spacy.tokens.doc.Doc'>


In [33]:
lemmas_lc = [i.lemma_.lower() for i in test_doc if doc_check(i)]
lemmas_lc

['initiate',
 'request',
 'colleague',
 'blankverse',
 'add',
 'mix',
 'sure',
 'aware',
 'ip',
 'check',
 'arbitration',
 'committee',
 'issue',
 'right',
 'simply',
 'satify',
 'curiousity',
 'qualify',
 'reason',
 'invade',
 'privacy',
 'leave',
 'message',
 'blankverse',
 'talk',
 'page',
 'sincerely']

In [39]:
# vector of the document as a whole:
test_doc.vector

array([-2.13121418e-02,  1.48708686e-01, -2.18804672e-01,  2.12305575e-03,
        6.60265684e-02,  4.28793095e-02, -2.85865436e-03, -2.22052231e-01,
       -2.84318291e-02,  2.12891936e+00, -1.49136350e-01,  5.84894195e-02,
        7.73073360e-02, -6.52949363e-02, -9.85272750e-02, -4.07859012e-02,
       -8.40210617e-02,  1.03252184e+00, -2.14530215e-01,  2.21992079e-02,
       -4.42361413e-03, -2.86064092e-02, -3.81505936e-02, -2.50708833e-02,
        1.76838264e-02,  2.72275545e-02, -3.40951905e-02, -9.39358771e-02,
        3.59738655e-02, -1.24521852e-01, -1.76976230e-02,  1.41400397e-01,
       -7.65471235e-02,  6.01183064e-02,  1.30067756e-02, -8.55363235e-02,
        1.19203480e-03,  5.07408418e-02, -8.37911591e-02, -7.08299801e-02,
       -2.41037142e-02,  2.25989874e-02,  1.92791447e-02, -4.06242758e-02,
       -2.73231473e-02,  6.00152463e-02, -1.01027787e-01,  1.79745089e-02,
        1.89492758e-02,  2.74333346e-04, -8.50247666e-02,  2.67418311e-03,
       -4.88690697e-02, -

## Confirm behavior on small subset of data

In [119]:
# return lowercase lemmas of alphabetical
def to_lc_lemmas(s):
    
    return [i.lemma_.lower() for i in s if doc_check(i)]

In [120]:
tiny_df = X_train[0:2].copy()

In [121]:
%%time
# create docs from text
tiny_df['docs'] = tiny_df['comment_text'].apply(nlp)
tiny_df['docs']

In [130]:
%%time
# keep subset of lc lemmas to reduce dimensions
tiny_df['lemmas'] = tiny_df['docs'].apply(to_lc_lemmas)
tiny_df['lemmas']

CPU times: user 1.08 ms, sys: 67 µs, total: 1.15 ms
Wall time: 1.18 ms


27301     [meša, selimović, oppose, formulation, instead...
141668    [september, utc, talk, victimize, release, imp...
Name: lemmas, dtype: object

# left off here!

# Create Columns

## Doc Column

This one will take the longest to process, but the docs must be created before the other features can be pulled from it

In [131]:
%%time
'''
CPU times: user 38min 19s, sys: 3min 42s, total: 42min 2s
Wall time: 43min 21s
'''

# create docs from text
X_train['docs'] = X_train['comment_text'].apply(nlp)
X_train['docs'].head(2)

CPU times: user 38min 19s, sys: 3min 42s, total: 42min 2s
Wall time: 43min 21s


27301     (', Meša, Selimović, I, 'm, not, opposing, suc...
141668    (', September, 2008, (, UTC, ), Talking, about...
Name: docs, dtype: object

## Lemmas Column

- text rendered to lemmas  
- pronouns removed
- preserve only alphabetical entities
- remove stopwords in spaCy's default stopwords set

In [132]:
%%time
'''
CPU times: user 12.8 s, sys: 1.54 s, total: 14.3 s
Wall time: 14.9 s
'''

# keep subset of lc lemmas to reduce dimensions
X_train['lemmas'] = X_train['docs'].apply(to_lc_lemmas)
X_train['lemmas'].head(2)

CPU times: user 12.8 s, sys: 1.54 s, total: 14.3 s
Wall time: 14.9 s


27301     [meša, selimović, oppose, formulation, instead...
141668    [september, utc, talk, victimize, release, imp...
Name: lemmas, dtype: object

## Doc Vector Column

In [153]:
%%time
'''
CPU times: user 56.3 s, sys: 4.04 s, total: 1min
Wall time: 1min 6s
'''

X_train['doc_vectors'] = X_train['docs'].apply(lambda x: x.vector)

CPU times: user 56.3 s, sys: 4.04 s, total: 1min
Wall time: 1min 6s


## List of word vectors

We will reduce the number of vectors by limiting our selection to those vectors representing lemmas that conform to our previous parameters and also have a non-zero vector.

Resource:
- [Getting Vector for Lemma](https://github.com/explosion/spaCy/issues/956) 
    - This was especially helpful for correctly formatting the lambda function.

In [158]:
tiny_doc_sample = X_train['docs'].head(2)

# for doc in tiny_doc_sample:
#     print(doc.vector)
#     for tok in doc:
#         if doc_check(tok) and tok.has_vector:
#             print(tok.text, tok.has_vector, tok.vector_norm)

# try with samll subset
tiny_doc_sample.apply(lambda doc: [nlp.vocab[tok.lemma].vector for tok in doc if doc_check(tok) and tok.has_vector])

27301     [[0.12798, -0.43185, 0.034991, 0.27789, -0.061...
141668    [[-0.02074, 0.42632, 0.59367, -0.090906, -0.08...
Name: docs, dtype: object

In [159]:
%%time
'''
CPU times: user 17.4 s, sys: 396 ms, total: 17.8 s
Wall time: 18 s

'''
X_train['tok_vectors'] = X_train['docs'].apply(lambda doc: [nlp.vocab[tok.lemma].vector for tok in doc if doc_check(tok) and tok.has_vector])

CPU times: user 17.4 s, sys: 396 ms, total: 17.8 s
Wall time: 18 s


In [None]:
X_train['tok_vectors'].describe()

## Preserve Doc column separately

As the doc column is quite large, we'll preserve it seperately.

In [133]:
%%time
'''
CPU times: user 48.3 s, sys: 30.3 s, total: 1min 18s
Wall time: 1min 55s
'''

X_train[['docs']].to_pickle('../data/basic_df_split/X_train_docs_series.pkl')

CPU times: user 48.3 s, sys: 30.3 s, total: 1min 18s
Wall time: 1min 55s


# Toxic Text


Detecting Insults in Social Commentary

Data from Wikipedia 

Data Source:
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data

