# Detecting and Classifying Toxic Comments
# Part 3: Sequential Binary Classifiers

It may be possible to employ sequential binary models in order to get better results with rarer cases.

If we first classify Toxic and Not Toxic, we could further process only the Toxic results against models that had been trained only to recognise sub-classes of toxic models.

# Setup

## Python Library Imports

In [1]:
import pandas as pd
import numpy as np

from timeit import default_timer as timer

%load_ext autoreload
%autoreload 2

## spaCy Setup and Imports

This time, we'll only use spaCy for data cleaning

In [2]:
import spacy

from spacy.lang.en import English
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

from spacy.tokens import Doc
# import en_core_web_lg
nlp = spacy.load('../models/spacy_multi_cat_model/')

## Import Custom Functions

I've created a few custom functions to assist in text preparation

In [3]:
import sys

# add src folder to path
sys.path.insert(1, '../src')

# from text_prep import tidy_series, uppercase_proportion_column
from spacy_helper import doc_check

## Load Train & Test Dataframes from Pickle File

We've already done a stratified Train Test Split, and a little bit of very basic text processing.

In [4]:
# ! ls ../data/basic_df_split/

X_train = pd.read_pickle('../data/basic_df_split/basic_X_train.pkl')
X_test = pd.read_pickle('../data/basic_df_split/basic_X_test.pkl')
y_train = pd.read_pickle('../data/basic_df_split/basic_y_train.pkl')
y_test= pd.read_pickle('../data/basic_df_split/basic_y_test.pkl')

In [5]:
print(X_train.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 106912 entries, 27301 to 14596
Data columns (total 2 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   comment_text          106912 non-null  object 
 1   uppercase_proportion  106897 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.4+ MB
None


In [6]:
print(y_train.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 106912 entries, 27301 to 14596
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype
---  ------         --------------   -----
 0   toxic          106912 non-null  int64
 1   severe_toxic   106912 non-null  int64
 2   obscene        106912 non-null  int64
 3   threat         106912 non-null  int64
 4   insult         106912 non-null  int64
 5   identity_hate  106912 non-null  int64
dtypes: int64(6)
memory usage: 5.7 MB
None


# 1: Use spaCy for feature reduction

We will utilize spaCy to reduce features to:
- remove stopwords
- remove punctuation
- retain only lemmas
- render all lemmas to lowercase

## 1a: testing process with subset of text data

In [7]:
# create test subset copy
text_sub = X_train['comment_text'].sample(5)

In [8]:
text_sub.iloc[0]

"Sorry Still can't find your penis?"

In [9]:
test_doc = nlp(text_sub.iloc[0])
print(type(test_doc))

<class 'spacy.tokens.doc.Doc'>


In [10]:
lemmas_lc = [i.lemma_.lower() for i in test_doc if doc_check(i)]
lemmas_lc

['sorry', 'find', 'penis']

In [11]:
# vector of the document as a whole:
test_doc.vector

array([-2.08142996e-01,  1.86965004e-01, -3.79616290e-01, -6.50937557e-02,
        6.80181161e-02,  1.81352478e-02,  8.92658755e-02, -2.32500494e-01,
        3.28790024e-02,  2.19221258e+00, -2.24580005e-01,  3.60621251e-02,
        1.41831264e-01, -9.37562287e-02, -2.12733522e-01, -6.34263754e-02,
       -9.71264318e-02,  9.87537503e-01, -2.17758760e-01,  8.26437324e-02,
        1.53897002e-01, -7.26040006e-02, -1.44263683e-02, -1.23997256e-01,
       -1.07008487e-01,  3.40928733e-02, -6.34614229e-02, -9.45577472e-02,
        3.38668734e-01, -1.06250711e-01, -1.82056248e-01,  2.19470993e-01,
       -1.58892125e-01,  6.11282513e-02,  1.33511633e-01,  5.97183742e-02,
        1.34243131e-01,  1.33611500e-01, -1.02997571e-01, -1.49885744e-01,
       -3.75362486e-02, -3.38061228e-02, -8.67414996e-02, -3.63667533e-02,
        2.00054139e-01,  1.42539874e-01, -2.84347497e-02, -3.69590893e-02,
        5.21373786e-02,  1.01151876e-01, -1.56987518e-01, -5.21989912e-02,
        7.85263702e-02, -

## Confirm behavior on small subset of data

In [12]:
# return lowercase lemmas of alphabetical
def to_lc_lemmas(s):
    
    return [i.lemma_.lower() for i in s if doc_check(i)]

In [13]:
tiny_df = X_train[0:2].copy()

In [14]:
%%time
# create docs from text
tiny_df['docs'] = tiny_df['comment_text'].apply(nlp)
tiny_df['docs']

CPU times: user 39 ms, sys: 3.09 ms, total: 42.1 ms
Wall time: 41.3 ms


27301     (', Meša, Selimović, I, 'm, not, opposing, suc...
141668    (', September, 2008, (, UTC, ), Talking, about...
Name: docs, dtype: object

In [15]:
%%time
# keep subset of lc lemmas to reduce dimensions
tiny_df['lemmas'] = tiny_df['docs'].apply(to_lc_lemmas)
tiny_df['lemmas']

CPU times: user 1.21 ms, sys: 35 µs, total: 1.25 ms
Wall time: 1.22 ms


27301     [meša, selimović, oppose, formulation, instead...
141668    [september, utc, talk, victimize, release, imp...
Name: lemmas, dtype: object

# Create Columns

## Doc Column

This one will take the longest to process, but the docs must be created before the other features can be pulled from it

In [16]:
%%time
'''
CPU times: user 3min 17s, sys: 12.5 s, total: 3min 30s
Wall time: 3min 32s
'''
# if already created, load the column from pickle file
# X_train['docs'] = pd.read_pickle('../data/basic_df_split/X_train_docs_series.pkl')

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs


'\nCPU times: user 3min 17s, sys: 12.5 s, total: 3min 30s\nWall time: 3min 32s\n'

In [22]:
%%time
# '''
# CPU times: user 38min 19s, sys: 3min 42s, total: 42min 2s
# Wall time: 43min 21s
# '''

# create docs from text
X_train['docs'] = X_train['comment_text'].apply(nlp)
X_train['docs'].head(2)

CPU times: user 1h 1min 51s, sys: 31min 39s, total: 1h 33min 31s
Wall time: 36min 56s


27301     (', Meša, Selimović, I, 'm, not, opposing, suc...
141668    (', September, 2008, (, UTC, ), Talking, about...
Name: docs, dtype: object

In [23]:
X_train.columns

Index(['comment_text', 'uppercase_proportion', 'docs'], dtype='object')

## Lemmas Column

- text rendered to lemmas  
- pronouns removed
- preserve only alphabetical entities
- remove stopwords in spaCy's default stopwords set

In [24]:
%%time
'''
CPU times: user 12.8 s, sys: 1.54 s, total: 14.3 s
Wall time: 14.9 s
'''

# keep subset of lc lemmas to reduce dimensions
X_train['lemmas'] = X_train['docs'].apply(to_lc_lemmas)
X_train['lemmas'].head(2)

CPU times: user 8.58 s, sys: 133 ms, total: 8.72 s
Wall time: 8.72 s


27301     [meša, selimović, oppose, formulation, instead...
141668    [september, utc, talk, victimize, release, imp...
Name: lemmas, dtype: object

## Doc Vector Column

In [25]:
%%time
'''
CPU times: user 56.3 s, sys: 4.04 s, total: 1min
Wall time: 1min 6s
'''

X_train['doc_vectors'] = X_train['docs'].apply(lambda x: x.vector)

CPU times: user 1min 3s, sys: 262 ms, total: 1min 4s
Wall time: 1min 4s


## List of word vectors Column

We will reduce the number of vectors by limiting our selection to those vectors representing lemmas that conform to our previous parameters and also have a non-zero vector.

Resource:
- [Getting Vector for Lemma](https://github.com/explosion/spaCy/issues/956) 
    - This was especially helpful for correctly formatting the lambda function.

In [26]:
tiny_doc_sample = X_train['docs'].head(2)

# for doc in tiny_doc_sample:
#     print(doc.vector)
#     for tok in doc:
#         if doc_check(tok) and tok.has_vector:
#             print(tok.text, tok.has_vector, tok.vector_norm)

# try with samll subset
tiny_doc_sample.apply(lambda doc: [nlp.vocab[tok.lemma].vector for tok in doc if doc_check(tok) and tok.has_vector])

27301     [[0.12798, -0.43185, 0.034991, 0.27789, -0.061...
141668    [[-0.02074, 0.42632, 0.59367, -0.090906, -0.08...
Name: docs, dtype: object

In [27]:
%%time
'''
CPU times: user 17.4 s, sys: 396 ms, total: 17.8 s
Wall time: 18 s

'''
X_train['tok_vectors'] = X_train['docs'].apply(lambda doc: [nlp.vocab[tok.lemma].vector for tok in doc if doc_check(tok) and tok.has_vector])

CPU times: user 15.3 s, sys: 183 ms, total: 15.5 s
Wall time: 15.5 s


In [28]:
X_train.columns

Index(['comment_text', 'uppercase_proportion', 'docs', 'lemmas', 'doc_vectors',
       'tok_vectors'],
      dtype='object')

## Preserve Doc column separately

As the doc column is quite large, we'll preserve it seperately.

In [29]:
%%time
'''
CPU times: user 48.3 s, sys: 30.3 s, total: 1min 18s
Wall time: 1min 55s
'''

X_train['docs'].to_pickle('../data/basic_df_split/X_train_docs_series.pkl')

CPU times: user 45.3 s, sys: 23.2 s, total: 1min 8s
Wall time: 1min 32s


In [30]:
print("hi")

hi


# left off here!

In [37]:
# ls ../models
# ! mkdir ../models/spacy_2
! ls ../models

base_config.cfg       [1m[36mspacy_2[m[m               [1m[36mspacy_multi_cat_model[m[m


In [39]:
! ls ../models
nlp.to_disk("../models/spacy_2/")

base_config.cfg       [1m[36mspacy_2[m[m               [1m[36mspacy_multi_cat_model[m[m
base_config.cfg       [1m[36mspacy_2[m[m               [1m[36mspacy_multi_cat_model[m[m


## Preserve X_train with new columns

In [41]:
%%time
X_train.to_pickle('../data/basic_df_split/X_train_2-1.pkl')

CPU times: user 1min 14s, sys: 1min 12s, total: 2min 27s
Wall time: 4min 33s


# Toxic Text


Detecting Insults in Social Commentary

Data from Wikipedia 

Data Source:
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data

