# Detecting and Classifying Toxic Comments
# Part 1-2: Split the Data

We'll first split our data in to a training and a holdout group. We'll do a stratified split with the training data in order to maintain the  Not Toxic to Toxic text ratio of roughly 9:1.

# Setup

## Python Library Imports


Resources:
- [pool]()

In [1]:
import pandas as pd
import numpy as np

from collections import Counter
import re
import random

# scikit learn imports
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# tqdm & time
from tqdm.auto import tqdm
import time

%load_ext autoreload
%autoreload 2

## spaCy Setup & Imports

As mentioned previously, we'll be using spaCy version 2.3.5

In [2]:
# # check version
! python -m spacy info

[1m

spaCy version    2.3.5                         
Location         /opt/anaconda3/lib/python3.7/site-packages/spacy
Platform         Darwin-20.3.0-x86_64-i386-64bit
Python version   3.7.6                         
Models                                         



In [3]:
# spaCy Imports
import spacy

from spacy.lang.en import English
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

from spacy.util import minibatch, compounding
from spacy import displacy
from spacy.tokens import Doc

from spacy.scorer import Scorer

## Load data from toxic_basic Pickle File

In [4]:
%%time
'''
last load time:

CPU times: user 67 ms, sys: 46.7 ms, total: 114 ms
Wall time: 114 ms
'''

# load toxic_basic pickle into dataframe
path_toxic_basic = "../data/toxic_basic.pkl"

toxic_df = pd.read_pickle(path_toxic_basic)

CPU times: user 114 ms, sys: 83.9 ms, total: 197 ms
Wall time: 213 ms


In [5]:
toxic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   comment_text          159571 non-null  object 
 1   uppercase_proportion  159548 non-null  float64
 2   toxic                 159571 non-null  int64  
 3   severe_toxic          159571 non-null  int64  
 4   obscene               159571 non-null  int64  
 5   threat                159571 non-null  int64  
 6   insult                159571 non-null  int64  
 7   identity_hate         159571 non-null  int64  
dtypes: float64(1), int64(6), object(1)
memory usage: 9.7+ MB


In [6]:
# Convert training text and training outcomes into a list of tuples

toxic_df["tuples"] = toxic_df.apply(lambda row: (row['comment_text'], row['toxic']), axis=1)

In [7]:
toxic_df['tuples'][0]

("Explanation Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.",
 0)

# Simple Train Test Split

As our process should first determine whether the text is toxic or not toxic, we'll make a simplified stratified train test split, ensuring our balance of toxic and non toxic rows are proportionally distributed.

For now, we won't be too concerned with the proportion of sub-categories, as our first step will be to filter not toxic from toxic, then run parallel operations for each toxic sub-category, as toxic sub-categories are not mutually exclusive.

## Stratified Split maintaining ratio of toxic to not toxic texts


In [8]:
# check current columns
toxic_df.columns

Index(['comment_text', 'uppercase_proportion', 'toxic', 'severe_toxic',
       'obscene', 'threat', 'insult', 'identity_hate', 'tuples'],
      dtype='object')

In [9]:
# split df into X(independent) and y(depenendent) groups
ind_cols = ['comment_text', 'uppercase_proportion']

X = toxic_df[ind_cols]
y = toxic_df.drop(columns=ind_cols)

print(f"X columns: {X.columns}\ny columns:{y.columns}")

X columns: Index(['comment_text', 'uppercase_proportion'], dtype='object')
y columns:Index(['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate',
       'tuples'],
      dtype='object')


In [10]:
# Train Test Split. Stratified on y['toxic']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42, 
                                                    stratify=y['toxic'])

# Stratified K Fold

- [SKF docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#)  

In [11]:
tiny_df = toxic_df.sample(20)

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3,
                      random_state=42,
                      shuffle=True)
print(skf)

skf.get_n_splits(X_train['comment_text'], y_train['toxic'])

# for tr_in, test_in in skf.split(toxic['comment_text'], toxic['toxic']):
#     print(tr_in, test_in)


train_indx, test_indx = next(skf.split(toxic_df['comment_text'], toxic_df['toxic']))

StratifiedKFold(n_splits=3, random_state=42, shuffle=True)


# spaCy

Let's try out spaCy, a nlp processing library!

- https://course.spacy.io/en/chapter1
- [text classification with spaCy](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/) 
- [customized list of stopwords](https://spacy.io/usage/linguistic-features#stop-words)  
- [Split Series into list of sentences](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.cat.html)  
- [contractions](https://theslaps.medium.com/cant-stand-don-t-want-contractions-with-spacy-39715cac2ebb)  


- [v2.spacy.io](https://v2.spacy.io/)

# Try new, blank spaCy model

## Establish spaCy Pipeline

"spaCy's components are supervised models for text annotations, meaning hey can only learn to reproduce examples, not guess new labels from raw text."

Resources:
- [for emojis](https://spacy.io/universe/project/spacymoji)  

Code is modified from tutorial here:

Resource:
https://www.machinelearningplus.com/nlp/custom-text-classification-spacy/

In [12]:
# ! python -m spacy download en_core_web_lg

Resources
- [spaCy docs: scorer](https://spacy.io/api/scorer)  

- [F-Score](https://en.wikipedia.org/wiki/F-score)  

In [13]:
# nlp = spacy.blank('en')
import en_core_web_lg
nlp = en_core_web_lg.load()

# Provide scoring pipeline
scorer = Scorer(nlp)

In [14]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [15]:
# tagger = nlp.create_pipe('tagger')
textcat = nlp.create_pipe('textcat')

In [16]:
# nlp.add_pipe(tagger)
nlp.add_pipe(textcat)

In [17]:
textcat.add_label("TOXIC")
textcat.add_label("NOT TOXIC")

1

In [18]:
nlp.pipe_names

['tagger', 'parser', 'ner', 'textcat']

# I left off here!!!

https://v2.spacy.io/usage/processing-pipelines#pipelines
https://v2.spacy.io/usage/processing-pipelines


In [19]:
# from spacy.tokens import Doc
# from spacy.training import Example


def txt_and_cat(txt_series, cat_series):
        
    # convert each series or series slice to list
    t = txt_series.tolist()
    c = cat_series.tolist()
    
    # format categories
    c = [{"TOXIC": bool(y), "NOT TOXIC": not bool(y)} for y in c]
    c = [{'cats': i} for i in c]
    
    docs = list(zip(t, c))
    
    return docs

[Article for v3](https://medium.com/analytics-vidhya/building-a-text-classifier-with-spacy-3-0-dd16e9979a)

In [20]:
# ! ls ../models/base_config.cfg
# ! python -m spacy init fill-config ../models/base_config.cfg config.cfg --diff

In [21]:
# ! python -m spacy validate

In [22]:
# formatting list of tuples for spacy training
txt = toxic_df['comment_text'][train_indx]
cat = toxic_df['toxic'][train_indx]

train_docs = txt_and_cat(txt, cat)

test_txt = toxic_df['comment_text'][test_indx]
test_cat = toxic_df['toxic'][test_indx]

test_docs = txt_and_cat(test_txt, test_cat)

In [23]:
# this should be the correct format expected by the trainer
print(len(train_docs), len(test_docs))

# print(train_docs[0][1])

print([i for i in train_docs[:5]])

106380 53191
[("Explanation Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.", {'cats': {'TOXIC': False, 'NOT TOXIC': True}}), ("D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)", {'cats': {'TOXIC': False, 'NOT TOXIC': True}}), ("Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.", {'cats': {'TOXIC': False, 'NOT TOXIC': True}}), ("' More I can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of 'types of accidents' -I think the references may need tidying so that they are all in the exact

In [24]:
test_txt_lst = [i[0] for i in test_docs]
test_cat_lst = [i[1] for i in test_docs]

https://www.machinelearningplus.com/nlp/custom-text-classification-spacy

# Not providing proper scoring...

In [25]:
def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 0.0  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 0.0  # True negatives
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if label == "TOXIC":
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.0
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.0
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    if (precision + recall) == 0:
        f_score = 0.0
    else:
        f_score = 2 * (precision * recall) / (precision + recall)
    return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}


In [26]:
train_data = train_docs[:1000]
dev_texts = test_txt_lst[:1000]
dev_cats = test_cat_lst[:1000]

In [27]:
from spacy.util import minibatch, compounding


#("Number of training iterations", "n", int))
n_iter=10

# Disabling other components
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes):  
    optimizer = nlp.begin_training()

    print("Training the model...")
    print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))

    # Performing training
    for i in range(n_iter):
        losses = {}
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2,
                       losses=losses)

      # Calling the evaluate() function and printing the scores
        with textcat.model.use_params(optimizer.averages):
            scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
        print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}'  
              .format(losses['textcat'], scores['textcat_p'],
                      scores['textcat_r'], scores['textcat_f']))

Training the model...
LOSS 	  P  	  R  	  F  
3.391	0.000	0.000	0.000
2.828	0.000	0.000	0.000
2.170	0.000	0.000	0.000
1.448	0.000	0.000	0.000
1.198	0.000	0.000	0.000
0.724	0.000	0.000	0.000
0.427	0.000	0.000	0.000
0.219	0.000	0.000	0.000
0.230	0.000	0.000	0.000
0.240	0.000	0.000	0.000


In [28]:
# Testing the model
# txt = toxic_df['comment_text'][train_indx]
# cat = toxic_df['toxic'][train_indx]

# train_docs = txt_and_cat(txt, cat)

# test_txt = toxic_df['comment_text'][test_indx]
# test_cat = toxic_df['toxic'][test_indx]

# test_docs = txt_and_cat(test_txt, test_cat)

test_text = test_txt

print(test_text[6])


doc=nlp(test_text[6])
doc.cats 

COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK


{'TOXIC': 0.9999545812606812, 'NOT TOXIC': 0.002241710666567087}

[expanding contractions](https://gist.github.com/widiger-anna/deefac010da426911381c118a97fc23f) 
[contractions](https://theslaps.medium.com/cant-stand-don-t-want-contractions-with-spacy-39715cac2ebb)  


[text wrangling](https://www.kdnuggets.com/2018/08/practitioners-guide-processing-understanding-text-2.html)  


[nlp nltk vs spacy](https://www.activestate.com/blog/natural-language-processing-nltk-vs-spacy/)  

[pytorch](https://pytorch.org/https://pytorch.org/)  

[text classification in python with spacy (try this one!)](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/)  

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a