# Detecting and Classifying Toxic Comments
# Part 2-3: spaCy Multi-Category Test

Testing multiple, non-mutually exclusive categories with 1 instead of True

# Setup

## Python Library Imports


Resources:
- [pool]()

In [176]:
import pandas as pd
import numpy as np

from collections import Counter
import re
import random

# scikit learn imports
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# # tqdm & time
# from tqdm.auto import tqdm
# import time
from timeit import default_timer as timer

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## spaCy Setup & Imports

As mentioned previously, we'll be using spaCy version 2.3.5

In [2]:
# # check version
! python -m spacy info

[1m

spaCy version    2.3.5                         
Location         /opt/anaconda3/lib/python3.7/site-packages/spacy
Platform         Darwin-20.3.0-x86_64-i386-64bit
Python version   3.7.6                         
Models                                         



In [3]:
# spaCy Imports
import spacy

from spacy.lang.en import English
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

from spacy.util import minibatch, compounding
from spacy import displacy
from spacy.tokens import Doc

from spacy.scorer import Scorer

## Load data from toxic_basic Pickle File

In [4]:
%%time
'''
last load time:

CPU times: user 67 ms, sys: 46.7 ms, total: 114 ms
Wall time: 114 ms
'''

# load toxic_basic pickle into dataframe
path_toxic_basic = "../data/toxic_basic.pkl"

toxic_df = pd.read_pickle(path_toxic_basic)

CPU times: user 72.5 ms, sys: 55.7 ms, total: 128 ms
Wall time: 136 ms


In [5]:
toxic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   comment_text          159571 non-null  object 
 1   uppercase_proportion  159548 non-null  float64
 2   toxic                 159571 non-null  int64  
 3   severe_toxic          159571 non-null  int64  
 4   obscene               159571 non-null  int64  
 5   threat                159571 non-null  int64  
 6   insult                159571 non-null  int64  
 7   identity_hate         159571 non-null  int64  
dtypes: float64(1), int64(6), object(1)
memory usage: 9.7+ MB


In [6]:
# Removing tuples here...
# Convert training text and training outcomes into a list of tuples

# toxic_df["tuples"] = toxic_df.apply(lambda row: (row['comment_text'], row['toxic']), axis=1)

In [7]:
# toxic_df['tuples'][0]

# Simple Train Test Split

As our process should first determine whether the text is toxic or not toxic, we'll make a simplified stratified train test split, ensuring our balance of toxic and non toxic rows are proportionally distributed.

For now, we won't be too concerned with the proportion of sub-categories, as our first step will be to filter not toxic from toxic, then run parallel operations for each toxic sub-category, as toxic sub-categories are not mutually exclusive.

## Stratified Split maintaining ratio of toxic to not toxic texts


In [8]:
# check current columns
toxic_df.columns

Index(['comment_text', 'uppercase_proportion', 'toxic', 'severe_toxic',
       'obscene', 'threat', 'insult', 'identity_hate'],
      dtype='object')

In [9]:
# split df into X(independent) and y(depenendent) groups
ind_cols = ['comment_text', 'uppercase_proportion']

X = toxic_df[ind_cols]
y = toxic_df.drop(columns=ind_cols)

print(f"X columns: {X.columns}\ny columns:{y.columns}")

X columns: Index(['comment_text', 'uppercase_proportion'], dtype='object')
y columns:Index(['toxic', 'severe_toxic', 'obscene', 'threat', 'insult',
       'identity_hate'],
      dtype='object')


In [10]:
# Train Test Split. Stratified on y['toxic']
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42, 
                                                    stratify=y['toxic'])

# Stratified K Fold

- [SKF docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#)  

In [11]:
# tiny_df = toxic_df.sample(20)

# from sklearn.model_selection import StratifiedKFold

# skf = StratifiedKFold(n_splits=3,
#                       random_state=42,
#                       shuffle=True)
# print(skf)

# skf.get_n_splits(X_train['comment_text'],
#                  y_train[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']])

# train_indx, test_indx = next(skf.split(toxic_df['comment_text'], toxic_df[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]))

# spaCy

Let's try out spaCy, a nlp processing library!

- https://course.spacy.io/en/chapter1
- [text classification with spaCy](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/) 
- [customized list of stopwords](https://spacy.io/usage/linguistic-features#stop-words)  
- [Split Series into list of sentences](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.cat.html)  
- [contractions](https://theslaps.medium.com/cant-stand-don-t-want-contractions-with-spacy-39715cac2ebb)  


- [v2.spacy.io](https://v2.spacy.io/)

# Train spaCy Model for Classification

## Establish spaCy Pipeline

"spaCy's components are supervised models for text annotations, meaning that they can only learn to reproduce examples, not guess new labels from raw text."

By default, spaCy's text categorizer is a simple convolutional neural network.

Resources:
- [for emojis](https://spacy.io/universe/project/spacymoji)  

Code is modified from tutorial here:

Resource:
https://www.machinelearningplus.com/nlp/custom-text-classification-spacy/

In [12]:
# ! python -m spacy download en_core_web_lg

Resources
- [spaCy docs: scorer](https://spacy.io/api/scorer)  

- [F-Score](https://en.wikipedia.org/wiki/F-score)  

In [13]:
## if the model is not yet locally available
# ! python -m spacy download en_core_web_lg

import en_core_web_lg
nlp = en_core_web_lg.load()

# Provide scoring pipeline
scorer = Scorer(nlp)

In [14]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [15]:
# tagger = nlp.create_pipe('tagger')
textcat = nlp.create_pipe('textcat')

In [16]:
# nlp.add_pipe(tagger)
nlp.add_pipe(textcat)

In [17]:
# textcat.add_label("TOXIC")
# textcat.add_label("SEVERE_TOXIC")
# textcat.add_label("OBSCENE")
# textcat.add_label("THREAT")
# textcat.add_label("INSULT")
# textcat.add_label("IDENTITY_HATE")
# # textcat.add_label("NOT TOXIC")

def add_labels_helper(s):
    '''
    takes dataframe or series, 
    unpacks col labels and adds each as label to textcat
        formatted as uppercase
    '''
    
    for col in s.columns:
        print(col)
        textcat.add_label(col.upper())

In [18]:
add_labels_helper(y_train)
    

toxic
severe_toxic
obscene
threat
insult
identity_hate


In [19]:
nlp.pipe_names

['tagger', 'parser', 'ner', 'textcat']

In [20]:
textcat.labels

('TOXIC', 'SEVERE_TOXIC', 'OBSCENE', 'THREAT', 'INSULT', 'IDENTITY_HATE')

# I left off here!!!

https://v2.spacy.io/usage/processing-pipelines#pipelines
https://v2.spacy.io/usage/processing-pipelines


[zipping from dict of unknown size](https://stackoverflow.com/a/40658867)

In [21]:
# from spacy.tokens import Doc
# from spacy.training import Example


def txt_and_multi_cat(txt_series, multi_cat_df):
        
    # convert each series or series slice to list
    t = txt_series.tolist()

    
    # get name for each dependent column
    cats = [multi_cat_df[cat].name.upper() for cat in multi_cat_df.columns]
    print(cats)
    
    
    cat_vals = multi_cat_df.values.tolist()
    
    c = [{cats[i]: v for i, v in enumerate(row)} for row in cat_vals]
    
    
#     for row in cat_vals:
#         print(row)
# #         row_cat = dict()
        
# #         for i, v in enumerate(row):
# # #             print(i, v)
# #             row_cat[cats[i]] = v
    
#         row_cat = {cats[i]: v for i, v in enumerate(row)}
        
        
        
#         annotations_list.append(row_cat)

#     print(len(t), len(annotations_list))
#     for i in annotations_list:
#         print(i)

    
# def txt_and_cat(txt_series, cat_series):
        
#     # convert each series or series slice to list
#     t = txt_series.tolist()
#     c = cat_series.tolist()
    
#     # format categories
#     c = [{"TOXIC": bool(y), "NOT TOXIC": not bool(y)} for y in c]
    c = [{'cats': i} for i in c]
    
    docs = list(zip(t, c))
    
    return docs


In [22]:
tot = 20
tiny_X = X_train['comment_text'][:tot]
tiny_y = y_train[:tot]

txt_and_multi_cat(tiny_X, tiny_y)
# print(tiny_X, tiny_y)

['TOXIC', 'SEVERE_TOXIC', 'OBSCENE', 'THREAT', 'INSULT', 'IDENTITY_HATE']


[("' Meša Selimović I'm not opposing such a formulation. Instead I'm trying to prevent potential editwars by inclusion of both views (Bosnian and Serbian writer). However, it seems that there is a third view (Yugoslavian writer) and so an option is to mention all of them (Yugoslavian, Bosnian and Serbian writer). Although this may read cumbersome, maybe it's worth trying. All the best. 's talk '",
  {'cats': {'TOXIC': 0,
    'SEVERE_TOXIC': 0,
    'OBSCENE': 0,
    'THREAT': 0,
    'INSULT': 0,
    'IDENTITY_HATE': 0}}),
 ("' September 2008 (UTC) Talking about how he was victimized because of the release of his name is an implication that releasing his name was bad. (talk | contribs) 03:17, 9'",
  {'cats': {'TOXIC': 0,
    'SEVERE_TOXIC': 0,
    'OBSCENE': 0,
    'THREAT': 0,
    'INSULT': 0,
    'IDENTITY_HATE': 0}}),
 (', 1 August 2012 (UTC) Danke, - supports what we found, no need to ask further, - the Main page appearance on 26 July opened new heights, 08:05',
  {'cats': {'TOXIC': 

[Article for v3](https://medium.com/analytics-vidhya/building-a-text-classifier-with-spacy-3-0-dd16e9979a)

In [23]:
# ! ls ../models/base_config.cfg
# ! python -m spacy init fill-config ../models/base_config.cfg config.cfg --diff

In [24]:
# ! python -m spacy validate

<!-- X columns: Index(['comment_text', 'uppercase_proportion'], dtype='object')
y columns:Index(['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate',
       'tuples'],
      dtype='object') -->

In [25]:
# formatting list of tuples for spacy training
train_txt = X_train['comment_text']
train_cat = y_train[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

train_docs = txt_and_multi_cat(train_txt, train_cat)

test_txt = X_test['comment_text']
test_cat = y_test[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

test_docs = txt_and_multi_cat(test_txt, test_cat)

['TOXIC', 'SEVERE_TOXIC', 'OBSCENE', 'THREAT', 'INSULT', 'IDENTITY_HATE']
['TOXIC', 'SEVERE_TOXIC', 'OBSCENE', 'THREAT', 'INSULT', 'IDENTITY_HATE']


In [26]:
# this should be the correct format expected by the trainer

# print(train_docs[0][1])
first_five = [i for i in train_docs[:5]]

for i in first_five:
    print(f"{i[0][:30]}, {i[1]}")

' Meša Selimović I'm not oppos, {'cats': {'TOXIC': 0, 'SEVERE_TOXIC': 0, 'OBSCENE': 0, 'THREAT': 0, 'INSULT': 0, 'IDENTITY_HATE': 0}}
' September 2008 (UTC) Talking, {'cats': {'TOXIC': 0, 'SEVERE_TOXIC': 0, 'OBSCENE': 0, 'THREAT': 0, 'INSULT': 0, 'IDENTITY_HATE': 0}}
, 1 August 2012 (UTC) Danke, -, {'cats': {'TOXIC': 0, 'SEVERE_TOXIC': 0, 'OBSCENE': 0, 'THREAT': 0, 'INSULT': 0, 'IDENTITY_HATE': 0}}
' The Aurora name One of the p, {'cats': {'TOXIC': 0, 'SEVERE_TOXIC': 0, 'OBSCENE': 0, 'THREAT': 0, 'INSULT': 0, 'IDENTITY_HATE': 0}}
' WP:NOT: W is not a 'how to' , {'cats': {'TOXIC': 0, 'SEVERE_TOXIC': 0, 'OBSCENE': 0, 'THREAT': 0, 'INSULT': 0, 'IDENTITY_HATE': 0}}


https://www.machinelearningplus.com/nlp/custom-text-classification-spacy

# Not providing proper scoring...

In [162]:
# from collections import defaultdict

def evaluate(tokenizer, textcat, val_texts, val_cats, thresh=0.5):
    
    docs = (tokenizer(val_text) for val_text in val_texts)  
    
    # create dict of results
    evals_by_cat = dict()
    
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = val_cats[i]['cats']
        
        for label, score in doc.cats.items():
            
            # add label to dict if not already present
            if label not in evals_by_cat:
                evals_by_cat[label] = {'tp':0,
                                       'fp':0,
                                       'fn':0,
                                       'tn':0,}
            
            if score >= thresh and gold[label] >= thresh:
                evals_by_cat[label]['tp'] += 1

            elif score >= thresh and gold[label] < thresh:
                evals_by_cat[label]['fp'] += 1

            elif score < thresh and gold[label] < thresh:
                evals_by_cat[label]['tn'] += 1
            
            elif score < thresh and gold[label] >= thresh:
                evals_by_cat[label]['fn'] += 1
    
    for key in evals_by_cat.keys():

        tp = evals_by_cat[key]['tp']
        fp = evals_by_cat[key]['fp']
        fn = evals_by_cat[key]['fn']
        tn = evals_by_cat[key]['tn']
        
        # precision
        # edge case: avoid dividing by zero: precision = 1 when fp = 0
        if tp + fp == 0:
            evals_by_cat[key]['precision'] = 1
        else:    
            evals_by_cat[key]['precision'] = tp / (tp + fp)
        
        # recall
        # edge case: avoid dividing by zero: recall = 1 when fn = 0
        if tp + fn == 0:
            evals_by_cat[key]['recall'] = 1
        else:    
            evals_by_cat[key]['recall'] = tp / (tp + fn)
            
        precision = evals_by_cat[key]['precision']
        recall = evals_by_cat[key]['recall']
        
        if precision  + recall == 0:
            evals_by_cat[key]['f_score'] = 0.0
        else:
            evals_by_cat[key]['f_score'] = 2 * (precision * recall) / (precision + recall)
        
    
#     for key in evals_by_cat.keys():
#         print(f'{key}:\n{evals_by_cat[key]}')
    
#     precision = tp / (tp + fp)
#     recall = tp / (tp + fn)
#     if (precision + recall) == 0:
#         f_score = 0.0
#     else:
#         f_score = 2 * (precision * recall) / (precision + recall)
    evals_by_cat['TEXTCAT_LOSSES'] = losses['textcat']

    return evals_by_cat

In [139]:
size = 1000
train_data = train_docs[:size]
dev_texts = [i[0] for i in test_docs[:size]]
dev_cats = [i[1] for i in test_docs[:size]]

In [196]:
%%time
from spacy.util import minibatch, compounding

# start = timer()
# # ...
# end = timer()
# print(end - start) # Time in seconds, e.g. 5.38091952400282

#("Number of training iterations", "n", int))
n_iter=2

# Disabling other components
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes):  
    optimizer = nlp.begin_training()

    start_cumulative = timer()
    print(f"Training the model.\nIterations: {n_iter}\nTimer Begins:{start_cumulative:.2f}")
#     print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))

    # Performing training
    for i in range(n_iter):
        start_epoch = timer()
        print(f'epoch {i + 1} start time: {start_epoch:.2f}')
        losses = {}
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, 
                       annotations, 
                       sgd=optimizer, 
                       drop=0.2,
                       losses=losses)

      # Calling the evaluate() function and printing the scores
        with textcat.model.use_params(optimizer.averages):
            evals_by_cat = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
            # reduced scores for testing
            print(f'textcat losses:', losses['textcat'])
            
            end_epoch = timer()
            time.strftime("%Hh%Mm%Ss", time.gmtime(4*3600+13*60+6)) 
            print(f'epoch {i + 1} end: {end_epoch}, epoch elapsed: {end_epoch-start_epoch:.2f}\n')
         
            
#             for key in evals_by_cat.keys():
#                 print(f'{key}:\n{evals_by_cat[key]}')
#         print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}'  
#               .format(losses['textcat'], scores['textcat_p'],
#                       scores['textcat_r'], scores['textcat_f']))

end_cumulative = timer()
print(f"Training complete. End:{end_cumulative:.2f} Training Elapsed: {end_cumulative - start_cumulative:.2f}\n")

Training the model.
Iterations: 2
Timer Begins:13336.02
epoch 1 start time: 13336.02
textcat losses: 0.0012461869023688843
epoch 1 end: 13337.610593877, epoch elapsed: 1.59

epoch 2 start time: 13337.61
textcat losses: 0.011721035442237682
epoch 2 end: 13339.06375051, epoch elapsed: 1.45

Training complete. End:13339.07 Training Elapsed: 3.04

CPU times: user 5.64 s, sys: 4.36 s, total: 10 s
Wall time: 3.1 s


In [171]:
first_df = pd.DataFrame.from_dict(evals_by_cat['INSULT'], orient='index', columns=['INSULT_epoch1'])
second_df = pd.DataFrame.from_dict(evals_by_cat['TOXIC'], orient='index', columns=['TOXIC_epoch1'])


In [174]:
pd.concat([first_df, second_df], axis=1)

Unnamed: 0,INSULT_epoch1,TOXIC_epoch1
tp,7.0,8.0
fp,1.0,0.0
fn,0.0,4.0
tn,92.0,88.0
precision,0.875,1.0
recall,1.0,0.666667
f_score,0.933333,0.8


In [154]:
evals_df = pd.DataFrame.from_dict(evals_by_cat)


In [153]:
evals_df

Unnamed: 0,TOXIC,SEVERE_TOXIC,OBSCENE,THREAT,INSULT,IDENTITY_HATE
tp,9.0,3.0,5.0,0.0,7.0,0.0
fp,4.0,0.0,2.0,3.0,5.0,1.0
fn,3.0,1.0,1.0,1.0,0.0,4.0
tn,84.0,96.0,92.0,96.0,88.0,95.0
precision,0.692308,1.0,0.714286,0.0,0.583333,0.0
recall,0.75,0.75,0.833333,0.0,1.0,0.0
f_score,0.72,0.857143,0.769231,0.0,0.736842,0.0


In [30]:
one_text = dev_texts[11]
# one_text

In [31]:
print(one_text)

doc=nlp(one_text)
doc.cats 

# sentiment_sum = sum([i.sentiment for i in doc])
# print(sentiment_sum)
# doc[0].vector

' Ok, let me say it again Come on, now you guys are just being piece of shit jews. I mean you have to admit, the guys in pink floyd play their instruments about as slow as a nigger works. I shouldn't even call what they play music. It's just a bunch of alarm clocks and cashier regirsters! But you know what the most pretentious thing about them is, its their lyrics. All af their songs are just surrealist poetry sung over doom-noise pop, and everyone starts calling them genius's over it. The truth is, their songs have no meaning. Take the album 'The Wall' for instance; sure it tells a story, but what is the moral and the meaning of the story? And dont tell me that the purpose of their songs is to make you think. The only way that music as slow as pink floyd could make you fucking think is if you were just as stoned as they are, which you kikes probly are... And one last time: 1) pink floyd fucking sucks 2) david fuckmor should taste my ass 3) you should to'


{'TOXIC': 0.9091559052467346,
 'SEVERE_TOXIC': 0.0608687661588192,
 'OBSCENE': 0.7635753154754639,
 'THREAT': 0.007229546085000038,
 'INSULT': 0.5386186242103577,
 'IDENTITY_HATE': 0.08038196712732315}

In [194]:
def doc_check(tok):
    '''
    helper function for getting only desired lemmas from spaCy doc
    argument: doc.token
    
    checks for rejection conditions
        not alpha
        pronoun
        stopword
        
    returns True if no rejection conditions are met
    
    ''' 
    # reject if not alpha
    if tok.is_alpha == False:
        return False
    
    # reject if pronoun
    if tok.lemma_ == "-PRON-":
        return False
    
    # reject if stopword
    if tok.is_stop == True:
        return False

    # if not rejected, return true
    return True

lemmas_lc = [i.lemma_.lower() for i in doc if doc_check(i)]
lemmas_lc

# sentiment_sum = sum([i.sentiment for i in doc])
# print(sentiment_sum)

['meša',
 'selimović',
 'oppose',
 'formulation',
 'instead',
 'try',
 'prevent',
 'potential',
 'editwar',
 'inclusion',
 'view',
 'bosnian',
 'serbian',
 'writer',
 'view',
 'yugoslavian',
 'writer',
 'option',
 'mention',
 'yugoslavian',
 'bosnian',
 'serbian',
 'writer',
 'read',
 'cumbersome',
 'maybe',
 'worth',
 'try',
 'good',
 'talk']

[expanding contractions](https://gist.github.com/widiger-anna/deefac010da426911381c118a97fc23f) 
[contractions](https://theslaps.medium.com/cant-stand-don-t-want-contractions-with-spacy-39715cac2ebb)  


[text wrangling](https://www.kdnuggets.com/2018/08/practitioners-guide-processing-understanding-text-2.html)  


[nlp nltk vs spacy](https://www.activestate.com/blog/natural-language-processing-nltk-vs-spacy/)  

[pytorch](https://pytorch.org/https://pytorch.org/)  

[text classification in python with spacy (try this one!)](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/)  

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

In [195]:
! ls ../models
nlp.to_disk("../models/spacy_multi_cat_model/")
! ls ../models

base_config.cfg       [1m[36mspacy_multi_cat_model[m[m
base_config.cfg       [1m[36mspacy_multi_cat_model[m[m


In [190]:
! ls

config.cfg.txt                        part_2-2_nltk.ipynb
part_1-1_preparation_toxic_text.ipynb part_2-2_spacy_multi_cat_test.ipynb
part_2-1_spacy_toxic_text.ipynb       part_2-3_spacy_multi_cat_test.ipynb


In [192]:
! ls ../models

base_config.cfg [1m[36mner[m[m             [1m[36mtagger[m[m          tokenizer
meta.json       [1m[36mparser[m[m          [1m[36mtextcat[m[m         [1m[36mvocab[m[m
