# Detecting and Classifying Toxic Comments
# Part 2-1: spaCy Multi-Category Test

We'll train a spaCy model to categorize multiple, non-mutually exclusive categories.

# Setup

## Python Library Imports


Resources:
- [pool]()

In [176]:
import pandas as pd
import numpy as np

from collections import Counter
import re
import random

# scikit learn imports
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from timeit import default_timer as timer

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## spaCy Setup & Imports

As mentioned previously, we'll be using spaCy version 2.3.5

In [2]:
# # check version
! python -m spacy info

[1m

spaCy version    2.3.5                         
Location         /opt/anaconda3/lib/python3.7/site-packages/spacy
Platform         Darwin-20.3.0-x86_64-i386-64bit
Python version   3.7.6                         
Models                                         



In [3]:
# spaCy Imports
import spacy

from spacy.lang.en import English
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

from spacy.util import minibatch, compounding
from spacy import displacy
from spacy.tokens import Doc

from spacy.scorer import Scorer

In [236]:
import sys

# add src folder to path
sys.path.insert(1, '../src')

# from text_prep import tidy_series, uppercase_proportion_column
from spacy_helper import doc_check, add_labels_helper, txt_and_multi_cat

## Load data from toxic_basic Pickle File

In [4]:
%%time
'''
last load time:

CPU times: user 67 ms, sys: 46.7 ms, total: 114 ms
Wall time: 114 ms
'''

# load toxic_basic pickle into dataframe
path_toxic_basic = "../data/toxic_basic.pkl"

toxic_df = pd.read_pickle(path_toxic_basic)

CPU times: user 72.5 ms, sys: 55.7 ms, total: 128 ms
Wall time: 136 ms


In [5]:
toxic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   comment_text          159571 non-null  object 
 1   uppercase_proportion  159548 non-null  float64
 2   toxic                 159571 non-null  int64  
 3   severe_toxic          159571 non-null  int64  
 4   obscene               159571 non-null  int64  
 5   threat                159571 non-null  int64  
 6   insult                159571 non-null  int64  
 7   identity_hate         159571 non-null  int64  
dtypes: float64(1), int64(6), object(1)
memory usage: 9.7+ MB


# Stratified Train Test Split

As our process should first determine whether the text is toxic or not toxic, we'll make a simplified stratified train test split, ensuring our balance of toxic and non-toxic rows are proportionally distributed.

For now, we won't be too concerned with the proportion of sub-categories, just the primary categories.

## Stratified Split maintaining ratio of toxic to not toxic texts


In [8]:
# check current columns
toxic_df.columns

Index(['comment_text', 'uppercase_proportion', 'toxic', 'severe_toxic',
       'obscene', 'threat', 'insult', 'identity_hate'],
      dtype='object')

In [9]:
# split df into X(independent) and y(depenendent) groups
ind_cols = ['comment_text', 'uppercase_proportion']

X = toxic_df[ind_cols]
y = toxic_df.drop(columns=ind_cols)

print(f"X columns: {X.columns}\ny columns:{y.columns}")

X columns: Index(['comment_text', 'uppercase_proportion'], dtype='object')
y columns:Index(['toxic', 'severe_toxic', 'obscene', 'threat', 'insult',
       'identity_hate'],
      dtype='object')


In [10]:
# Train Test Split. Stratified on y['toxic']
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42, 
                                                    stratify=y['toxic'])

## Preserve Test Train Split Files

In [226]:
# ! ls ../data/basic_df_split/

X_train.to_pickle('../data/basic_df_split/basic_X_train.pkl')
X_test.to_pickle('../data/basic_df_split/basic_X_test.pkl')
y_train.to_pickle('../data/basic_df_split/basic_y_train.pkl')
y_test.to_pickle('../data/basic_df_split/basic_y_test.pkl')

# spaCy

Let's try out spaCy, a nlp processing library!

- https://course.spacy.io/en/chapter1
- [text classification with spaCy](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/) 
- [customized list of stopwords](https://spacy.io/usage/linguistic-features#stop-words)  
- [Split Series into list of sentences](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.cat.html)  
- [contractions](https://theslaps.medium.com/cant-stand-don-t-want-contractions-with-spacy-39715cac2ebb)  


- [v2.spacy.io](https://v2.spacy.io/)

# Train spaCy Model for Classification

## Establish spaCy Pipeline

"spaCy's components are supervised models for text annotations, meaning that they can only learn to reproduce examples, not guess new labels from raw text."

By default, spaCy's text categorizer is a simple convolutional neural network.

Resources:
- [for emojis](https://spacy.io/universe/project/spacymoji)  

Code is modified from tutorial here:

Resource:
https://www.machinelearningplus.com/nlp/custom-text-classification-spacy/

Resources
- [spaCy docs: scorer](https://spacy.io/api/scorer)  

- [F-Score](https://en.wikipedia.org/wiki/F-score)  

In [13]:
## if the model is not yet locally available
# ! python -m spacy download en_core_web_lg

import en_core_web_lg
nlp = en_core_web_lg.load()

# Provide scoring pipeline
scorer = Scorer(nlp)

In [14]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [16]:
textcat = nlp.create_pipe('textcat')

nlp.add_pipe(textcat)

In [19]:
nlp.pipe_names

['tagger', 'parser', 'ner', 'textcat']

In [18]:
# use custom function to add labels to textcat
add_labels_helper(y_train)  

toxic
severe_toxic
obscene
threat
insult
identity_hate


In [20]:
textcat.labels

('TOXIC', 'SEVERE_TOXIC', 'OBSCENE', 'THREAT', 'INSULT', 'IDENTITY_HATE')

[zipping from dict of unknown size](https://stackoverflow.com/a/40658867)

In [22]:
tot = 20
tiny_X = X_train['comment_text'][:tot]
tiny_y = y_train[:tot]

txt_and_multi_cat(tiny_X, tiny_y)
# print(tiny_X, tiny_y)

['TOXIC', 'SEVERE_TOXIC', 'OBSCENE', 'THREAT', 'INSULT', 'IDENTITY_HATE']


[("' Meša Selimović I'm not opposing such a formulation. Instead I'm trying to prevent potential editwars by inclusion of both views (Bosnian and Serbian writer). However, it seems that there is a third view (Yugoslavian writer) and so an option is to mention all of them (Yugoslavian, Bosnian and Serbian writer). Although this may read cumbersome, maybe it's worth trying. All the best. 's talk '",
  {'cats': {'TOXIC': 0,
    'SEVERE_TOXIC': 0,
    'OBSCENE': 0,
    'THREAT': 0,
    'INSULT': 0,
    'IDENTITY_HATE': 0}}),
 ("' September 2008 (UTC) Talking about how he was victimized because of the release of his name is an implication that releasing his name was bad. (talk | contribs) 03:17, 9'",
  {'cats': {'TOXIC': 0,
    'SEVERE_TOXIC': 0,
    'OBSCENE': 0,
    'THREAT': 0,
    'INSULT': 0,
    'IDENTITY_HATE': 0}}),
 (', 1 August 2012 (UTC) Danke, - supports what we found, no need to ask further, - the Main page appearance on 26 July opened new heights, 08:05',
  {'cats': {'TOXIC': 

[Article for v3](https://medium.com/analytics-vidhya/building-a-text-classifier-with-spacy-3-0-dd16e9979a)

In [23]:
# ! ls ../models/base_config.cfg
# ! python -m spacy init fill-config ../models/base_config.cfg config.cfg --diff

In [24]:
# ! python -m spacy validate

<!-- X columns: Index(['comment_text', 'uppercase_proportion'], dtype='object')
y columns:Index(['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate',
       'tuples'],
      dtype='object') -->

In [25]:
# formatting list of tuples for spacy training
train_txt = X_train['comment_text']
train_cat = y_train[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

train_docs = txt_and_multi_cat(train_txt, train_cat)

test_txt = X_test['comment_text']
test_cat = y_test[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

test_docs = txt_and_multi_cat(test_txt, test_cat)

['TOXIC', 'SEVERE_TOXIC', 'OBSCENE', 'THREAT', 'INSULT', 'IDENTITY_HATE']
['TOXIC', 'SEVERE_TOXIC', 'OBSCENE', 'THREAT', 'INSULT', 'IDENTITY_HATE']


In [26]:
# this should be the correct format expected by the trainer

# print(train_docs[0][1])
first_five = [i for i in train_docs[:5]]

for i in first_five:
    print(f"{i[0][:30]}, {i[1]}")

' Meša Selimović I'm not oppos, {'cats': {'TOXIC': 0, 'SEVERE_TOXIC': 0, 'OBSCENE': 0, 'THREAT': 0, 'INSULT': 0, 'IDENTITY_HATE': 0}}
' September 2008 (UTC) Talking, {'cats': {'TOXIC': 0, 'SEVERE_TOXIC': 0, 'OBSCENE': 0, 'THREAT': 0, 'INSULT': 0, 'IDENTITY_HATE': 0}}
, 1 August 2012 (UTC) Danke, -, {'cats': {'TOXIC': 0, 'SEVERE_TOXIC': 0, 'OBSCENE': 0, 'THREAT': 0, 'INSULT': 0, 'IDENTITY_HATE': 0}}
' The Aurora name One of the p, {'cats': {'TOXIC': 0, 'SEVERE_TOXIC': 0, 'OBSCENE': 0, 'THREAT': 0, 'INSULT': 0, 'IDENTITY_HATE': 0}}
' WP:NOT: W is not a 'how to' , {'cats': {'TOXIC': 0, 'SEVERE_TOXIC': 0, 'OBSCENE': 0, 'THREAT': 0, 'INSULT': 0, 'IDENTITY_HATE': 0}}


https://www.machinelearningplus.com/nlp/custom-text-classification-spacy

# Evaluate Function

The evaluate function preserves the scores for each epoch. Scores are broken down categorically, and will assist us in checking the model's effectiveness.

In [162]:
def evaluate(tokenizer, textcat, val_texts, val_cats, thresh=0.5):
    
    docs = (tokenizer(val_text) for val_text in val_texts)  
    
    # create dict of results
    evals_by_cat = dict()
    
    for i, doc in enumerate(textcat.pipe(docs)):
        
        # actual scores 
        gold = val_cats[i]['cats']
        
        for label, score in doc.cats.items():
            
            # add label to dict if not already present
            if label not in evals_by_cat:
                evals_by_cat[label] = {'tp':0,
                                       'fp':0,
                                       'fn':0,
                                       'tn':0,}
            
            if score >= thresh and gold[label] >= thresh:
                evals_by_cat[label]['tp'] += 1

            elif score >= thresh and gold[label] < thresh:
                evals_by_cat[label]['fp'] += 1

            elif score < thresh and gold[label] < thresh:
                evals_by_cat[label]['tn'] += 1
            
            elif score < thresh and gold[label] >= thresh:
                evals_by_cat[label]['fn'] += 1
    
    for key in evals_by_cat.keys():

        # local scope variables to ease reading & debugging
        tp = evals_by_cat[key]['tp']
        fp = evals_by_cat[key]['fp']
        fn = evals_by_cat[key]['fn']
        tn = evals_by_cat[key]['tn']
        
        # precision
        # edge case: avoid dividing by zero: precision = 1 when fp = 0
        if tp + fp == 0:
            evals_by_cat[key]['precision'] = 1
        else:    
            evals_by_cat[key]['precision'] = tp / (tp + fp)
        
        # recall
        # edge case: avoid dividing by zero: recall = 1 when fn = 0
        if tp + fn == 0:
            evals_by_cat[key]['recall'] = 1
        else:    
            evals_by_cat[key]['recall'] = tp / (tp + fn)
            
        # local variables to ease reading & debugging
        precision = evals_by_cat[key]['precision']
        recall = evals_by_cat[key]['recall']
        
        # f score
        if precision  + recall == 0:
            evals_by_cat[key]['f_score'] = 0.0
        else:
            evals_by_cat[key]['f_score'] = 2 * (precision * recall) / (precision + recall)
    
    # preserve spaCy's native losses metric
    evals_by_cat['TEXTCAT_LOSSES'] = losses['textcat']

    return evals_by_cat

In [200]:
# variable used in testing
# size = 1000
train_data = train_docs
dev_texts = [i[0] for i in test_docs]
dev_cats = [i[1] for i in test_docs]

# Do 5 Training Iterations 

We'll begin with 5 training iterations to help us estimate how long training will be and get a few baseline predictions.

Each iteration takes roughly 12 minutes, so 5 will be about an hour

## log from previous run:
Training the model.
Iterations: 5
Timer Begins:14924.26
epoch 1 start time: 14924.26
textcat losses: 3.300574591791758
epoch 1 end: 15645.511968644, epoch elapsed: 721.25

epoch 2 start time: 15645.52
textcat losses: 3.2344222257217226
epoch 2 end: 16362.236106837, epoch elapsed: 716.72

epoch 3 start time: 16362.24
textcat losses: 2.8232882740304106
epoch 3 end: 17072.387902298, epoch elapsed: 710.15

epoch 4 start time: 17072.39
textcat losses: 2.6497837365776666
epoch 4 end: 17787.474139616, epoch elapsed: 715.08

epoch 5 start time: 17787.48
textcat losses: 2.3880158715215387
epoch 5 end: 18499.152896641, epoch elapsed: 711.68

Training complete. End:18499.16 Training Elapsed: 3574.90

CPU times: user 1h 39min 33s, sys: 1h 3min 52s, total: 2h 43min 26s
Wall time: 59min 34s

In [205]:
%%time
from spacy.util import minibatch, compounding

log_list = list()

# start = timer()
# # ...
# end = timer()
# print(end - start) # Time in seconds, e.g. 5.38091952400282

#("Number of training iterations", "n", int))
n_iter=5

# Disabling other components
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes):  
    optimizer = nlp.begin_training()

    start_cumulative = timer()
    print(f"Training the model.\nIterations: {n_iter}\nTimer Begins:{start_cumulative:.2f}")
#     print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))

    # Performing training
    for i in range(n_iter):
        start_epoch = timer()
        print(f'epoch {i + 1} start time: {start_epoch:.2f}')
        losses = {}
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        
        for batch in batches:
#             print(f"new batch at :{timer():.2f}")
            texts, annotations = zip(*batch)
            nlp.update(texts, 
                       annotations, 
                       sgd=optimizer, 
                       drop=0.2,
                       losses=losses)

      # Calling the evaluate() function and printing the scores
        with textcat.model.use_params(optimizer.averages):
            evals_by_cat = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
            # reduced scores for testing
            print(f'textcat losses:', losses['textcat'])
            log_list.append(evals_by_cat)
            
            end_epoch = timer()
            time.strftime("%Hh%Mm%Ss", time.gmtime(4*3600+13*60+6)) 
            print(f'epoch {i + 1} end: {end_epoch}, epoch elapsed: {end_epoch-start_epoch:.2f}\n')
         
            
#             for key in evals_by_cat.keys():
#                 print(f'{key}:\n{evals_by_cat[key]}')
#         print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}'  
#               .format(losses['textcat'], scores['textcat_p'],
#                       scores['textcat_r'], scores['textcat_f']))

end_cumulative = timer()
print(f"Training complete. End:{end_cumulative:.2f} Training Elapsed: {end_cumulative - start_cumulative:.2f}\n")

Training the model.
Iterations: 5
Timer Begins:14924.26
epoch 1 start time: 14924.26
textcat losses: 3.300574591791758
epoch 1 end: 15645.511968644, epoch elapsed: 721.25

epoch 2 start time: 15645.52
textcat losses: 3.2344222257217226
epoch 2 end: 16362.236106837, epoch elapsed: 716.72

epoch 3 start time: 16362.24
textcat losses: 2.8232882740304106
epoch 3 end: 17072.387902298, epoch elapsed: 710.15

epoch 4 start time: 17072.39
textcat losses: 2.6497837365776666
epoch 4 end: 17787.474139616, epoch elapsed: 715.08

epoch 5 start time: 17787.48
textcat losses: 2.3880158715215387
epoch 5 end: 18499.152896641, epoch elapsed: 711.68

Training complete. End:18499.16 Training Elapsed: 3574.90

CPU times: user 1h 39min 33s, sys: 1h 3min 52s, total: 2h 43min 26s
Wall time: 59min 34s


In [212]:
# len(log_list)

# import pickle

# with open('../logs/log_list_2.pkl', 'wb') as f:
#     pickle.dump(log_list, f)

# log_list

In [213]:
# with open('../logs/log_list_1.pkl', 'rb') as f:
#     test_list = pickle.load(f)

In [214]:
test_list

[{'TOXIC': {'tp': 3854,
   'fp': 663,
   'fn': 1193,
   'tn': 46949,
   'precision': 0.8532211644897055,
   'recall': 0.7636219536358233,
   'f_score': 0.8059389376829779},
  'SEVERE_TOXIC': {'tp': 95,
   'fp': 77,
   'fn': 422,
   'tn': 52065,
   'precision': 0.5523255813953488,
   'recall': 0.18375241779497098,
   'f_score': 0.2757619738751814},
  'OBSCENE': {'tp': 2211,
   'fp': 409,
   'fn': 551,
   'tn': 49488,
   'precision': 0.8438931297709924,
   'recall': 0.8005068790731354,
   'f_score': 0.8216276477146044},
  'THREAT': {'tp': 0,
   'fp': 0,
   'fn': 150,
   'tn': 52509,
   'precision': 1,
   'recall': 0.0,
   'f_score': 0.0},
  'INSULT': {'tp': 1884,
   'fp': 691,
   'fn': 690,
   'tn': 49394,
   'precision': 0.7316504854368931,
   'recall': 0.7319347319347319,
   'f_score': 0.7317925810837055},
  'IDENTITY_HATE': {'tp': 2,
   'fp': 2,
   'fn': 475,
   'tn': 52180,
   'precision': 0.5,
   'recall': 0.0041928721174004195,
   'f_score': 0.008316008316008316},
  'TEXTCAT_LOSSES

In [230]:
first_df = pd.DataFrame.from_dict(evals_by_cat['INSULT'], orient='index', columns=['INSULT_epoch1'])
second_df = pd.DataFrame.from_dict(evals_by_cat['TOXIC'], orient='index', columns=['TOXIC_epoch1'])

pd.concat([first_df, second_df], axis=1)

Unnamed: 0,INSULT_epoch1,TOXIC_epoch1
tp,1870.0,3953.0
fp,814.0,957.0
fn,704.0,1094.0
tn,49271.0,46655.0
precision,0.696721,0.805092
recall,0.726496,0.783238
f_score,0.711297,0.794014


In [229]:
evals_df = pd.DataFrame.from_dict(evals_by_cat)
evals_df

Unnamed: 0,TOXIC,SEVERE_TOXIC,OBSCENE,THREAT,INSULT,IDENTITY_HATE,TEXTCAT_LOSSES
tp,3953.0,171.0,2272.0,18.0,1870.0,183.0,1.392019
fp,957.0,165.0,578.0,16.0,814.0,118.0,1.392019
fn,1094.0,346.0,490.0,132.0,704.0,294.0,1.392019
tn,46655.0,51977.0,49319.0,52493.0,49271.0,52064.0,1.392019
precision,0.805092,0.508929,0.797193,0.529412,0.696721,0.607973,1.392019
recall,0.783238,0.330754,0.822592,0.12,0.726496,0.383648,1.392019
f_score,0.794014,0.400938,0.809694,0.195652,0.711297,0.470437,1.392019


In [30]:
one_text = dev_texts[11]
# one_text

In [31]:
print(one_text)

doc=nlp(one_text)
doc.cats 

# sentiment_sum = sum([i.sentiment for i in doc])
# print(sentiment_sum)
# doc[0].vector

' Ok, let me say it again Come on, now you guys are just being piece of shit jews. I mean you have to admit, the guys in pink floyd play their instruments about as slow as a nigger works. I shouldn't even call what they play music. It's just a bunch of alarm clocks and cashier regirsters! But you know what the most pretentious thing about them is, its their lyrics. All af their songs are just surrealist poetry sung over doom-noise pop, and everyone starts calling them genius's over it. The truth is, their songs have no meaning. Take the album 'The Wall' for instance; sure it tells a story, but what is the moral and the meaning of the story? And dont tell me that the purpose of their songs is to make you think. The only way that music as slow as pink floyd could make you fucking think is if you were just as stoned as they are, which you kikes probly are... And one last time: 1) pink floyd fucking sucks 2) david fuckmor should taste my ass 3) you should to'


{'TOXIC': 0.9091559052467346,
 'SEVERE_TOXIC': 0.0608687661588192,
 'OBSCENE': 0.7635753154754639,
 'THREAT': 0.007229546085000038,
 'INSULT': 0.5386186242103577,
 'IDENTITY_HATE': 0.08038196712732315}

In [206]:
! ls ../models
nlp.to_disk("../models/spacy_multi_cat_model/")

base_config.cfg       [1m[36mspacy_multi_cat_model[m[m
base_config.cfg       [1m[36mspacy_multi_cat_model[m[m


# Do 20 Additional Training Iterations

In [216]:
%%time
log_list = list()

# start = timer()
# # ...
# end = timer()
# print(end - start) # Time in seconds, e.g. 5.38091952400282

#("Number of training iterations", "n", int))
n_iter=20

# Disabling other components
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes):  
    optimizer = nlp.begin_training()

    start_cumulative = timer()
    print(f"Training the model.\nIterations: {n_iter}\nTimer Begins:{start_cumulative:.2f}")
#     print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))

    # Performing training
    for i in range(n_iter):
        start_epoch = timer()
        print(f'epoch {i + 1} start time: {start_epoch:.2f}')
        losses = {}
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        
        for batch in batches:
#             print(f"new batch at :{timer():.2f}")
            texts, annotations = zip(*batch)
            nlp.update(texts, 
                       annotations, 
                       sgd=optimizer, 
                       drop=0.2,
                       losses=losses)

      # Calling the evaluate() function and printing the scores
        with textcat.model.use_params(optimizer.averages):
            evals_by_cat = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
            # reduced scores for testing
            print(f'textcat losses:', losses['textcat'])
            log_list.append(evals_by_cat)
            
            end_epoch = timer()
            time.strftime("%Hh%Mm%Ss", time.gmtime(4*3600+13*60+6)) 
            print(f'epoch {i + 1} end: {end_epoch}, epoch elapsed: {end_epoch-start_epoch:.2f}\n')
         
            
#             for key in evals_by_cat.keys():
#                 print(f'{key}:\n{evals_by_cat[key]}')
#         print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}'  
#               .format(losses['textcat'], scores['textcat_p'],
#                       scores['textcat_r'], scores['textcat_f']))

end_cumulative = timer()
print(f"Training complete. End:{end_cumulative:.2f} Training Elapsed: {end_cumulative - start_cumulative:.2f}\n")

Training the model.
Iterations: 20
Timer Begins:20204.43
epoch 1 start time: 20204.43
textcat losses: 2.420857442188398
epoch 1 end: 20904.773435177, epoch elapsed: 700.34

epoch 2 start time: 20904.78
textcat losses: 2.3911622935770898
epoch 2 end: 21525.821319783, epoch elapsed: 621.05

epoch 3 start time: 21525.82
textcat losses: 2.1606388487485946
epoch 3 end: 22142.51697684, epoch elapsed: 616.69

epoch 4 start time: 22142.52
textcat losses: 2.1463427109298654
epoch 4 end: 22758.429444661, epoch elapsed: 615.91

epoch 5 start time: 22758.43
textcat losses: 1.9554609553060178
epoch 5 end: 23372.526017913, epoch elapsed: 614.09

epoch 6 start time: 23372.53
textcat losses: 1.7881683352414486
epoch 6 end: 23989.573797575, epoch elapsed: 617.05

epoch 7 start time: 23989.58
textcat losses: 1.7377497433548796
epoch 7 end: 24608.659860403, epoch elapsed: 619.08

epoch 8 start time: 24608.66
textcat losses: 1.8173315413520463
epoch 8 end: 25227.00988865, epoch elapsed: 618.35

epoch 9 st

## Preserve log list

These logs are currently saved as dictionaries. We'll preserve them so we can

In [217]:
with open('../logs/log_list_2.pkl', 'wb') as f:
    pickle.dump(log_list, f)

## Preserve Model

In [218]:
! ls ../models
config = nlp.config
nlp.to_disk("../models/spacy_multi_cat_model/")
! ls ../models

base_config.cfg       [1m[36mspacy_multi_cat_model[m[m
base_config.cfg       [1m[36mspacy_multi_cat_model[m[m


In [228]:
# test after 25 epochs
print(one_text[:100])

doc=nlp(one_text)
doc.cats 

' Ok, let me say it again Come on, now you guys are just being piece of shit jews. I mean you have t


{'TOXIC': 0.9849706292152405,
 'SEVERE_TOXIC': 0.007135968655347824,
 'OBSCENE': 0.9465410113334656,
 'THREAT': 0.007597020361572504,
 'INSULT': 0.9003435373306274,
 'IDENTITY_HATE': 0.7753530144691467}

# Notes to self

https://v2.spacy.io/usage/processing-pipelines#pipelines
https://v2.spacy.io/usage/processing-pipelines


[expanding contractions](https://gist.github.com/widiger-anna/deefac010da426911381c118a97fc23f) 
[contractions](https://theslaps.medium.com/cant-stand-don-t-want-contractions-with-spacy-39715cac2ebb)  


[text wrangling](https://www.kdnuggets.com/2018/08/practitioners-guide-processing-understanding-text-2.html)  


[nlp nltk vs spacy](https://www.activestate.com/blog/natural-language-processing-nltk-vs-spacy/)  

[pytorch](https://pytorch.org/https://pytorch.org/)  

[text classification in python with spacy (try this one!)](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/)  

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a