# Detecting and Classifying Toxic Comments
# Part 3-1: TF*IDF & Random Forest Classifiers

It may be possible to employ sequential binary models in order to get better results with rarer cases.

If we first classify Toxic and Not Toxic, we could further process only the Toxic results against models that had been trained only to recognise sub-classes of toxic models.

## Python Library Imports

In [1]:
import pandas as pd
import numpy as np

## Import spaCy

In [2]:
import spacy

from spacy.lang.en import English
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

from spacy.tokens import Doc

# import custom trained spaCy model
nlp = spacy.load("../models/spacy_2/")


## Import nltk

In [3]:
# nltk imports
import nltk
from nltk.corpus import stopwords

## Import Custom Functions

In [4]:
import sys

# add src folder to path
sys.path.insert(1, '../src')

# from text_prep import tidy_series, uppercase_proportion_column
from spacy_helper import doc_check

# Getting info from preserved spaCy docs

I've had a little difficulty with getting the doc properties to un-pickle and maintain the ability to further process them later. docs seem to depend on some vocab properties of the model that are not saved within the doc itself.

In [5]:
%%time
'''
CPU times: user 3min 44s, sys: 33.5 s, total: 4min 17s
Wall time: 4min 32s
'''
X_train = pd.read_pickle('../data/basic_df_split/X_train_2-1.pkl')

CPU times: user 3min 39s, sys: 28.7 s, total: 4min 8s
Wall time: 4min 16s


In [6]:
# load y_train
! ls ../data/basic_df_split/
y_train = pd.read_pickle('../data/basic_df_split/basic_y_train.pkl')
X_test = pd.read_pickle('../data/basic_df_split/basic_X_test.pkl')
y_test = pd.read_pickle('../data/basic_df_split/basic_y_test.pkl')


X_train_2-1.pkl         basic_X_test.pkl        basic_y_test.pkl
X_train_docs_series.pkl basic_X_train.pkl       basic_y_train.pkl


In [7]:
X_train.columns

Index(['comment_text', 'uppercase_proportion', 'docs', 'lemmas', 'doc_vectors',
       'tok_vectors'],
      dtype='object')

### Create list of lemmas, less nltk stopwords

In [8]:
stopw_set = set(stopwords.words('english'))

In [9]:
%%time
# remove lemmas that appear in nltk stopword list
X_train['lemmas_less'] = X_train['lemmas'].apply(lambda row: [lemma for lemma in row if lemma not in stopw_set])

CPU times: user 1.48 s, sys: 993 ms, total: 2.47 s
Wall time: 3.03 s


## Further reduce lemmas by min_length & max_length

Exploration of corpus vocabulary suggests that lemmas of 2 or fewer characters are likely not very useful and can be removed to reduce features. 

Lemmas of longer than 20 characters are often run-on words (where spaces have been omitted). Although a few of them have words hidden within them that may be considered toxic, the rarity and non-standard format make them unlikely to be generalizable.

In [10]:
%%time
min_l = 3
max_l = 20

X_train['lemmas_less'] = X_train['lemmas'].apply(lambda row: [lemma for lemma in row if len(lemma) >= min_l or len(lemma) <= max_l])

CPU times: user 663 ms, sys: 77.1 ms, total: 740 ms
Wall time: 817 ms


# TF*IDF

## Scikit Learn Imports

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
%%time
tfidf_sklearn = TfidfVectorizer(ngram_range = (1,3),
                                min_df = 2)

# return sparse matrix
# join list into individual strings
tfidf_values = tfidf_sklearn.fit_transform(X_train['lemmas_less'].apply(lambda x: " ".join(x)))

CPU times: user 17.1 s, sys: 1.19 s, total: 18.3 s
Wall time: 18.5 s


In [13]:
'''
last values (with min_df as 1):
<106912x4089577 sparse matrix of type '<class 'numpy.float64'>'
	with 8002628 stored elements in Compressed Sparse Row format>
'''

tfidf_values

<106912x459917 sparse matrix of type '<class 'numpy.float64'>'
	with 4387497 stored elements in Compressed Sparse Row format>

In [14]:
%%time
'''
CPU times: user 11.9 s, sys: 8.27 s, total: 20.2 s
Wall time: 23 s
'''
lemmas_less_tfidf = pd.DataFrame(tfidf_values.toarray(),
                                 columns=tfidf_sklearn.get_feature_names())

CPU times: user 3.86 s, sys: 4.86 s, total: 8.72 s
Wall time: 9.76 s


In [15]:
lemmas_less_tfidf.shape

(106912, 459917)

In [16]:
y_train['toxic'].shape

(106912,)

In [17]:
# sum(lemmas_less_tfidf['jerk'])

# search_term = 'jerk'

# # running this portion will crash the kernel
# bool_mask = lemmas_less_tfidf.sort_values(search_term, ascending=False)[search_term][:10]
# bool_mask

# Toxic: Random Forest Classifier

Resources:
- [Explanation of Warm Start for RFC (not what you may think)](https://stackoverflow.com/questions/42757892/how-to-use-warm-start/42763502)  
- [TfidfVectorizer Docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)  

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [19]:
toxic_rfc = RandomForestClassifier(n_estimators=100,
                                   max_depth = 10,
                                   oob_score=True,
                                   n_jobs=-1,
                                   random_state=42,
                                   warm_start=True)

In [20]:
X = lemmas_less_tfidf
y = y_train

In [21]:
toxic_logistic = LogisticRegression(warm_start=True,
                                    random_state=42,
                                    verbose=True,
                                    solver='sag',
                                    multi_class='ovr',
                                    max_iter=100,
                                    n_jobs=-1)

In [22]:
%%time
toxic_logistic.fit(X[:10],y[:10])

ValueError: y should be a 1d array, got an array of shape (10, 6) instead.

In [None]:
def batch_train(X, y, model, batch_size=1000, verbose=False, start=0):
    
    remaining = len(X)
    
    if batch_size > remaining:
        bach_size = remaining
    
    b = start
    e = batch_size
    
    while e <= remaining:
        
        model.fit(X[b:e], y[b:e])
        
        
        
        
        
    
    
    
    
    

In [21]:
# 

In [22]:
# naive_toxic = BernoulliNB
# print(vect_X_train.shape)
# print(y_train.shape)
# print(X_train.shape)

# vect_X_train.iloc[0]

# naive_toxic.fit(vect_X_train, y_train)