In [6]:
import pandas as pd
import numpy as np

# Below option allows us to see the entire comment_text column
pd.set_option('display.max_colwidth', None)
# Read in the dataset
train = pd.read_csv("../../data/kaggle_train.csv")
train = train.drop(columns=['id'])

labels = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

print("Stats of training set: ", train.shape)
print("Labels:", labels)

Stats of training set:  (159571, 7)
Labels: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


In [7]:
train.head()

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,"Explanation\r\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,0,0,0,0,0
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,0,0,0,0,0
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,0,0,0,0,0
3,"""\r\nMore\r\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\r\n\r\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0,0,0,0,0,0
4,"You, sir, are my hero. Any chance you remember what page that's on?",0,0,0,0,0,0


# Text Preprocessing

Below I have noticed some inconsistencies in the data and by preprocessing it, we can ensure a clean dataset.

In [8]:
# Convert comment to lowercase
def to_lowercase(text):
    return text.lower()

train['comment_text'] = train['comment_text'].apply(to_lowercase)
train.head()

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,"explanation\r\nwhy the edits made under my username hardcore metallica fan were reverted? they weren't vandalisms, just closure on some gas after i voted at new york dolls fac. and please don't remove the template from the talk page since i'm retired now.89.205.38.27",0,0,0,0,0,0
1,"d'aww! he matches this background colour i'm seemingly stuck with. thanks. (talk) 21:51, january 11, 2016 (utc)",0,0,0,0,0,0
2,"hey man, i'm really not trying to edit war. it's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. he seems to care more about the formatting than the actual info.",0,0,0,0,0,0
3,"""\r\nmore\r\ni can't make any real suggestions on improvement - i wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -i think the references may need tidying so that they are all in the exact same format ie date format etc. i can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\r\n\r\nthere appears to be a backlog on articles for review so i guess there may be a delay until a reviewer turns up. it's listed in the relevant form eg wikipedia:good_article_nominations#transport """,0,0,0,0,0,0
4,"you, sir, are my hero. any chance you remember what page that's on?",0,0,0,0,0,0


In [9]:
import re
# Remove HTML tags from the comments
def remove_html(text):
    return re.sub(r"<.*>", "", text, flags=re.MULTILINE)
    
train['comment_text'] = train['comment_text'].apply(remove_html)
train.head()

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,"explanation\r\nwhy the edits made under my username hardcore metallica fan were reverted? they weren't vandalisms, just closure on some gas after i voted at new york dolls fac. and please don't remove the template from the talk page since i'm retired now.89.205.38.27",0,0,0,0,0,0
1,"d'aww! he matches this background colour i'm seemingly stuck with. thanks. (talk) 21:51, january 11, 2016 (utc)",0,0,0,0,0,0
2,"hey man, i'm really not trying to edit war. it's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. he seems to care more about the formatting than the actual info.",0,0,0,0,0,0
3,"""\r\nmore\r\ni can't make any real suggestions on improvement - i wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -i think the references may need tidying so that they are all in the exact same format ie date format etc. i can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\r\n\r\nthere appears to be a backlog on articles for review so i guess there may be a delay until a reviewer turns up. it's listed in the relevant form eg wikipedia:good_article_nominations#transport """,0,0,0,0,0,0
4,"you, sir, are my hero. any chance you remember what page that's on?",0,0,0,0,0,0


In [10]:
# Remove links from the comments
def remove_links(text):
    text= re.sub(r"http\S+"," ",text, flags=re.MULTILINE)
    return re.sub(r"www\S+"," ",text, flags=re.MULTILINE)

train['comment_text'] = train['comment_text'].apply(remove_links)
train.head()

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,"explanation\r\nwhy the edits made under my username hardcore metallica fan were reverted? they weren't vandalisms, just closure on some gas after i voted at new york dolls fac. and please don't remove the template from the talk page since i'm retired now.89.205.38.27",0,0,0,0,0,0
1,"d'aww! he matches this background colour i'm seemingly stuck with. thanks. (talk) 21:51, january 11, 2016 (utc)",0,0,0,0,0,0
2,"hey man, i'm really not trying to edit war. it's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. he seems to care more about the formatting than the actual info.",0,0,0,0,0,0
3,"""\r\nmore\r\ni can't make any real suggestions on improvement - i wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -i think the references may need tidying so that they are all in the exact same format ie date format etc. i can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\r\n\r\nthere appears to be a backlog on articles for review so i guess there may be a delay until a reviewer turns up. it's listed in the relevant form eg wikipedia:good_article_nominations#transport """,0,0,0,0,0,0
4,"you, sir, are my hero. any chance you remember what page that's on?",0,0,0,0,0,0


In [11]:
import string
# Remove punctuation marks 
def remove_punctuation(text):
    for i in string.punctuation:
        text = text.replace(i, "")
    return text

train['comment_text'] = train['comment_text'].apply(remove_punctuation)
train.head()

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,explanation\r\nwhy the edits made under my username hardcore metallica fan were reverted they werent vandalisms just closure on some gas after i voted at new york dolls fac and please dont remove the template from the talk page since im retired now892053827,0,0,0,0,0,0
1,daww he matches this background colour im seemingly stuck with thanks talk 2151 january 11 2016 utc,0,0,0,0,0,0
2,hey man im really not trying to edit war its just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page he seems to care more about the formatting than the actual info,0,0,0,0,0,0
3,\r\nmore\r\ni cant make any real suggestions on improvement i wondered if the section statistics should be later on or a subsection of types of accidents i think the references may need tidying so that they are all in the exact same format ie date format etc i can do that later on if noone else does first if you have any preferences for formatting style on references or want to do it yourself please let me know\r\n\r\nthere appears to be a backlog on articles for review so i guess there may be a delay until a reviewer turns up its listed in the relevant form eg wikipediagoodarticlenominationstransport,0,0,0,0,0,0
4,you sir are my hero any chance you remember what page thats on,0,0,0,0,0,0


In [12]:
# Remove special characters such as: \n \r \t
def remove_special(text):
    return re.sub(r"[\n\t\\\/\r]"," ",text, flags=re.MULTILINE)

train['comment_text'] = train['comment_text'].apply(remove_special)
train.head()

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,explanation why the edits made under my username hardcore metallica fan were reverted they werent vandalisms just closure on some gas after i voted at new york dolls fac and please dont remove the template from the talk page since im retired now892053827,0,0,0,0,0,0
1,daww he matches this background colour im seemingly stuck with thanks talk 2151 january 11 2016 utc,0,0,0,0,0,0
2,hey man im really not trying to edit war its just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page he seems to care more about the formatting than the actual info,0,0,0,0,0,0
3,more i cant make any real suggestions on improvement i wondered if the section statistics should be later on or a subsection of types of accidents i think the references may need tidying so that they are all in the exact same format ie date format etc i can do that later on if noone else does first if you have any preferences for formatting style on references or want to do it yourself please let me know there appears to be a backlog on articles for review so i guess there may be a delay until a reviewer turns up its listed in the relevant form eg wikipediagoodarticlenominationstransport,0,0,0,0,0,0
4,you sir are my hero any chance you remember what page thats on,0,0,0,0,0,0


In [13]:
# Remove stopwords using nltk's stopwords package
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

train['comment_text'] = train['comment_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
train.head()


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Andrew\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,explanation edits made username hardcore metallica fan reverted werent vandalisms closure gas voted new york dolls fac please dont remove template talk page since im retired now892053827,0,0,0,0,0,0
1,daww matches background colour im seemingly stuck thanks talk 2151 january 11 2016 utc,0,0,0,0,0,0
2,hey man im really trying edit war guy constantly removing relevant information talking edits instead talk page seems care formatting actual info,0,0,0,0,0,0
3,cant make real suggestions improvement wondered section statistics later subsection types accidents think references may need tidying exact format ie date format etc later noone else first preferences formatting style references want please let know appears backlog articles review guess may delay reviewer turns listed relevant form eg wikipediagoodarticlenominationstransport,0,0,0,0,0,0
4,sir hero chance remember page thats,0,0,0,0,0,0


In [14]:
# As you can see above, there are numbers and/or dates
# I will remove those as they are not helpful

def remove_numbers(text):
    return re.sub(r'\d'," ",text, flags=re.MULTILINE)

train['comment_text'] = train['comment_text'].apply(remove_numbers)
train.head()

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,explanation edits made username hardcore metallica fan reverted werent vandalisms closure gas voted new york dolls fac please dont remove template talk page since im retired now,0,0,0,0,0,0
1,daww matches background colour im seemingly stuck thanks talk january utc,0,0,0,0,0,0
2,hey man im really trying edit war guy constantly removing relevant information talking edits instead talk page seems care formatting actual info,0,0,0,0,0,0
3,cant make real suggestions improvement wondered section statistics later subsection types accidents think references may need tidying exact format ie date format etc later noone else first preferences formatting style references want please let know appears backlog articles review guess may delay reviewer turns listed relevant form eg wikipediagoodarticlenominationstransport,0,0,0,0,0,0
4,sir hero chance remember page thats,0,0,0,0,0,0


# Train, Test, and Validation Split

Below I need to split the dataset into train and test datasets.
However, sklearn's `train_test_split` function does not work for
multi-class classification.

Therefore, I will be creating a train, test, and validation split for each label in the classification.

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train[["comment_text"]], train[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]], test_size=0.20)
X_train

Unnamed: 0,comment_text
64330,saved lives well delayed death serious injury shell finally hit trench ordered top attack
99454,estimate total biomass incorrect page defines biomass biomass organic nonfossil material collectively words biomass describes mass biological organisms dead alive goes state entire earth contains billion tons biomass assuming billion estimate low consistent definition give example global forest resources assessment fao estimates global total aboveground woody biomass billion tonnes david wardle
44699,thanks fuck
9616,song adaptation worthes theres song iron maiden movie heart darkness called edge darkness x factor album
98170,thats interesting however fact pira murdered around maimed around people single important fact organistion must included within lead pira article real issue ensuring correct verifiable sources figures dead injured
...,...
89688,still stupid cunt whore still stupid cunt whore
46589,happens may worth noting present view like advert encyclopedic entry macdui
101304,apologize giving benefit doubt someone drew borderline pedophilic image
36347,im sorry must misread article look back absense mention wmd bush telling public iraq suddenly npov pov silly link iraq terrorists oh wait right included negative things bush forgot cant


In [16]:
print("Train shape:",X_train.shape)
print("Test shape:", X_test.shape)

Train shape: (127656, 1)
Test shape: (31915, 1)


# Vectorizing the Comment Text

*Logistic Regression can't take text values as input*

Since the independent variable I have is only text, we will need to use a vectorizer to convert the text into usable data for Logistic Regression.

```

# Max_features = Build a vocabulary that only consider the top max_features ordered by term frequency

# Analyzer = Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

# ngram_range = (1,1) means only unigrams, (1,2) means unigrams and bigrams, (1,3) means unigrams, bigrams, and trigrams

# Further ngrams knowledge = bigrams means it will learn the occurence of every two words, trigrams would be every 3, etc.

# dtype = type of the matrix returned, default is float64
```

We will use a word and char n-grams as some people like to obfuscate words by using multiple characters, by using both we can hope to catch these.
The idea from this came from [here](https://www.kaggle.com/code/tunguz/logistic-regression-with-words-and-char-n-grams/comments) which has one of the best results for Logistic Regression. This user optimized the ngram_range.

We use FeatureUnion (similar to how hstack works in previous non-Pipeline example) to combine the word and char n-ngrams as described in this [post](https://stackoverflow.com/questions/65765954/word-and-char-ngram-with-different-ngram-range-on-tfidfvectorizer-pipeline) into one feature.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FeatureUnion

cols_trans = ColumnTransformer([
    ("txt_word", TfidfVectorizer(max_features=10000, binary=True, analyzer="word", ngram_range=(1,3), dtype=np.float32), 'comment_text'),
    ("txt_char", TfidfVectorizer(max_features=10000, binary=True, analyzer="char", ngram_range=(3,6), dtype=np.float32), 'comment_text')
])

## Pipeline 

Create a Pipeline for the data to flow through:

TFIDF Vectorize the data

then

Perform Logistic Regression

In [18]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import BernoulliNB
from mlxtend.feature_selection import ColumnSelector

pipe = Pipeline([
    ('trans', cols_trans),
    ('clf', BernoulliNB())
])

In [19]:
from sklearn import set_config
set_config(display='diagram')
# with display='diagram', simply use display() to see the diagram
display(pipe)
# if desired, set display back to the default
set_config(display='text')

# Hyperparameter Tuning with GridSearchCV

Below I will build and train the `Logistic Regression` model and check if the model is overfit, underfit, or optimal fit using GridSearch I will find the best hyperparameters.

We create a new model for each label in order to classify the multi-class, example we cross validate the `toxic` label, then the `severe_toxic` and so on. This method is the preferred method based on previous implementations for the Kaggle competition.

By  doing this, we can evaluate the percentage for each label and choose the highest label(s) which we should classify the text as. For example, in the data we have data which may be `toxic` and `obscene` rather than only `toxic` data and only `obscene` data.

In [20]:
# speecify parameter values to search
alpha = np.logspace(-3, 3, 10)
alpha = np.append(0, alpha)

print("Alpha Values: ")
for c in alpha:
    print(c)
params = {}
params['clf__alpha'] = alpha
params['clf__fit_prior'] = [True, False]

Alpha Values: 
0.0
0.001
0.004641588833612777
0.021544346900318832
0.1
0.46415888336127775
2.154434690031882
10.0
46.41588833612773
215.44346900318823
1000.0


In [21]:
from sklearn.model_selection import GridSearchCV
import time
import warnings
warnings.filterwarnings('ignore') 

best_results = {}

for label in labels:
    start = time.time()

    grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
    grid.fit(X_train, y_train[label])

    best_results[label] = {
        "score": grid.best_score_,
        "parameters": grid.best_params_,
        "estimator": grid.best_estimator_
    }

    print(f"Time to tune [{label}]: {time.time() - start}")
    print(f"\tBest Score: {grid.best_score_}")
    print(f"\tFinal Model:: {grid.best_estimator_}")


Time to tune [toxic]: 9947.086345672607
	Best Score: 0.9019082523638465
	Final Model:: Pipeline(steps=[('trans',
                 ColumnTransformer(transformers=[('txt_word',
                                                  TfidfVectorizer(binary=True,
                                                                  dtype=<class 'numpy.float32'>,
                                                                  max_features=10000,
                                                                  ngram_range=(1,
                                                                               3)),
                                                  'comment_text'),
                                                 ('txt_char',
                                                  TfidfVectorizer(analyzer='char',
                                                                  binary=True,
                                                                  dtype=<class 'numpy.float32'>,
         

In [22]:
final_results = pd.DataFrame.from_dict(best_results)
final_results.head(6)

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
score,0.901908,0.988767,0.932013,0.997031,0.935812,0.990545
parameters,"{'clf__alpha': 215.44346900318823, 'clf__fit_prior': True}","{'clf__alpha': 1000.0, 'clf__fit_prior': True}","{'clf__alpha': 1000.0, 'clf__fit_prior': True}","{'clf__alpha': 1000.0, 'clf__fit_prior': True}","{'clf__alpha': 1000.0, 'clf__fit_prior': True}","{'clf__alpha': 1000.0, 'clf__fit_prior': True}"
estimator,"(ColumnTransformer(transformers=[('txt_word',\n TfidfVectorizer(binary=True,\n dtype=<class 'numpy.float32'>,\n max_features=10000,\n ngram_range=(1, 3)),\n 'comment_text'),\n ('txt_char',\n TfidfVectorizer(analyzer='char', binary=True,\n dtype=<class 'numpy.float32'>,\n max_features=10000,\n ngram_range=(3, 6)),\n 'comment_text')]), BernoulliNB(alpha=215.44346900318823))","(ColumnTransformer(transformers=[('txt_word',\n TfidfVectorizer(binary=True,\n dtype=<class 'numpy.float32'>,\n max_features=10000,\n ngram_range=(1, 3)),\n 'comment_text'),\n ('txt_char',\n TfidfVectorizer(analyzer='char', binary=True,\n dtype=<class 'numpy.float32'>,\n max_features=10000,\n ngram_range=(3, 6)),\n 'comment_text')]), BernoulliNB(alpha=1000.0))","(ColumnTransformer(transformers=[('txt_word',\n TfidfVectorizer(binary=True,\n dtype=<class 'numpy.float32'>,\n max_features=10000,\n ngram_range=(1, 3)),\n 'comment_text'),\n ('txt_char',\n TfidfVectorizer(analyzer='char', binary=True,\n dtype=<class 'numpy.float32'>,\n max_features=10000,\n ngram_range=(3, 6)),\n 'comment_text')]), BernoulliNB(alpha=1000.0))","(ColumnTransformer(transformers=[('txt_word',\n TfidfVectorizer(binary=True,\n dtype=<class 'numpy.float32'>,\n max_features=10000,\n ngram_range=(1, 3)),\n 'comment_text'),\n ('txt_char',\n TfidfVectorizer(analyzer='char', binary=True,\n dtype=<class 'numpy.float32'>,\n max_features=10000,\n ngram_range=(3, 6)),\n 'comment_text')]), BernoulliNB(alpha=1000.0))","(ColumnTransformer(transformers=[('txt_word',\n TfidfVectorizer(binary=True,\n dtype=<class 'numpy.float32'>,\n max_features=10000,\n ngram_range=(1, 3)),\n 'comment_text'),\n ('txt_char',\n TfidfVectorizer(analyzer='char', binary=True,\n dtype=<class 'numpy.float32'>,\n max_features=10000,\n ngram_range=(3, 6)),\n 'comment_text')]), BernoulliNB(alpha=1000.0))","(ColumnTransformer(transformers=[('txt_word',\n TfidfVectorizer(binary=True,\n dtype=<class 'numpy.float32'>,\n max_features=10000,\n ngram_range=(1, 3)),\n 'comment_text'),\n ('txt_char',\n TfidfVectorizer(analyzer='char', binary=True,\n dtype=<class 'numpy.float32'>,\n max_features=10000,\n ngram_range=(3, 6)),\n 'comment_text')]), BernoulliNB(alpha=1000.0))"


In [23]:
final_results.iloc[0].mean()

0.9576792340781544

# Saving the Best Models

In [24]:
import joblib

for k, v in best_results.items():
    print(k, v['estimator'])
    joblib.dump(v['estimator'], f'F:/Thesis/models/bernoulli_naive/{k}.pkl')

toxic Pipeline(steps=[('trans',
                 ColumnTransformer(transformers=[('txt_word',
                                                  TfidfVectorizer(binary=True,
                                                                  dtype=<class 'numpy.float32'>,
                                                                  max_features=10000,
                                                                  ngram_range=(1,
                                                                               3)),
                                                  'comment_text'),
                                                 ('txt_char',
                                                  TfidfVectorizer(analyzer='char',
                                                                  binary=True,
                                                                  dtype=<class 'numpy.float32'>,
                                                                  max_features=10000,
    

# Loading the Saved Models

In [4]:
import joblib

models = {}

for label in labels:
    models[label] = joblib.load(open(f'F:/Thesis/models/logistic/{label}.pkl', 'rb'))

In [19]:
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_predict

for k, v in models.items():
    y_actual = y_train[k]
    y_predicted = cross_val_predict(v, X_train, y_train[k], cv=5)
    print(f"-----------{k}-----------")
    print(metrics.confusion_matrix(y_actual, y_predicted))
    print(metrics.classification_report(y_actual, y_predicted))

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS 

-----------toxic-----------
[[114070   1307]
 [  3768   8511]]
              precision    recall  f1-score   support

           0       0.97      0.99      0.98    115377
           1       0.87      0.69      0.77     12279

    accuracy                           0.96    127656
   macro avg       0.92      0.84      0.87    127656
weighted avg       0.96      0.96      0.96    127656



If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS 

-----------severe_toxic-----------
[[126063    319]
 [   899    375]]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00    126382
           1       0.54      0.29      0.38      1274

    accuracy                           0.99    127656
   macro avg       0.77      0.65      0.69    127656
weighted avg       0.99      0.99      0.99    127656



If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS 

-----------obscene-----------
[[120201    658]
 [  1756   5041]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99    120859
           1       0.88      0.74      0.81      6797

    accuracy                           0.98    127656
   macro avg       0.94      0.87      0.90    127656
weighted avg       0.98      0.98      0.98    127656



If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS 

-----------threat-----------
[[127227     61]
 [   275     93]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    127288
           1       0.60      0.25      0.36       368

    accuracy                           1.00    127656
   macro avg       0.80      0.63      0.68    127656
weighted avg       1.00      1.00      1.00    127656



If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS 

-----------insult-----------
[[120391    919]
 [  2649   3697]]
              precision    recall  f1-score   support

           0       0.98      0.99      0.99    121310
           1       0.80      0.58      0.67      6346

    accuracy                           0.97    127656
   macro avg       0.89      0.79      0.83    127656
weighted avg       0.97      0.97      0.97    127656



If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS 

-----------identity_hate-----------
[[126368    154]
 [   812    322]]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00    126522
           1       0.68      0.28      0.40      1134

    accuracy                           0.99    127656
   macro avg       0.84      0.64      0.70    127656
weighted avg       0.99      0.99      0.99    127656

