# NLP toxic comments classification

Project Description

An online store is launching a new service. Now users can edit and expand product descriptions, similar to how it's done in wiki communities. This means that customers can suggest edits and comment on changes made by others. The store needs a tool that can identify toxic comments and send them for moderation.

The task is to train a model to classify comments as either positive or negative. A dataset with labels indicating the toxicity of the edits is available. The goal is to build a model with an F1 score of at least 0.74.

The data is stored in the toxic_comments.csv file. The 'text' column contains the comment text, and 'toxic' is the target feature.

In [None]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.2-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.2


In [None]:
!pip install spacy



In [None]:
from spacy.cli import download
download("en_core_web_trf")
download("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
!pip install 'spacy[transformers]'



## Preprocessing Data

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import f1_score
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

import re
import spacy_transformers
import spacy

from tqdm import tqdm

from google.colab import drive

**Let's take a look at the raw data.**

In [None]:
drive.mount('/content/drive')
data = pd.read_csv('/content/drive/My Drive/projects/NLP_toxic_comments_classification/toxic_comments.csv')
data.head()

Mounted at /content/drive


Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


**Delete the extra column.**

In [None]:
data = data.drop('Unnamed: 0', axis=1)
data.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [None]:
data['toxic'].mean()

0.10161213369158527

**Toxic comments make up 10% of the total sample.**

**Let's look at the absolute ratios.**

In [None]:
data['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

**Let’s take a random sample from the original dataset in order to maintain a similar class ratio in it as in the original sample and do all the work on it, because There is not enough power to work on the full dataset, however, if there were enough power, the work would be identical.**

In [None]:
df = data.sample(50000).reset_index(drop=True)
df['toxic'].mean()

0.10116

**The ratio of classes in the new sample remained similar to the original one.**

**We will prepare the data, namely, we will do tokenization using the tools of the Spacy library.**

In [None]:
nlp = spacy.load('en_core_web_sm')
tqdm.pandas()

new_corpus = []
for doc in tqdm(nlp.pipe(df['text'], batch_size=64, n_process=-1, disable=["parser", "ner"]), total=len(df['text'])):
    word_list = [tok.lemma_ for tok in doc]
    new_corpus.append(' '.join(word_list))

df['lemm_text'] = new_corpus

100%|██████████| 50000/50000 [08:39<00:00, 96.19it/s] 


**Clean up the lemmatized text**

In [None]:
def clear_text(text):
  cleared_text = re.sub(r'[^a-zA-Z]', ' ',text)
  cleared_text = cleared_text.lower()
  return ' '.join(cleared_text.split())

df['clear_text'] = df['lemm_text'].progress_apply(clear_text)

100%|██████████| 50000/50000 [00:02<00:00, 19375.59it/s]


**We will divide our sample into training and testing. There will be no validation, because we will use cross-validation.**

In [None]:
#split the features and target feature into training and test samples in a ratio of 80/20
train, test = train_test_split(df, test_size = 0.2, random_state=1)

display('train shape:', train.shape)
display('test shape:', test.shape)

'train shape:'

(40000, 4)

'test shape:'

(10000, 4)

**Now we will create a corpus for the test and training samples, use the TF_IDF vectorizer to obtain features from the source text for the training sample, and then based on them we will create features for the test sample.**

In [None]:
%%time
corpus_train = train['lemm_text']
corpus_test = test['lemm_text']


count_tf_idf_train = TfidfVectorizer()

features_train = count_tf_idf_train.fit_transform(corpus_train)
target_train = train['toxic']
features_test = count_tf_idf_train.transform(corpus_test)
target_test = test['toxic']

CPU times: user 4.83 s, sys: 166 ms, total: 4.99 s
Wall time: 5.95 s


## Training Models

### Decision tree classifier model

**We will use GridSearchCV with built-in cross-validation to find the best result of the f1 metric, which we use as an estimate of our model according to the problem conditions.**

In [None]:
%%time
model_tree = DecisionTreeClassifier(random_state=1)
param = {
         'criterion': ['gini', 'entropy'],
         'max_depth': range(1, 10, 3)
        }
gridsearch_tree = GridSearchCV(
    estimator=model_tree,
    param_grid=param,
    scoring='f1',
    cv=3)
gridsearch_tree.fit(features_train, target_train)
gridsearch_tree.best_params_

CPU times: user 1min 6s, sys: 125 ms, total: 1min 6s
Wall time: 1min 26s


{'criterion': 'gini', 'max_depth': 7}

**Now let's see what the best result of the f1 metric was achieved by the decision tree classifier model.**

In [None]:
f1_train_tree = round(gridsearch_tree.best_score_, 3)
f1_train_tree

0.559

### Model classifier CatBoost

**We will use GridSearchCV with built-in cross-validation to find the best result of the f1 metric, which we use as an estimate of our model according to the problem conditions. Let us display this value of the metric f1.**

In [None]:
%%time
model = CatBoostClassifier(silent=True, random_state=1)
param = {}
gridsearch_cat = GridSearchCV(
    estimator=model,
    param_grid=param,
    scoring='f1',
    cv=3)
gridsearch_cat.fit(features_train, target_train)
f1_train_cat = round(gridsearch_cat.best_score_, 3)
f1_train_cat

CPU times: user 1h 36min 33s, sys: 36.7 s, total: 1h 37min 10s
Wall time: 1h 4min 31s


0.72

## Model Comparison

**The CatBoost model performed better of these two models. Let us now check what result of the f1 metric this model will show on the test sample.**

In [None]:
%%time
predictions_test = gridsearch_cat.best_estimator_.predict(features_test)
f1_test = f1_score(target_test, predictions_test)
f1_test

CPU times: user 410 ms, sys: 7.97 ms, total: 418 ms
Wall time: 593 ms


0.7402135231316725

**We managed to obtain a metric f1 value of 0.74 on the test sample, which satisfies the original task.**

## Checking models for adequacy

**Let's check the model for adequacy using the Dummy Model from the sklearn class, having previously imported it in the first step.**

**The check is that the quality of the trained model selected above is higher than the quality of the Dummy model, which predicts the result without relying on the features of the training set.**

In [None]:
model_dummy = DummyClassifier(random_state=10)
parameters_dummy = {'strategy':['most_frequent', 'prior', 'stratified', 'uniform'],
                   }
gridsearch_dummy = GridSearchCV(
    estimator=model_dummy,
    param_grid=parameters_dummy,
    scoring='f1',
    cv=3, n_jobs=-1)
gridsearch_dummy.fit(features_train, target_train)
print('The best f1 metric value for the Dummy model on the training set =', gridsearch_dummy.best_score_)

Лучшее значение метрики f1 для Dummy-модели на тренировочной выборке = 0.1684021448565768


In [None]:
predict_dummy = gridsearch_dummy.best_estimator_.predict(features_test)
print('The best value of the f1 metric for the Dummy model on the test set =', f1_score(target_test, predict_dummy))

Лучшее значение метрики f1 для Dummy-модели на тестовой выборке = 0.16993895396799208


## Result

**As a result, using the Dummy model, it was possible to obtain the best quality of the model with f1 equal to 0.17 on the test sample, which is significantly lower than the best result of the selected and trained CatBoost model. Which proves the adequacy of the model we found and selected.**