# Defining Comment Toxicity

**Project Description**

Online store users can edit and supplement product descriptions, like in wiki communities. That is, customers offer their edits and comment on others' changes. The store needs a tool that will search for toxic comments and send them for moderation.

**Research Objective**

Train a model to classify comments into positive and negative. We have a dataset with edit toxicity markup.

**Research Tasks**

- train different models with different hyperparameters;
- the *F1* metric value should be at least 0.75.

**Data Description**

The data is located in the `/datasets/toxic_comments.csv` file. The *text* column contains the comment text, and *toxic* is the target feature.

## Import Libraries

In [None]:
!pip install scikit-learn --upgrade -q
!pip install -U imbalanced-learn -q

In [None]:
import pandas as pd
import os
import re
import spacy
import numpy as np

import nltk
from nltk.corpus import stopwords as nltk_stopwords

from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

from catboost import CatBoostClassifier

from sklearn.metrics import make_scorer, f1_score

RANDOM_STATE = 42
TEST_SIZE = 0.10

## Preparation

### Loading Data

In [None]:
path_1 = '/datasets/toxic_comments.csv'
path_2 = 'https://.../datasets/toxic_comments.csv'

if os.path.exists(path_1):
    data = pd.read_csv(path_1, index_col=[0])
else:
    try:
        data = pd.read_csv(path_2, index_col=[0])
    except Exception as e:
        print(f'Error loading data from URL: {e}')

Let's take a look at the data:

In [None]:
#universal function for reviewing data
def data_review(data):
    '''
    data - DataFrame

    '''
    print('*'*10, 'The Original DataFrame', '*'*10)
    display(data.head())
    print('')
    print('')
    print('*'*10, 'General Information', '*'*10)
    print('')
    data.info()
    print('')
    print('')
    print('*'*10, 'Has NaN', '*'*10)
    display(pd.DataFrame(data.isna().sum()).style.background_gradient('coolwarm'))
    print('*'*10, 'Has NaN Percentage', '*'*10)
    display(pd.DataFrame(round(data.isna().mean()*100,1)).style.background_gradient('coolwarm'))
    print('')
    print('')
    print('*'*10, 'Descriptive Statistics', '*'*10)
    display(pd.DataFrame(data.describe()))

In [None]:
data_review(data)

********** The Original DataFrame **********


Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0




********** General Information **********

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


********** Has NaN **********


Unnamed: 0,0
text,0
toxic,0


********** Has NaN Percentage **********


Unnamed: 0,0
text,0.0
toxic,0.0




********** Descriptive Statistics **********


Unnamed: 0,toxic
count,159292.0
mean,0.101612
std,0.302139
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


#### Conclusions

- the loaded data is presented in two columns, tweet texts and the target feature `toxic`;
- no missing values ​​were found in the data;
- class imbalance was detected, class `1` is represented by only 10% of the data;
- gaps were found in the Int64Index indexes: 159292 entries, 0 to 159450.

### Dataset processing

To speed up calculations, we will reduce the number of texts to 75,000 and reset the indices.

In [None]:
data_cut = data.sample(75000, random_state=RANDOM_STATE)

In [None]:
data_cut.reset_index(drop=True , inplace=True )

In [None]:
data_review(data_cut)

********** The Original DataFrame **********


Unnamed: 0,text,toxic
0,"Sometime back, I just happened to log on to ww...",0
1,"""\n\nThe latest edit is much better, don't mak...",0
2,""" October 2007 (UTC)\n\nI would think you'd be...",0
3,Thanks for the tip on the currency translation...,0
4,I would argue that if content on the Con in co...,0




********** General Information **********

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75000 entries, 0 to 74999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    75000 non-null  object
 1   toxic   75000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ MB


********** Has NaN **********


Unnamed: 0,0
text,0
toxic,0


********** Has NaN Percentage **********


Unnamed: 0,0
text,0.0
toxic,0.0




********** Descriptive Statistics **********


Unnamed: 0,toxic
count,75000.0
mean,0.10084
std,0.301119
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


### Cleaning and lemmatization

Let's clean up and lemmatize the text:

In [None]:
#universal text lemmatization function
nlp = spacy.load("en_core_web_sm")

def lemmatize(text):
    doc = nlp(text)
    lemm_text = ' '.join([token.lemma_ for token in doc])
    return lemm_text

In [None]:
#universal text cleaning function
def clear_text(text):
    return ' '.join(re.sub(r'[^a-zA-z ]', ' ', text.lower()).split())

In [None]:
data_cut['clean_text'] = data_cut['text'].apply(clear_text)

In [None]:
data_cut.head()

Unnamed: 0,text,toxic,clean_text
0,"Sometime back, I just happened to log on to ww...",0,sometime back i just happened to log on to www...
1,"""\n\nThe latest edit is much better, don't mak...",0,the latest edit is much better don t make this...
2,""" October 2007 (UTC)\n\nI would think you'd be...",0,october utc i would think you d be able to get...
3,Thanks for the tip on the currency translation...,0,thanks for the tip on the currency translation...
4,I would argue that if content on the Con in co...,0,i would argue that if content on the con in co...


In [None]:
tqdm.pandas()

data_cut['lemm_text'] = data_cut['clean_text'].progress_apply(lemmatize)

100%|██████████| 75000/75000 [17:30<00:00, 71.40it/s] 


In [None]:
data_cut.head(5)

Unnamed: 0,text,toxic,clean_text,lemm_text
0,"Sometime back, I just happened to log on to ww...",0,sometime back i just happened to log on to www...,sometime back I just happen to log on to www i...
1,"""\n\nThe latest edit is much better, don't mak...",0,the latest edit is much better don t make this...,the late edit be much well don t make this art...
2,""" October 2007 (UTC)\n\nI would think you'd be...",0,october utc i would think you d be able to get...,october utc I would think you d be able to get...
3,Thanks for the tip on the currency translation...,0,thanks for the tip on the currency translation...,thank for the tip on the currency translation ...
4,I would argue that if content on the Con in co...,0,i would argue that if content on the con in co...,I would argue that if content on the con in co...


In [None]:
data_cut['lemm_text'].duplicated().sum()

379

In [None]:
data_cut = data_cut.drop_duplicates(subset=['lemm_text'])

In [None]:
data_cut['lemm_text'].duplicated().sum()

0

Let's select the training and test samples:

In [None]:
features = data_cut['lemm_text']
target = data_cut['toxic']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
        features,
        target,
        test_size=TEST_SIZE,
        stratify=target,
        random_state=RANDOM_STATE
    )

Let's check the sizes and dimensions of the obtained samples:

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((67158,), (7463,), (67158,), (7463,))

## Training

Let's create a pipeline for training models. Let's train 2 models: LogisticRegression() and CatBoostClassifier(). Let's also calculate TF-IDF for the text corpus in the pipeline, having previously created a list of stop words to clean the corpus from words without semantic meaning.

In [None]:
nltk.download('stopwords')
stopwords = list(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
pipe_final = Pipeline(
    [
        ('vect', TfidfVectorizer(stop_words=stopwords)),
        ('models', LogisticRegression())
    ]
)

In [None]:
param_grid = [
    {
        'models': [LogisticRegression(random_state=RANDOM_STATE,
                                      solver='liblinear',
                                      max_iter=1000,
                                      class_weight='balanced')],
        'models__penalty': ['l1', 'l2'],
        'models__C': range(5, 15)
    },
    {
        'models': [CatBoostClassifier(random_seed=RANDOM_STATE,
                                      iterations=500,
                                      depth=6,
                                      auto_class_weights='Balanced',
                                      verbose = False)],
        'models__learning_rate': [0.1, 0.5]
    },
]

Let's run an automatic search for the best model using the GridSearchCV method with the F1 metric:

In [None]:
f1_scorer = make_scorer(f1_score)

In [None]:
grid_search = GridSearchCV(
    pipe_final,
    param_grid=param_grid,
    cv=3,
    scoring=f1_scorer,
    n_jobs=-1
)

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
F1_score = grid_search.best_score_
print(f'F1_score for the best model on cross-validation: {F1_score:.2f}')

F1_score для лучшей модели на кросс-валидации: 0.76


Посмотрим на параметры лучшей модели:

In [None]:
grid_search.best_params_

{'models': <catboost.core.CatBoostClassifier at 0x7fd2b2cbe970>,
 'models__learning_rate': 0.5}

In [None]:
results = pd.DataFrame(grid_search.cv_results_)
results = results.sort_values(by='mean_test_score', ascending=False).head(5)
results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_models,param_models__C,param_models__penalty,param_models__learning_rate,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
21,431.007029,5.067633,1.117746,0.03244,<catboost.core.CatBoostClassifier object at 0x...,,,0.5,{'models': <catboost.core.CatBoostClassifier o...,0.755479,0.750384,0.763359,0.756407,0.005338,1
17,15.660775,0.235865,0.961803,0.02314,"LogisticRegression(class_weight='balanced', ma...",13.0,l2,,{'models': LogisticRegression(class_weight='ba...,0.76223,0.748434,0.75436,0.755008,0.005651,2
5,12.493378,0.258806,1.02203,0.005056,"LogisticRegression(class_weight='balanced', ma...",7.0,l2,,{'models': LogisticRegression(class_weight='ba...,0.761464,0.746916,0.755254,0.754545,0.00596,3
11,14.377527,0.485345,0.997361,0.009521,"LogisticRegression(class_weight='balanced', ma...",10.0,l2,,{'models': LogisticRegression(class_weight='ba...,0.762187,0.747639,0.753258,0.754362,0.00599,4
19,15.301699,0.419365,0.999549,0.033471,"LogisticRegression(class_weight='balanced', ma...",14.0,l2,,{'models': LogisticRegression(class_weight='ba...,0.761395,0.747676,0.753504,0.754192,0.005622,5


Let's save the best model into a variable:

In [None]:
best_model = grid_search.best_estimator_

Taking into account the average training time of the CatBoostClassifier model, we will also consider the second-ranking LogisticRegression model with the parameters:

In [None]:
second_best_params = results.iloc[1]['params']
second_best_params

{'models': LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42,
                    solver='liblinear'),
 'models__C': 13,
 'models__penalty': 'l2'}

In [None]:
second_best_model = LogisticRegression(random_state=RANDOM_STATE,
                                       solver='liblinear',
                                       max_iter=1000,
                                       class_weight='balanced',
                                       C=13,
                                       penalty='l2'
                                      )

In [None]:
pipe_final_second = Pipeline(
    [
        ('vect', TfidfVectorizer(stop_words=stopwords)),
        ('model_2', LogisticRegression(random_state=RANDOM_STATE,
                                       solver='liblinear',
                                       max_iter=1000,
                                       class_weight='balanced',
                                       C=13,
                                       penalty='l2'))

    ]
)

In [None]:
pipe_final_second.fit(X_train, y_train)

## Testing

Let's check the prediction quality of the best model on test data:

In [None]:
y_test_pred = best_model.predict(X_test)

In [None]:
F1_score = f1_score(y_test_pred, y_test)
print(f'F1_score for the best model on test data: {F1_score:.2f}')

F1_score для лучшей модели на тестовых данных: 0.77


Let's check the prediction quality of the second best model on the test data:

In [None]:
y_test_pred_second = pipe_final_second.predict(X_test)

In [None]:
F1_score = f1_score(y_test_pred_second, y_test)
print(f'F1_score for the second best model on test data: {F1_score:.2f}')

F1_score для второй лучшей модели на тестовых данных: 0.76


## Conclusions

Based on the results of GridSearch cross-validation, the best model with hyperparameters was selected:

In [None]:
grid_search.best_params_

{'models': <catboost.core.CatBoostClassifier at 0x7fd2b2cbe970>,
 'models__learning_rate': 0.5}

The quality metric of the prediction of the best model in cross-validation corresponds to the condition of the problem and is 0.76, on the test sample the result is 0.77.

The second best quality logistic regression model with parameters:

In [None]:
second_best_params

{'models': LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42,
                    solver='liblinear'),
 'models__C': 13,
 'models__penalty': 'l2'}

also showed a result that satisfies the problem condition - 0.76 on the test sample with a training time significantly lower than the training time of the CatBoostClassifier model. Both models can be recommended for use depending on the key requirements for their indicators.