<h1>Content<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preparation" data-toc-modified-id="Preparation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preparation</a></span></li><li><span><a href="#Training" data-toc-modified-id="Training-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Training</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Conclusions</a></span></li></ul></div>

# Comment classification model

Online store "Wikishop" launches a new service. Now users can edit and supplement product descriptions, just like in wiki communities. That is, clients propose their edits and comment on the changes of others. The store needs a tool that will look for toxic comments and submit them for moderation.

It is necessary to train the model to classify comments into positive and negative. We have at our disposal a dataset with markup on the toxicity of edits. It is necessary to build a model with the value of the quality metric *F1* not less than 0.75.

**Work Plan**

1. Download and prepare data.
2. Train different models.
3. Draw conclusions.

**Data Description**

The data is in the `toxic_comments.csv` file. The *text* column contains the text of the comment, and *toxic* is the target attribute.

## Preparation

In [1]:
import numpy as np
import pandas as pd

import re

import spacy
from spacy.cli import download
from spacy.lang.en.stop_words import STOP_WORDS as en_stop

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from tqdm import tqdm
tqdm.pandas()

import time

import warnings
warnings.filterwarnings("ignore")

In [2]:
download('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
#tcomm = pd.read_csv('toxic_comments.csv')
tcomm = pd.read_csv('/datasets/toxic_comments.csv')

In [4]:
tcomm.info()
tcomm.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
5,"""\n\nCongratulations from me as well, use the ...",0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,Your vandalism to the Matt Shirvington article...,0
8,Sorry if the word 'nonsense' was offensive to ...,0
9,alignment on this subject and which are contra...,0


Everything you need is imported and downloaded, and the data is ready to go. Passes and any artifacts are visually absent. You can proceed to processing.

## Training

In [5]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [6]:
def lemmatization(text):
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc])

In [7]:
tcomm['text'] = tcomm['text'].progress_apply(lemmatization)

100%|██████████| 159571/159571 [15:55<00:00, 167.01it/s]


In [8]:
tcomm.info()
tcomm.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Unnamed: 0,text,toxic
0,explanation \n why the edit make under my user...,0
1,d'aww ! he match this background colour I be s...,0
2,"hey man , I be really not try to edit war . it...",0
3,""" \n More \n I can not make any real suggestio...",0
4,"you , sir , be my hero . any chance you rememb...",0
5,""" \n\n congratulation from I as well , use the...",0
6,COCKSUCKER before you pis around on my work,1
7,your vandalism to the Matt Shirvington article...,0
8,sorry if the word ' nonsense ' be offensive to...,0
9,alignment on this subject and which be contrar...,0


Lemmatization was carried out using the spacy library.

In [9]:
def clear_text(text):
    return ' '.join(re.sub(r'[^a-z]', ' ',text.lower()) .split())

In [10]:
tcomm['text'] = tcomm['text'].progress_apply(clear_text)

100%|██████████| 159571/159571 [00:05<00:00, 30037.20it/s]


In [11]:
tcomm.info()
tcomm.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Unnamed: 0,text,toxic
0,explanation why the edit make under my usernam...,0
1,d aww he match this background colour i be see...,0
2,hey man i be really not try to edit war it be ...,0
3,more i can not make any real suggestion on imp...,0
4,you sir be my hero any chance you remember wha...,0
5,congratulation from i as well use the tool wel...,0
6,cocksucker before you pis around on my work,1
7,your vandalism to the matt shirvington article...,0
8,sorry if the word nonsense be offensive to you...,0
9,alignment on this subject and which be contrar...,0


Used regular expressions to clean up the text.

In [12]:
features = tcomm['text']
target = tcomm['toxic']

Let's prepare the data by dividing it into features and target feature.

In [13]:
features_train, features_test, target_train, target_test = train_test_split(features, target, 
                                                                            test_size=0.2, random_state=12345)

In [14]:
print(f'Размерность features_train: {features_train.shape}, Размерность target_train: {target_train.shape}') 
print(f'Размерность features_test: {features_test.shape}, Размерность target_test: {target_test.shape}')

Размерность features_train: (127656,), Размерность target_train: (127656,)
Размерность features_test: (31915,), Размерность target_test: (31915,)


To split the data into two samples (training and test) the train_test_split method was used. The sampling ratio was 4:1 or 80%:20%. Dimension check completed successfully.

In [15]:
stop_words = list(en_stop)

Added stop words for future use.

In [16]:
%%time

my_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words=stop_words)),
    ('clf', 'passthrough')
])


parameters = [
    {
        'clf':[LinearSVC(class_weight='balanced', random_state=12345)],
        'clf__C':np.arange(0.5, 11.0, 0.5)
    },
    {
        'clf':[LogisticRegression(class_weight='balanced', random_state=12345)],
        'clf__C':np.arange(0.5, 11.0, 0.5)
    }
]

grid_search = GridSearchCV(my_pipeline, param_grid=parameters, cv=5, n_jobs=-1, scoring='f1', verbose=1)
grid_search.fit(features_train, target_train)

print(f'Параметры лучшей модели: {grid_search.best_params_}')

print(f'F1 лучшей модели: {grid_search.best_score_}')

Fitting 5 folds for each of 42 candidates, totalling 210 fits
Параметры лучшей модели: {'clf': LogisticRegression(C=6.0, class_weight='balanced', random_state=12345), 'clf__C': 6.0}
F1 лучшей модели: 0.7623229699542351
CPU times: user 28.5 s, sys: 20.4 s, total: 48.9 s
Wall time: 11min 46s


## Conclusions

In [17]:
prediction_test = grid_search.best_estimator_.predict(features_test)

f1_test = f1_score(target_test, prediction_test)

print(f'F1 наилучшей модели на тестовой выборке: {f1_test}')

F1 наилучшей модели на тестовой выборке: 0.7615157064375264


For the best LogisticRegression model, the F1 metric decreased quite a bit on the test sample and still exceeds the threshold of 0.75.

Based on the results of the analysis, the best LogisticRegression model is recommended for use when searching for toxic comments.