# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [21]:
import pandas as pd

from pymystem3 import Mystem

import re

from functools import lru_cache

import nltk
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.corpus import wordnet

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

import spacy

import warnings
warnings.filterwarnings('ignore')

from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

import time


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dimson\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dimson\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dimson\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


In [2]:
try:
    data = pd.read_csv('/datasets/toxic_comments.csv')
except:
    data = pd.read_csv('C:/Study/Yandex_Practicum/Negative comments/toxic_comments.csv')

In [3]:
data.head(5)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [5]:
data = data.drop('Unnamed: 0', axis=1)#дропнем столбец, так как он не несет информации

In [6]:
data.head(5)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [7]:
data['toxic'].value_counts(normalize=True)

0    0.898388
1    0.101612
Name: toxic, dtype: float64

Вывод: классы не сбалансированны

Создадим функции для очистки и лемматизации текста.

In [8]:
def clear_text(text):
    re_list = re.sub(r"[^a-zA-Z']", ' ', text)
    re_list = re_list.split()
    re_list = " ".join(re_list)
    return re_list

In [9]:
data['lemm_text'] = data['text'].apply(clear_text)
data

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,D'aww He matches this background colour I'm se...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I'm really not trying to edit war It's...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can't make any real suggestions on impr...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...
...,...,...,...
159287,""":::::And for the second time of asking, when ...",0,And for the second time of asking when your vi...
159288,You should be ashamed of yourself \n\nThat is ...,0,You should be ashamed of yourself That is a ho...
159289,"Spitzer \n\nUmm, theres no actual article for ...",0,Spitzer Umm theres no actual article for prost...
159290,And it looks like it was actually you who put ...,0,And it looks like it was actually you who put ...


Ячейка грузилась 69 минут.

In [22]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    word_list = nltk.word_tokenize((text))
    
    return ' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in word_list])

data['lemm_text'] = data['lemm_text'].apply(lemmatize_text)
data.head()


Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits make under my userna...
1,D'aww! He matches this background colour I'm s...,0,D'aww He match this background colour I 'm see...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I 'm really not try to edit war It 's ...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I ca n't make any real suggestion on impr...
4,"You, sir, are my hero. Any chance you remember...",0,You sir be my hero Any chance you remember wha...


Разобьем данные на три выборки.

In [23]:
features = data['lemm_text'].values
target = data['toxic'].values

train_valid_features, test_features, train_valid_target, test_target = train_test_split(features, target, test_size=0.2, random_state=12345)

In [24]:
train_features, valid_features, train_target, valid_target = train_test_split(train_valid_features, train_valid_target, test_size=0.25, random_state=12345)

Сделаем векторизацию текста

In [25]:
stopwordss = list(set(stopwords.words('english')))

count_tf_idf = TfidfVectorizer(stop_words = stopwordss)
train_features = count_tf_idf.fit_transform(train_features)
test_features = count_tf_idf.transform(test_features)
valid_features = count_tf_idf.transform(valid_features)


## Обучение

### Логистическая регрессия

In [26]:
%%time
lr_model = LogisticRegression()
hyperparams = [{'C':(0, 10, 0.5),   
                'class_weight':['balanced'],
                'max_iter': (50, 200, 50)}]
clf = GridSearchCV(lr_model, hyperparams, scoring='f1',cv=3)
clf.fit(train_features, train_target)
print("Лучшие параметры модели:", clf.best_params_)
print()
print('F1:', clf.best_score_)

Лучшие параметры модели: {'C': 10, 'class_weight': 'balanced', 'max_iter': 50}

F1: 0.7573909608656693
CPU times: total: 45.4 s
Wall time: 41 s


In [27]:
lr_model = LogisticRegression(C=10, class_weight='balanced', max_iter=200).fit(train_features, train_target)
pred_lr = lr_model.predict(valid_features)
print('F1 на валидационных данных', f1_score(valid_target, pred_lr))


F1 на валидационных данных 0.7502165752237944


### Классификатор CatBoost

In [28]:
%%time
cat_model = CatBoostClassifier()
params = [{'eval_metric': ["F1"], 
          'iterations':(50, 200, 50), 
          'max_depth':(5, 20, 1), 
          'learning_rate':(0.1, 3, 0.1), 
          'random_state': [12345]}]
grid = GridSearchCV(cat_model, params, cv=3, scoring='f1')
grid.fit(train_features, train_target, verbose=10)
print("Лучшие параметры модели:", grid.best_params_)
print()
print('F1:', grid.best_score_)

0:	learn: 0.3750779	total: 544ms	remaining: 26.7s
10:	learn: 0.4785947	total: 4.49s	remaining: 15.9s
20:	learn: 0.5104297	total: 8.42s	remaining: 11.6s
30:	learn: 0.5387190	total: 12.3s	remaining: 7.54s
40:	learn: 0.5623971	total: 16.3s	remaining: 3.57s
49:	learn: 0.5772103	total: 19.5s	remaining: 0us
0:	learn: 0.3500889	total: 363ms	remaining: 17.8s
10:	learn: 0.4799814	total: 3.61s	remaining: 12.8s
20:	learn: 0.4972998	total: 6.91s	remaining: 9.54s
30:	learn: 0.5328418	total: 10.1s	remaining: 6.2s
40:	learn: 0.5467212	total: 13.5s	remaining: 2.97s
49:	learn: 0.5752566	total: 16.7s	remaining: 0us
0:	learn: 0.4068293	total: 513ms	remaining: 25.2s
10:	learn: 0.4923183	total: 4.38s	remaining: 15.5s
20:	learn: 0.5179678	total: 8.08s	remaining: 11.2s
30:	learn: 0.5430185	total: 11.6s	remaining: 7.12s
40:	learn: 0.5567010	total: 15.1s	remaining: 3.32s
49:	learn: 0.5663986	total: 18.3s	remaining: 0us
0:	learn: 0.3322610	total: 68.7ms	remaining: 3.36s
10:	learn: 0.3529113	total: 746ms	remaini

learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3750779	total: 367ms	remaining: 18s


Training has stopped (degenerate solution on iteration 4, probably too small l2-regularization, try to increase it)
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3500889	total: 343ms	remaining: 16.8s


learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.4068293	total: 376ms	remaining: 18.4s


learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3322610	total: 78.5ms	remaining: 3.85s


learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3502538	total: 131ms	remaining: 6.4s


learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3526729	total: 85.8ms	remaining: 4.21s
0:	learn: 0.3750779	total: 441ms	remaining: 21.6s
10:	learn: 0.4785947	total: 4.52s	remaining: 16s
20:	learn: 0.5104297	total: 8.3s	remaining: 11.5s
30:	learn: 0.5387190	total: 12.6s	remaining: 7.71s
40:	learn: 0.5623971	total: 16.5s	remaining: 3.63s
49:	learn: 0.5772103	total: 19.8s	remaining: 0us
0:	learn: 0.3500889	total: 443ms	remaining: 21.7s
10:	learn: 0.4799814	total: 5.03s	remaining: 17.8s
20:	learn: 0.4972998	total: 9.49s	remaining: 13.1s
30:	learn: 0.5328418	total: 13.7s	remaining: 8.4s
40:	learn: 0.5467212	total: 18.3s	remaining: 4.01s
49:	learn: 0.5752566	total: 21.9s	remaining: 0us
0:	learn: 0.4068293	total: 403ms	remaining: 19.7s
10:	learn: 0.4923183	total: 4.31s	remaining: 15.3s
20:	learn: 0.5179678	total: 8.12s	remaining: 11.2s
30:	learn: 0.5430185	total: 11.9s	remaining: 7.32s
40:	learn: 0.5567010	total: 15.7s	remaining: 3.46s
49:	learn: 0.5663986	total: 19.2s	remaining: 0us
0:	learn: 0.3322610	total: 72ms	remaining: 3

learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3750779	total: 411ms	remaining: 1m 21s


Training has stopped (degenerate solution on iteration 4, probably too small l2-regularization, try to increase it)
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3500889	total: 477ms	remaining: 1m 34s


learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.4068293	total: 417ms	remaining: 1m 23s


learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3322610	total: 74.3ms	remaining: 14.8s


learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3502538	total: 73.4ms	remaining: 14.6s


learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3526729	total: 77.2ms	remaining: 15.4s
0:	learn: 0.3750779	total: 414ms	remaining: 1m 22s
10:	learn: 0.4785947	total: 4.35s	remaining: 1m 14s
20:	learn: 0.5104297	total: 8.15s	remaining: 1m 9s
30:	learn: 0.5387190	total: 12s	remaining: 1m 5s
40:	learn: 0.5623971	total: 15.8s	remaining: 1m 1s
50:	learn: 0.5799698	total: 19.7s	remaining: 57.4s
60:	learn: 0.5987437	total: 23.4s	remaining: 53.4s
70:	learn: 0.6094756	total: 27.1s	remaining: 49.3s
80:	learn: 0.6164326	total: 30.9s	remaining: 45.4s
90:	learn: 0.6286425	total: 34.7s	remaining: 41.6s
100:	learn: 0.6411257	total: 38.5s	remaining: 37.7s
110:	learn: 0.6481028	total: 42.2s	remaining: 33.8s
120:	learn: 0.6469267	total: 45.9s	remaining: 30s
130:	learn: 0.6585834	total: 49.6s	remaining: 26.1s
140:	learn: 0.6629100	total: 53.4s	remaining: 22.3s
150:	learn: 0.6713972	total: 57.1s	remaining: 18.5s
160:	learn: 0.6766902	total: 1m	remaining: 14.7s
170:	learn: 0.6825067	total: 1m 4s	remaining: 11s
180:	learn: 0.6842573	total: 1m

learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3750779	total: 402ms	remaining: 19.7s


Training has stopped (degenerate solution on iteration 4, probably too small l2-regularization, try to increase it)
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3500889	total: 384ms	remaining: 18.8s


learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.4068293	total: 421ms	remaining: 20.6s


learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3322610	total: 76.4ms	remaining: 3.74s


learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3502538	total: 73.9ms	remaining: 3.62s


learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.
learning rate is greater than 1. You probably need to decrease learning rate.


0:	learn: 0.3526729	total: 73ms	remaining: 3.57s
0:	learn: 0.3750779	total: 404ms	remaining: 19.8s
10:	learn: 0.4785947	total: 4.32s	remaining: 15.3s
20:	learn: 0.5104297	total: 8.16s	remaining: 11.3s
30:	learn: 0.5387190	total: 12s	remaining: 7.33s
40:	learn: 0.5623971	total: 15.8s	remaining: 3.46s
49:	learn: 0.5772103	total: 19.2s	remaining: 0us
0:	learn: 0.3500889	total: 393ms	remaining: 19.3s
10:	learn: 0.4799814	total: 4.15s	remaining: 14.7s
20:	learn: 0.4972998	total: 7.94s	remaining: 11s
30:	learn: 0.5328418	total: 11.7s	remaining: 7.17s
40:	learn: 0.5467212	total: 15.4s	remaining: 3.38s
49:	learn: 0.5752566	total: 18.7s	remaining: 0us
0:	learn: 0.4068293	total: 456ms	remaining: 22.4s
10:	learn: 0.4923183	total: 4.88s	remaining: 17.3s
20:	learn: 0.5179678	total: 8.86s	remaining: 12.2s
30:	learn: 0.5430185	total: 12.7s	remaining: 7.78s
40:	learn: 0.5567010	total: 16.4s	remaining: 3.6s
49:	learn: 0.5663986	total: 19.7s	remaining: 0us
0:	learn: 0.3322610	total: 73.5ms	remaining: 3.

In [29]:
cat_model1 = CatBoostClassifier(eval_metric="F1", iterations=200, max_depth=6, learning_rate=0.1, random_state=12345).fit(train_features, train_target, verbose=2)
pred_cb = cat_model1.predict(valid_features)
print('F1 CatBoost классификатора на валидационных данных:', f1_score(valid_target, pred_cb))

0:	learn: 0.3513993	total: 910ms	remaining: 3m 1s
2:	learn: 0.4644113	total: 2.52s	remaining: 2m 45s
4:	learn: 0.4608519	total: 4.18s	remaining: 2m 42s
6:	learn: 0.4616590	total: 5.9s	remaining: 2m 42s
8:	learn: 0.4611399	total: 7.79s	remaining: 2m 45s
10:	learn: 0.4692284	total: 9.45s	remaining: 2m 42s
12:	learn: 0.4971337	total: 11.1s	remaining: 2m 39s
14:	learn: 0.5050505	total: 12.7s	remaining: 2m 36s
16:	learn: 0.5033848	total: 14.2s	remaining: 2m 33s
18:	learn: 0.5087161	total: 15.8s	remaining: 2m 30s
20:	learn: 0.5237559	total: 17.4s	remaining: 2m 27s
22:	learn: 0.5234375	total: 18.9s	remaining: 2m 25s
24:	learn: 0.5303076	total: 20.5s	remaining: 2m 23s
26:	learn: 0.5199007	total: 22.1s	remaining: 2m 21s
28:	learn: 0.5270169	total: 23.6s	remaining: 2m 19s
30:	learn: 0.5385359	total: 25.2s	remaining: 2m 17s
32:	learn: 0.5451458	total: 26.7s	remaining: 2m 15s
34:	learn: 0.5606536	total: 28.3s	remaining: 2m 13s
36:	learn: 0.5638786	total: 29.9s	remaining: 2m 11s
38:	learn: 0.565195

In [30]:
#40 минут грузится ячейка, поэтому написал какие параметры перебирал
LightGBM_model = LGBMClassifier()
hyperparams = [{'num_leaves' : [20], #(20, 50, 5)
                'learning_rate':[0.1], #(0.1, 1, 0.1)
                'n_estimators' : [1000],  #(50, 1000, 50)
                'random_state':[12345]}]
clf = GridSearchCV(LightGBM_model, hyperparams, scoring='f1',cv=3)
clf.fit(train_features, train_target)
print("Лучшие параметры модели:")
print()
LGBM_best_params = clf.best_params_
print(LGBM_best_params)
print()
print('F1:', clf.best_score_)

Лучшие параметры модели:

{'learning_rate': 0.1, 'n_estimators': 1000, 'num_leaves': 20, 'random_state': 12345}

F1: 0.7605102024783673


In [31]:
lgbm = LGBMClassifier(num_leaves=20, learning_rate=0.1, n_estimators=1000, random_state=12345).fit(train_features, train_target)
pred_lg = lgbm.predict(valid_features)
print('F1 LGBM классификатора на валидационных данных:', f1_score(valid_target, pred_lg))

F1 LGBM классификатора на валидационных данных: 0.7657360846193862


Лучшей моделью является LGBMClassifier с метрикой F1 на валидационных данных 0.765. 

## Выводы

Проверим лушую модель на тестовых данных.

In [32]:
final_model = LGBMClassifier(num_leaves=20, learning_rate=0.1, n_estimators=1000, random_state=12345).fit(train_features, train_target)
pred = final_model.predict(test_features)
print('', f1_score(test_target, pred))

 0.7711761649055949


Вывод: 

За проделанный мною анализ я сделал лемматизацию и токенизацию текста, а также определил самую подходящую модель, которая на тестовых данных дала метрику F1 равной 0.771.