<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span><ul class="toc-item"><li><span><a href="#Функции-лематизации-и-очистки-текста" data-toc-modified-id="Функции-лематизации-и-очистки-текста-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Функции лематизации и очистки текста</a></span></li><li><span><a href="#Подготовка-данных" data-toc-modified-id="Подготовка-данных-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Подготовка данных</a></span><ul class="toc-item"><li><span><a href="#Почистим-переменные" data-toc-modified-id="Почистим-переменные-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Почистим переменные</a></span></li></ul></li></ul></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#LogisticRegression" data-toc-modified-id="LogisticRegression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>LogisticRegression</a></span></li><li><span><a href="#CatBoostClassifier" data-toc-modified-id="CatBoostClassifier-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>CatBoostClassifier</a></span></li></ul></li><li><span><a href="#Лучшая-модель" data-toc-modified-id="Лучшая-модель-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Лучшая модель</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

# Проект для интернет магазина

Интернет-магазин запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Обучите модель классифицировать комментарии на позитивные и негативные. В нашем распоряжении набор данных с разметкой о токсичности правок.

Значение метрики качества *F1* не меньше 0.75.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [None]:
#pip install -U catboost

In [None]:
import pandas as pd
import numpy as np

import re
import nltk
from nltk.corpus import stopwords as nltk_stopwords
import spacy
from tqdm import notebook

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import f1_score

from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier, Pool, cv

In [None]:
data = pd.read_csv('/datasets/toxic_comments.csv', index_col=0)

In [None]:
display(data.info())
display(data.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


None

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


### Функции лематизации и очистки текста

In [None]:
nlp = spacy.load('en_core_web_sm', disable = ['ner', 'parser'])

In [None]:
def clear_text(text):
    clear_text = re.sub(r"[^a-zA-Z']", ' ', text)
    clear_text = " ".join(clear_text.split())
    return clear_text

### Подготовка данных

In [None]:
%%time

data['text'] = data['text'].apply(lambda x: clear_text(x))

CPU times: user 7.69 s, sys: 119 ms, total: 7.81 s
Wall time: 12.3 s


In [None]:
%%time

lemm_texts = []

for doc in notebook.tqdm(nlp.pipe(data['text'].values, disable = ['ner', 'parser'], n_process=-1, batch_size=512), total=data.shape[0]):
    lemm_text = " ".join([i.lemma_ for i in doc])
    lemm_texts.append(lemm_text)

  0%|          | 0/159292 [00:00<?, ?it/s]

CPU times: user 6min 4s, sys: 5.96 s, total: 6min 10s
Wall time: 18min 27s


In [None]:
data['lemm_texts'] = lemm_texts

In [None]:
data['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

In [None]:
features_train, features_test, target_train, target_test = train_test_split(data['lemm_texts'], data['toxic'],
                                                                            test_size=0.1,
                                                                            random_state=12345,
                                                                            stratify=data['toxic']
                                                                           )

In [None]:
features_train_cb = data.iloc[data.index.isin(features_train.index)]['text']
features_test_cb = data.iloc[data.index.isin(features_test.index)]['text']

target_train_cb = data.iloc[data.index.isin(features_train.index)]['toxic']
target_test_cb = data.iloc[data.index.isin(features_test.index)]['toxic']

In [None]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
count_tf_idf = TfidfVectorizer(stop_words=list(stopwords))

In [None]:
tf_idf_train = count_tf_idf.fit_transform(features_train)
tf_idf_test = count_tf_idf.transform(features_test)

#### Почистим переменные

In [None]:
del stopwords, count_tf_idf, lemm_texts, features_train, features_test

## Обучение

### LogisticRegression

In [None]:
model = LogisticRegression(random_state=12345, class_weight='balanced')

In [None]:
print('Mean F1 score:', cross_val_score(estimator=model, X=tf_idf_train, y=target_train, scoring='f1').mean())

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Mean F1 score: 0.7522113875586871


### CatBoostClassifier

In [None]:
params = {
    "iterations": 700,
    "learning_rate": 0.1,
    "eval_metric": 'F1',
    "loss_function": 'Logloss'
}

In [None]:
cv_dataset = Pool(
        features_train_cb.to_frame('text'),
        target_train_cb,
        text_features=['text']
    )

In [None]:
scores = cv(cv_dataset,
            params,
            fold_count=3,
            shuffle=True,
            stratified=True,
            verbose=100,
            plot="True")

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/3]
0:	learn: 0.6673802	test: 0.6858447	best: 0.6858447 (0)	total: 630ms	remaining: 7m 20s
100:	learn: 0.7330467	test: 0.7354260	best: 0.7354260 (100)	total: 49.8s	remaining: 4m 55s
200:	learn: 0.7531792	test: 0.7403635	best: 0.7408607 (193)	total: 1m 28s	remaining: 3m 40s
300:	learn: 0.7645766	test: 0.7479430	best: 0.7480881 (299)	total: 2m 8s	remaining: 2m 50s
400:	learn: 0.7755974	test: 0.7480014	best: 0.7499132 (366)	total: 2m 53s	remaining: 2m 9s
500:	learn: 0.7861012	test: 0.7474771	best: 0.7499132 (366)	total: 3m 33s	remaining: 1m 24s
600:	learn: 0.7941379	test: 0.7484335	best: 0.7499132 (366)	total: 4m 13s	remaining: 41.7s
699:	learn: 0.8001378	test: 0.7514196	best: 0.7514196 (699)	total: 4m 52s	remaining: 0us

bestTest = 0.7514196315
bestIteration = 699

Training on fold [1/3]
0:	learn: 0.6688139	test: 0.6920195	best: 0.6920195 (0)	total: 368ms	remaining: 4m 17s
100:	learn: 0.7343831	test: 0.7361795	best: 0.7363720 (98)	total: 41.5s	remaining: 4m 6s
200:	lea

In [None]:
print('Mean F1 score:', scores.mean())

Mean F1 score: iterations            349.500000
test-F1-mean            0.746177
test-F1-std             0.005233
train-F1-mean           0.764392
train-F1-std            0.001612
test-Logloss-mean       0.129461
test-Logloss-std        0.001470
train-Logloss-mean      0.122150
train-Logloss-std       0.000331
dtype: float64


## Лучшая модель

Результаты обученных моделей:
- LogisticRegression
  - F1 = 0.7522113875586871

- CatBoostClassifier
  - F1 =  0.746177

LogisticRegression показала лучший результат

Плюсом CatBoost является то, что нет необходимости дополнительно обрабатывать текстовые данные.

In [None]:
model.fit(tf_idf_train, target_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
predicted_test = model.predict(tf_idf_test)

In [None]:
print('Итоговая оценка F1:', f1_score(target_test, predicted_test))

Итоговая оценка F1: 0.7612061939690302


## Выводы

В проекте необходимо было создать модель, которая классифицировала сообщение на токсичные и не токсичные.
Мы загрузили и подготовили данные:
- добавили новый столбец lemm_text, для которого:
    - очистили текст
    - лемматизировали
    
Обучили разные модели:
- LogisticRegression
    - Метрика f1 = 0.7522113875586871

- CatBoostClassifier
    - Метрика f1 = 0.745561
    
Таким образом мы выполнили поставленную задачу. Модель LogisticRegression удовлетворяет требованиям.