<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Тестирование-лучшей-модели" data-toc-modified-id="Тестирование-лучшей-модели-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Тестирование лучшей модели</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Требуется обучить модель классифицировать комментарии на позитивные и негативные. В нашем распоряжении набор данных с разметкой о токсичности правок. Значением метрики качества *F1* должно быть не меньше 0.75. 

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Imports

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score
import numpy as np

import re
from tqdm import notebook 
from tqdm.notebook import tqdm
tqdm.pandas()


from nltk import pos_tag
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')

wnl = WordNetLemmatizer()

from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from catboost import  CatBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


## Functions

In [None]:
def penn2morphy(penntag):
    """ Converts Penn Treebank tags to WordNet. """
    morphy_tag = {'NN':'n', 'JJ':'a',
                  'VB':'v', 'RB':'r'}
    try:
        return morphy_tag[penntag[:2]]
    except:
        return 'n' 

def lemmatize_sent(text): 
    """ Text input is string, returns lowercased strings. """
    return [wnl.lemmatize(word.lower(), pos=penn2morphy(tag)) 
            for word, tag in pos_tag(word_tokenize(text))]

def lemmatize(text):
    return " ".join(lemmatize_sent(text))

def clear_text(text):
    return ' '.join(re.sub(r'[^a-zA-z ]', ' ', text).split())

In [2]:
lemmatize_sent('He is walking to school')

['he', 'be', 'walk', 'to', 'school']

In [3]:
clear_text('He +++5565 is walking to school')

'He is walking to school'

## Loading dataset

In [4]:
try:
    data = pd.read_csv('/datasets/toxic_comments.csv')
except:
    data = pd.read_csv(r'C:\Users\yaros\Новая папка\toxic_comments.csv')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [6]:
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [7]:
data['toxic'].mean()

0.10161213369158527

## Lemmatizing text

In [8]:
data['lemm_text'] = data['text'].progress_apply(clear_text)

  0%|          | 0/159292 [00:00<?, ?it/s]

In [9]:
data['lemm_text'] = data['lemm_text'].progress_apply(lemmatize)

  0%|          | 0/159292 [00:00<?, ?it/s]

In [10]:
data['text'][0]

"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

In [11]:
data['lemm_text'][0]

'explanation why the edits make under my username hardcore metallica fan be revert they weren t vandalism just closure on some gas after i vote at new york doll fac and please don t remove the template from the talk page since i m retire now'

## Features, target

In [12]:
features = data['lemm_text']
target = data['toxic']

In [13]:
features_train, features_test, target_train, target_test = train_test_split(
features, target, test_size=0.1, random_state=12345)

## English stopwords

In [14]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Training

In [15]:
model = DummyClassifier(strategy='stratified')

print(cross_val_score(model, features_train, target_train, scoring = 'f1').mean())

0.1017495945583284


In [16]:
model = LogisticRegression(max_iter=200)

pipeline = Pipeline(
    [
        ("vect", count_tf_idf),
        ("model", model),
    ]
)


parameters = {'model__C':(5, 10, 15)}
grid_lr = GridSearchCV(pipeline, parameters, scoring='f1', cv=3)
grid_lr.fit(features_train, target_train)

print(grid_lr.best_score_)

grid_lr.best_params_


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.7669076326408172


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


{'model__C': 15}

In [17]:
model = RandomForestClassifier(random_state=12345)

pipeline = Pipeline(
    [
        ("vect", count_tf_idf),
        ("model", model),
    ]
)

parameters = {'model__max_depth':(1, 5, 10), 'model__n_estimators':(1, 10, 20, 30)}
grid_forest = GridSearchCV(pipeline, parameters, scoring='f1', cv=3)
grid_forest.fit(features_train, target_train)

print(grid_forest.best_score_)

grid_forest.best_params_

0.08925632762047625


{'model__max_depth': 10, 'model__n_estimators': 1}

In [18]:
model = CatBoostClassifier()

pipeline = Pipeline(
    [
        ("vect", count_tf_idf),
        ("model", model),
    ]
)

parameters = {'model__depth':(1, 5, 10), 'model__n_estimators':(10, 15, 20)}
grid_cat = GridSearchCV(pipeline, parameters, scoring='f1', cv=3)
grid_cat.fit(features_train, target_train)

print(grid_cat.best_score_)

grid_cat.best_params_


Learning rate set to 0.5
0:	learn: 0.3592494	total: 545ms	remaining: 4.91s
1:	learn: 0.2967167	total: 1.02s	remaining: 4.06s
2:	learn: 0.2764759	total: 1.5s	remaining: 3.51s
3:	learn: 0.2684851	total: 1.99s	remaining: 2.98s
4:	learn: 0.2627348	total: 2.48s	remaining: 2.48s
5:	learn: 0.2567165	total: 2.98s	remaining: 1.99s
6:	learn: 0.2513139	total: 3.45s	remaining: 1.48s
7:	learn: 0.2470820	total: 3.91s	remaining: 978ms
8:	learn: 0.2430194	total: 4.39s	remaining: 487ms
9:	learn: 0.2406573	total: 4.85s	remaining: 0us
Learning rate set to 0.5
0:	learn: 0.3647015	total: 479ms	remaining: 4.31s
1:	learn: 0.3000873	total: 956ms	remaining: 3.83s
2:	learn: 0.2791163	total: 1.41s	remaining: 3.3s
3:	learn: 0.2705679	total: 1.87s	remaining: 2.81s
4:	learn: 0.2651159	total: 2.34s	remaining: 2.34s
5:	learn: 0.2588268	total: 2.8s	remaining: 1.86s
6:	learn: 0.2544001	total: 3.26s	remaining: 1.4s
7:	learn: 0.2507501	total: 3.72s	remaining: 931ms
8:	learn: 0.2476708	total: 4.18s	remaining: 464ms
9:	lea

{'model__depth': 10, 'model__n_estimators': 20}

In [19]:
model = LGBMClassifier()

pipeline = Pipeline(
    [
        ("vect", count_tf_idf),
        ("model", model),
    ]
)

parameters = {'model__max_depth':(1, 5, 10), 'model__n_estimators':(10, 15, 20)}
grid_lgbm = GridSearchCV(pipeline, parameters, scoring='f1', cv=3)
grid_lgbm.fit(features_train, target_train)

print(grid_lgbm.best_score_)

grid_lgbm.best_params_

0.5542505168542124


{'model__max_depth': 10, 'model__n_estimators': 20}

## Best model

In [18]:
model = grid_lr.best_estimator_

predictions = model.predict(features_test)
print(f1_score(target_test, predictions))

0.776973457428473


## Conclusions

Удалось добится значения F1 почти в 0.777 используя логистическую регрессию, что соответствует требованиям заказчика (F1 > 0.75).