<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#LightGBM" data-toc-modified-id="LightGBM-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>LightGBM</a></span></li><li><span><a href="#Древо-решений" data-toc-modified-id="Древо-решений-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Древо решений</a></span></li><li><span><a href="#LogisticRegression" data-toc-modified-id="LogisticRegression-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>LogisticRegression</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span><ul class="toc-item"><li><span><a href="#LightGBM" data-toc-modified-id="LightGBM-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>LightGBM</a></span></li><li><span><a href="#Древо-решений" data-toc-modified-id="Древо-решений-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Древо решений</a></span></li><li><span><a href="#LogisticRegression" data-toc-modified-id="LogisticRegression-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>LogisticRegression</a></span></li></ul></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

Загрузим библиотеки

In [43]:
import pandas as pd
import re

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords, wordnet

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import f1_score, confusion_matrix

from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [44]:
SEED = 666

Загрузим данные

In [45]:
df = pd.read_csv('/datasets/toxic_comments.csv')

Посмотрим на данные

In [46]:
display(df.head(10))

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


Уберем лишний столбец unnamed

In [48]:
df = df.drop(['Unnamed: 0'], axis=1)

Посмотрим соотношение в целевом признаке

In [49]:
display(df['toxic'].value_counts())

0    143106
1     16186
Name: toxic, dtype: int64

Функция очищения постов

In [50]:
def clear_text(text):
    
    text = text.lower()
    text = re.sub(r'[^a-zA-Z]', ' ', text)   
    text = ' '.join(text.split())
    
    return text

Применяем функцию

In [51]:
df['text'] = df['text'].apply(clear_text) 

In [52]:
wlm = WordNetLemmatizer()

In [53]:
pos_tag_map = {
    'N': wordnet.NOUN,
    'V': wordnet.VERB,
    'R': wordnet.ADV,
    'J': wordnet.ADJ
}

In [54]:
def get_wordnet_pos(tag):
    return pos_tag_map.get(tag[0].upper(), wordnet.NOUN)

In [55]:
def lemmatize_text(text):
    
    words = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(words)
    
    lemmatized_words = [wlm.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags]
    
    return ' '.join(lemmatized_words)

In [56]:
df['lemmatised'] = df['text'].apply(lemmatize_text)

In [57]:
df

Unnamed: 0,text,toxic,lemmatised
0,explanation why the edits made under my userna...,0,explanation why the edits make under my userna...
1,d aww he matches this background colour i m se...,0,d aww he match this background colour i m seem...
2,hey man i m really not trying to edit war it s...,0,hey man i m really not try to edit war it s ju...
3,more i can t make any real suggestions on impr...,0,more i can t make any real suggestion on impro...
4,you sir are my hero any chance you remember wh...,0,you sir be my hero any chance you remember wha...
...,...,...,...
159287,and for the second time of asking when your vi...,0,and for the second time of ask when your view ...
159288,you should be ashamed of yourself that is a ho...,0,you should be ashamed of yourself that be a ho...
159289,spitzer umm theres no actual article for prost...,0,spitzer umm theres no actual article for prost...
159290,and it looks like it was actually you who put ...,0,and it look like it be actually you who put on...


In [58]:
df = df.drop(['text'], axis=1)

Создадим корпус из лематизированных и очищеных тексов

In [59]:
corpus = df['lemmatised'].values

Разобьем выборки

In [60]:
features = corpus
target = df['toxic'].values

train_features, test_features, train_target, test_target = train_test_split(features, target, test_size=0.2, random_state=SEED)

In [61]:
stopwordss = set(stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words = stopwordss)
train_features = count_tf_idf.fit_transform(train_features)

In [62]:
test_features = count_tf_idf.transform(test_features)

## Обучение

### LightGBM

In [63]:
LightGBM_model = LGBMClassifier()

подбор параметров

In [64]:
hyperparams = [{'max_depth' : [1],# 2, 5, 10],
                'learning_rate':[0.1],# 0.2, 0.3],
                'n_estimators' : [500], #100, 200, 500],
                'random_state':[SEED]}]

In [65]:
clf = GridSearchCV(LightGBM_model, hyperparams, scoring='f1',cv=3)

In [66]:
clf.fit(train_features, train_target)

GridSearchCV(cv=3, estimator=LGBMClassifier(),
             param_grid=[{'learning_rate': [0.1], 'max_depth': [1],
                          'n_estimators': [500], 'random_state': [666]}],
             scoring='f1')

In [67]:
LGBM_best_params = clf.best_params_
print(LGBM_best_params)
print()
print('F1:', clf.best_score_)

{'learning_rate': 0.1, 'max_depth': 1, 'n_estimators': 500, 'random_state': 666}

F1: 0.6191364534769429


{'learning_rate': 0.1, 'max_depth': 1, 'n_estimators': 500, 'random_state': 666}

F1: 0.3483369671298225

### Древо решений

In [68]:
classificator = DecisionTreeClassifier()

In [69]:
hyperparams = [{'max_depth':[x for x in range(50,100,2)],
                'random_state':[SEED]}]

In [70]:
clf = GridSearchCV(classificator, hyperparams, scoring='f1',cv=3)

In [71]:
clf.fit(train_features, train_target)

GridSearchCV(cv=3, estimator=DecisionTreeClassifier(),
             param_grid=[{'max_depth': [50, 52, 54, 56, 58, 60, 62, 64, 66, 68,
                                        70, 72, 74, 76, 78, 80, 82, 84, 86, 88,
                                        90, 92, 94, 96, 98],
                          'random_state': [666]}],
             scoring='f1')

In [72]:
DTC_best_params = clf.best_params_

In [73]:
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.6f for %r"% (mean, params))
print()

cv_f1_DTC = max(means)

0.705156 for {'max_depth': 50, 'random_state': 666}
0.704926 for {'max_depth': 52, 'random_state': 666}
0.706928 for {'max_depth': 54, 'random_state': 666}
0.707551 for {'max_depth': 56, 'random_state': 666}
0.711923 for {'max_depth': 58, 'random_state': 666}
0.711353 for {'max_depth': 60, 'random_state': 666}
0.713031 for {'max_depth': 62, 'random_state': 666}
0.713149 for {'max_depth': 64, 'random_state': 666}
0.712236 for {'max_depth': 66, 'random_state': 666}
0.712163 for {'max_depth': 68, 'random_state': 666}
0.714754 for {'max_depth': 70, 'random_state': 666}
0.714461 for {'max_depth': 72, 'random_state': 666}
0.715219 for {'max_depth': 74, 'random_state': 666}
0.717355 for {'max_depth': 76, 'random_state': 666}
0.716880 for {'max_depth': 78, 'random_state': 666}
0.719879 for {'max_depth': 80, 'random_state': 666}
0.718294 for {'max_depth': 82, 'random_state': 666}
0.718306 for {'max_depth': 84, 'random_state': 666}
0.721060 for {'max_depth': 86, 'random_state': 666}
0.718588 for

### LogisticRegression

In [74]:
lr_model = LogisticRegression()

In [75]:
hyperparams = [{'C':[10],   # так же подбирал [0.1, 1, 3]
                'class_weight':['balanced']}]

In [76]:
clf = GridSearchCV(lr_model, hyperparams, scoring='f1',cv=3)

In [77]:
clf.fit(train_features, train_target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

GridSearchCV(cv=3, estimator=LogisticRegression(),
             param_grid=[{'C': [10], 'class_weight': ['balanced']}],
             scoring='f1')

In [78]:
print("Лучшие параметры модели:")
print()
LR_best_params = clf.best_params_
print(LR_best_params)
print()
print('F1:', clf.best_score_)

Лучшие параметры модели:

{'C': 10, 'class_weight': 'balanced'}

F1: 0.7591129500382627


## Выводы

### LightGBM

In [79]:
LightGBM_model = LGBMClassifier()

In [80]:
LightGBM_model.fit(train_features, train_target)

LGBMClassifier()

In [81]:
prediction = LightGBM_model.predict(test_features)

In [82]:
f1 = f1_score(test_target, prediction)

In [83]:
print('F1:', f1)
print()
print('Матрица ошибок')
print(confusion_matrix(test_target, prediction))
print()

F1: 0.7351212343864805

Матрица ошибок
[[28416   244]
 [ 1198  2001]]



### Древо решений

In [84]:
classificator = DecisionTreeClassifier()

In [85]:
classificator.set_params(**DTC_best_params)

DecisionTreeClassifier(max_depth=96, random_state=666)

In [86]:
classificator.fit(train_features, train_target)

DecisionTreeClassifier(max_depth=96, random_state=666)

In [87]:
target_predict = classificator.predict(test_features)

In [88]:
valid_f1_DTC = f1_score(test_target, target_predict)

In [89]:
print('F1 регрессии:', cv_f1_DTC)
print()
print('Матрица ошибок')
print(confusion_matrix(test_target, target_predict))
print()

F1 регрессии: 0.7215058231446254

Матрица ошибок
[[28124   536]
 [ 1131  2068]]



### LogisticRegression

In [90]:
lr_model = LogisticRegression()

In [91]:
lr_model.set_params(**LR_best_params)

LogisticRegression(C=10, class_weight='balanced')

In [92]:
lr_model.fit(train_features, train_target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(C=10, class_weight='balanced')

In [93]:
prediction = lr_model.predict(test_features)

In [94]:
f1 = f1_score(test_target, prediction)

In [95]:
print('F1 регрессии:', f1)
print()
print('Матрица ошибок')
print(confusion_matrix(test_target, prediction))
print()

F1 регрессии: 0.7540425531914894

Матрица ошибок
[[27467  1193]
 [  541  2658]]



# Вывод

Из рассмотренных моделей только Линейная показывает приемлемый уровень точности в соответствии с метрикой F1.