Интернет-магазин запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 


## Подготовка

In [1]:
!pip install xgboost



In [3]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import re
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
import spacy

In [4]:
df = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')

In [5]:
df

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
...,...,...,...
159287,159446,""":::::And for the second time of asking, when ...",0
159288,159447,You should be ashamed of yourself \n\nThat is ...,0
159289,159448,"Spitzer \n\nUmm, theres no actual article for ...",0
159290,159449,And it looks like it was actually you who put ...,0


In [6]:
df.isna().sum()

Unnamed: 0    0
text          0
toxic         0
dtype: int64

In [7]:
df.duplicated().sum()

0

Пропусков и дубликатов нет.

In [8]:
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/alexander/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/alexander/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/alexander/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/alexander/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/alexander/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Напишем функцию для лемматизации и очистки от лишних символов

In [9]:
def get_wordnet_pos(word):
    
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [10]:
def lemma(sentence):
    sentence = sentence.lower()
    lemmatizer = WordNetLemmatizer()
    word_list = nltk.word_tokenize(sentence)
    lemma_text = ' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in word_list])
    return ' '.join((re.sub(r'[^a-z]', ' ', lemma_text)).split())

In [11]:
df['lemma'] = df['text'].apply(lemma)

In [12]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic,lemma
0,0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits make under my userna...
1,1,D'aww! He matches this background colour I'm s...,0,d aww he match this background colour i m seem...
2,2,"Hey man, I'm really not trying to edit war. It...",0,hey man i m really not try to edit war it s ju...
3,3,"""\nMore\nI can't make any real suggestions on ...",0,more i ca n t make any real suggestion on impr...
4,4,"You, sir, are my hero. Any chance you remember...",0,you sir be my hero any chance you remember wha...


In [13]:
stopwords = list(nltk_stopwords.words('english'))

Посчитаем TF-IDF для тренировочной и тестовой выборке текстов

In [14]:
X_train, X_test, y_train, y_test = train_test_split(df['lemma'], df['toxic'], test_size=0.3)

In [15]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
X_train_tf_idf = count_tf_idf.fit_transform(X_train)
X_test_tf_idf = count_tf_idf.transform(X_test)

In [16]:
df['toxic'].value_counts(normalize=True)

0    0.898388
1    0.101612
Name: toxic, dtype: float64

Применив функцию value_counts, мы видем, что у нас сильный дисбаланс классов. Проведем обучение моделий, взвесив классы внутренними методами.

## Обучение

In [17]:
params = {'max_depth': [7, 9, 11]}

### RandomForestClassifier

In [18]:
rf = RandomForestClassifier(class_weight='balanced', random_state=42)

grid_rf = GridSearchCV(rf, param_grid=params, cv=3, scoring='f1', n_jobs=-1)

grid_rf.fit(X_train_tf_idf, y_train)

In [19]:
best_result_rf = pd.DataFrame(grid_rf.cv_results_).sort_values('rank_test_score').head(1)
best_result_rf

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
2,22.822221,2.293654,0.248303,0.008963,11,{'max_depth': 11},0.350875,0.388732,0.377861,0.372489,0.015915,1


### XGBClassifier

In [20]:
weight = y_train.value_counts(normalize=True)[0] * 100
weight

89.82996125699526

In [21]:
xgbc = XGBClassifier(scale_pos_weight=weight, random_state=42)

In [22]:
grid_xgbc = GridSearchCV(xgbc, param_grid=params, cv=3,  scoring='f1', n_jobs=-1)

In [23]:
grid_xgbc.fit(X_train_tf_idf, y_train)

In [24]:
best_result_xgbc = pd.DataFrame(grid_xgbc.cv_results_).sort_values('rank_test_score').head(1)
best_result_xgbc

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
2,1913.443158,26.268132,0.199807,0.022708,11,{'max_depth': 11},0.385726,0.392767,0.384259,0.387584,0.003714,1


### CatBoostClassifier

In [25]:
weights = [y_train.value_counts(normalize=True)[0], y_train.value_counts(normalize=True)[1]]
weights

[0.8982996125699526, 0.10170038743004735]

In [26]:
ctbc = CatBoostClassifier(iterations=1000, learning_rate=0.1, 
                          auto_class_weights='Balanced', random_state=42, verbose=False)

grid_ctbc = GridSearchCV(ctbc, param_grid=params, cv=3, scoring='f1', n_jobs=-1)

grid_ctbc.fit(X_train_tf_idf, y_train)

In [27]:
best_result_ctbc = pd.DataFrame(grid_ctbc.cv_results_).sort_values('rank_test_score').head(1)
best_result_ctbc

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
2,23561.002911,7679.78733,0.45887,0.233008,11,{'max_depth': 11},0.774556,0.764622,0.778525,0.772568,0.005847,1


In [28]:
result = pd.concat([best_result_rf, best_result_xgbc, best_result_ctbc])
result.index = ['RandomForestClassifier', 'XGBClassifier', 'CatBoostClassifier']
result[['mean_test_score']]

Unnamed: 0,mean_test_score
RandomForestClassifier,0.372489
XGBClassifier,0.387584
CatBoostClassifier,0.772568


По итогам обучения и кросс валидация видно, что лучший результат у модели CatBoostClassifier. Проверим оценку на тестовой выборки.

In [29]:
f1_score(y_test, grid_ctbc.predict(X_test_tf_idf))

0.7653936087295402

## Выводы

Провели анализ данной выборки, пропусков и дубликатов обнаружено не было. Затем обработали текст, лемматизировав слова и убрав лишние. Преобразовали отзывы в векторы для обучения модели и дальнейших предсказаний. Т.к. в выборке был обнаружен сильный дисбаланс классов, во время обучения были применены методы взвешивания классов. По итогам обучения лучший показатель метрики F1 0,773 показала модель CatBoostClassifier на тестовой выборке результат составил 0,765, что удовлетворяет заданным требованиям.