<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd 
import numpy as np 
import re
import nltk

from sklearn.pipeline import Pipeline

from tqdm import notebook
from tqdm import tqdm
import pymorphy2
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
from pymystem3 import Mystem
m = Mystem()
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.utils import resample

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, f1_score, classification_report

import time
import warnings
warnings.filterwarnings('ignore')

nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
#Загрузим данные
df = pd.read_csv('/datasets/toxic_comments.csv')

In [3]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [5]:
#Проверим соотношение токсичных/нектоксичных комментов
m = (df['toxic'].value_counts())
m

0    143106
1     16186
Name: toxic, dtype: int64

Наблюдается дисбаланс классов

In [6]:
#Объявим корпус текстов
corpus = df['text'].values
print(corpus[0])

Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27


In [7]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub('\W', ' ', text)
    text = re.sub('\s+', ' ', text)
    text = text.strip(' ')
    return text

In [8]:
df['text'] = df['text'].map(lambda x: clean_text(x))

In [9]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,               
                "N": wordnet.NOUN,              
                "V": wordnet.VERB,              
                "R": wordnet.ADV                
               }  
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()

def lemm_text(text):
    text = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(text)]
    return ' '.join(text)

In [10]:
%%time

df['text'] = df['text'].apply(lemm_text) 

CPU times: user 18min 34s, sys: 1min 48s, total: 20min 23s
Wall time: 20min 25s


In [11]:
#Объединим датасет и корпус

df_corpus = pd.DataFrame(corpus)
df['lemm_text'] = df_corpus[0]
display(df.head(10))
df.info()

Unnamed: 0.1,Unnamed: 0,text,toxic,lemm_text
0,0,explanation why the edits make under my userna...,0,explanation why the edits make under my userna...
1,1,d aww he match this background colour i be see...,0,d aww he match this background colour i be see...
2,2,hey man i be really not try to edit war it jus...,0,hey man i be really not try to edit war it jus...
3,3,more i can not make any real suggestion on imp...,0,more i can not make any real suggestion on imp...
4,4,you sir be my hero any chance you remember wha...,0,you sir be my hero any chance you remember wha...
5,5,congratulation from me a well use the tool wel...,0,congratulation from me a well use the tool wel...
6,6,cocksucker before you piss around on my work,1,cocksucker before you piss around on my work
7,7,your vandalism to the matt shirvington article...,0,your vandalism to the matt shirvington article...
8,8,sorry if the word nonsense be offensive to you...,0,sorry if the word nonsense be offensive to you...
9,9,alignment on this subject and which be contrar...,0,alignment on this subject and which be contrar...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
 3   lemm_text   159292 non-null  object
dtypes: int64(2), object(2)
memory usage: 4.9+ MB


**ВЫВОД**

Дубликатов и пропусков нет

Провели баланс классов

## Обучение

In [12]:
features = df['lemm_text']
target = df['toxic']

features_train_1, features_test_1, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=42, stratify = target)

In [13]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stopwords) 

#Выполним векторизацию текстов
features_train = count_tf_idf.fit_transform(features_train_1)
features_test = count_tf_idf.transform(features_test_1)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
print(features_train.shape)
print(features_test.shape)
print(target_train.shape)
print(target_test.shape)

(119469, 143727)
(39823, 143727)
(119469,)
(39823,)


**LogisticRegression**

In [15]:
%%time
#Обучим и проверим модель на кросс-валидации
regression = LogisticRegression(fit_intercept=True, 
                                class_weight='balanced', 
                                random_state=42,
                                solver='liblinear'
                               )
regression_parametrs = {'C': [0.1, 1, 10]}

regression_grid = GridSearchCV(regression, regression_parametrs, scoring='f1', cv=3)
regression_grid.fit(features_train, target_train)

regression.fit(features_train, target_train)
regression_cv_score = cross_val_score(regression,features_train, target_train,scoring='f1',cv=3).mean()
print('Качество модели Логистической регрессии на кросс-валидации:', regression_cv_score)

Качество модели Логистической регрессии на кросс-валидации: 0.7434791500248843
CPU times: user 1min 20s, sys: 1min 30s, total: 2min 50s
Wall time: 2min 50s


In [16]:
%%time
#Определяем гиперпараметры и качество модели на кросс-валидации:
regression_params = regression_grid.best_params_
regression_score = regression_grid.score(features_train, target_train)
print(regression_params)
print(regression_score)
print('')

{'C': 10}
0.9079071167814534

CPU times: user 39.2 ms, sys: 3.28 ms, total: 42.5 ms
Wall time: 40.8 ms


In [17]:
%%time
#Проверяем Логистическую регрессию на тестовой выборке:
regression_model = LogisticRegression(fit_intercept=True,
                                class_weight='balanced',
                                random_state=42,
                                solver='liblinear',
                                C=regression_params['C']
                               )

regression_model.fit(features_train, target_train)
regression_model_predictions = regression_model.predict(features_test)

CPU times: user 12.1 s, sys: 11.6 s, total: 23.7 s
Wall time: 23.7 s


In [18]:
regression_predictions = regression_model.predict(features_test)
regression_f1 = round(f1_score(target_test, regression_predictions), 5) 
print(regression_f1)

0.75781


**RandomForestClassifier**

In [19]:
%%time
#Подбирем гиперпараметры на кросс-валидации
forest = RandomForestClassifier(class_weight='balanced', n_jobs=-1 )

forest_parametrs = { 'n_estimators': range(20, 40, 5),
                     'max_depth': range(4, 8, 2),
                     'min_samples_leaf': range(3,5),
                     'min_samples_split': range(2,6,2)}

#Применим GridSearchCV с кросс-валидацией
forest_grid = GridSearchCV(forest, forest_parametrs, scoring='f1', cv=3)
forest_grid.fit(features_train, target_train)

CPU times: user 3min 53s, sys: 3.06 s, total: 3min 56s
Wall time: 3min 56s


GridSearchCV(cv=3,
             estimator=RandomForestClassifier(class_weight='balanced',
                                              n_jobs=-1),
             param_grid={'max_depth': range(4, 8, 2),
                         'min_samples_leaf': range(3, 5),
                         'min_samples_split': range(2, 6, 2),
                         'n_estimators': range(20, 40, 5)},
             scoring='f1')

In [20]:
%%time
#Определим гиперпараметры и качество модели на кросс-валидации:
forest_params = forest_grid.best_params_
forest_score = forest_grid.score(features_train, target_train)
print(forest_params)
print(forest_score)
print('')

{'max_depth': 6, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 35}
0.3244483113185162

CPU times: user 751 ms, sys: 4.44 ms, total: 756 ms
Wall time: 762 ms


In [21]:
%%time
#Обучим модель на подобранных гиперпараметрах:
forest_model = RandomForestClassifier(random_state=42, n_jobs=-1, class_weight='balanced',
                                     max_depth=forest_params['max_depth'],
                                     min_samples_leaf = forest_params['min_samples_leaf'],
                                     min_samples_split = forest_params['min_samples_split'],
                                     n_estimators = forest_params['n_estimators'])

forest_model.fit(features_train, target_train)
forest_model_predictions = forest_model.predict(features_test)

CPU times: user 1.04 s, sys: 17.5 ms, total: 1.06 s
Wall time: 1.06 s


In [22]:
forest_predictions = forest_model.predict(features_test)
forest_f1 =  round(f1_score(target_test, forest_predictions), 3)
print(forest_f1)

0.309


## Выводы

In [23]:
columns = ['Модель', '%%time', 'f1-мера']
regression_model = ['LogisticRegression', 26.1, regression_f1]
forest_model = ['RandomForestClassifier', 2.5, forest_f1]


table = pd.DataFrame([regression_model, forest_model], columns = columns)


display(table)

Unnamed: 0,Модель,%%time,f1-мера
0,LogisticRegression,26.1,0.75781
1,RandomForestClassifier,2.5,0.309


**ВЫВОД**

На данных получилось достичд необходимой величины F1. Из выбранных моделей наилучший результат показал LogisticRegression, на тестовых данных у него 0.758, это больше чем минимальная в 0.75