<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#LogisticRegression" data-toc-modified-id="LogisticRegression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>LogisticRegression</a></span></li><li><span><a href="#RandomForestClassifier" data-toc-modified-id="RandomForestClassifier-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>RandomForestClassifier</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

## Подготовка

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from pymystem3 import Mystem
import re
from sklearn.model_selection import train_test_split #Функция для разделения датасета на обучающую и тестовую выборку
from sklearn.linear_model import LogisticRegression # логистическая регрессия алгоритм классификации
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

from tqdm import tqdm
tqdm.pandas()

from nltk.stem import WordNetLemmatizer 
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize


In [2]:
data = pd.read_csv('/datasets/toxic_comments.csv')#


In [3]:
display(data)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


Проверка пропусков

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Считаем сколько в выборке 0 и 1.

In [5]:
display(data['toxic'].value_counts())

0    143346
1     16225
Name: toxic, dtype: int64

In [6]:
ratio_value_counts = data['toxic'].value_counts()[0] /data['toxic'].value_counts()[1]

In [7]:
display(ratio_value_counts)

8.834884437596301

Классы несбалансированы. Отношение 1:8.83. Поэтому в моделях заложим `class_weight='balanced'`

Проводим лемматизацию и очистку текста

In [8]:
%%time


m=WordNetLemmatizer()

def lemmatize_text(text):
    text = text.lower()
    lemm_text = "".join(m.lemmatize(text))
    cleared_text = re.sub(r'[^a-zA-Z]', ' ', lemm_text) 
    return " ".join(cleared_text.split())
    


data['lemm_text'] = data['text'].progress_apply(lemmatize_text)


100%|██████████| 159571/159571 [00:07<00:00, 20606.15it/s]

CPU times: user 7.44 s, sys: 176 ms, total: 7.62 s
Wall time: 7.75 s





In [9]:
display(data)

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,d aww he matches this background colour i m se...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man i m really not trying to edit war it s...
3,"""\nMore\nI can't make any real suggestions on ...",0,more i can t make any real suggestions on impr...
4,"You, sir, are my hero. Any chance you remember...",0,you sir are my hero any chance you remember wh...
...,...,...,...
159566,""":::::And for the second time of asking, when ...",0,and for the second time of asking when your vi...
159567,You should be ashamed of yourself \n\nThat is ...,0,you should be ashamed of yourself that is a ho...
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,spitzer umm theres no actual article for prost...
159569,And it looks like it was actually you who put ...,0,and it looks like it was actually you who put ...


In [10]:
text='caring touching'
print('Исходный текст',text)
tokens=nltk.word_tokenize(text)
print('Токенизация',tokens)
print()
lemmatizer=WordNetLemmatizer()
lemm_review=[lemmatizer.lemmatize(word) for word in tokens]
print('Лемматизация',lemm_review)
print()
stemmer=PorterStemmer()
stemmer_review=[stemmer.stem(word) for word in tokens]
print('Стемминг',stemmer_review)



Исходный текст caring touching
Токенизация ['caring', 'touching']

Лемматизация ['caring', 'touching']

Стемминг ['care', 'touch']


In [11]:
data= data.drop(['text'], axis=1)

In [12]:
display(data)

Unnamed: 0,toxic,lemm_text
0,0,explanation why the edits made under my userna...
1,0,d aww he matches this background colour i m se...
2,0,hey man i m really not trying to edit war it s...
3,0,more i can t make any real suggestions on impr...
4,0,you sir are my hero any chance you remember wh...
...,...,...
159566,0,and for the second time of asking when your vi...
159567,0,you should be ashamed of yourself that is a ho...
159568,0,spitzer umm theres no actual article for prost...
159569,0,and it looks like it was actually you who put ...


Разбиваем на выборки

Признаки

In [13]:
features = data.drop(['toxic'], axis=1)
display(features.shape)

(159571, 1)

Цель

In [14]:
target = data['toxic']
display(target.shape)

(159571,)

Делим выборку на тренировочную и тестовую 40 %

In [15]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.40, random_state=12345) # отделим 40% данных

In [16]:
 display(features_train.shape)

(95742, 1)

Удаляем стоп слова из мешка слов

In [17]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stopwords)


features_train = count_tf_idf.fit_transform(features_train['lemm_text'].values)
features_test = count_tf_idf.transform(features_test['lemm_text'].values)
print(features_train.shape)
print(features_test.shape)
cv_counts = 3

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


(95742, 125610)
(63829, 125610)


## Обучение

### LogisticRegression

In [18]:
%%time

model_logist =LogisticRegression(class_weight='balanced',random_state=12345, max_iter=1000, n_jobs = -1)  # инициализируем модель
model_logist.fit(features_train, target_train) # обучим модель на тренировочной выборке
predicted_logist_test = model_logist.predict(features_test)
f1_logist=f1_score(target_test, predicted_logist_test)

print("F1-score", f1_logist)

F1-score 0.7529135921660576
CPU times: user 12.4 s, sys: 26.2 s, total: 38.6 s
Wall time: 38.7 s


### RandomForestClassifier

In [19]:
%%time
model_RFC=RandomForestClassifier(class_weight='balanced',random_state=12345, n_jobs = -1)
model_RFC.fit(features_train, target_train) # обучим модель на тренировочной выборке
predicted_RFC_test = model_RFC.predict(features_test)

f1_RFC=f1_score(target_test, predicted_RFC_test)
print("F1-score", f1_RFC)


F1-score 0.6425956061838893
CPU times: user 9min 9s, sys: 1.16 s, total: 9min 10s
Wall time: 9min 10s


## Выводы

1. Анализ данных показал, что классы несбалансированы. Отношение 1:8.83. Поэтому в моделях заложим `class_weight='balanced'`
2. Обучили модели LogisticRegression и RandomForestClassifier
3. Лучший результат на модели LogisticRegression F1-score=0.7528973509933775