# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Требуется обучить модель классифицировать комментарии на позитивные и негативные. В нашем распоряжении набор данных с разметкой о токсичности правок.

Необходимо построить модель со значением метрики качества *F1* не меньше 0.75. 

**В рамках проекта необходимо:**

1. Загрузить и подготовить данные.
2. Обучить разные модели. 
3. Сделать выводы.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка-данных" data-toc-modified-id="Подготовка-данных-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка данных</a></span><ul class="toc-item"><li><span><a href="#Вывод" data-toc-modified-id="Вывод-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Вывод</a></span></li></ul></li><li><span><a href="#Подготовка,-обучение-и-тестирование-моделей" data-toc-modified-id="Подготовка,-обучение-и-тестирование-моделей-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Подготовка, обучение и тестирование моделей</a></span><ul class="toc-item"><li><span><a href="#Модель-LogisticRegression" data-toc-modified-id="Модель-LogisticRegression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Модель LogisticRegression</a></span></li><li><span><a href="#Модель-RandomForestClassifier" data-toc-modified-id="Модель-RandomForestClassifier-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Модель RandomForestClassifier</a></span></li><li><span><a href="#Модель-DecisionTreeClassifier" data-toc-modified-id="Модель-DecisionTreeClassifier-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Модель DecisionTreeClassifier</a></span></li><li><span><a href="#Модель-LGBMClassifier" data-toc-modified-id="Модель-LGBMClassifier-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Модель LGBMClassifier</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

## Подготовка данных

In [54]:
# импорт библиотек
import warnings
import pandas as pd
import numpy as np
import nltk
from nltk.stem import WordNetLemmatizer
import re
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import lightgbm as ltb
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [2]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fokin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
# уберем отображение незначительных ошибок
warnings.filterwarnings('ignore')

In [4]:
# расширим отображение данных
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [5]:
# заберем данные для работы
try:
    data = pd.read_csv('/datasets/toxic_comments.csv')
except:    
    data = pd.read_csv('C:/Users/fokin/OneDrive/Рабочий стол/Дата Сайнс/Машинное обучение для текстов/Проект/toxic_comments.csv')

In [6]:
# рассмотрим данные
display(data.info())
display(data.head(10))
display(data.tail(10))
display(data.describe())
display(data.isna().sum())
display(data.duplicated().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


None

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
5,"""\n\nCongratulations from me as well, use the ...",0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,Your vandalism to the Matt Shirvington article...,0
8,Sorry if the word 'nonsense' was offensive to ...,0
9,alignment on this subject and which are contra...,0


Unnamed: 0,text,toxic
159561,"""\nNo he did not, read it again (I would have ...",0
159562,"""\n Auto guides and the motoring press are not...",0
159563,"""\nplease identify what part of BLP applies be...",0
159564,Catalan independentism is the social movement ...,0
159565,The numbers in parentheses are the additional ...,0
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0
159570,"""\nAnd ... I really don't think you understand...",0


Unnamed: 0,toxic
count,159571.0
mean,0.101679
std,0.302226
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


text     0
toxic    0
dtype: int64

0

### Вывод

Мы получили и рассмотрели данные из файла toxic_comments.csv, в датасете 2 столбца - text, toxic. Данные признака toxic - целевые для обучения модели. Количество записей 159571, дубликатов нет.

## Подготовка, обучение и тестирование моделей

In [7]:
%%time
# лемматизируем и почистим текст
lemmatizer = WordNetLemmatizer()
def lemmatize_clear(text):
    text = text.lower()
    lem_list = lemmatizer.lemmatize(text)
    lem_text = "".join(lem_list)
    lem_clear_text = re.sub(r'[^a-zA-Z]', ' ', lem_text)
    lem_clear_text_1 = lem_clear_text.split()
    lem_clear_text_2 = " ".join(lem_clear_text_1)
    return lem_clear_text_2

data['lemmatize_clear_text'] = data['text'].apply(lemmatize_clear)
display(data.head(10))

Unnamed: 0,text,toxic,lemmatize_clear_text
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,d aww he matches this background colour i m se...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man i m really not trying to edit war it s...
3,"""\nMore\nI can't make any real suggestions on ...",0,more i can t make any real suggestions on impr...
4,"You, sir, are my hero. Any chance you remember...",0,you sir are my hero any chance you remember wh...
5,"""\n\nCongratulations from me as well, use the ...",0,congratulations from me as well use the tools ...
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,cocksucker before you piss around on my work
7,Your vandalism to the Matt Shirvington article...,0,your vandalism to the matt shirvington article...
8,Sorry if the word 'nonsense' was offensive to ...,0,sorry if the word nonsense was offensive to yo...
9,alignment on this subject and which are contra...,0,alignment on this subject and which are contra...


CPU times: total: 6.92 s
Wall time: 6.94 s


In [8]:
# удалим лишний столбец с необработанным текстом
data = data.drop('text', axis=True)
display(data.head(10))

Unnamed: 0,toxic,lemmatize_clear_text
0,0,explanation why the edits made under my userna...
1,0,d aww he matches this background colour i m se...
2,0,hey man i m really not trying to edit war it s...
3,0,more i can t make any real suggestions on impr...
4,0,you sir are my hero any chance you remember wh...
5,0,congratulations from me as well use the tools ...
6,1,cocksucker before you piss around on my work
7,0,your vandalism to the matt shirvington article...
8,0,sorry if the word nonsense was offensive to yo...
9,0,alignment on this subject and which are contra...


In [9]:
# создадим выборки для работы
target = data['toxic']
features = data.drop('toxic', axis=1)
display(target.shape)
display(features.shape)

(159571,)

(159571, 1)

In [10]:
# разобьем выборку на обучающую и тестовую
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.40, random_state=654321)
display(features_train.shape)
display(features_test.shape)
display(target_train.shape)
display(target_test.shape)

(95742, 1)

(63829, 1)

(95742,)

(63829,)

In [11]:
# вычислим TF-IDF в sklearn
nltk.download('stopwords')

stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords)

features_train = count_tf_idf.fit_transform(features_train['lemmatize_clear_text'].values.astype('U'))
features_test = count_tf_idf.transform(features_test['lemmatize_clear_text'].values.astype('U'))

display(features_train)
display(features_test)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fokin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<95742x125348 sparse matrix of type '<class 'numpy.float64'>'
	with 2607940 stored elements in Compressed Sparse Row format>

<63829x125348 sparse matrix of type '<class 'numpy.float64'>'
	with 1693744 stored elements in Compressed Sparse Row format>

### Модель LogisticRegression

In [23]:
%%time
# получим лучшие параметры LogisticRegression
params = {'fit_intercept': [True, False],
          'class_weight': ['balanced', None],
          'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
         'random_state': [654321]}
lr = LogisticRegression()
grid_lr = GridSearchCV(lr, params, cv=5)
grid_lr.fit(features_train, target_train)
display(grid_lr.best_params_)

{'class_weight': None,
 'fit_intercept': False,
 'random_state': 654321,
 'solver': 'newton-cg'}

CPU times: total: 8min 42s
Wall time: 3min 14s


In [29]:
%%time
# кроссвалидация на модели LogisticRegression
lr = LogisticRegression(class_weight=None, fit_intercept=False, random_state=654321, solver='newton-cg')
scores = []
scores = cross_val_score(lr, features_train, target_train, cv=5, scoring='f1_macro')
print('F1 на обучающей выборке:', scores.mean())

F1 на обучающей выборке: 0.8485993943827251
CPU times: total: 34.3 s
Wall time: 5.74 s


In [31]:
%%time
# получим F1 для модели LogisticRegression
lr.fit(features_train, target_train)
predictions = lr.predict(features_test)
print('F1 на тестовой выборке:', f1_score(target_test, predictions))

F1 на тестовой выборке: 0.7362841240392262
CPU times: total: 8.52 s
Wall time: 1.44 s


### Модель RandomForestClassifier

In [32]:
%%time
# получим лучшие параметры RandomForestClassifier
params = {'n_estimators': [10, 50, 100, 200],
          'criterion': ['gini', 'entropy', 'log_loss'],
          'max_depth': [10, 50, 100, 200],
         'random_state': [654321]}
rfc = RandomForestClassifier()
grid_rfc = GridSearchCV(rfc, params, cv=5)
grid_rfc.fit(features_train, target_train)
display(grid_rfc.best_params_)

{'criterion': 'gini',
 'max_depth': 200,
 'n_estimators': 10,
 'random_state': 654321}

CPU times: total: 7h 15min 4s
Wall time: 7h 15min 8s


In [35]:
%%time
# кроссвалидация на модели RandomForestClassifier
rfc = RandomForestClassifier(random_state=654321, criterion='gini', max_depth=200, n_estimators=10)
scores = []
scores = cross_val_score(rfc, features_train, target_train, cv=5, scoring='f1_macro')
print('F1 на обучающей выборке:', scores.mean())

F1 на обучающей выборке: 0.7482636497830291
CPU times: total: 3min 14s
Wall time: 3min 14s


In [36]:
%%time
# получим F1 для модели RandomForestClassifier
rfc.fit(features_train, target_train)
predictions = rfc.predict(features_test)
print('F1 на тестовой выборке:', f1_score(target_test, predictions))

F1 на тестовой выборке: 0.5349577433871144
CPU times: total: 14 s
Wall time: 14 s


### Модель DecisionTreeClassifier

In [33]:
%%time
# получим лучшие параметры DecisionTreeClassifier
params = {'splitter': ['best', 'random'],
          'criterion': ['gini', 'entropy', 'log_loss'],
          'max_depth': [10, 50, 100, 200],
         'random_state': [654321]}
dtc = DecisionTreeClassifier()
grid_dtc = GridSearchCV(dtc, params, cv=5)
grid_dtc.fit(features_train, target_train)
display(grid_dtc.best_params_)

{'criterion': 'gini',
 'max_depth': 100,
 'random_state': 654321,
 'splitter': 'best'}

CPU times: total: 34min 6s
Wall time: 34min 6s


In [37]:
%%time
# кроссвалидация на модели DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=654321, criterion='gini', max_depth=100, splitter='best')
scores = []
scores = cross_val_score(dtc, features_train, target_train, cv=5, scoring='f1_macro')
print('F1 на обучающей выборке:', scores.mean())

F1 на обучающей выборке: 0.8458030843457094
CPU times: total: 2min 46s
Wall time: 2min 46s


In [38]:
%%time
# получим F1 для модели DecisionTreeClassifier
dtc.fit(features_train, target_train)
predictions = dtc.predict(features_test)
print('F1 на тестовой выборке:', f1_score(target_test, predictions))

F1 на тестовой выборке: 0.7275812335623992
CPU times: total: 32.2 s
Wall time: 32.2 s


### Модель LGBMClassifier

In [58]:
%%time
# получим лучшие параметры LGBMClassifier
params = {'max_depth': [10, 50, 100, 200],
          'num_leaves': [10, 50, 100, 200],
         'random_state': [654321]}
lgbmc = ltb.LGBMClassifier(objective='binary')
grid_lgbmc = GridSearchCV(lgbmc, params, cv=5)
grid_lgbmc.fit(features_train, target_train)
display(grid_lgbmc.best_params_)

{'max_depth': 100, 'num_leaves': 100, 'random_state': 654321}

CPU times: total: 5h 20min 15s
Wall time: 29min 28s


In [59]:
%%time
# кроссвалидация на модели LGBMClassifier
lgbmc = ltb.LGBMClassifier(random_state=654321, objective='binary', verbosity=-1, max_depth=100, num_leaves=100)
scores = []
scores = cross_val_score(lgbmc, features_train, target_train, cv=5, scoring='f1_macro')
print('F1 на обучающей выборке:', scores.mean())

F1 на обучающей выборке: 0.8701563893654882
CPU times: total: 27min 1s
Wall time: 2min 26s


In [60]:
%%time
# получим F1 для модели LGBMClassifier
lgbmc.fit(features_train, target_train)
predictions = lgbmc.predict(features_test)
print('F1 на тестовой выборке:', f1_score(target_test, predictions))

F1 на тестовой выборке: 0.7726916034527858
CPU times: total: 6min 49s
Wall time: 36.7 s


## Выводы

In [64]:
# сформируем сводную таблицу для оценки результатов
pivot_data_models = {'Model': ['LogisticRegression', 'RandomForestClassifier', 'DecisionTreeClassifier', 'LGBMClassifier'],
                   'F1 cross val score на обучающей выборке': [0.8485993943827251, 0.7482636497830291, 0.8458030843457094, 0.8701563893654882],
                   'Wall time CVS': ['5.74s', '3min 14s', '2min 46s', '2min 26s'],
                   'F1 на тестовой выборке': [0.7362841240392262, 0.5349577433871144, 0.7275812335623992, 0.7726916034527858],
                   'Wall time test': ['1.44s', '14s', '32.2s', '36.7s']}
pivot_data_models = pd.DataFrame(pivot_data_models)
display(pivot_data_models.sort_values('F1 на тестовой выборке', ascending=False))

Unnamed: 0,Model,F1 cross val score на обучающей выборке,Wall time CVS,F1 на тестовой выборке,Wall time test
3,LGBMClassifier,0.870156,2min 26s,0.772692,36.7s
0,LogisticRegression,0.848599,5.74s,0.736284,1.44s
2,DecisionTreeClassifier,0.845803,2min 46s,0.727581,32.2s
1,RandomForestClassifier,0.748264,3min 14s,0.534958,14s


Анализируя представленные выше данные можно сделать вывод, что модель на основе LGBMClassifier выдает наиболее высокий показатель F1 на тестовой выборке - 0.772692 и позволяет достичь целевого показателя 0.75 и превзойти его, поэтому рекомендуется рассмотреть данную модель как одну из основных для данной задачи классификации.