# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка
### Импорт библиотек и загрузка данных

In [1]:
import re
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', -1)

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

import warnings
warnings.filterwarnings("ignore")

  pd.set_option('display.max_colwidth', -1)
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
data = pd.read_csv('/datasets/toxic_comments.csv', index_col=0)
data.head()

Unnamed: 0,text,toxic
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0
4,"You, sir, are my hero. Any chance you remember what page that's on?",0


### Анализ данных


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


Пропусков нет, а типы данных определены верно.  
Проверим наличие явных дубликатов.

In [4]:
data.duplicated().sum()

0

Теперь взглянем на баланс классов в датасете.

In [5]:
data['toxic'].value_counts()

0    143106
1    16186 
Name: toxic, dtype: int64

Видим, что баланс классов нарушен. Учтём это при построении моделей.  
Можем приступать к предобработке данных.

### Препроцессинг данных

Первым делом с помощью регулярных выражений избавимся от лишних символов, которые сразу бросаются в глаза даже в первых строках датасета.

In [6]:
data['edited_text'] = data['text'].apply(lambda x: re.sub(r'[^a-zA-Z ]', ' ', x.lower()))

Далее - токенизация и лемматизация слов.

In [7]:
WNL = WordNetLemmatizer()

def lemmatize(text):
    return [WNL.lemmatize(i) for i in text]

data['edited_text'] = data['edited_text'].apply(lambda x: lemmatize(x.split()))

Теперь посмотрим, что получилось в результате преобразований.

In [8]:
data.head()

Unnamed: 0,text,toxic,edited_text
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,"[explanation, why, the, edits, made, under, my, username, hardcore, metallica, fan, were, reverted, they, weren, t, vandalism, just, closure, on, some, gas, after, i, voted, at, new, york, doll, fac, and, please, don, t, remove, the, template, from, the, talk, page, since, i, m, retired, now]"
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,"[d, aww, he, match, this, background, colour, i, m, seemingly, stuck, with, thanks, talk, january, utc]"
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,"[hey, man, i, m, really, not, trying, to, edit, war, it, s, just, that, this, guy, is, constantly, removing, relevant, information, and, talking, to, me, through, edits, instead, of, my, talk, page, he, seems, to, care, more, about, the, formatting, than, the, actual, info]"
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0,"[more, i, can, t, make, any, real, suggestion, on, improvement, i, wondered, if, the, section, statistic, should, be, later, on, or, a, subsection, of, type, of, accident, i, think, the, reference, may, need, tidying, so, that, they, are, all, in, the, exact, same, format, ie, date, format, etc, i, can, do, that, later, on, if, no, one, else, doe, first, if, you, have, any, preference, for, formatting, style, on, reference, or, want, to, do, it, yourself, please, let, me, know, there, appears, to, be, a, backlog, on, article, for, review, so, i, guess, there, may, be, a, delay, until, a, ...]"
4,"You, sir, are my hero. Any chance you remember what page that's on?",0,"[you, sir, are, my, hero, any, chance, you, remember, what, page, that, s, on]"


### Разбиение на выборки и TF-IDF

In [9]:
features_train, features_test, target_train, target_test = \
    train_test_split(data['edited_text'], data['toxic'], test_size=0.25, random_state=42)

In [10]:
tf_idf = TfidfVectorizer(stop_words=stopwords)
features_train = tf_idf.fit_transform(features_train.astype('string'))
features_test = tf_idf.transform(features_test.astype('string'))

features_train.shape, features_test.shape

((119469, 132803), (39823, 132803))

**Вывод:** Данные были загружены, изучены и подготовлены к работе.

## Обучение

In [11]:
models = ['LogisticRegression', 'RandomForestClassifier']
scores = []
params = []

### LogisticRegression

In [12]:
%%time
parameters = {
    'C': [5, 10, 15]
}

logreg = GridSearchCV(LogisticRegression(random_state=42, class_weight='balanced'), parameters,
                      cv=3, scoring='f1', error_score='raise', verbose=10)
logreg.fit(features_train, target_train)

scores.append(logreg.best_score_)
params.append(logreg.best_params_)

Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV 1/3; 1/3] START C=5.........................................................
[CV 1/3; 1/3] END .......................................C=5; total time=  42.0s
[CV 2/3; 1/3] START C=5.........................................................
[CV 2/3; 1/3] END .......................................C=5; total time=  39.7s
[CV 3/3; 1/3] START C=5.........................................................
[CV 3/3; 1/3] END .......................................C=5; total time=  41.0s
[CV 1/3; 2/3] START C=10........................................................
[CV 1/3; 2/3] END ......................................C=10; total time=  38.8s
[CV 2/3; 2/3] START C=10........................................................
[CV 2/3; 2/3] END ......................................C=10; total time=  40.0s
[CV 3/3; 2/3] START C=10........................................................
[CV 3/3; 2/3] END ................................

### RandomForestClassifier

In [13]:
%%time
parameters = {
    'n_estimators': [20, 40],
    'max_depth': [5, 8, 10]
} 

RFC = GridSearchCV(RandomForestClassifier(random_state=42, class_weight='balanced'), parameters,
                   cv=3, scoring='f1', error_score='raise', verbose=10)
RFC.fit(features_train, target_train)

scores.append(RFC.best_score_)
params.append(RFC.best_params_)

Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV 1/3; 1/6] START max_depth=5, n_estimators=20................................
[CV 1/3; 1/6] END ..............max_depth=5, n_estimators=20; total time=   1.6s
[CV 2/3; 1/6] START max_depth=5, n_estimators=20................................
[CV 2/3; 1/6] END ..............max_depth=5, n_estimators=20; total time=   1.5s
[CV 3/3; 1/6] START max_depth=5, n_estimators=20................................
[CV 3/3; 1/6] END ..............max_depth=5, n_estimators=20; total time=   1.6s
[CV 1/3; 2/6] START max_depth=5, n_estimators=40................................
[CV 1/3; 2/6] END ..............max_depth=5, n_estimators=40; total time=   3.0s
[CV 2/3; 2/6] START max_depth=5, n_estimators=40................................
[CV 2/3; 2/6] END ..............max_depth=5, n_estimators=40; total time=   2.9s
[CV 3/3; 2/6] START max_depth=5, n_estimators=40................................
[CV 3/3; 2/6] END ..............max_depth=5, n_es

In [14]:
print('Результаты кросс-валидации')
pd.DataFrame(data={'F1':scores,
                   'Parameters':params},
                    index=models)

Результаты кросс-валидации


Unnamed: 0,F1,Parameters
LogisticRegression,0.758252,{'C': 10}
RandomForestClassifier,0.341406,"{'max_depth': 10, 'n_estimators': 40}"


С лучшей стороны себя показала Логистическая Регрессия, проверим её работу на тестовой выборке

In [16]:
predictions = logreg.best_estimator_.predict(features_test)
f1_score(target_test, predictions)

0.7668213457076566

## Выводы

В результате работы мы построили несколько моделей для классификации комментариев на позитивные и негативные.
1. Загружены данные и проведена предобработка
2. Выполнено сравнение моделей с использованием различных наборов гиперпараметров
3. Выбрана лучшая модель по результатам метрики F1

Были использованы:
- LogisticRegression
- RandomForestRegressor

Наивысший показатель F1-меры на тестовой выборке продемонстрировала Логистическая Регрессия с результатом 0.767 и параметром C=10.