<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Тестирование" data-toc-modified-id="Тестирование-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Тестирование</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd
import numpy as np
import re
import warnings

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer

from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

import spacy
from tqdm import tqdm

# отключим некритичные предупреждения
warnings.filterwarnings("ignore")

# расширим немного 
pd.set_option('display.max_colwidth', 500)

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
# назначим глобальную переменную

STATE = np.random.RandomState(42)

In [3]:
# загрузим наш файл

try:
    data = pd.read_csv('/Users/alex/Downloads/toxic_comments.csv')
except:
    data = pd.read_csv('/datasets/toxic_comments.csv')

In [4]:
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess t...",0
4,4,"You, sir, are my hero. Any chance you remember what page that's on?",0


In [5]:
# посмотрим на распределение таргета 

data['toxic'].value_counts(normalize = True, dropna = False)

0    0.898388
1    0.101612
Name: toxic, dtype: float64

In [6]:
# размер

len(data)

159292

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [8]:
# удалим ненужный столбец Unnamed: 0

data.drop(['Unnamed: 0'], axis = 1, inplace = True)

__Подготовим наш текст__

In [9]:
# буду использовать spacy

nlp = spacy.load('en_core_web_sm', disable = ['parser', 'ner'])
stopwords = stopwords.words('english')


def words_only(text):
    return ' '.join(re.sub(r'[^a-zA-Z]', ' ', text.lower()).split())

def lemmatize_text(text, nlp):
    return [token.lemma_ for token in nlp(text)]

def remove_stopwords(text, stopwords):
    return [word for word in text if not word in stopwords and len(word) > 1]

def clean_text(text, nlp, stopwords):
    
    # очистка от символов
    text_words = words_only(text)
    
    # лемматизация
    text_lemmas = lemmatize_text(text_words, nlp)
    
    # очистка от стоп-слов
    return ' '.join(remove_stopwords(text_lemmas, stopwords))



lemmas = []

for i in tqdm(range(len(data))):
    lemmas.append(clean_text(data['text'][i], nlp, stopwords))
    
data['lemmas'] = lemmas

100%|██████████| 159292/159292 [14:24<00:00, 184.25it/s]


In [11]:
data.head()

Unnamed: 0,text,toxic,lemmas
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,explanation edit make username hardcore metallica fan revert vandalism closure gas vote new york doll fac please remove template talk page since retire
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,aww match background colour seemingly stuck thank talk january utc
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,hey man really try edit war guy constantly remove relevant information talk edit instead talk page seem care formatting actual info
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess t...",0,make real suggestion improvement wonder section statistic later subsection type accident think reference may need tidy exact format ie date format etc later one else first preference format style reference want please let know appear backlog article review guess may delay reviewer turn list relevant form eg wikipedia good article nomination transport
4,"You, sir, are my hero. Any chance you remember what page that's on?",0,sir hero chance remember page


После того, как мы обработали текст (привели все к нижнеми регистру, избавились от ненужных символов и тд) можно глянуть на наличие дубликатов

In [12]:
data.duplicated().sum()

0

In [13]:
# избавимся от них

data.drop_duplicates(inplace = True)
data.duplicated().sum()

0

__Подготовим выборки__

In [14]:
# разделим наши данные передав в stratify наш таргет

train, test = train_test_split(data, test_size = 0.2, stratify = data['toxic'], random_state=STATE)

corpus_train = train['lemmas'].values
corpus_test = test['lemmas'].values

target_train = train['toxic']
target_test = test['toxic']

print(f'\nразмер TRAIN выборки: {corpus_train.shape[0]}\nсоотношение positive/negative\n{target_train.value_counts()}')
print(f'\nразмер TEST выборки: {corpus_test.shape[0]}\nсоотношение positive/negative\n{target_test.value_counts()}')


размер TRAIN выборки: 127433
соотношение positive/negative
0    114484
1     12949
Name: toxic, dtype: int64

размер TEST выборки: 31859
соотношение positive/negative
0    28622
1     3237
Name: toxic, dtype: int64


__Создадим векторайзер__

In [15]:
count_tf_idf = TfidfVectorizer(min_df = 2, stop_words = stopwords) 

tf_idf_train = count_tf_idf.fit_transform(corpus_train) 
tf_idf_test = count_tf_idf.transform(corpus_test) 

print(f'Размер матрицы: {tf_idf_train.shape}')
print(f'Размер матрицы: {tf_idf_test.shape}')

Размер матрицы: (127433, 52764)
Размер матрицы: (31859, 52764)


___Промежуточный вывод___
- импортировали нужные нам биб-ки и загрузили наш файл
- удалили неинформативный столбец 'Unnamed: 0'
- написали функцию лемматизации и очистки текста
- удалили дубликаты
- векторизировали наши текстовые данные

## Обучение

Обучим наши модели и подберем параметры:
- LogisticRegression
- DecisionTreeClassifier
- RandomForestClassifier
- LGBMClassifier

__LogisticRegression__

In [16]:
lr = LogisticRegression(random_state = STATE, solver = 'liblinear', class_weight = 'balanced', n_jobs = -1)

params = {
   'penalty':['l1', 'l2'],        
   'C': list(range(1, 20, 2)) 
}

grid_lr = GridSearchCV(lr, params, cv = 3, scoring = 'f1', verbose = True)
grid_lr.fit(tf_idf_train, target_train)

print(f'Лучшие гиперпараметры :{grid_lr.best_params_}')
print(f'Лучшая f1 score модели LogisticRegression равна: {grid_lr.best_score_}')

Fitting 3 folds for each of 20 candidates, totalling 60 fits
Лучшие гиперпараметры :{'C': 5, 'penalty': 'l2'}
Лучшая f1 score модели LogisticRegression равна: 0.7619761969999032


__DecisionTreeClassifier__

In [17]:
dt = DecisionTreeClassifier(random_state = STATE, class_weight = 'balanced')

params_dt = {
   'criterion': ['gini', 'entropy'],        
   'max_depth': list(range(2, 20, 2))
}

grid_dt = GridSearchCV(dt, params_dt, cv = 3, scoring = 'f1', verbose = True, n_jobs = -1)
grid_dt.fit(tf_idf_train, target_train)

print(f'Лучшие гиперпараметры :{grid_dt.best_params_}')
print(f'Лучшая f1 score модели DecisionTreeClassifier равна: {grid_dt.best_score_}')

Fitting 3 folds for each of 18 candidates, totalling 54 fits
Лучшие гиперпараметры :{'criterion': 'gini', 'max_depth': 18}
Лучшая f1 score модели DecisionTreeClassifier равна: 0.6268644554120802


__RandomForestClassifier__

In [18]:
rf = RandomForestClassifier(random_state = STATE, class_weight = 'balanced')

params_rf = {
    'n_estimators': [70, 80, 100],
    'max_depth': range(1, 40, 10)
}

grid_rf = GridSearchCV(rf, params_rf, cv = 3, scoring = 'f1', verbose = True, n_jobs = -1)
grid_rf.fit(tf_idf_train, target_train)

print(f'Лучшие гиперпараметры :{grid_rf.best_params_}')
print(f'Лучшая f1 score модели RandomForestClassifier равна: {grid_rf.best_score_}')

Fitting 3 folds for each of 12 candidates, totalling 36 fits
Лучшие гиперпараметры :{'max_depth': 31, 'n_estimators': 100}
Лучшая f1 score модели RandomForestClassifier равна: 0.4582261348912236


__LGBMClassifier__

In [19]:
model_lgb = LGBMClassifier(random_state = STATE, class_weight = 'balanced', n_jobs = -1)

param_lgb = {
    "max_depth" : [20],
    "n_estimators" : [100]
}

grid_lgb = GridSearchCV(model_lgb, param_lgb, cv = 3, scoring = 'f1', verbose = True)
lgb_model = grid_lgb.fit(tf_idf_train, target_train)

print(f'Лучшая f1 score модели LGBMClassifier равна: {lgb_model.best_score_} при параметрах, {lgb_model.best_params_}')

Fitting 3 folds for each of 1 candidates, totalling 3 fits
Лучшая f1 score модели LGBMClassifier равна: 0.7449946332995899 при параметрах, {'max_depth': 20, 'n_estimators': 100}


In [20]:
print(f'Лучшая f1 score модели LogisticRegression равна: {grid_lr.best_score_}')
print(f'Лучшая f1 score модели DecisionTreeClassifier равна: {grid_dt.best_score_}')
print(f'Лучшая f1 score модели RandomForestClassifier равна: {grid_rf.best_score_}')
print(f'Лучшая f1 score модели LGBMClassifier равна: {lgb_model.best_score_}')

Лучшая f1 score модели LogisticRegression равна: 0.7619761969999032
Лучшая f1 score модели DecisionTreeClassifier равна: 0.6268644554120802
Лучшая f1 score модели RandomForestClassifier равна: 0.4582261348912236
Лучшая f1 score модели LGBMClassifier равна: 0.7449946332995899


Лучше всего справилась модель Логистической регрессии, ее и будем использовать на тестовых данных

## Тестирование

In [21]:
# проверяем нашу модель ЛогРег на тестовых данных

predict = grid_lr.predict(tf_idf_test)
f1 = f1_score(target_test, predict)

print(f'f1 score модели LogisticRegression на тестовых данных равна: {f1}')

f1 score модели LogisticRegression на тестовых данных равна: 0.7578858825178747


___ВЫВОД___

Удалось достичь цели 'Постройте модель со значением метрики качества F1 не меньше 0.75', f1_score = 0.761
