<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#LogisticRegression" data-toc-modified-id="LogisticRegression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>LogisticRegression</a></span><ul class="toc-item"><li><span><a href="#Pipeline" data-toc-modified-id="Pipeline-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Pipeline</a></span></li></ul></li><li><span><a href="#xgbClassifier" data-toc-modified-id="xgbClassifier-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>xgbClassifier</a></span></li><li><span><a href="#RandomForestClassifier" data-toc-modified-id="RandomForestClassifier-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>RandomForestClassifier</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

In [None]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
import xgboost as xgb
import time

## Подготовка

In [None]:
data = pd.read_csv('/datasets/toxic_comments.csv')

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [None]:
data['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

Тут явно дисбаланс классов.

In [None]:
data['text'][0]

"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

Все на англ.

Лемматизирум текст и убираем перенос строки.

In [None]:
Dup_Rows = data[data.duplicated()]
print("\n\nПовторяющиеся строки : \n {}".format(Dup_Rows))



Повторяющиеся строки : 
 Empty DataFrame
Columns: [text, toxic]
Index: []


In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    text = re.sub(r'[^\w\s]','',text)
    #tokenize = w_tokenizer.tokenize(text)
    pos = ([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(text)])
    #return pos
    return " ".join(pos)

In [None]:
#lemmatize_text('how are you?')

In [None]:
data['lemm_text'] = data['text'].apply(lemmatize_text)

In [None]:
data['lemm_text']

0         Explanation Why the edits make under my userna...
1         Daww He match this background colour Im seemin...
2         Hey man Im really not try to edit war Its just...
3         More I cant make any real suggestion on improv...
4         You sir be my hero Any chance you remember wha...
                                ...                        
159566    And for the second time of ask when your view ...
159567    You should be ashamed of yourself That be a ho...
159568    Spitzer Umm there no actual article for prosti...
159569    And it look like it be actually you who put on...
159570    And I really dont think you understand I come ...
Name: lemm_text, Length: 159571, dtype: object

In [None]:
print("Исходный текст:", data['text'][2])
print("Лемматизированный текст:", data['lemm_text'][2])

Исходный текст: Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.
Лемматизированный текст: Hey man Im really not try to edit war Its just that this guy be constantly remove relevant information and talk to me through edits instead of my talk page He seem to care more about the format than the actual info


Создадим корпус.

In [None]:
corpus = list(data['lemm_text'])

Загрузим стоп-слова.

Создадим мешок слов.

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
stop_words = set(stopwords.words('english'))

Разделим данные на выборки

In [None]:
features = data['lemm_text']
target = data['toxic']

In [None]:
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=12345) 

In [None]:
features_train

111565    Ath Cliath section We have a bit of a problem ...
8575      Sure thing By the way I have a new userbox tha...
153402    We be in the same boat a Britannica which be a...
65019     I have a look through the section on theology ...
155787    Warren Commission Exhibit 746E the Chin It be ...
                                ...                        
109993    Gam keep your CRAP off my Talk Page Your KKK s...
85412                               I correct what you list
133249    hmm yes a I say I also watch the Emily Ruetear...
130333                          kinda like miachael jackson
77285     criterion for deletion Im just wonder what be ...
Name: lemm_text, Length: 119678, dtype: object

## Обучение

In [None]:
count_tf_idf = TfidfVectorizer(stop_words=stop_words)

In [None]:
tf_idf = count_tf_idf.fit_transform(features_train)
train_tfidf = count_tf_idf.transform(features_train)
test_tfidf = count_tf_idf.transform(features_valid)

### LogisticRegression

#### Pipeline

In [None]:
from sklearn.pipeline import Pipeline
import time
pipeline = Pipeline(
    [
        ("vect", CountVectorizer()),
        ("tfidf", TfidfTransformer()),
        ("LGR", LogisticRegression(class_weight='balanced',solver='lbfgs', max_iter=1000)),
    ]
)

parameters = {'LGR__C': [1, 10, 0.1]}
grid_search = GridSearchCV(pipeline, parameters, verbose=1)
#print(grid_search.get_params().keys())
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)

grid_search.fit(features_train, target_train)

predict = grid_search.predict(features_valid)
f1 = f1_score(target_valid,predict)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['vect', 'tfidf', 'LGR']
parameters:
{'LGR__C': [1, 10, 0.1]}
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best score: 0.950
Best parameters set:
	LGR__C: 10


In [None]:
print(f1)

0.7753465459961049


С пайп лайном результат вышел даже лучше.

In [None]:
def LogisticRegression_model():
    model = LogisticRegression(class_weight='balanced',solver='lbfgs', max_iter=1000)
    clf = GridSearchCV(model, {'C': [1, 10, 0.1]})            
                                                                    
    clf.fit(train_tfidf, target_train)
    predict = clf.predict(test_tfidf)
    f1 = f1_score(target_valid,predict)
    print(f1)
    print(clf.best_params_)

In [None]:
LogisticRegression_model()

0.7651221182378453
{'C': 10}


После подбора параметра регуляризации и сбалансированного баланса классов получили F1 = 0.76 что удовлетворяет условиям задачи.

### xgbClassifier

In [None]:
def xgbClassifier_model():
    model_xgb = xgb.XGBClassifier(booster='gbtree',use_label_encoder =False,eval_metric='mlogloss')
    clf = GridSearchCV(model_xgb, {'max_depth': [1, 10, 1],
                                    #'subsample':[0.5,0.9,0.1]
                                    #'colsample_bytree':[0.5,0.9,0.1]
                                    'n_estimators': [1, 50, 5]}, verbose=1,
                                     scoring='f1')
                                                                    
    clf.fit(train_tfidf, target_train)
    predict = clf.predict(test_tfidf)
    f1 = f1_score(target_valid,predict)
    print(f1)
    print(clf.best_params_)

In [None]:
xgbClassifier_model()

Fitting 5 folds for each of 9 candidates, totalling 45 fits
0.7167819866307025
{'max_depth': 10, 'n_estimators': 50}


Обучение было довольно долгим, библиотека градиентного бустинга дала результат в F1 = 0.71

### RandomForestClassifier

In [None]:
def RandomForestClassifier_model():
    model_RF = RandomForestClassifier(random_state=12345)
    clf = GridSearchCV(model_RF, {'max_depth': [1, 10, 1],
                               'n_estimators': [1, 50, 5]}, verbose=1,
                                scoring='f1')
                                                                    
    clf.fit(train_tfidf, target_train)
    predict = clf.predict(test_tfidf)
    f1 = f1_score(target_valid,predict)
    print(f1)
    print(clf.best_params_)

In [None]:
RandomForestClassifier_model()

Fitting 5 folds for each of 9 candidates, totalling 45 fits
0.025949062950504566
{'max_depth': 10, 'n_estimators': 1}


## Выводы

Модели случайного леса и градиентного бустинга не подходят для задачи классификации текста, не выдают приемлимый результат. В нашей задаче лучше всего проявила себя модель LogisticRegression через pipeline.

## Чек-лист проверки

- [x]  Jupyter Notebook открыт
- [x]  Весь код выполняется без ошибок
- [x]  Ячейки с кодом расположены в порядке исполнения
- [x]  Данные загружены и подготовлены
- [x]  Модели обучены
- [x]  Значение метрики *F1* не меньше 0.75
- [x]  Выводы написаны