<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#SGDClassifier" data-toc-modified-id="SGDClassifier-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>SGDClassifier</a></span></li><li><span><a href="#LogisticRegression" data-toc-modified-id="LogisticRegression-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>LogisticRegression</a></span></li><li><span><a href="#AdaBoostClassifier" data-toc-modified-id="AdaBoostClassifier-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>AdaBoostClassifier</a></span></li><li><span><a href="#SVM" data-toc-modified-id="SVM-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>SVM</a></span></li></ul></li><li><span><a href="#Проверка-лучшей-модели-на-тестовой-выборке" data-toc-modified-id="Проверка-лучшей-модели-на-тестовой-выборке-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Проверка лучшей модели на тестовой выборке</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок. 

Постройте модель со значением метрики качества *F1* не меньше 0.75.

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score,make_scorer,accuracy_score
from sklearn.svm import SVC,LinearSVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm

from tqdm import notebook,tqdm, trange
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.dummy import DummyClassifier
nltk.download('wordnet','stopwords','punkt','averaged_perceptron_tagger')
from nltk.corpus import wordnet
from lightgbm import LGBMClassifier
import lightgbm as lgb
import time 
 


from IPython.display import clear_output
import matplotlib.pyplot as plt
import io

import torch
import warnings
warnings.filterwarnings('ignore')

In [2]:
try:
    comments =pd.read_csv(r"C:\\Users\\Специалист\\Downloads\\toxic_comments.csv")
except:
    comments =pd.read_csv('/datasets/toxic_comments.csv')

comments.head(10)


Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


In [3]:
comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [4]:
comments.duplicated().sum()

0

In [5]:
comments.describe(include='all')

Unnamed: 0.1,Unnamed: 0,text,toxic
count,159292.0,159292,159292.0
unique,,159292,
top,,well i can guarantee that brazil has more pote...,
freq,,1,
mean,79725.697242,,0.101612
std,46028.837471,,0.302139
min,0.0,,0.0
25%,39872.75,,0.0
50%,79721.5,,0.0
75%,119573.25,,0.0


In [6]:
# Преимущественно не токсичные комментарии соотношение 89,8% к 10,2%
comments.toxic.value_counts()/comments.shape[0]*100 

0    89.838787
1    10.161213
Name: toxic, dtype: float64

In [7]:
#приведем все буквенные данные к нижнему регистру
comments['text'] = comments['text'].str.lower()

In [8]:
# оставим только буквы в нижнем и верхнем регистре + цифры
comments_new = []
pattern = r'[^a-zA-Z0-9]' #r'[^a-zA-z]' [^a-zA-Z0-9]
for sentence in comments.text:
    cleared_text = re.sub(pattern, " ", sentence)
    comments_new.append(" ". join(cleared_text.split()))

In [9]:
comments["clear_text"]=comments_new
comments.head(10)

Unnamed: 0.1,Unnamed: 0,text,toxic,clear_text
0,0,explanation\nwhy the edits made under my usern...,0,explanation why the edits made under my userna...
1,1,d'aww! he matches this background colour i'm s...,0,d aww he matches this background colour i m se...
2,2,"hey man, i'm really not trying to edit war. it...",0,hey man i m really not trying to edit war it s...
3,3,"""\nmore\ni can't make any real suggestions on ...",0,more i can t make any real suggestions on impr...
4,4,"you, sir, are my hero. any chance you remember...",0,you sir are my hero any chance you remember wh...
5,5,"""\n\ncongratulations from me as well, use the ...",0,congratulations from me as well use the tools ...
6,6,cocksucker before you piss around on my work,1,cocksucker before you piss around on my work
7,7,your vandalism to the matt shirvington article...,0,your vandalism to the matt shirvington article...
8,8,sorry if the word 'nonsense' was offensive to ...,0,sorry if the word nonsense was offensive to yo...
9,9,alignment on this subject and which are contra...,0,alignment on this subject and which are contra...


## Обучение

Для дальнейшего использования 159571 очень большой датасет, поэтому сделаем sample.

переменную **corpus** будем использовать для TF-IDF


In [10]:
#sample_size = 60000
#corpus = comments.sample(n=sample_size,random_state=0).reset_index(drop=True)
#print('соотношение классов в датасете corpus\n', corpus.toxic.value_counts()/corpus.shape[0]*100)

Соотношение классов приблизтельно как в исходном датасете.
Можно считать, что baseline по accuracy для данного датасета 90.

In [11]:
#corpus.head()

Применяем Wordnet Lemmatizer + Pos_tag

In [12]:
# инициализация Wordnet Lemmatizer
L = WordNetLemmatizer()

In [13]:

def lemmatizered(comments):
    ''' функция выполняет токенизациию и лемматизацию массива текстов'''
    corpus_new = []
    for sentence in corpus:
        word_list = nltk.word_tokenize(sentence)
        corpus_new.append(' '.join([L.lemmatize(w) for w in word_list]))
    return corpus_new

In [14]:
def get_wordnet_pos(word):
    """функция возвращает словарь, где возвращается значение часть речи (pos_tag)"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
 
    return tag_dict.get(tag, wordnet.NOUN)
     

In [15]:
def get_word_text(comments):
    ''' функция выполняет токенизациию и лемматизацию массива текстов c учетом pos_tag и удаление стоп-слов'''
    corpus_new = []
    for sentence in comments:
        corpus_new.append(' '.join([L.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence) if not w in stopwords.words('english')]))
    return corpus_new

Выполним лемматизацию корпуса c учетом pos_tag





In [16]:
 %%time
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
 
comments['lemma_text_no_sw'] = get_word_text(comments['clear_text'])

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


CPU times: user 24min 34s, sys: 3min 5s, total: 27min 40s
Wall time: 27min 43s


In [17]:
comments.head()

Unnamed: 0.1,Unnamed: 0,text,toxic,clear_text,lemma_text_no_sw
0,0,explanation\nwhy the edits made under my usern...,0,explanation why the edits made under my userna...,explanation edits make username hardcore metal...
1,1,d'aww! he matches this background colour i'm s...,0,d aww he matches this background colour i m se...,aww match background colour seemingly stuck th...
2,2,"hey man, i'm really not trying to edit war. it...",0,hey man i m really not trying to edit war it s...,hey man really try edit war guy constantly rem...
3,3,"""\nmore\ni can't make any real suggestions on ...",0,more i can t make any real suggestions on impr...,make real suggestion improvement wonder sectio...
4,4,"you, sir, are my hero. any chance you remember...",0,you sir are my hero any chance you remember wh...,sir hero chance remember page


In [18]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
stop_words = set(stopwords.words("english"))

In [20]:
train = []
# потребуется для Word2vec
for sentences in comments['lemma_text_no_sw']:
    train.append(sentences.split())

comments['split'] = train

In [21]:
train_comments,test_comments = train_test_split(comments, test_size=0.2, random_state =0,stratify = comments['toxic'])

In [22]:
train_comments.head()

Unnamed: 0.1,Unnamed: 0,text,toxic,clear_text,lemma_text_no_sw,split
138503,138654,"""::::*the name of the geographical reference i...",0,the name of the geographical reference isn t r...,name geographical reference really pertinent d...,"[name, geographical, reference, really, pertin..."
87577,87658,"""\n\n disscussions with dmacks \n\noth have un...",0,disscussions with dmacks oth have understandin...,disscussions dmacks oth understand difficulty ...,"[disscussions, dmacks, oth, understand, diffic..."
26277,26311,"on further investigation, i think i see what y...",0,on further investigation i think i see what yo...,investigation think see question edit sub stub...,"[investigation, think, see, question, edit, su..."
56665,56726,i am going to make a good-faith assumption tha...,0,i am going to make a good faith assumption tha...,go make good faith assumption reply request ma...,"[go, make, good, faith, assumption, reply, req..."
143787,143942,i suggest you take a look at the archived link...,0,i suggest you take a look at the archived link...,suggest take look archive link current site qu...,"[suggest, take, look, archive, link, current, ..."


In [23]:
def result_write(model_name,f1,cross_val_score,time_fit):
    ''' 
    функция добавляет значения метрики f1 по умолчания, f1 c учетом измененного порога,
    порог, время настройки модели
    '''
    result_df.loc[model_name,'F1_predict'] = f1
    result_df.loc[model_name,'cross_val_score'] = cross_val_score
    result_df.loc[model_name,'Time_fit'] = time_fit

    return result_df


In [24]:
result_df = pd.DataFrame(columns=['F1_predict','cross_val_score','Time_fit'])
result_df

Unnamed: 0,F1_predict,cross_val_score,Time_fit


# TF-IDF


Выполним векторизацию слов методом TfidfVectorize.

Метод fit настривает модель, transform выполняет вектризацию.

In [25]:

count_tf_idf = TfidfVectorizer(stop_words = 'english') 
tf_idf = count_tf_idf.fit(train_comments['lemma_text_no_sw'])
tf_idf_train = tf_idf.transform(train_comments['lemma_text_no_sw'])
test_tf_idf = tf_idf.transform(test_comments['lemma_text_no_sw'])

In [26]:
print("Размер матрицы train:", tf_idf_train.shape,"Размер матрицы test:", test_tf_idf.shape)


Размер матрицы train: (127433, 143240) Размер матрицы test: (31859, 143240)


In [27]:
# Обозначим что есть признаки и таргеты для трейн и тест.
X_train = tf_idf_train
X_test = test_tf_idf
y_train = train_comments['toxic'].values
y_test = test_comments['toxic'].values

In [28]:
print(X_train.shape,X_test.shape)

(127433, 143240) (31859, 143240)


In [29]:
f1 = make_scorer(f1_score)

### SGDClassifier

In [30]:
modelSGDClassifier_TF_IDF = SGDClassifier()

hyperparams = [{'loss':['hinge', 'log', 'modified_huber'],
                'learning_rate':['constant', 'optimal', 'invscaling', 'adaptive'],
                'eta0':[0.01, 0.05, 0.1, 0.2, 0.3, 0.5],
                'random_state':[12082020],
                'class_weight':['balanced']}]

grid_search_SGDClassifier = GridSearchCV(modelSGDClassifier_TF_IDF, hyperparams, cv=5)

start = time.time()

grid_search_SGDClassifier.fit(X_train, y_train)
grid_search_SGDClassifier.best_params_ 
grid_search_SGDClassifier.best_score_ 

0.9435154236822731

In [31]:

modelSGDClassifier = SGDClassifier(**grid_search_SGDClassifier.best_params_)

model_sgd = modelSGDClassifier.fit(X_train, y_train)
end= time.time()
time_fit = end - start

In [32]:
SGD_predict = model_sgd.predict(X_test)
f1_SGD = f1_score(SGD_predict,y_test)
f1_SGD

0.7447300421596628

In [33]:
cross_val_sgd = cross_val_score(model_sgd,X_train,y_train,cv=5,scoring='f1').mean()
cross_val_sgd

0.7512679897107728

In [34]:
result_write('TF-IDF+SGDClassifier',f1_SGD,cross_val_sgd,time_fit)

Unnamed: 0,F1_predict,cross_val_score,Time_fit
TF-IDF+SGDClassifier,0.74473,0.751268,437.849358


### LogisticRegression

In [35]:
start = time.time()

logreg = LogisticRegression()
result_logreg = logreg.fit(X_train, y_train)
end = time.time()
time_fit = end - start


In [36]:
lr_predict = result_logreg.predict(X_test)
lr_f1 = f1_score(lr_predict,y_test)
lr_f1

0.7362576346474182

In [37]:
lr_cross_val = cross_val_score(result_logreg,X_train,y_train,cv=5,scoring='f1').mean()
lr_cross_val

0.7137018791041585

In [38]:
result_write('TF-IDF+LogisticRegression',lr_f1,lr_cross_val,time_fit)

Unnamed: 0,F1_predict,cross_val_score,Time_fit
TF-IDF+SGDClassifier,0.74473,0.751268,437.849358
TF-IDF+LogisticRegression,0.736258,0.713702,44.909829


### AdaBoostClassifier

In [39]:
start = time.time()
modelClf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2), n_estimators=100, random_state=42)
modelclf_fit = modelClf.fit(X_train, y_train)
end = time.time()
time_fit = end - start


In [40]:
adaboost_predict = modelclf_fit.predict(X_test)
adaboost_f1 = f1_score(adaboost_predict,y_test)
adaboost_f1

0.7300407151708267

In [41]:
adaboost_cross_val = cross_val_score(modelclf_fit,X_train,y_train,cv=5,scoring='f1').mean()
adaboost_cross_val

0.7235590738255137

In [42]:
result_write('TF-IDF+AdaBoostClassifier',adaboost_f1,adaboost_cross_val,time_fit)

Unnamed: 0,F1_predict,cross_val_score,Time_fit
TF-IDF+SGDClassifier,0.74473,0.751268,437.849358
TF-IDF+LogisticRegression,0.736258,0.713702,44.909829
TF-IDF+AdaBoostClassifier,0.730041,0.723559,86.912569


### SVM

In [44]:
start = time.time()
metodsvm = svm.SVC()
result_svm = metodsvm.fit(X_train, y_train)
end = time.time()
time_fit = end - start


In [45]:
svm_predict = result_svm.predict(X_test)
svm_f1 = f1_score(svm_predict,y_test)
svm_f1

0.7517367458866545

In [46]:
svm_cross_val = cross_val_score(result_svm,X_train,y_train,cv=5,scoring='f1').mean()
svm_cross_val

0.7358101074687179

In [47]:
result_write('TF-IDF+svm',svm_f1,svm_cross_val,time_fit)

Unnamed: 0,F1_predict,cross_val_score,Time_fit
TF-IDF+SGDClassifier,0.74473,0.751268,437.849358
TF-IDF+LogisticRegression,0.736258,0.713702,44.909829
TF-IDF+AdaBoostClassifier,0.730041,0.723559,86.912569
TF-IDF+svm,0.751737,0.73581,3128.358924


## Проверка лучшей модели на тестовой выборке

Лучшей моделью выбрана модель **SVM**. Получение предсказаний на тестовой выборке



In [48]:
# Проверим на тестовой выборке качество модели 'SVM' 
svm_predict = result_svm.predict(X_test)

f1_svm_test = round(f1_score(y_test, svm_predict),4)
print("TF-IDF+svm:", f1_svm_test)

TF-IDF+svm: 0.7517


## Выводы

При выполнении проекта проделана следующая работа:

* На первом этапе произведена загрузка данных и их подготовка для обучения моделей:

- Были исследованы модели **TF-IDF+SGDClassifier**,**TF-IDF+Logistic Regression**,**TF-IDF+AdaBoostClassifier**,**TF-IDF+svm**.


* **Наилучший показатель f1_score в данном случае у svm = 0.7448**.
