# Проект для «Викишоп»

### Установка необходимых пакетов

In [1]:
! pip inslall nltk

! pip install spacy
! spacy download en_core_web_sm 

### Импорт необходимых библиотек

In [2]:
import pandas as pd

import re  

import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

from tqdm.notebook import tqdm
tqdm.pandas()

import spacy

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer 

from sklearn.linear_model import LogisticRegression, SGDClassifier, PassiveAggressiveClassifier
from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, roc_auc_score, roc_curve

from sklearn.ensemble import VotingClassifier

## Описание проекта

Необходимо обучить модель для классификации комментариев на позитивные и негативные. В нашем распоряжении набор данных с разметкой о токсичности правок.  

Построенная модель должна обеспечивать значение метрики качества F1 не меньше 0.75.


## Содержание проекта

[1. Загрузка данных](#1)  
[2. Анализ и обработка данных](#2)  
[3. Обучение моделей](#3)  
[4. Проверка модели](#4)  
[Общий вывод](#itog)  

# 1. Загрузка  данных<a id="1"></a>

In [4]:
display(comments.info())
display(comments.head(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


None

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
5,"""\n\nCongratulations from me as well, use the ...",0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,Your vandalism to the Matt Shirvington article...,0
8,Sorry if the word 'nonsense' was offensive to ...,0
9,alignment on this subject and which are contra...,0


# 2. Анализ и обработка данных<a id="2"></a>

Оценим баланс классов предоставленной выборки:

In [5]:
comments['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

Имеем дисбаланс классов в соотношении примерно 1:9 в пользу позитивных комментариев. Далее, учтем этот дисбаланс при прогнозировании.

### Очистка текстов

+ Приведем все строки к нижнему регистру;
+ Оставим только слова и сокращения типа can't, he's, I'm и т.п.;
+ Уберем спецсимволы

In [7]:
def clear_text(text):
    """
    Функция очистки текста от "мусора":
    text - исходный текст
    Возвращает очищенный текст.
    """
    text = text.lower()
    #text = re.sub(r'[^a-z\'t\'m\'re\'s\'ve\'d\'ll]', ' ', text)
    text = re.sub(r'[^a-z]', ' ', text)
    text = text.split()
    text = ' '.join(text)
    return text


In [8]:
%%time
comments['clear_text'] = comments['text'].progress_apply(clear_text)

HBox(children=(FloatProgress(value=0.0, max=159571.0), HTML(value='')))


CPU times: user 9.91 s, sys: 376 ms, total: 10.3 s
Wall time: 10.1 s


In [9]:
comments.sample(10)

Unnamed: 0,text,toxic,clear_text
85503,"""\n\nWasn't Warrant Officer (TNG - Hollow Pip)...",0,wasn t warrant officer tng hollow pip listed i...
4122,"Cape Town meetup \nHi Piet, I just wanted to l...",0,cape town meetup hi piet i just wanted to let ...
86447,"""BTW The Adversary on the wikipediareview webs...",0,btw the adversary on the wikipediareview websi...
37350,"""\n\n Archiving? \n\nTo help keep it tidy, you...",0,archiving to help keep it tidy you might want ...
64992,"""\n\n Wikidata and Interwiki links \n\nYou are...",0,wikidata and interwiki links you are receiving...
25198,Please remove the Criticisms sectoin\nThere ar...,0,please remove the criticisms sectoin there are...
155881,Anal leakage ban hammer?,1,anal leakage ban hammer
136000,Except they clearly state he left for South Am...,0,except they clearly state he left for south am...
29835,"Also, I remember reading that the menthol in m...",0,also i remember reading that the menthol in me...
62784,"""\nI just retooled it into a more concise vers...",0,i just retooled it into a more concise version...


In [10]:
comments = comments.drop('text', axis = 1)

### Лемматизация

### spaCy

In [11]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [12]:
def lemmatization3(text):
    """
    Функция лемматизации:
    text - исходный текст
    Возвращает лемматизированный текст
    """
    doc = nlp(text)
    text_lemm = ' '.join([token.lemma_ for token in doc])
    return text_lemm

In [13]:
comments['lemm_text'] = comments['clear_text'].progress_apply(lemmatization3)

HBox(children=(FloatProgress(value=0.0, max=159571.0), HTML(value='')))




In [15]:
comments

Unnamed: 0,toxic,clear_text,lemm_text
0,0,explanation why the edits made under my userna...,explanation why the edit make under my usernam...
1,0,d aww he matches this background colour i m se...,d aww he match this background colour I m seem...
2,0,hey man i m really not trying to edit war it s...,hey man I m really not try to edit war it s ju...
3,0,more i can t make any real suggestions on impr...,more I can t make any real suggestion on impro...
4,0,you sir are my hero any chance you remember wh...,you sir be my hero any chance you remember wha...
...,...,...,...
159566,0,and for the second time of asking when your vi...,and for the second time of ask when your view ...
159567,0,you should be ashamed of yourself that is a ho...,you should be ashamed of yourself that be a ho...
159568,0,spitzer umm theres no actual article for prost...,spitzer umm there s no actual article for pros...
159569,0,and it looks like it was actually you who put ...,and it look like it be actually you who put on...


In [16]:
comments['lemm_text'].loc[0]

'explanation why the edit make under my username hardcore metallica fan be revert they weren t vandalism just closure on some gas after I vote at new york doll fac and please don t remove the template from the talk page since I m retire now'

Также, была оценена работа лемматизатора WordNetLemmatizer. WordNetLemmatizer c POS-tag и spaCy дают аналогичные результаты, при больших временных затратах, поэтому был выбран spaCy.

In [17]:
comments

Unnamed: 0,toxic,clear_text,lemm_text
0,0,explanation why the edits made under my userna...,explanation why the edit make under my usernam...
1,0,d aww he matches this background colour i m se...,d aww he match this background colour I m seem...
2,0,hey man i m really not trying to edit war it s...,hey man I m really not try to edit war it s ju...
3,0,more i can t make any real suggestions on impr...,more I can t make any real suggestion on impro...
4,0,you sir are my hero any chance you remember wh...,you sir be my hero any chance you remember wha...
...,...,...,...
159566,0,and for the second time of asking when your vi...,and for the second time of ask when your view ...
159567,0,you should be ashamed of yourself that is a ho...,you should be ashamed of yourself that be a ho...
159568,0,spitzer umm theres no actual article for prost...,spitzer umm there s no actual article for pros...
159569,0,and it looks like it was actually you who put ...,and it look like it be actually you who put on...


## Вывод

Проведена очистка текстов, а также лемматизация с использованием spaCy, визуально оценена корректность проведенной лемматизации.

# 3. Обучение моделей<a id="3"></a>

Разделим датасет на выборки

In [18]:
y = comments['toxic']
X = comments['lemm_text']

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12345)

In [21]:
display(X.shape)
display(y.shape)
display(X_train.shape)
display(y_train.shape)
display(X_test.shape)
display(y_test.shape)

(159571,)

(159571,)

(127656,)

(127656,)

(31915,)

(31915,)

In [22]:
display(X_train)

45800     I have offer as well but he must do my map fir...
94209     your biased orthodox view have no more merit t...
135210    thank you very much for your quick respond and...
89158     I move this page when I do I find another link...
61233     gray powell article nominate for deletion nomi...
                                ...                        
109993    gam keep your crap off my talk page your kkk s...
85412                               I correct what you list
133249    hmm yes as I say I also watch the emily ruete ...
130333                          kinda like miachael jackson
77285     criterion for deletion I m just wonder what be...
Name: lemm_text, Length: 127656, dtype: object

In [23]:
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer 

In [24]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stopwords)

X_train = count_tf_idf.fit_transform(X_train.values)
X_test = count_tf_idf.transform(X_test.values)
                                
print(X_train.shape)
print(X_test.shape)                                

[nltk_data] Downloading package stopwords to /home/master/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


(127656, 131657)
(31915, 131657)


Оценим несколько моделей:

In [34]:
logr = LogisticRegression(class_weight = 'balanced', random_state = 12345)
dtree = DecisionTreeClassifier(class_weight = 'balanced', random_state = 12345)
sgdc = SGDClassifier(class_weight = 'balanced', random_state = 12345)
pagr = PassiveAggressiveClassifier(class_weight = 'balanced', random_state = 12345)

logr_params = {'solver':['newton-cg', 'lbfgs', 'liblinear'],'penalty':['l1', 'l2'], 'C':range(1,15,3)}
dtree_params = {'max_depth': range(2, 10, 2)}
sgdc_params = {'loss':['hinge', 'log', 'modified_huber'],
                'learning_rate':['constant', 'optimal', 'invscaling', 'adaptive'],
                'eta0':[0.01, 0.05, 0.1, 0.2, 0.3, 0.5]}
pagr_params = {'C':range(1,15,3)}
                


In [35]:
%%time

warnings.filterwarnings('ignore')

gs = GridSearchCV(logr, logr_params, cv=3, scoring='f1', verbose=True).fit(X_train, y_train) 
print('F1_logr:', gs.best_score_)
print('Оптимальные параметры модели:', gs.best_params_)

gs = GridSearchCV(dtree, dtree_params, cv=3, scoring='f1', verbose=True).fit(X_train, y_train) 
print('F1_tree:', gs.best_score_)
print('Оптимальные параметры модели:', gs.best_params_)

gs = GridSearchCV(sgdc, sgdc_params, cv=3, scoring='f1', verbose=True).fit(X_train, y_train) 
print('F1_shdc:', gs.best_score_)
print('Оптимальные параметры модели:', gs.best_params_)

gs = GridSearchCV(pagr, pagr_params, cv=3, scoring='f1', verbose=True).fit(X_train, y_train) 
print('F1_pagr:', gs.best_score_)
print('Оптимальные параметры модели:', gs.best_params_)

Fitting 3 folds for each of 30 candidates, totalling 90 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:  6.6min finished


F1_logr: 0.7682212380049465
Оптимальные параметры модели: {'C': 7, 'penalty': 'l2', 'solver': 'liblinear'}
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:  2.9min finished


F1_tree: 0.5397354243938718
Оптимальные параметры модели: {'max_depth': 8}
Fitting 3 folds for each of 72 candidates, totalling 216 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 216 out of 216 | elapsed:  6.9min finished


F1_shdc: 0.7583809421230114
Оптимальные параметры модели: {'eta0': 0.05, 'learning_rate': 'adaptive', 'loss': 'modified_huber'}
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   23.3s finished


F1_pagr: 0.7236441759568116
Оптимальные параметры модели: {'C': 1}
CPU times: user 17min 26s, sys: 3.15 s, total: 17min 29s
Wall time: 17min 7s


## Вывод

Проведено исследование на четырех моделях. Значение искомой метрики f1 больше 0.75 обеспечивают LogisticRegression и SGDClassifier.  
Итоговое тестирование проведем на Логистической регрессии.

# 4. Проверка модели <a id="4"></a>

In [38]:
model = LogisticRegression(solver = 'liblinear', class_weight = 'balanced', C = 7, penalty = 'l2', random_state = 12345)
model.fit(X_train, y_train)
predict = model.predict(X_test)
f1 = f1_score(y_test, predict)
print('Значение F1:', f1)

Значение F1: 0.7611814345991561


## Вывод

Логистическая регрессия обеспечила значение метрики f1 равное 0.76, что удовлетворяет требованию минимального значения искомой метрики.  


# Общий вывод

Обучена модель классификации позитивных\негативных отзывов.  
В процессе выполнения задачи была проведена лематизация комментариев пользователй, апробированы различные модели. 
Итоговая модель логистической регрессии обеспечила значение метрики f1 равное 0.76
