<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span><ul class="toc-item"><li><span><a href="#Очистка-и-лемматизация-теста" data-toc-modified-id="Очистка-и-лемматизация-теста-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Очистка и лемматизация теста</a></span></li></ul></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#LogisticRegression" data-toc-modified-id="LogisticRegression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>LogisticRegression</a></span></li><li><span><a href="#LightGBM" data-toc-modified-id="LightGBM-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>LightGBM</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

# Классификация текстовых комментариев

## Подготовка

In [48]:
import pandas as pd
import numpy as np
import re
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from lightgbm import LGBMClassifier
import spacy
import nltk
from nltk.corpus import stopwords

from tqdm import notebook
notebook.tqdm.pandas()
    
import warnings
warnings.filterwarnings('ignore')

In [3]:
try:
    df = pd.read_csv('/datasets/toxic_comments.csv')
except:
    df = pd.read_csv('toxic_comments.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [5]:
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [6]:
df.isna().sum()

text     0
toxic    0
dtype: int64

Пропусков нет

Изучим столбец с целевыми признаками

In [7]:
df['toxic'].value_counts(normalize=True).map('{:.2%}'.format)

0    89.83%
1    10.17%
Name: toxic, dtype: object

**Итого:**
 - В данных 159571 строка и 2 столбца (текст комментария и целевой признак)
 - В столбце toxic (целевой признак) наблюдается дисбаланс классов

### Очистка и лемматизация теста

In [8]:
def clear_text(text):
    return " ".join(re.sub(r'[^a-zA-Z]', ' ', text).split())

In [9]:
global nlp
nlp = spacy.load('en_core_web_sm')

In [10]:
def lemmatize(text):
    doc = nlp(text.lower())
    sent = []
    for token in doc:
        sent.append(token.lemma_)
    return " ".join(sent)

In [11]:
df['lemm_text'] = df['text'].apply(clear_text).progress_apply(lemmatize)

  0%|          | 0/159571 [00:00<?, ?it/s]

In [12]:
df.head()

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edit make under my usernam...
1,D'aww! He matches this background colour I'm s...,0,d aww he match this background colour I m seem...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man I m really not try to edit war it s ju...
3,"""\nMore\nI can't make any real suggestions on ...",0,more I can t make any real suggestion on impro...
4,"You, sir, are my hero. Any chance you remember...",0,you sir be my hero any chance you remember wha...


## Обучение

Разделим данные на признаки и целевой признак

In [27]:
features = df['lemm_text']
target = df['toxic']

Разделим данные на тренировочные и тестовые

In [49]:
features_train, features_test, target_train, target_test = train_test_split(features, target,
                                                                              test_size = 0.1,
                                                                              stratify = target,
                                                                              random_state = 2019)

In [50]:
stopwords = set(stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
features_train_tf_idf = count_tf_idf.fit_transform(features_train)

Проверка разбиения

In [51]:
(features_train.shape[0]/features.shape[0])

0.8999943598774214

In [52]:
(features_test.shape[0]/features.shape[0])

0.10000564012257866

Баланс классов в обучающей выборке

In [53]:
target_train.value_counts(normalize=True).map('{:.2%}'.format)

0    89.83%
1    10.17%
Name: toxic, dtype: object

Баланс классов в тестовой выборке

In [54]:
target_test.value_counts(normalize=True).map('{:.2%}'.format)

0    89.83%
1    10.17%
Name: toxic, dtype: object

### LogisticRegression

In [55]:
model_linear = LogisticRegression(random_state=2019, class_weight='balanced')
model_linear.fit(features_train_tf_idf, target_train)

features_test_tf_idf = count_tf_idf.transform(features_test)
predictions_linear = model_linear.predict(features_test_tf_idf)
    
f1_score(target_test, predictions_linear)

0.762186910853757

### LightGBM

In [56]:
model_LGBM = LGBMClassifier(num_leaves=200, n_estimators=100, class_weight='balanced', random_state=2019)
model_LGBM.fit(features_train_tf_idf, target_train)
predictions_LGBM = model_LGBM.predict(features_test_tf_idf)
    
f1_score(target_test, predictions_LGBM)

0.7649560533030905

## Выводы

Обе модели показывают результат F1-score более 0.75 (0.76 у модели LogisticRegression и 0.76 у модели LightGBM), однако модель LogisticRegression работает заметно быстрее и поэтому её можно рекомендовать для поиска токсичных комментариев.