<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Тестирование-лучшей-модели" data-toc-modified-id="Тестирование-лучшей-модели-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Тестирование лучшей модели</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

In [70]:
import numpy as np
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords as nltk_stopwords
from pymystem3 import Mystem
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from tqdm import notebook 

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from gensim.models import Word2Vec
import spacy

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

## Подготовка

In [71]:
data = pd.read_csv('/datasets/toxic_comments.csv')

In [72]:
data.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [73]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Лемматизируем текст и очистим его от ненужных символов

In [74]:
nlp = spacy.load("en_core_web_sm")

In [76]:
m = Mystem()
for i in notebook.tqdm(range(len(data['text']))):
    lemm_list = nlp(data['text'][i])
    data['text'][i] = " ".join([token.lemma_ for token in lemm_list])
    a = re.sub(r'[^a-zA-Z]', ' ', data['text'][i])
    data['text'][i] = ' '.join(a.split())

  0%|          | 0/159571 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'][i] = " ".join([token.lemma_ for token in lemm_list])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'][i] = ' '.join(a.split())


Если пробовать через функцию и apply, то ядро сразу умирает. Поэтому сделал через цикл

In [77]:
data.head()

Unnamed: 0,text,toxic
0,explanation why the edit make under my usernam...,0
1,d aww he match this background colour I be see...,0
2,hey man I be really not try to edit war it be ...,0
3,More I can not make any real suggestion on imp...,0
4,you sir be my hero any chance you remember wha...,0


In [78]:
features = data['text']
target = data['toxic']
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=148)

In [79]:
features_train = features_train.values.astype('U')

In [80]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
features_train = count_tf_idf.fit_transform(features_train)


[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Обучение

In [81]:
%time
model_l_r = LogisticRegression(class_weight='balanced', solver="saga", max_iter=1000)
scores = cross_val_score(model_l_r, features_train, target_train, scoring='f1')
print(scores)
print(scores.mean())

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 6.68 µs
[0.75112185 0.76029056 0.75082365 0.75208044 0.75655431]
0.7541741619693467


In [12]:
%time
model_f = RandomForestClassifier(class_weight='balanced')
scores = cross_val_score(model_f, features_train, target_train, scoring='f1', cv=3)
print(scores)
print(scores.mean())

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.72 µs
[0.62195728 0.61264425 0.60332771]
0.6126430797876959


In [13]:
%time
model_t = DecisionTreeClassifier(class_weight='balanced')
scores = cross_val_score(model_t, features_train, target_train, scoring='f1', cv=3)
print(scores)
print(scores.mean())

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.53 µs
[0.64482527 0.59119691 0.59754601]
0.611189397428018


In [84]:
grid = {"C":np.logspace(-3,3,10),
    'max_iter' : [100],
    'class_weight': ['balanced', 'None']}

In [1]:
rf_cv = GridSearchCV(estimator=LogisticRegression(), param_grid=grid, cv= 3, scoring='f1')
rf_cv.fit(features_train, target_train)

In [None]:
rf_cv.best_params_

### Тестирование лучшей модели

In [82]:
features_test= count_tf_idf.transform(features_test)

In [83]:
model_l_r = LogisticRegression(class_weight='balanced', solver="saga", max_iter=1000, random_state=148, C= 1.05)
model_l_r.fit(features_train, target_train)
p = model_l_r.predict(features_test)
f1_score(target_test, p)

0.7505783099741461

## Выводы

Изначальный текст был лемматизирован и избавлен от лишних символов. Была проведена векторизация текста для обучения моделей. Модели обучались долго, CatBoost вообще больше часа. Лучший результат получился и логистической регрессии, что подходит под целевой