## Описание данных
---
- `text` - текст комментария;
- `toxic` - целевой признак.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/1e/21/d1718eb4c93d6bacdd540b3792187f32ccb1ad9c51b9c4f10875d63ec176/catboost-0.25-cp37-none-manylinux1_x86_64.whl (67.3MB)
[K     |████████████████████████████████| 67.3MB 58kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.25


In [None]:
import pandas as pd
import numpy as np
from nltk.stem import WordNetLemmatizer
import nltk
import re
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Подготовка

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Data/ml-5-toxic_comments.csv')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [None]:
df.duplicated().sum()

0

Пропусков нет, дубликатов нет, работаю дальше.

In [None]:
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [None]:
df['toxic'].mean()

0.10167887648758234

Только 10% признаков имеют положительный класс.

In [None]:
def clear_text(text):
    text = re.sub(r'[^a-zA-Z]',' ', text)
    text = ' '.join(text.split())
    return text

Функция чистит текст от лишних символов.

In [None]:
def lemmatizer(row):
    lemmatizer = WordNetLemmatizer()
    word_list = nltk.word_tokenize(row)
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w, pos = 'v') for w in word_list])
    return lemmatized_output

Функция лемматизирует текст ячейки.

In [None]:
df['text'] = df['text'].apply(clear_text).apply(lemmatizer)

In [None]:
x_col = 'text'

In [None]:
y_col = 'toxic'

In [None]:
train, test = train_test_split(df, test_size = 0.4, random_state = 4, stratify = df[y_col])

In [None]:
valid, test = train_test_split(test, test_size = 0.5, random_state = 4, stratify = test[y_col])

Делю на треин, валид и тест выборки (60%, 20%, 20%)

In [None]:
len(df) == len(train) + len(test) + len(valid)

True

In [None]:
test[y_col].mean(), train[y_col].mean(), valid[y_col].mean()

(0.10167632774557418, 0.10167951369305007, 0.10167951369305007)

Проверяю, что все нормально разделилось.

In [None]:
corpus = train[x_col].values.astype('U')

In [None]:
stopwords = set(nltk_stopwords.words('english'))

In [None]:
count_tf_idf = TfidfVectorizer(stop_words = stopwords)

In [None]:
tf_idf = count_tf_idf.fit_transform(corpus)

In [None]:
valid_features = count_tf_idf.transform(valid[x_col].values.astype('U'))

In [None]:
test_features = count_tf_idf.transform(test[x_col].values.astype('U'))

Рассчитываю TF IDF для всех текстов дф.

## Обучение

In [None]:
def f1_eval(y_pred, dtrain):
    y_true = dtrain.get_label()
    err = 1-f1_score(y_true, np.round(y_pred))
    return 'f1_err', err
    
model = XGBClassifier(random_state = 4)
model.fit(tf_idf, train[y_col], eval_set=[(valid_features, valid[y_col])], eval_metric = f1_eval)
predictions = model.predict(valid_features)
accuracy_score(valid[y_col],predictions),
f1_score(valid[y_col],predictions)

[0]	validation_0-error:0.072946	validation_0-f1_err:0.542152
[1]	validation_0-error:0.072977	validation_0-f1_err:0.54125
[2]	validation_0-error:0.072852	validation_0-f1_err:0.543351
[3]	validation_0-error:0.072915	validation_0-f1_err:0.541793
[4]	validation_0-error:0.072915	validation_0-f1_err:0.543819
[5]	validation_0-error:0.072664	validation_0-f1_err:0.540433
[6]	validation_0-error:0.072664	validation_0-f1_err:0.540938
[7]	validation_0-error:0.073322	validation_0-f1_err:0.551107
[8]	validation_0-error:0.073165	validation_0-f1_err:0.549283
[9]	validation_0-error:0.075014	validation_0-f1_err:0.573826
[10]	validation_0-error:0.073009	validation_0-f1_err:0.547977
[11]	validation_0-error:0.072977	validation_0-f1_err:0.550461
[12]	validation_0-error:0.077301	validation_0-f1_err:0.605102
[13]	validation_0-error:0.077113	validation_0-f1_err:0.60363
[14]	validation_0-error:0.075077	validation_0-f1_err:0.576793
[15]	validation_0-error:0.077019	validation_0-f1_err:0.602746
[16]	validation_0-er

0.5664214625703159

In [None]:
model = DecisionTreeClassifier(random_state = 4)
model.fit(tf_idf, train[y_col])
predictions = model.predict(valid_features)
accuracy_score(valid[y_col],predictions),
f1_score(valid[y_col],predictions)

0.7040421248257704

In [None]:
best_f1 = 0
for c in np.linspace(0.001, 10, 100):
    model = LogisticRegression(C = c)
    model.fit(tf_idf, train[y_col])
    proba = model.predict_proba(valid_features)
    for threshold in np.arange(0, 1, 0.01):
        predictions = proba[:,1] > threshold
        f = f1_score(valid[y_col], predictions)
        if f > best_f1:
            best_f1 = f
            best_thrs = threshold
            best_c = c
best_f1, best_thrs, best_c

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

(0.7858608174214928, 0.28, 4.142)

In [None]:
model = LogisticRegression(C = 4.041)
model.fit(tf_idf, train[y_col])
proba = model.predict_proba(valid_features)
predictions = proba[:,1] > 0.28
f1_score(valid[y_col], predictions)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.7852857593937479

In [None]:
proba = model.predict_proba(valid_features)

In [None]:
f1_score(test[y_col], model.predict(test_features))

0.7688219663418956

Логистическая регрессия показала лучший результат: 0.785 f1 на валидационной выборке и 0.770 на тестовой.
Остальные модели показали результат хуже, кэтбуст вообще роняет ядро (скрыл их, т.к. долго учатся).

## Выводы

### В процессе выполнения проекта были сделаны следующие работы:
- данные проверены: пропуски отсутствуют, дубликаты отсутсвуют;
- обнаружено, что выборка не сбалансированна, только 10% имеют положительный класс;
- текст обработан: удалены лишние символы, проведена лемматизация;
- данные разбиты на 3 выборки (60%, 20%, 20%);
- модели обучены и проверены на валидационной выборке, лучший результат показала логистическая регрессия;
- на тестовой выборке логистическая регрессия показала значение f1_score - 0.77, задание выполнено. 