#Поиск токсичных комментариев (CatBoost)

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

In [None]:
! pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.0.6-cp37-none-manylinux1_x86_64.whl (76.6 MB)
[K     |████████████████████████████████| 76.6 MB 1.1 MB/s 
Installing collected packages: catboost
Successfully installed catboost-1.0.6


In [None]:
import pandas as pd
import numpy as np

import re
import string

from sklearn.model_selection import train_test_split

from catboost import Pool, cv
from catboost import CatBoostClassifier

from sklearn.metrics import f1_score
#from sklearn.model_selection import cross_val_score


In [None]:
df_tweets = pd.read_csv('/content/toxic_comments.csv')

In [None]:
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [None]:
df_tweets['toxic'].value_counts(normalize = True)

0    0.898321
1    0.101679
Name: toxic, dtype: float64

In [None]:
df_tweets.duplicated().sum()

0

-------

In [None]:
df_tweets.text[0]

"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

In [None]:
def text_cleaning(text):
  text = text.lower()
  text = re.sub('\[.*?\]','', text)
  text = re.sub('\\W',' ', text)
  text = re.sub('https?://\S+|www\.\S+','', text)
  text = re.sub('<.*?>+','', text)
  text = re.sub('[%s]' % re.escape(string.punctuation),'', text)
  text = re.sub('\n','', text)
  text = re.sub('\w*\d\w*','', text)
  text = re.sub('\s+[a-zA-Z]\s',' ', text)
  text = ' '.join(re.sub('\s+[a-zA-Z]\s',' ', text).split())

  return text

In [None]:
text_cleaning(df_tweets.text[0])

'explanation why the edits made under my username hardcore metallica fan were reverted they weren vandalisms just closure on some gas after voted at new york dolls fac and please don remove the template from the talk page since retired now'

------

In [None]:
df_tweets['text'] = df_tweets['text'].apply(text_cleaning)

-----

In [None]:
features_train, features_test, target_train, target_test = train_test_split(
    df_tweets.drop('toxic', axis=1), 
    df_tweets.toxic, 
    test_size=0.2, 
    random_state=12345, 
    stratify=df_tweets.toxic) 

------

In [None]:
parameters = {'loss_function':'Logloss',
              'task_type': 'GPU',
              'eval_metric' : 'AUC',
              'early_stopping_rounds': 200,
        #         'learning_rate': 0.1,
        #         'depth': 8,
                 'iterations': 3000, # значение по умолчанию 1000
                 'random_seed': 12345,
                 'verbose': 200}

cv_dataset = Pool(data=features_train,
                  label=target_train,
                  text_features=['text'])

scores = cv(cv_dataset,
            parameters,
            fold_count=5,
            shuffle=True,
            stratified=True)#,
          #  plot="True")                 

Training on fold [0/5]


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.9098963	best: 0.9098963 (0)	total: 102ms	remaining: 5m 6s
200:	test: 0.9578780	best: 0.9578780 (200)	total: 9.57s	remaining: 2m 13s
400:	test: 0.9625102	best: 0.9625495 (395)	total: 19s	remaining: 2m 3s
600:	test: 0.9640998	best: 0.9640998 (600)	total: 28.4s	remaining: 1m 53s
800:	test: 0.9649914	best: 0.9650127 (792)	total: 37.7s	remaining: 1m 43s
1000:	test: 0.9655859	best: 0.9655939 (998)	total: 47.1s	remaining: 1m 34s
1200:	test: 0.9660353	best: 0.9660386 (1196)	total: 56.6s	remaining: 1m 24s
1400:	test: 0.9664170	best: 0.9664314 (1398)	total: 1m 5s	remaining: 1m 15s
1600:	test: 0.9665658	best: 0.9665861 (1536)	total: 1m 15s	remaining: 1m 5s
1800:	test: 0.9668894	best: 0.9668894 (1800)	total: 1m 25s	remaining: 56.7s
2000:	test: 0.9670136	best: 0.9670225 (1990)	total: 1m 34s	remaining: 47.4s
2200:	test: 0.9671061	best: 0.9671159 (2188)	total: 1m 45s	remaining: 38.2s
2400:	test: 0.9671874	best: 0.9671890 (2398)	total: 1m 54s	remaining: 28.6s
2600:	test: 0.9673427	best: 0.9

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.8968612	best: 0.8968612 (0)	total: 96.1ms	remaining: 4m 48s
200:	test: 0.9602778	best: 0.9602778 (200)	total: 9.69s	remaining: 2m 14s
400:	test: 0.9655087	best: 0.9655098 (399)	total: 19.3s	remaining: 2m 4s
600:	test: 0.9676024	best: 0.9676024 (600)	total: 28.9s	remaining: 1m 55s
800:	test: 0.9687332	best: 0.9687357 (799)	total: 38.6s	remaining: 1m 45s
1000:	test: 0.9695002	best: 0.9695002 (1000)	total: 48.2s	remaining: 1m 36s
1200:	test: 0.9700526	best: 0.9700729 (1194)	total: 57.8s	remaining: 1m 26s
1400:	test: 0.9703726	best: 0.9704238 (1384)	total: 1m 7s	remaining: 1m 16s
1600:	test: 0.9705029	best: 0.9705178 (1592)	total: 1m 16s	remaining: 1m 7s
1800:	test: 0.9706233	best: 0.9706233 (1800)	total: 1m 26s	remaining: 57.4s
2000:	test: 0.9707013	best: 0.9707386 (1980)	total: 1m 35s	remaining: 47.8s
2200:	test: 0.9708872	best: 0.9708913 (2198)	total: 1m 45s	remaining: 38.2s
2400:	test: 0.9709927	best: 0.9710124 (2373)	total: 1m 54s	remaining: 28.6s
2600:	test: 0.9710133	best

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.8999985	best: 0.8999985 (0)	total: 96.7ms	remaining: 4m 50s
200:	test: 0.9608703	best: 0.9608910 (199)	total: 9.75s	remaining: 2m 15s
400:	test: 0.9653779	best: 0.9653779 (400)	total: 19.4s	remaining: 2m 5s
600:	test: 0.9672682	best: 0.9672830 (593)	total: 29.1s	remaining: 1m 56s
800:	test: 0.9684839	best: 0.9684850 (798)	total: 38.7s	remaining: 1m 46s
1000:	test: 0.9691704	best: 0.9691977 (990)	total: 48.3s	remaining: 1m 36s
1200:	test: 0.9697018	best: 0.9697079 (1182)	total: 58s	remaining: 1m 26s
1400:	test: 0.9700062	best: 0.9700358 (1397)	total: 1m 7s	remaining: 1m 17s
1600:	test: 0.9703532	best: 0.9703537 (1598)	total: 1m 16s	remaining: 1m 7s
1800:	test: 0.9705747	best: 0.9705747 (1800)	total: 1m 26s	remaining: 57.6s
2000:	test: 0.9707494	best: 0.9707494 (2000)	total: 1m 36s	remaining: 48.1s
2200:	test: 0.9707520	best: 0.9708335 (2123)	total: 1m 45s	remaining: 38.4s
2400:	test: 0.9708870	best: 0.9708936 (2396)	total: 1m 55s	remaining: 28.8s
2600:	test: 0.9710017	best: 0

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.9006012	best: 0.9006012 (0)	total: 96.8ms	remaining: 4m 50s
200:	test: 0.9593019	best: 0.9593019 (200)	total: 9.7s	remaining: 2m 15s
400:	test: 0.9635668	best: 0.9635698 (399)	total: 19.3s	remaining: 2m 5s
600:	test: 0.9656924	best: 0.9656924 (600)	total: 28.9s	remaining: 1m 55s
800:	test: 0.9666409	best: 0.9666559 (798)	total: 38.6s	remaining: 1m 45s
1000:	test: 0.9673520	best: 0.9673554 (999)	total: 48.2s	remaining: 1m 36s
1200:	test: 0.9679722	best: 0.9679722 (1200)	total: 57.9s	remaining: 1m 26s
1400:	test: 0.9683293	best: 0.9683399 (1398)	total: 1m 7s	remaining: 1m 16s
1600:	test: 0.9685046	best: 0.9685046 (1600)	total: 1m 16s	remaining: 1m 7s
1800:	test: 0.9688278	best: 0.9688278 (1800)	total: 1m 26s	remaining: 57.6s
2000:	test: 0.9689571	best: 0.9689697 (1999)	total: 1m 35s	remaining: 47.9s
2200:	test: 0.9691450	best: 0.9691450 (2200)	total: 1m 45s	remaining: 38.3s
2400:	test: 0.9692770	best: 0.9692897 (2395)	total: 1m 54s	remaining: 28.7s
2600:	test: 0.9693113	best: 

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.9014562	best: 0.9014562 (0)	total: 105ms	remaining: 5m 13s
200:	test: 0.9586489	best: 0.9586489 (200)	total: 9.73s	remaining: 2m 15s
400:	test: 0.9630933	best: 0.9630933 (400)	total: 19.4s	remaining: 2m 5s
600:	test: 0.9652278	best: 0.9652278 (600)	total: 29s	remaining: 1m 55s
800:	test: 0.9662094	best: 0.9662548 (794)	total: 38.6s	remaining: 1m 46s
1000:	test: 0.9668905	best: 0.9668905 (1000)	total: 48.2s	remaining: 1m 36s
1200:	test: 0.9674586	best: 0.9674586 (1200)	total: 57.7s	remaining: 1m 26s
1400:	test: 0.9679002	best: 0.9679028 (1399)	total: 1m 7s	remaining: 1m 16s
1600:	test: 0.9682296	best: 0.9682559 (1593)	total: 1m 16s	remaining: 1m 6s
1800:	test: 0.9684521	best: 0.9684668 (1793)	total: 1m 26s	remaining: 57.3s
2000:	test: 0.9686081	best: 0.9686253 (1988)	total: 1m 35s	remaining: 47.7s
2200:	test: 0.9688155	best: 0.9688155 (2200)	total: 1m 45s	remaining: 38.1s
2400:	test: 0.9689761	best: 0.9689930 (2390)	total: 1m 54s	remaining: 28.6s
2600:	test: 0.9691341	best: 0

In [None]:
scores

Unnamed: 0,iterations,test-AUC-mean,test-AUC-std,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std
0,0,0.901763,0.004867,0.645636,0.000667,0.646171,0.000774
1,1,0.910910,0.006782,,,,
2,2,0.913170,0.007098,,,,
3,3,0.923859,0.005722,,,,
4,4,0.926564,0.006852,,,,
...,...,...,...,...,...,...,...
2995,2995,0.969632,0.001509,0.113593,0.003358,0.094577,0.001017
2996,2996,0.969631,0.001510,,,,
2997,2997,0.969633,0.001510,,,,
2998,2998,0.969632,0.001507,,,,


In [None]:
best_value = np.max(scores['test-AUC-mean'])
best_iter = np.argmax(scores['test-AUC-mean'])

In [None]:
best_value

0.9696337223052979

In [None]:
best_iter

2970

In [None]:
params = {'task_type': 'GPU',
                 'text_features': ['text'],
                 'eval_metric' : 'AUC', # F1 лучше не указывать, так как классы несбалансированы
                 'early_stopping_rounds': 200,
                 'random_seed': 12345,
                 'verbose': 200}

In [None]:
model = CatBoostClassifier(**params, iterations=best_iter)

In [None]:
model.fit(features_train, target_train)

Learning rate set to 0.009735


Default metric period is 5 because AUC is/are not implemented for GPU


0:	total: 22.4ms	remaining: 1m 6s
200:	total: 2.68s	remaining: 36.9s
400:	total: 5.19s	remaining: 33.3s
600:	total: 7.77s	remaining: 30.6s
800:	total: 10.3s	remaining: 27.9s
1000:	total: 12.8s	remaining: 25.2s
1200:	total: 15.3s	remaining: 22.6s
1400:	total: 17.8s	remaining: 20s
1600:	total: 20.3s	remaining: 17.4s
1800:	total: 22.8s	remaining: 14.8s
2000:	total: 25.3s	remaining: 12.3s
2200:	total: 27.8s	remaining: 9.72s
2400:	total: 30.3s	remaining: 7.19s
2600:	total: 32.9s	remaining: 4.66s
2800:	total: 35.4s	remaining: 2.13s
2969:	total: 37.5s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7fceb0305850>

In [None]:
predictions = model.predict(features_train)

In [None]:
f1_score(target_train, predictions) 

0.8130723347055363

In [None]:
predictions = model.predict(features_test) # предсказания на тесте

In [None]:
f1_score(target_test, predictions) # f1

0.7797833935018051

------

------

In [None]:
parameters_CB = {'task_type': 'GPU',
                 'text_features': ['text'],
                 'eval_metric' : 'AUC', # F1 лучше не указывать, так как классы несбалансированы
                 'early_stopping_rounds': 200,
        #         'learning_rate': 0.1,
        #         'depth': 8,
                 'iterations': 1000, # значение по умолчанию
                 'random_seed': 12345,
                 'verbose': 200}

In [None]:
model_CB = CatBoostClassifier(**parameters_CB)

In [None]:
model_CB.fit(features_train, target_train)

Learning rate set to 0.025819


Default metric period is 5 because AUC is/are not implemented for GPU


0:	total: 22.5ms	remaining: 22.5s
200:	total: 2.67s	remaining: 10.6s
400:	total: 5.19s	remaining: 7.75s
600:	total: 7.7s	remaining: 5.11s
800:	total: 10.2s	remaining: 2.53s
999:	total: 12.7s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7fceb00a6710>

In [None]:
model_CB.best_score_

{'learn': {'Logloss': 0.11569275258066601}}

In [None]:
predictions = model_CB.predict(features_train)

In [None]:
f1_score(target_train, predictions) 

0.8107971745711403

In [None]:
predictions = model_CB.predict(features_test) # предсказания на тесте

In [None]:
f1_score(target_test, predictions) # f1

0.7788858321870702