# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

**Содержание**
- [Подготовка](#section1)
 - [Загрузка датасета](#section2)
 - [Небольшое заключение по датасету](#section3)
 - [Создание признаков](#section4)
 - [Вывод](#section5)
- [Обучение](#section6)
- [Выводы](#section7)
 

<a id = 'section1'></a>
## Подготовка

In [3]:
import pandas as pd
import numpy as np
import re
from pymystem3 import Mystem
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle

<a id = 'section2'></a>
### Загрузка датасета

In [16]:
df = pd.read_csv('/datasets/toxic_comments.csv')

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [18]:
df.sample(5)

Unnamed: 0,text,toxic
107495,"""\nYou said that, not me ! ≈talk≈ """,0
108959,Link \nSomeone placed a link to the prior arti...,0
67571,Clearly you have personal issues with my polit...,0
57733,These messages keep getting censored by the no...,0
67051,Merry Christmas and Happy New Year!,0


In [19]:
df['toxic'].mean()

0.10167887648758234

In [20]:
df['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

In [21]:
df['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

In [22]:
class_ratio = df['toxic'].value_counts()[0] / df['toxic'].value_counts()[1]
class_ratio

8.834884437596301

Классы несбалансированы. Отношение 1:8.83.

<a id = 'section3'></a>
### Небольшое заключение по датасету

Пропущенные значения отсутсвуют, представленные комментарии содержат английский разговорный язык. Наблюдается дисбаланс классов, токсичных комментариев явно меньше, чем позитивных и составляют лишь 10% всей выборки. Для того чтобы достичь искомого уровня искомой метрики (метрики качества *F1* должна быть не меньше 0.75) попробуем построить модели без учета дисбаланса, затем с учетом и выбрать уже лучшие модели.

В моделях с учетом дисбаланса применим:
- Upsampling
- Изменим веса в моделях обучения

Столкнулся с проблемой, что на серверах Яндекса постоянно умирал Kernel и не возможно было закончить проект, поэтому выполнял проект локально. На локальном компьютере установлен виндоус и не возможно было выполнить леммитизацию, так как насколько я понял она работает лишь на Lunix хорошо. Поэтому на серверах Яндекса провел леммитизацию и выгрузил датасет, с ним уже работал на локальном компе. Ссылка на мой ONEDRIVE https://1drv.ms/u/s!AuAtVzzlUFeVgc4ptFqqpcE-x_ns-w?e=T5z1ND Леммитизация комментариев ниже работает, но лишь на серверах Яндекса (код перевел в markdown)

In [23]:
%%time

m = Mystem()

def lemmatize(text):
    text = text.lower()
    cleared_text = re.sub(r'[^a-zA-Z]', ' ', text)
    lemm_text = "".join(m.lemmatize(text))
     
    return " ".join(lemm_text.split())

df['lemm_text'] = df['text'].apply(lemmatize)

del m

CPU times: user 39.2 s, sys: 8.12 s, total: 47.3 s
Wall time: 1min 40s


In [24]:
df

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,d'aww! he matches this background colour i'm s...
2,"Hey man, I'm really not trying to edit war. It...",0,"hey man, i'm really not trying to edit war. it..."
3,"""\nMore\nI can't make any real suggestions on ...",0,""" more i can't make any real suggestions on im..."
4,"You, sir, are my hero. Any chance you remember...",0,"you, sir, are my hero. any chance you remember..."
...,...,...,...
159566,""":::::And for the second time of asking, when ...",0,""":::::and for the second time of asking, when ..."
159567,You should be ashamed of yourself \n\nThat is ...,0,you should be ashamed of yourself that is a ho...
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,"spitzer umm, theres no actual article for pros..."
159569,And it looks like it was actually you who put ...,0,and it looks like it was actually you who put ...


In [9]:
df = pd.read_csv(r'C:\Users\thund\OneDrive\Рабочий стол\DS\яндекс\data.csv')

In [10]:
df.sample(5)

Unnamed: 0,toxic,lemm_text
156663,0,and perpetuating beliefs of those that just do...
7414,0,you dodged my request for libel about me to be...
26790,1,stop removing my edits why do you insist on be...
35290,0,thanks to sasquatch for blocking this ip
33960,0,please do not add nonsense to wikipedia it is ...


<a id = 'section4'></a>
### Создание признаков

Выделим признаки и целевой признак. Разделим на тестовую и трейн выборки в соотношение 25% и 75% соответственно, применим стратификацию. 

In [25]:
x = df['lemm_text']
y = df['toxic']

In [26]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size = 0.25, stratify = y, random_state = 42)

In [27]:
y_train.mean()

0.10168117782716957

In [28]:
y_test.mean()

0.1016719725265084

Подготовим наши корпуса текстов. Вычислим TF-IDF для корпуса текстов, предварительно очистив от стоп-слов.

In [29]:
corpus_x_train = x_train.values

In [30]:
corpus_x_test =x_test.values

In [31]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [32]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)

In [33]:
x_train = count_tf_idf.fit_transform(corpus_x_train)

In [34]:
x_test = count_tf_idf.transform(corpus_x_test)

In [35]:
x_train.shape

(119678, 160737)

In [36]:
x_test.shape

(39893, 160737)

<a id = 'section5'></a>
### Вывод

- Комментарии лемматизированы
- Подготовлены признаки для обучения
- Описан план действий по борьбе с дисбалансом класса целевого признака


<a id = 'section6'></a>
## Обучение

Обучим 4 модели без учета дисбаланс класса:
- LogisticRegression
- DecisionTreeClassifier
- CatBoostClassifier
- RandomForestClassifier

### Обучение моделей без учета дисбаланса класса

К каждой моделе подобраны гиперпараметры по сетке GridSearchCV, затем обучены на кросс валидации, целевая метрика F1 score

#### LogisticRegression

In [37]:
Logreg = LogisticRegression(random_state = 42)

In [38]:
logreg_grid = {'C': np.arange(1.0, 100.0, .1),
               'solver':['newton-cg','liblinear'],
               'class_weight': [None, 'balanced'],
              }

In [39]:
gs_logreg = GridSearchCV(Logreg, param_grid=logreg_grid, cv =3, scoring = 'f1')

In [None]:
%%time

gs_logreg.fit(x_train, y_train)

In [27]:
logreg_best_params = gs_logreg.best_params_
logreg_best_params

{'class_weight': 'balanced', 'solver': 'newton-cg'}

In [28]:
Logreg = LogisticRegression(random_state = 42)
Logreg.set_params(**logreg_best_params)

LogisticRegression(class_weight='balanced', random_state=42, solver='newton-cg')

In [29]:
logreg_f1_train = cross_val_score(Logreg, x_train, y_train, cv = 3, scoring='f1').mean()
print(f'Показатель f1_score в LogisticRegression на train выборке {logreg_f1_train}')

Показатель f1_score в LogisticRegression на train выборке 0.7495763720303711


#### DecisionTreeClassifier

In [30]:
DecTree = DecisionTreeClassifier(random_state=42) 

In [31]:
DecTree_grid= {'max_depth':np.arange(1,31,5),
               'class_weight': [None, 'balanced'],
               'criterion':['gini', 'entropy'],
              }

In [32]:
gs_DecTree = GridSearchCV(DecTree, DecTree_grid, cv = 3, scoring = 'f1', verbose = 10)

%%time

gs_DecTree.fit(x_train, y_train)

In [33]:
DecTree_best_params = {'class_weight': None, 'criterion': 'gini', 'max_depth': 26}
DecTree_best_params

{'class_weight': None, 'criterion': 'gini', 'max_depth': 26}

In [34]:
DecTree = DecisionTreeClassifier(random_state=42)
DecTree.set_params(**DecTree_best_params)

DecisionTreeClassifier(max_depth=26, random_state=42)

In [35]:
DecTree_f1_train = cross_val_score(DecTree, x_train, y_train, cv = 3, scoring='f1').mean()
print(f'Показатель f1_score в DecisionTreeClassifier на train выборке {DecTree_f1_train}')

Показатель f1_score в DecisionTreeClassifier на train выборке 0.6595829568748616


#### RandomForestClassifier

In [36]:
RandomForest = RandomForestClassifier(random_state=42)

In [37]:
RandomForest_grid = {'n_estimators':np.arange(1,31, 5),
                     'max_depth':np.arange(1,31, 5),
                     'class_weight': [None, 'balanced'],
                     'criterion':['gini', 'entropy']
                    }

In [38]:
gs_RandomForest = GridSearchCV(RandomForest, RandomForest_grid, cv = 3, 
                               scoring = 'f1', verbose = 20)

%%time

gs_RandomForest.fit(x_train, y_train)

In [39]:
RandomForest_best_params = {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 26, 'n_estimators': 26}
RandomForest_best_params

{'class_weight': 'balanced',
 'criterion': 'gini',
 'max_depth': 26,
 'n_estimators': 26}

In [40]:
RandomForest = RandomForestClassifier(random_state=42)
RandomForest.set_params(**RandomForest_best_params)

RandomForestClassifier(class_weight='balanced', max_depth=26, n_estimators=26,
                       random_state=42)

In [41]:
RandomForest_f1_train = cross_val_score(RandomForest, x_train, y_train, cv = 3, scoring='f1').mean()
print(f'Показатель f1_score в RandomForestClassifier на train выборке {RandomForest_f1_train}')

Показатель f1_score в RandomForestClassifier на train выборке 0.41928586842614995


#### CatBoostClassifier

In [42]:
catboost_model = CatBoostClassifier(random_state = 42, verbose = 20)

In [44]:
catboost__f1_train = cross_val_score(catboost_model, x_train, y_train, scoring='f1', cv = 3).mean()
print(f'Показатель f1_score в CatBoostClassifier на train выборке {catboost__f1_train}')

Learning rate set to 0.066843
0:	learn: 0.6239303	total: 1.4s	remaining: 23m 18s
20:	learn: 0.2503416	total: 27.5s	remaining: 21m 23s
40:	learn: 0.2166752	total: 53.6s	remaining: 20m 53s
60:	learn: 0.2013085	total: 1m 19s	remaining: 20m 24s
80:	learn: 0.1918484	total: 1m 45s	remaining: 19m 57s
100:	learn: 0.1842798	total: 2m 11s	remaining: 19m 30s
120:	learn: 0.1782587	total: 2m 38s	remaining: 19m 9s
140:	learn: 0.1726844	total: 3m 3s	remaining: 18m 37s
160:	learn: 0.1682544	total: 3m 32s	remaining: 18m 26s
180:	learn: 0.1645008	total: 3m 59s	remaining: 18m 2s
200:	learn: 0.1606302	total: 4m 26s	remaining: 17m 39s
220:	learn: 0.1573103	total: 4m 55s	remaining: 17m 22s
240:	learn: 0.1541688	total: 5m 30s	remaining: 17m 19s
260:	learn: 0.1512340	total: 6m 1s	remaining: 17m 2s
280:	learn: 0.1488709	total: 6m 32s	remaining: 16m 44s
300:	learn: 0.1464044	total: 7m 2s	remaining: 16m 21s
320:	learn: 0.1441915	total: 7m 32s	remaining: 15m 57s
340:	learn: 0.1422495	total: 8m 2s	remaining: 15m 3

In [45]:
data = [logreg_f1_train, DecTree_f1_train, RandomForest_f1_train, catboost__f1_train]
index = ['LogisticRegression', 'DecisionTreeClassifier', 'RandomForestClassifier', 'CatBoostClassifier']

In [46]:
df_f1 = pd.DataFrame(data = data, index = index, columns = ['f1_train_without_balance'])
df_f1

Unnamed: 0,f1_train_without_balance
LogisticRegression,0.749576
DecisionTreeClassifier,0.659583
RandomForestClassifier,0.419286
CatBoostClassifier,0.741834


Ни одна из моделей не достигла требуемого уровня качества метрики f1_score, однако  LogisticRegression и CatBoostClassifier показали наиболее близкие результаты. Перейдем к разделу, где будем бороться с дисбалансом.

### Upsampling

Попробуем теперь накрутить класс 1 по целевому признаку за счет Upsampling, напишем функцию и подготовим признаки для классификации

In [23]:
def upsample(x, y, repeat):
    x_zeros = x[y == 0]
    x_ones = x[y == 1]
    y_zeros = y[y == 0]
    y_ones = y[y == 1]

    x_upsampled = pd.concat([x_zeros] + [x_ones] * repeat)
    y_upsampled = pd.concat([y_zeros] + [y_ones] * repeat)
    
    x_upsampled, y_upsampled = shuffle(x_upsampled, y_upsampled, random_state=42)
    
    return x_upsampled, y_upsampled

In [24]:
x = df['lemm_text']
y = df['toxic']

In [25]:
x_train_up, x_test_up, y_train_up, y_test_up = train_test_split(
    x, y, test_size = 0.25, stratify = y, random_state = 42)

In [26]:
x_train_upsampled, y_train_upsampled = upsample(x_train_up, y_train_up, 10)

In [27]:
corpus_x_train_upsampled = x_train_upsampled.values

In [28]:
corpus_x_test = x_test_up.values

In [29]:
x_train_upsampled = count_tf_idf.fit_transform(corpus_x_train_upsampled)

In [30]:
x_test = count_tf_idf.transform(corpus_x_test)

In [31]:
y_train.mean()

0.10168117782716957

In [32]:
y_train_upsampled.mean()

0.5309359988481626

##### LogisticRegression после upsampled 

In [57]:
logreg_f1_train_upsampled = cross_val_score(
    Logreg, x_train_upsampled, y_train_upsampled, cv = 3, scoring='f1').mean()
print(f'Показатель f1_score в LogisticRegression на train выборке {logreg_f1_train_upsampled}')

Показатель f1_score в LogisticRegression на train выборке 0.963294766513464


##### DecisionTreeClassifier после upsampled 

In [58]:
DecTree_f1_train_upsampled = cross_val_score(
    DecTree, x_train_upsampled, y_train_upsampled, cv = 3, scoring='f1').mean()
print(f'Показатель f1_score в DecisionTreeClassifier на train выборке {DecTree_f1_train_upsampled}')

Показатель f1_score в DecisionTreeClassifier на train выборке 0.7678372098392678


##### RandomForestClassifier после upsampled 

In [59]:
RandomForest_f1_train_upsampled = cross_val_score(
    RandomForest, x_train_upsampled, y_train_upsampled, cv = 3, scoring='f1').mean()
print(f'Показатель f1_score в RandomForestClassifier на train выборке {RandomForest_f1_train_upsampled}')

Показатель f1_score в RandomForestClassifier на train выборке 0.8466208323985693


##### CatBoostClassifier после upsampled 

In [60]:
catboost__f1_train_upsampled = cross_val_score(
    catboost_model, x_train_upsampled , y_train_upsampled , scoring='f1', cv = 3).mean()
print(f'Показатель f1_score в CatBoostClassifier на train выборке {catboost__f1_train_upsampled}')

Learning rate set to 0.088217
0:	learn: 0.6517676	total: 2.5s	remaining: 41m 38s
20:	learn: 0.4843196	total: 38s	remaining: 29m 33s
40:	learn: 0.4378246	total: 1m 12s	remaining: 28m 4s
60:	learn: 0.4091305	total: 1m 46s	remaining: 27m 14s
80:	learn: 0.3873838	total: 2m 19s	remaining: 26m 27s
100:	learn: 0.3707355	total: 2m 53s	remaining: 25m 48s
120:	learn: 0.3581101	total: 3m 26s	remaining: 25m 3s
140:	learn: 0.3465543	total: 4m 2s	remaining: 24m 34s
160:	learn: 0.3353063	total: 4m 34s	remaining: 23m 52s
180:	learn: 0.3257487	total: 5m 6s	remaining: 23m 8s
200:	learn: 0.3166141	total: 5m 38s	remaining: 22m 26s
220:	learn: 0.3085338	total: 6m 11s	remaining: 21m 48s
240:	learn: 0.3011535	total: 6m 46s	remaining: 21m 21s
260:	learn: 0.2948045	total: 7m 19s	remaining: 20m 44s
280:	learn: 0.2889669	total: 7m 52s	remaining: 20m 7s
300:	learn: 0.2831725	total: 8m 24s	remaining: 19m 32s
320:	learn: 0.2782422	total: 8m 57s	remaining: 18m 56s
340:	learn: 0.2733234	total: 9m 29s	remaining: 18m 2

In [61]:
data1 = [logreg_f1_train_upsampled, DecTree_f1_train_upsampled,
         RandomForest_f1_train_upsampled, catboost__f1_train_upsampled]

In [62]:
df_f1['f1_train_upsampled'] = data1

In [63]:
df_f1

Unnamed: 0,f1_train_without_balance,f1_train_upsampled
LogisticRegression,0.749576,0.963295
DecisionTreeClassifier,0.659583,0.767837
RandomForestClassifier,0.419286,0.846621
CatBoostClassifier,0.741834,0.949946


### Изменим веса в моделях обучения
Передадим веса в наши модели в соответствии с соотношеним классов, классу 0 целевого признака передадим вес 1, классу 1 целевого признака передадим вес 8.834884437596301

{0:1, 1:8.834884437596301}

#### Измение весов в модели LogisticRegression

In [66]:
logreg_weight_params = {'class_weight': {0:1, 1:8.834884437596301}, 'solver': 'newton-cg'}

In [67]:
Logreg.set_params(**logreg_weight_params)

LogisticRegression(class_weight={0: 1, 1: 8.834884437596301}, random_state=42,
                   solver='newton-cg')

In [68]:
x = df['lemm_text']
y = df['toxic']

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size = 0.25, stratify = y, random_state = 42)

corpus_x_train = x_train.values.astype('U')
corpus_x_test =x_test.values.astype('U')

count_tf_idf = TfidfVectorizer(stop_words=stopwords)

x_train = count_tf_idf.fit_transform(corpus_x_train)
x_test = count_tf_idf.transform(corpus_x_test)

In [69]:
logreg_f1_train_weighted = cross_val_score(Logreg, x_train, y_train, cv = 3, scoring='f1').mean()
print(f'Показатель f1_score в LogisticRegression на train выборке {logreg_f1_train_weighted}')

Показатель f1_score в LogisticRegression на train выборке 0.7556673333022562


#### Измение весов в модели DecisionTreeClassifier

In [70]:
DecTree_weighted_params = {'class_weight': {0:1, 1:8.834884437596301},
                                'criterion': 'gini',
                                'max_depth': 26}

In [71]:
DecTree.set_params(**DecTree_weighted_params)

DecisionTreeClassifier(class_weight={0: 1, 1: 8.834884437596301}, max_depth=26,
                       random_state=42)

In [72]:
DecTree_f1_train_weighted = cross_val_score(DecTree, x_train, y_train, cv = 3, scoring='f1').mean()
print(f'Показатель f1_score в DecisionTreeClassifier на train выборке {DecTree_f1_train_weighted}')

Показатель f1_score в DecisionTreeClassifier на train выборке 0.6091857844491904


#### Измение весов в модели RandomForestClassifier

In [73]:
RandomForest_weighted_params = {'class_weight': {0: 1, 1: 8.834884437596301},
                               'criterion': 'gini',
                               'max_depth': 26,
                               'n_estimators': 26}

In [74]:
RandomForest.set_params(**RandomForest_weighted_params)

RandomForestClassifier(class_weight={0: 1, 1: 8.834884437596301}, max_depth=26,
                       n_estimators=26, random_state=42)

In [75]:
RandomForest_f1_train_weighted = cross_val_score(RandomForest, x_train, y_train, cv = 3, scoring='f1').mean()
print(f'Показатель f1_score в RandomForestClassifier на train выборке {RandomForest_f1_train_weighted}')

Показатель f1_score в RandomForestClassifier на train выборке 0.4166828341033015


#### Измение весов в модели CatBoostClassifier

In [76]:
catboost_model = CatBoostClassifier(random_state = 42, verbose = 20, class_weights={0: 1, 1: 8.834884437596301})

In [77]:
catboost__f1_train_weighted = cross_val_score(catboost_model, x_train, y_train, scoring='f1', cv = 3).mean()
print(f'Показатель f1_score в CatBoostClassifier на train выборке {catboost__f1_train_weighted}')

Learning rate set to 0.066843
0:	learn: 0.6635636	total: 1.3s	remaining: 21m 42s
20:	learn: 0.5137158	total: 26.8s	remaining: 20m 49s
40:	learn: 0.4687054	total: 51.8s	remaining: 20m 11s
60:	learn: 0.4403953	total: 1m 16s	remaining: 19m 43s
80:	learn: 0.4203014	total: 1m 42s	remaining: 19m 20s
100:	learn: 0.4054671	total: 2m 6s	remaining: 18m 47s
120:	learn: 0.3935949	total: 2m 31s	remaining: 18m 21s
140:	learn: 0.3813221	total: 2m 56s	remaining: 17m 56s
160:	learn: 0.3717839	total: 3m 21s	remaining: 17m 30s
180:	learn: 0.3611955	total: 3m 46s	remaining: 17m 4s
200:	learn: 0.3502980	total: 4m 11s	remaining: 16m 38s
220:	learn: 0.3398635	total: 4m 36s	remaining: 16m 13s
240:	learn: 0.3318978	total: 5m	remaining: 15m 47s
260:	learn: 0.3247031	total: 5m 25s	remaining: 15m 22s
280:	learn: 0.3180520	total: 5m 50s	remaining: 14m 56s
300:	learn: 0.3109229	total: 6m 15s	remaining: 14m 31s
320:	learn: 0.3050243	total: 6m 39s	remaining: 14m 5s
340:	learn: 0.2999886	total: 7m 4s	remaining: 13m 40

In [78]:
data2 = [logreg_f1_train_weighted, DecTree_f1_train_weighted, RandomForest_f1_train_weighted, catboost__f1_train_weighted]

In [79]:
df_f1['f1_train_weighted'] = data2
df_f1

Unnamed: 0,f1_train_without_balance,f1_train_upsampled,f1_train_weighted
LogisticRegression,0.749576,0.963295,0.755667
DecisionTreeClassifier,0.659583,0.767837,0.609186
RandomForestClassifier,0.419286,0.846621,0.416683
CatBoostClassifier,0.741834,0.949946,0.758964


Из-за того что часто вылетаю умирает Jupyter сохранил датафрейм в тот же файл https://1drv.ms/u/s!AuAtVzzlUFeVgc4ptFqqpcE-x_ns-w?e=T5z1ND 

In [226]:
df_f1.to_csv(r'C:\Users\thund\OneDrive\Рабочий стол\DS\яндекс\train_f1.csv')

In [138]:
df_f1 = pd.read_csv(r'C:\Users\thund\OneDrive\Рабочий стол\DS\яндекс\train_f1.csv')
df_f1

Unnamed: 0.1,Unnamed: 0,f1_train_without_balance,f1_train_upsampled,f1_train_weighted
0,LogisticRegression,0.749576,0.963295,0.755667
1,DecisionTreeClassifier,0.659583,0.767837,0.609186
2,RandomForestClassifier,0.419286,0.846621,0.416683
3,CatBoostClassifier,0.741834,0.949946,0.758964


### Проверка моделей на тесте
Оценим качество метрики на тесте лишь для LogisticRegression и CatBoostClassifier, так как данные модели показали хорошее качестов метрики f1_score

##### LogisticRegression на тесте

In [87]:
logreg_params = {'class_weight': 'balanced', 'solver': 'newton-cg'}

In [88]:
Logreg = LogisticRegression()
Logreg.set_params(**logreg_params)

LogisticRegression(class_weight='balanced', solver='newton-cg')

In [89]:
%%time

Logreg.fit(x_train, y_train)

Wall time: 3.65 s


LogisticRegression(class_weight='balanced', solver='newton-cg')

In [90]:
predict_y = Logreg.predict(x_test)

In [91]:
logreg_f1_test = f1_score(y_test, predict_y)
print(f'Показатель f1_score в LogisticRegression на test выборке {logreg_f1_test}')

Показатель f1_score в LogisticRegression на test выборке 0.7581939032775613


##### LogisticRegression на тесте с учетом изменения веса класса

In [92]:
logreg_weight_params = {'class_weight': {0:1, 1:8.834884437596301}, 'solver': 'newton-cg'}

In [93]:
Logreg_weighted = LogisticRegression()
Logreg_weighted.set_params(**logreg_weight_params)

LogisticRegression(class_weight={0: 1, 1: 8.834884437596301},
                   solver='newton-cg')

In [94]:
%%time

Logreg_weighted.fit(x_train, y_train)

Wall time: 3.94 s


LogisticRegression(class_weight={0: 1, 1: 8.834884437596301},
                   solver='newton-cg')

In [95]:
predict_y_weighted = Logreg_weighted.predict(x_test)

In [97]:
logreg_f1_test_weighted  = f1_score(y_test, predict_y_weighted)
print(f'Показатель f1_score в LogisticRegression на test выборке {logreg_f1_test_weighted}')

Показатель f1_score в LogisticRegression на test выборке 0.7632895195998602


##### LogisticRegression на тесте Upsampling

In [98]:
logreg_params = {'class_weight': 'balanced', 'solver': 'newton-cg'}

In [99]:
Logreg_upsampled = LogisticRegression()
Logreg_upsampled.set_params(**logreg_params)

LogisticRegression(class_weight='balanced', solver='newton-cg')

In [100]:
%%time

Logreg_upsampled.fit(x_train_upsampled, y_train_upsampled)

Wall time: 6.24 s


LogisticRegression(class_weight='balanced', solver='newton-cg')

In [101]:
predict_y_upsampled = Logreg_upsampled.predict(x_test)

In [102]:
logreg_f1_test_upsampled = f1_score(y_test, predict_y_upsampled)
print(f'Показатель f1_score в LogisticRegression на test выборке {logreg_f1_test_upsampled}')

Показатель f1_score в LogisticRegression на test выборке 0.7508833922261484


##### CatBoostClassifier на тесте

In [50]:
catboost_model = CatBoostClassifier(random_state = 42, verbose = 20, )

In [51]:
catboost_model.fit(x_train, y_train)

Learning rate set to 0.079478
0:	learn: 0.6136806	total: 1.72s	remaining: 28m 38s
20:	learn: 0.2373423	total: 38.4s	remaining: 29m 48s
40:	learn: 0.2080187	total: 1m 11s	remaining: 28m 2s
60:	learn: 0.1947347	total: 1m 45s	remaining: 26m 57s
80:	learn: 0.1850258	total: 2m 18s	remaining: 26m 9s
100:	learn: 0.1778031	total: 2m 51s	remaining: 25m 25s
120:	learn: 0.1720127	total: 3m 24s	remaining: 24m 44s
140:	learn: 0.1672912	total: 3m 57s	remaining: 24m 5s
160:	learn: 0.1628970	total: 4m 29s	remaining: 23m 26s
180:	learn: 0.1586568	total: 5m 2s	remaining: 22m 50s
200:	learn: 0.1551989	total: 5m 35s	remaining: 22m 12s
220:	learn: 0.1519379	total: 6m 8s	remaining: 21m 37s
240:	learn: 0.1490600	total: 6m 40s	remaining: 21m 2s
260:	learn: 0.1464873	total: 7m 13s	remaining: 20m 28s
280:	learn: 0.1441954	total: 7m 46s	remaining: 19m 54s
300:	learn: 0.1421130	total: 8m 19s	remaining: 19m 19s
320:	learn: 0.1401196	total: 8m 52s	remaining: 18m 45s
340:	learn: 0.1383469	total: 9m 24s	remaining: 18

<catboost.core.CatBoostClassifier at 0x152dd8b9548>

In [81]:
cat_predict_y = catboost_model.predict(x_test)

In [82]:
catboost_f1_test = f1_score(y_test, cat_predict_y )
print(f'Показатель f1_score в CatBoostClassifier на test выборке {catboost_f1_test}')

Показатель f1_score в CatBoostClassifier на test выборке 0.7318310699895538


##### CatBoostClassifier на тесте с учетом изменения веса класса

In [57]:
catboost_model_weight = CatBoostClassifier(random_state = 42, verbose = 20, class_weights={0: 1, 1: 8.834884437596301})

In [58]:
catboost_model_weight.fit(x_train, y_train)

Learning rate set to 0.079478
0:	learn: 0.6576543	total: 1.73s	remaining: 28m 47s
20:	learn: 0.4960730	total: 37.2s	remaining: 28m 54s
40:	learn: 0.4508399	total: 1m 11s	remaining: 27m 59s
60:	learn: 0.4242410	total: 1m 47s	remaining: 27m 29s
80:	learn: 0.4044771	total: 2m 22s	remaining: 26m 58s
100:	learn: 0.3907624	total: 2m 58s	remaining: 26m 24s
120:	learn: 0.3769295	total: 3m 32s	remaining: 25m 45s
140:	learn: 0.3645842	total: 4m 10s	remaining: 25m 25s
160:	learn: 0.3533815	total: 4m 45s	remaining: 24m 48s
180:	learn: 0.3422867	total: 5m 24s	remaining: 24m 30s
200:	learn: 0.3329509	total: 5m 58s	remaining: 23m 45s
220:	learn: 0.3246282	total: 6m 32s	remaining: 23m 3s
240:	learn: 0.3173770	total: 7m 5s	remaining: 22m 21s
260:	learn: 0.3104736	total: 7m 39s	remaining: 21m 41s
280:	learn: 0.3036453	total: 8m 13s	remaining: 21m 1s
300:	learn: 0.2980372	total: 8m 46s	remaining: 20m 23s
320:	learn: 0.2931408	total: 9m 19s	remaining: 19m 43s
340:	learn: 0.2878740	total: 9m 52s	remaining:

<catboost.core.CatBoostClassifier at 0x152dbef01c8>

In [59]:
cat_predict_y_weighted = catboost_model_weight.predict(x_test)

In [60]:
catboost_f1_test_weighted = f1_score(y_test, cat_predict_y_weighted )
print(f'Показатель f1_score в CatBoostClassifier на test выборке {catboost_f1_test_weighted}')

Показатель f1_score в CatBoostClassifier на test выборке 0.7663226255363563


##### CatBoostClassifier на тесте Upsample

In [108]:
catboost_model_upsampled = CatBoostClassifier(random_state = 42, verbose = 20, )

In [109]:
catboost_model_upsampled.fit(x_train_upsampled, y_train_upsampled)

Learning rate set to 0.104893
0:	learn: 0.6445284	total: 2.07s	remaining: 34m 25s
20:	learn: 0.4727473	total: 40.8s	remaining: 31m 43s
40:	learn: 0.4254919	total: 1m 20s	remaining: 31m 13s
60:	learn: 0.3953586	total: 1m 57s	remaining: 30m 6s
80:	learn: 0.3739800	total: 2m 35s	remaining: 29m 21s
100:	learn: 0.3578923	total: 3m 12s	remaining: 28m 35s
120:	learn: 0.3429422	total: 3m 49s	remaining: 27m 50s
140:	learn: 0.3314279	total: 4m 26s	remaining: 27m 5s
160:	learn: 0.3198615	total: 5m 3s	remaining: 26m 23s
180:	learn: 0.3109446	total: 5m 42s	remaining: 25m 48s
200:	learn: 0.3023077	total: 6m 18s	remaining: 25m 3s
220:	learn: 0.2939911	total: 6m 54s	remaining: 24m 20s
240:	learn: 0.2869835	total: 7m 30s	remaining: 23m 39s
260:	learn: 0.2799163	total: 8m 7s	remaining: 22m 59s
280:	learn: 0.2746902	total: 8m 43s	remaining: 22m 19s
300:	learn: 0.2693415	total: 9m 19s	remaining: 21m 39s
320:	learn: 0.2644601	total: 9m 55s	remaining: 21m
340:	learn: 0.2595302	total: 10m 31s	remaining: 20m 

<catboost.core.CatBoostClassifier at 0x152e590fd48>

In [110]:
cat_predict_y_upsampled = catboost_model_upsampled.predict(x_test)

In [111]:
catboost_f1_test_upsampled = f1_score(y_test, cat_predict_y_upsampled)
print(f'Показатель f1_score в CatBoostClassifier на test выборке {catboost_f1_test_upsampled}')

Показатель f1_score в CatBoostClassifier на test выборке 0.7489784649364992


<a id = 'section7'></a>
### Выводы
Собирем таблицы по нашим моделям, которые показали наилучший результат

In [112]:
logregression = [logreg_f1_test, logreg_f1_test_weighted, logreg_f1_test_upsampled]
catboostclass = [catboost_f1_test, catboost_f1_test_weighted, catboost_f1_test_upsampled]

In [149]:
test =[logreg_f1_test, catboost_f1_test]
test_upsampled = [logreg_f1_test_upsampled, catboost_f1_test_upsampled]
test_weighted = [logreg_f1_test_weighted, catboost_f1_test_weighted]

In [140]:
df_f1.columns = ['index', 'f1_train_without_balance', 'f1_train_upsampled',
       'f1_train_weighted']

In [141]:
best_models = df_f1.copy().iloc[[0,3]]

Unnamed: 0,index,f1_train_without_balance,f1_train_upsampled,f1_train_weighted
0,LogisticRegression,0.749576,0.963295,0.755667
3,CatBoostClassifier,0.741834,0.949946,0.758964


In [142]:
best_models = best_models.set_index('index')

In [150]:
best_models['f1_test_without_balance'] = test
best_models['f1_test_upsampled'] = test_upsampled
best_models['f1_test_weighted'] = test_weighted

In [151]:
best_models

Unnamed: 0_level_0,f1_train_without_balance,f1_train_upsampled,f1_train_weighted,f1_test_without_balance,f1_test_upsampled,f1_test_weighted
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
LogisticRegression,0.749576,0.963295,0.755667,0.758194,0.750883,0.76329
CatBoostClassifier,0.741834,0.949946,0.758964,0.731831,0.748978,0.766323


В ходе работы над проектом было сделано:

- Подготовленны данные обучения на моделях.
- Выбраны несколько способ баланса классов.
- Обучены модели и выбраны лучшие из них.
- Составлена сводная таблица по f1_score

Исходные данные обладают большим количеством признаков. Созданных столбцов больше, чем записей данных. Так как TF-IDF превращают текст в численные значения, лучшими моделями стали LogisticRegression. CatBoostClassifier тоже показывает себя очень хорошо при долгом обучении на данных. В ходе тестов данный классификатор мог обучатся до 5 часов.
