# Введение

## Описание задачи
[источник](https://www.kaggle.com/c/ml-posterior-gym-training-prediction)

### EN

A network of sports clubs tracks training sessions conducted by its coaches for clients. If a client wants to sign up for a training session, he/she calls the reception or come in person. The GYM’s staff appoints the appropriate time, the coach and the club. At your disposal, you have a database of several GYMs with this data. The data contains a log of training sessions for 2017 and 2018 in a trainset, and training sessions for 2019 in a testset. There is a boolean target flag for each training session: whether a training session had been.

You face the task of reducing the number of training process skips by identifying factors that affect the skip. For this purpose, you have to build a model that predicts the probability of the client going to the GYM training.

Optional question:

* How could we enrich the data to improve the model quality?

## Описание датасета

### EN

* Id: index of training session
* ClientID: Client who signed up for a training session
* CoachID: Сoach to whom the client signed up
* GymID: GYM center where the training will take place. One client can have trainings in different gyms. Also one coach can have trrainings in different gyms.
* TrainingID: Training type: strength training, cardio, swimming pool, etc
* Time: Scheduled time
* Target: Whether a training session had been: Yes(1) or No(0)

## Идеи

Можно было бы улучшить прогнозы, если добавить новые фичи. Например:
* Географическое положение 
* Погоду (Можно узнать по времени и положению)
* Возраст клиента
* Пол (Сильно повлиять не должно)

# Подготовка ноутбука

## Первичные константы 

In [1]:
PROJECT_NAME = "Kaggle. ML Posterior. Gym training prediction"
MOUNT_DIR = '/content/drive' # In case Colab Usage
VALIDATE_RATIO = 0.2

## Дополнительные установки

In [2]:
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/b2/aa/e61819d04ef2bbee778bf4b3a748db1f3ad23512377e43ecfdc3211437a0/catboost-0.23.2-cp36-none-manylinux1_x86_64.whl (64.8MB)
[K     |████████████████████████████████| 64.8MB 76kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.23.2


## Библиотеки

In [3]:
import os

from datetime import datetime

from collections import Counter

import requests, json 

import numpy as np

import pandas as pd

import catboost as ctb
from catboost import CatBoostClassifier, Pool
from catboost.utils import get_roc_curve

from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import roc_auc_score, roc_curve, auc

from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px

import matplotlib.pyplot as plt

%matplotlib inline

## Обработка случая работы в Google.Colab

### Подключение библиотек

In [4]:
try:
    from google.colab import files, drive
    
    USE_COLAB = True
except:
    USE_COLAB = False

if USE_COLAB:
    print("Don't forget to avoid disconnections:")
    print("""
function ClickConnect(){
    console.log("Clicking"); 
    document.querySelector("colab-connect-button").click() 
}
setInterval(ClickConnect,60000)
    """)

Don't forget to avoid disconnections:

function ClickConnect(){
    console.log("Clicking"); 
    document.querySelector("colab-connect-button").click() 
}
setInterval(ClickConnect,60000)
    


### Подключение к Google.Drive

In [5]:
MOUNT_DIR = '/content/drive' # In case Colab Usage
if USE_COLAB:
    drive.mount(MOUNT_DIR)
    DRIVE_DIR = os.path.join(MOUNT_DIR, 'My Drive')
    print(f"Drive directory is {DRIVE_DIR}")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
Drive directory is /content/drive/My Drive


### Установка соединения с Kaggle

In [6]:
if USE_COLAB:
    !pip install -q kaggle
    !mkdir ~/.kaggle
    kaggle_file = os.path.join(DRIVE_DIR, 'kaggle.json')
    !cp "$kaggle_file" ~/.kaggle/
    !chmod 600 ~/.kaggle/kaggle.json

## Объявление рабочей директории

Подключение к Google.Drive в случае работы c Google.Colab

In [7]:
PROJECT_DIR = os.path.join(DRIVE_DIR, 'projects', PROJECT_NAME) if USE_COLAB else './'
WORK_DIR = '/content' if USE_COLAB else PROJECT_DIR
print(f"Project directory is {PROJECT_DIR}")
print(f"Working directory is {WORK_DIR}")

Project directory is /content/drive/My Drive/projects/Kaggle. ML Posterior. Gym training prediction
Working directory is /content


# Обработка данных

На выходе должны быть объявлены переменные:

* **X_train**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day
* **y_train**: pd.Series\
Id | Target
* **X_valid**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day
* **y_valid**: pd.Series\
Id | Target
* **X_test**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day

TODO: Обновить

## Загрузка

Загрузка train/test датасета в один датафрейм

На выходе должна быть объявлена переменная:
* initial_dataset: pd.Dataframe

### Объявление путей

In [8]:
src_data_dir_path = os.path.join(PROJECT_DIR, 'data')

src_train_file_path = os.path.join(src_data_dir_path, 'train.csv.zip')
src_test_file_path = os.path.join(src_data_dir_path, 'test.csv.zip')

dist_data_dir = os.path.join(WORK_DIR, 'data')

### Скачивание архива


In [9]:
!kaggle competitions download -c ml-posterior-gym-training-prediction  -p "$src_data_dir_path"

test.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
sample_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
train.csv.zip: Skipping, found more recently modified local copy (use --force to force download)


### Разархивация

In [10]:
!unzip -o "$src_train_file_path" -d "$dist_data_dir" > /dev/null
!unzip -o "$src_test_file_path" -d "$dist_data_dir" > /dev/null

### Считывание

In [56]:
train_file_path = os.path.join(dist_data_dir, 'train.csv')
test_file_path = os.path.join(dist_data_dir, 'test.csv')

train_initial_dataset = pd.read_csv(train_file_path)
test_initial_dataset = pd.read_csv(test_file_path)

initial_dataset = pd.concat([train_initial_dataset, test_initial_dataset]).reset_index(drop=True)

## Подготовка датасета
На выходе должны быть объявлены переменные: 
* **X_train_pure**: pd.DataFrame
* **y_train**: pd.Series
* **X_valid_pure**: pd.DataFrame
* **y_valid**: pd.Series
* **X_test_pure**: pd.DataFrame

У всех **X_\*** выборок должы быть объявлены поля

* Month
* Day
* Hour
* DayOfWeek
* Season

### Добавление новых признаков

* Добавление временных признаков \
Month | Day | Hour | DayOfWeek | Season

* Добавляем исторические признаки \
NumberOfAppointments | LastVisitType
    - NumberOfAppointments - число посещений ранее
    - LastVisitType - тип последнего посещения
        - 0 - в первый раз
        - 1 - день назад
        - 2 - в течение недели
        - 3 - в течение месяца
        - 4 - больше месяца

In [79]:
def add_time_features(X):
    X['Time'] = pd.to_datetime(X['Time'], format='%Y-%m-%d %H:%M:%S')
    X['Month'] = pd.DatetimeIndex(X['Time']).month
    X['Day'] = pd.DatetimeIndex(X['Time']).day
    X['Hour'] = pd.DatetimeIndex(X['Time']).hour
    X['DayOfWeek'] = pd.DatetimeIndex(X['Time']).dayofweek
    X['Season'] = (X['Month'] % 12 + 3) // 3
    return X

def add_history_features(X):
    X = X.sort_values('Time')
    clients = set(X.ClientID)

    X['NumberOfAppointments'] = -1
    X['LastVisitType'] = -1
    for client in clients:
        X_client = X[X.ClientID == client]
        
        visit_times = X_client.Time.values
        if len(visit_times) == 1:
            last_visit_type = np.array([0])
        else:
            time_since_last_visit = visit_times[1:] - visit_times[:-1]
            last_visit_type = np.ones(len(visit_times)) * 4
            last_visit_type[0] = 0
            last_visit_type[1:][time_since_last_visit < np.timedelta64(30, 'D')] = 3
            last_visit_type[1:][time_since_last_visit < np.timedelta64(7, 'D')] = 2
            last_visit_type[1:][time_since_last_visit < np.timedelta64(1, 'D')] = 1

        number_of_appointments = np.arange(len(X_client))
        X['NumberOfAppointments'].loc[X_client.index] = number_of_appointments
        X['LastVisitType'].loc[X_client.index] = last_visit_type
        X['NumberOfAppointments'] = X['NumberOfAppointments'].astype(int)
        X['LastVisitType'] = X['LastVisitType'].astype(int)
    return X
def prepare_dataset(X):
    # X = X.copy()
    X = add_time_features(X)
    X = add_history_features(X)
    # X = X.drop('Time', axis=1)
    return X

In [80]:
dataset = prepare_dataset(initial_dataset)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



### Разделение на train, valid, test

In [120]:
# Разделение датасета на train и test
test_mask = dataset.Target.isna()
train_mask = ~test_mask
train_dataset = dataset[train_mask].set_index('Id')
test_dataset = dataset[test_mask].set_index('Id').drop('Target', axis=1)

# Разделение train на train_train и train_valid
train_dataset = train_dataset.sort_values(by='Time')
partition = int(len(train_dataset) * (1 - VALIDATE_RATIO))

# Разделение датасетов на выборки и целевые переменные
X_train = train_dataset.drop('Target', axis=1)
y_train = train_dataset['Target']
X_train_train = X_train.iloc[:partition]
y_train_train = y_train.iloc[:partition]
X_train_valid = X_train.iloc[partition:]
y_train_valid = y_train.iloc[partition:]
X_test = test_dataset

## Визуализация

### Распределение Target от времени записи

In [121]:
df = train_dataset.loc[:, ['Time', 'Target']]
df.Time =  pd.DatetimeIndex(df.Time)
df.loc[:, "Time"] = df.Time.apply(lambda d: d.replace(year=1967, month=1, day=1))

fig = px.histogram(df, x='Time', color='Target')

fig.update_layout(
    xaxis=dict(type="date", tickformat="%H:%M:%S"),
    width=297. * 3,
    height=210. * 3
)
fig.show()

### Просмотр первых строк

In [122]:
train_dataset.head()

Unnamed: 0_level_0,Time,ClientID,ClientType,CoachID,GymID,TrainingID,Target,Month,Day,Hour,DayOfWeek,Season,NumberOfAppointments,LastVisitType
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
46246,2017-01-01 08:00:00,63159,2,71,0,840,1.0,1,1,8,6,1,0,0
43829,2017-01-01 08:00:00,60091,1,523,0,1564,1.0,1,1,8,6,1,0,0
24306,2017-01-01 08:00:00,33441,2,270,3,840,1.0,1,1,8,6,1,0,0
113828,2017-01-01 08:00:00,10962,2,495,1,847,1.0,1,1,8,6,1,0,0
89407,2017-01-01 08:15:00,114152,2,440,1,726,1.0,1,1,8,6,1,0,0


In [123]:
test_dataset.head()

Unnamed: 0_level_0,Time,ClientID,ClientType,CoachID,GymID,TrainingID,Month,Day,Hour,DayOfWeek,Season,NumberOfAppointments,LastVisitType
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
24601,2019-01-01 08:00:00,91745,1,540,4,1185,1,1,8,1,1,16,4
45300,2019-01-01 08:00:00,4727,1,220,4,954,1,1,8,1,1,5,2
1786,2019-01-01 08:00:00,18232,1,474,1,729,1,1,8,1,1,5,3
24602,2019-01-01 08:00:00,91745,1,456,4,1454,1,1,8,1,1,17,1
26554,2019-01-01 08:00:00,65917,1,534,0,711,1,1,8,1,1,35,2


# Эксперименты с обучением

На выходе должны быть объявлены переменные

* **y_test_predicted**: pd.Series \
Id | Target
* **SUBMISSTION_FILE_NAME**: string\
Названия файла для сохранения

## Catboost

### Метапараметры

In [124]:
used_features = [
    'ClientID',
    'ClientType',
    'CoachID',
    'GymID',
    'TrainingID',
    'Month',
    'Day',
    'Hour',
    'DayOfWeek',
    'Season',
    'NumberOfAppointments',
    'LastVisitType'
]
cat_features = [
    'ClientID',
    'ClientType',
    'CoachID',
    'GymID',
    'TrainingID',
    'Month',
    'Day',
    # 'Hour',
    'DayOfWeek',
    'Season',
    'LastVisitType'
]
ignored_features = [
]
params = {
    'iterations': 400,
    'learning_rate': 0.08,
    'eval_metric': 'AUC',
#     'random_seed': 113,
    'logging_level': 'Silent',
    'use_best_model': False,
    'ignored_features': ignored_features
}

### Объявление пула

In [125]:
train_train_pool = Pool(
    X_train_train.loc[:, used_features],
    y_train_train,
    cat_features=cat_features
)
train_valid_pool = Pool(
    X_train_valid.loc[:, used_features],
    y_train_valid,
    cat_features=cat_features
)

### Обучение с валидацией

In [126]:
%%time
model = CatBoostClassifier(**params)
model.fit(train_train_pool, eval_set=train_valid_pool)

CPU times: user 8min 14s, sys: 29.7 s, total: 8min 44s
Wall time: 4min 32s


### Визуализация результатов

In [127]:
def visualize_learning_results(metrics):
    """
    Plot one graph for each metric

    :param metrics: dict of metrics
    """

    n = len(metrics)
    
    fig = make_subplots(
        rows=n,
        cols=1,
        subplot_titles=list(metrics.keys())
    )

    for i, (metric_name, curves) in enumerate(metrics.items()):
        for dataset_type, curve in curves.items():
            m = len(curve)
            fig.add_trace(
                go.Scatter(
                    x=np.arange(m),
                    y=curve,
                    mode='lines',
                    name=f'{dataset_type} {metric_name}'
                ),
                row=i + 1,
                col=1
            )

    fig.update_layout(
        title_text="Learning results",
        width=297. * 3,
        height=210. * 3 * n
    )
    fig.show()

In [128]:
eval_results = model.get_evals_result()
metrics = dict(
    Logloss=dict(
        validation=eval_results['validation']['Logloss'],
        learn=eval_results['learn']['Logloss']
    ),
    AUC=dict(
        validation=eval_results['validation']['AUC']
    )
)

In [129]:
visualize_learning_results(metrics)

### Объявление финальных параметров

In [134]:
n_models = 10
SUBMISSTION_FILE_NAME = 'Catboost_5_attemp_10_models_300_iters_0.8_lr.csv'

### Объявление финального пула

In [131]:
train_pool = Pool(
    X_train.loc[:, used_features],
    y_train,
    cat_features=cat_features
)
params.update(
    iterations=300
)

### Обучение

In [132]:
%%time
catboost_models = []
for i in range(n_models):
    model = CatBoostClassifier(**params)
    model.fit(train_pool)
    catboost_models.append(model)

CPU times: user 1h 17min 55s, sys: 4min 3s, total: 1h 21min 58s
Wall time: 42min 8s


### Получение предсказаний

In [133]:
def get_catboost_predictions():
    y_test_predicted = None
    for model in catboost_models:
        predicted_probas = model.predict_proba(X_test.loc[:, used_features])
        y_test_cur_predicted = pd.Series(
            predicted_probas[:, 1],
            index=X_test.index,
            name='Target'
        ).sort_index()
        
        if y_test_predicted is None:
            y_test_predicted = y_test_cur_predicted
        else:
            y_test_predicted += y_test_cur_predicted

    y_test_predicted /= n_models
    return y_test_predicted 

y_test_predicted = get_catboost_predictions()

## Naive Bayes

### Объявление параметров

In [None]:
N_REPETITIONS = 1  # required feature repetitions

### Модификация валидационной выборки 
Оставляем только те строки, для которых ClientID, CoachID и TrainingID встречаются в обучающей выборке


In [None]:
def filter_new_features(X_train, X_test, y_test=None, n_repetitions=1):
    client_counter = Counter(X_train.ClientID.values)
    coach_counter = Counter(X_train.CoachID.values)
    training_counter = Counter(X_train.TrainingID.values)
    X_test_filtered_mask = X_test.apply(
        lambda x:
            client_counter[x.ClientID] >= n_repetitions and
            coach_counter[x.CoachID] >= n_repetitions and
            training_counter[x.TrainingID] >= n_repetitions,
        axis=1
    )
    return X_test[X_test_filtered_mask], y_test[X_test_filtered_mask] if y_test is not None else y_test

In [None]:
X_train_valid_existed, y_train_valid_existed = filter_new_features(
    X_train_train,
    X_train_valid,
    y_train_valid,
    n_repetitions=N_REPETITIONS
)

### Обучим модель и предскажим вероятности на валидационном датасете

In [None]:
model = CategoricalNB()

In [None]:
model = model.fit(X_train_train.values, y_train_train.values)

In [None]:
y_train_valid_probas = model.predict_proba(X_train_valid_existed.values)

In [None]:
roc_auc_score(y_train_valid_existed.values, y_train_valid_probas[:,1])

0.7873626517964328

### Визуализация

In [None]:
def draw_roc_auc(y_true, y_prob):
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    roc_auc = auc(fpr, tpr)

    fig = go.Figure()


    fig.add_trace(
        go.Scatter(
            x=fpr,
            y=tpr, 
            mode='lines', 
            line=dict(color='darkorange', width=2),
            name='ROC curve (area = %0.2f)' % roc_auc
        )
    )

    fig.add_trace(
        go.Scatter(
            x=[0, 1],
            y=[0, 1], 
            mode='lines', 
            line=dict(color='navy', width=2),
            showlegend=False
        )
    )

    fig.update_layout(
        title_text="ROC curve on validation sample",
        width=297. * 3,
        height=210. * 3
    )
    fig.show()

In [None]:
draw_roc_auc(y_train_valid_existed.values, y_train_valid_probas[:,1])

### Объявление финальной выборки для обучения

In [None]:
X_train = pd.concat([X_train_train, X_train_valid])
y_train = pd.concat([y_train_train, y_train_valid])

X_test_existed, _ = filter_new_features(
    X_train,
    X_test,
    n_repetitions=N_REPETITIONS
)

### Обучение

In [None]:
nb_model = CategoricalNB()
nb_model = nb_model.fit(X_train.values, y_train.values)

### Получение предсказание

In [None]:
def get_nb_predictions():
    predicted_probas = nb_model.predict_proba(X_test_existed)
    y_test_nb_predicted = pd.Series(
        predicted_probas[:, 1],
        index=X_test_existed.index,
        name='Target'
    )

    y_test_catboost_predicted = get_catboost_predictions()
    absense_mask = ~y_test_catboost_predicted.index.isin(list(y_test_nb_predicted.index))
    
    return pd.concat([y_test_catboost_predicted[absense_mask], y_test_nb_predicted]).sort_index()

y_test_predicted = get_nb_predictions()

SUBMISSTION_FILE_NAME = 'NB_with_Catboost_1_attemp_10_n_repetitions_10_models_300_iters_0.8_lr.csv'

# Отправка результатов

## Определение пути

In [135]:
submission_folder_path = os.path.join(PROJECT_DIR, 'submissions')
file_path = os.path.join(submission_folder_path, SUBMISSTION_FILE_NAME)
print(f"File will be saved to {file_path}")

File will be saved to /content/drive/My Drive/projects/Kaggle. ML Posterior. Gym training prediction/submissions/Catboost_5_attemp_10_models_300_iters_0.8_lr.csv


## Сохранение

In [136]:
y_test_predicted.to_csv(file_path)

## Отправка на кагл

In [137]:
!kaggle competitions submit -c ml-posterior-gym-training-prediction -f "$file_path" -m "send $SUBMISSTION_FILE_NAME"

100% 1.09M/1.09M [00:00<00:00, 2.45MB/s]
Successfully submitted to ML Posterior. Gym training prediction

# Юнит тестирование


## Подключение библиотек

In [82]:
import unittest

## Объявление тестирующего класса

In [118]:
class TestNotebook(unittest.TestCase):
    def test_add_time_features(self):
        df = pd.DataFrame(
            [
                {'Time' : datetime.strptime('2017-01-01 08:00:00', '%Y-%m-%d %H:%M:%S'), 'ClientID' : 1},
            ]
        )
        df = add_time_features(df)
        self.assertTrue(df.loc[0, 'ClientID'] == 1)
        self.assertTrue(df.loc[0, 'Month'] == 1)
        self.assertTrue(df.loc[0, 'Day'] == 1)
        self.assertTrue(df.loc[0, 'Hour'] == 8)
        self.assertTrue(df.loc[0, 'DayOfWeek'] == 6)
        self.assertTrue(df.loc[0, 'Season'] == 1)

    def test_add_history_features(self):
        df = pd.DataFrame(
            [
                {'Time' : datetime.strptime('2017-01-01 08:00:00', '%Y-%m-%d %H:%M:%S'), 'ClientID' : 1},
                {'Time' : datetime.strptime('2017-01-01 09:00:00', '%Y-%m-%d %H:%M:%S'), 'ClientID' : 1},
                {'Time' : datetime.strptime('2017-01-01 08:00:00', '%Y-%m-%d %H:%M:%S'), 'ClientID' : 2},
                {'Time' : datetime.strptime('2017-01-01 09:00:00', '%Y-%m-%d %H:%M:%S'), 'ClientID' : 2},
                {'Time' : datetime.strptime('2017-01-01 10:00:00', '%Y-%m-%d %H:%M:%S'), 'ClientID' : 2},
                {'Time' : datetime.strptime('2017-01-04 10:00:00', '%Y-%m-%d %H:%M:%S'), 'ClientID' : 2},
                {'Time' : datetime.strptime('2017-01-24 10:00:00', '%Y-%m-%d %H:%M:%S'), 'ClientID' : 2},
                {'Time' : datetime.strptime('2018-03-04 10:00:00', '%Y-%m-%d %H:%M:%S'), 'ClientID' : 2},
            ]
        )
        df = add_history_features(df)
        
        self.assertTrue((df.ClientID.values == np.array([1, 2, 1, 2, 2, 2, 2, 2])).all())
        self.assertTrue((df.NumberOfAppointments.values == np.array([0, 0, 1, 1, 2, 3, 4, 5])).all())
        self.assertTrue((df.LastVisitType.values == np.array([0, 0, 1, 1, 1, 2, 3, 4])).all())

## Запуск тестов

In [119]:
unittest.main(argv=[''], verbosity=2, exit=False)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

ok
test_add_time_features (__main__.TestNotebook) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.026s

OK


<unittest.main.TestProgram at 0x7f87bf679b70>

# Выводы

Можно заметить, что лучше всего себя показал катбуст. Комбинация наивного байесовского классификатора с катбустом оказалась не такой эффективной. Есть ощущение, что это из-за переобучения в байесовском классификаторе