# Введение

## Описание задачи
[источник](https://www.kaggle.com/c/ml-posterior-gym-training-prediction)

### EN

A network of sports clubs tracks training sessions conducted by its coaches for clients. If a client wants to sign up for a training session, he/she calls the reception or come in person. The GYM’s staff appoints the appropriate time, the coach and the club. At your disposal, you have a database of several GYMs with this data. The data contains a log of training sessions for 2017 and 2018 in a trainset, and training sessions for 2019 in a testset. There is a boolean target flag for each training session: whether a training session had been.

You face the task of reducing the number of training process skips by identifying factors that affect the skip. For this purpose, you have to build a model that predicts the probability of the client going to the GYM training.

Optional question:

* How could we enrich the data to improve the model quality?

## Описание датасета

### EN

* Id: index of training session
* ClientID: Client who signed up for a training session
* CoachID: Сoach to whom the client signed up
* GymID: GYM center where the training will take place. One client can have trainings in different gyms. Also one coach can have trrainings in different gyms.
* TrainingID: Training type: strength training, cardio, swimming pool, etc
* Time: Scheduled time
* Target: Whether a training session had been: Yes(1) or No(0)

# Подготовка ноутбука

## Первичные константы 

In [1]:
PROJECT_NAME = "Kaggle. ML Posterior. Gym training prediction"
MOUNT_DIR = '/content/drive' # In case Colab Usage
VALIDATE_RATIO = 0.2

## Дополнительные установки

In [2]:
!pip install catboost

Collecting graphviz
  Downloading graphviz-0.14.1-py2.py3-none-any.whl (18 kB)
Collecting plotly
  Downloading plotly-4.9.0-py2.py3-none-any.whl (12.9 MB)
[K     |████████████████████████████████| 12.9 MB 1.6 MB/s eta 0:00:01
Collecting retrying>=1.3.3
  Downloading retrying-1.3.3.tar.gz (10 kB)
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py) ... [?25ldone
[?25h  Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11430 sha256=2e9e71c9ad66226b0d2cafe88ba0ac59b8e567e869600c36733171405e940ab3
  Stored in directory: /Users/alex-kozinov/Library/Caches/pip/wheels/ac/cb/8a/b27bf6323e2f4c462dcbf77d70b7c5e7868a7fbe12871770cf
Successfully built retrying
Installing collected packages: graphviz, retrying, plotly
Successfully installed graphviz-0.14.1 plotly-4.9.0 retrying-1.3.3


## Библиотеки

In [120]:
import os

from datetime import datetime

from collections import Counter

import numpy as np

import pandas as pd

import catboost as ctb
from catboost import CatBoostClassifier, Pool
from catboost.utils import get_roc_curve

from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import roc_auc_score

from plotly.subplots import make_subplots
import plotly.graph_objects as go

import matplotlib.pyplot as plt

%matplotlib inline

## Обработка случая работы в Google.Colab

### Подключение библиотек

In [3]:
try:
    from google.colab import files, drive
    
    USE_COLAB = True
except:
    USE_COLAB = False

if USE_COLAB:
    print("Don't forget to avoid disconnections:")
    print("""
function ClickConnect(){
    console.log("Clicking"); 
    document.querySelector("colab-connect-button").click() 
}
setInterval(ClickConnect,60000)
    """)

### Подключение к Google.Drive

In [4]:
if USE_COLAB:
    drive.mount(MOUNT_DIR)
    DRIVE_DIR = os.path.join(MOUNT_DIR, 'My Drive')
    print(f"Drive directory is {DRIVE_DIR}")

### Установка соединения с Kaggle

In [5]:
if USE_COLAB:
    !pip install -q kaggle
    !mkdir ~/.kaggle
    kaggle_file = os.path.join(DRIVE_DIR, 'kaggle.json')
    !cp "$kaggle_file" ~/.kaggle/
    !chmod 600 ~/.kaggle/kaggle.json

## Объявление рабочей директории

Подключение к Google.Drive в случае работы c Google.Colab

In [6]:
PROJECT_DIR = os.path.join(DRIVE_DIR, 'projects', PROJECT_NAME) if USE_COLAB else './'
WORK_DIR = '/content' if USE_COLAB else PROJECT_DIR
print(f"Project directory is {PROJECT_DIR}")
print(f"Working directory is {WORK_DIR}")

Project directory is ./
Working directory is ./


# Обработка данных

На выходе должны быть объявлены переменные:

* **X_train**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day
* **y_train**: pd.Series\
Id | Target
* **X_valid**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day
* **y_valid**: pd.Series\
Id | Target
* **X_test**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day

TODO: Обновить

## Загрузка

Загрузка train/test датасета в датафреймы

На выходе должны быть объявлены две переменные:
* train_dataset: pd.Dataframe
* test_dataset: pd.Dataframe

### Объявление путей

In [9]:
src_data_dir_path = os.path.join(PROJECT_DIR, 'data')

src_train_file_path = os.path.join(src_data_dir_path, 'train.csv.zip')
src_test_file_path = os.path.join(src_data_dir_path, 'test.csv.zip')

dist_data_dir = os.path.join(WORK_DIR, 'data')

### Скачивание архива


In [11]:
!kaggle competitions download -c ml-posterior-gym-training-prediction  -p "$src_data_dir_path"

Downloading ml-posterior-gym-training-prediction.zip to ./data
 91%|██████████████████████████████████▋   | 3.00M/3.28M [00:00<00:00, 3.68MB/s]
100%|██████████████████████████████████████| 3.28M/3.28M [00:00<00:00, 4.72MB/s]


### Разархивация

In [12]:
!unzip -o "$src_train_file_path" -d "$dist_data_dir" > /dev/null
!unzip -o "$src_test_file_path" -d "$dist_data_dir" > /dev/null

### Считывание

In [10]:
train_file_path = os.path.join(dist_data_dir, 'train.csv')
test_file_path = os.path.join(dist_data_dir, 'test.csv')

train_dataset = pd.read_csv(train_file_path, index_col='Id')
test_dataset = pd.read_csv(test_file_path, index_col='Id')

## Визуализация

### Просмотр первых строк

In [11]:
train_dataset.head()

Unnamed: 0_level_0,Time,ClientID,ClientType,CoachID,GymID,TrainingID,Target
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,2018-07-02 09:40:00,1,1,780,0,694,0
1,2018-07-20 11:50:00,1,1,567,0,655,0
2,2018-11-01 11:25:00,1,1,622,0,2523,0
3,2017-01-18 13:05:00,60,1,105,0,719,1
4,2017-05-17 07:40:00,60,1,622,0,2523,1


In [12]:
test_dataset.head()

Unnamed: 0_level_0,Time,ClientID,ClientType,CoachID,GymID,TrainingID
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2019-06-21 10:30:00,1,1,463,3,692
1,2019-05-19 15:25:00,1,1,728,6,2523
2,2019-05-24 18:20:00,1,1,622,0,434
3,2019-05-24 18:25:00,1,1,622,0,2523
4,2019-06-21 08:00:00,1,1,565,3,2523


## Финальная обработка

### Извлечение выборки и таргета

На выходе должны быть объявлены переменные:

* **X_train_pure**: pd.DataFrame\
Id | Time | ClientID | ClientType | CoachID | GymID | TrainingID
* **y_train**: pd.Series\
Id | Target
* **X_valid_pure**: pd.DataFrame\
Id | Time | ClientID | ClientType | CoachID | GymID | TrainingID
* **y_valid**: pd.Series\
Id | Target
* **X_test_pure**: pd.DataFrame\
Id | Time | ClientID | ClientType | CoachID | GymID | TrainingID

In [34]:
train_dataset = train_dataset.sort_values(by='Time')

partition = int(len(train_dataset) * (1 - VALIDATE_RATIO))
train_train_dataset = train_dataset.iloc[:partition]
train_valid_dataset = train_dataset.iloc[partition:]

X_train_train_pure = train_train_dataset.drop('Target', axis=1)
y_train_train = train_train_dataset['Target']

X_train_valid_pure = train_valid_dataset.drop('Target', axis=1)
y_train_valid = train_valid_dataset['Target']
X_test_pure = test_dataset

### Добавление дополнительных признаков

* **X_train**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day | DayOfWeek | Season
* **y_train**: pd.Series\
Id | Target
* **X_valid**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day | DayOfWeek | Season
* **y_valid**: pd.Series\
Id | Target
* **X_test**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day | DayOfWeek | Season

In [35]:
def add_time_features(X):
    datetime_series = pd.to_datetime(X['Time'], format='%Y-%m-%d %H:%M:%S')
    X['Month'] = pd.DatetimeIndex(datetime_series).month
    X['Day'] = pd.DatetimeIndex(datetime_series).day
    X['DayOfWeek'] = pd.DatetimeIndex(datetime_series).dayofweek
    X['Season'] = (X['Month'] % 12 + 3) // 3
    return X

def prepare_sample(X):
    X = X.copy()
    X = add_time_features(X)
    return X.drop('Time', axis=1)

In [36]:
X_train_train = prepare_sample(X_train_train_pure)
X_train_valid = prepare_sample(X_train_valid_pure)
X_test = prepare_sample(X_test_pure)

# Эксперименты с обучением

На выходе должны быть объявлены переменные

* **y_test_predicted**: pd.Series \
Id | Target
* **SUBMISSTION_FILE_NAME**: string\
Названия файла для сохранения

## Catboost

### Объявление пула и параметров

In [37]:
cat_features = [
    'ClientID',
    'ClientType',
    'CoachID',
    'GymID',
    'TrainingID',
    'Month',
    'Day',
    'DayOfWeek',
    'Season'
]
ignored_features = [
]
train_train_pool = Pool(
    X_train_train,
    y_train_train,
    cat_features=cat_features
)
train_valid_pool = Pool(
    X_train_valid,
    y_train_valid,
    cat_features=cat_features
)
params = {
    'iterations': 400,
    'learning_rate': 0.08,
    'eval_metric': 'AUC',
#     'random_seed': 113,
    'logging_level': 'Silent',
    'use_best_model': False,
    'ignored_features': ignored_features
}

### Обучение с валидацией

In [38]:
%%time
model = CatBoostClassifier(**params)
model.fit(train_train_pool, eval_set=train_valid_pool)

CPU times: user 9min 6s, sys: 2min 7s, total: 11min 13s
Wall time: 2min 6s


<catboost.core.CatBoostClassifier at 0x1a285f4be0>

### Визуализация результатов

In [39]:
def visualize_learning_results(metrics):
    """
    Plot one graph for each metric

    :param metrics: dic of metrics
    """

    n = len(metrics)
    
    fig = make_subplots(
        rows=n,
        cols=1,
        subplot_titles=list(metrics.keys())
    )

    for i, (metric_name, curves) in enumerate(metrics.items()):
        for dataset_type, curve in curves.items():
            m = len(curve)
            fig.add_trace(
                go.Scatter(
                    x=np.arange(m),
                    y=curve,
                    mode='lines',
                    name=f'{dataset_type} {metric_name}'
                ),
                row=i + 1,
                col=1
            )

    fig.update_layout(
        title_text="Learning results",
        width=297. * 3,
        height=210. * 3 * n
    )
    fig.show()

In [40]:
eval_results = model.get_evals_result()
metrics = dict(
    Logloss=dict(
        validation=eval_results['validation']['Logloss'],
        learn=eval_results['learn']['Logloss']
    ),
    AUC=dict(
        validation=eval_results['validation']['AUC']
    )
)

In [41]:
visualize_learning_results(metrics)

### Объявление финальных параметров

In [42]:
cat_features = [
    'ClientID',
    'ClientType',
    'CoachID',
    'GymID',
    'TrainingID',
    'Month',
    'Day',
    'DayOfWeek',
    'Season'
]
ignored_features = [
]
train_pool = Pool(
    pd.concat([X_train_train, X_train_valid]),
    pd.concat([y_train_train, y_train_valid]),
    cat_features=cat_features
)
params = {
    'iterations': 300,
    'learning_rate': 0.08,
    'logging_level': 'Silent',
    'use_best_model': False,
    'ignored_features': ignored_features
}
n_models = 10

### Обучение

In [43]:
%%time
models = []
for i in range(n_models):
    model = CatBoostClassifier(**params)
    model.fit(train_pool)
    models.append(model)

CPU times: user 1h 12min 12s, sys: 18min 23s, total: 1h 30min 35s
Wall time: 14min 53s


### Получение предсказаний

In [44]:
y_test_predicted = None
for model in models:
    predicted_probas = model.predict_proba(X_test)
    y_test_cur_predicted = pd.Series(
        predicted_probas[:, 1],
        index=X_test.index,
        name='Target'
    ).sort_index()
    
    if y_test_predicted is None:
        y_test_predicted = y_test_cur_predicted
    else:
        y_test_predicted += y_test_cur_predicted

y_test_predicted /= n_models

SUBMISSTION_FILE_NAME = 'Catboost_3_attemp_10_models_300_iters_0.8_lr.csv'

## Naive Bayes

In [81]:
model = CategoricalNB()

In [111]:
X_train_valid.TrainingID.isin(list(X_train_train.TrainingID.values)).describe()

count     64234
unique        2
top        True
freq      63939
Name: TrainingID, dtype: object

In [124]:
existed_in_train_mask = X_train_valid.ClientID.isin(list(X_train_train.ClientID.values))
existed_in_train_mask &= X_train_valid.CoachID.isin(list(X_train_train.CoachID.values))
existed_in_train_mask &= X_train_valid.TrainingID.isin(list(X_train_train.TrainingID.values))

X_train_valid_existed = X_train_valid[existed_in_train_mask]
y_train_valid_existed = y_train_valid[existed_in_train_mask]

In [125]:
model = model.fit(X_train_train.values, y_train_train.values)

In [126]:
y_train_valid_probas = model.predict_proba(X_train_valid_existed.values)

In [129]:
y_train_valid_probas

array([[0.00107237, 0.99892763],
       [0.00139774, 0.99860226],
       [0.00981249, 0.99018751],
       ...,
       [0.13281903, 0.86718097],
       [0.00404909, 0.99595091],
       [0.00374611, 0.99625389]])

In [130]:
roc_auc_score(y_train_valid_existed.values, y_train_valid_probas[:,1])

0.7764905616335358

# Отправка результатов

## Определение пути

In [46]:
submission_folder_path = os.path.join(PROJECT_DIR, 'submissions')
file_path = os.path.join(submission_folder_path, SUBMISSTION_FILE_NAME)
print(f"File will be saved to {file_path}")

File will be saved to ./submissions/Catboost_3_attemp_10_models_300_iters_0.8_lr.csv


## Сохранение

In [47]:
y_test_predicted.to_csv(file_path)

## Отправка на кагл

In [48]:
!kaggle competitions submit -c ml-posterior-gym-training-prediction -f "$file_path" -m "send $SUBMISSTION_FILE_NAME"

100%|███████████████████████████████████████| 1.09M/1.09M [00:06<00:00, 185kB/s]
Successfully submitted to ML Posterior. Gym training prediction

# Юнит тестирование


## Подключение библиотек

In [None]:
import unittest

## Объявление тестирующего класса

In [None]:
class TestNotebook(unittest.TestCase):
    def test_add(self):
        self.assertEqual(2 + 2, 4)

## Запуск тестов

In [None]:
unittest.main(argv=[''], verbosity=2, exit=False)

test_add (__main__.TestNotebook) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.002s

OK


<unittest.main.TestProgram at 0x7f929c006dd8>