# Введение

## Описание задачи
[источник](https://www.kaggle.com/c/ml-posterior-gym-training-prediction)

### EN

A network of sports clubs tracks training sessions conducted by its coaches for clients. If a client wants to sign up for a training session, he/she calls the reception or come in person. The GYM’s staff appoints the appropriate time, the coach and the club. At your disposal, you have a database of several GYMs with this data. The data contains a log of training sessions for 2017 and 2018 in a trainset, and training sessions for 2019 in a testset. There is a boolean target flag for each training session: whether a training session had been.

You face the task of reducing the number of training process skips by identifying factors that affect the skip. For this purpose, you have to build a model that predicts the probability of the client going to the GYM training.

Optional question:

* How could we enrich the data to improve the model quality?

## Описание датасета

### EN

* Id: index of training session
* ClientID: Client who signed up for a training session
* CoachID: Сoach to whom the client signed up
* GymID: GYM center where the training will take place. One client can have trainings in different gyms. Also one coach can have trrainings in different gyms.
* TrainingID: Training type: strength training, cardio, swimming pool, etc
* Time: Scheduled time
* Target: Whether a training session had been: Yes(1) or No(0)

# Подготовка ноутбука

## Первичные константы 

In [1]:
PROJECT_NAME = "Kaggle. ML Posterior. Gym training prediction"
MOUNT_DIR = '/content/drive' # In case Colab Usage
VALIDATE_RATIO = 0.2

## Дополнительные установки

In [2]:
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/b2/aa/e61819d04ef2bbee778bf4b3a748db1f3ad23512377e43ecfdc3211437a0/catboost-0.23.2-cp36-none-manylinux1_x86_64.whl (64.8MB)
[K     |████████████████████████████████| 64.8MB 58kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.23.2


## Библиотеки

In [124]:
import os

from datetime import datetime

from collections import Counter

import numpy as np

import pandas as pd

import catboost as ctb
from catboost import CatBoostClassifier, Pool
from catboost.utils import get_roc_curve

from sklearn.naive_bayes import CategoricalNB

from plotly.subplots import make_subplots
import plotly.graph_objects as go

import matplotlib.pyplot as plt

%matplotlib inline

## Обработка случая работы в Google.Colab

### Подключение библиотек

In [5]:
try:
    from google.colab import files, drive
    
    USE_COLAB = True
except:
    USE_COLAB = False

if USE_COLAB:
    print("Don't forget to avoid disconnections:")
    print("""
function ClickConnect(){
    console.log("Clicking"); 
    document.querySelector("colab-connect-button").click() 
}
setInterval(ClickConnect,60000)
    """)

Don't forget to avoid disconnections:

function ClickConnect(){
    console.log("Clicking"); 
    document.querySelector("colab-connect-button").click() 
}
setInterval(ClickConnect,60000)
    


### Подключение к Google.Drive

In [6]:
if USE_COLAB:
    drive.mount(MOUNT_DIR)
    DRIVE_DIR = os.path.join(MOUNT_DIR, 'My Drive')
    print(f"Drive directory is {DRIVE_DIR}")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
Drive directory is /content/drive/My Drive


### Установка соединения с Kaggle

In [7]:
if USE_COLAB:
    !pip install -q kaggle
    !mkdir ~/.kaggle
    kaggle_file = os.path.join(DRIVE_DIR, 'kaggle.json')
    !cp "$kaggle_file" ~/.kaggle/
    !chmod 600 ~/.kaggle/kaggle.json

## Объявление рабочей директории

Подключение к Google.Drive в случае работы c Google.Colab

In [8]:
PROJECT_DIR = os.path.join(DRIVE_DIR, 'projects', PROJECT_NAME) if USE_COLAB else './'
WORK_DIR = '/content' if USE_COLAB else PROJECT_DIR
print(f"Project directory is {PROJECT_DIR}")
print(f"Working directory is {WORK_DIR}")

Project directory is /content/drive/My Drive/projects/Kaggle. ML Posterior. Gym training prediction
Working directory is /content


# Обработка данных

На выходе должны быть объявлены переменные:

* **X_train**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day
* **y_train**: pd.Series\
Id | Target
* **X_valid**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day
* **y_valid**: pd.Series\
Id | Target
* **X_test**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day

TODO: Обновить

## Загрузка

Загрузка train/test датасета в датафреймы

На выходе должны быть объявлены две переменные:
* train_dataset: pd.Dataframe
* test_dataset: pd.Dataframe

### Скачивание архива


In [9]:
src_data_dir_path = os.path.join(PROJECT_DIR, 'data')
!kaggle competitions download -c ml-posterior-gym-training-prediction  -p "$src_data_dir_path"

sample_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
train.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
test.csv.zip: Skipping, found more recently modified local copy (use --force to force download)


### Разархивация

In [11]:
src_train_file_path = os.path.join(src_data_dir_path, 'train.csv.zip')
src_test_file_path = os.path.join(src_data_dir_path, 'test.csv.zip')
dist_data_dir = os.path.join(WORK_DIR, 'data')
!unzip -o "$src_train_file_path" -d "$dist_data_dir" > /dev/null
!unzip -o "$src_test_file_path" -d "$dist_data_dir" > /dev/null

### Считывание

In [12]:
train_file_path = os.path.join(dist_data_dir, 'train.csv')
test_file_path = os.path.join(dist_data_dir, 'test.csv')

train_dataset = pd.read_csv(train_file_path, index_col='Id')
test_dataset = pd.read_csv(test_file_path, index_col='Id')

## Визуализация

### Просмотр первых строк

In [13]:
train_dataset.head()

Unnamed: 0_level_0,Time,ClientID,ClientType,CoachID,GymID,TrainingID,Target
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,2018-07-02 09:40:00,1,1,780,0,694,0
1,2018-07-20 11:50:00,1,1,567,0,655,0
2,2018-11-01 11:25:00,1,1,622,0,2523,0
3,2017-01-18 13:05:00,60,1,105,0,719,1
4,2017-05-17 07:40:00,60,1,622,0,2523,1


In [14]:
test_dataset.head()

Unnamed: 0_level_0,Time,ClientID,ClientType,CoachID,GymID,TrainingID
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2019-06-21 10:30:00,1,1,463,3,692
1,2019-05-19 15:25:00,1,1,728,6,2523
2,2019-05-24 18:20:00,1,1,622,0,434
3,2019-05-24 18:25:00,1,1,622,0,2523
4,2019-06-21 08:00:00,1,1,565,3,2523


## Финальная обработка

### Извлечение выборки и таргета

На выходе должны быть объявлены переменные:

* **X_train_pure**: pd.DataFrame\
Id | Time | ClientID | ClientType | CoachID | GymID | TrainingID
* **y_train**: pd.Series\
Id | Target
* **X_valid_pure**: pd.DataFrame\
Id | Time | ClientID | ClientType | CoachID | GymID | TrainingID
* **y_valid**: pd.Series\
Id | Target
* **X_test_pure**: pd.DataFrame\
Id | Time | ClientID | ClientType | CoachID | GymID | TrainingID

In [15]:
train_dataset = train_dataset.sort_values(by='Time')

partition = int(len(train_dataset) * (1 - VALIDATE_RATIO))
train_train_dataset = train_dataset.iloc[:partition]
train_validation_dataset = train_dataset.iloc[partition:]

X_train_pure = train_train_dataset.drop('Target', axis=1)
y_train = train_train_dataset['Target']
X_valid_pure = train_validation_dataset.drop('Target', axis=1)
y_valid = train_validation_dataset['Target']
X_test_pure = test_dataset

### Добавление дополнительных признаков

* **X_train**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day | DayOfWeek | Season
* **y_train**: pd.Series\
Id | Target
* **X_valid**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day | DayOfWeek | Season
* **y_valid**: pd.Series\
Id | Target
* **X_test**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day | DayOfWeek | Season

In [16]:
def add_time_features(X):
    datetime_series = pd.to_datetime(X['Time'], format='%Y-%m-%d %H:%M:%S')
    X['Month'] = pd.DatetimeIndex(datetime_series).month
    X['Day'] = pd.DatetimeIndex(datetime_series).day
    X['DayOfWeek'] = pd.DatetimeIndex(datetime_series).dayofweek
    X['Season'] = (X['Month'] % 12 + 3) // 3
    return X

def prepare_sample(X):
    X = X.copy()
    X = add_time_features(X)
    return X.drop('Time', axis=1)

In [110]:
X_train = prepare_sample(X_train_pure)
X_valid = prepare_sample(X_valid_pure)
X_test = prepare_sample(X_test_pure)

# Эксперименты с обучением

На выходе должны быть объявлены переменные

* **y_test_predicted**: pd.Series \
Id | Target
* **SUBMISSTION_FILE_NAME**: string\
Названия файла для сохранения

## Catboost

### Объявление пула и параметров

In [19]:
cat_features = [
    'ClientID',
    'ClientType',
    'CoachID',
    'GymID',
    'TrainingID',
    'Month',
    'Day',
    'DayOfWeek',
    'Season'
]
ignored_features = [
]
train_train_pool = Pool(
    X_train,
    y_train,
    cat_features=cat_features
)
valid_pool = Pool(
    X_valid,
    y_valid,
    cat_features=cat_features
)
params = {
    'iterations': 400,
    'learning_rate': 0.08,
    'eval_metric': 'AUC',
    'random_seed': 113,
    'logging_level': 'Silent',
    'use_best_model': False,
    'ignored_features': ignored_features
}

### Обучение с валидацией

In [20]:
%%time
model = CatBoostClassifier(**params)
model.fit(train_train_pool, eval_set=valid_pool)

CPU times: user 8min, sys: 26.4 s, total: 8min 26s
Wall time: 4min 19s


### Визуализация результатов

In [58]:
def visualize_learning_results(metrics):
    """
    Plot one graph for each metric

    :param metrics: dic of metrics
    """

    n = len(metrics)
    
    fig = make_subplots(
        rows=n,
        cols=1,
        subplot_titles=list(metrics.keys())
    )

    for i, (metric_name, curves) in enumerate(metrics.items()):
        for dataset_type, curve in curves.items():
            m = len(curve)
            fig.add_trace(
                go.Scatter(
                    x=np.arange(m),
                    y=curve,
                    mode='lines',
                    name=f'{dataset_type} {metric_name}'
                ),
                row=i + 1,
                col=1
            )

    fig.update_layout(
        title_text="Learning results",
        width=297. * 3,
        height=210. * 3 * n
    )
    fig.show()

In [59]:
metrics = dict(
    Logloss=dict(
        validation=eval_results['validation']['Logloss'],
        learn=eval_results['learn']['Logloss']
    ),
    AUC=dict(
        validation=eval_results['validation']['AUC']
    )
)

In [60]:
visualize_learning_results(metrics)

### Объявление финальных параметров

In [126]:
cat_features = [
    'ClientID',
    'ClientType',
    'CoachID',
    'GymID',
    'TrainingID',
    'Month',
    'Day',
    'DayOfWeek',
    'Season'
]
ignored_features = [
]
train_pool = Pool(
    pd.concat([X_train, X_valid]),
    pd.concat([y_train, y_valid]),
    cat_features=cat_features
)
params = {
    'iterations': 300,
    'learning_rate': 0.08,
    'logging_level': 'Silent',
    'use_best_model': False,
    'ignored_features': ignored_features
}
n_models = 5

### Обучение

In [127]:
%%time
models = []
for i in range(n_models):
    model = CatBoostClassifier(**params)
    model.fit(train_pool)
    models.append(model)

CPU times: user 33min 14s, sys: 1min 43s, total: 34min 57s
Wall time: 17min 53s


### Получение предсказаний

In [128]:
y_test_predicted = None
for model in models:
    predicted_probas = model.predict_proba(X_test)
    y_test_cur_predicted = pd.Series(
        predicted_probas[:, 1],
        index=X_test.index,
        name='Target'
    ).sort_index()
    
    if y_test_predicted is None:
        y_test_predicted = y_test_cur_predicted
    else:
        y_test_predicted += y_test_cur_predicted

y_test_predicted /= n_models

SUBMISSTION_FILE_NAME = 'Catboost_2_attemp_5_models_300_iters_0.8_lr.csv'

# Отправка результатов

## Определение пути

In [131]:
submission_folder_path = os.path.join(PROJECT_DIR, 'submissions')
file_path = os.path.join(submission_folder_path, SUBMISSTION_FILE_NAME)
print(f"File will be saved to {file_path}")

File will be saved to /content/drive/My Drive/projects/Kaggle. ML Posterior. Gym training prediction/submissions/Catboost_2_attemp_5_models_300_iters_0.8_lr.csv


## Сохранение

In [132]:
y_test_predicted.to_csv(file_path)

## Отправка на кагл

In [133]:
!kaggle competitions submit -c ml-posterior-gym-training-prediction -f "$file_path" -m "send $SUBMISSTION_FILE_NAME"

100% 1.09M/1.09M [00:08<00:00, 134kB/s]
Successfully submitted to ML Posterior. Gym training prediction

# Юнит тестирование


## Подключение библиотек

In [None]:
import unittest

## Объявление тестирующего класса

In [None]:
class TestNotebook(unittest.TestCase):
    def test_add(self):
        self.assertEqual(2 + 2, 4)

## Запуск тестов

In [None]:
unittest.main(argv=[''], verbosity=2, exit=False)

test_add (__main__.TestNotebook) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.002s

OK


<unittest.main.TestProgram at 0x7f929c006dd8>