# Введение

## Описание задачи
[источник](https://www.kaggle.com/c/ml-posterior-gym-training-prediction)

### EN

A network of sports clubs tracks training sessions conducted by its coaches for clients. If a client wants to sign up for a training session, he/she calls the reception or come in person. The GYM’s staff appoints the appropriate time, the coach and the club. At your disposal, you have a database of several GYMs with this data. The data contains a log of training sessions for 2017 and 2018 in a trainset, and training sessions for 2019 in a testset. There is a boolean target flag for each training session: whether a training session had been.

You face the task of reducing the number of training process skips by identifying factors that affect the skip. For this purpose, you have to build a model that predicts the probability of the client going to the GYM training.

Optional question:

* How could we enrich the data to improve the model quality?

## Описание датасета

### EN

* Id: index of training session
* ClientID: Client who signed up for a training session
* CoachID: Сoach to whom the client signed up
* GymID: GYM center where the training will take place. One client can have trainings in different gyms. Also one coach can have trrainings in different gyms.
* TrainingID: Training type: strength training, cardio, swimming pool, etc
* Time: Scheduled time
* Target: Whether a training session had been: Yes(1) or No(0)

## Идеи

In [None]:
Можно было бы улучшить прогнозы, если добавить новые фичи. Например:
* Географическое положение 
* Погоду (Можно узнать по времени и положению)
* Возраст клиента
* Пол (Сильно повлиять не должно)

# Подготовка ноутбука

## Первичные константы 

In [1]:
PROJECT_NAME = "Kaggle. ML Posterior. Gym training prediction"
MOUNT_DIR = '/content/drive' # In case Colab Usage
VALIDATE_RATIO = 0.2

## Дополнительные установки

In [2]:
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/b2/aa/e61819d04ef2bbee778bf4b3a748db1f3ad23512377e43ecfdc3211437a0/catboost-0.23.2-cp36-none-manylinux1_x86_64.whl (64.8MB)
[K     |████████████████████████████████| 64.8MB 56kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.23.2


## Библиотеки

In [63]:
import os

from datetime import datetime

from collections import Counter

import numpy as np

import pandas as pd

import catboost as ctb
from catboost import CatBoostClassifier, Pool
from catboost.utils import get_roc_curve

from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import roc_auc_score, roc_curve, auc

from plotly.subplots import make_subplots
import plotly.graph_objects as go

import matplotlib.pyplot as plt

%matplotlib inline

## Обработка случая работы в Google.Colab

### Подключение библиотек

In [4]:
try:
    from google.colab import files, drive
    
    USE_COLAB = True
except:
    USE_COLAB = False

if USE_COLAB:
    print("Don't forget to avoid disconnections:")
    print("""
function ClickConnect(){
    console.log("Clicking"); 
    document.querySelector("colab-connect-button").click() 
}
setInterval(ClickConnect,60000)
    """)

Don't forget to avoid disconnections:

function ClickConnect(){
    console.log("Clicking"); 
    document.querySelector("colab-connect-button").click() 
}
setInterval(ClickConnect,60000)
    


### Подключение к Google.Drive

In [5]:
if USE_COLAB:
    drive.mount(MOUNT_DIR)
    DRIVE_DIR = os.path.join(MOUNT_DIR, 'My Drive')
    print(f"Drive directory is {DRIVE_DIR}")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
Drive directory is /content/drive/My Drive


### Установка соединения с Kaggle

In [6]:
if USE_COLAB:
    !pip install -q kaggle
    !mkdir ~/.kaggle
    kaggle_file = os.path.join(DRIVE_DIR, 'kaggle.json')
    !cp "$kaggle_file" ~/.kaggle/
    !chmod 600 ~/.kaggle/kaggle.json

## Объявление рабочей директории

Подключение к Google.Drive в случае работы c Google.Colab

In [7]:
PROJECT_DIR = os.path.join(DRIVE_DIR, 'projects', PROJECT_NAME) if USE_COLAB else './'
WORK_DIR = '/content' if USE_COLAB else PROJECT_DIR
print(f"Project directory is {PROJECT_DIR}")
print(f"Working directory is {WORK_DIR}")

Project directory is /content/drive/My Drive/projects/Kaggle. ML Posterior. Gym training prediction
Working directory is /content


# Обработка данных

На выходе должны быть объявлены переменные:

* **X_train**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day
* **y_train**: pd.Series\
Id | Target
* **X_valid**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day
* **y_valid**: pd.Series\
Id | Target
* **X_test**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day

TODO: Обновить

## Загрузка

Загрузка train/test датасета в датафреймы

На выходе должны быть объявлены две переменные:
* train_dataset: pd.Dataframe
* test_dataset: pd.Dataframe

### Объявление путей

In [8]:
src_data_dir_path = os.path.join(PROJECT_DIR, 'data')

src_train_file_path = os.path.join(src_data_dir_path, 'train.csv.zip')
src_test_file_path = os.path.join(src_data_dir_path, 'test.csv.zip')

dist_data_dir = os.path.join(WORK_DIR, 'data')

### Скачивание архива


In [9]:
!kaggle competitions download -c ml-posterior-gym-training-prediction  -p "$src_data_dir_path"

train.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
test.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
sample_submission.csv: Skipping, found more recently modified local copy (use --force to force download)


### Разархивация

In [10]:
!unzip -o "$src_train_file_path" -d "$dist_data_dir" > /dev/null
!unzip -o "$src_test_file_path" -d "$dist_data_dir" > /dev/null

### Считывание

In [11]:
train_file_path = os.path.join(dist_data_dir, 'train.csv')
test_file_path = os.path.join(dist_data_dir, 'test.csv')

train_dataset = pd.read_csv(train_file_path, index_col='Id')
test_dataset = pd.read_csv(test_file_path, index_col='Id')

## Визуализация

### Просмотр первых строк

In [12]:
train_dataset.head()

Unnamed: 0_level_0,Time,ClientID,ClientType,CoachID,GymID,TrainingID,Target
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,2018-07-02 09:40:00,1,1,780,0,694,0
1,2018-07-20 11:50:00,1,1,567,0,655,0
2,2018-11-01 11:25:00,1,1,622,0,2523,0
3,2017-01-18 13:05:00,60,1,105,0,719,1
4,2017-05-17 07:40:00,60,1,622,0,2523,1


In [13]:
test_dataset.head()

Unnamed: 0_level_0,Time,ClientID,ClientType,CoachID,GymID,TrainingID
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2019-06-21 10:30:00,1,1,463,3,692
1,2019-05-19 15:25:00,1,1,728,6,2523
2,2019-05-24 18:20:00,1,1,622,0,434
3,2019-05-24 18:25:00,1,1,622,0,2523
4,2019-06-21 08:00:00,1,1,565,3,2523


## Финальная обработка

### Извлечение выборки и таргета

На выходе должны быть объявлены переменные:

* **X_train_pure**: pd.DataFrame\
Id | Time | ClientID | ClientType | CoachID | GymID | TrainingID
* **y_train**: pd.Series\
Id | Target
* **X_valid_pure**: pd.DataFrame\
Id | Time | ClientID | ClientType | CoachID | GymID | TrainingID
* **y_valid**: pd.Series\
Id | Target
* **X_test_pure**: pd.DataFrame\
Id | Time | ClientID | ClientType | CoachID | GymID | TrainingID

In [14]:
train_dataset = train_dataset.sort_values(by='Time')

partition = int(len(train_dataset) * (1 - VALIDATE_RATIO))
train_train_dataset = train_dataset.iloc[:partition]
train_valid_dataset = train_dataset.iloc[partition:]

X_train_train_pure = train_train_dataset.drop('Target', axis=1)
y_train_train = train_train_dataset['Target']

X_train_valid_pure = train_valid_dataset.drop('Target', axis=1)
y_train_valid = train_valid_dataset['Target']
X_test_pure = test_dataset

### Добавление дополнительных признаков

* **X_train**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day | DayOfWeek | Season
* **y_train**: pd.Series\
Id | Target
* **X_valid**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day | DayOfWeek | Season
* **y_valid**: pd.Series\
Id | Target
* **X_test**: pd.DataFrame\
Id | ClientID | ClientType | CoachID | GymID | TrainingID | Month | Day | DayOfWeek | Season

In [15]:
def add_time_features(X):
    datetime_series = pd.to_datetime(X['Time'], format='%Y-%m-%d %H:%M:%S')
    X['Month'] = pd.DatetimeIndex(datetime_series).month
    X['Day'] = pd.DatetimeIndex(datetime_series).day
    X['DayOfWeek'] = pd.DatetimeIndex(datetime_series).dayofweek
    X['Season'] = (X['Month'] % 12 + 3) // 3
    return X

def prepare_sample(X):
    X = X.copy()
    X = add_time_features(X)
    return X.drop('Time', axis=1)

In [16]:
X_train_train = prepare_sample(X_train_train_pure)
X_train_valid = prepare_sample(X_train_valid_pure)
X_test = prepare_sample(X_test_pure)

# Эксперименты с обучением

На выходе должны быть объявлены переменные

* **y_test_predicted**: pd.Series \
Id | Target
* **SUBMISSTION_FILE_NAME**: string\
Названия файла для сохранения

## Catboost

### Объявление пула и параметров

In [76]:
cat_features = [
    'ClientID',
    'ClientType',
    'CoachID',
    'GymID',
    'TrainingID',
    'Month',
    'Day',
    'DayOfWeek',
    'Season'
]
ignored_features = [
]
train_train_pool = Pool(
    X_train_train,
    y_train_train,
    cat_features=cat_features
)
train_valid_pool = Pool(
    X_train_valid,
    y_train_valid,
    cat_features=cat_features
)
params = {
    'iterations': 400,
    'learning_rate': 0.08,
    'eval_metric': 'AUC',
#     'random_seed': 113,
    'logging_level': 'Silent',
    'use_best_model': False,
    'ignored_features': ignored_features
}

### Обучение с валидацией

In [77]:
%%time
model = CatBoostClassifier(**params)
model.fit(train_train_pool, eval_set=train_valid_pool)

CPU times: user 6min 47s, sys: 24.4 s, total: 7min 12s
Wall time: 3min 41s


### Визуализация результатов

In [78]:
def visualize_learning_results(metrics):
    """
    Plot one graph for each metric

    :param metrics: dict of metrics
    """

    n = len(metrics)
    
    fig = make_subplots(
        rows=n,
        cols=1,
        subplot_titles=list(metrics.keys())
    )

    for i, (metric_name, curves) in enumerate(metrics.items()):
        for dataset_type, curve in curves.items():
            m = len(curve)
            fig.add_trace(
                go.Scatter(
                    x=np.arange(m),
                    y=curve,
                    mode='lines',
                    name=f'{dataset_type} {metric_name}'
                ),
                row=i + 1,
                col=1
            )

    fig.update_layout(
        title_text="Learning results",
        width=297. * 3,
        height=210. * 3 * n
    )
    fig.show()

In [79]:
eval_results = model.get_evals_result()
metrics = dict(
    Logloss=dict(
        validation=eval_results['validation']['Logloss'],
        learn=eval_results['learn']['Logloss']
    ),
    AUC=dict(
        validation=eval_results['validation']['AUC']
    )
)

In [80]:
visualize_learning_results(metrics)

### Объявление финальных параметров

In [81]:
cat_features = [
    'ClientID',
    'ClientType',
    'CoachID',
    'GymID',
    'TrainingID',
    'Month',
    'Day',
    'DayOfWeek',
    'Season'
]
ignored_features = [
]
train_pool = Pool(
    pd.concat([X_train_train, X_train_valid]),
    pd.concat([y_train_train, y_train_valid]),
    cat_features=cat_features
)
params = {
    'iterations': 300,
    'learning_rate': 0.08,
    'logging_level': 'Silent',
    'use_best_model': False,
    'ignored_features': ignored_features
}
n_models = 10

### Обучение

In [83]:
%%time
catboost_models = []
for i in range(n_models):
    model = CatBoostClassifier(**params)
    model.fit(train_pool)
    catboost_models.append(model)

CPU times: user 59min 18s, sys: 3min 15s, total: 1h 2min 33s
Wall time: 31min 59s


### Получение предсказаний

In [84]:
def get_catboost_predictions():
    y_test_predicted = None
    for model in catboost_models:
        predicted_probas = model.predict_proba(X_test)
        y_test_cur_predicted = pd.Series(
            predicted_probas[:, 1],
            index=X_test.index,
            name='Target'
        ).sort_index()
        
        if y_test_predicted is None:
            y_test_predicted = y_test_cur_predicted
        else:
            y_test_predicted += y_test_cur_predicted

    y_test_predicted /= n_models
    return y_test_predicted 

y_test_predicted = get_catboost_predictions()

SUBMISSTION_FILE_NAME = 'Catboost_3_attemp_10_models_300_iters_0.8_lr.csv'

## Naive Bayes

### Объявление параметров

In [123]:
N_REPETITIONS = 1  # required feature repetitions

### Модификация валидационной выборки 
Оставляем только те строки, для которых ClientID, CoachID и TrainingID встречаются в обучающей выборке


In [86]:
def filter_new_features(X_train, X_test, y_test=None, n_repetitions=1):
    client_counter = Counter(X_train.ClientID.values)
    coach_counter = Counter(X_train.CoachID.values)
    training_counter = Counter(X_train.TrainingID.values)
    X_test_filtered_mask = X_test.apply(
        lambda x:
            client_counter[x.ClientID] >= n_repetitions and
            coach_counter[x.CoachID] >= n_repetitions and
            training_counter[x.TrainingID] >= n_repetitions,
        axis=1
    )
    return X_test[X_test_filtered_mask], y_test[X_test_filtered_mask] if y_test is not None else y_test

In [87]:
X_train_valid_existed, y_train_valid_existed = filter_new_features(
    X_train_train,
    X_train_valid,
    y_train_valid,
    n_repetitions=N_REPETITIONS
)

### Обучим модель и предскажим вероятности на валидационном датасете

In [89]:
model = CategoricalNB()

In [90]:
model = model.fit(X_train_train.values, y_train_train.values)

In [91]:
y_train_valid_probas = model.predict_proba(X_train_valid_existed.values)

In [92]:
roc_auc_score(y_train_valid_existed.values, y_train_valid_probas[:,1])

0.7873626517964328

### Визуализация

In [93]:
def draw_roc_auc(y_true, y_prob):
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    roc_auc = auc(fpr, tpr)

    fig = go.Figure()


    fig.add_trace(
        go.Scatter(
            x=fpr,
            y=tpr, 
            mode='lines', 
            line=dict(color='darkorange', width=2),
            name='ROC curve (area = %0.2f)' % roc_auc
        )
    )

    fig.add_trace(
        go.Scatter(
            x=[0, 1],
            y=[0, 1], 
            mode='lines', 
            line=dict(color='navy', width=2),
            showlegend=False
        )
    )

    fig.update_layout(
        title_text="ROC curve on validation sample",
        width=297. * 3,
        height=210. * 3
    )
    fig.show()

In [94]:
draw_roc_auc(y_train_valid_existed.values, y_train_valid_probas[:,1])

### Объявление финальной выборки для обучения

In [124]:
X_train = pd.concat([X_train_train, X_train_valid])
y_train = pd.concat([y_train_train, y_train_valid])

X_test_existed, _ = filter_new_features(
    X_train,
    X_test,
    n_repetitions=N_REPETITIONS
)

### Обучение

In [125]:
nb_model = CategoricalNB()
nb_model = nb_model.fit(X_train.values, y_train.values)

### Получение предсказание

In [126]:
def get_nb_predictions():
    predicted_probas = nb_model.predict_proba(X_test_existed)
    y_test_nb_predicted = pd.Series(
        predicted_probas[:, 1],
        index=X_test_existed.index,
        name='Target'
    )

    y_test_catboost_predicted = get_catboost_predictions()
    absense_mask = ~y_test_catboost_predicted.index.isin(list(y_test_nb_predicted.index))
    
    return pd.concat([y_test_catboost_predicted[absense_mask], y_test_nb_predicted]).sort_index()

y_test_predicted = get_nb_predictions()

SUBMISSTION_FILE_NAME = 'NB_with_Catboost_1_attemp_10_n_repetitions_10_models_300_iters_0.8_lr.csv'

# Отправка результатов

## Определение пути

In [127]:
submission_folder_path = os.path.join(PROJECT_DIR, 'submissions')
file_path = os.path.join(submission_folder_path, SUBMISSTION_FILE_NAME)
print(f"File will be saved to {file_path}")

File will be saved to /content/drive/My Drive/projects/Kaggle. ML Posterior. Gym training prediction/submissions/NB_with_Catboost_1_attemp_10_n_repetitions_10_models_300_iters_0.8_lr.csv


## Сохранение

In [128]:
y_test_predicted.to_csv(file_path)

## Отправка на кагл

In [129]:
!kaggle competitions submit -c ml-posterior-gym-training-prediction -f "$file_path" -m "send $SUBMISSTION_FILE_NAME"

  0% 0.00/1.09M [00:00<?, ?B/s]100% 1.09M/1.09M [00:00<00:00, 5.32MB/s]
Successfully submitted to ML Posterior. Gym training prediction

# Юнит тестирование


## Подключение библиотек

In [None]:
import unittest

## Объявление тестирующего класса

In [None]:
class TestNotebook(unittest.TestCase):
    def test_add(self):
        self.assertEqual(2 + 2, 4)

## Запуск тестов

In [None]:
unittest.main(argv=[''], verbosity=2, exit=False)

test_add (__main__.TestNotebook) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.002s

OK


<unittest.main.TestProgram at 0x7f929c006dd8>