# Project title: Diabet Prediction

## Authors: Denys Herasymuk & Yaroslav Morozevych

**Variant**: the remainder of the division = 3

## Contents of This Notebook

Click on the section and go to this cell immediately.

* [Section 1. Explore Data](#section_1)
* [Section 2. Identifying Stationarity](#section_2)
* [Section 3. Nonstationary-to-Stationary Transformations](#section_3)
* [Section 4. Correlation analysis](#section_4)
* [Section 5. Feature generation and validation of DL models](#section_5)
* [Section 6. Fbprophet and Nbeats models](#section_6)
* [Section 7. Predict on 12 months out of dataframe](#section_7)


When you use Run All button with this notebook, you should wait approx. 10-15 mins to get output of all cells.

**How to run this notebook**

* Create a new conda env with python 3.7
* Run `jupyter notebook` in your new env via terminal (without installing any packages now)
* Based on (this link)[https://stackoverflow.com/questions/61353951/no-module-named-fbprophet]. Run these two cells in your jupyter:
```
!conda install -c conda-forge fbprophet -y
!pip install --upgrade plotly
```

* In terminal run  -- `pip install -r requirements.txt`
* In any case, a useful command -- `conda create --clone py35 --name py35-2` from here -- https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf

* How to install pytorch -- https://pytorch.org/get-started/locally/
* In such a way I installed it on Ubuntu
`pip3 install torch==1.10.0+cpu torchvision==0.11.1+cpu torchaudio==0.10.0+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html`

* Nbeats installing -- `pip3 install nbeats-pytorch`

## General Configuration

In [55]:
import os
import sys
import math
import sklearn
import itertools
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels as ss
import matplotlib.pyplot as plt

from pprint import pprint
from copy import deepcopy
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [18]:
%matplotlib inline

# alt.data_transformers.disable_max_rows()
# alt.renderers.enable('html')

plt.style.use('mpl20')
matplotlib.rcParams['figure.dpi'] = 100
matplotlib.rcParams['figure.figsize'] = 15, 5

# import warnings
# warnings.filterwarnings('ignore')

## Python & Library Versions

In [19]:
versions = ( ("matplotlib", matplotlib.__version__),
             ("numpy", np.__version__),
             ("pandas", pd.__version__),
             ("statsmodels", ss.__version__),
             ("seaborn", sns.__version__),
             ("sklearn", sklearn.__version__),
             # ("keras", keras.__version__),
             # ("xgboost", xgboost.__version__),
             )

print(sys.version, "\n")
print("library" + " " * 4 + "version")
print("-" * 18)

for tup1, tup2 in versions:
    print("{:11} {}".format(tup1, tup2))

3.8.12 (default, Oct 12 2021, 13:49:34) 
[GCC 7.5.0] 

library    version
------------------
matplotlib  3.5.1
numpy       1.19.2
pandas      1.3.5
statsmodels 0.13.1
seaborn     0.11.2
sklearn     1.0.1


<a id='section_1'></a>

## Section 1. Explore Data

In [20]:
diabetes_df = pd.read_csv(os.path.join(".", "data", "diabetes.csv"))

In [21]:
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [22]:
diabetes_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [23]:
diabetes_df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [24]:
total_n_rows = len(diabetes_df)
print('Total number of occurrences: ', total_n_rows)
print('Number of diabet occurrences: ', round(len(diabetes_df[diabetes_df.Outcome == 1]) / total_n_rows, 2))
print('Number of non-diabet occurrences: ', round(len(diabetes_df[diabetes_df.Outcome == 0]) / total_n_rows, 2))

Total number of occurrences:  768
Number of diabet occurrences:  0.35
Number of non-diabet occurrences:  0.65


<a id='section_3'></a>

## Section 3. Train and Validate Model

### Prepare model

In [25]:
y = diabetes_df['Outcome']
X = diabetes_df.loc[:, ~diabetes_df.columns.isin(['Outcome'])]

In [26]:
SEED = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
samples_per_fold = len(y_test)
n_folds = 3

In [27]:
# iterator for GridSearch
def folds_iterator(n_folds, samples_per_fold, size):
    for i in range(n_folds):
        yield np.arange(0, size - samples_per_fold * (i + 1)), \
              np.arange(size - samples_per_fold * (i + 1), size - samples_per_fold * i)


In [28]:
def validate_model(model, x, y, params, n_folds, samples_per_fold):
    grid_search = GridSearchCV(estimator=model,
                               param_grid=params,
                               # scoring={"F1_Score": make_scorer(rmse, greater_is_better=False), "WMAPE": make_scorer(wmape, greater_is_better=False)},
                               scoring={"F1_Score": make_scorer(f1_score, average='micro')},
                               refit="F1_Score",
                               n_jobs=-1, cv=folds_iterator(n_folds, samples_per_fold, x.shape[0]))
    grid_search.fit(x, y)
    best_index = grid_search.best_index_
    return grid_search.best_estimator_, grid_search.cv_results_["mean_test_F1_Score"][best_index], grid_search.best_params_


In [56]:
def validate_ML_models(X, y, show_plots, debug_mode):
    results_df = pd.DataFrame(columns=('Model_Name', 'F1_Score',
                                       'Model_Best_Params'))

    config_models = [
        {
            'model_name': 'RandomForestClassifier',
            'model': RandomForestClassifier(random_state=SEED),
            'params': {"max_depth": [3, 4, 6],
                      "min_samples_split": [6],
                      "n_estimators": [200, 500, 1000],
                      "max_features": [0.6]}
        },
        {
            'model_name': 'DecisionTreeClassifier',
            'model': DecisionTreeClassifier(random_state=SEED),
            'params': {"max_depth": [3, 4, 6],
                      "min_samples_split": [6],
                      "max_features": [0.6]}
        },
    ]

    best_f1_score = -np.Inf
    best_model = None
    best_model_name = 'No model'
    best_params = None
    idx = 0
    for model_config in config_models:
        cur_model, cur_f1_score, cur_params = validate_model(deepcopy(model_config['model']),
                                                                    X, y, model_config['params'],
                                                                    n_folds, samples_per_fold)
        print('Model name: ', model_config['model_name'])
        print('Best model params: ')
        pprint(cur_params)
        results_df.loc[idx] = [model_config['model_name'],
                               cur_f1_score,
                               cur_params]
        idx += 1

        if cur_f1_score > best_f1_score:
            best_f1_score = cur_f1_score
            best_model_name = model_config['model_name']
            best_model = cur_model
            best_params = cur_params

    # add visualizations here

    return results_df, best_model, best_model_name

In [57]:
ML_results_df, best_model, best_model_name = validate_ML_models(X_train, y_train, show_plots=False, debug_mode=False)

Model name:  RandomForestClassifier
Best model params: 
{'max_depth': 3,
 'max_features': 0.6,
 'min_samples_split': 6,
 'n_estimators': 200}
Model name:  DecisionTreeClassifier
Best model params: 
{'max_depth': 3, 'max_features': 0.6, 'min_samples_split': 6}


In [58]:
ML_results_df

Unnamed: 0,Model_Name,F1_Score,Model_Best_Params
0,RandomForestClassifier,0.768398,"{'max_depth': 3, 'max_features': 0.6, 'min_sam..."
1,DecisionTreeClassifier,0.729437,"{'max_depth': 3, 'max_features': 0.6, 'min_sam..."


### Test set evaluation

In [60]:
y_hat = best_model.predict(X_test)
best_f1_score = f1_score(y_test, y_hat)

In [61]:
print('Best model: ', best_model_name)
print('Test score of the    best model: ', best_f1_score)

Best model:  RandomForestClassifier
Test score of best model:  0.6534653465346534


### Retrain and save model

In [None]:
# x = pd.concat((x_train, x_test))
# random_forest.fit(x, y);

# models_path = "models"
# if not os.path.exists(models_path):
#     os.mkdir(models_path)
# pickle.dump(random_forest, open(os.path.join(models_path, "demand_forecasting_model.pkl"), "wb"))