# N-STEP demo

## Quickstart:

* Change uname to your db username
* Run all the cells
* Make coffee
* Check the table landed_test.nstep_notebook in the database

## Overview
This notebook contains an example of how to use the N-step ahead forecasting tools.
The models are terrible, they are just an illustration. 
Everything is based around the model dictionaries. 

A model dictionary contains all the information of a forecast:

* outcome
* list of features
* estimator object (Usually a scikit Pipeline)
* time limits for training and forecasting
* which time steps to forecast for
* which input table to use
* downsampling factors for y=1 and y=0

The nstep.forecast_many() method is the key component:
given a list of model dictionaries it returns a dataframe of forecasted values.


It does:

* Reading from database
* Fitting the model for each step
* Forecasting for each step, both predicted probs and predicted outcomes (discrete)
* Interpolate between the steps
* Merging the forecast of many models to one dataframe

For each model 5 columns go into the database

* actual_model : The actual value of the outcome
* model : predicted value from the model, usually binary (predict())
* p_model : predicted probability (predict_proba())
* model_li : linear interpolation of predicted value
* p_model_li : linear prediction of predicted probability

The last cell then writes these predictions to the specified output table using 
dbutils.df_to_db().


In [1]:
import os
import sys

import pandas as pd
import numpy as np

sys.path.insert(0, "..")

import views_utils.dbutils as dbutils
import nstep.utils as nstep

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [2]:
# db parameters
uname    = "VIEWSADMIN"
prefix   = "postgresql"
db       = "views"
port     = "5432"
hostname = "VIEWSHOST"
connectstring = dbutils.make_connectstring(prefix, db, uname, hostname, port)

output_schema   = "landed_test"
output_table    = "nstep_example"

# specify as many as you want
table_input = {
    'connectstring' : connectstring,
    'schema'    : 'launched',
    'table'     : 'imp_imp_1',
    'timevar'   : 'month_id',
    'groupvar'  : 'pg_id'
}

table_input_noimp = {
    'connectstring' : connectstring,
    'schema'    : 'preflight',
    'table'     : 'flight_pgm',
    'timevar'   : 'month_id',
    'groupvar'  : 'pg_id'
}

In [3]:
# Remember all X are lagged by step, outcomes can be features 
features_mini = [    
    "ged_dummy_sb",
    "ged_dummy_ns",
    "ged_dummy_os"
    ]


In [4]:
# MLPC
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('mlp', MLPClassifier())
])

params = {
    'mlp__hidden_layer_sizes' : ((50,)),
    'mlp__solver' : ('lbfgs', 'adam'),
    'mlp__alpha' : (1e-3, 1e-7)
}

gscv = GridSearchCV(
    pipeline, 
    params, 
    verbose=1, 
    cv=3, 
    n_jobs=2)

model_mlpc= { 
        'name'      : 'mlp',
        'outcome'   : 'ged_dummy_sb',
        'estimator' : gscv,
        'features'  : features_mini,
        'steps'     : [1,36],
        'share_zeros_keep'  : 0.1,
        'share_ones_keep'   : 0.1,
        'train_start'   : 300,
        'train_end'     : 408,
        'forecast_start': 409,
        'forecast_end'  : 444,
        'table' : table_input
        }

In [5]:
# RF
rf =  RandomForestClassifier()
scaler =  StandardScaler()

# Syntax is pipelinecomponent__parameter, notice double underscores
params = {
    'rf__n_estimators' : (1, 2, 3, 4, 5, 6)
}

pipeline = Pipeline([
    ('scaler', scaler),
    ('rf', rf)
])

gscv = GridSearchCV(
    pipeline, 
    params, 
    verbose=1, 
    cv=3, 
    n_jobs=2)

model_rf = { 
        'name'      : 'rf',
        'outcome'   : 'ged_dummy_sb',
        'estimator' : gscv,
        'features'  : features_mini,
        'steps'     : [1, 12, 24, 36],
        'share_zeros_keep'  : 0.5,
        'share_ones_keep'   : 0.5,
        'train_start'   : 300,
        'train_end'     : 408,
        'forecast_start': 409,
        'forecast_end'  : 444,
        'table' : table_input
        }

In [6]:
# Forecast the models
models = [model_mlpc, model_rf]
df_results = nstep.forecast_many(models)

Starting forecast mlp
Getting 6 cols from launched.imp_imp_1
Getting 3 cols from launched.imp_imp_1
Training mlp step 1
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=2)]: Done  12 out of  12 | elapsed:    9.6s finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('mlp', MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rat...=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=2,
       param_grid={'mlp__hidden_layer_sizes': (50,), 'mlp__solver': ('lbfgs', 'adam'), 'mlp__alpha': (0.001, 1e-07)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)




Training mlp step 36
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=2)]: Done  12 out of  12 | elapsed:    7.5s finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('mlp', MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rat...=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=2,
       param_grid={'mlp__hidden_layer_sizes': (50,), 'mlp__solver': ('lbfgs', 'adam'), 'mlp__alpha': (0.001, 1e-07)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)




Finished forecasting mlp
Starting forecast rf
Getting 6 cols from launched.imp_imp_1
Getting 3 cols from launched.imp_imp_1
Training rf step 1
Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=2)]: Done  18 out of  18 | elapsed:    5.0s finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
      ...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=2,
       param_grid={'rf__n_estimators': (1, 2, 3, 4, 5, 6)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)




Training rf step 12
Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=2)]: Done  18 out of  18 | elapsed:    4.4s finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
      ...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=2,
       param_grid={'rf__n_estimators': (1, 2, 3, 4, 5, 6)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)
Training rf step 24




Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=2)]: Done  18 out of  18 | elapsed:    3.6s finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
      ...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=2,
       param_grid={'rf__n_estimators': (1, 2, 3, 4, 5, 6)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)
Training rf step 36




Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=2)]: Done  18 out of  18 | elapsed:    3.4s finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
      ...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=2,
       param_grid={'rf__n_estimators': (1, 2, 3, 4, 5, 6)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)
Finished forecasting rf


In [7]:
# Write forecast to db
dbutils.df_to_db(connectstring, df_results, output_schema, output_table, 
    if_exists="replace", write_index=True)

Pushing 384372 rows to landed_test.nstep_example
[92m [OK] [0m
runtime:  395.04440784454346 rows/second:  972.984283203059
