# Production model example

This uses a simple [Kaggle dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) 
to create a house price predictive model.  

To goal of this notebook is to come up with optimal LightGBM parameters to use for production model training.

In [1]:
# data
import hashlib
import pandas as pd
import numpy as np

# parameter tuning
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV

# model
from sklearn.pipeline import Pipeline
from lightgbm import LGBMRegressor
from joblib import dump

# sampling
from scipy.stats import uniform, randint

# custom objects 
from production_demo import (CategoriesTransformer, 
                             CATEGORIES, 
                             NUMERICS, 
                             OUTPUT)

### Dataprep

In [2]:
train = pd.read_csv('../data/train.csv')
print(train.shape)
train.head()

(1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
# hasher 
hct = CategoriesTransformer(CATEGORIES)


# prepare train/test splitting
train.sort_values(by=['YrSold', 'MoSold'], 
                  inplace=True)
tss = TimeSeriesSplit(n_splits=10)


# parameter space
param_distributions = dict(
    LGBM__num_leaves=randint(2, 5000),
    LGBM__max_depth=randint(2, 20),
    LGBM__learning_rate=uniform(0.01, 0.9),
    LGBM__n_estimators=randint(5, 1000),
    LGBM__min_split_gain=uniform(0.0, 0.1),
    LGBM__min_child_weight=uniform(0.0, 0.1),
    LGBM__subsample=uniform(0.1, 0.9),
    LGBM__colsample_bytree=uniform(0.1, 0.9),
    LGBM__reg_alpha=uniform(0.0, 5000.0),
    LGBM__reg_lambda=uniform(0.0, 5000.0),
)

### Features subset

We're subsetting features here based on what we will have **at time of prediction**; in other words, not all 80+ features from training are going to be available to us at prediction time, or we want to make it easier to fill out a form to on our web page to make a prediction. We are saying that we will only *require* the below features in order to make a prediction. 

In [4]:
print(f'Categories used:\n{CATEGORIES}')
print(f'\nNumerics used:\n{NUMERICS}')
print(f'\n Total features used: {len(CATEGORIES)+len(NUMERICS)}')

Categories used:
['BldgType', 'CentralAir', 'Electrical', 'ExterCond', 'ExterQual', 'Fence', 'FireplaceQu', 'Foundation', 'Functional', 'GarageCond', 'GarageQual', 'GarageType', 'Heating', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LotConfig', 'MasVnrType', 'MSSubClass', 'PavedDrive', 'RoofStyle']

Numerics used:
['1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'EnclosedPorch', 'Fireplaces', 'FullBath', 'GarageArea', 'GarageCars', 'GrLivArea', 'HalfBath', 'KitchenAbvGr', 'LotArea', 'OpenPorchSF', 'OverallCond', 'OverallQual', 'PoolArea', 'TotRmsAbvGrd', 'TotalBsmtSF', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd']

 Total features used: 42


### Parameter tuning

In [5]:
model = Pipeline([
    ('hash', hct),
    ('LGBM', LGBMRegressor(random_state=22)),
])
rsv = RandomizedSearchCV(estimator=model,
                         param_distributions=param_distributions,
                         n_iter=1000,
                         cv=tss,
                         scoring='neg_root_mean_squared_error')
_ = rsv.fit(train[NUMERICS + CATEGORIES], train[OUTPUT])

In [6]:
best_params_dict = rsv.best_params_

print(f'Best params:\n {best_params_dict}')
print(f'\nBest score:\n {rsv.best_score_:.4f}')

# save 
model = LGBMRegressor(**best_params_dict)

Best params:
 {'LGBM__colsample_bytree': 0.4569053964451767, 'LGBM__learning_rate': 0.060215823699935396, 'LGBM__max_depth': 10, 'LGBM__min_child_weight': 0.06494958890542044, 'LGBM__min_split_gain': 0.05643039388402624, 'LGBM__n_estimators': 354, 'LGBM__num_leaves': 3192, 'LGBM__reg_alpha': 260.8146685873852, 'LGBM__reg_lambda': 101.43819465504578, 'LGBM__subsample': 0.4382540932521588}

Best score:
 -32854.9616


## Train

In [7]:
model.fit(train[NUMERICS], train[OUTPUT])

LGBMRegressor(LGBM__colsample_bytree=0.4569053964451767,
              LGBM__learning_rate=0.060215823699935396, LGBM__max_depth=10,
              LGBM__min_child_weight=0.06494958890542044,
              LGBM__min_split_gain=0.05643039388402624, LGBM__n_estimators=354,
              LGBM__num_leaves=3192, LGBM__reg_alpha=260.8146685873852,
              LGBM__reg_lambda=101.43819465504578,
              LGBM__subsample=0.4382540932521588)

In [8]:
#save model artifacts
dump(model, '../data/trained_model')

['../data/trained_model']