# Sales Prediction for Time Series Data

This notebook gathers predictions from the three base models and train the stacked model.  For this project, the simple averaging of the model predictions did better than stacking.  

## Part 5: Ensembles

Predictions for date block 32 will be gathered from the models trained previously, which includes linear regression, light GBM, and neural networks.  

In [1]:
import numpy as np
import pandas as pd 
import os
import time 
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import gc
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.externals import joblib
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.wrappers.scikit_learn import KerasRegressor
from keras.callbacks import ModelCheckpoint
from keras.callbacks import EarlyStopping
from keras.models import load_model
import h5py
import lightgbm as lgb

%matplotlib inline 

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Load the base model files trained on date_block_num from 12 to 31

In [2]:
DATA_FOLDER = '../script/'
lrModel = joblib.load(DATA_FOLDER + 'lr_robust_scale_0201_15.05.pkl') 
lgbmModel = lgb.Booster(model_file='lgb_model_0131_17.30.txt') 
nnModel = load_model(DATA_FOLDER + "nnBestModel.hdf5")

Load the X_test and y_test for obtaining predictions.  

date_block_num = 32.  Used for meta model features

In [4]:
DATA_FOLDER = '../data/'
X_test_robustScaler = pd.read_pickle(DATA_FOLDER + 'X_test_lev_1_RobustScaler') # LR
X_test_standardScaler = pd.read_pickle(DATA_FOLDER + 'X_test_lev_1_standardScaler') # NNet
X_test_robustScalerTree = pd.read_pickle(DATA_FOLDER + 'X_test_lev_1_RobustScalerTree') #light GBM
y_test = joblib.load(DATA_FOLDER + 'y_val_lev_1.pkl')
X_test_robustScaler_lr = X_test_robustScaler.drop(['shop_id','item_id', 'item_category_id'], axis = 1)

date_block_num = 33.  Used for meta model validation

In [5]:
DATA_FOLDER = '../data/'
X_test_2_RobustScaler = pd.read_pickle(DATA_FOLDER + 'X_test_lev_2_RobustScaler') # LR
X_test_2_standardScaler = pd.read_pickle(DATA_FOLDER + 'X_test_lev_2_standardScaler') # NNet
X_test_2_robustScalerTree = pd.read_pickle(DATA_FOLDER + 'X_test_lev_2_RobustScalerTree') #light GBM
y_test_2 = joblib.load(DATA_FOLDER + 'y_val_lev_2.pkl')
X_test_2_robustScaler_lr = X_test_2_RobustScaler.drop(['shop_id','item_id', 'item_category_id'], axis = 1)

### Stacking

#### Gather predictions from base models. 
These are the meta features

In [7]:
lrPred = lrModel.predict(X_test_robustScaler_lr.values).clip(0,20)
lgbmPred = lgbmModel.predict(X_test_robustScalerTree.values).clip(0,20)
nnPred = nnModel.predict(X_test_standardScaler.values).clip(0,20)

In [9]:
lrPred_val = lrModel.predict(X_test_2_robustScaler_lr.values).clip(0,20)
lgbmPred_val = lgbmModel.predict(X_test_2_robustScalerTree.values).clip(0,20)
nnPred_val = nnModel.predict(X_test_2_standardScaler.values).clip(0,20)

In [10]:
preds = pd.DataFrame(columns = ['lr', 'lgbm', 'nnet'])
preds['lr'] = lrPred
preds['lgbm'] = lgbmPred
preds['nnet'] = nnPred
preds['mean'] = pd.DataFrame.mean(preds, axis = 1)

In [11]:
preds_val = pd.DataFrame(columns = ['lr', 'lgbm', 'nnet'])
preds_val['lr'] = lrPred_val
preds_val['lgbm'] = lgbmPred_val
preds_val['nnet'] = nnPred_val
preds_val['mean'] = pd.DataFrame.mean(preds_val, axis = 1)

In [12]:
X_train_meta = preds.iloc[:,0:4]
y_train_meta = y_test
X_test_meta = preds_val.iloc[:, 0:4]
y_test_meta = y_test_2

#### Baseline:  mean prediction from the base models

The base models were trained with data before date_block_num 32.  Below we calculate the MSE for block_num 32.  The meta model's training MSE should be better than these MSE.  

In [13]:
lrMSE = mean_squared_error(y_test, lrPred)
lgbmMSE = mean_squared_error(y_test, lgbmPred)
nnMSE = mean_squared_error(y_test, nnPred)
meanMSE = mean_squared_error(y_test, preds['mean'])
print('lr:', lrMSE, 'light gbm: ', lgbmMSE, 'neural netowrks:', nnMSE)
print('mean MSE of all models: ', meanMSE)

lr: 1.3378885 light gbm:  1.5158774749749127 neural netowrks: 1.2493004
mean MSE of all models:  1.255182221247377


Date_block_num 33 is used for validation of the meta model.  Below we calculate the MSE using the base models for this block. The meta model's testing performance should be better than these numbers.   

In [14]:
lrMSE = mean_squared_error(y_test_2, lrPred_val)
lgbmMSE = mean_squared_error(y_test_2, lgbmPred_val)
nnMSE = mean_squared_error(y_test_2, nnPred_val)
meanMSE = mean_squared_error(y_test_2, preds_val['mean'])
print('lr:', lrMSE, 'light gbm: ', lgbmMSE, 'neural netowrks:', nnMSE)
print('mean MSE of all models: ',meanMSE)

lr: 0.7841173 light gbm:  1.0244859643995765 neural netowrks: 0.9388006
mean MSE of all models:  0.7606538814281627


#### Linear Regression

In [15]:
model = linear_model.LinearRegression(n_jobs = 4)
model.fit(X_train_meta, y_train_meta)
lr_meta_test_pred = model.predict(X_test_meta.values)
lr_meta_train_pred = model.predict(X_train_meta.values)
lr_meta_test_MSE = mean_squared_error(y_test_meta, lr_meta_test_pred.clip(0, 20))
lr_meta_train_MSE = mean_squared_error(y_train_meta, lr_meta_train_pred.clip(0, 20))
print('Linear Regression Meta Model Train (block 32) MSE: ',lr_meta_train_MSE)
print('Linear Regression Meta Model Test (block 33) MSE: ',lr_meta_test_MSE)

Linear Regression Meta Model Train (block 32) MSE:  1.0863823291453925
Linear Regression Meta Model Test (block 33) MSE:  0.9168139375772854


In [16]:
st = datetime.datetime.fromtimestamp(time.time()).strftime('%m%d_%H.%M')
joblib.dump(model, 'lr_meta_' + st + '.pkl')

['lr_meta_0201_21.09.pkl']

#### Light GBM

In [17]:
def lgb_model(params):
    lgb_model = lgb.train(params, train_data, valid_sets=[train_data, test_data], verbose_eval=25)
    return lgb_model

In [28]:
params = {
    'application':'regression',
    'learning_rate':0.001,
    'early_stopping_round':10,
    'metric':'l2_root', #RMSE
    'nthread':-1, 
    'train_metric': True,
    'num_boost_round': 1000,
    'max_depth:': 2
}

In [29]:
train_data = lgb.Dataset(X_train_meta, label= y_train_meta)
test_data = lgb.Dataset(X_test_meta, label = y_test_meta)

In [30]:
model = lgb_model(params) 



Training until validation scores don't improve for 10 rounds.
[25]	training's rmse: 1.45944	valid_1's rmse: 1.12902
[50]	training's rmse: 1.43986	valid_1's rmse: 1.11237
[75]	training's rmse: 1.42098	valid_1's rmse: 1.09665
[100]	training's rmse: 1.40278	valid_1's rmse: 1.08182
[125]	training's rmse: 1.38525	valid_1's rmse: 1.06789
[150]	training's rmse: 1.36835	valid_1's rmse: 1.05483
[175]	training's rmse: 1.35208	valid_1's rmse: 1.04256
[200]	training's rmse: 1.33641	valid_1's rmse: 1.03109
[225]	training's rmse: 1.32132	valid_1's rmse: 1.02041
[250]	training's rmse: 1.30682	valid_1's rmse: 1.01046
[275]	training's rmse: 1.29287	valid_1's rmse: 1.00121
[300]	training's rmse: 1.27945	valid_1's rmse: 0.992641
[325]	training's rmse: 1.26654	valid_1's rmse: 0.984741
[350]	training's rmse: 1.25412	valid_1's rmse: 0.977437
[375]	training's rmse: 1.24217	valid_1's rmse: 0.970748
[400]	training's rmse: 1.23065	valid_1's rmse: 0.964681
[425]	training's rmse: 1.21954	valid_1's rmse: 0.959208


In [31]:
print(model.best_iteration)

791


In [32]:
lgbm_meta_test_pred = model.predict(X_test_meta.values)
lgbm_meta_train_pred = model.predict(X_train_meta.values)
lgbm_meta_test_MSE = mean_squared_error(y_test_meta, lgbm_meta_test_pred.clip(0, 20))
lgbm_meta_train_MSE = mean_squared_error(y_train_meta, lgbm_meta_train_pred.clip(0, 20))
print('Light GBM Meta Model Train (block 32) MSE: ',lgbm_meta_train_MSE)
print('Light GBM Meta Model Test (block 33) MSE: ',lgbm_meta_test_MSE)

Light GBM Meta Model Train (block 32) MSE:  1.2064141123142558
Light GBM Meta Model Test (block 33) MSE:  0.8575943376305404


In [33]:
st = datetime.datetime.fromtimestamp(time.time()).strftime('%m%d_%H.%M')
model.save_model('lgb_meta_' + st + '.txt', num_iteration= model.best_iteration)
print('lgb_meta_' + st + '.txt')

lgb_meta_0201_21.13.txt


### Final Submission

Load models that are trained on the full dataset.

In [34]:
DATA_FOLDER = '../script/'
lrModel = joblib.load(DATA_FOLDER + 'lr_full_0201_21.06.pkl') 
lgbmModel = lgb.Booster(model_file ='lgb_model_full_0131_17.52.txt') 
nnModel = load_model(DATA_FOLDER + 'nn_full_model.h5')
lgb_meta = lgb.Booster(model_file = DATA_FOLDER + 'lgb_meta_0201_21.13.txt')
lr_meta = joblib.load(DATA_FOLDER + 'lr_meta_0201_21.09.pkl') 

In [37]:
DATA_FOLDER = '../data/'
X_test_robustScaler = pd.read_pickle(DATA_FOLDER + 'X_test_RobustScaler') # LR
X_test_standardScaler = pd.read_pickle(DATA_FOLDER + 'X_test_standardScaler') # NNet
X_test_robustScalerTree = pd.read_pickle(DATA_FOLDER + 'X_test_RobustScalerTree') #light GBM
X_test_robustScaler_lr = X_test_robustScaler.drop(['shop_id','item_id', 'item_category_id'], axis = 1)

Get predictions for the base learners

In [39]:
lrFinalPred = lrModel.predict(X_test_robustScaler_lr).clip(0,20)
lgbmFinalPred = lgbmModel.predict(X_test_robustScalerTree).clip(0,20)
nnFinalPred = nnModel.predict(X_test_standardScaler).clip(0,20)

In [40]:
preds_test = pd.DataFrame(columns = ['lr', 'lgbm', 'nnet'])
preds_test['lr'] = lrFinalPred
preds_test['lgbm'] = lgbmFinalPred 
preds_test['nnet'] = nnFinalPred
preds_test['mean'] = pd.DataFrame.mean(preds_test, axis = 1)

In [41]:
preds_test.shape

(214200, 4)

Get predictions for the meta learners

In [42]:
lrMetaPred = lr_meta.predict(preds_test).clip(0,20)
lgbMetaPred = lgb_meta.predict(preds_test).clip(0,20)

In [43]:
def submit_final_pred(pred):
    ID = joblib.load('ID.pkl')
    predDF = pd.DataFrame() 
    predDF['ID'] = ID
    predDF['item_cnt_month'] = pred
    ts = time.time()
    st = datetime.datetime.fromtimestamp(ts).strftime('%m%d_%H.%M')
    print('submission_' + st + '.csv')

    predDF.to_csv(header=True, index=False, path_or_buf = 'submission_' + st + '.csv')
    
    return None

In [44]:
submit_final_pred(lrMetaPred) # 1.00192

submission_0201_21.16.csv


In [45]:
submit_final_pred(lgbMetaPred) # 0.97329

submission_0201_21.17.csv


In [46]:
submit_final_pred(preds_test['mean']) # 0.92366

submission_0201_21.24.csv
