# Bike sharing (Kaggle) (Moritz)
- https://www.kaggle.com/c/184702-tu-ml-ws-18-bike-sharing
- large samples (train = 8690), small dimension (15)
- attribute characteristics: numeric, date?

## Preprocessing
- prepocessing: scale (standardize)

### Linear Regression
Score created from training data without(!) train-test split.
- with and without preprocessing
- with all samples: 
    - <1 s
    - R^2: 0.38403
    - RMSE: 140.25758
    - Kaggle: 143.91026

### SVR
- without preprocessing
- with all samples
    - few minutes
    - C: 1.0, kernel: linear, epsilon: 0.30000000000000004, gamma: auto 
    - RMSE: 147.88896
    - Kaggle: 152.19112
- with preprocessing
- with all samples: 
    - few minutes   
    - C: 1.0, kernel: linear, epsilon: 0.5, gamma: auto 
    - RMSE: 146.11651
    - Kaggle: 150.71520

### Gradient Boosted Decision Tree
- with preprocessing
- with 50 samples: 
    - 13.559 s
    - {'max_depth': 5, 'min_samples_split': 15, 'n_estimators': 350}
    - RMSE: 89.9571
    - Kaggle: 143.91026
- with 150 samples: 
    - 25.025 s
    - {'max_depth': 5, 'min_samples_split': 15, 'n_estimators': 350}
    - RMSE: 118.99678
    - Kaggle: 128.42365
- with 500 samples: 
    - 74.803 s
    - {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 350}
    - RMSE: 78.88731
    - Kaggle: 128.39004
- with 2000 samples:
    - 431.691 s
    - {'max_depth': 10, 'min_samples_split': 15, 'n_estimators': 350}
    - RMSE: 54.24448
    - Kaggle: 54.09653
- with all samples:
    - 4361.643 s
    - {'max_depth': 10, 'min_samples_split': 15, 'n_estimators': 350}
    - RMSE: 43.12057
    - Kaggle: 43.16327
    
### AutoML 
- without scaling
- with all samples
    - minutes? (max 600 s)
    - XGBoost_3_AutoML_20190106_135032
    - RMSE: 40.08211
    - Kaggle: __39.61732__
    
    
- with preprocessing
- with all samples
    - minutes (max 600 s)
    - XGBoost_3_AutoML_20190106_144525
    - RMSE: 40.24830
    - Kaggle: 39.70656

In [19]:
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
import datetime as dt
%run './base.ipynb'

In [15]:
# read train data and drop date
train = pd.read_csv('./data/bike_sharing_kaggle/bikeSharing.shuf.train.csv').drop(['dteday'], axis=1)
train.set_index(['id'], inplace=True)
# extract, then drop 'cnt' col
train_target = train[['cnt']]
train.drop(['cnt'], axis='columns', inplace=True)

# read test data and drop date
test = pd.read_csv('./data/bike_sharing_kaggle/bikeSharing.shuf.test.csv').drop(['dteday'], axis=1)
test.set_index(['id'], inplace=True)

# scale train
train_s, test_s = scale_data(train, test)

X_train = train_s
y_train = train_target
X_test = test_s

#X_train = train
#y_train = train_target
#X_test = test

#display(train)
#display(X_train.shape)

In [3]:
# Linear Regression
# predict on X_train (to replace missing MSE in LinearRegression)
reg = linear_reg(X_train, y_train, X_train, y_train)
result = pd.DataFrame(reg.predict(X_test), columns=['cnt'])

# join id col
result = pd.concat([X_test.reset_index()[['id']], result], axis='columns')

# Save result
filename = f'''lr_{dt.datetime.now()}.csv'''

result.to_csv('./predictions/bike_sharing_kaggle/' + filename, sep = ",", index=False)
print(f'''Saved as {filename}''')

#display(result)

  linalg.lstsq(X, y)


R^2 value for model: 0.38403
Predict:
RMSE: 140.25758
R^2 Score: 0.38403
Saved as lr_2019-01-06 13:34:13.249858.csv


In [4]:
# SVR
# params
param_grid = {
    'C': np.linspace(.2,1,5),
    'kernel': ['linear'],#, 'rbf', 'sigmoid', 'poly'], # poly very slow
    'epsilon': np.linspace(0,.5,6),
    'gamma': ['auto']
}

# run grid search
gs = run_svr(X_train, y_train.values.ravel(), cv=5, param_grid=param_grid)

# predict
result = pd.DataFrame(gs.best_estimator_.predict(X_test), columns=['cnt'])

# join id col
result = pd.concat([X_test.reset_index()[['id']], result], axis='columns')
#display(result)

# Create SVR filename
filename = f'''svr_'''\
           f'''C-{gs.best_estimator_.C}_'''\
           f'''k-{gs.best_estimator_.kernel}_'''\
           f'''e-{gs.best_estimator_.epsilon}_'''\
           f'''g-{gs.best_estimator_.gamma}_'''\
           f'''{dt.datetime.now()}.csv'''

result.to_csv('./predictions/bike_sharing_kaggle/' + filename, sep = ",", index=False)
print(f'''Saved as {filename}''')

GridSearch initializing...
SVR model in training...
MSE: 21871.14445, RMSE: 147.88896, 
C: 1.0, kernel: linear, epsilon: 0.30000000000000004, gamma: auto 
Saved as svr_C-1.0_k-linear_e-0.30000000000000004_g-auto_2019-01-06 13:43:58.951959.csv


In [21]:
# Gradient Boosted Decision Tree
param_fix = {
    'learning_rate': .01, 
    'loss': 'ls'
}

param_grid = {
    'n_estimators': (1, 10, 100, 200, 350),# 500), 
    'max_depth': (1, 5, 10, 25),# 50), 
    'min_samples_split': (2, 5, 15),# 50)
}

num_samples = 500
#X = X_train.iloc[:num_samples, :]
#y = y_train.iloc[:num_samples, :].values.ravel()

X = X_train
y = y_train.values.ravel()

gs = run_boosted_tree(X, y, [], [], param_fix=param_fix, cv=10, param_grid=param_grid)

#plot_scores(gbt.cv_results_)
#plot_training_deviance(gbt, test_data, test_target)

# predict
result = pd.DataFrame(gs.best_estimator_.predict(X_test), columns=['cnt'])

# join id col
result = pd.concat([X_test.reset_index()[['id']], result], axis='columns')
#display(result)

# Create SVR filename
filename = f'''gbdtree_'''\
           f'''ne-{gs.best_estimator_.n_estimators}_'''\
           f'''md-{gs.best_estimator_.max_depth}_'''\
           f'''mss-{gs.best_estimator_.min_samples_split}_'''\
           f'''{dt.datetime.now()}.csv'''

result.to_csv('./predictions/bike_sharing_kaggle/' + filename, sep = ",", index=False)
print(f'''Saved as {filename}''')

GridSearch initializing...
GradientBoostedRegressor model in training...
GradientBoostedRegressor model selected and fitted in 4361.643 s

MSE: 1859.38335, RMSE: 43.12057
Best parameters selected by GridSearch: {'max_depth': 10, 'min_samples_split': 15, 'n_estimators': 350}
Saved as gbdtree_ne-350_md-10_mss-15_2019-01-02 01:53:02.486364.csv


In [20]:
# AutoML
# reset indices to id col
X_train_r = X_train.reset_index()
#display(X_train_r)
y_train_r = y_train.reset_index()
#display(y_train_r.info())

# create train-test-split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_train_r, y_train_r, random_state=100)
train = pd.merge(X_train1, y_train1, on='id')

# drop id cols from test set
X_test2 = X_test1.drop(['id'], axis=1)
y_test2 = y_test1.drop(['id'], axis=1)

y_name = 'cnt'
aml = run_autoML_moritz(train, y_name, X_test2, y_test2, 
                        max_models=30, max_runtime=600)

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,1 hour 9 mins
H2O cluster timezone:,Europe/Vienna
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.1.1
H2O cluster version age:,9 days
H2O cluster name:,H2O_from_python_Moritz_rnh332
H2O cluster total nodes:,1
H2O cluster free memory:,3.212 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,4


Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%


'AUTOML Leaderboard'

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
XGBoost_3_AutoML_20190106_144525,1619.93,40.2483,1619.93,25.6588,
XGBoost_2_AutoML_20190106_144525,1643.4,40.5388,1643.4,25.9113,
GBM_3_AutoML_20190106_144525,1650.32,40.6242,1650.32,25.8912,
StackedEnsemble_AllModels_AutoML_20190106_144525,1666.49,40.8226,1666.49,26.1142,
GBM_4_AutoML_20190106_144525,1675.25,40.9298,1675.25,25.9138,
GBM_2_AutoML_20190106_144525,1711.36,41.3686,1711.36,26.4678,
GBM_1_AutoML_20190106_144525,1743.26,41.7523,1743.26,27.0199,
XGBoost_grid_1_AutoML_20190106_144525_model_1,1881.53,43.3766,1881.53,27.9474,
XGBoost_1_AutoML_20190106_144525,1891.06,43.4863,1891.06,28.1736,
StackedEnsemble_BestOfFamily_AutoML_20190106_144525,2101.54,45.8425,2101.54,29.6678,




In [21]:
print(aml.leader)

Model Details
H2OXGBoostEstimator :  XGBoost
Model Key:  XGBoost_3_AutoML_20190106_144525


ModelMetricsRegression: xgboost
** Reported on train data. **

MSE: 351.28735220796074
RMSE: 18.742661289367653
MAE: 11.653210594200623
RMSLE: NaN
Mean Residual Deviance: 351.28735220796074

ModelMetricsRegression: xgboost
** Reported on cross-validation data. **

MSE: 1619.9264050292184
RMSE: 40.24830934373789
MAE: 25.65879186368691
RMSLE: NaN
Mean Residual Deviance: 1619.9264050292184
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,25.658718,0.5891565,26.651115,25.142027,24.30492,26.119354,26.076178
mean_residual_deviance,1619.9198,65.04105,1725.4581,1557.3733,1479.5155,1706.0093,1631.2428
mse,1619.9198,65.04105,1725.4581,1557.3733,1479.5155,1706.0093,1631.2428
r2,0.9498392,0.0024831,0.9509453,0.9525676,0.9539651,0.9443955,0.9473224
residual_deviance,1619.9198,65.04105,1725.4581,1557.3733,1479.5155,1706.0093,1631.2428
rmse,40.23184,0.812119,41.538635,39.46357,38.46447,41.303864,40.388645
rmsle,0.0,,,,,,


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2019-01-06 14:47:37,39.264 sec,0.0,261.4297733,189.5742673,68345.5263542
,2019-01-06 14:47:37,39.318 sec,5.0,208.2368331,148.1900374,43362.5786443
,2019-01-06 14:47:38,39.368 sec,10.0,171.6200143,118.2726732,29453.4293233
,2019-01-06 14:47:38,39.425 sec,15.0,143.2111822,96.7991612,20509.4426965
,2019-01-06 14:47:38,39.494 sec,20.0,119.2136477,79.2451104,14211.8938097
---,---,---,---,---,---,---
,2019-01-06 14:47:42,44.269 sec,145.0,20.2559424,12.6042208,410.3032037
,2019-01-06 14:47:43,44.571 sec,150.0,19.8681261,12.3610693,394.7424336
,2019-01-06 14:47:43,44.915 sec,155.0,19.4309868,12.0814534,377.5632471



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
hr,647353472.0000000,1.0,0.4958914
id,177065552.0000000,0.2735222,0.1356373
atemp,108771960.0000000,0.1680256,0.0833225
workingday,80005520.0000000,0.1235886,0.0612865
temp,79177984.0000000,0.1223103,0.0606526
hum,60228916.0000000,0.0930387,0.0461371
weekday,51562268.0000000,0.0796509,0.0394982
yr,33940796.0000000,0.0524301,0.0259996
weathersit,24621174.0000000,0.0380336,0.0188605





In [22]:
# create predictions for test data
X_test_h2o = h2o.H2OFrame(X_test.reset_index())
result = aml.predict(X_test_h2o)
#result.head(rows=result.nrows)

result_df = result.as_data_frame()
result_df[['id']] = X_test.reset_index()[['id']]
result_df.rename({'predict': 'cnt'}, axis=1, inplace=True)
#display(result_df)

# save to file
filename = f'''autoML_'''\
           f'''{dt.datetime.now()}.csv'''

result_df.to_csv('./predictions/bike_sharing_kaggle/' + filename, sep = ",", index=False)
print(f'''Saved as {filename}''')

Parse progress: |█████████████████████████████████████████████████████████| 100%
xgboost prediction progress: |████████████████████████████████████████████| 100%
Saved as autoML_2019-01-06 15:08:12.524794.csv


In [None]:
# shutdown h2o cluster
h2o.cluster().shutdown()