# <img src="https://kaggle2.blob.core.windows.net/competitions/kaggle/4594/logos/front_page.png"/><span style="color:blue;text-align:center;">v5 XGBoost</span>

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, 
Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams! 
<img src="https://kaggle2.blob.core.windows.net/competitions/kaggle/4594/media/rossmann_banner2.png"/>

In [1]:
using DataFrames
using XGBoost
using Gadfly

### Load Data

In [2]:
train = readtable("data/training_feat_engineering.csv")
test = readtable("data/test_feat_engineering.csv");

### Training

In [3]:
features = names(train)[1:end-1];

In [4]:
train_y = Array{Float64}(train[:Sales])
train_x = Array{Float64,2}(train[:, features])
test_x = Array{Float64,2}(test[:, features]);

In [9]:
num_round = 2000
params = Dict({"max_depth"=>50, "eta"=> .3, "subsample" => .5, "objective"=>"reg:linear"})
tic(); model = XGBoost.xgboost(train_x, num_round, label=train_y, param=params); toc();
ypred = XGBoost.predict(model, test_x);


Use "Dict{Any,Any}(a=>b, ...)" instead.
[1]	train-rmse:4965.934082
[2]	train-rmse:3619.122070
[3]	train-rmse:2695.842285
[4]	train-rmse:2067.988770
[5]	train-rmse:1623.628174
[6]	train-rmse:1332.710327
[7]	train-rmse:1116.742554
[8]	train-rmse:959.140015
[9]	train-rmse:832.532715
[10]	train-rmse:737.954285
[11]	train-rmse:659.990417
[12]	train-rmse:589.286621
[13]	train-rmse:536.055786
[14]	train-rmse:488.016327
[15]	train-rmse:447.094513
[16]	train-rmse:411.355408
[17]	train-rmse:377.169647
[18]	train-rmse:347.837036
[19]	train-rmse:323.062897
[20]	train-rmse:301.574493
[21]	train-rmse:281.200867
[22]	train-rmse:264.297272
[23]	train-rmse:247.655563
[24]	train-rmse:232.788818
[25]	train-rmse:218.171509
[26]	train-rmse:204.924622
[27]	train-rmse:193.965271
[28]	train-rmse:182.875107
[29]	train-rmse:172.325684
[30]	train-rmse:165.362015
[31]	train-rmse:156.593475
[32]	train-rmse:148.079514
[33]	train-rmse:140.360168
[34]	train-rmse:133.059509
[35]	train-rmse:126.242569
[36]	train-rmse:

elapsed time: 13329

In [10]:
ypred_round = map(v -> v < 0? 0 : v, round(Int, ypred));

.400299309 seconds


### Simple Eval 

In [11]:
rmspe(yreal, ypred) = sqrt(sum(((yreal - ypred) ./ (yreal+.00001)) .^2)/length(yreal))
rmspe(train_y, XGBoost.predict(model, train_x))

68.48976543618694

### Create Submission

In [12]:
submission = DataFrame(Id=test[:Id], Sales=ypred_round)
writetable("data/submission_v5_xgboost_nr2000_eta.3_md50_ss.9_reglinear.csv", submission, separator=',');

### Kaggle Public Result

v5.6 + XGBoost nr500 eta .3 md 30 ss .9 reg:linear: **** (eval: ) 
v5.5 + XGBoost nr500 eta .3 md 30 ss .9 reg:linear: **0.16466** (eval: 102.72031811851865) 
v5.4 + XGBoost nr10 eta .3 md 30 ss .5 reg:linear: **0.17142** (eval: 2.403523758865878e6) 
v5.3 + XGBoost nr10 eta .3 md 15 ss .5 reg:linear: **0.22616** (eval: 4.229142277636823e6) 
v5.2 + XGBoost nr300 eta .3 md 15 ss .5 reg:linear: **0.15229** (eval: 1.7664821959061403e7)  
v5.1 + XGBoost: **0.42537** (eval: 2.1655121558566544e7)   
v4 + Random Forest 100 Trees, Feat 5: **0.30076**
v4 + Initial Random Forest: **0.30879**  
v2 + CV + Selected Features: **0.42446**  
v1 All Features - Only to set a initial score: **0.42591**  
Better than baseline benchmark that is all Zeros: **1.0000**