## Predictive Model Exploration

#### I tried a wide variety of models for this competition. The caret framework makes it very easy to test different models in R. Some of the models that I tried are:

* Random Forest
* Extremely Randomized Trees
* Extreme Gradient Boosted Trees (XGBoost)
* Elastic Nets
* Generalized Linear Models
* Generalized Additive Models
* Online Linear Learning (via Vowpal Wabbit)

The nice thing about the caret package is that it provides a unified interface for these models. Of these models, I found that XGBoost had the most predictive power. Once I settled on XGBoost, I just ran XGBoost directly without the caret wrapper. 

### Log Transformation of the Dependent Variable

The demand variable should be log transformed. There are a few reasons for this:

1. The metric for the competition is RMSLE (Root Mean Square Log Error). Most methods support the RMSE (Root Mean Square Error) metric. By log transformation, the RMSE metric becomes equivalent to the RMSLE metric
2. Demand is always a positive number. By log transformation, we restrict the demand to be a positive number on the inverse transformation. An alternate way of saying this is that most methods assume that the range of the dependent variable is from -Inf to +Inf.

I modified the formatData() function to support a logTransform parameter for this purpose

In [4]:
library (plyr)
library (dplyr)
library (ggplot2)
library (lubridate)
library (reshape2)
library (xgboost)

source ("support.R")
train.df <- read.csv ("data/train.csv")
test.df <- read.csv ("data/test.csv")
train.df <- formatData (train.df, logTransform=TRUE) %>% tbl_df()

test.df <- formatData (test.df) %>% tbl_df()

train.df$month <- factor (train.df$month)
train.df$year <- factor (train.df$year)

test.df$month <- factor (test.df$month)
test.df$year <- factor (test.df$year)


Attaching package: ‘xgboost’

The following object is masked from ‘package:dplyr’:

    slice



Note that I've converted the month and year variables to a factor. The reason I did this was that I didn't see any continuous behavior in demand as a result of these two variables. 

We are now ready to run the XGBoost model. At this point, we have a choice to make. Either we can predict total demand or we could predict casual and regular demand separately and then sum them. From my explorations, predicting casual and regular demand separately yielded a slightly higher score

### Model for Casual Users

In [9]:
train.formula <-  ~ season + holiday + workingday + weather + temp + atemp +
        humidity + windspeed + year + month + wday + day + hour - 1

trainData <- xgb.DMatrix (model.matrix (train.formula, train.df), label=train.df$casual)
testData <- xgb.DMatrix (model.matrix (train.formula, test.df))

set.seed (4322)

#The parameters below are optimal parameters, calculated via hyperparameter optimization using the hyperopt package
#I'll show my setup to use this package a bit later. If you want to take a quick peek, look at the hyperopt/ directory
params <- list (booster="gbtree",
                eta=0.00330925962444,
                gamma=0.684530964272,
                max_depth=7,
                min_child_weight=0.596497397942,
                subsample=0.678093555386,
                colsample_bytree=0.662176894972,
                objective="reg:linear",
                eval_metric="rmse")

fit1 <- xgb.train (params, trainData, nround = 8500, nfold = 5)

### Model for Registered Users

In [8]:
trainData <- xgb.DMatrix (model.matrix (train.formula, train.df), label=train.df$registered)

params <- list (booster="gbtree",
                eta=0.00291713063475,
                gamma=0.00833471795637,
                max_depth=6,
                min_child_weight=1.57952698042,
                subsample=0.626763785155,
                colsample_bytree=0.685032802413,
                objective="reg:linear",
                eval_metric="rmse")

fit2 <- xgb.train (params, trainData, nround = 10000, nfold = 5)

### Generate Predictions

In [10]:
y.pred <- (exp (predict (fit1, testData)) - 1) + (exp (predict (fit2, testData)) - 1)
y.pred[y.pred < 0] = 0
result.df <- data.frame (datetime=strftime (test.df$datetime, 
                                            format="%Y-%m-%d %H:%M:%S", 
                                            tz="UTC"),
                         count=y.pred)

#### This single optimized XGBoost model will result in 97th place. Not bad, eh?

![97th Place](files/images/xgb-separate.png)

### At this point, I decided that I really wanted a top 50 finish. So I set about trying to improve this model.

#### How I actually accomplished this is a story for a different day