## Autocorrelative Models

#### Even after extensive hyperparameter optimization, I was only able to get to 97th on the leaderboard. At this point, I decided that it was time to try something else. In general, I group model improvement techniques in the following buckets:

1. Feature Engineering
2. Model Ensembling
3. Parameter Tuning

Having already squeezed out the maximum performance from a single XGBoost model, the only way to further improve the model was via feature engineering and model ensembling. 

My initial forays into model ensembling weren't particularly successful. None of the other models reduced the error by much when combined with the XGBoost model. I decided that I needed new features to really vault me up the leaderboard

#### After giving it some thought, I settled on building an autocorrelative model. My rationale was that the demand at a given hour should be highly correlated to the demand in the hours just preceding. I decided to use previous hours as predictors for subsequent hours. This would be in addition to the other predictors like temperature, humidity etc. 

Normal scheme: Season + Holiday + Workingday + Weather + Temp + Atemp + Humidity + Windspeed + Year + Month + Day + Day of the Week + Hour

Autocorrelative scheme: Season + Holiday + Workingday + Weather + Temp + Atemp + Humidity + Windspeed + Year + Month + Day + Day of the Week + Hour + **Predictions for the past 4 hours**


While conceptually simple, this idea poses a few implementation challenges:

1. Dealing with missing values: Certain hours are missing in the train/test dataset. This makes predictions challenging for subsequent hours which depend on the predictor for that hour. My solution was to impute these predictions using the median prediction for that hour

2. Keeping track of past predictions: Usually, during the prediction process each test case can be presented in any order. Since we are trying to use past predictions as predictors for future predictions, we have to predict in temporal order.

### Getting Previous Predictions

I opted to do a straight lookup for the 4 previous predictions when provided with a data point. The function is shown below:

```r
#This function obtains the previous predictions for a given point in time
#
#PARAMETERS:
#df - Data frame containing the data
#window - How many previous predictions are desired
#lookupColumn - Which column will serve as index?
#valueColumn - Which column has the actual prediction?
#impute - If missing values are found for the previous predictions, how should they be handled. If impute is TRUE, a  #median imputation will be performed
#
#RETURN:
#A vector containing the previous predictions

getPrevPreds <- function (df, lookupValue, window=4, lookupColumn="datetime", 
                          valueColumn="registered", impute=FALSE)
{
        df <- tbl_df (df)
        dots <- list (lazyeval::interp (~lookupColumn, lookupColumn=as.name (lookupColumn)))
        dots <- c(dots, lazyeval::interp (~ valueColumn, valueColumn=as.name (valueColumn)))
        dots <- c(dots, lazyeval::interp (~ hour))
                  
        df <- select_(df, .dots=dots)
        lowerLookupBound <- lookupValue - dhours (window)
        upperLookupBound <- lookupValue - dhours (1)
        
        dots <- lazyeval::interp (~ lookupColumn >= lowerLookupBound & 
                                lookupColumn <= upperLookupBound, 
                                lookupColumn=as.name (lookupColumn),
                                lowerLookupBound=lowerLookupBound,
                                upperLookupBound=upperLookupBound)
        
        filteredDF <- filter_ (df, dots)
        prevPreds <- filteredDF[[valueColumn]]
        if (length (prevPreds) < window) {
            if (length (prevPreds) > 0) {
                prevPreds <- c(rep (mean (prevPreds, na.rm=TRUE), 
                                    window-length (prevPreds)), 
                               prevPreds)
            } else if (impute==FALSE) {
                prevPreds <- rep (NA, window)
            }
              else {
                print (paste0 (lookupValue, " : imputed"))
                subsetDF <- df[df$hour==hour (lookupValue),]
                prevPreds <- rep (mean (subsetDF[[valueColumn]], 
                                        na.rm=TRUE), 
                                  window)
              }
        }
        return (prevPreds)
}
```


### Using Previous Predictions as Predictors

The code snippet below shows how the previous predictions can be built up in a sequential way and used to predict the next value.

```r
for (currentRow in 1:nrow (test.df)) {
        print (currentRow)        
        currentDataRow <- test.df[currentRow,]
        currentDateTime <- currentDataRow[1,]$datetime
        prevPreds <- getPrevPreds(composite.df, currentDateTime, window=4, 
                                  valueColumn=trainCol, impute=TRUE)
        
        relevantCols <- grep (paste0 (trainCol, "_"), colnames (composite.df))
        currentDataRow[1,relevantCols] <- prevPreds
        rowNumberInCompositeDF <- which (composite.df$datetime==currentDateTime)
        preds[currentRow] <- predict (fit, currentDataRow)
        composite.df[rowNumberInCompositeDF, trainCol] <- preds[currentRow]
}
```

The code snippet steps through each row in the test data frame and obtains a prediction. It then saves the prediction into a data frame. These past predictions are then used for subsequent predictions

#### This model resulted in a 101st place finish

![101st Place](files/images/xgb-prevPreds.png)