In this kernel we compare the prediction accuracy of the machine learning method "Random Forest" (default) with penalized linear models (tuned).



**1. Five Minute Model: Random Forest with ranger**

To show how easy, quick and yet powerful machine learning methods can be we start with the R-package ranger, a fast implementation of Leo Breimans original random forest algorithm. We read the data, overwrite loss by log(loss), build the model on training data, make predictions on test data and finally create a well formatted submission file.

In [None]:
tic <- proc.time() # start time. Sparse code (just 5 min ..):

d=read.csv("../input/train.csv")
v=read.csv("../input/test.csv")
d$loss <- log(d$loss) 
library(ranger)
m <- ranger(loss~., data=d, importance='impurity', verbose=F)
p=predict(m,dat=v)
write.csv(data.frame(id=v$id,loss=exp(p$predictions)),"subm.csv",row.names=F)

print(proc.time() - tic)

A late submission of this file (no upload required) will result in a score of 1191 (MAE) on the public leaderboard.

To get a clue about the model we can display the importance of the variables [top ten]: 

In [None]:
sort(m$variable.importance, decreasing = TRUE)[1:10]

The most important features seem to be cat80 and cat79, both categorial. Interestingly the id is listed in the top 10. Either there is some meaning in it or it is an indicator for over-fitting. It shouldn't be there. Thus we exclude id in the next steps. 


**2.  Split data and play with settings**

Now we split the training data in two halfs. We train the model on the first half and evaluate on the second half and vice versa.   The evaluation metric is the mean absolute error (MAE).

In [None]:
# split training data based on row number (to get same samples on any system)
d1 <- d[seq(1,188318) %% 10 <5,-1] #-1: without id
d2 <- d[seq(1,188318) %% 10 >=5,-1] #-1: without id
y1 <- d1$loss
y2 <- d2$loss

# print number of claims and variables
dim(d); dim(d1); dim(d2)

Again we use ranger with default settings.

In [None]:
# ranger tuning parameters: default values
# mtry = 11. Number of variables to possibly split at in each node. Default is the (rounded down) square root of the number of variables.
# num.trees = 500. Number of trees.
tic <- proc.time()        # timer: start

rg1 <- ranger(loss ~ ., data = d1, importance='impurity',verbose=F) 
sort(rg1$variable.importance, decreasing = TRUE)[1:10] # display top10 variables
rgpred12 <- predict(rg1, dat = d2)
print(paste("MAE Ranger,12 = ",mean(abs(exp(rgpred12$predictions)-exp(y2)))))

print(proc.time() - tic)  # timer: stop

In [None]:
rg2 <- ranger(loss ~ ., data = d2, importance='impurity',verbose=F)
sort(rg2$variable.importance, decreasing = TRUE)[1:10] # display top10 variables
rgpred21 <- predict(rg2, dat = d1)
print(paste("MAE Ranger,21 = ",mean(abs(exp(rgpred21$predictions)-exp(y1)))))

Result: 
The mean average error of the models trained on just half of the claims is quite similar, yet higher than the MAE based on all claims. As expected, more data allows for more model complexity and thus better predictions.

We can take these results as a benchmark and try to improve the model and get a lower MAE.  To speed up we start with less trees and guess a reasonable number of split-variables. 

In [None]:
rg1 <- ranger(loss ~ ., data = d1, num.trees =  400, mtry = 40, importance='impurity',verbose=F)
rgpred12 <- predict(rg1, dat = d2)
print(paste("MAE Ranger,12 = ",mean(abs(exp(rgpred12$predictions)-exp(y2)))))

rg2 <- ranger(loss ~ ., data = d2, num.trees =  400, mtry = 40, importance='impurity',verbose=F)
rgpred21 <- predict(rg2, dat = d1)
print(paste("MAE Ranger,21 = ",mean(abs(exp(rgpred21$predictions)-exp(y1)))))

A bit better. Since two parameters where changed at the same time we don't know how to continue. Performing a grid search might help here. 

Anyway, how thus this compare to linear models?

**3. Regularized Linear Models: LASSO**

In machine learning, the term "linear models" refers to the newer, regularized variants of linear models such as Ridge regression or LASSO. These methods are well implemented in the R-package glmnet. Let's try LASSO, as this method performs automatic variable selection. 


In [None]:
library(glmnet) 
# create matrix with dummy variables for factors (one-hot-encoding)
x1 <- model.matrix(loss~.-1,data=d1) 
x2 <- model.matrix(loss~.-1,data=d2)

Fortunately there are no missing values in the records. This allows us to directly apply linear models without further preprocessing.

In the following we calculate a set of linear models for a grid of lambdas (the complexity-parameter). Then we determine the best model by using cross validation (default: 10-fold) and make the predictions.

In [None]:
tic <- proc.time()        # timer: start

fit.lasso=glmnet(x1,y1) # Lasso with default-grid
plot(fit.lasso ,xvar="lambda",main="Lasso")
cv1=cv.glmnet(x1,y1) 
pred12=predict(fit.lasso, s = cv1$lambda.min,newx = x2)

print(proc.time() - tic)  # timer: stop

On the left hand side there is no penalty. We can see the coefficients of the usually overfitting full model. On the right hand side the penalty is very high and all coefficients are set to zero, just the intercept (average) remains. This model usually underfits.  

The best lambda and thus the best model complexity is determined by cross-validation. The results:

In [None]:
print(paste("Best Log(Lambda),1 = ",log(cv1$lambda.min)))
print(paste("MAE Lasso,12 = ",mean(abs(exp(pred12)-exp(y2)))))

The MAE is substancially worse.

**4.Summary**

Based on this claims data set a default random forest model outperforms an optimized regularized linear model in precision (and speed).



**Extensions**

a) Gradient boosting: Gradient boosting machines like xgboost and lightgbm need to be carefully tuned (settings, called “hyperparameters”). Despite that a default gbm-model is even better than random forest, see https://www.kaggle.com/floser/r-starter-lightgbm-regression, which is based on the same data set.

b) Neural nets: Neural nets became recently very successful in claim prediction competitions. 
Unfortunately, training with large data sets requires a lot of computing power. Here you can find an introduction based on small datasets:  https://www.kaggle.com/floser/neuralnet-plots-and-deeper-learning .

c) Generalized Linear Mixed Models (GLMM): There are further extensions of linear models available. In case you are interested in how to apply Generalized Linear Models (GLMs) and Generalized Linear Mixed Models (GLMMs) to claims data with R see  https://www.kaggle.com/floser/claim-frequency-glms-and-glmms .
