In [1]:
##Libraries
library(tidyverse)
library(tidymodels)
library(embed)
library(themis)


train <- read_csv("/kaggle/input/allstate-claims-severity/train.csv") %>%
  mutate_at(vars(cat1:cat116), as.factor)

test <- read_csv("/kaggle/input/allstate-claims-severity/test.csv") %>%
  mutate_at(vars(cat1:cat116), as.factor)


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39

This next step is where I perform a log transformation on my data and create my recipe. The recipe that worked best for me was simply target encoding my nominal predictors. The random forest did all of the feature engineering necessary from there.

In [2]:
train$loss = log(train$loss)
rf_recipe <- recipe(loss ~ ., data=train) %>%
  step_lencode_mixed(all_nominal_predictors(), outcome = vars(loss))

prep <- prep(rf_recipe)
baked <- bake(prep, new_data = NULL)
baked



boundary (singular) fit: see help('isSingular')



id,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,⋯,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,cont14,loss
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,7.808970,7.929607,7.647261,7.832552,7.609000,7.774355,7.658897,7.666263,7.934139,⋯,0.718367,0.335060,0.30260,0.67135,0.83510,0.569745,0.594646,0.822493,0.714843,7.702186
2,7.808970,7.929607,7.647261,7.616529,7.609000,7.774355,7.658897,7.666263,7.934139,⋯,0.438917,0.436585,0.60087,0.35127,0.43919,0.338312,0.366307,0.611431,0.304496,7.157424
5,7.808970,7.929607,7.647261,7.616529,7.831398,7.774355,7.658897,7.666263,7.934139,⋯,0.289648,0.315545,0.27320,0.26076,0.32446,0.381398,0.373424,0.195709,0.774425,8.008063
10,7.310865,7.929607,7.647261,7.832552,7.609000,7.774355,7.658897,7.666263,7.934139,⋯,0.440945,0.391128,0.31796,0.32128,0.44467,0.327915,0.321570,0.605077,0.602642,6.845720
11,7.808970,7.929607,7.647261,7.832552,7.609000,7.774355,7.658897,7.666263,7.934139,⋯,0.178193,0.247408,0.24564,0.22089,0.21230,0.204687,0.202213,0.246011,0.432606,7.924380
13,7.808970,7.929607,7.647261,7.616529,7.609000,7.774355,7.658897,7.666263,7.934139,⋯,0.364464,0.401162,0.26847,0.46226,0.50556,0.366788,0.359249,0.345247,0.726792,8.545367
14,7.808970,7.498451,7.647261,7.616529,7.831398,7.774355,7.658897,7.666263,7.519834,⋯,0.381515,0.363768,0.24564,0.40455,0.47225,0.334828,0.352251,0.342239,0.382931,7.031936
20,7.808970,7.929607,7.647261,7.832552,7.609000,7.774355,7.658897,7.666263,7.934139,⋯,0.867021,0.583389,0.90267,0.84847,0.80218,0.644013,0.785706,0.859764,0.242416,8.184723
23,7.808970,7.929607,8.340463,7.832552,7.831398,7.774355,7.658897,7.666263,7.934139,⋯,0.628534,0.384099,0.61229,0.38249,0.51111,0.682315,0.669033,0.756454,0.361191,9.237975
24,7.808970,7.929607,7.647261,7.616529,7.831398,7.478077,7.658897,7.666263,7.934139,⋯,0.713343,0.469223,0.30260,0.67135,0.83510,0.863052,0.879347,0.822493,0.294523,8.729816


Here is where I set up and fit my model. I initially had tune set for the value of all of the parameters, but have since manually encoded the values to save runtime. My best model was a random forest with 1000 trees, 5 predictors at every split, and a minimum node size of 2.

In [3]:
#Set up the model
my_mod <- rand_forest(mtry = 5,
                      min_n = 2,
                      trees = 1000) %>%
  set_engine("ranger") %>%
  set_mode("regression")

## Create a workflow with model & recipe
rf_wf <- workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(my_mod) %>%
  fit(data=train)

boundary (singular) fit: see help('isSingular')



This is where I performed my cross-validation. It took about 22 hours for this model. I have commented it out for runtime purposes.

In [4]:
  ## Set up grid of tuning values
  
  #CV Results 1,23
# tuning_grid <- grid_regular(mtry(c(1,5)),
#                             min_n(),
#                               levels = 5)## L^2 total tuning possibilities
# 
# ## Set up K-fold CV
# folds <- vfold_cv(train, v = 3, repeats=1)
# 
# ## Run the CV
# CV_results <- rf_workflow %>%
#   tune_grid(resamples=folds,
#             grid=tuning_grid,
#                  metrics=metric_set(mae)) #Or leave metrics NULL
# 
# ## Find best tuning parameters
# collect_metrics(CV_results) %>% # Gathers metrics into DF
#   filter(.metric=="mae") %>%
#   ggplot(data=., aes(x=mtry, y=min_n, color=factor(mtry))) +
#   geom_line()
# 
# collect_metrics(CV_results)
# 
# CV_results
# ## Find Best Tuning Parameters
# bestTune <- CV_results %>%
#   select_best("mae")
# bestTune
# 
# ## Finalize the Workflow & fit it
# final_wf <- rf_workflow %>%
#   finalize_workflow(bestTune) %>%
#   fit(data=train)

This is where I create my predictions and prepare the data for submission. Notice that I exponentiate it to reverse the log transformation at the beginning.

In [5]:
rf_predictions <- rf_wf %>%
  predict(new_data = test)

Sub1 <- rf_predictions %>% 
  bind_cols(test) %>% 
  select(id,.pred) %>%
  rename(loss = .pred) %>%
  mutate(loss = exp(loss))
  

write_csv(Sub1, "RFSubmission.csv")

“Novel levels found in column 'cat89': 'F'. The levels have been removed, and values have been coerced to 'NA'.”
“Novel levels found in column 'cat92': 'E', 'G'. The levels have been removed, and values have been coerced to 'NA'.”
“Novel levels found in column 'cat96': 'H'. The levels have been removed, and values have been coerced to 'NA'.”
“Novel levels found in column 'cat99': 'U'. The levels have been removed, and values have been coerced to 'NA'.”
“Novel levels found in column 'cat103': 'M'. The levels have been removed, and values have been coerced to 'NA'.”
“Novel levels found in column 'cat106': 'Q'. The levels have been removed, and values have been coerced to 'NA'.”
“Novel levels found in column 'cat109': 'AD'. The levels have been removed, and values have been coerced to 'NA'.”
“Novel levels found in column 'cat110': 'BH', 'CA', 'EN'. The levels have been removed, and values have been coerced to 'NA'.”
“Novel levels found in column 'cat111': 'L'. The levels have been removed