In [None]:
## We will be completing the Allstate Claims Severity Competition using a Boosted Tree. The goal of this competition is to predict the claim cost based on a variety of factors. We will start by loading the necessary packages and data. 

In [1]:
library(tidyverse)
library(tidymodels)
library(vroom)
library(embed)
library(bonsai)
library(lightgbm)

train <- vroom("/kaggle/input/allstate-claims-severity/train.csv")
test <- vroom("/kaggle/input/allstate-claims-severity/test.csv")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.2.0 ──

[32m✔[39m [34mbroom       [39m 1.0.6      [32m✔[39m [34mrsample     [3

In [None]:
## Next, we will make a recipe to clean the data. We remove the 'id' column because it isn't useful for predictions. We target encode, combine rare categories, remove categories with high multicollinearity, normalize the numeric predictors, and remove any predictors with zero variance.

In [4]:
my_recipe <- recipe(loss ~ ., data = train) |>
  step_rm(id) |> 
  step_other(all_nominal_predictors(), threshold = .001) |> 
  step_lencode_glm(all_nominal_predictors(), outcome = vars(loss)) |> 
  step_corr(all_numeric_predictors(), threshold = 0.6) |> 
  step_normalize(all_numeric_predictors())|> 
  step_zv(all_predictors())

In [None]:
## Next, we create our model. We will tune trees, tree depth, and learn rate to find the ideal values for each parameter. In addition, we will use the 'lightgbm' engine, and we set the mode to regression because it is a quantitative response and not classification. 

In [5]:
model <- boost_tree(
  mode = "regression",
  engine = "lightgbm",
  trees = tune(),
  tree_depth = tune(),
  learn_rate = tune()
)

In [None]:
## We will now combine the recipe and model into a workflow. We make tuning grid for all of our tuning parameters to find the ideal values for each parameter. We keep the grid at 3 levels to keep computational load manageable. We use the workflow, the grid, and our number of folds to create crossvalidation. 

In [None]:
workflow <- workflow() |>
  add_recipe(my_recipe) |>
  add_model(model)

grid <- grid_regular(trees(),
                     tree_depth(),
                     learn_rate(),
                     levels = 3)

fold <- vfold_cv(train, v = 5, repeats = 1)

CV <- workflow |>
  tune_grid(resamples = fold, grid = grid, metrics = metric_set(mae), control = control_grid(verbose = TRUE))


[34mi[39m [30mFold1: preprocessor 1/1[39m

[32m✓[39m [30mFold1: preprocessor 1/1[39m

[34mi[39m [30mFold1: preprocessor 1/1, model 1/9[39m

[32m✓[39m [30mFold1: preprocessor 1/1, model 1/9[39m

[34mi[39m [30mFold1: preprocessor 1/1, model 1/9 (extracts)[39m

[34mi[39m [30mFold1: preprocessor 1/1, model 1/9 (predictions)[39m

[34mi[39m [30mFold1: preprocessor 1/1, model 2/9[39m

[32m✓[39m [30mFold1: preprocessor 1/1, model 2/9[39m

[34mi[39m [30mFold1: preprocessor 1/1, model 2/9 (extracts)[39m

[34mi[39m [30mFold1: preprocessor 1/1, model 2/9 (predictions)[39m

[34mi[39m [30mFold1: preprocessor 1/1, model 3/9[39m

[32m✓[39m [30mFold1: preprocessor 1/1, model 3/9[39m

[34mi[39m [30mFold1: preprocessor 1/1, model 3/9 (extracts)[39m

[34mi[39m [30mFold1: preprocessor 1/1, model 3/9 (predictions)[39m

[34mi[39m [30mFold1: preprocessor 1/1, model 4/9[39m

[32m✓[39m [30mFold1: preprocessor 1/1, model 4/9[39m

[34mi[39m [30mFo

In [None]:
## Now that we have the cross validation, we will find the best values using the mean absolute error metric (the metric we are judged on for this competition). Once we have the best values, we use them to predict the values in the test dataset. We then create a submission files to submit to Kaggle. 

In [2]:
best <- CV |> select_best(metric = 'mae')
best

final_wf <- workflow |>
  finalize_workflow(best) |>
  fit(data = train)

pred <- predict(final_wf, new_data = test)

submission <- pred |>
  mutate(id = test$id) |>
  mutate(loss = .pred) |> 
  select(2, 3)

vroom_write(x= submission, file = "./AllstateBoosted.csv", delim = ",")

ERROR: Error in eval(expr, envir, enclos): object 'CV' not found
