In [None]:
#load libraries
library(embed)
library(themis)
library(vroom)
library(tidyverse)
library(tidymodels)
library(DataExplorer)
library(bonsai)
library(lightgbm)

In [None]:
#read in data
train <- vroom("/kaggle/input/allstate-claims-severity/train.csv")
test <- vroom("/kaggle/input/allstate-claims-severity/test.csv")

**Exploratory Data Analysis**

In [None]:
#eda
plot_correlation(train, type = "continuous") #high correlation off the diagonal

hist(train$loss) #right skewed

train_ex <- train %>%
mutate(loss_ex = log(loss))
hist(train_ex$loss_ex) #log transform makes response more normal

train_ex_2 <- train %>%
mutate(loss_ex_2 = (loss+1)^.25) #transformation as suggested by other winners
hist(train_ex_2$loss_ex_2)


Exploratory analysis shows that some variables have notably high correlation with one another. There are also a lot of variables available to us with these data. This signals that a feature engineering step may be needed to eliminate some variables with high correlation. 

Further exploration shows a right-skewed response. A log transformation makes sense in this scenario with the large numbers. One suggestion from other winners of this competition was to do the following transformation on the response:(loss + 1)^.25. This also results in more normal looking data, however it is still slightly right skewed. After running some models with both the log transformation and the (loss+1)^.25 transformation, I have selected the latter to use in my final model. I will refer to this transformation as the "winner" transformation.

**Transform Response and Create Recipe**

In [None]:
train <- train %>%
    mutate(loss = (loss+1)^.25)

allstate_recipe <- recipe(loss ~ ., train) %>% #w/ winner transformation
  step_lencode_mixed(all_nominal_predictors(), outcome = vars(loss))

The above recipe implements the winner transformation as well uses target encoding on all of the categorical predictors.

**Creating the Model**

In [None]:
#create model
boost_model <- boost_tree(tree_depth=tune(),
                          trees=tune(),
                          learn_rate=tune(),
                         mode = "regression") %>%
              set_engine("lightgbm") #or "xgboost" but lightgbm is faster

#set workflow
boost_wf <- workflow() %>%
  add_recipe(allstate_recipe) %>%
  add_model(boost_model)

This is a boosted model using the lightgbm engine. Tree depth, the number of trees, and learning rate are parameters that will be tuned over. Since the response is numeric, the mode is regression. The model and recipe are then added to a workflow.

**Cross Validation**

In [None]:
#set up tuning grid
boost_tuneGrid <- grid_regular(tree_depth(), trees(), learn_rate(), levels = 3)

#set up cv
boost_folds <- vfold_cv(train, v = 5, repeats = 1)

CV_boost_results <- boost_wf %>%
  tune_grid(resamples = boost_folds,
            grid = boost_tuneGrid,
            metrics = metric_set(mae))

#find best tuning parameters
bestTune_boost <- CV_boost_results %>%
  select_best("mae") #mean absolute error used by this particular Kaggle comp

In order to find the optimal levels to set for tree depth, tree number, and learning rate, we must perform a cross validation. This will split the data into 5 subsections and use all combinations of those three parameters to see which combination results in the lowest mean absolute error. Mean absolute error was selected as the metric to determine how well the model did with predictions becuase this is what the Kaggle competition is using to score submissions.

**Fit the Model and Make Predictions**

In [None]:
#finalize workflow and fit it
final_boost_wf <- boost_wf %>%
  finalize_workflow(bestTune_boost) %>%
  fit(train)

#make predictions
pred_boost <- predict(final_boost_wf, new_data = test) 

Next, we implement the model with optmial tuning parameters to fit on the traning data. After the model is fit on the traning data, we then make predictions on the test data.  

**Format for Kaggle**

In [None]:
#format for Kaggle
boost_final <- pred_boost %>%
  bind_cols(test) %>%
  select(id,.pred) %>%
  rename(loss = .pred) %>%
  mutate(loss = loss^4-1)

write_csv(boost_final, "boostSubmission.csv")

##SCORE: 1,123.17

The boosted model with the winner transformation and target encoding resulted in a score of 1,123.17. This is the 1,462nd best score of 3,048 competitors.