First I need to download the necessary libraries. 

In [1]:
# LIBRARIES
library(tidyverse)
library(tidymodels)
library(vroom)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39

Now I need to read in the data.

In [2]:
# READ IN DATA
all_train <- vroom('/kaggle/input/allstate-claims-severity/train.csv')
all_test <- vroom('/kaggle/input/allstate-claims-severity/test.csv')

[1mRows: [22m[34m188318[39m [1mColumns: [22m[34m132[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (116): cat1, cat2, cat3, cat4, cat5, cat6, cat7, cat8, cat9, cat10, cat1...
[32mdbl[39m  (16): id, cont1, cont2, cont3, cont4, cont5, cont6, cont7, cont8, cont9...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m125546[39m [1mColumns: [22m[34m131[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (116): cat1, cat2, cat3, cat4, cat5, cat6, cat7, cat8, cat9, cat10, cat1...
[32mdbl[39m  (15): id, cont1, cont2, cont3, cont4, cont5, cont6, cont7, cont8, cont9...

[36mℹ[39m Use `spec()` to retrieve the full column specification f

Now I need to perform some feature engineering. First I decided to take the log of the loss variable. Next I created my recipe, which has explanations included in the code. Then I chose my model and fit the workflow. Then I created my predictions and correctly formatted these predictions so they could be successfully submitted to Kaggle. Lastly, I created my csv file for submission.

In [3]:
# FEATURE ENGINEERING

# take log of loss variable
all_train$loss <- log(all_train$loss)

# recipe
my_recipe <- recipe(loss ~ ., all_train) %>% 
  update_role(id, new_role = 'ID') %>% # update id role
  step_scale(all_numeric_predictors()) %>% # normalize all numeric predictors to have sd = 1
  step_corr(all_numeric_predictors(), threshold = .6) %>% # remove all numeric predictors that have a correlation with other variables that is over 0.6
  step_novel(all_nominal_predictors()) %>% # assign new factor values a "new" value
  step_unknown(all_nominal_predictors()) %>% # assign missing values an "unknown" value
  step_dummy(all_nominal_predictors()) %>% # create dummy variables for all nominal predictors
  prep()

# model
my_mod <- linear_reg() %>% # Type of model
  set_engine("lm")

# workflow
lin_wf <- workflow() %>%
  add_recipe(my_recipe) %>%
  add_model(my_mod) %>%
  fit(data = all_train) # Fit the workflow

# predictions
lin_predictions <- predict(lin_wf,
                              new_data= all_test)

# correctly format predictions
lin_predictions <- lin_predictions %>%
  bind_cols(., all_test) %>%
  select(id, .pred) %>%
  rename(loss = .pred) %>%
  mutate(loss = exp(loss)) # must do this because we previously took the log of loss

# create csv file
vroom_write(x=lin_predictions, file="linear_predictions.csv", delim=",")

“prediction from a rank-deficient fit may be misleading”
