Load all needed packages:

In [1]:
library(tidyverse) #Basic Functions
library(vroom) #Read in the data 
library(tidymodels) #Random Forest
library(embed) 
library(ranger) #Random Forest 
library(discrim)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘vroom’


The following objects are masked from ‘package:readr’:

    as.col_spec, col_character, col_date, col_datetime, col_double,
    col_factor,

Load in Data Sets as "Kobe". Also Creating three new variables to help predict whether or not the shot was made by Kobe. The first variable is time remaining, which is a combination of the minutes and seconds remaining variables, and it ends being all in seconds. The second variable is Home vs. Away, which detects if the game was home or away. The last variable is Season, which alters the already-existing season variable to split the data into something more readable for the machine. 

In [2]:
kobe <- vroom("/kaggle/input/kobe-bryant-shot-selection/data.csv.zip") #Read in the data 
kobe$time_remaining = (kobe$minutes_remaining*60)+kobe$seconds_remaining #Create a time-remaining variable
kobe$matchup = ifelse(str_detect(kobe$matchup, 'vs.'), 'Home', 'Away') #Create a Home vs. Away varaible
kobe['season'] <- substr(str_split_fixed(kobe$season, '-',2)[,2],2,2) #Create a season variable

[1mRows: [22m[34m30697[39m [1mColumns: [22m[34m25[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (10): action_type, combined_shot_type, season, shot_type, shot_zone_are...
[32mdbl[39m  (14): game_event_id, game_id, lat, loc_x, loc_y, lon, minutes_remaining...
[34mdate[39m  (1): game_date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


First, I am removing any unwanted variables that I feel won't contribute to my predictions. They include repeated variables like Team Name and Team ID. They also include variables that give information about Kobe's position on the court, which I felt wasn't creating a more accurate model for me. 
Second, I am splitting the dataset into the train and test set. The train set is the complete data we already have with information on whether or not Kobe made the shot. The test set is the data we have that is missing information on whether or not Kobe made the shot, so it is what we will be trying to predict. 

In [3]:
kobe <- kobe %>%
  select(-c( 'team_id', 'team_name', 'shot_zone_range', 'lon', 'lat', 
            'seconds_remaining', 'minutes_remaining', 'game_event_id', 
            'game_id', 'loc_x', 'loc_y'))

# Train
train <- kobe %>%
  filter(!is.na(shot_made_flag)) #Split the data into the training set with shot_made_flag indicators
# Test 
test <- kobe %>% 
  filter(is.na(shot_made_flag)) #Split the data into the testing set missing shot_made_flag indicators 

Next I am creating a recipe for my model, with several different feature engineering steps. Step Novel helps to create a new feature when data comes up that hasn't been seen before, which will happen since this data moves forward in time row by row. Step Unknown assigns missing values the label "Unknown", and update_role creates a new ID column using shot_id, which usually improves accuracy. Step Dummy creates dummy variables for all nominal predictors, and step naomit removes NA values, which I added just in case, even though I don't think there are any NA values in the data. 

In [4]:
my_recipe <- recipe(shot_made_flag ~ ., data = train) %>%  #Create a recipe 
  step_novel(all_nominal_predictors()) %>% #
  step_unknown(all_nominal_predictors()) %>% #
  update_role(shot_id, new_role = "ID") %>% #Create a new ID column - usually improves score 
  step_dummy(all_nominal_predictors()) %>% #Create dummy variables for all nominal columns 
  step_naomit() #Remove NA values (just in case- even though I don't think there are any)

After creating a recipe, I want to test it by prepping and baking the data using that recipe. This step insures that my recipe is working. 

In [5]:
prep <- prep(my_recipe) #Prep Recipe to see if it works 
bake(prep, new_data = train) #Bake training set
bake(prep, new_data=test) #Bake testing set 

period,playoffs,shot_distance,game_date,shot_id,time_remaining,shot_made_flag,action_type_Alley.Oop.Layup.shot,action_type_Cutting.Layup.Shot,action_type_Driving.Bank.shot,⋯,opponent_POR,opponent_SAC,opponent_SAS,opponent_SEA,opponent_TOR,opponent_UTA,opponent_VAN,opponent_WAS,opponent_new,opponent_unknown
<dbl>,<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,15,2000-10-31,2,622,0,0,0,0,⋯,1,0,0,0,0,0,0,0,0,0
1,0,16,2000-10-31,3,465,1,0,0,0,⋯,1,0,0,0,0,0,0,0,0,0
1,0,22,2000-10-31,4,412,0,0,0,0,⋯,1,0,0,0,0,0,0,0,0,0
2,0,0,2000-10-31,5,379,1,0,0,0,⋯,1,0,0,0,0,0,0,0,0,0
3,0,14,2000-10-31,6,572,0,0,0,0,⋯,1,0,0,0,0,0,0,0,0,0
3,0,0,2000-10-31,7,532,1,0,0,0,⋯,1,0,0,0,0,0,0,0,0,0
3,0,12,2000-10-31,9,372,1,0,0,0,⋯,1,0,0,0,0,0,0,0,0,0
3,0,12,2000-10-31,10,216,0,0,0,0,⋯,1,0,0,0,0,0,0,0,0,0
3,0,25,2000-10-31,11,116,0,0,0,0,⋯,1,0,0,0,0,0,0,0,0,0
1,0,17,2000-11-01,12,660,1,0,0,0,⋯,0,0,0,0,0,1,0,0,0,0


period,playoffs,shot_distance,game_date,shot_id,time_remaining,shot_made_flag,action_type_Alley.Oop.Layup.shot,action_type_Cutting.Layup.Shot,action_type_Driving.Bank.shot,⋯,opponent_POR,opponent_SAC,opponent_SAS,opponent_SEA,opponent_TOR,opponent_UTA,opponent_VAN,opponent_WAS,opponent_new,opponent_unknown
<dbl>,<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,18,2000-10-31,1,627,,0,0,0,⋯,1,0,0,0,0,0,0,0,0,0
3,0,2,2000-10-31,8,485,,0,0,0,⋯,1,0,0,0,0,0,0,0,0,0
1,0,0,2000-11-01,17,1,,0,0,0,⋯,0,0,0,0,0,1,0,0,0,0
3,0,0,2000-11-01,20,646,,0,0,0,⋯,0,0,0,0,0,1,0,0,0,0
1,0,17,2000-11-04,33,686,,0,0,0,⋯,0,0,0,0,0,0,1,0,0,0
1,0,20,2000-11-04,34,658,,0,0,0,⋯,0,0,0,0,0,0,1,0,0,0
1,0,1,2000-11-04,35,453,,0,0,0,⋯,0,0,0,0,0,0,1,0,0,0
1,0,1,2000-11-04,36,358,,0,0,0,⋯,0,0,0,0,0,0,1,0,0,0
1,0,0,2000-11-04,37,249,,0,0,0,⋯,0,0,0,0,0,0,1,0,0,0
2,0,16,2000-11-04,38,333,,0,0,0,⋯,0,0,0,0,0,0,1,0,0,0


Once I have my recipe finalized and working, I will create the model I am going to use to predict Kobe's shots. In this case, I am using a random forest model with 500 trees and other tuned parameters that we will tune later on.  

In [6]:
my_mod_forest <- rand_forest(mtry = tune(), #Create a random forest model with 500 trees and tuned attributes 
                             min_n=tune(),
                             trees=500) %>%
  set_engine("ranger") %>%
  set_mode("regression")

After creating a model, we need to create a workflow for the machine to process our data with the model we want to use and the recipe we want to use.  

In [7]:
workflow_forest <- workflow() %>% # Create a workflow with model & recipe
  add_recipe(my_recipe) %>%
  add_model(my_mod_forest)


After the workflow is completed, creating a tuning grid is next. We need to set up a tuning grid, so that the machine knows all possible values it can check of each parameter.

In [8]:
tuning_grid <- grid_regular(mtry(range=c(1, ncol(train) - 1)), # Set up grid of tuning values
                            min_n(),
                            levels = 5) # L^2 total tuning possibilities



After creating a tuning grid for cross validation, we must determine the number of folds we would like, or how many sections we would like our data to be broken up into to be cross validated. Here we are using 5 folds, so it will be 5-fold cross validation. After we have determined the number of folds we would like to have, we will cross validate, which tests each of our parameters at all levels. This competition determines scores based on the log loss function, but my model has limited capabilities, so I am running it to optimize root mean squared error instead. 

In [9]:
folds <- vfold_cv(train, v = 5, repeats=1) #decide how many folds you want to do for k-fold cross validation


CV_results_forest <- workflow_forest %>% #cross validate and tune values 
  tune_grid(resamples=folds,
            grid=tuning_grid,
            metrics=metric_set(rmse))

Once cross validation has been run, the function below will return the model with the best root mean squared error, so we can use the best possible model to make predictions. 

In [10]:
bestTune <- CV_results_forest %>% # Find best tuning parameters
  select_best("rmse")

After the best model has been chosen, we create the workflow with the best parameters based on what the "best tuned" model is above. 

In [11]:
final_wf_forest <- workflow_forest %>% # Finalize workflow 
  finalize_workflow(bestTune) %>%
  fit(data = train)

Once the best workflow has been created, we will use it to predict our response variable: whether or not Kobe made the shot. This code predicts the probability that each shot is made. Kaggle has a specific way that predictions must be formatted, so the last chunk of code formats it properly, one row for Shot ID and one row for the prediction. 

In [12]:
predictions_forest <- final_wf_forest %>% #Predict Values 
  predict(test)

predictions_forest <- predictions_forest %>% #Bind values to shot_id to upload to Kaggle 
  bind_cols(., test) %>%
  select(shot_id, .pred) %>%
  rename(shot_made_flag = .pred)

The properly formatted predictions are written to a csv file

In [13]:
vroom_write(x= predictions_forest, file="./submission.csv", delim=",") #Push Predictions to Output Document 