# Basketball Analytics Data Engineering

The clusters generated in the previous part are used for modeling defensive performance. A defensive play is defined as guarding a ballhandler (Center, Power Forward, etc..) at a certain location in the court at a certain time on the shot clock. In other words, a defensive play is who is guarded, when, and where.

After building defensive plays as a combination of these three dimensions, the next step is reformat the data from a ball touch to a ball possession. Each row becomes a ball possession with defensive plays as features. The target variables becomes wehther the ball possession ended with a shot attempt or not

Let's load the data from the previous clustering step

In [1]:
#Load dataset and fill in NA values with median


pacman::p_load(dplyr,readr, FactoInvestigate, Factoshiny, DT, corrplot, rio, FactoMineR, tidyr, shiny, lubridate, broom)
 
library(ggfortify)
library(ggplot2)
library(grid)
library(jpeg)
library(cluster) 
library(factoextra)
options(warn=-1)


# Restore the object
touch_data=readRDS(file = "my_data_with_18_clusters.rds")


Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa



Let's create bins for the time on sot clock. It's divided into three bins 0-5, 5-10, and 10-24 secs

In [2]:

touch_data$time_group=" "

touch_data$time_group[touch_data$start_shot_clock <5]="0_to_5"
touch_data$time_group[touch_data$start_shot_clock >=5 & touch_data$start_shot_clock <10]="5_to_10"
touch_data$time_group[touch_data$start_shot_clock >=10]="10_to_24"

touch_data$cluster <- as.numeric(as.character(touch_data$cluster))


Let's clean up the ballhandler positions. We only need five main player positions C, PF, SF, SG, and PG

In [5]:
touch_data$bh_pos[touch_data$bh_pos=='PG-SG']='PG'
touch_data$bh_pos[touch_data$bh_pos=='SF-SG']='SF'
touch_data$bh_pos[touch_data$bh_pos=='SG-PG']='SG'
touch_data$bh_pos[touch_data$bh_pos=='PF-SF']='PF'
touch_data$bh_pos[touch_data$bh_pos=='SF-PF']='SF'
touch_data$bh_pos[touch_data$bh_pos=='SG-SF']='SG'
touch_data$bh_pos[touch_data$bh_pos=='G-F']='PG'
touch_data$bh_pos[touch_data$bh_pos=='F']='PF'

unique(touch_data$bh_pos)




Now we combine the three variables discussed earlier into one combined feature. Ballhandler position combined with time on shot clock combined with the cluster of the location on court. We end up with a total of 270 features

In [6]:
touch_data$combined= paste(touch_data$cluster,"-",touch_data$time_group,"-",touch_data$bh_pos)

In [7]:
length(unique(touch_data$combined))

Now we reformat the data from a ball touch level to a ball possession level. This is acheived by doing a pivot table on chance id, counting the defensive plays and then spreading them as columns

In [None]:
touch_data_aggregate=touch_data %>%
    group_by(chance_id)%>%
    count(chance_id,combined) %>%
    spread(combined, n)

touch_data_aggregate[is.na(touch_data_aggregate)] = 0

after the pivot table we add the target variable which is 1 if a shot was attempted at the end of the ball possession and 0 other wise

In [7]:
touch_data$led_to_shot=as.numeric(touch_data$led_to_shot)

touch_data_shot=touch_data %>% group_by(chance_id) %>% summarise(shot=sum(led_to_shot))
touch_data_shot$shot[touch_data_shot$shot > 1]=1

touch_data_aggregate=merge(touch_data_aggregate,touch_data_shot,by='chance_id')


Save the data for later use in predictive modeling

In [8]:
# Save an object to a file
saveRDS(touch_data_aggregate, file = "my_data_with_18_cluster_features.rds")
# Restore the object
#readRDS(file = "my_data.rds")

### Player defensive plays count

After aggregating the data per ball possesion to moel the effect of each defensive play on the likelihood of a shot being attempted, we go ahead and do the same per player. we simply count the number of times each player has made each defensive play and normalize it by the total number of plays made by a player. This way, basewd on how often each player makes a defensive play, and how much this play affects the likelihood of a shot attempt, we can calculate how effective each player is in preventing a shot attempt, which is what our defensive ranking is based on

In [6]:
player_aggregate=touch_data %>%
    group_by(defender_id)%>%
    count(defender_id,combined) %>%
    spread(combined, n)

player_aggregate[is.na(player_aggregate)] = 0

Add a sum column that counts the total defensive plays made by a player

In [7]:
player_aggregate$sum=rowSums(subset(player_aggregate,select=-defender_id),na.rm=TRUE)

Check statistics of the count of plays and only keep players with more than 700 defensive plays. We only want to analyze players that have played long enough for significant results

In [40]:
summary(player_aggregate$sum)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    2.0   448.2  1979.0  2648.8  4469.2 10728.0 

In [8]:
player_aggregate=player_aggregate %>% filter(sum>700)

Save the dataframe for later use in creating the defensive player rankings

In [9]:
# Save an object to a file
saveRDS(player_aggregate, file = "my_data_with_player_features_18.rds")
# Restore the object
#readRDS(file = "my_data.rds")