### We will use the 2016-2017 basketball shot log data to demonstrate how to test the hot hand.

#### Import useful libraries and the shot log data  

#### Please note that the 3 lecture notebooks for this week must be run in order, as the following notebooks rely on the output of the previous

In [1]:
# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np

Shotlog=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Sports Performance Analytics Specialization/Course 1 - Foundations of Sports Analytics: Data, Representation, and Models in Sports/work/Data/Week 6/Shotlog_16_17.csv")
Shotlog.head()

Unnamed: 0,team_previous_shot,player_position,home_game,location_x,opponent_previous_shot,home_team,shot_type,points,away_team,location_y,time,date,shoot_player,time_from_last_shot,quarter,current_shot_outcome
0,,SF,Yes,97.0,SCORED,ATL,Pullup Jump Shot,2,WAS,405.0,1:09,10/27/2016,Kent Bazemore,,1,MISSED
1,MISSED,C,Yes,52.0,SCORED,ATL,Tip Dunk Shot,2,WAS,250.0,1:11,10/27/2016,Dwight Howard,2.0,1,SCORED
2,SCORED,SG,Yes,239.0,MISSED,ATL,Jump Shot,2,WAS,223.0,1:41,10/27/2016,Kyle Korver,30.0,1,SCORED
3,SCORED,PG,Yes,102.0,SCORED,ATL,Pullup Jump Shot,2,WAS,385.0,2:16,10/27/2016,Dennis Schroder,35.0,1,SCORED
4,SCORED,PF,Yes,128.0,MISSED,ATL,Turnaround Jump Shot,2,WAS,265.0,2:40,10/27/2016,Paul Millsap,24.0,1,MISSED


In [3]:
Shotlog.shape

(210072, 16)

## Data Preparation

### Missing Value

In [4]:
Shotlog.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210072 entries, 0 to 210071
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   team_previous_shot      207612 non-null  object 
 1   player_position         210072 non-null  object 
 2   home_game               210072 non-null  object 
 3   location_x              209675 non-null  float64
 4   opponent_previous_shot  208462 non-null  object 
 5   home_team               210072 non-null  object 
 6   shot_type               210072 non-null  object 
 7   points                  210072 non-null  int64  
 8   away_team               210072 non-null  object 
 9   location_y              209675 non-null  float64
 10  time                    210072 non-null  object 
 11  date                    210072 non-null  object 
 12  shoot_player            210072 non-null  object 
 13  time_from_last_shot     200072 non-null  float64
 14  quarter             

### Let’s create some useful variables.
- Create dummy variables to indicate hit or miss of current shot and previous shot.


In [5]:
Shotlog['current_shot_hit'] = np.where(Shotlog['current_shot_outcome']=="SCORED", 1, 0)
Shotlog.head()

Unnamed: 0,team_previous_shot,player_position,home_game,location_x,opponent_previous_shot,home_team,shot_type,points,away_team,location_y,time,date,shoot_player,time_from_last_shot,quarter,current_shot_outcome,current_shot_hit
0,,SF,Yes,97.0,SCORED,ATL,Pullup Jump Shot,2,WAS,405.0,1:09,10/27/2016,Kent Bazemore,,1,MISSED,0
1,MISSED,C,Yes,52.0,SCORED,ATL,Tip Dunk Shot,2,WAS,250.0,1:11,10/27/2016,Dwight Howard,2.0,1,SCORED,1
2,SCORED,SG,Yes,239.0,MISSED,ATL,Jump Shot,2,WAS,223.0,1:41,10/27/2016,Kyle Korver,30.0,1,SCORED,1
3,SCORED,PG,Yes,102.0,SCORED,ATL,Pullup Jump Shot,2,WAS,385.0,2:16,10/27/2016,Dennis Schroder,35.0,1,SCORED,1
4,SCORED,PF,Yes,128.0,MISSED,ATL,Turnaround Jump Shot,2,WAS,265.0,2:40,10/27/2016,Paul Millsap,24.0,1,MISSED,0


- Make sure the variable "date" is stored as a date type variable.


In [6]:
import datetime as dt
Shotlog['date']=pd.to_datetime(Shotlog['date'])

- Convert the variable "time" to be datetime type variable
 1. We will first add the hour (00) to the time variable since the time variable will be stored in the format 'HH:MM:SS';
 2. We will use "to_timedelta" to work with variable with only time information.


In [7]:
Shotlog['time'] = pd.to_timedelta('00:'+ Shotlog['time'])
Shotlog['time'].describe()

count                       210072
mean     0 days 00:06:08.994773220
std      0 days 00:03:28.346263848
min                0 days 00:00:00
25%                0 days 00:03:08
50%                0 days 00:06:06
75%                0 days 00:09:10
max                0 days 00:12:00
Name: time, dtype: object

- Create lagged variable to indicate the result of the previous shot by the same player in the same game.
 1. We will first sort the shot outcome by the quarter and time in the game;
 2. We will group the data by player and game (date) and use the "shift" command to create a lag variable.


In [8]:
Shotlog['lag_shot_hit']=Shotlog.sort_values(by=['quarter','time'], ascending=[True, True]).groupby(['shoot_player','date'])['current_shot_hit'].shift(1)
Shotlog.head()

Unnamed: 0,team_previous_shot,player_position,home_game,location_x,opponent_previous_shot,home_team,shot_type,points,away_team,location_y,time,date,shoot_player,time_from_last_shot,quarter,current_shot_outcome,current_shot_hit,lag_shot_hit
0,,SF,Yes,97.0,SCORED,ATL,Pullup Jump Shot,2,WAS,405.0,0 days 00:01:09,2016-10-27,Kent Bazemore,,1,MISSED,0,
1,MISSED,C,Yes,52.0,SCORED,ATL,Tip Dunk Shot,2,WAS,250.0,0 days 00:01:11,2016-10-27,Dwight Howard,2.0,1,SCORED,1,
2,SCORED,SG,Yes,239.0,MISSED,ATL,Jump Shot,2,WAS,223.0,0 days 00:01:41,2016-10-27,Kyle Korver,30.0,1,SCORED,1,
3,SCORED,PG,Yes,102.0,SCORED,ATL,Pullup Jump Shot,2,WAS,385.0,0 days 00:02:16,2016-10-27,Dennis Schroder,35.0,1,SCORED,1,
4,SCORED,PF,Yes,128.0,MISSED,ATL,Turnaround Jump Shot,2,WAS,265.0,0 days 00:02:40,2016-10-27,Paul Millsap,24.0,1,MISSED,0,


#### We can sort the shot log data by player, game(date),  quarter, and time of the shot.



In [9]:
Shotlog.sort_values(by=['shoot_player', 'date', 'quarter', 'time'], ascending=[True, True, True, True])

Unnamed: 0,team_previous_shot,player_position,home_game,location_x,opponent_previous_shot,home_team,shot_type,points,away_team,location_y,time,date,shoot_player,time_from_last_shot,quarter,current_shot_outcome,current_shot_hit,lag_shot_hit
42660,MISSED,C,No,210.0,SCORED,GSW,Jump Shot,2,DAL,269.0,0 days 00:07:23,2016-11-09,A.J. Hammons,43.0,4,SCORED,1,
42661,SCORED,C,No,308.0,SCORED,GSW,Jump Shot,3,DAL,202.0,0 days 00:07:56,2016-11-09,A.J. Hammons,33.0,4,SCORED,1,1.0
42664,MISSED,C,No,167.0,SCORED,GSW,Jump Shot,2,DAL,318.0,0 days 00:09:26,2016-11-09,A.J. Hammons,51.0,4,MISSED,0,1.0
42667,SCORED,C,No,131.0,MISSED,GSW,Jump Shot,2,DAL,337.0,0 days 00:11:46,2016-11-09,A.J. Hammons,62.0,4,MISSED,0,0.0
42668,MISSED,C,No,72.0,MISSED,GSW,Tip Layup Shot,2,DAL,248.0,0 days 00:11:47,2016-11-09,A.J. Hammons,1.0,4,SCORED,1,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70218,SCORED,C,Yes,58.0,MISSED,GSW,Layup,2,UTA,241.0,0 days 00:11:08,2017-04-10,Zaza Pachulia,47.0,2,SCORED,1,0.0
70255,BLOCKED,C,Yes,866.0,SCORED,GSW,Layup,2,UTA,252.0,0 days 00:10:33,2017-04-10,Zaza Pachulia,5.0,4,MISSED,0,1.0
70264,SCORED,C,Yes,239.0,SCORED,GSW,Jump Shot,2,LAL,272.0,0 days 00:01:41,2017-04-12,Zaza Pachulia,29.0,1,SCORED,1,
70270,MISSED,C,Yes,52.0,SCORED,GSW,Tip Layup Shot,2,LAL,251.0,0 days 00:04:32,2017-04-12,Zaza Pachulia,1.0,1,SCORED,1,1.0


_Notice that for the first shots of the game by the given players, the lagged outcome variable will have missing value._

#### Let's create a dataframe for average success rate of players over the season.

Since the "current_shot_hit" variable is a dummy variable (=1 if hit, =0 if miss), the average of this variable would indicate the success rate of the player over the season.

In [10]:
Player_Stats=Shotlog.groupby(['shoot_player'])['current_shot_hit'].mean()
Player_Stats=Player_Stats.reset_index()
Player_Stats.head()

Unnamed: 0,shoot_player,current_shot_hit
0,A.J. Hammons,0.404762
1,Aaron Brooks,0.403333
2,Aaron Gordon,0.454861
3,Aaron Harrison,0.0
4,Adreian Payne,0.425926


- Let's rename the "current_shot_hit" variable in the newly created date frame as "average_hit".


In [11]:
Player_Stats.rename(columns={'current_shot_hit':'average_hit'}, inplace=True)

#### We will use the player statistics to analyze the hot hand. So we will merge average player statistics dataframe back to the shot log dataframe.


In [12]:
Shotlog=pd.merge(Shotlog, Player_Stats, on=['shoot_player'])
Shotlog.head()

Unnamed: 0,team_previous_shot,player_position,home_game,location_x,opponent_previous_shot,home_team,shot_type,points,away_team,location_y,time,date,shoot_player,time_from_last_shot,quarter,current_shot_outcome,current_shot_hit,lag_shot_hit,average_hit
0,,SF,Yes,97.0,SCORED,ATL,Pullup Jump Shot,2,WAS,405.0,0 days 00:01:09,2016-10-27,Kent Bazemore,,1,MISSED,0,,0.408587
1,MISSED,SF,Yes,279.0,SCORED,ATL,Jump Shot,3,WAS,130.0,0 days 00:03:11,2016-10-27,Kent Bazemore,4.0,1,MISSED,0,0.0,0.408587
2,MISSED,SF,Yes,58.0,SCORED,ATL,Cutting Layup Shot,2,WAS,275.0,0 days 00:09:53,2016-10-27,Kent Bazemore,30.0,2,MISSED,0,0.0,0.408587
3,SCORED,SF,Yes,868.0,SCORED,ATL,Jump Shot,3,WAS,475.0,0 days 00:01:02,2016-10-27,Kent Bazemore,47.0,3,MISSED,0,0.0,0.408587
4,SCORED,SF,Yes,691.0,MISSED,ATL,Pullup Jump Shot,3,WAS,100.0,0 days 00:04:50,2016-10-27,Kent Bazemore,39.0,3,SCORED,1,0.0,0.408587


- Create a variable to indicate the total number of shots recorded in the dataset for each player.


In [13]:
Player_Shots=Shotlog.groupby(['shoot_player']).size().reset_index(name='shot_count')

In [14]:
Player_Shots.sort_values(by=['shot_count'], ascending=[False]).head()

Unnamed: 0,shoot_player,shot_count
402,Russell Westbrook,1940
25,Andrew Wiggins,1568
106,DeMar DeRozan,1545
193,James Harden,1532
28,Anthony Davis,1525


We should also note that players have different number of shots in each individual game. We will need to treat the data differently for a player who had only two shots in a game compared to those who had attempted 30 in a game.

- Create a variable to indicate the number of shots in each game for by each player.


In [15]:
Player_Game=Shotlog.groupby(['shoot_player','date']).size().reset_index(name='shot_per_game')
Player_Game.head()

Unnamed: 0,shoot_player,date,shot_per_game
0,A.J. Hammons,2016-11-09,5
1,A.J. Hammons,2016-11-23,1
2,A.J. Hammons,2016-11-25,1
3,A.J. Hammons,2016-12-03,2
4,A.J. Hammons,2016-12-07,2


#### We will merge the shot count data frames back to the shot log dataframe.


In [16]:
Shotlog=pd.merge(Shotlog, Player_Shots, on=['shoot_player'])
Shotlog=pd.merge(Shotlog, Player_Game, on=['shoot_player','date'])
display(Shotlog)

Unnamed: 0,team_previous_shot,player_position,home_game,location_x,opponent_previous_shot,home_team,shot_type,points,away_team,location_y,...,date,shoot_player,time_from_last_shot,quarter,current_shot_outcome,current_shot_hit,lag_shot_hit,average_hit,shot_count,shot_per_game
0,,SF,Yes,97.0,SCORED,ATL,Pullup Jump Shot,2,WAS,405.0,...,2016-10-27,Kent Bazemore,,1,MISSED,0,,0.408587,722,7
1,MISSED,SF,Yes,279.0,SCORED,ATL,Jump Shot,3,WAS,130.0,...,2016-10-27,Kent Bazemore,4.0,1,MISSED,0,0.0,0.408587,722,7
2,MISSED,SF,Yes,58.0,SCORED,ATL,Cutting Layup Shot,2,WAS,275.0,...,2016-10-27,Kent Bazemore,30.0,2,MISSED,0,0.0,0.408587,722,7
3,SCORED,SF,Yes,868.0,SCORED,ATL,Jump Shot,3,WAS,475.0,...,2016-10-27,Kent Bazemore,47.0,3,MISSED,0,0.0,0.408587,722,7
4,SCORED,SF,Yes,691.0,MISSED,ATL,Pullup Jump Shot,3,WAS,100.0,...,2016-10-27,Kent Bazemore,39.0,3,SCORED,1,0.0,0.408587,722,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210067,SCORED,C,No,878.0,MISSED,DET,Layup,2,WAS,250.0,...,2017-04-10,Ian Mahinmi,21.0,1,MISSED,0,,0.590909,110,5
210068,SCORED,C,No,878.0,MISSED,DET,Driving Layup,2,WAS,250.0,...,2017-04-10,Ian Mahinmi,31.0,1,SCORED,1,0.0,0.590909,110,5
210069,MISSED,C,No,863.0,SCORED,DET,Hook Shot,2,WAS,303.0,...,2017-04-10,Ian Mahinmi,78.0,2,MISSED,0,1.0,0.590909,110,5
210070,SCORED,C,No,58.0,SCORED,DET,Layup,2,WAS,264.0,...,2017-04-10,Ian Mahinmi,43.0,3,SCORED,1,0.0,0.590909,110,5


#### We will sort the data again after merging.


In [17]:
Shotlog.sort_values(by=['shoot_player', 'date', 'quarter', 'time'], ascending=[True, True, True, True])

Unnamed: 0,team_previous_shot,player_position,home_game,location_x,opponent_previous_shot,home_team,shot_type,points,away_team,location_y,...,date,shoot_player,time_from_last_shot,quarter,current_shot_outcome,current_shot_hit,lag_shot_hit,average_hit,shot_count,shot_per_game
50484,MISSED,C,No,210.0,SCORED,GSW,Jump Shot,2,DAL,269.0,...,2016-11-09,A.J. Hammons,43.0,4,SCORED,1,,0.404762,42,5
50485,SCORED,C,No,308.0,SCORED,GSW,Jump Shot,3,DAL,202.0,...,2016-11-09,A.J. Hammons,33.0,4,SCORED,1,1.0,0.404762,42,5
50486,MISSED,C,No,167.0,SCORED,GSW,Jump Shot,2,DAL,318.0,...,2016-11-09,A.J. Hammons,51.0,4,MISSED,0,1.0,0.404762,42,5
50487,SCORED,C,No,131.0,MISSED,GSW,Jump Shot,2,DAL,337.0,...,2016-11-09,A.J. Hammons,62.0,4,MISSED,0,0.0,0.404762,42,5
50488,MISSED,C,No,72.0,MISSED,GSW,Tip Layup Shot,2,DAL,248.0,...,2016-11-09,A.J. Hammons,1.0,4,SCORED,1,0.0,0.404762,42,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72781,SCORED,C,Yes,58.0,MISSED,GSW,Layup,2,UTA,241.0,...,2017-04-10,Zaza Pachulia,47.0,2,SCORED,1,0.0,0.535948,306,5
72782,BLOCKED,C,Yes,866.0,SCORED,GSW,Layup,2,UTA,252.0,...,2017-04-10,Zaza Pachulia,5.0,4,MISSED,0,1.0,0.535948,306,5
72783,SCORED,C,Yes,239.0,SCORED,GSW,Jump Shot,2,LAL,272.0,...,2017-04-12,Zaza Pachulia,29.0,1,SCORED,1,,0.535948,306,3
72784,MISSED,C,Yes,52.0,SCORED,GSW,Tip Layup Shot,2,LAL,251.0,...,2017-04-12,Zaza Pachulia,1.0,1,SCORED,1,1.0,0.535948,306,3


#### We will treat the "points" and "quarter" variables as objects.

In [18]:
Shotlog['points'] = Shotlog['points'].astype(object)
Shotlog['quarter'] = Shotlog['quarter'].astype(object)

#### Missing values
- Drop observations with missing value in lagged variable.


In [19]:
Shotlog=Shotlog[pd.notnull(Shotlog["lag_shot_hit"])]

#### Let's take a quick look at the number of variables and the number of observations in our clean dataframe.

In [20]:
Shotlog.shape

(185052, 21)

### Save our updated data

In [21]:
Shotlog.to_csv("/content/drive/MyDrive/Colab Notebooks/Sports Performance Analytics Specialization/Course 1 - Foundations of Sports Analytics: Data, Representation, and Models in Sports/work/Data/Week 6/Shotlog1.csv", index=False)
Player_Stats.to_csv("/content/drive/MyDrive/Colab Notebooks/Sports Performance Analytics Specialization/Course 1 - Foundations of Sports Analytics: Data, Representation, and Models in Sports/work/Data/Week 6/Player_Stats1.csv", index=False)
Player_Shots.to_csv("/content/drive/MyDrive/Colab Notebooks/Sports Performance Analytics Specialization/Course 1 - Foundations of Sports Analytics: Data, Representation, and Models in Sports/work/Data/Week 6/Player_Shots1.csv", index=False)
Player_Game.to_csv("/content/drive/MyDrive/Colab Notebooks/Sports Performance Analytics Specialization/Course 1 - Foundations of Sports Analytics: Data, Representation, and Models in Sports/work/Data/Week 6/Player_Game1.csv", index=False)