$$ \textbf{Introduction to the project}$$

This project aims to compare the shooting frequency and efficiency of 20 of the best NBA players of the 21st century. Using various metrics, models will be developed to estimate the probability of a successful shot for each player. The challenge lies in creating a reliable model that takes into account various influencing factors such as defensive pressure, contact with defenders, body orientation towards the basket, ball control, quality of the last pass, and the player's physical condition. Through this project, a comprehensive understanding of the shooting abilities of these remarkable NBA players will be gained, and the intricate dynamics that contribute to their success or challenges on the court will be explored.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
nba = pd.read_csv("2019-20_pbp.csv")

In [8]:
nba.head(3)

Unnamed: 0,URL,GameType,Location,Date,Time,WinningTeam,Quarter,SecLeft,AwayTeam,AwayPlay,...,EnterGame,LeaveGame,TurnoverPlayer,TurnoverType,TurnoverCause,TurnoverCauser,JumpballAwayPlayer,JumpballHomePlayer,JumpballPoss,Unnamed: 40
0,/boxscores/201910220TOR.html,regular,Scotiabank Arena Toronto Canada,October 22 2019,8:00 PM,TOR,1,720,NOP,Jump ball: D. Favors vs. M. Gasol (L. Ball gai...,...,,,,,,,D. Favors - favorde01,M. Gasol - gasolma01,L. Ball - balllo01,
1,/boxscores/201910220TOR.html,regular,Scotiabank Arena Toronto Canada,October 22 2019,8:00 PM,TOR,1,708,NOP,L. Ball misses 2-pt jump shot from 11 ft,...,,,,,,,,,,
2,/boxscores/201910220TOR.html,regular,Scotiabank Arena Toronto Canada,October 22 2019,8:00 PM,TOR,1,707,NOP,Offensive rebound by D. Favors,...,,,,,,,,,,


In [10]:
print('The number of lines in the dataframe is', nba.shape[0])
print('The number of columns in the dataframe is', nba.shape[1])

The number of lines in the dataframe is 539265
The number of columns in the dataframe is 41


In [16]:
# list of the columns 
columns = nba.columns
columns

Index(['URL', 'GameType', 'Location', 'Date', 'Time', 'WinningTeam', 'Quarter',
       'SecLeft', 'AwayTeam', 'AwayPlay', 'AwayScore', 'HomeTeam', 'HomePlay',
       'HomeScore', 'Shooter', 'ShotType', 'ShotOutcome', 'ShotDist',
       'Assister', 'Blocker', 'FoulType', 'Fouler', 'Fouled', 'Rebounder',
       'ReboundType', 'ViolationPlayer', 'ViolationType', 'TimeoutTeam',
       'FreeThrowShooter', 'FreeThrowOutcome', 'FreeThrowNum', 'EnterGame',
       'LeaveGame', 'TurnoverPlayer', 'TurnoverType', 'TurnoverCause',
       'TurnoverCauser', 'JumpballAwayPlayer', 'JumpballHomePlayer',
       'JumpballPoss', 'Unnamed: 40'],
      dtype='object')

In [17]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539265 entries, 0 to 539264
Data columns (total 41 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   URL                 539265 non-null  object 
 1   GameType            539265 non-null  object 
 2   Location            539265 non-null  object 
 3   Date                539265 non-null  object 
 4   Time                539265 non-null  object 
 5   WinningTeam         539265 non-null  object 
 6   Quarter             539265 non-null  int64  
 7   SecLeft             539265 non-null  int64  
 8   AwayTeam            539265 non-null  object 
 9   AwayPlay            272389 non-null  object 
 10  AwayScore           539265 non-null  int64  
 11  HomeTeam            539265 non-null  object 
 12  HomePlay            266868 non-null  object 
 13  HomeScore           539265 non-null  int64  
 14  Shooter             202397 non-null  object 
 15  ShotType            202397 non-nul

In [18]:
nba.describe()

Unnamed: 0,Quarter,SecLeft,AwayScore,HomeScore,ShotDist,Unnamed: 40
count,539265.0,539265.0,539265.0,539265.0,202397.0,0.0
mean,2.545066,331.896518,56.943722,58.136317,14.049778,
std,1.137296,207.61365,33.376788,33.806187,10.854431,
min,1.0,0.0,0.0,0.0,0.0,
25%,2.0,153.0,28.0,29.0,3.0,
50%,3.0,326.0,56.0,58.0,14.0,
75%,4.0,508.0,84.0,86.0,25.0,
max,6.0,720.0,159.0,158.0,88.0,


$$\textbf{Missing value}$$

In [41]:
missing_percentage = (nba.isna().sum() / len(nba)) * 100

for index, value in enumerate (missing_percentage):
    print(f"{nba.columns[index]}: {value}")


URL: 0.0
GameType: 0.0
Location: 0.0
Date: 0.0
Time: 0.0
WinningTeam: 0.0
Quarter: 0.0
SecLeft: 0.0
AwayTeam: 0.0
AwayPlay: 49.48884129324173
AwayScore: 0.0
HomeTeam: 0.0
HomePlay: 50.51264220744903
HomeScore: 0.0
Shooter: 62.467988836657305
ShotType: 62.467988836657305
ShotOutcome: 62.467988836657305
ShotDist: 62.467988836657305
Assister: 89.70914114581885
Blocker: 97.94201366675011
FoulType: 90.97586529813728
Fouler: 90.97586529813728
Fouled: 91.17038932621253
Rebounder: 77.50938777780868
ReboundType: 77.50938777780868
ViolationPlayer: 99.63283357903813
ViolationType: 99.63283357903813
TimeoutTeam: 97.63604164928189
FreeThrowShooter: 90.1773710513384
FreeThrowOutcome: 90.1773710513384
FreeThrowNum: 90.1773710513384
EnterGame: 89.66222543647372
LeaveGame: 89.66222543647372
TurnoverPlayer: 93.8681353323505
TurnoverType: 93.85775082751523
TurnoverCause: 96.77598212381668
TurnoverCauser: 96.77598212381668
JumpballAwayPlayer: 99.62949570248394
JumpballHomePlayer: 99.62949570248394
Jumpbal

$$\textbf{Categorical & Quantitative}$$

In [59]:
for column in nba.columns:
    print(column, ':', len(nba[column].unique()))

URL : 1143
GameType : 2
Location : 34
Date : 193
Time : 24
WinningTeam : 30
Quarter : 6
SecLeft : 721
AwayTeam : 30
AwayPlay : 85920
AwayScore : 159
HomeTeam : 30
HomePlay : 83417
HomeScore : 158
Shooter : 531
ShotType : 6
ShotOutcome : 3
ShotDist : 89
Assister : 518
Blocker : 484
FoulType : 10
Fouler : 565
Fouled : 516
Rebounder : 528
ReboundType : 3
ViolationPlayer : 16
ViolationType : 7
TimeoutTeam : 31
FreeThrowShooter : 507
FreeThrowOutcome : 3
FreeThrowNum : 8
EnterGame : 533
LeaveGame : 514
TurnoverPlayer : 516
TurnoverType : 20
TurnoverCause : 2
TurnoverCauser : 503
JumpballAwayPlayer : 300
JumpballHomePlayer : 330
JumpballPoss : 355
Unnamed: 40 : 1
