## Capstone 2: Data Wrangling


### Overview

How can we use data from 2015-2018 to predict/assess what type of pitch should be thrown in a given at bat?

This can be looked at from both the pitching team’s perspective (what is the ideal pitch for a given situation), and from the batting team (what pitch should I expect, assuming that the pitcher will choose the optimal pitch). Using the ab_id, we can link the data in the pitches csv to the data in the atbats csv to look at the outcomes of at bats along with the exact type and order of pitches thrown. We can add to this by using the batter_id and pitcher_id to gather specific data for a given hitter or pitcher by linking with the player_names csv. 

The data will be acquired from https://www.kaggle.com/pschale/mlb-pitch-data-20152018?select=pitches.csv
which was scraped from  http://gd2.mlb.com/components/game/mlb/.


In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

from library.sb_utils import save_file


### Import the data
#### First we will do a small file (player_names), and then do the rest

In [17]:
#the CSV data files are in the data/raw directory
#player names
names = pd.read_csv('../data/raw/player_names.csv')

In [18]:
names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          2218 non-null   int64 
 1   first_name  2218 non-null   object
 2   last_name   2218 non-null   object
dtypes: int64(1), object(2)
memory usage: 52.1+ KB


In [19]:
names.head()

Unnamed: 0,id,first_name,last_name
0,452657,Jon,Lester
1,425794,Adam,Wainwright
2,457435,Phil,Coke
3,435400,Jason,Motte
4,519166,Neil,Ramirez


## Import Games and At-Bats Data

In [20]:
#the CSV data files are in the data/raw directory
#at bat info
atbats = pd.read_csv('../data/raw/atbats.csv')
#games info 
games = pd.read_csv('../data/raw/games.csv')


In [21]:
atbats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 740389 entries, 0 to 740388
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   ab_id       740389 non-null  int64 
 1   batter_id   740389 non-null  int64 
 2   event       740389 non-null  object
 3   g_id        740389 non-null  int64 
 4   inning      740389 non-null  int64 
 5   o           740389 non-null  int64 
 6   p_score     740389 non-null  int64 
 7   p_throws    740389 non-null  object
 8   pitcher_id  740389 non-null  int64 
 9   stand       740389 non-null  object
 10  top         740389 non-null  bool  
dtypes: bool(1), int64(7), object(3)
memory usage: 57.2+ MB


In [24]:
games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9718 entries, 0 to 9717
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   attendance        9718 non-null   int64 
 1   away_final_score  9718 non-null   int64 
 2   away_team         9718 non-null   object
 3   date              9718 non-null   object
 4   elapsed_time      9718 non-null   int64 
 5   g_id              9718 non-null   int64 
 6   home_final_score  9718 non-null   int64 
 7   home_team         9718 non-null   object
 8   start_time        9718 non-null   object
 9   umpire_1B         9718 non-null   object
 10  umpire_2B         9715 non-null   object
 11  umpire_3B         9718 non-null   object
 12  umpire_HP         9718 non-null   object
 13  venue_name        9718 non-null   object
 14  weather           9718 non-null   object
 15  wind              9718 non-null   object
 16  delay             9718 non-null   int64 
dtypes: int64(6), o

## Import Pitch Data (800 mb)

In [22]:
#the CSV data files are in the data/raw directory
pitches = pd.read_csv('../data/raw/pitches.csv')

In [23]:
pitches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2867154 entries, 0 to 2867153
Data columns (total 40 columns):
 #   Column           Dtype  
---  ------           -----  
 0   px               float64
 1   pz               float64
 2   start_speed      float64
 3   end_speed        float64
 4   spin_rate        float64
 5   spin_dir         float64
 6   break_angle      float64
 7   break_length     float64
 8   break_y          float64
 9   ax               float64
 10  ay               float64
 11  az               float64
 12  sz_bot           float64
 13  sz_top           float64
 14  type_confidence  float64
 15  vx0              float64
 16  vy0              float64
 17  vz0              float64
 18  x                float64
 19  x0               float64
 20  y                float64
 21  y0               float64
 22  z0               float64
 23  pfx_x            float64
 24  pfx_z            float64
 25  nasty            float64
 26  zone             float64
 27  code        

In [27]:
pitches.head()

Unnamed: 0,px,pz,start_speed,end_speed,spin_rate,spin_dir,break_angle,break_length,break_y,ax,...,event_num,b_score,ab_id,b_count,s_count,outs,pitch_num,on_1b,on_2b,on_3b
0,0.416,2.963,92.9,84.1,2305.052,159.235,-25.0,3.2,23.7,7.665,...,3,0.0,2015000000.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,-0.191,2.347,92.8,84.1,2689.935,151.402,-40.7,3.4,23.7,12.043,...,4,0.0,2015000000.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0
2,-0.518,3.284,94.1,85.2,2647.972,145.125,-43.7,3.7,23.7,14.368,...,5,0.0,2015000000.0,0.0,2.0,0.0,3.0,0.0,0.0,0.0
3,-0.641,1.221,91.0,84.0,1289.59,169.751,-1.3,5.0,23.8,2.104,...,6,0.0,2015000000.0,0.0,2.0,0.0,4.0,0.0,0.0,0.0
4,-1.821,2.083,75.4,69.6,1374.569,280.671,18.4,12.0,23.8,-10.28,...,7,0.0,2015000000.0,1.0,2.0,0.0,5.0,0.0,0.0,0.0


## Check missing values for at bats, players, pitches

In [26]:
#missing values for each column in pitches
missingP = pd.concat([pitches.isnull().sum(), 100 * pitches.isnull().mean()], axis=1)
missingP.columns=['count', '%']
missingP.sort_values(by=['count'], ascending = False)

Unnamed: 0,count,%
px,14189,0.494881
type_confidence,14189,0.494881
pitch_type,14189,0.494881
zone,14189,0.494881
nasty,14189,0.494881
z0,14189,0.494881
y0,14189,0.494881
pz,14189,0.494881
x0,14189,0.494881
vz0,14189,0.494881


In [28]:
#missing values for each column in atbats
missingAB = pd.concat([atbats.isnull().sum(), 100 * atbats.isnull().mean()], axis=1)
missingAB.columns=['count', '%']
missingAB.sort_values(by=['count'], ascending = False)

Unnamed: 0,count,%
ab_id,0,0.0
batter_id,0,0.0
event,0,0.0
g_id,0,0.0
inning,0,0.0
o,0,0.0
p_score,0,0.0
p_throws,0,0.0
pitcher_id,0,0.0
stand,0,0.0


In [29]:
#missing values for players

missingPlayers = pd.concat([names.isnull().sum(), 100 * names.isnull().mean()], axis=1)
missingPlayers.columns=['count', '%']
missingPlayers.sort_values(by=['count'], ascending = False)

Unnamed: 0,count,%
id,0,0.0
first_name,0,0.0
last_name,0,0.0


In [30]:
#missing values for games
missingG = pd.concat([games.isnull().sum(), 100 * games.isnull().mean()], axis=1)
missingG.columns=['count', '%']
missingG.sort_values(by=['count'], ascending = False)

Unnamed: 0,count,%
umpire_2B,3,0.030871
attendance,0,0.0
umpire_1B,0,0.0
wind,0,0.0
weather,0,0.0
venue_name,0,0.0
umpire_HP,0,0.0
umpire_3B,0,0.0
start_time,0,0.0
away_final_score,0,0.0


### There are no missing values in the AB and Names dataframes

### Because there are no columns with more than 0.5% of values missing (i.e. no large chunks missing) in the pitches and games dataframes, I am just going to remove rows with missing values