# Project Title: Fantasy Baseball 2017 Pitching Advanced Metrics Exploration
Project goal: Uncover undervalued players, make predictions for 2018 statistics

Advanced metrics were first collected by MLB in 2015 meaning I'll use 2015, 2016, 2017 data.

Minimum to meet FanGraphs threshold for pitching statistics was 10 IP. I wasn't really sure where to set it but that was the lowest threshold that would cut out position players. I will merge savant and fangraphs data together like I did with the hitting data for further analysis.

I will do similar cleaning and merging of data from 2015, 2016, and 2017, however this time for pitching statistics. 

*** Current Statcast data is using minimum 10 AB threshold, sometime update it to include all info and show position players and the reason for needing to create 10 IP threshold

# MLB Statcast Data Dictionary:

* pitches - ?
* player_id - unique player id
* player_name - player's name
* total_pitches - number of pitches thrown
* pitch percent - ?
* ba - batting average
* iso - isolated power
* babip - batting average on balls in play
* slg - slugging percentage
* woba - weighted on-base average
* xwoba - expected weighted on-base average
* xba - expected batting average
* hits - total base hits
* abs - total at bats
* launch_speed - average launch speed off of bat
* launch_angle - average launch angle
* spin_rate - average spin rate 
* velocity - average velocity
* effective_speed - average effective velocity
* whiffs - how many swings and misses
* swings - number of swings
* takes - number of pitches taken
* eff_min_vel - difference between velocity and effective_speed
* release_extension - release extension
* posX_int_start_distance - starting distance from each position in the field


# FanGraphs Data Dictionary:

* Pitches - pitches thrown that year
* IP - innings pitched
* W - Wins
* L - Losses
* SV - Saves
* G - Games appeared in
* GS - Games Started
* K/9 - Strikes per 9 innings pitched
* BB/9 - Walks per 9 innings pitched
* HR/9 - Home Runs per 9 innings pitched
* BABIP - Batting average on balls in play
* LOB% - Left on base percentage
* GB% - Ground ball percentage
* LD% - Line drive percentage
* FB% - Fly ball percentage
* HR/FB - Home Runs per fly ball
* WHIP - Walks and his per inning pitched
* ERA - Earned run average
* FIP - Fielding indepentent pitching
* xFIP - Expected fielding independent pitching
* SIERA - Skills interactive ERA
* WAR - Wins above replacement
* playerid - unique player id

In [12]:
# import packages
import pandas as pd
pd.options.display.max_columns = None

In [13]:
# read in 2015, 2016, and 2017 baseball savant data
statcast_pitching_2015 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/statcast_pitching_2015.csv")
statcast_pitching_2016 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/statcast_pitching_2016.csv")
statcast_pitching_2017 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/statcast_pitching_2017.csv")

In [14]:
# look at background info, mostly floats and ints with one object (player_name)
statcast_pitching_2015.info()
print("-------------------------------------")
statcast_pitching_2016.info()
print("-------------------------------------")
statcast_pitching_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 734 entries, 0 to 733
Data columns (total 31 columns):
pitches                    734 non-null int64
player_id                  734 non-null int64
player_name                734 non-null object
total_pitches              734 non-null int64
pitch_percent              734 non-null float64
ba                         734 non-null float64
iso                        734 non-null float64
babip                      734 non-null float64
slg                        734 non-null float64
woba                       734 non-null float64
xwoba                      734 non-null float64
xba                        734 non-null float64
hits                       734 non-null int64
abs                        734 non-null int64
launch_speed               734 non-null float64
launch_angle               734 non-null float64
spin_rate                  734 non-null int64
velocity                   734 non-null float64
effective_speed            734 non-null floa

In [15]:
# now let's look at the guts of each dataset to see what we're working with
statcast_pitching_2015.describe()

Unnamed: 0,pitches,player_id,total_pitches,pitch_percent,ba,iso,babip,slg,woba,xwoba,xba,hits,abs,launch_speed,launch_angle,spin_rate,velocity,effective_speed,whiffs,swings,takes,eff_min_vel,release_extension
count,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0,734.0
mean,177.782016,508789.592643,957.074932,18.7747,0.335129,0.215211,0.300786,0.550349,0.371436,0.357084,0.317183,57.365123,174.467302,87.575886,10.495504,2105.133515,88.772752,88.160995,0.0,177.782016,0.0,-0.610082,6.010695
std,174.358857,75992.8292,915.912471,4.833262,0.086589,0.138991,0.084293,0.199445,0.109829,0.0734,0.051549,55.877938,171.369401,2.972343,6.568235,190.643689,3.781948,3.995576,0.0,174.358857,0.0,0.795764,0.465977
min,1.0,112526.0,1.0,6.67,0.0,0.0,0.0,0.0,0.0,0.012,0.01,0.0,1.0,62.6,-19.8,939.0,56.9,55.29,0.0,1.0,0.0,-4.6,3.99
25%,47.25,456549.75,254.0,16.62,0.305,0.149,0.27,0.46925,0.329,0.324,0.296,16.0,46.25,86.3,6.8,1990.25,86.825,86.1925,0.0,47.25,0.0,-1.1,5.73
50%,122.5,517503.5,680.0,18.52,0.333,0.192,0.3,0.527,0.3605,0.351,0.316,40.0,120.5,87.45,10.4,2120.5,89.2,88.57,0.0,122.5,0.0,-0.6,6.02
75%,216.0,571577.75,1163.0,20.28,0.364,0.25,0.333,0.60075,0.405,0.379,0.335,71.0,212.0,88.7,14.2,2226.75,91.2,90.8175,0.0,216.0,0.0,-0.1,6.32
max,696.0,648737.0,3493.0,100.0,1.0,1.5,1.0,2.5,1.45,1.078,0.771,228.0,677.0,105.2,39.5,2624.0,96.8,98.35,0.0,696.0,0.0,3.1,7.44


In [16]:
# the 2016 statcast pitching data
statcast_pitching_2016.describe()

Unnamed: 0,pitches,player_id,total_pitches,pitch_percent,ba,iso,babip,slg,woba,xwoba,xba,hits,abs,launch_speed,launch_angle,velocity,whiffs,swings,takes
count,742.0,742.0,742.0,742.0,742.0,742.0,742.0,742.0,742.0,742.0,742.0,742.0,742.0,742.0,742.0,742.0,742.0,742.0,742.0
mean,173.623989,531248.679245,964.871968,18.417763,0.34245,0.219868,0.306318,0.562334,0.380325,0.364117,0.320999,56.973046,170.606469,87.728167,11.378167,88.941509,0.0,173.623989,0.0
std,168.804137,70388.885609,910.919235,4.844252,0.084218,0.118522,0.084603,0.165956,0.096499,0.06644,0.053279,54.911233,166.170502,2.59187,6.995957,3.661939,0.0,168.804137,0.0
min,1.0,112526.0,1.0,7.41,0.0,0.0,0.0,0.0,0.0,0.004,0.003,0.0,1.0,71.4,-25.6,64.0,0.0,1.0,0.0
25%,44.25,474477.5,249.0,16.1925,0.30725,0.15825,0.271,0.481,0.332,0.333,0.3,15.0,44.0,86.425,7.3,87.1,0.0,44.25,0.0
50%,117.5,543078.5,721.0,17.99,0.3365,0.211,0.305,0.546,0.3725,0.3615,0.322,39.5,115.5,87.8,11.3,89.3,0.0,117.5,0.0
75%,221.5,592776.0,1227.5,19.8575,0.375,0.267,0.333,0.62775,0.42175,0.39075,0.341,75.75,218.0,89.0,14.975,91.275,0.0,221.5,0.0
max,672.0,664641.0,3668.0,100.0,1.0,1.2,1.0,1.6,0.9,0.787,0.801,227.0,655.0,97.8,65.7,98.8,0.0,672.0,0.0


In [17]:
# and the 2017 statcast pitching data
statcast_pitching_2017.describe()

Unnamed: 0,pitches,player_id,total_pitches,pitch_percent,ba,iso,babip,slg,woba,xwoba,xba,hits,abs,launch_speed,launch_angle,spin_rate,velocity,effective_speed,whiffs,swings,takes,eff_min_vel,release_extension,pos3_int_start_distance,pos4_int_start_distance,pos5_int_start_distance,pos6_int_start_distance,pos7_int_start_distance,pos8_int_start_distance,pos9_int_start_distance
count,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0,752.0
mean,169.614362,547951.37367,963.543883,17.886077,0.349815,0.240016,0.309537,0.589831,0.393653,0.366311,0.32135,56.130319,166.831117,86.929255,11.805984,2184.404255,88.659043,88.200691,0.0,169.614362,0.0,-0.459043,6.028285,110.029255,149.992021,112.921543,146.81516,294.698138,317.480053,293.553191
std,157.627492,70297.536864,871.982247,3.408102,0.096862,0.125392,0.094881,0.188691,0.110408,0.06573,0.05152,52.282919,155.271943,2.722495,6.706682,192.971311,4.096873,4.326713,0.0,157.627492,0.0,0.775162,0.448702,3.909318,2.943759,4.502754,2.345437,5.852888,6.340085,5.456019
min,1.0,112526.0,5.0,7.63,0.0,0.0,0.0,0.0,0.0,0.001,0.0,0.0,1.0,70.8,-35.3,1357.0,58.4,56.9,0.0,1.0,0.0,-2.9,4.01,94.0,127.0,98.0,129.0,272.0,294.0,272.0
25%,48.0,501949.0,267.75,16.01,0.30775,0.17475,0.269,0.49775,0.34075,0.333,0.296,17.0,47.75,85.5,7.8,2059.5,87.0,86.35,0.0,48.0,0.0,-1.0,5.74,108.0,149.0,110.0,146.0,291.0,313.0,290.0
50%,128.5,570444.5,777.0,17.675,0.338,0.227,0.302,0.5675,0.382,0.359,0.317,41.0,126.5,86.8,11.9,2186.5,89.1,88.735,0.0,128.5,0.0,-0.5,6.03,110.0,150.0,112.0,147.0,295.0,318.0,294.0
75%,216.25,605201.0,1213.25,19.5275,0.3755,0.283,0.333,0.645,0.42625,0.38925,0.339,73.0,212.75,88.1,15.525,2315.5,91.1,90.7725,0.0,216.25,0.0,0.1,6.34,112.0,152.0,115.0,148.0,299.0,322.0,297.0
max,650.0,664701.0,3546.0,50.0,1.0,1.5,1.0,2.0,1.075,0.727,0.621,240.0,643.0,102.3,69.0,2748.0,99.5,100.05,0.0,650.0,0.0,2.3,7.59,131.0,168.0,154.0,158.0,316.0,334.0,313.0


Again we have all of the same columns for the datasets which will make cleaning convenient.

We'll drop pitches, pitch_percent, whiffs (all 0s), takes (all 0s), eff_min_vel (difference between velocity and perceived velocity but we'll use both of those in our exploration), and all of the posX_int_start_distance (not concerned with that right now, but maybe later on)

In [18]:
# drop specified columns from data
statcast_pitching_2015.drop(['pitches', 'pitch_percent', 'whiffs', 'takes', 'eff_min_vel', 'pos3_int_start_distance',
                  'pos4_int_start_distance', 'pos5_int_start_distance', 'pos6_int_start_distance', 
                  'pos7_int_start_distance', 'pos8_int_start_distance', 'pos9_int_start_distance'], axis=1, inplace=True)
statcast_pitching_2016.drop(['pitches', 'pitch_percent', 'whiffs', 'takes', 'eff_min_vel', 'pos3_int_start_distance',
                  'pos4_int_start_distance', 'pos5_int_start_distance', 'pos6_int_start_distance', 
                  'pos7_int_start_distance', 'pos8_int_start_distance', 'pos9_int_start_distance'], axis=1, inplace=True)
statcast_pitching_2017.drop(['pitches', 'pitch_percent', 'whiffs', 'takes', 'eff_min_vel', 'pos3_int_start_distance',
                  'pos4_int_start_distance', 'pos5_int_start_distance', 'pos6_int_start_distance', 
                  'pos7_int_start_distance', 'pos8_int_start_distance', 'pos9_int_start_distance'], axis=1, inplace=True)

The only matching column that i could find between savant and fangraphs data was total pitches but fangraphs doesn't show ABs so I am going to read in the fangraphs data, then look at savant data for lowest number of pitches and then drop all fangraphs data with less than that number of pitches. I set the initial AB threshold on savant data at 10 AB. 

In [19]:
# read in 2015, 2016, and 2017 baseball savant data
fangraphs_pitching_2015 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/fangraphs_pitching_2015.csv")
fangraphs_pitching_2016 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/fangraphs_pitching_2016.csv")
fangraphs_pitching_2017 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/fangraphs_pitching_2017.csv")

In [20]:
fangraphs_pitching_2017.head()

Unnamed: 0,Name,Team,Pitches,IP,W,L,SV,G,GS,K/9,BB/9,HR/9,BABIP,LOB%,GB%,LD%,FB%,HR/FB,WHIP,ERA,FIP,xFIP,SIERA,WAR,playerid
0,Tyler Olson,Indians,300,20.0,1,0,1,30,0,8.1,2.7,0.0,0.25,100.0 %,52.9 %,21.6 %,25.5 %,0.0 %,0.95,0.0,2.41,3.56,3.53,0.5,14741
1,Jimmie Sherfy,Diamondbacks,156,10.2,2,0,1,11,0,7.59,1.69,0.0,0.192,100.0 %,53.8 %,3.8 %,42.3 %,0.0 %,0.66,0.0,2.03,3.87,3.36,0.3,15118
2,Austin Maddox,Red Sox,267,17.1,0,0,0,13,0,7.27,1.04,0.52,0.24,100.0 %,26.0 %,14.0 %,60.0 %,3.3 %,0.87,0.52,2.64,4.97,4.2,0.3,14241
3,Ben Heller,Yankees,181,11.0,1,0,0,9,0,7.36,4.91,0.0,0.179,90.9 %,46.4 %,14.3 %,39.3 %,0.0 %,1.0,0.82,3.16,4.94,4.89,0.2,15100
4,Kenley Jansen,Dodgers,1012,68.1,5,0,41,65,0,14.36,0.92,0.66,0.289,91.3 %,38.4 %,21.0 %,40.6 %,8.9 %,0.75,1.32,1.31,1.82,1.48,3.6,3096


In [21]:
# change fangraphs 'Name' column to 'player_name' to merge with savant data
# also need to change the columns with '%' in them to 'perc' to make it easier to analyze later
fangraphs_pitching_2015.rename(columns={'Name':'player_name', 'K/9':'Kper9', 'BB/9':'BBper9', 'HR/9':'HRper9', 'LOB%':'LOBperc', 
                                        'GB%':'GBperc', 'LD%':'LDperc', 'FB%':'FBperc', 'HR/FB':'HRperFB'}, inplace=True)
fangraphs_pitching_2016.rename(columns={'Name':'player_name', 'K/9':'Kper9', 'BB/9':'BBper9', 'HR/9':'HRper9', 'LOB%':'LOBperc', 
                                        'GB%':'GBperc', 'LD%':'LDperc', 'FB%':'FBperc', 'HR/FB':'HRperFB'}, inplace=True)
fangraphs_pitching_2017.rename(columns={'Name':'player_name', 'K/9':'Kper9', 'BB/9':'BBper9', 'HR/9':'HRper9', 'LOB%':'LOBperc', 
                                        'GB%':'GBperc', 'LD%':'LDperc', 'FB%':'FBperc', 'HR/FB':'HRperFB'}, inplace=True)

In [22]:
# need to convert BB% and K% objects to floats
fangraphs_pitching_2015 ['LOBperc'] = fangraphs_pitching_2015['LOBperc'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2016 ['LOBperc'] = fangraphs_pitching_2016['LOBperc'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2017 ['LOBperc'] = fangraphs_pitching_2017['LOBperc'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2015 ['GBperc'] = fangraphs_pitching_2015['GBperc'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2016 ['GBperc'] = fangraphs_pitching_2016['GBperc'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2017 ['GBperc'] = fangraphs_pitching_2017['GBperc'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2015 ['LDperc'] = fangraphs_pitching_2015['LDperc'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2016 ['LDperc'] = fangraphs_pitching_2016['LDperc'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2017 ['LDperc'] = fangraphs_pitching_2017['LDperc'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2015 ['FBperc'] = fangraphs_pitching_2015['FBperc'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2016 ['FBperc'] = fangraphs_pitching_2016['FBperc'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2017 ['FBperc'] = fangraphs_pitching_2017['FBperc'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2015 ['HRperFB'] = fangraphs_pitching_2015['HRperFB'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2016 ['HRperFB'] = fangraphs_pitching_2016['HRperFB'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_pitching_2017 ['HRperFB'] = fangraphs_pitching_2017['HRperFB'].replace('%', ' ', regex=True).astype('float')/100

In [23]:
# it looks like for 2017 that the minimum threshold for 10 IP is 138 pitches so we'll cut out players < 138 pitches in savant
fangraphs_pitching_2017.sort_values(by='Pitches')

Unnamed: 0,player_name,Team,Pitches,IP,W,L,SV,G,GS,Kper9,BBper9,HRper9,BABIP,LOBperc,GBperc,LDperc,FBperc,HRperFB,WHIP,ERA,FIP,xFIP,SIERA,WAR,playerid
161,Aaron Wilkerson,Brewers,138,10.1,1,0,0,3,2,6.10,0.87,0.87,0.179,0.536,0.448,0.138,0.414,0.083,0.68,3.48,3.35,4.16,4.11,0.2,16904
1,Jimmie Sherfy,Diamondbacks,156,10.2,2,0,1,11,0,7.59,1.69,0.00,0.192,1.000,0.538,0.038,0.423,0.000,0.66,0.00,2.03,3.87,3.36,0.3,15118
277,Logan Verrett,Orioles,157,10.2,2,0,0,4,0,7.59,2.53,2.53,0.267,0.848,0.367,0.133,0.500,0.200,1.31,4.22,6.53,5.38,4.29,-0.2,12905
45,Chaz Roe,- - -,164,10.2,0,0,0,12,0,10.97,4.22,0.84,0.250,0.690,0.560,0.200,0.240,0.167,1.13,2.53,3.63,3.41,3.37,0.1,9866
10,Victor Arano,Phillies,172,10.2,1,0,0,10,0,10.97,3.38,0.00,0.240,0.800,0.440,0.200,0.360,0.000,0.94,1.69,1.85,3.35,3.18,0.3,15915
412,Jason Hursh,Braves,173,10.2,1,0,0,9,0,5.91,3.38,0.84,0.353,0.723,0.571,0.229,0.200,0.143,1.59,5.06,4.47,4.42,4.26,0.0,14914
67,Chih-Wei Hu,Rays,176,10.0,1,1,0,6,0,8.10,3.60,1.80,0.120,0.807,0.370,0.148,0.481,0.154,0.90,2.70,5.16,4.87,4.27,-0.1,16419
3,Ben Heller,Yankees,181,11.0,1,0,0,9,0,7.36,4.91,0.00,0.179,0.909,0.464,0.143,0.393,0.000,1.00,0.82,3.16,4.94,4.89,0.2,15100
574,Josh Lindblom,Pirates,186,10.1,0,0,0,4,0,8.71,2.61,0.00,0.474,0.571,0.395,0.211,0.395,0.000,2.03,7.84,2.09,4.68,4.06,0.2,7882
46,Jake Diekman,Rangers,188,10.2,0,0,1,11,0,10.97,8.44,0.84,0.143,0.873,0.591,0.136,0.273,0.167,1.31,2.53,4.75,4.53,5.30,0.0,5003


In [24]:
# threshold for 10IP in 2016 was 127
fangraphs_pitching_2016.sort_values(by='Pitches')

Unnamed: 0,player_name,Team,Pitches,IP,W,L,SV,G,GS,Kper9,BBper9,HRper9,BABIP,LOBperc,GBperc,LDperc,FBperc,HRperFB,WHIP,ERA,FIP,xFIP,SIERA,WAR,playerid
13,Ryan Merritt,Indians,127,11.0,1,0,0,4,1,4.91,0.00,0.00,0.194,0.667,0.533,0.200,0.267,0.000,0.55,1.64,2.06,3.27,3.37,0.3,13688
327,Chad Girodo,Blue Jays,144,10.1,0,0,0,14,0,4.35,1.74,2.61,0.242,0.918,0.694,0.167,0.139,0.600,1.26,4.35,6.82,3.86,3.11,-0.2,15542
18,Tyler Wagner,Diamondbacks,145,10.0,1,0,0,3,0,6.30,1.80,0.00,0.290,0.727,0.484,0.194,0.323,0.000,1.10,1.80,2.35,4.01,3.83,0.4,13796
208,Phil Coke,- - -,166,10.0,0,0,0,6,0,3.60,6.30,0.90,0.281,0.783,0.394,0.303,0.303,0.100,1.70,3.60,6.05,6.41,6.52,-0.1,5535
440,Josh Edgin,Mets,166,10.1,1,0,0,16,0,9.58,5.23,0.87,0.333,0.685,0.429,0.214,0.357,0.100,1.55,5.23,4.02,4.37,4.22,0.0,10796
207,Chris Stratton,Giants,168,10.0,1,0,0,7,0,5.40,4.50,0.90,0.323,0.822,0.406,0.156,0.438,0.071,1.60,3.60,4.75,5.78,5.35,-0.1,13761
3,Dan Altavilla,Mariners,172,12.1,0,0,0,15,0,7.30,0.73,0.00,0.306,0.923,0.500,0.250,0.250,0.000,0.97,0.73,2.01,3.23,2.98,0.3,16507
149,Dalier Hinojosa,Phillies,177,11.0,0,1,0,10,0,6.55,2.45,0.82,0.281,0.776,0.515,0.152,0.333,0.091,1.18,3.27,3.69,4.18,3.93,0.1,15671
326,Juan Minaya,White Sox,180,10.1,1,0,0,11,0,5.23,4.35,0.00,0.294,0.647,0.235,0.235,0.529,0.000,1.45,4.35,4.02,6.92,5.82,0.0,10341
559,Joseph Colon,Indians,186,10.0,1,3,0,11,0,9.00,6.30,1.80,0.323,0.617,0.333,0.212,0.455,0.133,1.90,7.20,5.85,5.74,4.97,-0.1,4920


In [25]:
# threshold for 10IP in 2016 was 142
fangraphs_pitching_2015.sort_values(by='Pitches')

Unnamed: 0,player_name,Team,Pitches,IP,W,L,SV,G,GS,Kper9,BBper9,HRper9,BABIP,LOBperc,GBperc,LDperc,FBperc,HRperFB,WHIP,ERA,FIP,xFIP,SIERA,WAR,playerid
0,Adam Ottavino,Rockies,142,10.1,1,0,3,10,0,11.32,1.74,0.00,0.158,1.000,0.632,0.053,0.316,0.000,0.48,0.00,1.49,2.35,1.62,0.4,1247
2,Jordan Walden,Cardinals,156,10.1,0,1,1,12,0,10.45,3.48,0.00,0.269,0.909,0.423,0.269,0.308,0.000,1.06,0.87,1.97,3.12,3.00,0.4,3271
195,Peter Moylan,Braves,158,10.1,1,0,0,22,0,6.97,0.00,0.87,0.314,0.660,0.694,0.083,0.222,0.125,1.16,3.48,2.84,2.73,2.02,0.2,4891
292,Stolmy Pimentel,Rangers,165,11.1,0,1,0,8,0,5.56,2.38,0.79,0.286,0.714,0.486,0.229,0.286,0.100,1.24,3.97,3.84,4.00,3.98,0.1,6415
464,Diego Moreno,Yankees,165,10.1,1,0,0,4,0,6.97,2.61,0.87,0.258,0.635,0.406,0.125,0.469,0.067,1.16,5.23,4.30,5.18,4.02,0.0,6238
20,Rex Brothers,Rockies,168,10.1,1,0,0,17,0,4.35,6.97,0.00,0.273,0.882,0.563,0.219,0.219,0.000,1.65,1.74,4.49,5.49,5.90,0.0,9794
490,Homer Bailey,Reds,171,11.1,0,1,0,2,2,2.38,3.18,2.38,0.317,0.823,0.524,0.167,0.310,0.231,1.76,5.56,7.10,5.36,5.65,-0.2,8362
78,Paco Rodriguez,Dodgers,175,10.1,0,0,0,18,0,6.97,2.61,0.00,0.323,0.769,0.484,0.290,0.226,0.000,1.26,2.61,2.46,3.46,3.50,0.2,13398
564,Jonathan Aro,Red Sox,183,10.1,0,1,0,6,0,6.97,3.48,1.74,0.371,0.679,0.162,0.297,0.541,0.100,1.84,6.97,5.26,5.60,5.04,-0.1,14699
540,Jeff Ferrell,Tigers,184,11.1,0,0,0,9,0,4.76,3.18,2.38,0.243,0.678,0.400,0.200,0.400,0.188,1.41,6.35,6.58,5.22,4.88,-0.2,11506


In [26]:
# due to the number of pitches not being uniform for 10 IP, i'm going to make the cutoff at 250 which is ~15 IP
# yes it is possible that some relevant pitchers threw less than that but the sample size is so small, we'll drop under 250
statcast_pitching_2015 = statcast_pitching_2015[statcast_pitching_2015["total_pitches"] >= 250]
statcast_pitching_2016 = statcast_pitching_2016[statcast_pitching_2016["total_pitches"] >= 250]
statcast_pitching_2017 = statcast_pitching_2017[statcast_pitching_2017["total_pitches"] >= 250]

In [27]:
# will do same with fangraphs data
fangraphs_pitching_2015 = fangraphs_pitching_2015[fangraphs_pitching_2015["Pitches"] >= 250]
fangraphs_pitching_2016 = fangraphs_pitching_2016[fangraphs_pitching_2016["Pitches"] >= 250]
fangraphs_pitching_2017 = fangraphs_pitching_2017[fangraphs_pitching_2017["Pitches"] >= 250]

In [28]:
# double check to see what statcast data looks like now
statcast_pitching_2017.sort_values(by='total_pitches')

Unnamed: 0,player_id,player_name,total_pitches,ba,iso,babip,slg,woba,xwoba,xba,hits,abs,launch_speed,launch_angle,spin_rate,velocity,effective_speed,swings,release_extension
294,430589,Chad Qualls,252,0.309,0.255,0.269,0.564,0.363,0.353,0.297,17,55,87.3,2.2,2055,89.3,89.02,56,6.05
725,593334,Domingo German,253,0.333,0.182,0.303,0.515,0.344,0.260,0.246,11,33,82.5,8.5,2544,92.1,91.60,35,6.04
349,643329,Brad Goldberg,262,0.326,0.233,0.293,0.558,0.377,0.348,0.318,14,43,87.0,8.7,2139,93.2,93.23,43,6.26
117,502009,Mat Latos,263,0.373,0.412,0.304,0.784,0.484,0.385,0.320,19,51,88.9,12.8,1968,89.8,89.39,51,6.01
585,456379,Al Alburquerque,263,0.204,0.041,0.204,0.245,0.198,0.262,0.254,10,49,85.3,5.5,2366,91.4,90.92,49,5.88
275,506693,Henderson Alvarez,265,0.280,0.220,0.250,0.500,0.331,0.428,0.344,14,50,87.5,7.1,1774,88.8,88.33,50,6.00
492,545348,Austin Maddox,267,0.255,0.078,0.240,0.333,0.258,0.298,0.259,13,51,86.0,27.7,2281,90.7,89.68,51,5.83
317,594867,Myles Jaye,268,0.346,0.231,0.308,0.577,0.380,0.361,0.330,18,52,87.2,5.1,2113,87.2,86.70,54,5.91
60,500765,Dayan Diaz,268,0.500,0.441,0.452,0.941,0.609,0.484,0.429,17,34,90.3,7.0,2315,90.4,90.62,34,6.30
41,592869,Tyler Wilson,269,0.400,0.255,0.358,0.655,0.444,0.432,0.377,22,55,91.3,9.4,2255,87.7,88.12,56,6.53


In [30]:
# double check to see what fangraphs data looks like now
# surprisingly they don't have the exact same pitches for some pitchers, not sure how that is possible but potential later issue
fangraphs_pitching_2017.sort_values(by='Pitches')

Unnamed: 0,player_name,Team,Pitches,IP,W,L,SV,G,GS,Kper9,BBper9,HRper9,BABIP,LOBperc,GBperc,LDperc,FBperc,HRperFB,WHIP,ERA,FIP,xFIP,SIERA,WAR,playerid
456,Chad Qualls,Rockies,252,16.2,1,1,0,19,0,5.94,2.70,1.62,0.264,0.618,0.574,0.111,0.315,0.176,1.32,5.40,5.08,4.55,4.30,0.0,2170
121,Domingo German,Yankees,253,14.1,0,1,0,7,0,11.30,5.65,0.63,0.294,0.753,0.545,0.212,0.242,0.125,1.40,3.14,3.44,3.52,3.93,0.1,17149
296,Henderson Alvarez,Phillies,261,14.2,0,1,0,3,3,3.68,6.75,1.23,0.250,0.811,0.460,0.200,0.340,0.118,1.70,4.30,6.36,6.65,7.14,-0.1,5669
532,Mat Latos,Blue Jays,262,15.0,0,1,0,3,3,6.00,4.80,3.00,0.304,0.810,0.440,0.140,0.420,0.238,1.80,6.60,7.96,6.11,5.77,-0.3,3815
587,Brad Goldberg,White Sox,262,12.0,0,0,0,11,0,2.25,10.50,1.50,0.293,0.699,0.488,0.256,0.256,0.182,2.33,8.25,8.82,8.29,8.37,-0.4,14776
40,Al Alburquerque,- - -,263,18.0,0,2,0,21,0,7.00,4.00,0.00,0.204,0.722,0.551,0.204,0.245,0.000,1.00,2.50,2.94,4.12,4.32,0.3,6324
547,Tyler Wilson,Orioles,265,15.1,2,2,0,9,1,5.28,2.35,1.76,0.358,0.570,0.382,0.309,0.309,0.176,1.70,7.04,5.51,4.93,4.95,-0.1,12691
2,Austin Maddox,Red Sox,267,17.1,0,0,0,13,0,7.27,1.04,0.52,0.240,1.000,0.260,0.140,0.600,0.033,0.87,0.52,2.64,4.97,4.20,0.3,14241
597,Dayan Diaz,Astros,268,13.0,1,1,0,10,1,13.85,2.77,2.08,0.452,0.417,0.412,0.324,0.265,0.333,1.62,9.00,4.00,2.24,2.56,0.0,11543
604,Myles Jaye,Tigers,268,12.2,1,2,0,5,2,2.84,7.11,1.42,0.308,0.461,0.491,0.264,0.245,0.154,2.21,12.08,7.66,7.43,6.77,-0.3,11769


In [31]:
# and now for the counts
print(len(fangraphs_pitching_2017))

571


In [32]:
# statcast 2017
print(len(statcast_pitching_2017))

571


In [33]:
# fangraphs 2016
print(len(fangraphs_pitching_2016))

554


In [34]:
# statcast 2016
print(len(statcast_pitching_2016))

554


In [35]:
# fangraphs 2015
print(len(fangraphs_pitching_2015))

556


In [37]:
# thankfully looks like we have the same counts for 2015, 2016, and 2017 savant and fangraphs data
print(len(statcast_pitching_2015))

556


In [38]:
# merge fangraphs and savant data to see where we have mismatched names
common_2017 = fangraphs_pitching_2017.merge(statcast_pitching_2017,on=['player_name','player_name'])
fangraphs_pitching_2017[(~fangraphs_pitching_2017.player_name.isin(common_2017.player_name))&(~fangraphs_pitching_2017.player_name.isin(common_2017.player_name))]

Unnamed: 0,player_name,Team,Pitches,IP,W,L,SV,G,GS,Kper9,BBper9,HRper9,BABIP,LOBperc,GBperc,LDperc,FBperc,HRperFB,WHIP,ERA,FIP,xFIP,SIERA,WAR,playerid
51,Samuel Tuivailala,Cardinals,643,42.1,3,3,0,37,0,7.23,2.34,0.85,0.258,0.849,0.488,0.195,0.317,0.103,1.09,2.55,3.7,4.11,3.89,0.2,13485
156,Jake Faria,Rays,1445,86.2,5,4,0,16,14,8.72,3.22,1.14,0.265,0.786,0.383,0.217,0.4,0.117,1.18,3.43,4.12,4.39,4.26,1.3,13699
199,Chase Bradford,Mets,507,33.2,2,0,0,28,0,7.22,3.48,0.8,0.27,0.67,0.559,0.147,0.294,0.1,1.28,3.74,3.87,4.3,4.17,0.2,12452
238,A.J. Ramos,- - -,1104,58.2,2,4,27,61,0,11.05,5.22,1.07,0.294,0.771,0.401,0.204,0.395,0.121,1.41,3.99,4.1,4.3,4.01,0.4,8350
263,Nathan Karns,Royals,749,45.1,2,2,0,9,8,10.13,2.58,1.79,0.283,0.807,0.496,0.124,0.38,0.196,1.19,4.17,4.48,3.71,3.59,0.4,12638
272,Jorge de la Rosa,Diamondbacks,852,51.1,3,1,0,65,0,7.89,3.68,1.23,0.273,0.764,0.452,0.164,0.384,0.125,1.31,4.21,4.58,4.75,4.33,0.0,2047
286,Lance McCullers,Astros,2014,118.2,7,4,0,22,22,10.01,3.03,0.61,0.33,0.676,0.613,0.192,0.195,0.127,1.3,4.25,3.1,3.17,3.41,3.0,14120
297,Jakob Junis,Royals,1525,98.1,9,3,0,20,16,7.32,2.29,1.37,0.294,0.728,0.401,0.197,0.401,0.123,1.28,4.3,4.55,4.77,4.49,0.9,13619
442,Matt Boyd,Tigers,2356,135.0,6,11,0,26,25,7.33,3.53,1.2,0.33,0.687,0.381,0.223,0.395,0.106,1.56,5.27,4.51,5.01,4.94,1.9,15440
472,Luke Sims,Braves,985,57.2,3,6,0,14,10,6.87,3.59,1.4,0.314,0.689,0.381,0.227,0.392,0.13,1.51,5.62,5.07,5.16,5.05,0.2,13470


In [39]:
# doing the inverse of that shows what the 2017 names need to be changed to
statcast_pitching_2017[(~statcast_pitching_2017.player_name.isin(common_2017.player_name))&(~statcast_pitching_2017.player_name.isin(common_2017.player_name))]

Unnamed: 0,player_id,player_name,total_pitches,ba,iso,babip,slg,woba,xwoba,xba,hits,abs,launch_speed,launch_angle,spin_rate,velocity,effective_speed,swings,release_extension
120,596001,Jake Junis,1529,0.334,0.222,0.297,0.556,0.374,0.386,0.33,101,302,88.9,14.5,2001,87.7,87.22,308,5.85
204,501992,Nate Karns,752,0.336,0.295,0.283,0.631,0.409,0.396,0.312,41,122,88.0,10.0,2217,89.1,89.87,122,6.77
263,607188,Jacob Faria,1445,0.303,0.208,0.263,0.511,0.34,0.352,0.312,70,231,87.5,17.8,1949,87.3,86.25,236,5.74
422,621121,Lance McCullers Jr.,2015,0.351,0.172,0.332,0.523,0.373,0.34,0.315,114,325,86.5,0.3,2531,89.2,88.5,329,5.84
465,592815,Sam Tuivailala,654,0.285,0.154,0.261,0.439,0.309,0.349,0.315,35,123,86.2,11.0,2240,90.1,89.23,124,5.82
479,607473,Chasen Bradford,511,0.294,0.157,0.27,0.451,0.318,0.319,0.29,30,102,86.2,6.1,2331,88.8,88.66,103,6.18
535,571510,Matthew Boyd,2368,0.365,0.228,0.333,0.593,0.402,0.363,0.316,157,430,85.7,14.8,2179,86.8,86.4,439,6.05
622,407822,Jorge De La Rosa,868,0.313,0.204,0.277,0.517,0.348,0.316,0.286,46,147,84.9,11.1,1747,89.6,88.37,150,5.55
623,608371,Lucas Sims,993,0.36,0.27,0.324,0.629,0.407,0.352,0.311,64,178,84.9,13.4,2349,87.1,85.47,184,5.19
625,573109,AJ Ramos,1108,0.331,0.209,0.296,0.541,0.369,0.345,0.299,49,148,84.9,12.7,2238,86.0,84.84,150,5.65


In [40]:
# print 2016 differences, some familiar names and some different, that's good
common_2016 = fangraphs_pitching_2016.merge(statcast_pitching_2016,on=['player_name','player_name'])
fangraphs_pitching_2016[(~fangraphs_pitching_2016.player_name.isin(common_2016.player_name))&(~fangraphs_pitching_2016.player_name.isin(common_2016.player_name))]

Unnamed: 0,player_name,Team,Pitches,IP,W,L,SV,G,GS,Kper9,BBper9,HRper9,BABIP,LOBperc,GBperc,LDperc,FBperc,HRperFB,WHIP,ERA,FIP,xFIP,SIERA,WAR,playerid
88,A.J. Ramos,Marlins,1117,64.0,1,4,40,67,0,10.27,4.92,0.14,0.309,0.781,0.364,0.259,0.377,0.016,1.36,2.81,2.9,4.28,3.94,1.4,8350
143,Lance McCullers,Astros,1342,81.0,6,5,0,14,14,11.78,5.0,0.56,0.383,0.814,0.573,0.216,0.211,0.119,1.54,3.22,3.0,3.06,3.68,2.1,14120
313,Rubby de la Rosa,Diamondbacks,857,50.2,4,5,0,13,10,9.59,3.55,1.42,0.257,0.735,0.514,0.183,0.303,0.186,1.24,4.26,4.49,3.85,3.86,0.4,3862
353,Matt Boyd,Tigers,1683,97.1,6,5,0,20,18,7.58,2.68,1.57,0.286,0.744,0.381,0.17,0.449,0.129,1.29,4.53,4.75,4.74,4.38,0.9,15440
429,Nathan Karns,Mariners,1679,94.1,6,2,1,22,15,9.64,4.29,1.05,0.327,0.69,0.403,0.232,0.365,0.115,1.48,5.15,4.05,4.23,4.23,1.2,12638
462,Jon Niese,- - -,2010,121.0,8,7,0,29,20,6.55,3.5,1.86,0.313,0.738,0.511,0.204,0.285,0.221,1.59,5.5,5.62,4.49,4.65,-0.7,4424
463,Jorge de la Rosa,Rockies,2343,134.0,8,9,0,27,24,7.25,4.23,1.54,0.325,0.69,0.473,0.21,0.317,0.169,1.64,5.51,5.36,4.81,4.85,0.4,2047


In [41]:
# what the 2016 names need to be changed to
statcast_pitching_2016[(~statcast_pitching_2016.player_name.isin(common_2016.player_name))&(~statcast_pitching_2016.player_name.isin(common_2016.player_name))]

Unnamed: 0,player_id,player_name,total_pitches,ba,iso,babip,slg,woba,xwoba,xba,hits,abs,launch_speed,launch_angle,spin_rate,velocity,effective_speed,swings,release_extension
213,501992,Nate Karns,1679,0.357,0.218,0.327,0.575,0.397,0.385,0.341,95,266,88.8,11.4,2194,88.6,88.86,268,6.65
287,407822,Jorge De La Rosa,2343,0.367,0.243,0.328,0.61,0.411,0.379,0.337,157,428,88.2,9.3,1725,85.4,84.84,435,5.63
313,477003,Jonathon Niese,2011,0.362,0.277,0.317,0.638,0.418,0.388,0.337,145,401,88.1,6.4,2022,87.2,86.99,408,6.3
321,523989,Rubby De La Rosa,857,0.303,0.239,0.259,0.542,0.354,0.356,0.305,43,142,88.0,6.0,2231,92.4,92.69,144,6.29
330,621121,Lance McCullers Jr.,1342,0.398,0.184,0.383,0.582,0.424,0.355,0.326,80,201,88.0,2.3,2556,89.8,89.28,201,5.92
442,571510,Matthew Boyd,1683,0.33,0.245,0.286,0.575,0.382,0.356,0.305,97,294,87.4,17.5,2285,87.4,87.25,297,6.36
615,573109,AJ Ramos,1117,0.325,0.069,0.313,0.394,0.305,0.357,0.329,52,160,85.9,13.9,2313,86.9,85.65,166,5.76


In [42]:
# and for the 2015 data
common_2015 = fangraphs_pitching_2015.merge(statcast_pitching_2015,on=['player_name','player_name'])
fangraphs_pitching_2015[(~fangraphs_pitching_2015.player_name.isin(common_2015.player_name))&(~fangraphs_pitching_2015.player_name.isin(common_2015.player_name))]

Unnamed: 0,player_name,Team,Pitches,IP,W,L,SV,G,GS,Kper9,BBper9,HRper9,BABIP,LOBperc,GBperc,LDperc,FBperc,HRperFB,WHIP,ERA,FIP,xFIP,SIERA,WAR,playerid
49,A.J. Ramos,Marlins,1087,70.1,2,4,32,71,0,11.13,3.33,0.77,0.252,0.854,0.434,0.164,0.403,0.094,1.01,2.3,3.01,3.24,2.73,1.2,8350
130,Samuel Tuivailala,Cardinals,265,14.2,0,1,0,14,0,12.27,4.91,1.23,0.314,0.879,0.486,0.189,0.324,0.167,1.43,3.07,3.82,3.25,3.12,0.1,13485
155,Lance McCullers,Astros,2109,125.2,6,7,0,22,22,9.24,3.08,0.72,0.288,0.75,0.465,0.218,0.318,0.093,1.19,3.22,3.26,3.5,3.57,2.7,14120
233,Mitchell Harris,Cardinals,436,27.0,2,1,0,26,0,5.0,4.33,1.33,0.289,0.775,0.451,0.143,0.407,0.108,1.59,3.67,5.39,5.49,5.14,-0.3,4530
236,Nathan Karns,Rays,2441,147.0,7,5,0,27,26,8.88,3.43,1.16,0.285,0.787,0.419,0.216,0.365,0.128,1.28,3.67,4.09,3.9,3.9,1.5,12638
327,Jon Niese,Mets,2701,176.2,9,10,0,33,29,5.76,2.8,1.02,0.3,0.715,0.545,0.208,0.247,0.143,1.4,4.13,4.41,4.11,4.27,0.9,4424
337,Jorge de la Rosa,Rockies,2459,149.0,9,7,0,26,26,8.09,3.93,1.03,0.288,0.729,0.52,0.207,0.273,0.148,1.36,4.17,4.19,3.84,4.1,1.7,2047
412,Rubby de la Rosa,Diamondbacks,3014,188.2,14,9,0,32,32,7.16,3.01,1.53,0.288,0.73,0.491,0.181,0.328,0.168,1.36,4.67,4.81,4.1,4.19,0.5,3862
487,Justin De Fratus,Phillies,1445,80.0,0,2,0,61,0,7.65,3.6,1.01,0.335,0.662,0.44,0.204,0.356,0.101,1.55,5.51,4.28,4.46,4.03,-0.1,4955
575,Sugar Marimon,Braves,418,25.2,0,1,0,16,0,4.91,4.91,1.05,0.314,0.578,0.368,0.195,0.437,0.079,1.71,7.36,5.2,5.86,5.48,-0.3,6027


In [43]:
# what the 2015 names need to be changed to
statcast_pitching_2015[(~statcast_pitching_2015.player_name.isin(common_2015.player_name))&(~statcast_pitching_2015.player_name.isin(common_2015.player_name))]

Unnamed: 0,player_id,player_name,total_pitches,ba,iso,babip,slg,woba,xwoba,xba,hits,abs,launch_speed,launch_angle,spin_rate,velocity,effective_speed,swings,release_extension
158,543334,TJ House,277,0.412,0.118,0.392,0.529,0.405,0.401,0.368,21,51,88.9,-1.0,1838,86.0,84.6,52,5.52
177,518603,Justin DeFratus,1445,0.371,0.218,0.34,0.589,0.397,0.347,0.305,92,248,88.7,11.2,1918,88.4,88.4,257,6.54
193,571510,Matthew Boyd,1003,0.386,0.37,0.318,0.755,0.471,0.396,0.312,71,184,88.6,20.8,2169,86.3,85.95,188,6.21
259,621121,Lance McCullers Jr.,2110,0.312,0.188,0.288,0.5,0.345,0.358,0.318,106,340,88.2,9.1,2414,91.1,90.56,343,6.04
289,501992,Nate Karns,2441,0.324,0.201,0.288,0.525,0.358,0.372,0.327,132,408,88.0,12.9,2235,88.2,88.7,415,6.68
363,516970,Sugar Ray Marimon,418,0.349,0.279,0.318,0.628,0.399,0.324,0.286,30,86,87.5,15.0,2086,89.0,87.45,89,5.55
366,523989,Rubby De La Rosa,3014,0.333,0.242,0.292,0.575,0.379,0.343,0.307,193,579,87.5,8.4,2145,91.5,90.96,592,5.98
522,518771,Mitch Harris,436,0.323,0.14,0.292,0.462,0.338,0.335,0.295,30,93,86.5,11.7,2079,92.1,92.25,94,6.35
555,477003,Jonathon Niese,2701,0.335,0.173,0.309,0.508,0.352,0.334,0.308,192,573,86.3,3.4,1932,85.7,85.26,593,6.23
582,502010,Jake Brigham,308,0.459,0.197,0.435,0.656,0.467,0.373,0.367,28,61,86.0,9.3,2251,89.4,88.56,63,5.91


In [44]:
# change those that don't match from fangraphs names to savant names
fangraphs_pitching_2015['player_name'] = fangraphs_pitching_2015['player_name'].replace({'A.J. Ramos' : 'AJ Ramos', 
                                         'Jacob Brigham': 'Jake Brigham', 'Jon Niese': 'Jonathon Niese', 
                                         'Jorge de la Rosa': 'Jorge De La Rosa', 'Justin De Fratus': 'Justin DeFratus', 
                                         'Lance McCullers': 'Lance McCullers Jr.', 'Matt Boyd': 'Matthew Boyd', 
                                         'Mitchell Harris': 'Mitch Harris', 'Rubby de la Rosa': 'Rubby De La Rosa',
                                         'Samuel Tuivailala': 'Sam Tuivailala', 'Sugar Marimon': 'Sugar Ray Marimon', 
                                         'T.J. House': 'TJ House'})
fangraphs_pitching_2016['player_name'] = fangraphs_pitching_2016['player_name'].replace({'A.J. Ramos' : 'AJ Ramos', 
                                         'Jon Niese': 'Jonathon Niese', 'Jorge de la Rosa': 'Jorge De La Rosa', 
                                         'Lance McCullers': 'Lance McCullers Jr.', 'Matt Boyd': 'Matthew Boyd', 
                                         'Nathaniel Karns': 'Nate Karns', 'Rubby de la Rosa': 'Rubby De La Rosa'})
fangraphs_pitching_2017['player_name'] = fangraphs_pitching_2017['player_name'].replace({'A.J. Ramos' : 'AJ Ramos', 
                                         'Chase Bradford': 'Chasen Bradford', 'Jake Faria': 'Jacob Faria',
                                         'Jakob Junis': 'Jake Junis', 'Jorge de la Rosa': 'Jorge De La Rosa', 
                                         'Lance McCullers': 'Lance McCullers Jr.', 'Luke Sims': 'Lucas Sims', 
                                         'Matt Boyd': 'Matthew Boyd', 'Nathaniel Karns': 'Nate Karns',
                                         'Samuel Tuivailala': 'Sam Tuivailala'})

In [45]:
# sort fangraphs and savant data by name
fangraphs_pitching_2015 = fangraphs_pitching_2015.sort_values(by='player_name')
fangraphs_pitching_2016 = fangraphs_pitching_2016.sort_values(by='player_name')
fangraphs_pitching_2017 = fangraphs_pitching_2017.sort_values(by='player_name')
statcast_pitching_2015 = statcast_pitching_2015.sort_values(by='player_name')
statcast_pitching_2016 = statcast_pitching_2016.sort_values(by='player_name')
statcast_pitching_2017 = statcast_pitching_2017.sort_values(by='player_name')

In [46]:
# reset index for all datasets
fangraphs_pitching_2015 = fangraphs_pitching_2015.reset_index(drop=True)
fangraphs_pitching_2016 = fangraphs_pitching_2016.reset_index(drop=True)
fangraphs_pitching_2017 = fangraphs_pitching_2017.reset_index(drop=True)
statcast_pitching_2015 = statcast_pitching_2015.reset_index(drop=True)
statcast_pitching_2016 = statcast_pitching_2016.reset_index(drop=True)
statcast_pitching_2017 = statcast_pitching_2017.reset_index(drop=True)

In [47]:
# let's look at the tails to see if we have consistency among the two datasets
fangraphs_pitching_2017.tail()

Unnamed: 0,player_name,Team,Pitches,IP,W,L,SV,G,GS,Kper9,BBper9,HRper9,BABIP,LOBperc,GBperc,LDperc,FBperc,HRperFB,WHIP,ERA,FIP,xFIP,SIERA,WAR,playerid
566,Zach Eflin,Phillies,1020,64.1,1,5,0,11,11,4.9,1.68,2.24,0.297,0.693,0.441,0.176,0.383,0.188,1.41,6.16,6.1,5.21,5.08,-0.2,13774
567,Zach McAllister,Indians,975,62.0,2,2,0,50,0,9.58,3.05,1.16,0.294,0.893,0.358,0.214,0.428,0.118,1.19,2.61,3.77,4.04,3.58,0.4,2895
568,Zack Godley,Diamondbacks,2436,155.0,8,9,0,26,25,9.58,3.08,0.87,0.28,0.752,0.553,0.185,0.262,0.147,1.14,3.37,3.41,3.32,3.67,3.5,14862
569,Zack Greinke,Diamondbacks,3163,202.1,17,7,0,32,32,9.56,2.0,1.11,0.285,0.753,0.468,0.18,0.352,0.134,1.07,3.2,3.31,3.34,3.48,5.1,1943
570,Zack Wheeler,Mets,1561,86.1,3,7,0,17,17,8.44,4.17,1.56,0.332,0.731,0.475,0.226,0.3,0.195,1.59,5.21,5.03,4.36,4.64,0.4,10310


In [48]:
# by the count and the names, i would say yes we do have consistency
statcast_pitching_2017.tail()

Unnamed: 0,player_id,player_name,total_pitches,ba,iso,babip,slg,woba,xwoba,xba,hits,abs,launch_speed,launch_angle,spin_rate,velocity,effective_speed,swings,release_extension
566,621107,Zach Eflin,1020,0.357,0.281,0.3,0.638,0.411,0.378,0.319,79,221,86.9,10.9,2142,89.8,89.87,228,6.41
567,502083,Zach McAllister,975,0.331,0.188,0.296,0.519,0.364,0.386,0.345,53,160,89.3,16.5,2216,93.3,94.82,161,7.05
568,643327,Zack Godley,2442,0.313,0.202,0.285,0.515,0.347,0.346,0.312,124,396,86.2,3.1,2202,88.4,87.62,404,5.69
569,425844,Zack Greinke,3163,0.322,0.221,0.287,0.543,0.365,0.358,0.306,172,534,86.5,9.1,2333,86.4,85.67,541,5.74
570,554430,Zack Wheeler,1565,0.369,0.238,0.329,0.608,0.417,0.379,0.331,96,260,87.0,8.3,2329,90.8,91.85,261,6.85


In [49]:
# merge fangraphs and savant pitching data on index
merged_pitching_2015 = pd.merge(fangraphs_pitching_2015, statcast_pitching_2015, right_index=True, left_index=True)
merged_pitching_2016 = pd.merge(fangraphs_pitching_2016, statcast_pitching_2016, right_index=True, left_index=True)
merged_pitching_2017 = pd.merge(fangraphs_pitching_2017, statcast_pitching_2017, right_index=True, left_index=True)

In [50]:
# check in to see what our merged data looks like
merged_pitching_2015.info()
print("-------------------------------------")
merged_pitching_2016.info()
print("-------------------------------------")
merged_pitching_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 556 entries, 0 to 555
Data columns (total 44 columns):
player_name_x        556 non-null object
Team                 556 non-null object
Pitches              556 non-null int64
IP                   556 non-null float64
W                    556 non-null int64
L                    556 non-null int64
SV                   556 non-null int64
G                    556 non-null int64
GS                   556 non-null int64
Kper9                556 non-null float64
BBper9               556 non-null float64
HRper9               556 non-null float64
BABIP                556 non-null float64
LOBperc              556 non-null float64
GBperc               556 non-null float64
LDperc               556 non-null float64
FBperc               556 non-null float64
HRperFB              556 non-null float64
WHIP                 556 non-null float64
ERA                  556 non-null float64
FIP                  556 non-null float64
xFIP                 556 no

In [51]:
# looking at player_name_x (from fangraphs) and player_name_y (from savant), looks like we had a successful merge
merged_pitching_2017

Unnamed: 0,player_name_x,Team,Pitches,IP,W,L,SV,G,GS,Kper9,BBper9,HRper9,BABIP,LOBperc,GBperc,LDperc,FBperc,HRperFB,WHIP,ERA,FIP,xFIP,SIERA,WAR,playerid,player_id,player_name_y,total_pitches,ba,iso,babip,slg,woba,xwoba,xba,hits,abs,launch_speed,launch_angle,spin_rate,velocity,effective_speed,swings,release_extension
0,A.J. Cole,Nationals,937,52.0,3,5,0,11,8,7.62,4.67,1.38,0.293,0.831,0.440,0.167,0.393,0.136,1.50,3.81,5.20,5.21,5.12,0.1,11467,595918,A.J. Cole,937,0.338,0.232,0.299,0.570,0.378,0.367,0.322,51,151,86.2,12.6,2082,89.6,89.77,155,6.49
1,A.J. Griffin,Rangers,1322,77.1,6,6,0,18,15,7.10,3.26,2.33,0.251,0.707,0.285,0.132,0.583,0.142,1.34,5.94,6.26,6.14,5.30,-0.1,11132,456167,A.J. Griffin,1322,0.318,0.343,0.251,0.661,0.404,0.351,0.280,76,239,86.5,23.4,2261,80.4,80.58,243,6.59
2,A.J. Schugel,Pirates,496,32.0,4,0,0,32,0,7.59,3.94,0.84,0.304,0.907,0.527,0.194,0.280,0.115,1.41,1.97,4.00,4.23,4.28,0.1,11432,519263,A.J. Schugel,504,0.333,0.172,0.308,0.505,0.354,0.330,0.293,31,93,87.8,7.0,2050,88.1,86.59,95,5.45
3,AJ Ramos,- - -,1104,58.2,2,4,27,61,0,11.05,5.22,1.07,0.294,0.771,0.401,0.204,0.395,0.121,1.41,3.99,4.10,4.30,4.01,0.4,8350,573109,AJ Ramos,1108,0.331,0.209,0.296,0.541,0.369,0.345,0.299,49,148,84.9,12.7,2238,86.0,84.84,150,5.65
4,Aaron Bummer,White Sox,374,22.0,1,3,0,30,0,6.95,6.14,1.64,0.167,0.769,0.544,0.140,0.316,0.222,1.27,4.50,6.16,5.25,5.42,-0.5,16258,607481,Aaron Bummer,378,0.232,0.250,0.170,0.482,0.290,0.315,0.276,13,56,86.0,3.6,1940,89.8,89.23,58,5.93
5,Aaron Loup,Blue Jays,1008,57.2,2,3,0,70,0,9.99,4.53,0.62,0.340,0.758,0.535,0.201,0.264,0.095,1.53,3.75,3.66,4.05,3.91,0.6,10343,571901,Aaron Loup,1028,0.366,0.137,0.350,0.503,0.367,0.307,0.294,59,161,82.7,3.8,2213,88.4,87.24,166,5.87
6,Aaron Nola,Phillies,2665,168.0,12,11,0,27,27,9.86,2.63,0.96,0.309,0.768,0.498,0.191,0.311,0.127,1.21,3.54,3.27,3.38,3.60,4.3,16149,605400,Aaron Nola,2673,0.338,0.200,0.311,0.537,0.374,0.340,0.306,154,456,85.5,8.8,2094,86.0,86.31,458,6.50
7,Aaron Sanchez,Blue Jays,613,36.0,1,3,0,8,8,6.00,5.00,1.50,0.310,0.714,0.475,0.238,0.287,0.171,1.72,4.25,5.74,5.30,5.62,0.0,11490,592717,Aaron Sanchez,613,0.344,0.205,0.310,0.549,0.384,0.374,0.324,42,122,87.1,7.3,2384,91.5,91.15,122,6.24
8,Adalberto Mejia,Twins,1752,98.0,4,7,0,21,21,7.81,4.04,1.19,0.328,0.760,0.393,0.226,0.380,0.112,1.57,4.50,4.65,5.03,4.95,1.0,13188,606167,Adalberto Mejia,1760,0.359,0.232,0.328,0.592,0.403,0.382,0.334,110,306,87.3,12.7,2389,89.3,88.56,309,5.87
9,Adam Conley,Marlins,1723,102.2,8,8,0,22,20,6.31,3.68,1.67,0.295,0.655,0.396,0.186,0.418,0.139,1.52,6.14,5.62,5.59,5.32,-0.1,14457,543045,Adam Conley,1739,0.344,0.254,0.303,0.598,0.390,0.370,0.317,114,331,86.5,13.4,2213,87.5,86.12,341,5.69


In [52]:
# don't need team name nor do we need two player names since we already verified they are the same across the dataset
merged_pitching_2015.drop(['Team', 'player_name_y', 'playerid', 'player_id', 'total_pitches', 'babip'], axis=1, inplace=True)
merged_pitching_2016.drop(['Team', 'player_name_y', 'playerid', 'player_id', 'total_pitches', 'babip'], axis=1, inplace=True)
merged_pitching_2017.drop(['Team', 'player_name_y', 'playerid', 'player_id', 'total_pitches', 'babip'], axis=1, inplace=True)

In [53]:
# i'd like to create one metric SwingsperAB just to see how many swings pitchers generate
merged_pitching_2015 ['SwingsperAB'] = merged_pitching_2015['swings'] /  merged_pitching_2015['abs']
merged_pitching_2016 ['SwingsperAB'] = merged_pitching_2016['swings'] /  merged_pitching_2016['abs']
merged_pitching_2017 ['SwingsperAB'] = merged_pitching_2017['swings'] /  merged_pitching_2017['abs']

In [54]:
# making all column names consistent for readability moving forward
merged_pitching_2015.rename(columns={'player_name_x':'Player_Name', 'Pitches':'Total_Pitches', 'ba':'BA',
                            'iso':'ISO', 'babip':'BABIP', 'slg':'SLG', 'woba':'wOBA', 'xwoba':'xwOBA', 'xba':'xBA', 
                            'hits':'Hits', 'abs':'ABs', 'launch_speed':'Launch_Speed', 'launch_angle':'Launch_Angle',  
                            'spin_rate':'Spin_Rate', 'velocity':'Velocity','effective_speed':'Effective_Speed',
                            'swings':'Swings', 'release_extension':'Release_Extension'}, inplace=True)
merged_pitching_2016.rename(columns={'player_name_x':'Player_Name', 'Pitches':'Total_Pitches', 'ba':'BA',
                            'iso':'ISO', 'babip':'BABIP', 'slg':'SLG', 'woba':'wOBA', 'xwoba':'xwOBA', 'xba':'xBA', 
                            'hits':'Hits', 'abs':'ABs', 'launch_speed':'Launch_Speed', 'launch_angle':'Launch_Angle',  
                            'spin_rate':'Spin_Rate', 'velocity':'Velocity','effective_speed':'Effective_Speed',
                            'swings':'Swings', 'release_extension':'Release_Extension'}, inplace=True)
merged_pitching_2017.rename(columns={'player_name_x':'Player_Name', 'Pitches':'Total_Pitches', 'ba':'BA',
                            'iso':'ISO', 'babip':'BABIP', 'slg':'SLG', 'woba':'wOBA', 'xwoba':'xwOBA', 'xba':'xBA', 
                            'hits':'Hits', 'abs':'ABs', 'launch_speed':'Launch_Speed', 'launch_angle':'Launch_Angle',  
                            'spin_rate':'Spin_Rate', 'velocity':'Velocity','effective_speed':'Effective_Speed',
                            'swings':'Swings', 'release_extension':'Release_Extension'}, inplace=True)

In [94]:
# save merged pitching files as csv for further analysis
merged_pitching_2015.to_csv("C:/Users/avitosky/Documents/Baseball Project/merged_pitching_2015.csv")
merged_pitching_2016.to_csv("C:/Users/avitosky/Documents/Baseball Project/merged_pitching_2016.csv")
merged_pitching_2017.to_csv("C:/Users/avitosky/Documents/Baseball Project/merged_pitching_2017.csv")