# Project Title: Fantasy Baseball 2018 Advanced Metrics Data Exploration
Project goal: Uncover undervalued players, make predictions for 2018 statistics

Advanced metrics were first collected by MLB in 2015 meaning I'll use 2015, 2016, 2017 data.

I initially selected a minimum of 50 ABs but seeing mostly pitchers under 75 ABs, I decided to make 75 ABs the minimum.

This notebook will be for cleaning and merging data from 2015, 2016, and 2017 from MLB Baseball Savant and FanGraphs

MLB Savant Data Dictionary:

pitches - number of pitches seen by the player in that specific season

player_id - unique player id, will be useful for year to year comparisons

player_name - Player's name

total_pitches - number of pitches seen by the player in that specific season

pitch percent - percentage of pitches seen? not exactly sure but column consists of '100' for every player

ba - batting average

iso - isolated power

babip - batting average on balls in play

slg - slugging percentage

woba - weighted on-base average

xwoba - expected weighted on-base average

xba - expected batting average

hits - total base hits

abs - total at bats

launch_speed - average launch speed off of bat

launch_angle - average launch angle

spin_rate - average spin rate 

velocity - average exit velocity

effective_speed - average effective speed

whiffs - how many swings and misses

swings - number of swings

takes - number of pitches taken

eff_min_vel - effective minimum velocity?

release_extension - release extension?

posX_int_start_distance - starting distance from each position in the field

In [1335]:
# import packages
import pandas as pd

In [1336]:
# read in 2015, 2016, and 2017 baseball savant data
savant_2015 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/savant_data_2015.csv")
savant_2016 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/savant_data_2016.csv")
savant_2017 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/savant_data_2017.csv")

In [1337]:
# sort players by alphabetical order for indexing
savant_2015 = savant_2015.sort_values(by='player_name')
savant_2016 = savant_2016.sort_values(by='player_name')
savant_2017 = savant_2017.sort_values(by='player_name')

In [1338]:
# reset index in savant data for merging later
savant_2015 = savant_2015.reset_index(drop=True)
savant_2016 = savant_2016.reset_index(drop=True)
savant_2017 = savant_2017.reset_index(drop=True)

In [1339]:
# look at background info, mostly floats and ints with one object (player_name)
savant_2015.info()
print("-------------------------------------")
savant_2016.info()
print("-------------------------------------")
savant_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467 entries, 0 to 466
Data columns (total 31 columns):
pitches                    467 non-null int64
player_id                  467 non-null int64
player_name                467 non-null object
total_pitches              467 non-null int64
pitch_percent              467 non-null float64
ba                         467 non-null float64
iso                        467 non-null float64
babip                      467 non-null float64
slg                        467 non-null float64
woba                       467 non-null float64
xwoba                      467 non-null float64
xba                        467 non-null float64
hits                       467 non-null int64
abs                        467 non-null int64
launch_speed               467 non-null float64
launch_angle               467 non-null float64
spin_rate                  467 non-null int64
velocity                   467 non-null float64
effective_speed            467 non-null floa

In [1340]:
# describe the 2015 data
savant_2015.describe()

Unnamed: 0,pitches,player_id,total_pitches,pitch_percent,ba,iso,babip,slg,woba,xwoba,...,takes,eff_min_vel,release_extension,pos3_int_start_distance,pos4_int_start_distance,pos5_int_start_distance,pos6_int_start_distance,pos7_int_start_distance,pos8_int_start_distance,pos9_int_start_distance
count,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0,...,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0
mean,1422.134904,496932.550321,1422.134904,100.0,0.249657,0.148099,0.296062,0.39773,0.310516,0.304615,...,743.873662,-0.621413,6.014518,109.1606,148.817987,109.513919,146.278373,291.475375,311.468951,291.321199
std,737.701078,81031.11449,737.701078,0.0,0.037963,0.059756,0.041732,0.080086,0.046327,0.044904,...,402.578256,0.210588,0.0872,3.311106,3.424705,8.100583,1.834231,7.941642,5.26799,6.671286
min,278.0,116338.0,278.0,100.0,0.125,0.0,0.181,0.152,0.177,0.177,...,128.0,-1.2,5.78,96.0,139.0,85.0,139.0,266.0,295.0,271.0
25%,772.0,452454.5,772.0,100.0,0.228,0.104,0.267,0.3445,0.2835,0.277,...,391.0,-0.8,5.96,107.0,147.0,103.5,145.0,287.0,308.0,287.0
50%,1415.0,500208.0,1415.0,100.0,0.254,0.147,0.299,0.4,0.313,0.302,...,709.0,-0.6,6.01,109.0,149.0,111.0,146.0,293.0,312.0,292.0
75%,2049.0,545839.5,2049.0,100.0,0.276,0.188,0.324,0.449,0.339,0.333,...,1033.0,-0.5,6.08,112.0,150.0,115.0,147.5,297.0,315.0,296.0
max,3021.0,656941.0,3021.0,100.0,0.338,0.356,0.412,0.649,0.467,0.446,...,1848.0,0.0,6.29,117.0,169.0,141.0,152.0,310.0,326.0,309.0


In [1341]:
# describe the 2016 data
savant_2016.describe()

Unnamed: 0,pitches,player_id,total_pitches,pitch_percent,ba,iso,babip,slg,woba,xwoba,...,takes,eff_min_vel,release_extension,pos3_int_start_distance,pos4_int_start_distance,pos5_int_start_distance,pos6_int_start_distance,pos7_int_start_distance,pos8_int_start_distance,pos9_int_start_distance
count,459.0,459.0,459.0,459.0,459.0,459.0,459.0,459.0,459.0,459.0,...,459.0,459.0,459.0,459.0,459.0,459.0,459.0,459.0,459.0,459.0
mean,1472.324619,512616.934641,1472.324619,100.0,0.25093,0.15619,0.296978,0.407109,0.316893,0.311919,...,776.413943,-0.537473,6.05342,108.657952,149.771242,110.655773,146.12854,294.265795,315.760349,292.583878
std,758.356172,78968.790969,758.356172,0.0,0.036263,0.058502,0.041438,0.076767,0.043414,0.040891,...,415.71431,0.212263,0.060611,3.513946,3.895182,8.323378,1.91262,7.326087,4.788915,6.207039
min,283.0,120074.0,283.0,100.0,0.094,0.008,0.135,0.153,0.111,0.185,...,111.0,-1.3,5.85,98.0,140.0,88.0,140.0,269.0,302.0,276.0
25%,811.0,456273.0,811.0,100.0,0.228,0.1145,0.271,0.352,0.286,0.282,...,417.0,-0.7,6.01,106.0,148.0,104.5,145.0,290.0,313.0,288.0
50%,1372.0,518692.0,1372.0,100.0,0.253,0.152,0.298,0.411,0.32,0.312,...,715.0,-0.5,6.05,108.0,149.0,112.0,146.0,296.0,316.0,293.0
75%,2108.0,572156.5,2108.0,100.0,0.275,0.194,0.3225,0.4585,0.346,0.339,...,1092.5,-0.4,6.095,112.0,151.0,115.0,148.0,299.5,319.0,297.0
max,3014.0,666560.0,3014.0,100.0,0.348,0.358,0.411,0.657,0.43,0.459,...,1813.0,0.2,6.23,118.0,171.0,142.0,151.0,309.0,329.0,310.0


In [1342]:
# describe the 2017 data
savant_2017.describe()

Unnamed: 0,pitches,player_id,total_pitches,pitch_percent,ba,iso,babip,slg,woba,xwoba,...,takes,eff_min_vel,release_extension,pos3_int_start_distance,pos4_int_start_distance,pos5_int_start_distance,pos6_int_start_distance,pos7_int_start_distance,pos8_int_start_distance,pos9_int_start_distance
count,456.0,456.0,456.0,456.0,456.0,456.0,456.0,456.0,456.0,456.0,...,456.0,456.0,456.0,456.0,456.0,456.0,456.0,456.0,456.0,456.0
mean,1497.885965,533524.054825,1497.885965,100.0,0.25211,0.165803,0.298447,0.417908,0.322906,0.310399,...,791.5,-0.489474,6.023575,109.572368,150.484649,111.054825,147.125,295.041667,317.682018,293.769737
std,735.232199,74635.920683,735.232199,0.0,0.036177,0.061619,0.042386,0.080516,0.044641,0.040224,...,405.669567,0.14608,0.069959,3.637295,3.787746,7.402647,2.064815,7.036515,4.969112,6.122346
min,277.0,134181.0,277.0,100.0,0.134,0.027,0.145,0.203,0.181,0.173,...,115.0,-1.0,5.75,99.0,142.0,90.0,142.0,271.0,303.0,274.0
25%,839.0,462042.0,839.0,100.0,0.231,0.12075,0.274,0.36675,0.295,0.283,...,432.25,-0.6,5.98,107.0,148.0,106.0,146.0,290.0,314.0,289.0
50%,1465.0,543388.5,1465.0,100.0,0.255,0.162,0.299,0.4175,0.323,0.311,...,766.5,-0.5,6.02,108.0,150.0,112.0,147.0,296.0,318.0,294.0
75%,2117.75,594784.5,2117.75,100.0,0.27625,0.20725,0.328,0.471,0.35,0.336,...,1114.75,-0.4,6.07,113.0,151.0,115.0,149.0,300.0,321.0,298.0
max,3028.0,664056.0,3028.0,100.0,0.346,0.392,0.42,0.69,0.451,0.446,...,1832.0,-0.1,6.21,119.0,167.0,134.0,153.0,311.0,331.0,309.0


It looks like we have all of the same columns in the dataset so I'll explore all columns in 2017 dataset.

I know that pitches, player_id, ba, iso, babip, slg, woba, xwoba, xba, hits, abs, whiffs, swings, takes are all relevant 
columns but I'm not sure about the others so I will explore them now.

From the data above total_pitches is a mirror of pitches and pitch_percent only has values of 100 so I'll focus on the others

In [1343]:
# launch speed may or may not be relevant, not a ton of variation in std
savant_2017['launch_speed'].describe()

count    456.000000
mean      81.481140
std        2.117513
min       74.000000
25%       80.100000
50%       81.700000
75%       83.000000
max       87.100000
Name: launch_speed, dtype: float64

In [1344]:
# simple correlation shows some correlation with woba (the most "catch-all stat in savant data) will keep the variable for now
savant_2017['launch_speed'].corr(savant_2017['woba'])

0.47280094010323959

In [1345]:
# launch angle is a good descriptor of consistency for well struck balls so we'll leave it, higher launch angle is better
savant_2017['launch_angle'].describe()

count    456.000000
mean      15.921053
std        4.018572
min       -1.600000
25%       13.075000
50%       15.900000
75%       18.500000
max       33.100000
Name: launch_angle, dtype: float64

In [1346]:
# not much correlation but a good varaible for us to keep
savant_2017['launch_angle'].corr(savant_2017['woba'])

0.1817071200478004

In [1347]:
# spin rate has a smaller std meaning that it is a relatively stable variable, more useful in pitching metrics than hitting
savant_2017['spin_rate'].describe()

count     456.000000
mean     2217.614035
std        30.625568
min      2110.000000
25%      2196.000000
50%      2219.000000
75%      2241.250000
max      2294.000000
Name: spin_rate, dtype: float64

In [1348]:
# with zero correlation to woba, we can get rid of it
savant_2017['spin_rate'].corr(savant_2017['woba'])

0.0017834618157080977

In [1349]:
# same story with velocity as with spin_rate
savant_2017['velocity'].describe()

count    456.000000
mean      88.639254
std        0.509832
min       86.800000
25%       88.300000
50%       88.600000
75%       88.925000
max       90.200000
Name: velocity, dtype: float64

In [1350]:
# again zero correlation so we'll drop it
savant_2017['velocity'].corr(savant_2017['woba'])

-0.0050709668877704211

In [1351]:
# again same story with effective_speed as with spin_rate and velocity
savant_2017['effective_speed'].describe()

count    456.000000
mean      88.148202
std        0.544963
min       86.120000
25%       87.770000
50%       88.150000
75%       88.490000
max       89.710000
Name: effective_speed, dtype: float64

In [1352]:
# zero correlation so we'll drop it
savant_2017['effective_speed'].corr(savant_2017['woba'])

-0.0015943003117315719

In [1353]:
# and again same story with eff_min_vel as with spin_rate, velocity, and effective_speed
savant_2017['eff_min_vel'].describe()

count    456.000000
mean      -0.489474
std        0.146080
min       -1.000000
25%       -0.600000
50%       -0.500000
75%       -0.400000
max       -0.100000
Name: eff_min_vel, dtype: float64

In [1354]:
# zero correlation so we'll drop it
savant_2017['eff_min_vel'].corr(savant_2017['woba'])

8.514397583133259e-05

In [1355]:
# and one more time, same story with eff_min_vel as with spin_rate, velocity, effective_speed, and eff_min_vel
savant_2017['release_extension'].describe()

count    456.000000
mean       6.023575
std        0.069959
min        5.750000
25%        5.980000
50%        6.020000
75%        6.070000
max        6.210000
Name: release_extension, dtype: float64

In [1356]:
# zero correlation so we'll drop it
savant_2017['release_extension'].corr(savant_2017['woba'])

-0.018583341838097341

In [1357]:
# the posX_int_distance variables describe the depth of the fielders positioning for defensive player so we'll drop those too
# kept - pitches, player_id, player_name, ba, iso, babip, slg, woba, xwoba, xba, hits, abs, launch_speed, launch_angle, whiffs, swings, taken
# deleted - total_pitches, pitch_percent, spin_rate, velocity, effective_speed, eff_min_vel, release_extension, posX_int_distance variables

In [1358]:
# drop specified columns from data
savant_2015.drop(['total_pitches', 'pitches', 'pitch_percent', 'abs', 'spin_rate', 'velocity',
                  'effective_speed', 'eff_min_vel', 'release_extension', 'pos3_int_start_distance',
                  'pos4_int_start_distance', 'pos5_int_start_distance', 'pos6_int_start_distance', 
                  'pos7_int_start_distance', 'pos8_int_start_distance', 'pos9_int_start_distance',
                  'ba', 'iso', 'babip', 'slg', 'woba'], axis=1, inplace=True)
savant_2016.drop(['total_pitches', 'pitches', 'pitch_percent', 'abs', 'spin_rate', 'velocity',
                  'effective_speed', 'eff_min_vel', 'release_extension', 'pos3_int_start_distance',
                  'pos4_int_start_distance', 'pos5_int_start_distance', 'pos6_int_start_distance', 
                  'pos7_int_start_distance', 'pos8_int_start_distance', 'pos9_int_start_distance',
                  'ba', 'iso', 'babip', 'slg', 'woba'], axis=1, inplace=True)
savant_2017.drop(['total_pitches', 'pitches', 'pitch_percent', 'abs', 'spin_rate', 'velocity',
                  'effective_speed', 'eff_min_vel', 'release_extension', 'pos3_int_start_distance',
                  'pos4_int_start_distance', 'pos5_int_start_distance', 'pos6_int_start_distance', 
                  'pos7_int_start_distance', 'pos8_int_start_distance', 'pos9_int_start_distance',
                  'ba', 'iso', 'babip', 'slg', 'woba'], axis=1, inplace=True)

In [1359]:
# show updated dataset
savant_2017.head()

Unnamed: 0,player_id,player_name,xwoba,xba,hits,launch_speed,launch_angle,whiffs,swings,takes
0,454560,A.J. Ellis,0.273,0.197,30,80.8,16.5,52,270,365
1,572041,A.J. Pollock,0.331,0.265,113,82.9,10.7,139,737,995
2,571437,Aaron Altherr,0.33,0.244,101,83.3,14.3,212,708,878
3,543305,Aaron Hicks,0.335,0.233,80,83.0,16.5,154,579,896
4,592450,Aaron Judge,0.446,0.278,154,85.1,17.5,429,1228,1756


To get a more holistic profile for each player, we'll need to import data from FanGraphs, drop what columns are redundant and then merge the two datasets together.

In [1360]:
# read in 2015, 2016, and 2017 fangraphs data
fangraphs_2015 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/fangraphs_2015.csv")
fangraphs_2016 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/fangraphs_2016.csv")
fangraphs_2017 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/fangraphs_2017.csv")

In [1361]:
# savant data has minimum 75 ABs, fangraphs can only specify minimum PAs so set to 70 but included ABs in data
# need to drop < 75 ABs from each fangraphs dataset
fangraphs_2015 = fangraphs_2015[fangraphs_2015["AB"] >= 75]
fangraphs_2016 = fangraphs_2016[fangraphs_2016["AB"] >= 75]
fangraphs_2017 = fangraphs_2017[fangraphs_2017["AB"] >= 75]

In [1362]:
# change fangraphs 'Name' column to 'player_name' to merge with savant data
fangraphs_2015.rename(columns={'Name':'player_name'}, inplace=True)
fangraphs_2016.rename(columns={'Name':'player_name'}, inplace=True)
fangraphs_2017.rename(columns={'Name':'player_name'}, inplace=True)

In [1363]:
fangraphs_2015['player_name'] = fangraphs_2015['player_name'].replace({'Eric Young' : 'Eric Young Jr.', 
                                        'Gregory Bird': 'Greg Bird', 'Ivan De Jesus': 'Ivan De Jesus Jr.', 
                                        'John Mayberry': 'John Mayberry Jr.', 'Nick Castellanos': 'Nicholas Castellanos', 
                                        'Nori Aoki': 'Norichika Aoki'})
fangraphs_2016['player_name'] = fangraphs_2016['player_name'].replace({'Byung-ho Park': 'ByungHo Park',
                                        ' Ivan De Jesus': 'Ivan De Jesus Jr.', 'Nick Castellanos': 'Nicholas Castellanos', 
                                        'Nori Aoki': 'Norichika Aoki','Yulieski Gurriel': 'Yuli Gurriel'})
fangraphs_2017['player_name'] = fangraphs_2017['player_name'].replace({'Cam Perkins': 'Cameron Perkins',
                                        'Eric Young': 'Eric Young Jr.', 'Gregory Bird' : 'Greg Bird', 
                                        'J.T. Riddle': 'JT Riddle', 'Nick Castellanos': 'Nicholas Castellanos', 
                                        'Nick Delmonico': 'Nicky Delmonico', 'Nori Aoki': 'Norichika Aoki', 
                                        'Yulieski Gurriel': 'Yuli Gurriel'})

In [1364]:
# sort fangraphs data by name
fangraphs_2015 = fangraphs_2015.sort_values(by='player_name')
fangraphs_2016 = fangraphs_2016.sort_values(by='player_name')
fangraphs_2017 = fangraphs_2017.sort_values(by='player_name')

In [1365]:
# reset fangraphs index
fangraphs_2015 = fangraphs_2015.reset_index(drop=True)
fangraphs_2016 = fangraphs_2016.reset_index(drop=True)
fangraphs_2017 = fangraphs_2017.reset_index(drop=True)

In [1366]:
# look at background info, mostly floats and ints with one object (player_name)
fangraphs_2015.info()
print("-------------------------------------")
fangraphs_2016.info()
print("-------------------------------------")
fangraphs_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467 entries, 0 to 466
Data columns (total 23 columns):
player_name    467 non-null object
Team           467 non-null object
G              467 non-null int64
AB             467 non-null int64
PA             467 non-null int64
HR             467 non-null int64
R              467 non-null int64
RBI            467 non-null int64
SB             467 non-null int64
BB%            467 non-null object
K%             467 non-null object
ISO            467 non-null float64
BABIP          467 non-null float64
AVG            467 non-null float64
OBP            467 non-null float64
SLG            467 non-null float64
wOBA           467 non-null float64
wRC+           467 non-null int64
BsR            467 non-null float64
Off            467 non-null float64
Def            467 non-null float64
WAR            467 non-null float64
playerid       467 non-null int64
dtypes: float64(10), int64(9), object(4)
memory usage: 84.0+ KB
-------------------------

In [1367]:
fangraphs_2017.head()

Unnamed: 0,player_name,Team,G,AB,PA,HR,R,RBI,SB,BB%,...,AVG,OBP,SLG,wOBA,wRC+,BsR,Off,Def,WAR,playerid
0,A.J. Ellis,Marlins,51,143,163,6,17,14,0,7.4 %,...,0.21,0.298,0.371,0.294,80,-1.8,-6.0,2.8,0.2,5677
1,A.J. Pollock,Diamondbacks,112,425,466,14,73,49,20,7.5 %,...,0.266,0.33,0.471,0.34,103,2.6,4.2,1.8,2.1,9256
2,Aaron Altherr,Phillies,107,372,412,19,58,65,5,7.8 %,...,0.272,0.34,0.516,0.359,120,-2.0,8.7,-8.6,1.3,11270
3,Aaron Hicks,Yankees,88,301,361,15,54,52,10,14.1 %,...,0.266,0.372,0.475,0.363,127,2.5,14.4,6.4,3.3,5297
4,Aaron Judge,Yankees,155,542,678,52,128,114,9,18.7 %,...,0.284,0.422,0.627,0.43,173,0.0,60.8,-1.3,8.2,15640


In [1368]:
# need to convert BB% and K% objects to floats
fangraphs_2015 ['BB%'] = fangraphs_2015['BB%'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_2016 ['BB%'] = fangraphs_2016['BB%'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_2017 ['BB%'] = fangraphs_2017['BB%'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_2015 ['K%'] = fangraphs_2015['K%'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_2016 ['K%'] = fangraphs_2016['K%'].replace('%', ' ', regex=True).astype('float')/100
fangraphs_2017 ['K%'] = fangraphs_2017['K%'].replace('%', ' ', regex=True).astype('float')/100

In [1369]:
fangraphs_2017.head()

Unnamed: 0,player_name,Team,G,AB,PA,HR,R,RBI,SB,BB%,...,AVG,OBP,SLG,wOBA,wRC+,BsR,Off,Def,WAR,playerid
0,A.J. Ellis,Marlins,51,143,163,6,17,14,0,0.074,...,0.21,0.298,0.371,0.294,80,-1.8,-6.0,2.8,0.2,5677
1,A.J. Pollock,Diamondbacks,112,425,466,14,73,49,20,0.075,...,0.266,0.33,0.471,0.34,103,2.6,4.2,1.8,2.1,9256
2,Aaron Altherr,Phillies,107,372,412,19,58,65,5,0.078,...,0.272,0.34,0.516,0.359,120,-2.0,8.7,-8.6,1.3,11270
3,Aaron Hicks,Yankees,88,301,361,15,54,52,10,0.141,...,0.266,0.372,0.475,0.363,127,2.5,14.4,6.4,3.3,5297
4,Aaron Judge,Yankees,155,542,678,52,128,114,9,0.187,...,0.284,0.422,0.627,0.43,173,0.0,60.8,-1.3,8.2,15640


In [1370]:
# from fangraphs will drop variables Team, G, ISO, BABIP, AVG, SLG, wOBA, Off, Def (both are cumulative stats, not rate stats)
# also drop playerid because it is not consistent from site to site
# drop specified columns from 2015 fangraphs data
fangraphs_2015.drop(['Team', 'G', 'Off', 'Def', 'BsR', 'playerid'], axis=1, inplace=True)
fangraphs_2016.drop(['Team', 'G', 'Off', 'Def', 'BsR', 'playerid'], axis=1, inplace=True)
fangraphs_2017.drop(['Team', 'G', 'Off', 'Def', 'BsR', 'playerid'], axis=1, inplace=True)

In [1371]:
# merge 2015 fangraphs and savant data on index
merged_2015 = pd.merge(fangraphs_2015, savant_2015, right_index=True, left_index=True)
merged_2016 = pd.merge(fangraphs_2016, savant_2016, right_index=True, left_index=True)
merged_2017 = pd.merge(fangraphs_2017, savant_2017, right_index=True, left_index=True)

In [1372]:
# check in to see what our merged data looks like
merged_2015.info()
print("-------------------------------------")
merged_2016.info()
print("-------------------------------------")
merged_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467 entries, 0 to 466
Data columns (total 27 columns):
player_name_x    467 non-null object
AB               467 non-null int64
PA               467 non-null int64
HR               467 non-null int64
R                467 non-null int64
RBI              467 non-null int64
SB               467 non-null int64
BB%              467 non-null float64
K%               467 non-null float64
ISO              467 non-null float64
BABIP            467 non-null float64
AVG              467 non-null float64
OBP              467 non-null float64
SLG              467 non-null float64
wOBA             467 non-null float64
wRC+             467 non-null int64
WAR              467 non-null float64
player_id        467 non-null int64
player_name_y    467 non-null object
xwoba            467 non-null float64
xba              467 non-null float64
hits             467 non-null int64
launch_speed     467 non-null float64
launch_angle     467 non-null float64
whi

In [1373]:
# we have same number of rows in each data set from before merge to after merge but just to double check, we'll look at each one
savant_2017

Unnamed: 0,player_id,player_name,xwoba,xba,hits,launch_speed,launch_angle,whiffs,swings,takes
0,454560,A.J. Ellis,0.273,0.197,30,80.8,16.5,52,270,365
1,572041,A.J. Pollock,0.331,0.265,113,82.9,10.7,139,737,995
2,571437,Aaron Altherr,0.330,0.244,101,83.3,14.3,212,708,878
3,543305,Aaron Hicks,0.335,0.233,80,83.0,16.5,154,579,896
4,592450,Aaron Judge,0.446,0.278,154,85.1,17.5,429,1228,1756
5,501659,Abraham Almonte,0.301,0.231,40,83.1,8.3,97,363,403
6,594807,Adam Duvall,0.295,0.217,146,81.5,22.8,312,1196,1278
7,594809,Adam Eaton,0.319,0.228,27,81.9,14.3,39,188,252
8,641553,Adam Engel,0.223,0.164,50,76.4,17.8,218,628,672
9,624428,Adam Frazier,0.325,0.275,112,80.6,14.6,108,803,940


In [1374]:
# Zack Granite comes in at #455 in both datasets so looks like the merge was successful
fangraphs_2017

Unnamed: 0,player_name,AB,PA,HR,R,RBI,SB,BB%,K%,ISO,BABIP,AVG,OBP,SLG,wOBA,wRC+,WAR
0,A.J. Ellis,143,163,6,17,14,0,0.074,0.178,0.161,0.222,0.210,0.298,0.371,0.294,80,0.2
1,A.J. Pollock,425,466,14,73,49,20,0.075,0.152,0.205,0.291,0.266,0.330,0.471,0.340,103,2.1
2,Aaron Altherr,372,412,19,58,65,5,0.078,0.252,0.245,0.328,0.272,0.340,0.516,0.359,120,1.3
3,Aaron Hicks,301,361,15,54,52,10,0.141,0.186,0.209,0.290,0.266,0.372,0.475,0.363,127,3.3
4,Aaron Judge,542,678,52,128,114,9,0.187,0.307,0.343,0.357,0.284,0.422,0.627,0.430,173,8.2
5,Abraham Almonte,172,195,3,26,14,2,0.103,0.236,0.134,0.298,0.233,0.314,0.366,0.298,81,-0.1
6,Adam Duvall,587,647,31,78,99,5,0.060,0.263,0.232,0.290,0.249,0.301,0.480,0.327,98,1.8
7,Adam Eaton,91,107,2,24,13,3,0.131,0.168,0.165,0.347,0.297,0.393,0.462,0.369,126,0.5
8,Adam Engel,301,336,6,34,21,8,0.057,0.348,0.116,0.247,0.166,0.235,0.282,0.230,37,-0.7
9,Adam Frazier,406,454,6,55,53,9,0.079,0.126,0.123,0.306,0.276,0.344,0.399,0.322,97,1.1


In [1375]:
# two player names columns after merging (short/full name differences) and player_id is now our index so we'll drop that too
merged_2015.drop(['player_name_y', 'player_id'], axis=1, inplace=True)
merged_2016.drop(['player_name_y', 'player_id'], axis=1, inplace=True)
merged_2017.drop(['player_name_y', 'player_id'], axis=1, inplace=True)

In [1376]:
# check in to see what our merged data looks like
merged_2015.info()
print("-------------------------------------")
merged_2016.info()
print("-------------------------------------")
merged_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467 entries, 0 to 466
Data columns (total 25 columns):
player_name_x    467 non-null object
AB               467 non-null int64
PA               467 non-null int64
HR               467 non-null int64
R                467 non-null int64
RBI              467 non-null int64
SB               467 non-null int64
BB%              467 non-null float64
K%               467 non-null float64
ISO              467 non-null float64
BABIP            467 non-null float64
AVG              467 non-null float64
OBP              467 non-null float64
SLG              467 non-null float64
wOBA             467 non-null float64
wRC+             467 non-null int64
WAR              467 non-null float64
xwoba            467 non-null float64
xba              467 non-null float64
hits             467 non-null int64
launch_speed     467 non-null float64
launch_angle     467 non-null float64
whiffs           467 non-null int64
swings           467 non-null int64
take

In [1377]:
# making all column names consistent for readability moving forward
merged_2015.rename(columns={'player_name_x':'Player_Name', 'AVG':'BA', 'pitches':'Pitches', 'BB%':'BB/PA', 'K%':'K/PA',
                            'iso':'ISO', 'babip':'BABIP', 'slg':'SLG', 'woba':'wOBA', 'xwoba':'xwOBA', 'xba':'xBA', 
                            'hits':'Hits', 'launch_speed':'Launch_Speed', 'launch_angle':'Launch_Angle', 'whiffs':'Whiffs', 
                            'swings':'Swings', 'takes':'Takes'}, inplace=True)
merged_2016.rename(columns={'player_name_x':'Player_Name', 'AVG':'BA', 'pitches':'Pitches', 'BB%':'BB/PA', 'K%':'K/PA',
                            'iso':'ISO', 'babip':'BABIP', 'slg':'SLG', 'woba':'wOBA', 'xwoba':'xwOBA', 'xba':'xBA', 
                            'hits':'Hits', 'launch_speed':'Launch_Speed', 'launch_angle':'Launch_Angle', 'whiffs':'Whiffs', 
                            'swings':'Swings', 'takes':'Takes'}, inplace=True)
merged_2017.rename(columns={'player_name_x':'Player_Name', 'AVG':'BA', 'pitches':'Pitches', 'BB%':'BB/PA', 'K%':'K/PA',
                            'iso':'ISO', 'babip':'BABIP', 'slg':'SLG', 'woba':'wOBA', 'xwoba':'xwOBA', 'xba':'xBA', 
                            'hits':'Hits', 'launch_speed':'Launch_Speed', 'launch_angle':'Launch_Angle', 'whiffs':'Whiffs', 
                            'swings':'Swings', 'takes':'Takes'}, inplace=True)

In [1378]:
# reorder columns for more logical ordering
merged_2015 = merged_2015[['Player_Name', 'PA', 'AB', 'Hits', 'R', 'HR', 'RBI', 
                           'SB', 'BA', 'xBA', 'OBP', 'BABIP', 'ISO', 'SLG', 'wOBA', 'xwOBA', 
                           'BB/PA', 'K/PA', 'Launch_Speed', 'Launch_Angle', 'Whiffs', 
                           'Swings', 'Takes', 'wRC+', 'WAR']]
merged_2016 = merged_2016[['Player_Name', 'PA', 'AB', 'Hits', 'R', 'HR', 'RBI', 
                           'SB', 'BA', 'xBA', 'OBP', 'BABIP', 'ISO', 'SLG', 'wOBA', 'xwOBA', 
                           'BB/PA', 'K/PA', 'Launch_Speed', 'Launch_Angle', 'Whiffs', 
                           'Swings', 'Takes', 'wRC+', 'WAR']]
merged_2017 = merged_2017[['Player_Name', 'PA', 'AB', 'Hits', 'R', 'HR', 'RBI', 
                           'SB', 'BA', 'xBA', 'OBP', 'BABIP', 'ISO', 'SLG', 'wOBA', 'xwOBA', 
                           'BB/PA', 'K/PA', 'Launch_Speed', 'Launch_Angle', 'Whiffs', 
                           'Swings', 'Takes', 'wRC+', 'WAR']]

In [1379]:
# save merged files as csv for further analysis
merged_2015.to_csv("C:/Users/avitosky/Documents/Baseball Project/merged_2015.csv")
merged_2016.to_csv("C:/Users/avitosky/Documents/Baseball Project/merged_2016.csv")
merged_2017.to_csv("C:/Users/avitosky/Documents/Baseball Project/merged_2017.csv")