# Statcast + FanGraphs 2018 "Breakout" Analysis - xwOBA

In this notebook, I will make an attempt at determining which players have a chance to "breakout" in 2018 based off of their 2017 stats. I'll then apply my methodology to 2015 and 2016 stats to see if "breakout" players did in fact break out or not. I'm using "breakout" in the sense that these players have a higher likelihood to perform above expectation the for the 2018 season. I realize that is semi vague and I'd like to quantify that later on.

Since AB ranges from 75 to 662 in the merged_2017 dataset that I created in a previous notebook, simply looking at BA, H, HR, RBI, etc is not sufficient since those are counting stats. Because of this, I will create more metrics to analyze per PA (plate appearance) in an attempt to somewhat normalize players. Once I get down to a final list, I will compare that list to the ADP (average draft position) of NFBC (National Fantasy Baseball Championship) which is the most widely used ADP in fantasy baseball. At this point, I will take out "qualified" players (> 502 ABs for that specific year) as those players are already full timers and my goal is to find more under the radar players.

wOBA - A rate statistic which attempts to credit a hitter for the value of each outcome (single, double, etc) rather than  treating all hits or times on base equally. Batting average and OBP assumes all hits are created equal, slugging weights hits but not accurately and ignores other ways of reaching base. wOBA formulas change year to year because it is based on relative contributions to run scoring and runs scored vary from year to year.

The 2017 formula for wOBA = (0.693×uBB + 0.723×HBP + 0.877×1B + 1.232×2B + 1.552×3B + 1.979×HR) / (AB + BB – IBB + SF + HBP)

xwOBA - A metric that utilizes launch angle and exit velocity to assign a hit value to every batted ball and then translates that into "expected" wOBA. 

In [449]:
import pandas as pd
import numpy as np
from IPython.display import display
import matplotlib.pyplot as plt
pd.options.display.max_columns = None
%matplotlib inline
pd.set_option("display.max_rows",500)

In [450]:
# read in merged statcast and fangraphs data
merged_2015 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/merged_2015.csv", index_col=0)
merged_2016 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/merged_2016.csv", index_col=0)
merged_2017 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/merged_2017.csv", index_col=0)

In [451]:
# a quick reminder of what our merged data looks like
merged_2017.head()

Unnamed: 0,Player_Name,PA,AB,Hits,R,HR,RBI,SB,BA,xBA,OBP,BABIP,ISO,SLG,wOBA,xwOBA,BB/PA,K/PA,Exit_Velocity,Launch_Angle,Whiffs,Swings,Takes,wRC+,WAR,playerid
0,A.J. Ellis,163,143,30,17,6,14,0,0.21,0.197,0.298,0.222,0.161,0.371,0.294,0.273,0.074,0.178,80.8,16.5,52,270,365,80,0.2,5677
1,A.J. Pollock,466,425,113,73,14,49,20,0.266,0.265,0.33,0.291,0.205,0.471,0.34,0.331,0.075,0.152,82.9,10.7,139,737,995,103,2.1,9256
2,Aaron Altherr,412,372,101,58,19,65,5,0.272,0.244,0.34,0.328,0.245,0.516,0.359,0.33,0.078,0.252,83.3,14.3,212,708,878,120,1.3,11270
3,Aaron Hicks,361,301,80,54,15,52,10,0.266,0.233,0.372,0.29,0.209,0.475,0.363,0.335,0.141,0.186,83.0,16.5,154,579,896,127,3.3,5297
4,Aaron Judge,678,542,154,128,52,114,9,0.284,0.278,0.422,0.357,0.343,0.627,0.43,0.446,0.187,0.307,85.1,17.5,429,1228,1756,173,8.2,15640


In [452]:
# let's look at top 10 in xwOBA to see what kind of players we are looking for
merged_2017.sort_values(by='xwOBA', ascending=False).head(10)

Unnamed: 0,Player_Name,PA,AB,Hits,R,HR,RBI,SB,BA,xBA,OBP,BABIP,ISO,SLG,wOBA,xwOBA,BB/PA,K/PA,Exit_Velocity,Launch_Angle,Whiffs,Swings,Takes,wRC+,WAR,playerid
4,Aaron Judge,678,542,154,128,52,114,9,0.284,0.278,0.422,0.357,0.343,0.627,0.43,0.446,0.187,0.307,85.1,17.5,429,1228,1756,173,8.2,15640
230,Joey Votto,707,559,179,106,36,100,5,0.32,0.298,0.454,0.321,0.258,0.578,0.428,0.424,0.19,0.117,81.8,17.7,176,1144,1577,165,6.6,4314
334,Mike Trout,507,402,123,92,33,72,22,0.306,0.28,0.442,0.318,0.323,0.629,0.437,0.423,0.185,0.178,82.2,20.8,146,801,1319,181,6.9,10155
193,J.D. Martinez,489,432,131,85,45,104,4,0.303,0.288,0.376,0.327,0.387,0.69,0.43,0.423,0.108,0.262,83.8,16.0,307,983,928,166,3.8,6184
249,Jose Martinez,307,272,84,47,14,46,4,0.309,0.322,0.379,0.35,0.21,0.518,0.379,0.411,0.104,0.195,84.5,13.0,107,515,726,135,1.6,7996
167,Freddie Freeman,514,440,135,84,28,71,8,0.307,0.283,0.403,0.335,0.28,0.586,0.407,0.403,0.126,0.185,84.3,19.6,254,1020,902,152,4.5,5361
341,Nelson Cruz,645,556,160,91,39,119,1,0.288,0.282,0.375,0.315,0.261,0.549,0.385,0.402,0.109,0.217,85.0,15.3,356,1167,1295,146,3.8,2434
23,Alex Avila,376,311,82,41,14,49,0,0.264,0.272,0.387,0.382,0.183,0.447,0.362,0.401,0.165,0.319,86.4,16.0,216,593,1016,124,2.5,7476
369,Rhys Hoskins,212,170,44,37,18,48,2,0.259,0.255,0.396,0.241,0.359,0.618,0.417,0.399,0.175,0.217,83.6,21.6,78,378,604,158,2.2,16472
173,Giancarlo Stanton,692,597,168,123,59,132,2,0.281,0.266,0.376,0.288,0.35,0.631,0.41,0.398,0.123,0.236,85.6,16.8,373,1174,1553,156,6.9,4949


In [453]:
# let's also look at top 10 in wOBA to see how these players actually performed in 2017
merged_2017.sort_values(by='wOBA', ascending=False).head(10)

Unnamed: 0,Player_Name,PA,AB,Hits,R,HR,RBI,SB,BA,xBA,OBP,BABIP,ISO,SLG,wOBA,xwOBA,BB/PA,K/PA,Exit_Velocity,Launch_Angle,Whiffs,Swings,Takes,wRC+,WAR,playerid
334,Mike Trout,507,402,123,92,33,72,22,0.306,0.28,0.442,0.318,0.323,0.629,0.437,0.423,0.185,0.178,82.2,20.8,146,801,1319,181,6.9,10155
193,J.D. Martinez,489,432,131,85,45,104,4,0.303,0.288,0.376,0.327,0.387,0.69,0.43,0.423,0.108,0.262,83.8,16.0,307,983,928,166,3.8,6184
4,Aaron Judge,678,542,154,128,52,114,9,0.284,0.278,0.422,0.357,0.343,0.627,0.43,0.446,0.187,0.307,85.1,17.5,429,1228,1756,173,8.2,15640
230,Joey Votto,707,559,179,106,36,100,5,0.32,0.298,0.454,0.321,0.258,0.578,0.428,0.424,0.19,0.117,81.8,17.7,176,1144,1577,165,6.6,4314
369,Rhys Hoskins,212,170,44,37,18,48,2,0.259,0.255,0.396,0.241,0.359,0.618,0.417,0.399,0.175,0.217,83.6,21.6,78,378,604,158,2.2,16472
70,Bryce Harper,492,420,134,95,29,87,4,0.319,0.283,0.413,0.356,0.276,0.595,0.416,0.39,0.138,0.201,83.3,16.8,259,964,999,156,4.8,11579
87,Charlie Blackmon,725,644,213,137,37,104,14,0.331,0.278,0.399,0.371,0.27,0.601,0.414,0.364,0.09,0.186,82.1,16.3,252,1334,1537,141,6.5,7859
316,Matt Olson,216,189,49,33,24,45,0,0.259,0.251,0.352,0.238,0.392,0.651,0.411,0.38,0.102,0.278,85.4,21.5,131,410,498,162,2.0,14344
173,Giancarlo Stanton,692,597,168,123,59,132,2,0.281,0.266,0.376,0.288,0.35,0.631,0.41,0.398,0.123,0.236,85.6,16.8,373,1174,1553,156,6.9,4949
167,Freddie Freeman,514,440,135,84,28,71,8,0.307,0.283,0.403,0.335,0.28,0.586,0.407,0.403,0.126,0.185,84.3,19.6,254,1020,902,152,4.5,5361


Next let's start off by creating a simple wOBA - xwOBA column called 'wOBA_diff' and see who our initial underperformers and 
overperformers were for 2017 based on wOBA and xwOBA.

In [454]:
# top 10 underperformers - Cabrera, Moreland stick out here
merged_2017 ['wOBA_diff'] = merged_2017['wOBA'] -  merged_2017['xwOBA']
merged_2017.sort_values(by='wOBA_diff').head(10)

Unnamed: 0,Player_Name,PA,AB,Hits,R,HR,RBI,SB,BA,xBA,OBP,BABIP,ISO,SLG,wOBA,xwOBA,BB/PA,K/PA,Exit_Velocity,Launch_Angle,Whiffs,Swings,Takes,wRC+,WAR,playerid,wOBA_diff
292,Luis Torrens,139,123,20,7,0,7,0,0.163,0.217,0.243,0.215,0.041,0.203,0.197,0.267,0.086,0.216,79.5,9.5,68,261,271,18,-0.8,15905,-0.07
327,Miguel Cabrera,529,469,117,50,16,60,0,0.249,0.29,0.329,0.292,0.149,0.399,0.313,0.382,0.102,0.208,84.5,17.5,237,1028,1014,91,-0.2,1744,-0.069
248,Jose Lobaton,158,141,24,11,4,11,0,0.17,0.225,0.248,0.194,0.106,0.277,0.232,0.301,0.089,0.222,79.8,13.3,76,303,322,36,-0.6,4243,-0.069
372,Rob Refsnyder,98,88,15,8,0,0,4,0.17,0.221,0.247,0.211,0.045,0.216,0.209,0.274,0.082,0.173,82.3,9.4,33,148,205,22,-0.7,13770,-0.065
361,Paulo Orlando,90,86,17,9,2,6,1,0.198,0.255,0.225,0.234,0.105,0.302,0.228,0.292,0.011,0.222,82.8,15.7,52,173,119,34,-0.4,8628,-0.064
344,Nick Franklin,119,106,19,9,2,12,2,0.179,0.248,0.269,0.207,0.104,0.283,0.25,0.312,0.084,0.185,79.1,11.7,46,197,273,48,-0.2,10166,-0.062
188,Hyun Soo Kim,239,212,49,20,1,14,0,0.231,0.262,0.307,0.287,0.061,0.292,0.268,0.321,0.092,0.192,82.0,14.4,77,414,567,61,-1.1,18718,-0.053
379,Ruben Tejada,124,113,26,17,0,5,0,0.23,0.266,0.293,0.265,0.053,0.283,0.259,0.311,0.065,0.121,81.8,16.3,33,220,226,56,-0.2,5519,-0.052
44,Austin Romine,252,229,50,19,2,21,0,0.218,0.254,0.272,0.277,0.074,0.293,0.25,0.297,0.063,0.226,79.8,14.6,123,467,481,49,-0.6,5491,-0.047
261,Juan Graterol,87,84,17,5,0,10,0,0.202,0.246,0.207,0.233,0.048,0.25,0.196,0.243,0.011,0.149,78.4,7.9,37,162,115,16,-0.4,5398,-0.047


In [455]:
# top 10 overperformers - Marwin, Cozart and again xwOBA doesn't adjust for speed as shown below
merged_2017.sort_values(by='wOBA_diff', ascending=False).head(10)

Unnamed: 0,Player_Name,PA,AB,Hits,R,HR,RBI,SB,BA,xBA,OBP,BABIP,ISO,SLG,wOBA,xwOBA,BB/PA,K/PA,Exit_Velocity,Launch_Angle,Whiffs,Swings,Takes,wRC+,WAR,playerid,wOBA_diff
357,Pat Valaika,195,182,47,28,13,40,0,0.258,0.211,0.284,0.291,0.275,0.533,0.338,0.255,0.036,0.272,79.1,18.7,123,382,303,92,0.6,14885,0.083
157,Eric Young Jr.,125,110,29,24,4,16,12,0.264,0.198,0.336,0.333,0.155,0.418,0.329,0.251,0.04,0.248,78.4,7.2,66,220,204,108,0.8,7158,0.078
147,Eduardo Nunez,491,467,146,60,12,58,24,0.313,0.246,0.341,0.333,0.148,0.46,0.342,0.275,0.037,0.11,80.5,13.8,150,911,802,112,2.2,6848,0.067
297,Mallex Smith,282,256,69,33,2,12,16,0.27,0.199,0.329,0.347,0.086,0.355,0.301,0.239,0.082,0.22,74.0,9.3,152,501,543,88,0.8,13608,0.062
308,Marwin Gonzalez,515,455,138,67,23,90,8,0.303,0.244,0.377,0.343,0.226,0.53,0.382,0.32,0.095,0.192,82.6,13.2,182,891,1211,144,4.1,5497,0.062
454,Zack Cozart,507,438,130,80,24,63,3,0.297,0.255,0.385,0.312,0.251,0.548,0.392,0.332,0.122,0.154,81.7,16.3,139,868,1249,141,5.0,2616,0.06
128,Delino DeShields,440,376,101,75,6,22,29,0.269,0.191,0.347,0.358,0.098,0.367,0.315,0.255,0.1,0.248,76.7,12.5,165,707,1082,90,2.3,11379,0.06
274,Kevin Kiermaier,421,380,105,56,15,39,16,0.276,0.215,0.338,0.337,0.174,0.45,0.337,0.279,0.074,0.235,79.8,14.5,212,792,853,112,3.0,11038,0.058
127,Dee Gordon,695,653,201,114,2,33,60,0.308,0.237,0.341,0.354,0.067,0.375,0.312,0.254,0.036,0.134,75.7,6.9,186,1256,1087,92,3.3,8203,0.058
245,Jose Altuve,662,590,204,112,24,81,32,0.346,0.274,0.41,0.37,0.202,0.547,0.405,0.349,0.088,0.127,81.0,13.4,185,1099,1195,160,7.5,5417,0.056


This gives us a baseline of what we're trying to replicate with our potential breakout list. Next I would like to get rid of the counting stats and in place I'll create rate metrics in an attempt at normalizing the statistics. Yes in an ideal world I'd have a sufficient number of PA for every player to give a 'truer' stat line but I'm working with what I have.

In [456]:
# creating new per PA percentage metrics
merged_2015 ['HR/PA'] = merged_2015['HR'] /  merged_2015['PA']
merged_2016 ['HR/PA'] = merged_2016['HR'] /  merged_2016['PA']
merged_2017 ['HR/PA'] = merged_2017['HR'] /  merged_2017['PA']
merged_2015 ['R/PA'] = merged_2015['R'] /  merged_2015['PA']
merged_2016 ['R/PA'] = merged_2016['R'] /  merged_2016['PA']
merged_2017 ['R/PA'] = merged_2017['R'] /  merged_2017['PA']
merged_2015 ['RBI/PA'] = merged_2015['RBI'] /  merged_2015['PA']
merged_2016 ['RBI/PA'] = merged_2016['RBI'] /  merged_2016['PA']
merged_2017 ['RBI/PA'] = merged_2017['RBI'] /  merged_2017['PA']
merged_2015 ['SB/PA'] = merged_2015['SB'] /  merged_2015['PA']
merged_2016 ['SB/PA'] = merged_2016['SB'] /  merged_2016['PA']
merged_2017 ['SB/PA'] = merged_2017['SB'] /  merged_2017['PA']
merged_2015 ['Whiffs/PA'] = merged_2015['Whiffs'] /  merged_2015['PA']
merged_2016 ['Whiffs/PA'] = merged_2016['Whiffs'] /  merged_2016['PA']
merged_2017 ['Whiffs/PA'] = merged_2017['Whiffs'] /  merged_2017['PA']
merged_2015 ['Swings/PA'] = merged_2015['Swings'] /  merged_2015['PA']
merged_2016 ['Swings/PA'] = merged_2016['Swings'] /  merged_2016['PA']
merged_2017 ['Swings/PA'] = merged_2017['Swings'] /  merged_2017['PA']
merged_2015 ['Takes/PA'] = merged_2015['Takes'] /  merged_2015['PA']
merged_2016 ['Takes/PA'] = merged_2016['Takes'] /  merged_2016['PA']
merged_2017 ['Takes/PA'] = merged_2017['Takes'] /  merged_2017['PA']

In [457]:
# taking out counting stats and creating datasets with just rate stats for deeper analysis, keep PA as a reference point
# most interested in how xwOBA and wOBA correlate so putting them towards the front

rate_stats_2015 = merged_2015.filter(['Player_Name', 'PA', 'xwOBA', 'wOBA', 'BA', 'xBA', 'OBP', 'BB/PA', 'K/PA', 'Exit_Velocity', 
                                      'Launch_Angle', 'BABIP', 'ISO', 'SLG',  'HR/PA', 'R/PA', 'RBI/PA', 'SB/PA', 
                                      'Whiffs/PA', 'Swings/PA', 'Takes/PA'], axis=1)
rate_stats_2016 = merged_2016.filter(['Player_Name', 'PA', 'xwOBA', 'wOBA', 'BA', 'xBA', 'OBP', 'BB/PA', 'K/PA', 'Exit_Velocity', 
                                      'Launch_Angle', 'BABIP', 'ISO', 'SLG',  'HR/PA', 'R/PA', 'RBI/PA', 'SB/PA', 
                                      'Whiffs/PA', 'Swings/PA', 'Takes/PA'], axis=1)
rate_stats_2017 = merged_2017.filter(['Player_Name', 'PA', 'xwOBA', 'wOBA', 'BA', 'xBA', 'OBP', 'BB/PA', 'K/PA', 'Exit_Velocity', 
                                      'Launch_Angle', 'BABIP', 'ISO', 'SLG',  'HR/PA', 'R/PA', 'RBI/PA', 'SB/PA', 
                                      'Whiffs/PA', 'Swings/PA', 'Takes/PA'], axis=1)

In [458]:
# let's look at our new dataset for 2017
rate_stats_2017.head()

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,BA,xBA,OBP,BB/PA,K/PA,Exit_Velocity,Launch_Angle,BABIP,ISO,SLG,HR/PA,R/PA,RBI/PA,SB/PA,Whiffs/PA,Swings/PA,Takes/PA
0,A.J. Ellis,163,0.273,0.294,0.21,0.197,0.298,0.074,0.178,80.8,16.5,0.222,0.161,0.371,0.03681,0.104294,0.08589,0.0,0.319018,1.656442,2.239264
1,A.J. Pollock,466,0.331,0.34,0.266,0.265,0.33,0.075,0.152,82.9,10.7,0.291,0.205,0.471,0.030043,0.156652,0.10515,0.042918,0.298283,1.581545,2.135193
2,Aaron Altherr,412,0.33,0.359,0.272,0.244,0.34,0.078,0.252,83.3,14.3,0.328,0.245,0.516,0.046117,0.140777,0.157767,0.012136,0.514563,1.718447,2.131068
3,Aaron Hicks,361,0.335,0.363,0.266,0.233,0.372,0.141,0.186,83.0,16.5,0.29,0.209,0.475,0.041551,0.149584,0.144044,0.027701,0.426593,1.603878,2.481994
4,Aaron Judge,678,0.446,0.43,0.284,0.278,0.422,0.187,0.307,85.1,17.5,0.357,0.343,0.627,0.076696,0.188791,0.168142,0.013274,0.632743,1.811209,2.589971


In [459]:
# a look at correlation between all rate stats, sorted by xwOBA
rate_stats_2017.corr().sort_values(by='xwOBA', ascending=False)

Unnamed: 0,PA,xwOBA,wOBA,BA,xBA,OBP,BB/PA,K/PA,Exit_Velocity,Launch_Angle,BABIP,ISO,SLG,HR/PA,R/PA,RBI/PA,SB/PA,Whiffs/PA,Swings/PA,Takes/PA
xwOBA,0.5182,1.0,0.823467,0.53368,0.766802,0.751825,0.554829,-0.195193,0.645139,0.224597,0.236109,0.677389,0.7582,0.607224,0.467414,0.591037,-0.187462,-0.096309,-0.098816,0.256429
wOBA,0.537792,0.823467,1.0,0.789975,0.630058,0.894772,0.445814,-0.147622,0.460735,0.185048,0.547763,0.748893,0.927742,0.645413,0.678424,0.634759,0.009776,-0.04435,-0.03919,0.182542
xBA,0.503943,0.766802,0.630058,0.74533,1.0,0.62019,0.089368,-0.599747,0.432762,-0.091323,0.354041,0.297335,0.562181,0.21045,0.315083,0.396814,-0.084176,-0.417623,-0.116456,-0.078738
SLG,0.522764,0.7582,0.927742,0.685917,0.562181,0.669816,0.231445,-0.040718,0.506109,0.295744,0.410699,0.904453,1.0,0.819635,0.629829,0.772791,-0.045754,0.11879,0.144225,0.014837
OBP,0.468414,0.751825,0.894772,0.787141,0.62019,1.0,0.60987,-0.283864,0.31758,0.000944,0.612258,0.413369,0.669816,0.301659,0.596937,0.355923,0.074175,-0.25615,-0.258285,0.324185
ISO,0.399265,0.677389,0.748893,0.310031,0.297335,0.413369,0.288615,0.206795,0.530718,0.471265,0.088253,1.0,0.904453,0.952514,0.518901,0.779297,-0.143605,0.317958,0.213543,0.091102
Exit_Velocity,0.358465,0.645139,0.460735,0.221798,0.432762,0.31758,0.276684,-0.008786,1.0,0.239353,0.03454,0.530718,0.506109,0.497111,0.222907,0.50237,-0.315203,0.047826,-0.104862,0.204214
HR/PA,0.345566,0.607224,0.645413,0.202257,0.21045,0.301659,0.238867,0.276418,0.497111,0.480974,-0.01261,0.952514,0.819635,1.0,0.416015,0.771772,-0.187473,0.388562,0.269415,0.052674
RBI/PA,0.328848,0.591037,0.634759,0.39341,0.396814,0.355923,0.116789,0.041924,0.50237,0.338213,0.133475,0.779297,0.772791,0.771772,0.341698,1.0,-0.244776,0.199138,0.204345,-0.033105
BB/PA,0.171868,0.554829,0.445814,0.022529,0.089368,0.60987,1.0,0.085075,0.276684,0.182623,0.020766,0.288615,0.231445,0.238867,0.30969,0.116789,-0.084228,-0.084457,-0.37785,0.722242


In [460]:
# 2016 xWOBA correlations
rate_stats_2016.corr().sort_values('xwOBA', ascending=False)

Unnamed: 0,PA,xwOBA,wOBA,BA,xBA,OBP,BB/PA,K/PA,Exit_Velocity,Launch_Angle,BABIP,ISO,SLG,HR/PA,R/PA,RBI/PA,SB/PA,Whiffs/PA,Swings/PA,Takes/PA
xwOBA,0.522343,1.0,0.799264,0.519971,0.770199,0.725085,0.493599,-0.15624,0.698957,0.177131,0.264013,0.641972,0.734246,0.598881,0.408658,0.596144,-0.186578,-0.052088,-0.111752,0.289609
wOBA,0.565675,0.799264,1.0,0.788777,0.626954,0.889832,0.394794,-0.12076,0.498414,0.163817,0.613564,0.728449,0.926953,0.597112,0.666003,0.621101,0.042403,-0.009047,-0.032023,0.211742
xBA,0.510603,0.770199,0.626954,0.753677,1.0,0.618923,0.020446,-0.591028,0.490512,-0.089401,0.382849,0.262848,0.555758,0.18667,0.311453,0.381355,-0.060254,-0.421649,-0.143506,-0.06483
SLG,0.54881,0.734246,0.926953,0.680817,0.555758,0.659979,0.183282,-0.005492,0.561737,0.298468,0.473555,0.891218,1.0,0.773106,0.631388,0.75383,-0.034514,0.153099,0.154774,0.0417
OBP,0.488093,0.725085,0.889832,0.791485,0.618923,1.0,0.556912,-0.276325,0.325465,-0.03948,0.670902,0.376217,0.659979,0.245456,0.572589,0.350051,0.128962,-0.230657,-0.248829,0.339116
Exit_Velocity,0.387655,0.698957,0.498414,0.251817,0.490512,0.325465,0.233332,0.054296,1.0,0.171958,0.076478,0.581611,0.561737,0.601924,0.18376,0.538245,-0.29531,0.147421,-0.029886,0.169715
ISO,0.402498,0.641972,0.728449,0.274586,0.262848,0.376217,0.263933,0.281886,0.581611,0.459128,0.128541,1.0,0.891218,0.939497,0.494311,0.769579,-0.165784,0.387563,0.221815,0.133329
HR/PA,0.356836,0.598881,0.597112,0.121866,0.18667,0.245456,0.254784,0.345182,0.601924,0.466338,-0.03303,0.939497,0.773106,1.0,0.381373,0.747733,-0.221501,0.439815,0.236106,0.146599
RBI/PA,0.405472,0.596144,0.621101,0.354885,0.381355,0.350051,0.124183,0.065959,0.538245,0.35101,0.129866,0.769579,0.75383,0.747733,0.327705,1.0,-0.226616,0.233628,0.218242,-0.017145
PA,1.0,0.522343,0.565675,0.513671,0.510603,0.488093,0.122484,-0.276738,0.387655,0.081651,0.251597,0.402498,0.54881,0.356836,0.450066,0.405472,0.106208,-0.155702,-0.040901,0.008948


In [461]:
# and finally 2015 xWOBA correlations
rate_stats_2015.corr().sort_values('xwOBA', ascending=False)

Unnamed: 0,PA,xwOBA,wOBA,BA,xBA,OBP,BB/PA,K/PA,Exit_Velocity,Launch_Angle,BABIP,ISO,SLG,HR/PA,R/PA,RBI/PA,SB/PA,Whiffs/PA,Swings/PA,Takes/PA
xwOBA,0.45735,1.0,0.839536,0.57204,0.808166,0.781818,0.576344,-0.163452,0.733438,0.273972,0.363072,0.679738,0.778572,0.630086,0.448491,0.649832,-0.162701,-0.009867,-0.043205,0.28729
wOBA,0.470927,0.839536,1.0,0.801373,0.698272,0.902726,0.445687,-0.182994,0.604984,0.165851,0.621261,0.756187,0.944439,0.640588,0.632337,0.667723,0.019146,-0.032415,-0.0011,0.223007
xBA,0.520299,0.808166,0.698272,0.783199,1.0,0.703219,0.152586,-0.544486,0.473702,-0.013551,0.476799,0.339876,0.624943,0.27405,0.340622,0.46724,-0.04118,-0.344109,-0.069487,-0.013969
OBP,0.476116,0.781818,0.902726,0.82099,0.703219,1.0,0.578034,-0.339381,0.440806,0.027487,0.676057,0.440285,0.71772,0.319806,0.533422,0.419519,0.105577,-0.230209,-0.186453,0.3077
SLG,0.426829,0.778572,0.944439,0.70357,0.624943,0.71772,0.276394,-0.063935,0.649262,0.245328,0.511379,0.892547,1.0,0.790894,0.622678,0.775551,-0.046247,0.104638,0.137374,0.103359
Exit_Velocity,0.287837,0.733438,0.604984,0.296095,0.473702,0.440806,0.392284,0.179781,1.0,0.328752,0.224378,0.68131,0.649262,0.681807,0.267789,0.65308,-0.319412,0.291767,0.130291,0.22704
ISO,0.250868,0.679738,0.756187,0.307559,0.339876,0.440285,0.348893,0.223047,0.68131,0.417095,0.167016,1.0,0.892547,0.940113,0.538592,0.788925,-0.167161,0.333227,0.186744,0.186918
RBI/PA,0.334756,0.649832,0.667723,0.393378,0.46724,0.419519,0.207794,0.011145,0.65308,0.286905,0.17298,0.788925,0.775551,0.783028,0.379766,1.0,-0.245832,0.185497,0.183407,0.050427
HR/PA,0.22087,0.630086,0.640588,0.187311,0.27405,0.319806,0.307524,0.286526,0.681807,0.440168,0.037083,0.940113,0.790894,1.0,0.432308,0.783028,-0.255771,0.396043,0.217788,0.166091
BB/PA,0.133585,0.576344,0.445687,0.034015,0.152586,0.578034,1.0,0.093212,0.392284,0.24044,0.034552,0.348893,0.276394,0.307524,0.256915,0.207794,-0.068895,0.023369,-0.347475,0.686347


It's important to step back here and digest these correlations. Obviously wOBA is highly correlated with xwOBA because xwOBA is derived from wOBA so we will not use wOBA in our analysis. Similar can be said with xBA as expected BA is already a part of xwOBA.  SLG and ISO both take an old school approach at accounting for power meaning that their weights don't fluctuate with the season and for that reason I'm not going to use them either. That leaves us with the next two highest correlated variables being 'OBP', and 'Exit_Velocity'. I'm a little surprised/frustrated that Launch_Angle is not very highly correlated with xwOBA but let's look at the top 15 just to see what types of players we are dealing with. I definitely thought I'd use a combination of 'Launch_Speed', 'Launch_Angle' and some other variables but let's look at the top players for 2017 in 'Launch_Angle' and see what we're dealing with.

In [462]:
# top 15 in 'Launch_Angle' for 2017 contains some decent players but no superstars at all
rate_stats_2017.sort_values(by='Launch_Angle', ascending=False).head(15)

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,BA,xBA,OBP,BB/PA,K/PA,Exit_Velocity,Launch_Angle,BABIP,ISO,SLG,HR/PA,R/PA,RBI/PA,SB/PA,Whiffs/PA,Swings/PA,Takes/PA
385,Ryan Schimpf,197,0.291,0.304,0.158,0.156,0.284,0.137,0.355,82.5,33.1,0.145,0.267,0.424,0.071066,0.121827,0.126904,0.0,0.451777,1.639594,2.624365
333,Mike Napoli,485,0.311,0.302,0.193,0.199,0.285,0.101,0.336,83.3,26.4,0.225,0.235,0.428,0.059794,0.123711,0.136082,0.002062,0.65567,1.898969,2.482474
311,Matt Chapman,326,0.314,0.332,0.234,0.216,0.313,0.098,0.282,83.5,25.0,0.29,0.238,0.472,0.042945,0.119632,0.122699,0.0,0.533742,1.760736,2.349693
176,Greg Bird,170,0.324,0.303,0.19,0.206,0.288,0.112,0.247,84.6,24.6,0.194,0.231,0.422,0.052941,0.117647,0.164706,0.0,0.558824,1.829412,2.423529
228,Joey Gallo,532,0.374,0.364,0.209,0.218,0.333,0.141,0.368,85.0,24.4,0.25,0.327,0.537,0.077068,0.159774,0.150376,0.013158,0.860902,1.984962,2.212406
310,Matt Carpenter,622,0.376,0.361,0.241,0.249,0.384,0.175,0.201,83.5,24.1,0.274,0.209,0.451,0.036977,0.146302,0.110932,0.003215,0.299035,1.516077,2.893891
332,Mike Moustakas,598,0.339,0.345,0.272,0.266,0.314,0.057,0.157,81.8,24.0,0.263,0.249,0.521,0.063545,0.125418,0.14214,0.0,0.436455,2.083612,1.652174
282,Kyle Schwarber,486,0.341,0.333,0.211,0.219,0.315,0.121,0.309,83.8,23.9,0.244,0.256,0.467,0.061728,0.13786,0.121399,0.002058,0.574074,1.927984,2.395062
427,Tyler Collins,169,0.28,0.269,0.193,0.196,0.278,0.107,0.325,83.9,23.6,0.264,0.14,0.333,0.029586,0.106509,0.08284,0.0,0.591716,1.863905,2.260355
132,Derek Norris,198,0.291,0.271,0.201,0.223,0.258,0.061,0.242,82.5,23.4,0.214,0.179,0.38,0.045455,0.106061,0.121212,0.005051,0.5,1.868687,1.90404


Instead of Launch_Angle, let's look at another power variable like 'HR/PA' and see if maybe this would be a better variable to include. It has a much higher correlation with xwOBA after all.

In [463]:
# not perfect but Martinez, Stanton, Judge, Bellinger, Donaldson, Trout are the types of players we want, let's use it instead
rate_stats_2017.sort_values(by='HR/PA', ascending=False).head(15)

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,BA,xBA,OBP,BB/PA,K/PA,Exit_Velocity,Launch_Angle,BABIP,ISO,SLG,HR/PA,R/PA,RBI/PA,SB/PA,Whiffs/PA,Swings/PA,Takes/PA
316,Matt Olson,216,0.38,0.411,0.259,0.251,0.352,0.102,0.278,85.4,21.5,0.238,0.392,0.651,0.111111,0.152778,0.208333,0.0,0.606481,1.898148,2.305556
193,J.D. Martinez,489,0.423,0.43,0.303,0.288,0.376,0.108,0.262,83.8,16.0,0.327,0.387,0.69,0.092025,0.173824,0.212679,0.00818,0.627812,2.010225,1.897751
173,Giancarlo Stanton,692,0.398,0.41,0.281,0.266,0.376,0.123,0.236,85.6,16.8,0.288,0.35,0.631,0.08526,0.177746,0.190751,0.00289,0.539017,1.696532,2.24422
369,Rhys Hoskins,212,0.399,0.417,0.259,0.255,0.396,0.175,0.217,83.6,21.6,0.241,0.359,0.618,0.084906,0.174528,0.226415,0.009434,0.367925,1.783019,2.849057
408,Teoscar Hernandez,95,0.317,0.371,0.261,0.218,0.305,0.063,0.379,76.8,18.7,0.333,0.341,0.602,0.084211,0.168421,0.210526,0.0,0.8,2.263158,2.221053
228,Joey Gallo,532,0.374,0.364,0.209,0.218,0.333,0.141,0.368,85.0,24.4,0.25,0.327,0.537,0.077068,0.159774,0.150376,0.013158,0.860902,1.984962,2.212406
4,Aaron Judge,678,0.446,0.43,0.284,0.278,0.422,0.187,0.307,85.1,17.5,0.357,0.343,0.627,0.076696,0.188791,0.168142,0.013274,0.632743,1.811209,2.589971
106,Cody Bellinger,548,0.357,0.38,0.267,0.24,0.352,0.117,0.266,84.2,21.2,0.299,0.315,0.581,0.071168,0.158759,0.177007,0.018248,0.580292,1.784672,2.224453
385,Ryan Schimpf,197,0.291,0.304,0.158,0.156,0.284,0.137,0.355,82.5,33.1,0.145,0.267,0.424,0.071066,0.121827,0.126904,0.0,0.451777,1.639594,2.624365
107,Colby Rasmus,129,0.343,0.365,0.281,0.245,0.318,0.054,0.349,83.5,15.4,0.368,0.298,0.579,0.069767,0.131783,0.178295,0.007752,0.782946,2.015504,1.674419


For my model, I'm going to include HR/PA instead of RBI/PA simply because RBIs are not created equal. There is an argument that not all HRs are created equal either but batting order position matters significantly more for RBI opportunity than it does for HR opportunity. Also, I'm going to use OBP instead of BB/PA because OBP encompasses all types of getting on base and BB/PA only takes into account walks. After that variable correlations to xwOBA start to fall off dramatically so let's move forward with using 'Exit_Velocity', 'OBP', and 'HR/PA' in our model and see what we come up with

In [464]:
# creating datasets with the variables we care about (for now at least) to analyze
rate_stats_2015 = merged_2015.filter(['Player_Name', 'PA', 'xwOBA', 'wOBA', 'OBP', 'Exit_Velocity', 'HR/PA'], axis=1)
rate_stats_2016 = merged_2016.filter(['Player_Name', 'PA', 'xwOBA', 'wOBA', 'OBP', 'Exit_Velocity', 'HR/PA'], axis=1)
rate_stats_2017 = merged_2017.filter(['Player_Name', 'PA', 'xwOBA', 'wOBA', 'OBP', 'Exit_Velocity', 'HR/PA'], axis=1)

In [465]:
# let's start with the top 15 in Launch_Speed - Correa, Abreu, Stanton, are familiar whereas Diaz Avila, Olson are not
rate_stats_2017.sort_values(by='Exit_Velocity', ascending=False).head(15)

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
443,Yandy Diaz,179,0.331,0.306,0.352,87.1,0.0
23,Alex Avila,376,0.401,0.362,0.387,86.4,0.037234
386,Ryan Zimmerman,576,0.375,0.387,0.358,85.8,0.0625
80,Carlos Correa,481,0.393,0.394,0.391,85.8,0.049896
244,Jose Abreu,675,0.364,0.377,0.354,85.8,0.048889
173,Giancarlo Stanton,692,0.398,0.41,0.376,85.6,0.08526
316,Matt Olson,216,0.38,0.411,0.352,85.4,0.111111
269,Kendrys Morales,608,0.358,0.32,0.308,85.4,0.046053
330,Miguel Sano,483,0.348,0.361,0.352,85.4,0.057971
11,Adam Lind,301,0.379,0.363,0.362,85.4,0.046512


At this point, I'm going to start segmenting the data to narrow down to find breakout candidates. I will start by looking at the top 50% of players in 'Exit_Velocity' from 2017 data and go from there. 

In [466]:
# breaking down 'Exit_Velocity' for 2017
rate_stats_2017['Exit_Velocity'].describe()

count    456.000000
mean      81.481140
std        2.117513
min       74.000000
25%       80.100000
50%       81.700000
75%       83.000000
max       87.100000
Name: Exit_Velocity, dtype: float64

In [467]:
# the 50% threshold for Exit_Velocity > 81.7, pretty good starting point with this group
# Yandy Diaz, highest Exit_Velocity and zero HRs, what the heck?!?
rate_stats_2017[rate_stats_2017.Exit_Velocity > 81.7].sort_values(by='Exit_Velocity', ascending=False)

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
443,Yandy Diaz,179,0.331,0.306,0.352,87.1,0.0
23,Alex Avila,376,0.401,0.362,0.387,86.4,0.037234
244,Jose Abreu,675,0.364,0.377,0.354,85.8,0.048889
386,Ryan Zimmerman,576,0.375,0.387,0.358,85.8,0.0625
80,Carlos Correa,481,0.393,0.394,0.391,85.8,0.049896
173,Giancarlo Stanton,692,0.398,0.41,0.376,85.6,0.08526
269,Kendrys Morales,608,0.358,0.32,0.308,85.4,0.046053
330,Miguel Sano,483,0.348,0.361,0.352,85.4,0.057971
316,Matt Olson,216,0.38,0.411,0.352,85.4,0.111111
11,Adam Lind,301,0.379,0.363,0.362,85.4,0.046512


In [468]:
# our initial list starts off at 109, let's work on narrowing it down some
print(len(rate_stats_2017[rate_stats_2017.Exit_Velocity > 81.7]))

221


In [469]:
# and 2017 'OBP'
rate_stats_2017['OBP'].describe()

count    456.000000
mean       0.321586
std        0.041196
min        0.176000
25%        0.294000
50%        0.323000
75%        0.350000
max        0.454000
Name: OBP, dtype: float64

In [470]:
# look at top 50% in OBP
rate_stats_2017[rate_stats_2017.OBP > 0.323].sort_values(by='OBP', ascending=False)

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
230,Joey Votto,707,0.424,0.428,0.454,81.8,0.050919
334,Mike Trout,507,0.423,0.437,0.442,82.2,0.065089
4,Aaron Judge,678,0.446,0.43,0.422,85.1,0.076696
265,Justin Turner,543,0.397,0.4,0.415,83.7,0.038674
70,Bryce Harper,492,0.39,0.416,0.413,83.3,0.058943
415,Tommy Pham,530,0.366,0.398,0.411,83.1,0.043396
245,Jose Altuve,662,0.349,0.405,0.41,81.0,0.036254
280,Kris Bryant,665,0.367,0.399,0.409,81.3,0.043609
41,Austin Barnes,262,0.37,0.386,0.408,83.4,0.030534
360,Paul Goldschmidt,665,0.397,0.4,0.404,85.3,0.054135


In [471]:
# 225 for top 50% in OBP
print(len(rate_stats_2017[rate_stats_2017.OBP > 0.323]))

225


In [472]:
# and 2017 'HR/PA'
rate_stats_2017['HR/PA'].describe()

count    456.000000
mean       0.031360
std        0.017441
min        0.000000
25%        0.018881
50%        0.029571
75%        0.043173
max        0.111111
Name: HR/PA, dtype: float64

In [473]:
# look at top 50% in HR/PA
rate_stats_2017[rate_stats_2017['HR/PA'] > 0.029571].sort_values(by='HR/PA', ascending=False)

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
316,Matt Olson,216,0.38,0.411,0.352,85.4,0.111111
193,J.D. Martinez,489,0.423,0.43,0.376,83.8,0.092025
173,Giancarlo Stanton,692,0.398,0.41,0.376,85.6,0.08526
369,Rhys Hoskins,212,0.399,0.417,0.396,83.6,0.084906
408,Teoscar Hernandez,95,0.317,0.371,0.305,76.8,0.084211
228,Joey Gallo,532,0.374,0.364,0.333,85.0,0.077068
4,Aaron Judge,678,0.446,0.43,0.422,85.1,0.076696
106,Cody Bellinger,548,0.357,0.38,0.352,84.2,0.071168
385,Ryan Schimpf,197,0.291,0.304,0.284,82.5,0.071066
107,Colby Rasmus,129,0.343,0.365,0.318,83.5,0.069767


In [474]:
# 228 for top 50% in HR/PA
print(len(rate_stats_2017[rate_stats_2017['HR/PA'] > 0.029571]))

228


In [475]:
# let's combine our top 50% threshold for each category and see what kind of a list we come up with
rate_stats_2017_v2 = rate_stats_2017[(rate_stats_2017.Exit_Velocity > 81.7) & (rate_stats_2017.OBP > 0.323) 
                                     & (rate_stats_2017['HR/PA'] > 0.029571)]

In [476]:
# sorting by'PA' ascending, a lot of superstars at the bottom, some trendy breakout candidates, some not so much, let's continue
rate_stats_2017_v2.sort_values(by='PA')

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
240,Jorge Alfaro,114,0.322,0.369,0.36,82.2,0.04386
220,Jesse Winker,137,0.35,0.384,0.375,83.1,0.051095
414,Tommy La Stella,151,0.341,0.368,0.389,82.5,0.033113
369,Rhys Hoskins,212,0.399,0.417,0.396,83.6,0.084906
316,Matt Olson,216,0.38,0.411,0.352,85.4,0.111111
363,Rafael Devers,240,0.296,0.344,0.338,82.5,0.041667
41,Austin Barnes,262,0.37,0.386,0.408,83.4,0.030534
11,Adam Lind,301,0.379,0.363,0.362,85.4,0.046512
249,Jose Martinez,307,0.411,0.379,0.379,84.5,0.045603
376,Robinson Chirinos,309,0.334,0.369,0.36,83.4,0.055016


In [477]:
# combining our top 50% for each category narrows our list down to 89
print(len(rate_stats_2017_v2))

89


In [478]:
# 89 is still quite a large list to sift through so what if we narrowed it down to top 25% in each category
rate_stats_2017_v3 = rate_stats_2017[(rate_stats_2017.Exit_Velocity > 83.0) & (rate_stats_2017['HR/PA'] > 0.043173) & 
                                     (rate_stats_2017.OBP > 0.350)]

In [479]:
# combining all three together and sorting by PA gives us a pretty nice group
rate_stats_2017_v3.sort_values(by='PA')

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
220,Jesse Winker,137,0.35,0.384,0.375,83.1,0.051095
369,Rhys Hoskins,212,0.399,0.417,0.396,83.6,0.084906
316,Matt Olson,216,0.38,0.411,0.352,85.4,0.111111
11,Adam Lind,301,0.379,0.363,0.362,85.4,0.046512
249,Jose Martinez,307,0.411,0.379,0.379,84.5,0.045603
376,Robinson Chirinos,309,0.334,0.369,0.36,83.4,0.055016
263,Justin Bour,429,0.374,0.374,0.366,83.1,0.058275
324,Michael Conforto,440,0.376,0.392,0.384,83.8,0.061364
80,Carlos Correa,481,0.393,0.394,0.391,85.8,0.049896
330,Miguel Sano,483,0.348,0.361,0.352,85.4,0.057971


In [480]:
# 27 is a lot more manageable, however a lot of these players on this list are already stars or at least full time players
print(len(rate_stats_2017_v3))

27


Because our list contains a lot of stars on it, we need to pause here and put it into perspective. Is Carlos Correa really a potential breakout candidate? Is Jose Martinez? From just looking at the table above, the perceived difference between the two is not abundantly clear. Actually you'd assume them to be basically the same player. To account for that, let's import the NFBC draft data and then compare the players again

In [481]:
# read in nfbc adp 2018 data
nfbc_adp_2018 = pd.read_csv("C:/Users/avitosky/Documents/Baseball Project/nfbc_adp_2018.csv")

In [482]:
nfbc_adp_2018.head()

Unnamed: 0,Rank,Player,Team,Position(s),ADP,Min Pick,Max Pick,Difference,# Picks,Team.1,Team Pick
0,1,"Trout, Mike",LAA,OF,1.07,1,2,,81,,
1,2,"Altuve, Jose",HOU,2B,2.14,1,4,,81,,
2,3,"Goldschmidt, Paul",ARZ,1B,4.3,2,7,,81,,
3,4,"Turner, Trea",WAS,SS,5.17,2,12,,81,,
4,5,"Arenado, Nolan",COL,3B,5.36,2,12,,81,,


In [483]:
# create 'Player_Name' column that is same as our current data so it can be merged together
nfbc_adp_2018['Last_Name'], nfbc_adp_2018['First_Name'] = nfbc_adp_2018['Player'].str.split(',',1).str
nfbc_adp_2018['Player_Name'] = nfbc_adp_2018['First_Name'].map(str) + ' ' + nfbc_adp_2018['Last_Name']

In [484]:
# only keep relevant columns, we don't care about Difference, # Picks, Team.1, Team Pick
nfbc_adp_2018 = nfbc_adp_2018[['Rank', 'Player_Name', 'Team', 'Position(s)', 'ADP', 'Min Pick', 'Max Pick']]

In [485]:
# strip out any extra white space in Player_Name column before continuing
nfbc_adp_2018['Player_Name'] = nfbc_adp_2018['Player_Name'].str.strip()

In [486]:
# before we merge datasets and since this is dataset is different from statcast, let's see if we have any name mismatches
rate_stats_2017_v3[(~rate_stats_2017_v3.Player_Name.isin(nfbc_adp_2018.Player_Name))]

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA


In [487]:
# no name mismatches so let's merge datasets on Player_Name
common_2018_v1 = pd.merge(nfbc_adp_2018, rate_stats_2017_v3, on='Player_Name')

In [488]:
# show our merged dataset
common_2018_v1.sort_values(by='ADP')

Unnamed: 0,Rank,Player_Name,Team,Position(s),ADP,Min Pick,Max Pick,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
0,3,Paul Goldschmidt,ARZ,1B,4.3,2,7,665,0.397,0.4,0.404,85.3,0.054135
1,5,Nolan Arenado,COL,3B,5.36,2,12,680,0.363,0.395,0.373,83.6,0.054412
2,8,Bryce Harper,WAS,OF,8.4,2,14,492,0.39,0.416,0.413,83.3,0.058943
3,10,Giancarlo Stanton,NYY,OF,9.0,3,17,692,0.398,0.41,0.376,85.6,0.08526
4,14,Carlos Correa,HOU,SS,13.95,4,22,481,0.393,0.394,0.391,85.8,0.049896
5,16,Aaron Judge,NYY,OF,17.64,4,28,678,0.446,0.43,0.422,85.1,0.076696
6,19,Jose Ramirez,CLE,"2B, 3B",19.59,7,29,645,0.355,0.396,0.374,83.2,0.044961
8,22,Freddie Freeman,ATL,1B,22.16,13,32,514,0.403,0.407,0.403,84.3,0.054475
9,23,J.D. Martinez,ARZ,OF,24.16,13,34,489,0.423,0.43,0.376,83.8,0.092025
10,24,Cody Bellinger,LAD,"1B, OF",24.57,12,39,548,0.357,0.38,0.352,84.2,0.071168


In [489]:
# let's drop the second Jose Ramirez that snuck into our data
common_2018_v1 = common_2018_v1.drop(common_2018_v1.index[7])

In [490]:
# looks like after the merge we ended up with one more due to an exact same name
print(len(common_2018_v1))

27


Let's pause here again. Our potential breakout list many players that are being drafted fairly high in ADP. These are not potential breakouts because they already have broken out. Perhaps we should look at players who were not full time players in 2017. The most logical way to categorize a full time players making the PA cutoff at 502. According to MLB rules, players with PA > 502 qualify for the batting title, whereas those with under 502 do not. 

In [491]:
# first let's see who had PA < 502 from 2017
rate_stats_2017_v3 = common_2018_v1[(common_2018_v1.Exit_Velocity > 83.0) & (common_2018_v1['HR/PA'] > 0.043173) & 
                                     (common_2018_v1.OBP > 0.350) & (common_2018_v1.PA < 502)]

In [492]:
# our refined list after taking out full time players, but wait we still have some on our list!
rate_stats_2017_v3

Unnamed: 0,Rank,Player_Name,Team,Position(s),ADP,Min Pick,Max Pick,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
2,8,Bryce Harper,WAS,OF,8.4,2,14,492,0.39,0.416,0.413,83.3,0.058943
4,14,Carlos Correa,HOU,SS,13.95,4,22,481,0.393,0.394,0.391,85.8,0.049896
9,23,J.D. Martinez,ARZ,OF,24.16,13,34,489,0.423,0.43,0.376,83.8,0.092025
12,29,Josh Donaldson,TOR,3B,28.51,16,40,496,0.384,0.396,0.385,83.3,0.066532
14,48,Rhys Hoskins,PHI,"1B, OF",50.94,31,79,212,0.399,0.417,0.396,83.6,0.084906
18,91,Miguel Sano,MIN,3B,97.54,63,182,483,0.348,0.361,0.352,85.4,0.057971
19,122,Matt Olson,OAK,1B,120.61,71,164,216,0.38,0.411,0.352,85.4,0.111111
22,164,Michael Conforto,NYM,OF,170.0,59,279,440,0.376,0.392,0.384,83.8,0.061364
23,190,Justin Bour,MIA,1B,191.88,117,282,429,0.374,0.374,0.366,83.1,0.058275
24,262,Robinson Chirinos,TEX,C,267.19,201,330,309,0.334,0.369,0.36,83.4,0.055016


In [493]:
# current list is at 13
print(len(rate_stats_2017_v3))

13


We have an issue here in that Harper, Correa, J.D. Martinez, Donaldson, are all already superstars (as shown by their ADP) but are still on our breakout list. They were each hurt for portion of 2017 so they didn't meet out 502 PA cutoff for that season. Let's do a quick check and see how many of the players on our breakout list have exceeded the 502 PA threshold in 2015 or 2016. Because of this I'm going to remove those who have had > 502 PA in a previous dataset.

In [494]:
# create 2016 PA check dataset
PA_check_2016 = rate_stats_2016[(rate_stats_2016.PA >= 502)]

In [495]:
# 2018 "breakout" players who had >= 502 PA in 2016, we will exclude them in our final dataset
common_2016 = rate_stats_2017_v3.merge(PA_check_2016,on=['Player_Name','Player_Name'])
rate_stats_2017_v3[(rate_stats_2017_v3.Player_Name.isin(common_2016.Player_Name))]

Unnamed: 0,Rank,Player_Name,Team,Position(s),ADP,Min Pick,Max Pick,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
2,8,Bryce Harper,WAS,OF,8.4,2,14,492,0.39,0.416,0.413,83.3,0.058943
4,14,Carlos Correa,HOU,SS,13.95,4,22,481,0.393,0.394,0.391,85.8,0.049896
9,23,J.D. Martinez,ARZ,OF,24.16,13,34,489,0.423,0.43,0.376,83.8,0.092025
12,29,Josh Donaldson,TOR,3B,28.51,16,40,496,0.384,0.396,0.385,83.3,0.066532


In [496]:
# create 2015 PA check dataset
PA_check_2015 = rate_stats_2015[(rate_stats_2015.PA >= 502)]

In [497]:
# 2018 "breakout" players who had >= 502 PA in 2015, we will exclude them in our final dataset
common_2015 = rate_stats_2017_v3.merge(PA_check_2015,on=['Player_Name','Player_Name'])
rate_stats_2017_v3[(rate_stats_2017_v3.Player_Name.isin(common_2015.Player_Name))]

Unnamed: 0,Rank,Player_Name,Team,Position(s),ADP,Min Pick,Max Pick,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
2,8,Bryce Harper,WAS,OF,8.4,2,14,492,0.39,0.416,0.413,83.3,0.058943
9,23,J.D. Martinez,ARZ,OF,24.16,13,34,489,0.423,0.43,0.376,83.8,0.092025
12,29,Josh Donaldson,TOR,3B,28.51,16,40,496,0.384,0.396,0.385,83.3,0.066532
27,493,Adam Lind,WAS,"1B, OF",492.83,374,608,301,0.379,0.363,0.362,85.4,0.046512


In [521]:
# let's look at our list again without the already full time players
hitting_breakouts_2018_v1 = rate_stats_2017_v3[(~rate_stats_2017_v3.Player_Name.isin(common_2016.Player_Name)) & 
                   (~rate_stats_2017_v3.Player_Name.isin(common_2015.Player_Name)) ]
hitting_breakouts_2018_v1.sort_values(by='ADP')

Unnamed: 0,Rank,Player_Name,Team,Position(s),ADP,Min Pick,Max Pick,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
14,48,Rhys Hoskins,PHI,"1B, OF",50.94,31,79,212,0.399,0.417,0.396,83.6,0.084906
18,91,Miguel Sano,MIN,3B,97.54,63,182,483,0.348,0.361,0.352,85.4,0.057971
19,122,Matt Olson,OAK,1B,120.61,71,164,216,0.38,0.411,0.352,85.4,0.111111
22,164,Michael Conforto,NYM,OF,170.0,59,279,440,0.376,0.392,0.384,83.8,0.061364
23,190,Justin Bour,MIA,1B,191.88,117,282,429,0.374,0.374,0.366,83.1,0.058275
24,262,Robinson Chirinos,TEX,C,267.19,201,330,309,0.334,0.369,0.36,83.4,0.055016
25,273,Jose Martinez,STL,"1B, OF",278.88,218,420,307,0.411,0.379,0.379,84.5,0.045603
26,320,Jesse Winker,CIN,OF,323.79,232,500,137,0.35,0.384,0.375,83.1,0.051095


In [499]:
# final list count
print(len(hitting_breakouts_2018_v1))

8


In [500]:
# lastly let's create a names to keep in mind list of players who hit the ball hard (top 25%) but for whatever
# reason did not make our breakout list, most likely due to small sample size or inability to lift the ball
# it is much easier to learn to elevate the ball, rather than to hit the ball harder so we'll eliminate HR/PA here
hitting_breakouts_2018_exit_velocity = rate_stats_2017 [(rate_stats_2017.Exit_Velocity > 83.0) & (rate_stats_2017.OBP > 0.350) 
                                      & (rate_stats_2017.PA < 502) & (~rate_stats_2017.Player_Name.isin(common_2016.Player_Name))
                                      & (~rate_stats_2017.Player_Name.isin(common_2015.Player_Name))
                                      & (~rate_stats_2017.Player_Name.isin(hitting_breakouts_2018_v1.Player_Name))]

In [501]:
# to recap this is players who hit the ball hard, get on base but couldn't elevate the ball in 2017
# we excluded those who had PA > 502 in 2015 and 2016 and then excluded those from our original breakout list
# welcome back Yandy!
hitting_breakouts_2018_exit_velocity.sort_values(by='Exit_Velocity', ascending=False)

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
443,Yandy Diaz,179,0.331,0.306,0.352,87.1,0.0
23,Alex Avila,376,0.401,0.362,0.387,86.4,0.037234
287,Logan Forsythe,439,0.329,0.307,0.351,84.0,0.013667
219,Jeimer Candelario,142,0.316,0.342,0.359,83.8,0.021127
323,Michael Brantley,375,0.344,0.342,0.357,83.8,0.024
41,Austin Barnes,262,0.37,0.386,0.408,83.4,0.030534


In [502]:
# let's also merge this with NFBC to see where these players are being drafted
hitting_breakouts_2018_exit_velocity = pd.merge(nfbc_adp_2018, hitting_breakouts_2018_exit_velocity, on='Player_Name')

In [503]:
# Barnes is really the only one here getting any love, maybe just put the rest of the names on your radar
hitting_breakouts_2018_exit_velocity

Unnamed: 0,Rank,Player_Name,Team,Position(s),ADP,Min Pick,Max Pick,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
0,182,Austin Barnes,LAD,"C, 2B",186.5,137,291,262,0.37,0.386,0.408,83.4,0.030534
1,247,Michael Brantley,CLE,OF,250.83,163,372,375,0.344,0.342,0.357,83.8,0.024
2,337,Jeimer Candelario,DET,3B,337.45,189,451,142,0.316,0.342,0.359,83.8,0.021127
3,418,Alex Avila,ARZ,C,418.61,203,520,376,0.401,0.362,0.387,86.4,0.037234
4,429,Logan Forsythe,LAD,"2B, 3B",427.5,328,531,439,0.329,0.307,0.351,84.0,0.013667
5,489,Yandy Diaz,CLE,3B,488.95,285,619,179,0.331,0.306,0.352,87.1,0.0


In [504]:
# saving our final potential breakout list along with our exit velocity list to .csv
hitting_breakouts_2018_v1.to_csv("C:/Users/avitosky/Documents/Baseball Project/hitting_breakouts_2018_v1.csv")
hitting_breakouts_2018_exit_velocity.to_csv("C:/Users/avitosky/Documents/Baseball Project/hitting_breakouts_2018_exit_velocity.csv")

Now that we have determined which under the radar players displayed the ability, albeit some in a limited sample, to potentially breakout in 2018, let's look back at the 2016 and 2015 data for context and to see if any of these players actually did breakout.

In [505]:
# all variables were similarly correlated in 2015, 2016, and 2017 so let's look at 2016 variables now
rate_stats_2016['Exit_Velocity'].describe()

count    459.000000
mean      83.332462
std        2.248195
min       76.400000
25%       81.800000
50%       83.600000
75%       84.900000
max       88.600000
Name: Exit_Velocity, dtype: float64

In [506]:
# 2016 HR/PA
rate_stats_2016['HR/PA'].describe()

count    459.000000
mean       0.028691
std        0.016414
min        0.000000
25%        0.016251
50%        0.028169
75%        0.040196
max        0.087336
Name: HR/PA, dtype: float64

In [507]:
# 2016 OBP
rate_stats_2016['OBP'].describe()

count    459.000000
mean       0.317536
std        0.040320
min        0.105000
25%        0.295000
50%        0.318000
75%        0.346000
max        0.441000
Name: OBP, dtype: float64

In [508]:
# creating the top 25% cutoff for each category, this time for 2016
rate_stats_2016_v2 = rate_stats_2016[(rate_stats_2016.Exit_Velocity > 84.9) & (rate_stats_2016['HR/PA'] > 0.040196) 
                                     & (rate_stats_2016.OBP > 0.346) & (rate_stats_2016.PA < 502)]

In [509]:
# Wright and Rodriguez battled injuries in 2017, Pearce is perennial part timer but the rest enjoyed nice 2017 seasons
rate_stats_2016_v2.sort_values(by='xwOBA', ascending=False)

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
173,Gary Sanchez,229,0.395,0.425,0.376,86.3,0.087336
263,Justin Bour,321,0.382,0.343,0.349,87.2,0.046729
137,David Wright,164,0.381,0.344,0.35,85.2,0.042683
312,Matt Joyce,293,0.37,0.375,0.403,85.1,0.044369
406,Steve Pearce,302,0.365,0.371,0.374,85.8,0.043046
395,Sean Rodriguez,342,0.344,0.363,0.349,85.1,0.052632


In [510]:
# lastly for 2016 let's recreate our top Exit Velocity part time players
hitting_breakouts_2016_exit_velocity = rate_stats_2016 [(rate_stats_2016.Exit_Velocity > 84.9) & (rate_stats_2016.OBP > 0.346) 
                                      & (rate_stats_2016.PA < 502) & (~rate_stats_2016.Player_Name.isin(common_2015.Player_Name))
                                      & (~rate_stats_2016.Player_Name.isin(rate_stats_2016_v2.Player_Name))]

In [511]:
# players who hit the ball hard in 2016 but were not elite home run hitters 
hitting_breakouts_2016_exit_velocity

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
22,Aledmys Diaz,460,0.322,0.37,0.369,85.2,0.036957
24,Alex Avila,209,0.327,0.329,0.359,85.8,0.033493
40,Anthony Recker,112,0.417,0.361,0.394,85.9,0.017857
188,Hunter Pence,442,0.334,0.349,0.357,86.2,0.029412
189,Hyun Soo Kim,346,0.349,0.352,0.382,86.2,0.017341
399,Shin-Soo Choo,210,0.385,0.334,0.357,86.4,0.033333
424,Trea Turner,324,0.34,0.395,0.37,85.2,0.040123
433,Tyler Flowers,325,0.347,0.338,0.357,86.0,0.024615


In [512]:
# now let's take a look at 2015 stats and see how they stack up
rate_stats_2015['Exit_Velocity'].describe()

count    467.000000
mean      86.553747
std        2.722148
min       78.000000
25%       84.800000
50%       86.900000
75%       88.500000
max       95.400000
Name: Exit_Velocity, dtype: float64

In [513]:
# HR/PA for 2015
rate_stats_2015['HR/PA'].describe()

count    467.000000
mean       0.025877
std        0.016149
min        0.000000
25%        0.014063
50%        0.024927
75%        0.035604
max        0.088496
Name: HR/PA, dtype: float64

In [514]:
# OBP for 2015
rate_stats_2015['OBP'].describe()

count    467.000000
mean       0.312143
std        0.042466
min        0.163000
25%        0.289500
50%        0.314000
75%        0.339000
max        0.460000
Name: OBP, dtype: float64

In [515]:
# and inputting 25% threshold for 2015
rate_stats_2015_v1 = rate_stats_2015[(rate_stats_2015.Exit_Velocity > 88.5) & (rate_stats_2015['HR/PA'] > 0.035604) & 
                                     (rate_stats_2015.OBP > 0.339) & (rate_stats_2015.PA < 502)]

In [516]:
# 25% threshold for 2015
rate_stats_2015_v1.sort_values(by='xwOBA', ascending=False)

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
186,Freddie Freeman,481,0.416,0.364,0.37,89.3,0.037422
192,Giancarlo Stanton,318,0.413,0.394,0.346,95.4,0.084906
353,Miguel Sano,335,0.384,0.392,0.385,92.6,0.053731
185,Franklin Gutierrez,189,0.38,0.41,0.354,89.4,0.079365
360,Mikie Mahtook,115,0.38,0.411,0.351,88.9,0.078261
196,Greg Bird,178,0.379,0.372,0.343,91.2,0.061798
325,Mark Teixeira,462,0.378,0.381,0.357,88.9,0.0671
404,Ryan Raburn,201,0.374,0.397,0.393,89.2,0.039801
306,Kyle Schwarber,273,0.357,0.364,0.355,91.2,0.058608
79,Carlos Correa,432,0.354,0.365,0.345,88.6,0.050926


In [517]:
# a very interesting list, some broke out in 2016, some in 2017, while others never really did
print(len(rate_stats_2015_v1))

13


In [518]:
# lastly for 2015 let's recreate our top Exit Velocity part time players
hitting_breakouts_2015_exit_velocity = rate_stats_2015 [(rate_stats_2015.Exit_Velocity > 88.5) & (rate_stats_2015.OBP > 0.339) 
                                      & (rate_stats_2015.PA < 502) & (~rate_stats_2015.Player_Name.isin(common_2015.Player_Name))
                                      & (~rate_stats_2015.Player_Name.isin(rate_stats_2015_v1.Player_Name))]

In [519]:
# players who hit the ball hard in 2016 but were not elite home run hitters 
hitting_breakouts_2015_exit_velocity

Unnamed: 0,Player_Name,PA,xwOBA,wOBA,OBP,Exit_Velocity,HR/PA
117,Corey Seager,113,0.412,0.421,0.425,89.1,0.035398
171,Enrique Hernandez,218,0.316,0.359,0.346,90.3,0.03211
189,George Springer,451,0.359,0.36,0.367,88.7,0.035477
233,Jason Rogers,169,0.329,0.354,0.367,89.3,0.023669
255,John Jaso,216,0.392,0.364,0.38,88.9,0.023148
281,Jung Ho Kang,467,0.356,0.356,0.355,89.0,0.03212
334,Matt Holliday,277,0.354,0.351,0.394,89.2,0.01444
431,Tommy Pham,173,0.352,0.352,0.347,90.8,0.028902


Looking at the top 25% in 2017 Sanchez but were the others really a breakout? The 2016 Exit Velocity list contained Turner which was great but the rest wer're kind of meh. The 2015 breakout list contained Stanton, Freeman, Sano, Correa, but some duds in there as well. The 2015 Exit Velocity list had Seager, Springer, Pham, but again with sme duds as well. 

It's not possible to create a list where everyone will breakout but it's important to do thorough analysis on the top 25% and then also skim through the Exit Velocity list to see if any others are worth further research. I initially also created 50% threshold lists but those were super long to comb through and again with a lot more duds! Perhaps going back through and adding an "Age" column would help determine potential breakouts along with paying attention to players who hit the ball the hardest. Yes it is possible for older players to breakout (thanks to advanced metrics!) but it's more likely for younger players to do so.