# Explanation
First let's configure the notebook with some built-in stuff, we'll need. Then we'll get to the details.

This notebook is available at: https://github.com/dcbove/merlin-notebooks/blob/master/explanation.ipynb.

The notebooks that went into this are available at:
* [pitchertest](https://github.com/dcbove/merlin-notebooks/blob/master/pitchertest.ipynb)
* [kmeans_pitch_type](https://github.com/dcbove/merlin-notebooks/blob/master/kmeans_pitch_type.ipynb)
* [pitch_batter_woba_clustering](https://github.com/dcbove/merlin-notebooks/blob/master/pitcher_batter_woba_clustering.ipynb)

## Configure Notebook
Let's set up some styling, some imports, and functions that we'll use throughout.

In [2]:
%%html
<style>
.rendered_html tr, .rendered_html th, .rendered_html td {
  text-align: left;
}
# .rendered_html :first-child {
#   text-align: left;
# }
# .rendered_html :last-child {
#   text-align: left;
# }
</style>

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import io

import boto3
from sagemaker import get_execution_role
import sagemaker.amazon.common as smac

In [55]:
def _df_from_s3(key, bucket = 'appleforge-merlin-develop-datalake', prefix='pitchtype'):
    data_location = 's3://{}/{}/{}'.format(bucket, prefix, key)
    df = pd.read_csv(data_location, low_memory=False)
    return df

## Source Data
The source data has been extract from the MLB Baseball Savant system and stored in my personal bucket at `appleforge-merlin-develop-datalake`. This bucket should be available for public read. Note that some of the notebooks write output at the end. That shouldn't work.

The entirety of the sample data that I have is every single pitch thrown from 2017 to 2019 (about 700K) per year. For each pitch, three things of note are recorded:
* The exact scenario, including pitcher, batter, count, runners on base, number of outs.
* The relevant [Statcast Data](https://en.wikipedia.org/wiki/Statcast) which includes detailed information about the   pitch including release point, break, velocity, and acceleration in three dimensions. 
* The outcome of the pitch, including called ball or strike, swing, contact, out or not, fielding data, and runs       scored.

Here's a quick sample of the file from April 26, 2019:

In [62]:
df = _df_from_s3(key='2019-04-26.csv', prefix='savant')
df.iloc[1:10,(list(range(1, 8))+list(range(27,37)))]

Unnamed: 0,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,pfx_x,pfx_z,plate_x,plate_z,on_3b,on_2b,on_1b,outs_when_up,inning,inning_topbot
1,2019-04-26,93.9,-1.8995,5.6741,Mitch Haniger,571745,664871,-0.6183,1.299,0.3479,2.3829,553882.0,592387.0,543829.0,1,11,Bot
2,2019-04-26,95.4,-2.0647,5.6099,Mitch Haniger,571745,664871,-0.633,1.3012,-0.7073,3.4248,553882.0,592387.0,543829.0,1,11,Bot
3,2019-04-26,85.5,-1.8516,5.659,Mitch Haniger,571745,664871,0.2276,0.3022,-0.0876,3.7103,553882.0,592387.0,543829.0,1,11,Bot
4,2019-04-26,95.0,-1.7544,5.7495,Mitch Haniger,571745,664871,-0.6222,1.2687,1.1685,1.8897,553882.0,592387.0,543829.0,1,11,Bot
5,2019-04-26,94.7,-1.8927,5.7213,Dee Gordon,543829,664871,-0.8637,1.1412,-1.6806,4.0445,,553882.0,592387.0,1,11,Bot
6,2019-04-26,84.5,-2.0504,5.6017,Dee Gordon,543829,664871,-1.2474,0.8703,-1.3326,3.5872,,553882.0,592387.0,1,11,Bot
7,2019-04-26,95.4,-1.9251,5.6904,Dee Gordon,543829,664871,-0.5815,1.2713,-1.4566,3.6794,,553882.0,592387.0,1,11,Bot
8,2019-04-26,92.5,-2.1065,5.6755,Dee Gordon,543829,664871,-1.1648,1.0778,-1.0048,2.6523,,553882.0,592387.0,1,11,Bot
9,2019-04-26,94.3,-1.9588,5.7363,Dee Gordon,543829,664871,-0.9079,1.2382,-1.3828,3.618,,553882.0,592387.0,1,11,Bot


## Goal
### How DFS Works
In many cases (and traditionally), the variability of outcomes in baseball is such that projections regarding the performance of a pitcher or batter require hundreds of events in order to reach some level of confidence.  Because of this, most effort in projecting the performance of a particular hitter attempts to do so over a full season.

However, in Daily Fantasy Sports (DFS), the goal is to choose an optimal lineup for a particular day. Typically a contestant chooses a lineup (one first baseman, one second baseman, two pitchers, three outfielders, etc.), drawing from a pool that includes every single player in Major League Baseball (MLB). The stats (e.g., hits, home runs, strikeouts) accumulated __that day__ by the selected lineup are summed according to rules of the DFS game and the most points wins. In addition, each player in the pool is assigned a 'salary' by the DFS game (with better players having higher salaries). The total salary of the contestant's selected team cannot exceed some limit. This prevents contestants from just choosing all of the best players.

So, given this, the problem is: navigate the 'salary' constraints and construct the lineup likely to score the most points that day.  

### How to Choose a Good Team
Multiple factors go into choosing a good team.  There are a handful that I ultimately want to consider:

* The matchup. If the best batter in the league is facing the best pitcher in the league while the second-best batter is facing the worst pitcher, it seems obvious that the second-best batter is the superior selection.
* Non-game factors.  Weather, umpire, game start time.  It is proven that umpires call different strike zones which impact the game.  Batted balls with the same launch angle and velocity fly further when it is warm and humid.  Batters swing more frequently on Sundays.
* Variability. The most popular formats with the largest prizes have tens of thousands of contestants. The goal isn't to find the lineup with the highest median point score, the goal is to pick the lineup that most likely to have a bunch of outliers so that you finish first and win the big prize.


### My Goal
Ok, so a lot of this is vague and beyond me.  But, I'm starting somewhere and I decided to start with matchups. My goal is to try to exploit "pitcher-vs-batter" or 'PvB' stats.  In most cases, these stats are not statistically significant because the two players have only faced each other a handful of times. Knowing that pitcher A has struckout batter B in their two previous matchups really isn't valuable. Most contestants disregard these statistics.

My hypothesis is that if I can generate a set of "pitcher archetypes" that I can build "pitcher-archetype-vs-batter" statistics that might have some actual predictive value.  This is all based on a theory that - given enough data points - a batter would have the same wOBA against all pitchers belonging to a single "pitcher archetype".

## Work So Far
Ok, none of this is "operationalized". I'm just messing around. You'll also notice that I've only used data from 2019 even though I have three years of data.

### The `pitchertest` Notebook
The first thing that I did is build a matrix where each row is a pitcher and the columns contain:
* the pitcher's id
* the handedness
* and then for each of 9 different pitches (e.g., Two-Seam Fastball, Curveball) there 17 different attributes (e.g., release point, velocity, acceleration in each of x, y, and z dimensions).  these values have all been standardized.
* and also for each of the 9 different pitches, the percentage of times that the pitcher throws that pitch (pct_usage).

A sample is here:

In [82]:
df = _df_from_s3(key='df_rescaled_pitch_info.csv', prefix='pitchtype')
df.iloc[1:10, range(0, 20)]

Unnamed: 0,pitcher,p_throws,2f_ax,2f_ay,2f_az,2f_pct_usage,2f_pfx_x,2f_pfx_z,2f_plate_x,2f_plate_z,2f_release_extension,2f_release_pos_x,2f_release_pos_z,2f_release_speed,2f_release_spin_rate,2f_sz_bot,2f_sz_top,2f_vx0,2f_vy0,2f_vz0
1,407845,R,-0.545069,-0.007018,0.675023,0.632901,-0.529787,0.626222,-0.13174,0.097922,-0.231211,-0.653028,0.473424,0.266002,-0.961867,-0.443407,-0.37418,0.66785,-0.254029,-0.569486
2,424144,L,,,,0.0,,,,,,,,,,,,,,
3,425772,R,,,,0.0,,,,,,,,,,,,,,
4,425794,R,,,,0.0,,,,,,,,,,,,,,
5,425844,R,-0.422223,-0.632582,0.263444,0.07139,-0.511866,0.444389,-0.557522,-0.51321,-0.511994,-0.302382,0.962265,-0.797717,0.322804,-0.064249,-0.075216,0.272078,0.792688,-0.81178
6,429719,R,-0.503285,0.181685,0.41205,0.187117,-0.504599,0.270413,-1.241893,-0.832632,0.168175,-0.527747,0.173925,0.315547,-0.525451,0.767188,-0.054382,0.34003,-0.327743,-0.851837
7,429722,R,-0.084525,-0.783462,0.536415,0.145658,-0.142044,0.862211,0.162699,0.897642,-0.540207,-0.07488,1.119819,-0.902993,-1.057198,-0.251112,0.614662,0.114335,0.881266,-0.192906
8,430935,L,1.654101,-0.925863,-0.136663,0.161025,1.64574,-0.101097,0.968342,-0.556205,0.352573,1.869637,0.841727,-0.558367,-0.997691,-0.053925,-0.675213,-1.830028,0.571584,-0.706534
9,431145,R,-0.391609,-1.679932,-0.034612,0.031746,-0.546508,1.132442,1.901356,1.848738,-2.219725,-0.97932,-1.012337,-2.595667,0.367462,-4.992289,-4.981752,1.139804,2.67467,2.295489


In [73]:
df.iloc[:, 2:].describe()

Unnamed: 0,2f_ax,2f_ay,2f_az,2f_pct_usage,2f_pfx_x,2f_pfx_z,2f_plate_x,2f_plate_z,2f_release_extension,2f_release_pos_x,...,sl_release_extension,sl_release_pos_x,sl_release_pos_z,sl_release_speed,sl_release_spin_rate,sl_sz_bot,sl_sz_top,sl_vx0,sl_vy0,sl_vz0
count,351.0,351.0,351.0,830.0,351.0,351.0,351.0,351.0,351.0,351.0,...,615.0,615.0,615.0,615.0,615.0,615.0,615.0,615.0,615.0,615.0
mean,-2.426043e-16,-1.28071e-15,-4.516805e-15,0.101683,3.1630290000000005e-17,1.436806e-15,-2.055969e-16,2.457673e-15,3.647921e-15,3.140887e-16,...,1.650622e-15,-1.473076e-16,-1.240099e-15,-1.412059e-14,4.907005e-15,3.056272e-15,-4.836493e-14,-2.379307e-16,2.000568e-15,5.153962e-17
std,1.0,1.0,1.0,0.192076,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-1.254903,-5.25992,-4.31839,0.0,-1.254063,-4.587505,-3.712563,-4.761365,-3.444183,-1.453906,...,-3.816871,-2.078393,-7.737255,-7.878803,-4.705402,-6.633944,-7.087466,-2.584539,-2.221391,-2.868824
25%,-0.6606724,-0.5885088,-0.6818167,0.0,-0.658729,-0.6536657,-0.6609,-0.5132586,-0.5958885,-0.6569986,...,-0.6457735,-0.6827021,-0.4689018,-0.4304528,-0.5680838,-0.3391919,-0.4211584,-0.4896648,-0.6320932,-0.6329724
50%,-0.502204,-0.01518246,0.05582217,0.0,-0.5091856,0.07233543,-0.1040908,0.001465171,0.05013719,-0.3825327,...,-0.04417682,-0.4122334,0.1059336,0.1026844,-0.03246931,0.06008126,0.02436256,0.3546451,-0.1073305,-0.04686597
75%,0.678421,0.6367385,0.7692587,0.108521,0.6818494,0.7595312,0.5644856,0.5421363,0.6074803,0.6178869,...,0.6506722,0.4569475,0.6197562,0.6305354,0.5489975,0.430176,0.374267,0.7246888,0.4207579,0.5046594
max,2.143318,3.703675,2.101626,1.0,2.139596,2.065287,4.806255,3.45214,2.655701,2.228879,...,3.210598,2.799978,2.079095,2.190743,3.379572,6.188519,5.852005,2.02395,7.876003,5.548234


### The `kmeans_pitch_type` notebook
The goal of this notebook is to determine "pitcher archetypes" that I can use to make pitcher-vs-batter statistics more statistically significant. The `n` value will increase because it will become "pitcher-archetype-vs-batter" statistics.

So, in this notebook, in order to come up with pitcher archetypes, I basically just:
* picked k=10
* built a __super-naive__ model with the k-means test that considers every feature
* deployed it
* made some predictions via a sagemaker endpoint

So, what I've done is build a dataframe that contains every pitcher and their "cluster" membership. Cluster membership is pitcher archetype membership. Here's an example set of results for one job.

In [75]:
jobname='pitch-test-kmeans-20200427011209'
df = _df_from_s3(key='pitcher_cluster_membership.csv', prefix=f'pitchtype/{jobname}')
df.iloc[:, 1:].head()

Unnamed: 0,pitcher,cluster
0,282332,2.0
1,407845,5.0
2,424144,2.0
3,425772,9.0
4,425794,0.0


### The `pitch_batter_woba_clustering` notebook
In order to simplify the eventually calculations, I wanted to generate a single number that represented how well a batter has done against a pitcher. I used wOBA. It stands for "weighted on base average", but basically it is just a generally accepted, decent representation. The higher the number, the better the batter has done.

wOBA is a rate statistic, essentially like a batting average, only giving more weight to home run as opposed to a single.

Here's what I've got.  Essentially, for every single pitcher-batter tuple, I've got the batter's wOBA and the number of events (plate appearances featuring the two) that occurred.

In [78]:
df = _df_from_s3(key='df_pitcher_batter_woba.csv')
df.head()

Unnamed: 0,pitcher,batter,woba,event_count
0,282332,405395,0.534,5.0
1,282332,425844,0.0,2.0
2,282332,429665,0.526667,3.0
3,282332,430945,0.23,3.0
4,282332,443558,1.05,2.0


## Next Steps
Ok, this is where I desperately need help.

In a perfect world, a batter should perform similarly against pitchers of the same archetype (i.e., belonging to the same cluster as generated by my k-means test).

So, for each batter, i was planning on calculating their wOBA against each pitcher archetype. Then I would generate some sort of RMSE that compared for each batter:
* the calculated wOBA versus each pitcher that was part of that "pitcher archetype" or "cluster"
* as compared to the single wOBA value versus all pitchers of that particular archetype  
* and then somehow combine all those RMSE's together to make a bigger RMSE?

So, I could do that, but what would I do next?  My naive next steps were:
* Continually re-run the k-means test with different k values and selecting different features.
* Try to minimize the RMSE 

Is there something better to do? Or a systemmatic or algorithmic way to cycle through the various k-means tests?

