**How to identify position players knowing the opposing pitcher**

Problem: In daily fantasy baseball, a participant sets up their team against another player of their league. Some position players are stars and should play nearly all the time, but often a position is shared between other players on your roster. How do you choose which platooning player to start (especially if there is a limited sample size for the batter facing that pitcher)?

Strategy: If you know the pitcher, look at statistical data, including Statcast data, to make a prediction on how well that player will do against the pitcher as a guide for a decision. If the matchup for a particular batter-pitcher is small, find similar players to serve as a guide. (Similar players can be found by using Erdos score.)

-How do I know if my strategy is improved over other models? Look at which player one would choose from a simple player ranking.

-Validation: Historical data

This notebook was created within my "insight" virtual environment.

*Input:* The roster of two opponents
<br>
*Output:* Player recommendation for each position with a predicted score

Order of features of product to incorporate:
1.	Roster’s “basic” baseball statistics from last 50 games (including previous season) - (minimum viable product level)
2.	Specific matchup against that pitcher - “basic” baseball statistics
3.	Roster’s statcast statistics from last 50 games (including previous season)
4.	Use of player similarity
5.	Weather
6.	Limit to more recent performance (last 7 games)
7.	Computer vision?

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

from termcolor import colored

# Web/database stuff
import urllib.request
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup

# Multiprocessing/threading
import multiprocess
import threading
from threading import Thread

In [3]:
# Web/database stuff
import urllib.request
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup
import sqlalchemy
import sqlalchemy_utils
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2

# Multiprocessing/threading
import multiprocess
import threading   # included in base

In [4]:
# Check versioning
print('numpy: ', np.__version__)
print('pandas: ', pd.__version__)
print('matplotlib: ', matplotlib.__version__)
print('seaborn: ', sns.__version__)
print('sklearn: ', sklearn.__version__)

print('psycopg2: ', psycopg2.__version__)
print('sqlalchemy: ', sqlalchemy.__version__)
print('sqlalchemy_utils: ', sqlalchemy_utils.__version__)
print('multiprocess: ', multiprocess.__version__)

numpy:  1.17.4
pandas:  0.25.3
matplotlib:  3.1.1
seaborn:  0.9.0
sklearn:  0.22
psycopg2:  2.8.4 (dt dec pq3 ext lo64)
sqlalchemy:  1.3.11
sqlalchemy_utils:  0.36.1
multiprocess:  0.70.9


In [5]:
from pybaseball import pitching_stats
from pybaseball import batting_stats
from pybaseball import statcast_pitcher
from pybaseball import playerid_lookup
from pybaseball import statcast

# Build a database

In [6]:
!conda env list

# conda environments:
#
base                     /Users/lacar/anaconda
insight               *  /Users/lacar/anaconda/envs/insight



In [6]:
# Define a database name 
# Set your postgres username
dbname = 'baseball'
username = 'lacar' # change this to your username

In [7]:
## 'engine' is a connection to a database
## Here, we're using postgres, but sqlalchemy can connect to other things too.
engine = create_engine('postgres://%s@localhost/%s'%(username,dbname))
print(engine.url)

postgres://lacar@localhost/baseball


In [8]:
## create a database (if it doesn't exist)
if not database_exists(engine.url):
    create_database(engine.url)
print(database_exists(engine.url))

True


In [17]:
# Get statcast data and put into database
# df_sc1 = statcast('2019-04-01', '2019-04-01', team='SD')
# df_sc1.shape

## Statcast data to database

In [9]:
date_list = [('2019-03-20', '2019-09-29'),
             ('2017-04-02', '2017-10-01'),
             ('2018-03-29', '2018-10-01')]

for i, date_pair in enumerate(date_list):
    if i < 1:
        df_sc = statcast(date_pair[0], date_pair[1])
        df_sc.to_sql('statcast', engine, if_exists='replace')
        print(date_pair, 'replace mode')
    else:
        df_sc = statcast(date_pair[0], date_pair[1])
        df_sc.to_sql('statcast', engine, if_exists='append')
        print(date_pair, 'append mode')
    

This is a large query, it may take a moment to complete
Completed sub-query from 2019-03-20 to 2019-03-25
Completed sub-query from 2019-03-26 to 2019-03-31
Completed sub-query from 2019-04-01 to 2019-04-06
Completed sub-query from 2019-04-07 to 2019-04-12
Completed sub-query from 2019-04-13 to 2019-04-18
Completed sub-query from 2019-04-19 to 2019-04-24
Completed sub-query from 2019-04-25 to 2019-04-30
Completed sub-query from 2019-05-01 to 2019-05-06
Completed sub-query from 2019-05-07 to 2019-05-12
Completed sub-query from 2019-05-13 to 2019-05-18
Completed sub-query from 2019-05-19 to 2019-05-24
Completed sub-query from 2019-05-25 to 2019-05-30
Completed sub-query from 2019-05-31 to 2019-06-05
Completed sub-query from 2019-06-06 to 2019-06-11
Completed sub-query from 2019-06-12 to 2019-06-17
Completed sub-query from 2019-06-18 to 2019-06-23
Completed sub-query from 2019-06-24 to 2019-06-29
Completed sub-query from 2019-06-30 to 2019-07-05
Completed sub-query from 2019-07-06 to 2019-

Above gave **2174906** results in the database

In [47]:
# date_list = [('2019-06-01', '2019-06-02'),
#              ('2019-06-03', '2019-09-29'),
#              ('2017-04-02', '2017-10-01'),
#              ('2018-03-29', '2018-10-01')]

# for i, date_pair in enumerate(date_list):
#     df_sc = statcast(date_pair[0], date_pair[1])
#     df_sc.to_sql('statcast', engine, if_exists='append')
#     print(date, 'append mode')

2019-04-03 append mode
This is a large query, it may take a moment to complete
Completed sub-query from 2019-06-03 to 2019-06-08
Completed sub-query from 2019-06-09 to 2019-06-14
Completed sub-query from 2019-06-15 to 2019-06-20
Completed sub-query from 2019-06-21 to 2019-06-26
Completed sub-query from 2019-06-27 to 2019-07-02
Completed sub-query from 2019-07-03 to 2019-07-08
Completed sub-query from 2019-07-09 to 2019-07-14
Completed sub-query from 2019-07-15 to 2019-07-20
Completed sub-query from 2019-07-21 to 2019-07-26
Completed sub-query from 2019-07-27 to 2019-08-01
Completed sub-query from 2019-08-02 to 2019-08-07
Completed sub-query from 2019-08-08 to 2019-08-13
Completed sub-query from 2019-08-14 to 2019-08-19
Completed sub-query from 2019-08-20 to 2019-08-25
Completed sub-query from 2019-08-26 to 2019-08-31
Completed sub-query from 2019-09-01 to 2019-09-06
Completed sub-query from 2019-09-07 to 2019-09-12
Completed sub-query from 2019-09-13 to 2019-09-18
Completed sub-query f

KeyboardInterrupt: 

## Playerid data to database

In [15]:
## Working with PostgreSQL in Python
# Connect to make queries using psycopg2
con = None
con = psycopg2.connect(database = dbname, user = username)


In [38]:
# Make a query to get all unique players

# pitchers and batters
sql_query = """
(SELECT DISTINCT pitcher FROM statcast)
UNION
(SELECT DISTINCT batter FROM statcast)
;
"""

# Note parantheses around each one helps account for order by or limit 
# which could kill subquery after first error

p_from_sql = pd.read_sql_query(sql_query,con)
p_list = p_from_sql.iloc[:, 0].tolist()

In [40]:
# Number of unique players
len(p_list)

1966

In [33]:
from pybaseball import playerid_reverse_lookup

In [42]:
# find the names of the players in player_ids, along with their ids from other data sources
df_pid = playerid_reverse_lookup(p_list, key_type='mlbam')

Gathering player lookup table. This may take a moment.


In [44]:
df_pid.head()

Unnamed: 0,name_last,name_first,key_mlbam,key_retro,key_bbref,key_fangraphs,mlb_played_first,mlb_played_last
0,abad,fernando,472551,abadf001,abadfe01,4994,2010.0,2019.0
1,abreu,bryan,650556,abreb002,abreubr01,16609,2019.0,2019.0
2,abreu,jose,547989,abrej003,abreujo02,15676,2014.0,2019.0
3,acuna,ronald,660670,acunr001,acunaro01,18401,2018.0,2019.0
4,adam,jason,592094,adamj002,adamja01,11861,2018.0,2019.0


In [45]:
# Lock down - comment out to avoid removing table

# df_pid.to_sql('player_id', engine, if_exists='replace')

In [46]:
## Working with PostgreSQL in Python

# Connect to make queries using psycopg2
# con = None
# con = psycopg2.connect(database = dbname, user = username)

# # query:
# sql_query = """
# SELECT * FROM statcast LIMIT 5;
# """

# pd.read_sql_query(sql_query,con)


## Import batting and pitching statistics (may need to limit to the year)

Send to SQL

In [47]:
df_batting_stats = batting_stats(2017, end_season=2019, league='all', qual=1, ind=1)

In [73]:
# Change '%' to '_perc' due to errror here https://github.com/pandas-dev/pandas/issues/11896
df_batting_stats.columns = df_batting_stats.columns.str.replace('%', '_perc')
# Change parentheses in column names due to https://stackoverflow.com/questions/27833213/how-to-set-parenthesis-in-column-name-in-create-table-sql-query
df_batting_stats.columns = df_batting_stats.columns.str.replace('(', '_')
df_batting_stats.columns = df_batting_stats.columns.str.replace(')', '')

In [50]:
df_pitching_stats = pitching_stats(2017, end_season=2019, league='all', qual=1, ind=1)

In [81]:
# Change '%' to '_perc' due to errror here https://github.com/pandas-dev/pandas/issues/11896
df_pitching_stats.columns = df_pitching_stats.columns.str.replace('%', '_perc')
# Change parentheses in column names due to https://stackoverflow.com/questions/27833213/how-to-set-parenthesis-in-column-name-in-create-table-sql-query
df_pitching_stats.columns = df_pitching_stats.columns.str.replace('(', '_')
df_pitching_stats.columns = df_pitching_stats.columns.str.replace(')', '')

In [82]:
#df_pitching_stats.to_sql('pitching_stats', engine, if_exists='replace')

In [83]:
df_pitching_stats.head()

Unnamed: 0,Season,Name,Team,Age,W,L,ERA,WAR,G,GS,...,wSL/C _pi,wXX/C _pi,O-Swing_perc _pi,Z-Swing_perc _pi,Swing_perc _pi,O-Contact_perc _pi,Z-Contact_perc _pi,Contact_perc _pi,Zone_perc _pi,Pace _pi
131,2018.0,Jacob deGrom,Mets,30.0,10.0,9.0,1.7,9.0,32.0,32.0,...,2.25,,0.367,0.661,0.518,0.524,0.804,0.708,0.513,21.9
376,2017.0,Chris Sale,Red Sox,28.0,17.0,8.0,2.9,7.6,32.0,32.0,...,0.67,,0.373,0.619,0.499,0.54,0.798,0.704,0.513,20.9
257,2018.0,Max Scherzer,Nationals,33.0,18.0,7.0,2.53,7.5,33.0,33.0,...,1.85,,0.355,0.666,0.519,0.513,0.774,0.69,0.528,24.2
251,2019.0,Gerrit Cole,Astros,28.0,20.0,5.0,2.5,7.4,33.0,33.0,...,1.88,,0.34,0.646,0.498,0.481,0.751,0.662,0.516,22.9
195,2017.0,Corey Kluber,Indians,31.0,18.0,4.0,2.25,7.2,29.0,29.0,...,4.65,9.31,0.388,0.595,0.489,0.433,0.853,0.681,0.485,23.5


In [85]:
df_sc.head()

Unnamed: 0,index,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
0,0,FC,2018-10-01,92.2,-1.969,6.2644,Kenley Jansen,467827.0,445276.0,strikeout,...,5.0,2.0,2.0,5.0,2.0,5.0,2.0,5.0,Standard,Standard
1,1,FC,2018-10-01,93.0,-1.7689,6.2976,Kenley Jansen,467827.0,445276.0,,...,5.0,2.0,2.0,5.0,2.0,5.0,2.0,5.0,Standard,Standard
2,2,FC,2018-10-01,91.6,-1.7451,6.2154,Kenley Jansen,467827.0,445276.0,,...,5.0,2.0,2.0,5.0,2.0,5.0,2.0,5.0,Standard,Standard
3,3,FF,2018-10-01,93.1,-1.425,6.1929,Kenley Jansen,467827.0,445276.0,,...,5.0,2.0,2.0,5.0,2.0,5.0,2.0,5.0,Standard,Standard
4,4,FC,2018-10-01,91.4,-1.9144,6.2641,Kenley Jansen,435622.0,445276.0,strikeout,...,5.0,2.0,2.0,5.0,2.0,5.0,2.0,5.0,Standard,Standard


In [86]:
df_sc.loc[:, df_sc.columns.str.contains('fielder_')]

Unnamed: 0,fielder_2,fielder_2.1,fielder_3,fielder_4,fielder_5,fielder_6,fielder_7,fielder_8,fielder_9
0,518735.0,518735.0,641355.0,571771.0,457759.0,592518.0,592626.0,621035.0,624577.0
1,518735.0,518735.0,641355.0,571771.0,457759.0,592518.0,592626.0,621035.0,624577.0
2,518735.0,518735.0,641355.0,571771.0,457759.0,592518.0,592626.0,621035.0,624577.0
3,518735.0,518735.0,641355.0,571771.0,457759.0,592518.0,592626.0,621035.0,624577.0
4,518735.0,518735.0,641355.0,571771.0,457759.0,592518.0,592626.0,621035.0,624577.0
...,...,...,...,...,...,...,...,...,...
721185,467092.0,467092.0,543068.0,621002.0,622110.0,588751.0,452655.0,595281.0,460576.0
721186,467092.0,467092.0,543068.0,621002.0,622110.0,588751.0,452655.0,595281.0,460576.0
721187,467092.0,467092.0,543068.0,621002.0,622110.0,588751.0,452655.0,595281.0,460576.0
721188,467092.0,467092.0,543068.0,621002.0,622110.0,588751.0,452655.0,595281.0,460576.0


# Inputs: Team rosters and opposing pitcher

In [111]:
# query style 1
sql_query = """
SELECT * FROM statcast LIMIT 5;
"""

pd.read_sql_query(sql_query,con)

Unnamed: 0,level_0,index,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
0,3675,2635,FF,2019-09-29,93.6,1.965,5.6573,Eduardo Rodriguez,542340.0,593958.0,...,2.0,3.0,3.0,2.0,3.0,2.0,3.0,2.0,Standard,Standard
1,3676,2650,FF,2019-09-29,93.4,1.9252,5.7313,Eduardo Rodriguez,542340.0,593958.0,...,2.0,3.0,3.0,2.0,3.0,2.0,3.0,2.0,Standard,Standard
2,3677,2671,CH,2019-09-29,86.8,1.9107,5.8085,Eduardo Rodriguez,542340.0,593958.0,...,2.0,3.0,3.0,2.0,3.0,2.0,3.0,2.0,Standard,Standard
3,3678,2688,FF,2019-09-29,92.8,1.6768,5.9208,Eduardo Rodriguez,542340.0,593958.0,...,2.0,3.0,3.0,2.0,3.0,2.0,3.0,2.0,Standard,Standard
4,3679,2706,CU,2019-09-29,82.6,-1.7676,5.7952,Chandler Shepherd,502110.0,605469.0,...,2.0,3.0,2.0,3.0,3.0,2.0,2.0,3.0,Standard,Strategic


In [112]:
# query style 2
table = 'statcast'
sql_query = "SELECT * FROM " + table + " LIMIT 5;"

pd.read_sql_query(sql_query,con)

Unnamed: 0,level_0,index,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
0,3675,2635,FF,2019-09-29,93.6,1.965,5.6573,Eduardo Rodriguez,542340.0,593958.0,...,2.0,3.0,3.0,2.0,3.0,2.0,3.0,2.0,Standard,Standard
1,3676,2650,FF,2019-09-29,93.4,1.9252,5.7313,Eduardo Rodriguez,542340.0,593958.0,...,2.0,3.0,3.0,2.0,3.0,2.0,3.0,2.0,Standard,Standard
2,3677,2671,CH,2019-09-29,86.8,1.9107,5.8085,Eduardo Rodriguez,542340.0,593958.0,...,2.0,3.0,3.0,2.0,3.0,2.0,3.0,2.0,Standard,Standard
3,3678,2688,FF,2019-09-29,92.8,1.6768,5.9208,Eduardo Rodriguez,542340.0,593958.0,...,2.0,3.0,3.0,2.0,3.0,2.0,3.0,2.0,Standard,Standard
4,3679,2706,CU,2019-09-29,82.6,-1.7676,5.7952,Chandler Shepherd,502110.0,605469.0,...,2.0,3.0,2.0,3.0,3.0,2.0,2.0,3.0,Standard,Strategic


In [88]:
df_pid.head()

Unnamed: 0,name_last,name_first,key_mlbam,key_retro,key_bbref,key_fangraphs,mlb_played_first,mlb_played_last
0,abad,fernando,472551,abadf001,abadfe01,4994,2010.0,2019.0
1,abreu,bryan,650556,abreb002,abreubr01,16609,2019.0,2019.0
2,abreu,jose,547989,abrej003,abreujo02,15676,2014.0,2019.0
3,acuna,ronald,660670,acunr001,acunaro01,18401,2018.0,2019.0
4,adam,jason,592094,adamj002,adamja01,11861,2018.0,2019.0


In [87]:
df_sc.head()

Unnamed: 0,index,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
0,0,FC,2018-10-01,92.2,-1.969,6.2644,Kenley Jansen,467827.0,445276.0,strikeout,...,5.0,2.0,2.0,5.0,2.0,5.0,2.0,5.0,Standard,Standard
1,1,FC,2018-10-01,93.0,-1.7689,6.2976,Kenley Jansen,467827.0,445276.0,,...,5.0,2.0,2.0,5.0,2.0,5.0,2.0,5.0,Standard,Standard
2,2,FC,2018-10-01,91.6,-1.7451,6.2154,Kenley Jansen,467827.0,445276.0,,...,5.0,2.0,2.0,5.0,2.0,5.0,2.0,5.0,Standard,Standard
3,3,FF,2018-10-01,93.1,-1.425,6.1929,Kenley Jansen,467827.0,445276.0,,...,5.0,2.0,2.0,5.0,2.0,5.0,2.0,5.0,Standard,Standard
4,4,FC,2018-10-01,91.4,-1.9144,6.2641,Kenley Jansen,435622.0,445276.0,strikeout,...,5.0,2.0,2.0,5.0,2.0,5.0,2.0,5.0,Standard,Standard


In [99]:
for p_key in df_pid['key_mlbam'].iloc[:4]:
    print(p_key)

472551
650556
547989
660670


In [103]:
p_key = 592518

In [91]:
position_list = ['fielder_' + str(i+1) for i in range(9)]
position_list

['fielder_1',
 'fielder_2',
 'fielder_3',
 'fielder_4',
 'fielder_5',
 'fielder_6',
 'fielder_7',
 'fielder_8',
 'fielder_9']

In [110]:
df_sc['game_date'].unique()

array(['2018-10-01T00:00:00.000000000', '2018-09-30T00:00:00.000000000',
       '2018-09-29T00:00:00.000000000', '2018-09-28T00:00:00.000000000',
       '2018-09-27T00:00:00.000000000', '2018-09-26T00:00:00.000000000',
       '2018-09-25T00:00:00.000000000', '2018-09-24T00:00:00.000000000',
       '2018-09-23T00:00:00.000000000', '2018-09-22T00:00:00.000000000',
       '2018-09-21T00:00:00.000000000', '2018-09-20T00:00:00.000000000',
       '2018-09-19T00:00:00.000000000', '2018-09-18T00:00:00.000000000',
       '2018-09-17T00:00:00.000000000', '2018-09-16T00:00:00.000000000',
       '2018-09-15T00:00:00.000000000', '2018-09-14T00:00:00.000000000',
       '2018-09-13T00:00:00.000000000', '2018-09-12T00:00:00.000000000',
       '2018-09-11T00:00:00.000000000', '2018-09-10T00:00:00.000000000',
       '2018-09-09T00:00:00.000000000', '2018-09-08T00:00:00.000000000',
       '2018-09-07T00:00:00.000000000', '2018-09-06T00:00:00.000000000',
       '2018-09-05T00:00:00.000000000', '2018-09-04

In [104]:
for position in position_list:
    try:
        games_played = len(df_sc[df_sc[position]==p_key].loc[:, 'game_date'].unique())
        print(position, games_played)
    except:
        continue

fielder_2 0
fielder_3 0
fielder_4 0
fielder_5 16
fielder_6 145
fielder_7 0
fielder_8 0
fielder_9 0


In [108]:
len(df_sc[df_sc['fielder_5']==p_key].loc[:, 'game_date'].unique())

16

In [None]:
## Determine each player's primary position by counting games played at each position



In [None]:
# Randomly generate two teams

# Choose a roster by position and WAR 

In [12]:
# teamA = somedf  # need position, left, right and other stats... 
# teamB = somedf

opp_pitcher = 'Clayton Kershaw'

In [29]:
pitcher_info = playerid_lookup(opp_pitcher.split()[1], opp_pitcher.split()[0])
pitcher_info

Gathering player lookup table. This may take a moment.


Unnamed: 0,name_last,name_first,key_mlbam,key_retro,key_bbref,key_fangraphs,mlb_played_first,mlb_played_last
0,kershaw,clayton,477132,kersc001,kershcl01,2036,2008.0,2019.0


In [13]:
batter = playerid_lookup('Machado', 'Manny')
batter

Gathering player lookup table. This may take a moment.


Unnamed: 0,name_last,name_first,key_mlbam,key_retro,key_bbref,key_fangraphs,mlb_played_first,mlb_played_last
0,machado,manny,592518,machm001,machama01,11493,2012.0,2019.0


In [None]:
# retrieve aggregate player statistics from last 3 years
no_years = 3
df_batting = batting_stats(2019 - no_years, 2019, ind=0)
df_batting = batting_stats(2019 - no_years, 2019, ind=0)

In [None]:
batting_stats(

In [8]:
df_batting.head()

Unnamed: 0,Name,Team,Age,G,AB,PA,H,1B,2B,3B,...,wSL/C (pi),wXX/C (pi),O-Swing% (pi),Z-Swing% (pi),Swing% (pi),O-Contact% (pi),Z-Contact% (pi),Contact% (pi),Zone% (pi),Pace (pi)
34,Mike Trout,Angels,26.0,547.0,1892.0,2396.0,580.0,312.0,108.0,14.0,...,1.57,-3.95,0.21,0.565,0.378,0.687,0.883,0.825,0.473,23.5
59,Mookie Betts,Red Sox,25.0,597.0,2417.0,2762.0,736.0,428.0,175.0,17.0,...,0.98,-7.75,0.22,0.543,0.378,0.697,0.935,0.865,0.49,23.0
50,Christian Yelich,- - -,26.0,588.0,2243.0,2585.0,690.0,419.0,137.0,15.0,...,0.99,-70.25,0.266,0.626,0.429,0.554,0.885,0.773,0.453,24.2
64,Anthony Rendon,Nationals,28.0,585.0,2149.0,2495.0,643.0,365.0,167.0,8.0,...,0.02,-0.23,0.256,0.618,0.434,0.785,0.906,0.87,0.492,25.4
120,Francisco Lindor,Indians,24.0,618.0,2514.0,2806.0,713.0,428.0,156.0,11.0,...,0.47,5.02,0.326,0.651,0.48,0.727,0.917,0.849,0.473,22.8


## Get statcast data for the matchup

**Make a database**

In [17]:
# Statcast data
df_sc = statcast('2019-04-01', '2019-05-30', team='SD')

This is a large query, it may take a moment to complete
Completed sub-query from 2019-04-01 to 2019-04-06
Completed sub-query from 2019-04-07 to 2019-04-12
Completed sub-query from 2019-04-13 to 2019-04-18
Completed sub-query from 2019-04-19 to 2019-04-24
Completed sub-query from 2019-04-25 to 2019-04-30
Completed sub-query from 2019-05-01 to 2019-05-06
Completed sub-query from 2019-05-07 to 2019-05-12
Completed sub-query from 2019-05-13 to 2019-05-18
Completed sub-query from 2019-05-19 to 2019-05-24
Completed sub-query from 2019-05-25 to 2019-05-30


In [34]:
df_sc_bat = df_sc[df_sc['batter']==int(batter['key_mlbam'])]  # & (df_sc['pitcher']==int(pitcher_info['key_mlbam'])))]

In [35]:
df_sc_bat.head()

Unnamed: 0,index,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
11,3252,FF,2019-05-29,90.0,1.6173,5.2746,Nestor Cortes Jr.,592518.0,641482.0,walk,...,7.0,0.0,0.0,7.0,0.0,7.0,0.0,7.0,Standard,Standard
12,3269,SL,2019-05-29,81.4,1.9681,5.2551,Nestor Cortes Jr.,592518.0,641482.0,,...,7.0,0.0,0.0,7.0,0.0,7.0,0.0,7.0,Standard,Standard
13,3274,SL,2019-05-29,82.9,1.7562,5.3481,Nestor Cortes Jr.,592518.0,641482.0,,...,7.0,0.0,0.0,7.0,0.0,7.0,0.0,7.0,Standard,Standard
14,3290,FF,2019-05-29,90.0,1.462,5.3794,Nestor Cortes Jr.,592518.0,641482.0,,...,7.0,0.0,0.0,7.0,0.0,7.0,0.0,7.0,Standard,Standard
15,3296,FF,2019-05-29,90.5,1.8463,5.3143,Nestor Cortes Jr.,592518.0,641482.0,,...,7.0,0.0,0.0,7.0,0.0,7.0,0.0,7.0,Standard,Standard


In [37]:
df_sc_bat_pitch = df_sc_bat[df_sc_bat['pitcher']==int(pitcher_info['key_mlbam'])]

## Basic year long matchup

In [48]:
df_batting.columns[df_batting.columns.str.contains('id')]

Index([], dtype='object')

## Basic results of matchup

In [44]:
df_sc_bat_pitch[df_sc_bat_pitch['events'].notna()]

Unnamed: 0,index,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
4066,17337,SL,2019-05-14,86.0,1.7186,6.3372,Clayton Kershaw,592518.0,477132.0,single,...,6.0,3.0,3.0,6.0,3.0,6.0,3.0,6.0,Infield shift,Standard
4142,18400,FF,2019-05-14,89.7,1.1706,6.3895,Clayton Kershaw,592518.0,477132.0,home_run,...,4.0,1.0,1.0,4.0,1.0,4.0,1.0,4.0,Infield shift,Standard
4237,19733,SL,2019-05-14,87.0,1.6145,6.4089,Clayton Kershaw,592518.0,477132.0,strikeout,...,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,Infield shift,Standard
6712,14244,FF,2019-05-03,88.7,1.595,6.2012,Clayton Kershaw,592518.0,477132.0,field_out,...,3.0,2.0,3.0,2.0,2.0,3.0,3.0,2.0,Infield shift,Standard
6805,15534,FF,2019-05-03,91.8,1.3058,6.2756,Clayton Kershaw,592518.0,477132.0,field_out,...,3.0,0.0,3.0,0.0,0.0,3.0,3.0,0.0,Infield shift,Standard
6861,16322,SL,2019-05-03,88.0,1.2369,6.3348,Clayton Kershaw,592518.0,477132.0,home_run,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Standard,Standard


In [45]:
df_sc_bat_pitch[df_sc_bat_pitch['events']!='NaN'].head()

Unnamed: 0,index,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
4066,17337,SL,2019-05-14,86.0,1.7186,6.3372,Clayton Kershaw,592518.0,477132.0,single,...,6.0,3.0,3.0,6.0,3.0,6.0,3.0,6.0,Infield shift,Standard
4067,17351,FF,2019-05-14,89.3,1.5819,6.271,Clayton Kershaw,592518.0,477132.0,,...,6.0,3.0,3.0,6.0,3.0,6.0,3.0,6.0,Infield shift,Standard
4068,17367,SL,2019-05-14,86.5,1.7675,6.2919,Clayton Kershaw,592518.0,477132.0,,...,6.0,3.0,3.0,6.0,3.0,6.0,3.0,6.0,Infield shift,Standard
4142,18400,FF,2019-05-14,89.7,1.1706,6.3895,Clayton Kershaw,592518.0,477132.0,home_run,...,4.0,1.0,1.0,4.0,1.0,4.0,1.0,4.0,Infield shift,Standard
4143,18410,CU,2019-05-14,75.4,1.0129,6.493,Clayton Kershaw,592518.0,477132.0,,...,4.0,1.0,1.0,4.0,1.0,4.0,1.0,4.0,Infield shift,Standard


Get basic statistics - see who does better

batting metrics that determine if someone wins? HR, RBI, batting avg?
Standard categories are AVE, HR, R, RBI, and SB. Other popular ones are OPS, SP%. Points leagues also count K’s and award points for each base (1 for single, 2 for double, etc).



In [33]:
df_sc_bat['pitcher'].unique()

array([], dtype=float64)

# Jupyter tricks

In [11]:
help(sdftatcast)sdf
help(sdftatcast)
help(sdftatcast)
help(sdftatcast)
help(sdftatcast)sdf

Help on function statcast in module pybaseball.statcast:

statcast(start_dt=None, end_dt=None, team=None, verbose=True)
    Pulls statcast play-level data from Baseball Savant for a given date range.
    
    INPUTS:
    start_dt: YYYY-MM-DD : the first date for which you want statcast data
    end_dt: YYYY-MM-DD : the last date for which you want statcast data
    team: optional (defaults to None) : city abbreviation of the team you want data for (e.g. SEA or BOS)
    
    If no arguments are provided, this will return yesterday's statcast data. If one date is provided, it will return that date's statcast data.



# Evaluate on pybaseball scraping serially

In [None]:
# Input

In [3]:
# Pitch result designation dictionary
# A similar question and approach here https://www.reddit.com/r/Sabermetrics/comments/e130el/how_to_calculate_whiff_rate/
# Note that bunts are included
contact_desc = (['pitchout_hit_into_play_score', 'hit_into_play_score', 'hit_into_play',
                 'hit_by_pitch', 'hit_into_play_no_out', 'pitchout_hit_into_play_no_out',
                 'pitchout_hit_into_play'])
foul_desc = ['foul_pitchout', 'foul_bunt', 'bunt_foul_tip', 'foul', 'foul_tip']
takeball_desc = ['intent_ball', 'blocked_ball', 'pitchout', 'ball']
takestrike_desc = ['called_strike']
unknownstrike_desc = ['unknown_strike']
whiff_desc = ['swinging_strike_blocked', 'swinging_strike', 'missed_bunt', 'swinging_strike']

desc_dict = {'contact':contact_desc, 'foul':foul_desc, 'take_ball':takeball_desc,
             'take_strike':takestrike_desc, 'unknown_strike':unknownstrike_desc,
             'whiff':whiff_desc}

bat_stand_list = ['L', 'R']
zone_list = list(range(1, 13))
sw_types = list(desc_dict.keys())

In [13]:
def get_pitcher_pb_sc_data(pitcher):
    print(colored(pitcher, 'blue'))
    pitcher_for_lookup = pitcher.split()
    
    # Account for Chi Chi Gonzales since I'm splitting by space
    if len(pitcher_for_lookup) > 2:
        p_id = playerid_lookup(pitcher_for_lookup[2], pitcher_for_lookup[0] + ' ' + pitcher_for_lookup[1])
    else:
        p_id = playerid_lookup(pitcher_for_lookup[1], pitcher_for_lookup[0])
    
    # Account for common names
    if p_id.shape[0] > 1:
        p_id = p_id[p_id['mlb_played_last'] > 2008]
    df_pitcher_sc = statcast_pitcher('2019-03-28', '2019-09-29', player_id = int(p_id['key_mlbam']))
    
    # Swing designation info
    df_pitcher_sc['sw_type'] = None
    for key, value in desc_dict.items():
        df_pitcher_sc.loc[df_pitcher_sc['description'].isin(value), 'sw_type'] = key
    df_pitcher_sc['sw_true'] = 0
    df_pitcher_sc.loc[df_pitcher_sc['sw_type'].isin(['whiff', 'contact', 'foul']), 'sw_true'] = 1
    
    return p_id, df_pitcher_sc

In [5]:
# Example
pitcher = 'Gerrit Cole'
df_pitcher_sc_cole = get_pitcher_pb_sc_data(pitcher)

[34mGerrit Cole[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data


In [18]:
%%timeit

pitcher_list = (['Gerrit Cole',
                 'Justin Verlander',
                 'Caleb Smith',
                 'Chris Paddack',
                 'Reynaldo Lopez',
                 'Robbie Ray',
                 'Zach Eflin',
                 'Dylan Bundy',
                 'Zach Plesac',
                 'Merrill Kelly'])

92.4 ns ± 6.55 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [6]:

# Make a list of 10 pitchers as a test set
pitcher_list = (['Gerrit Cole',
                 'Justin Verlander',
                 'Caleb Smith',
                 'Chris Paddack',
                 'Reynaldo Lopez',
                 'Robbie Ray',
                 'Zach Eflin',
                 'Dylan Bundy',
                 'Zach Plesac',
                 'Merrill Kelly'])

# Make a list of 10 pitchers that includes a name not in the database as a test set that can throw an error
pitcher_list_werror = (['Gerrit Cole',
                 'Justin Verlander',
                 'Caleb Smith',
                 'Chris Paddack',
                 'Ben Lacar',
                 'Robbie Ray',
                 'Zach Eflin',
                 'Dylan Bundy',
                 'Zach Plesac',
                 'Merrill Kelly'])

In [7]:
# Done serially (two pitchers), takes 21.4 s
for pitcher in pitcher_list[0:2]:
    print(pitcher)
    df_pitcher_sc = get_pitcher_pb_sc_data(pitcher)
    
%%time

Gerrit Cole
[34mGerrit Cole[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data
Justin Verlander
[34mJustin Verlander[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data


UsageError: Line magic function `%%time` not found.


# Evaluate on pybaseball scraping serially

I mainly modified functions based on [this StackOverflow post](https://stackoverflow.com/questions/16982569/making-multiple-api-calls-in-parallel-using-python-ipython).

- What happens if # of threads > # items to download?
- What happens if # items to download > # of threads?
- What happens if you max out the threads? (How do you know the max?)
- What if there is an error in an item you download?
- How do you do a loop with each pass through the loop doing a batch download? (e.g., scrape 10 items 2 at a time, so 5 passes through the loop)


In [8]:
def get_pitcher_pb_sc_data_range(pitcher_range, store=None):
    """process a number of ids, storing the results in a dict"""
    if store is None:
        store = {}
    for pitcher in pitcher_range:
        # Easy way to skip if there's an error
        try:
            store[pitcher] = get_pitcher_pb_sc_data(pitcher)
        except:
            continue
    return store

In [9]:
from threading import Thread

In [10]:
# def threaded_process_range(nthreads, id_range):
#     """process the id range in a specified number of threads"""
#     store = {}
#     threads = []
#     # create the threads
#     for i in range(nthreads):
#         ids = id_range[i::nthreads]
#         t = Thread(target=process_range, args=(ids,store))
#         threads.append(t)

#     # start the threads
#     [ t.start() for t in threads ]
#     # wait for the threads to finish
#     [ t.join() for t in threads ]
#     return store

In [11]:
def threaded_process_range(nthreads, pitcher_list):
    """process the pitcher list in a specified number of threads"""
    store = {}
    threads = []
    # create the threads
    for i in range(nthreads):
        ids = pitcher_list[i::nthreads]
        t = Thread(target=get_pitcher_pb_sc_data_range, args=(ids,store))
        threads.append(t)

    # start the threads
    [ t.start() for t in threads ]
    # wait for the threads to finish
    [ t.join() for t in threads ]
    return store

In [12]:
my_store = threaded_process_range(4, pitcher_list)

[34mGerrit Cole[0m
Gathering player lookup table. This may take a moment.
[34mJustin Verlander[0m
Gathering player lookup table. This may take a moment.
[34mCaleb Smith[0m
Gathering player lookup table. This may take a moment.
[34mChris Paddack[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering Player Data
[34mDylan Bundy[0m
Gathering player lookup table. This may take a moment.
[34mReynaldo Lopez[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering Player Data
[34mZach Eflin[0m
Gathering player lookup table. This may take a moment.
[34mRobbie Ray[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering Player Data
[34mZach Plesac[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering Player Data
Gathering Player Data
[34mMerrill Kelly[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data


In [63]:
my_store = threaded_process_range(4, pitcher_list_werror)

[34mGerrit Cole[0m
Gathering player lookup table. This may take a moment.
[34mJustin Verlander[0m
Gathering player lookup table. This may take a moment.[34mCaleb Smith[0m

Gathering player lookup table. This may take a moment.
[34mBen Lacar[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering Player Data
Gathering Player Data


In [13]:
my_store.keys()

dict_keys(['Chris Paddack', 'Gerrit Cole', 'Caleb Smith', 'Justin Verlander', 'Dylan Bundy', 'Reynaldo Lopez', 'Zach Eflin', 'Robbie Ray', 'Zach Plesac', 'Merrill Kelly'])

In [14]:
my_store['Gerrit Cole'][1]

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,...,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment,sw_type,sw_true
0,FF,2019-09-29,98.4,-1.9530,5.7396,Gerrit Cole,592230,543037,strikeout,swinging_strike,...,1,8,8,1,1,8,Infield shift,Standard,whiff,1
1,FF,2019-09-29,99.0,-1.8961,5.7786,Gerrit Cole,592230,543037,,ball,...,1,8,8,1,1,8,Infield shift,Standard,take_ball,0
2,KC,2019-09-29,86.2,-1.9253,5.7561,Gerrit Cole,592230,543037,,foul,...,1,8,8,1,1,8,Infield shift,Standard,foul,1
3,FF,2019-09-29,99.4,-1.9853,5.7987,Gerrit Cole,592230,543037,,foul,...,1,8,8,1,1,8,Infield shift,Standard,foul,1
4,FF,2019-09-29,98.5,-1.9848,5.7508,Gerrit Cole,592230,543037,,swinging_strike,...,1,8,8,1,1,8,Infield shift,Standard,whiff,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3357,SL,2019-03-29,89.8,-2.3465,5.9531,Gerrit Cole,640457,543037,,foul,...,0,0,0,0,0,0,Infield shift,Standard,foul,1
3358,FF,2019-03-29,96.3,-2.1579,5.8887,Gerrit Cole,640457,543037,,foul,...,0,0,0,0,0,0,Infield shift,Standard,foul,1
3359,SL,2019-03-29,88.5,-2.3414,5.9628,Gerrit Cole,640457,543037,,ball,...,0,0,0,0,0,0,Infield shift,Standard,take_ball,0
3360,FT,2019-03-29,96.7,-2.3392,5.7665,Gerrit Cole,640457,543037,,ball,...,0,0,0,0,0,0,Infield shift,Standard,take_ball,0


# What are threads? What are processes and what is going on?

In [57]:
my_list = ['Kawhi Leonard', 'Kevin Durant', 'Klay Thompson']

In [59]:
for player in my_list:
    print(player.lower().replace(' ', '-'))
    

kawhi-leonard
kevin-durant
klay-thompson


In [None]:
# https://www.foxsports.com/nba/kevin-durant-player-injuries

In [43]:
%%bash
mkdir -p test_dir
ls
rm -r test_dir
ls

DS_dev_setup_part_1.html
DS_dev_setup_part_2.ipynb
DS_sql_setup_part_3.ipynb
Qs_for_Eric.txt
births2012_downsampled.csv
daily_fantasy_baseball_player_predictor.ipynb
edu_data_explore
explore_data_insight_env.ipynb
multithreading_wpybaseball.ipynb
test_dir
test_insight_env_wpostgreSQL.ipynb
DS_dev_setup_part_1.html
DS_dev_setup_part_2.ipynb
DS_sql_setup_part_3.ipynb
Qs_for_Eric.txt
births2012_downsampled.csv
daily_fantasy_baseball_player_predictor.ipynb
edu_data_explore
explore_data_insight_env.ipynb
multithreading_wpybaseball.ipynb
test_insight_env_wpostgreSQL.ipynb


In [16]:
[print(i) for i in data.columns]

Name
Team
Age
G
AB
PA
H
1B
2B
3B
HR
R
RBI
BB
IBB
SO
HBP
SF
SH
GDP
SB
CS
AVG
GB
FB
LD
IFFB
Pitches
Balls
Strikes
IFH
BU
BUH
BB%
K%
BB/K
OBP
SLG
OPS
ISO
BABIP
GB/FB
LD%
GB%
FB%
IFFB%
HR/FB
IFH%
BUH%
wOBA
wRAA
wRC
Bat
Fld
Rep
Pos
RAR
WAR
Dol
Spd
wRC+
WPA
-WPA
+WPA
RE24
REW
pLI
phLI
PH
WPA/LI
Clutch
FB% (Pitch)
FBv
SL%
SLv
CT%
CTv
CB%
CBv
CH%
CHv
SF%
SFv
KN%
KNv
XX%
PO%
wFB
wSL
wCT
wCB
wCH
wSF
wKN
wFB/C
wSL/C
wCT/C
wCB/C
wCH/C
wSF/C
wKN/C
O-Swing%
Z-Swing%
Swing%
O-Contact%
Z-Contact%
Contact%
Zone%
F-Strike%
SwStr%
BsR
FA% (pfx)
FT% (pfx)
FC% (pfx)
FS% (pfx)
FO% (pfx)
SI% (pfx)
SL% (pfx)
CU% (pfx)
KC% (pfx)
EP% (pfx)
CH% (pfx)
SC% (pfx)
KN% (pfx)
UN% (pfx)
vFA (pfx)
vFT (pfx)
vFC (pfx)
vFS (pfx)
vFO (pfx)
vSI (pfx)
vSL (pfx)
vCU (pfx)
vKC (pfx)
vEP (pfx)
vCH (pfx)
vSC (pfx)
vKN (pfx)
FA-X (pfx)
FT-X (pfx)
FC-X (pfx)
FS-X (pfx)
FO-X (pfx)
SI-X (pfx)
SL-X (pfx)
CU-X (pfx)
KC-X (pfx)
EP-X (pfx)
CH-X (pfx)
SC-X (pfx)
KN-X (pfx)
FA-Z (pfx)
FT-Z (pfx)
FC-Z (pfx)
FS-Z (pfx)
FO-Z (pfx)
SI-Z (pfx)

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,