## Baseball Savant Data Scraping

#### Data Details

This code pulls data from the Baseball Savant [statcast search tool](https://baseballsavant.mlb.com/statcast_search). It is the equivalent of downloading CSVs from the website. The website limits query time and only returns the first 1,000 observations, so the code loops through query options to pull all of the data from 2008-2017. 

The code loops through the following characteristic to make sure the queries abide by the website's limitations:
   - Years
   - Teams (the batting team)
   - Home/Away
   - Outs
   - Innings (1-9 and 10+)
   - Pitcher Handedness (R for righty and L for lefty)
   - Batter Handedness
   - Opposing Team 
   
#### Output

The loop input changes the link to the data CSV which is read into a dataframe. The entire data source is pretty large, so each dataframe is then appended to a SQLite table. See [metadata](metadata.md).

[Return to Main](README.md)

In [None]:
import pandas as pd
import sqlite3
from tqdm import tqdm
import time

In [None]:
# Connect to database
savant = sqlite3.connect('BaseballSavant.db')

# List of teams
teams = ['LAA', 'HOU', 'OAK', 'TOR', 'ATL', 'MIL', 'STL', 
         'CHC', 'ARI', 'LAD', 'SF', 'CLE', 'SEA', 'MIA', 
         'NYM', 'WSH', 'BAL', 'SD', 'PHI', 'PIT', 'TEX', 
         'TB', 'BOS', 'CIN', 'COL', 'KC', 'DET', 'MIN', 
         'CWS', 'NYY']

# List of Home/Road
loc = ['Home', 'Road']

# List of out combinations
outl = ['0', '1', '2%7C3']

In [None]:
# Year loop
for year in tqdm(range(2008, 2009), desc = 'Years'):
    # Team loop
    for team in tqdm(teams, desc = 'Teams', leave = False):
        # Home/Away loop
        for home_away in tqdm(loc, desc = 'Location', leave = False):
            # Inning loop
            for inning in tqdm(range(1, 11), desc='Innings', leave=False):
                # Outs loop
                for outs in tqdm(outl, desc = 'Outs', leave = False):
                    # Pitcher handedness
                    for throws in ['R', 'L']:
                        # Batter handedness
                        for stands in ['R', 'L']:
                            # Opposing team 
                            for oteam in teams:
                                # Query link is based on loop input
                                link = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&hfPT=&hfAB=&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7C&hfC=&hfSea=' + str(year) + '%7C&hfSit=&player_type=batter&hfOuts=' + outs + '%7C&opponent=' + oteam + '&pitcher_throws=' + throws + '&batter_stands=' + stands + '&hfSA=&game_date_gt=&game_date_lt=&team=' + team + '&position=&hfRO=&home_road=' + home_away + '&hfFlag=&metric_1=&hfInn=' + str(inning) + '%7C&min_pitches=0&min_results=0&group_by=name-event&sort_col=pitches&player_event_sort=api_p_release_speed&sort_order=desc&min_abs=0&type=details&'
                                successful = False
                                backoff_time = 30
                                while not successful:
                                    try:
                                        # Read in query CSV as dataframe
                                        data = pd.read_csv(link)
                                        # Rename player_name to denote that it is the batter
                                        data.rename(columns={'player_name': 'batter_name'}, inplace=True)
                                        # Append the dataframe to the data
                                        pd.io.sql.to_sql(data, name='statcast', con=savant, if_exists='append')
                                        successful = True
                                    except (HTTPError, sqlite3.OperationalError) as e:
                                        # If there is an error backoff exponentially until there is no longer an error
                                        for i in tqdm(range(1, backoff_time), desc="Backing off " + str(backoff_time) + " seconds for error " + str(e) + " at " + str(year) + " " + outs + " " + team + " " + home_away + " " + str(inning), leave=False):
                                            time.sleep(1)
                                        backoff_time = min(backoff_time * 2, 60*60)

In [None]:
# Commit and close connection
savant.commit() 
savant.close()