## Baseball Savant Data Scraping

#### Data Details

This code pulls data from the Baseball Savant [statcast search tool](https://baseballsavant.mlb.com/statcast_search). It is the equivalent of downloading CSVs from the website. The website seems to have changed limits on query cost and number of records returned. The script loops through query options to pull all of the data from 2008-2017. 

The script loops through the following characteristic to make sure the queries abide by the website's limitations:
   - Years
   - Teams (the batting team)
   
#### Output

The loop input changes the link to the data CSV which is read into a dataframe. The entire data source is large by Excel standards, so each dataframe is then appended to a SQLite table. See [metadata](metadata.md).

[Return to Main](README.md)

In [None]:
import pandas as pd
import sqlite3
from tqdm import tqdm, tqdm_notebook
import time
from urllib.error import HTTPError

In [None]:
savant = sqlite3.connect('BaseballSavant.db')

teams = ['LAA', 'HOU', 'OAK', 'TOR', 'ATL', 'MIL', 'STL', 
         'CHC', 'ARI', 'LAD', 'SF', 'CLE', 'SEA', 'MIA', 
         'NYM', 'WSH', 'BAL', 'SD', 'PHI', 'PIT', 'TEX', 
         'TB', 'BOS', 'CIN', 'COL', 'KC', 'DET', 'MIN', 
         'CWS', 'NYY']

In [None]:
for year in tqdm_notebook(range(2008, 2017), desc = 'Years'):
    for team in tqdm_notebook(teams, desc = 'Teams', leave = False):
        # Query link is based on loop
        link = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&hfPT=&hfAB=&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=&hfC=&hfSea=' + str(year) + '%7C&hfSit=&player_type=pitcher&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=&game_date_lt=&team=' + team + '&position=&hfRO=&home_road=&hfFlag=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name-event&sort_col=pitches&player_event_sort=api_p_release_speed&sort_order=desc&min_abs=0&type=details&'
        successful = False
        backoff_time = 30
        while not successful:
            try:
                data = pd.read_csv(link)
                # Rename player_name to denote that it is the pitcher
                data.rename(columns={'player_name': 'pitcher_name'}, inplace=True)
                # Insert to table
                pd.io.sql.to_sql(data, name='statcast', con=savant, if_exists='append')
                successful = True
            except (HTTPError, sqlite3.OperationalError) as e:
                # If there is an error backoff exponentially until there is no longer an error
                for i in tqdm_notebook(range(1, backoff_time), desc="Backing off " + str(backoff_time) + " seconds", leave=False):
                    time.sleep(1)
                backoff_time = min(backoff_time * 2, 60*60)

In [None]:
savant.commit() 
savant.close()