# Data Ingestion 

In this portion of the project, I utilize the [baseball_scraper](https://github.com/spilchen/baseball_scraper) package to scrape the statcast data for every pitch thrown in each season. Baseball_scraper uses selenium to scrape [Baseball Savant's Statcast website](https://baseballsavant.mlb.com/csv-docs#pitch_type) to gather pitch-level data for each season. To gather pitch-level data for each season, I looked up the season start- and end-date for each season and plugged those dates into the statcast scraper function. 

Each season is comprised of ~700,000 pitches, so I am unable to push the data to github. Therefore, I saved the data in an Amazon s3 bucket for future use.

In [14]:
import pandas as pd

from baseball_scraper import playerid_lookup
from baseball_scraper import statcast_pitcher
from baseball_scraper import batting_stats_range
from baseball_scraper import statcast
from baseball_scraper import statcast_batter

In [15]:
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 400)

## 2015 Data

In [17]:
sc_2015 = statcast(start_dt='2015-04-01', end_dt= '2015-11-05')

This is a large query, it may take a moment to complete
Completed sub-query from 2015-04-01 to 2015-04-06
Completed sub-query from 2015-04-07 to 2015-04-12
Completed sub-query from 2015-04-13 to 2015-04-18
Completed sub-query from 2015-04-19 to 2015-04-24
Completed sub-query from 2015-04-25 to 2015-04-30
Completed sub-query from 2015-05-01 to 2015-05-06
Completed sub-query from 2015-05-07 to 2015-05-12
Completed sub-query from 2015-05-13 to 2015-05-18
Completed sub-query from 2015-05-19 to 2015-05-24
Completed sub-query from 2015-05-25 to 2015-05-30
Completed sub-query from 2015-05-31 to 2015-06-05
Completed sub-query from 2015-06-06 to 2015-06-11
Completed sub-query from 2015-06-12 to 2015-06-17
Completed sub-query from 2015-06-18 to 2015-06-23
Completed sub-query from 2015-06-24 to 2015-06-29
Completed sub-query from 2015-06-30 to 2015-07-05
Completed sub-query from 2015-07-06 to 2015-07-11
Completed sub-query from 2015-07-12 to 2015-07-17
Completed sub-query from 2015-07-18 to 2015-

In [47]:
sc_2015.to_csv('./data/statcast_data/sc_2015.csv', index=False)

## 2016 Data

In [19]:
sc_2016 = statcast(start_dt='2016-04-01', end_dt='2016-11-07')

This is a large query, it may take a moment to complete
Completed sub-query from 2016-04-01 to 2016-04-06
Completed sub-query from 2016-04-07 to 2016-04-12
Completed sub-query from 2016-04-13 to 2016-04-18
Completed sub-query from 2016-04-19 to 2016-04-24
Completed sub-query from 2016-04-25 to 2016-04-30
Completed sub-query from 2016-05-01 to 2016-05-06
Completed sub-query from 2016-05-07 to 2016-05-12
Completed sub-query from 2016-05-13 to 2016-05-18
Completed sub-query from 2016-05-19 to 2016-05-24
Completed sub-query from 2016-05-25 to 2016-05-30
Completed sub-query from 2016-05-31 to 2016-06-05
Completed sub-query from 2016-06-06 to 2016-06-11
Completed sub-query from 2016-06-12 to 2016-06-17
Completed sub-query from 2016-06-18 to 2016-06-23
Completed sub-query from 2016-06-24 to 2016-06-29
Completed sub-query from 2016-06-30 to 2016-07-05
Completed sub-query from 2016-07-06 to 2016-07-11
Completed sub-query from 2016-07-12 to 2016-07-17
Completed sub-query from 2016-07-18 to 2016-

In [49]:
sc_2016.to_csv('./data/statcast_data/sc_2016.csv', index=False)

## 2017 Data

In [21]:
sc_2017 = statcast(start_dt='2017-04-01', end_dt='2017-11-07')

This is a large query, it may take a moment to complete
Completed sub-query from 2017-04-01 to 2017-04-06
Completed sub-query from 2017-04-07 to 2017-04-12
Completed sub-query from 2017-04-13 to 2017-04-18
Completed sub-query from 2017-04-19 to 2017-04-24
Completed sub-query from 2017-04-25 to 2017-04-30
Completed sub-query from 2017-05-01 to 2017-05-06
Completed sub-query from 2017-05-07 to 2017-05-12
Completed sub-query from 2017-05-13 to 2017-05-18
Completed sub-query from 2017-05-19 to 2017-05-24
Completed sub-query from 2017-05-25 to 2017-05-30
Completed sub-query from 2017-05-31 to 2017-06-05
Completed sub-query from 2017-06-06 to 2017-06-11
Completed sub-query from 2017-06-12 to 2017-06-17
Completed sub-query from 2017-06-18 to 2017-06-23
Completed sub-query from 2017-06-24 to 2017-06-29
Completed sub-query from 2017-06-30 to 2017-07-05
Completed sub-query from 2017-07-06 to 2017-07-11
Completed sub-query from 2017-07-12 to 2017-07-17
Completed sub-query from 2017-07-18 to 2017-

In [51]:
sc_2017.to_csv('./data/statcast_data/sc_2017.csv', index=False)

## 2018 Data

In [23]:
sc_2018 = statcast(start_dt='2018-03-29', end_dt='2018-11-07')

This is a large query, it may take a moment to complete
Completed sub-query from 2018-03-29 to 2018-04-03
Completed sub-query from 2018-04-04 to 2018-04-09
Completed sub-query from 2018-04-10 to 2018-04-15
Completed sub-query from 2018-04-16 to 2018-04-21
Completed sub-query from 2018-04-22 to 2018-04-27
Completed sub-query from 2018-04-28 to 2018-05-03
Completed sub-query from 2018-05-04 to 2018-05-09
Completed sub-query from 2018-05-10 to 2018-05-15
Completed sub-query from 2018-05-16 to 2018-05-21
Completed sub-query from 2018-05-22 to 2018-05-27
Completed sub-query from 2018-05-28 to 2018-06-02
Completed sub-query from 2018-06-03 to 2018-06-08
Completed sub-query from 2018-06-09 to 2018-06-14
Completed sub-query from 2018-06-15 to 2018-06-20
Completed sub-query from 2018-06-21 to 2018-06-26
Completed sub-query from 2018-06-27 to 2018-07-02
Completed sub-query from 2018-07-03 to 2018-07-08
Completed sub-query from 2018-07-09 to 2018-07-14
Completed sub-query from 2018-07-15 to 2018-

In [44]:
sc_2018.to_csv('./data/statcast_data/sc_2018.csv', index=False)

## 2019 Data

In [25]:
sc_2019 = statcast(start_dt='2019-03-20', end_dt='2019-11-07')

This is a large query, it may take a moment to complete
Completed sub-query from 2019-03-20 to 2019-03-25
Completed sub-query from 2019-03-26 to 2019-03-31
Completed sub-query from 2019-04-01 to 2019-04-06
Completed sub-query from 2019-04-07 to 2019-04-12
Completed sub-query from 2019-04-13 to 2019-04-18
Completed sub-query from 2019-04-19 to 2019-04-24
Completed sub-query from 2019-04-25 to 2019-04-30
Completed sub-query from 2019-05-01 to 2019-05-06
Completed sub-query from 2019-05-07 to 2019-05-12
Completed sub-query from 2019-05-13 to 2019-05-18
Completed sub-query from 2019-05-19 to 2019-05-24
Completed sub-query from 2019-05-25 to 2019-05-30
Completed sub-query from 2019-05-31 to 2019-06-05
Completed sub-query from 2019-06-06 to 2019-06-11
Completed sub-query from 2019-06-12 to 2019-06-17
Completed sub-query from 2019-06-18 to 2019-06-23
Completed sub-query from 2019-06-24 to 2019-06-29
Completed sub-query from 2019-06-30 to 2019-07-05
Completed sub-query from 2019-07-06 to 2019-

In [42]:
sc_2019.to_csv('./data/statcast_data/sc_2019.csv', index=False)

## 2020 Data

In [27]:
sc_2020 = statcast(start_dt='2020-07-23', end_dt='2020-10-27')

This is a large query, it may take a moment to complete
Completed sub-query from 2020-07-23 to 2020-07-28
Completed sub-query from 2020-07-29 to 2020-08-03
Completed sub-query from 2020-08-04 to 2020-08-09
Completed sub-query from 2020-08-10 to 2020-08-15
Completed sub-query from 2020-08-16 to 2020-08-21
Completed sub-query from 2020-08-22 to 2020-08-27
Completed sub-query from 2020-08-28 to 2020-09-02
Completed sub-query from 2020-09-03 to 2020-09-08
Completed sub-query from 2020-09-09 to 2020-09-14
Completed sub-query from 2020-09-15 to 2020-09-20
Completed sub-query from 2020-09-21 to 2020-09-26
Completed sub-query from 2020-09-27 to 2020-10-02
Completed sub-query from 2020-10-03 to 2020-10-08
Completed sub-query from 2020-10-09 to 2020-10-14
Completed sub-query from 2020-10-15 to 2020-10-20
Completed sub-query from 2020-10-21 to 2020-10-26
Completed sub-query from 2020-10-27 to 2020-10-27


In [40]:
sc_2020.to_csv('./data/statcast_data/sc_2020.csv',index=False)