# Scraping Baseball Reference Data

### Introduction

This notebook describes how to scrape prior season data from Baseball Reference, to be used later to predict pitcher quality starts. I'll be scraping four years of data (2016-2019) for each team. Each year is on its own page, with pitching data for all teams available on the same page. The data I want is all in the "Player Starting Pitching" table, which looks like this:

![](img/starting_pitch_table.png)

### Retrieving Page Source Code

I’m going to use Selenium because Baseball Reference doesn't fully render the page once it's loaded, and I need to automate scrolling and clicking on the table I want, to get the data to show up. I’ll use BeautifulSoup to parse the HTML that comes back to make it easier to find and extract the data I care about. I’m also going to load pandas to put it all into dataframes after getting the data I need. 

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time, os

chromedriver = "/Applications/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver

I wrote this function to get the source code for each season of data. I built in 5 seconds of sleep time at various stages to allow the page to render properly. I also used Selenium to click on a button that displays the table as csv, that can be extracted and then processed more easily.

In [8]:
def get_csv_bbref(link_list):
    soup_list = []
    for i in link_list:
        driver = webdriver.Chrome(chromedriver)
        driver.get(i)
        time.sleep(5)
        driver.execute_script("window.scrollTo(0, 1500);")
        driver.find_element_by_xpath('//*[@id="all_players_starter_pitching"]/div[1]/div/ul/li[1]').click()
        time.sleep(5)
        driver.find_element_by_xpath('//*[@id="all_players_starter_pitching"]/div[1]/div/ul/li[1]/div/ul/li[4]/button').click()
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        crude = soup.find('pre', id = 'csv_players_starter_pitching').text.split("\n")
        soup_list.append(crude)
    return soup_list
        

In [10]:
#get all pages at once
bbref_list = ['https://www.baseball-reference.com/leagues/MLB/2016-starter-pitching.shtml', 
              'https://www.baseball-reference.com/leagues/MLB/2017-starter-pitching.shtml',
              'https://www.baseball-reference.com/leagues/MLB/2018-starter-pitching.shtml',
              'https://www.baseball-reference.com/leagues/MLB/2019-starter-pitching.shtml']

In [6]:
#get individual pages one at a time
bbref_2016 = ['https://www.baseball-reference.com/leagues/MLB/2016-starter-pitching.shtml']
bbref_2017 = ['https://www.baseball-reference.com/leagues/MLB/2017-starter-pitching.shtml']
bbref_2018 = ['https://www.baseball-reference.com/leagues/MLB/2018-starter-pitching.shtml']
bbref_2019 = ['https://www.baseball-reference.com/leagues/MLB/2019-starter-pitching.shtml']

In [11]:
raw_csv_data_list = get_csv_bbref(bbref_list)

In [9]:
raw_csv_data_2016 = get_csv_bbref(bbref_2016)
raw_csv_data_2017 = get_csv_bbref(bbref_2017)
raw_csv_data_2018 = get_csv_bbref(bbref_2018)
raw_csv_data_2019 = get_csv_bbref(bbref_2019)

### Extract Data from CSV table

Now that I have the csvs, I can process them, cleaning them up and saving them.

In [81]:
def clean_crude(csv):
    no_blank = [] 
    for i in csv:
        for j in i:
            if j != '':
                no_blank.append(j)
        new_data_split = []
    for i in no_blank:
        new_line = pd.DataFrame(i.split(','))
        new_data_split.append(new_line)
    season_data = pd.concat(new_data_split, ignore_index = True, axis = 1).T
    header = season_data.iloc[0]
    season_data = season_data[1:]
    season_data.columns = header
    season_data.drop(columns = ['Rk'], inplace = True)
    return season_data

In [83]:
clean_2016 = clean_crude(raw_csv_data_2016)
#clean_2016.to_csv('season_2016.csv', index = False)

In [84]:
clean_2017 = clean_crude(raw_csv_data_2017)
#clean_2017.to_csv('season_2017.csv', index = False)

In [85]:
clean_2018 = clean_crude(raw_csv_data_2018)
#clean_2018.to_csv('season_2018.csv', index = False)

In [86]:
clean_2019 = clean_crude(raw_csv_data_2019)
#clean_2019.to_csv('season_2019.csv', index = False)

In [88]:
clean_allyears = clean_crude(raw_csv_data_list)
#clean_allyears.to_csv('season_1619.csv', index = False)

### Wrap-up

Now we have four years of season data for pitching. Between this and the three years of projection data for Fangraphs, I am ready to start exploratory data analysis to predict quality starts. 