# Scraping Fangraphs Data

### Introduction

This notebook describes how to scrape historical baseball projection data from Fangraphs, to be used later to predict pitcher quality starts. I'll be scraping three years of projection data (2017, 2018, 2019) for each team, with each team-year of projection data on its own page. Each page of projection data has two tables with the statistics I want: "Pitchers, Counting Stats", and "Pitchers, Rates and Averages". They look like this:

![](/img/counting_stats_table.png)




I’m going to use the requests library to send HTTP requests and retrieve the page source code. I’ll use BeautifulSoup to parse the HTML that comes back, along with re, the regular expressions library, to find and extract the data I care about. I’m also going to load pandas to put it all into dataframes after getting the data I need.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

In [1]:
!pwd

/Users/angeline/Documents/GitHub/predicting_quality_starts/webscraping


### Retrieving Page Source Code 

I started by getting the list of URLS that I intended to scrape data from, taken from the links on this page (and similar pages for 2018 and 2019 projection data): https://blogs.fangraphs.com/category/2017-zips-projections/. 

In [2]:
#List of URLs to scrape from Fangraphs, by year of projection
list_urls_2017 = [
 'http://www.fangraphs.com/blogs/2017-zips-projections-baltimore-orioles',
 'http://www.fangraphs.com/blogs/2017-zips-projections-chicago-white-sox',
 'http://www.fangraphs.com/blogs/2017-zips-projections-houston-astros',
 'http://www.fangraphs.com/blogs/2017-zips-projections-boston-red-sox',
 'http://www.fangraphs.com/blogs/2017-zips-projections-cleveland-indians',
 'http://www.fangraphs.com/blogs/2017-zips-projections-los-angeles-angels',
 'http://www.fangraphs.com/blogs/2017-zips-projections-new-york-yankees',
 'http://www.fangraphs.com/blogs/2017-zips-projections-detroit-tigers',
 'http://www.fangraphs.com/blogs/2017-zips-projections-oakland-athletics',
 'http://www.fangraphs.com/blogs/2017-zips-projections-tampa-bay-rays',
 'http://www.fangraphs.com/blogs/2017-zips-projections-kansas-city-royals',
 'http://www.fangraphs.com/blogs/2017-zips-projections-seattle-mariners',
 'http://www.fangraphs.com/blogs/2017-zips-projections-toronto-blue-jays',
 'http://www.fangraphs.com/blogs/2017-zips-projections-minnesota-twins',
 'http://www.fangraphs.com/blogs/2017-zips-projections-texas-rangers',
 'http://www.fangraphs.com/blogs/2017-zips-projections-atlanta-braves',
 'http://www.fangraphs.com/blogs/2017-zips-projections-chicago-cubs',
 'http://www.fangraphs.com/blogs/2017-zips-projections-arizona-diamondbacks',
 'http://www.fangraphs.com/blogs/2017-zips-projections-miami-marlins',
 'http://www.fangraphs.com/blogs/2017-zips-projections-cincinnati-reds',
 'http://www.fangraphs.com/blogs/2017-zips-projections-colorado-rockies',
 'http://www.fangraphs.com/blogs/2017-zips-projections-new-york-mets',
 'http://www.fangraphs.com/blogs/2017-zips-projections-milwaukee-brewers',
 'http://www.fangraphs.com/blogs/2017-zips-projections-los-angeles-dodgers',
 'http://www.fangraphs.com/blogs/2017-zips-projections-philadelphia-phillies',
 'http://www.fangraphs.com/blogs/2017-zips-projections-pittsburgh-pirates',
 'http://www.fangraphs.com/blogs/2017-zips-projections-san-diego-padres',
 'http://www.fangraphs.com/blogs/2017-zips-projections-washington-nationals',
 'http://www.fangraphs.com/blogs/2017-zips-projections-st-louis-cardinals',
 'http://www.fangraphs.com/blogs/2017-zips-projections-san-francisco-giants']

list_urls_2018 = ['http://www.fangraphs.com/blogs/2018-zips-projections-baltimore-orioles',
 'http://www.fangraphs.com/blogs/2018-zips-projections-chicago-white-sox',
 'http://www.fangraphs.com/blogs/2018-zips-projections-houston-astros',
 'http://www.fangraphs.com/blogs/2018-zips-projections-boston-red-sox',
 'http://www.fangraphs.com/blogs/2018-zips-projections-cleveland-indians',
 'http://www.fangraphs.com/blogs/2018-zips-projections-los-angeles-angels',
 'http://www.fangraphs.com/blogs/2018-zips-projections-new-york-yankees',
 'http://www.fangraphs.com/blogs/2018-zips-projections-detroit-tigers',
 'http://www.fangraphs.com/blogs/2018-zips-projections-oakland-as',
 'http://www.fangraphs.com/blogs/2018-zips-projections-tampa-bay-rays',
 'http://www.fangraphs.com/blogs/2018-zips-projections-kansas-city-royals',
 'http://www.fangraphs.com/blogs/2018-zips-projections-seattle-mariners',
 'http://www.fangraphs.com/blogs/2018-zips-projections-toronto-blue-jays',
 'http://www.fangraphs.com/blogs/2018-zips-projections-minnesota-twins',
 'http://www.fangraphs.com/blogs/2018-zips-projections-texas-rangers',
 'http://www.fangraphs.com/blogs/2018-zips-projections-atlanta-braves',
 'http://www.fangraphs.com/blogs/2018-zips-projections-chicago-cubs',
 'http://www.fangraphs.com/blogs/2018-zips-projections-arizona-diamondbacks',
 'http://www.fangraphs.com/blogs/2018-zips-projections-miami-marlins',
 'http://www.fangraphs.com/blogs/2018-zips-projections-cincinnati-reds',
 'http://www.fangraphs.com/blogs/2018-zips-projections-colorado-rockies',
 'http://www.fangraphs.com/blogs/2018-zips-projections-new-york-mets',
 'http://www.fangraphs.com/blogs/2018-zips-projections-milwaukee-brewers',
 'http://www.fangraphs.com/blogs/2018-zips-projections-los-angeles-dodgers',
 'http://www.fangraphs.com/blogs/2018-zips-projections-philadelphia-phillies',
 'http://www.fangraphs.com/blogs/2018-zips-projections-pittsburgh-pirates',
 'http://www.fangraphs.com/blogs/2018-zips-projections-san-diego-padres',
 'http://www.fangraphs.com/blogs/2018-zips-projections-washington-nationals',
 'http://www.fangraphs.com/blogs/2018-zips-projections-st-louis-cardinals',
 'http://www.fangraphs.com/blogs/2018-zips-projections-san-francisco-giants']

list_urls_2019 = [
 'http://www.fangraphs.com/blogs/2019-zips-projections-baltimore-orioles',
 'http://www.fangraphs.com/blogs/2019-zips-projections-chicago-white-sox',
 'http://www.fangraphs.com/blogs/2019-zips-projection-houston-astros',
 'http://www.fangraphs.com/blogs/2019-zips-projections-boston-red-sox',
 'http://www.fangraphs.com/blogs/2019-zips-projections-cleveland-indians',
 'http://www.fangraphs.com/blogs/2019-zips-projections-los-angeles-angels',
 'http://www.fangraphs.com/blogs/2019-zips-projections-new-york-yankees',
 'http://www.fangraphs.com/blogs/2019-zips-projections-detroit-tigers',
 'http://www.fangraphs.com/blogs/2019-zips-projections-oakland-athletics',
 'http://www.fangraphs.com/blogs/2019-zips-projections-tampa-bay-rays',
 'http://www.fangraphs.com/blogs/2019-zips-projections-kansas-city-royals',
 'http://www.fangraphs.com/blogs/2019-zips-projections-seattle-mariners',
 'http://www.fangraphs.com/blogs/2019-zips-projections-toronto-blue-jays',
 'http://www.fangraphs.com/blogs/2019-zips-projections-minnesota-twins',
 'http://www.fangraphs.com/blogs/2019-zips-projections-texas-rangers',
 'http://www.fangraphs.com/blogs/2019-zips-projections-atlanta-braves',
 'http://www.fangraphs.com/blogs/2019-zips-projections-chicago-cubs',
 'http://www.fangraphs.com/blogs/2019-zips-projections-arizona-diamondbacks',
 'http://www.fangraphs.com/blogs/2019-zips-projections-miami-marlins',
 'http://www.fangraphs.com/blogs/2019-zips-projections-cincinnati-reds',
 'http://www.fangraphs.com/blogs/2019-zips-projections-colorado-rockies',
 'http://www.fangraphs.com/blogs/2019-zips-projections-new-york-mets',
 'http://www.fangraphs.com/blogs/2019-zips-projections-milwaukee-brewers',
 'http://www.fangraphs.com/blogs/2019-zips-projections-los-angeles-dodgers',
 'http://www.fangraphs.com/blogs/2019-zips-projections-philadelphia-phillies',
 'http://www.fangraphs.com/blogs/2019-zips-projections-pittsburgh-pirates',
 'http://www.fangraphs.com/blogs/2019-zips-projections-san-diego-padres',
 'http://www.fangraphs.com/blogs/2019-zips-projections-washington-nationals',
 'http://www.fangraphs.com/blogs/2019-zips-projections-st-louis-cardinals',
 'http://www.fangraphs.com/blogs/2019-zips-projections-san-francisco-giants']

In [None]:
# short_list = list_urls[0:4]
# short_list

For each page, I want to extract the source. This function puts it all into a list. I'll start with the 2017 data.

In [3]:
#Function to get source data for each team, appended into one long list
def get_data(urls):
    response = []
    for i in urls:
        response.append(requests.get(i))
    return response

In [4]:
#start with 2017 data
responses = get_data(list_urls_2017) 

Now that I have the source for all the 2017 ZiPS projections, I can parse them with BeautifulSoup.

In [5]:
#Function to parse all source data into Beautiful Soup object
def make_soup(response_list):
    soup = []
    for i in response_list:
        soup.append(BeautifulSoup(i.text, 'html5lib'))
    return soup

In [6]:
#Turn 2017 data into Beautiful Soup
soups = make_soup(responses)

This function creates a series for each of the columns in the table. I'm sure there is a more elegant solution for this, but this works for my purposes.

In [7]:
def extract_column(number, data, upper_limit, num_cols):
    name = []
    name_list = []
    for i in range(0, upper_limit):
        name.append(number + i*num_cols)
    for i in name:
        name_list.append(data[i])
    return name_list

This function extracts the team name from the source, so it can be added to the resulting dataframe. 

In [8]:
#Helper function to extract team name from source html
def team_name(team_data, upper_limit):
    team_list = []
    team_clean = re.findall('[\s(A-z)]+[\|]', team_data)[0].replace('|', '').strip()
    for i in range(0, upper_limit):
        team_list.append(team_clean)
    return team_list

This function uses extract_column and team_name to generate the dataframes for each team's counting stats, for 2017. There is 

In [9]:
#Function to get pitching counting stats
def extract_pitching(soup_list, text_id, num_cols):
    pitch_list = []
    for j in soup_list:
        data_crude = j.find(text = text_id).findNext().text.split("\n")
        data_crude2 = []
        for i in data_crude:
            if i != '':
                data_crude2.append(i)
        data = data_crude2[num_cols:]
        headers = data_crude2[:num_cols]
        headers.insert(1, "Team")
        team_data = j.find('title').text
        upper_bound = int(len(data)/num_cols)
        names = extract_column(0, data, upper_bound, num_cols)
        team = team_name(team_data, upper_bound)
        throws = extract_column(1, data, upper_bound, num_cols)
        age = extract_column(2, data, upper_bound, num_cols)
        games = extract_column(3, data, upper_bound, num_cols)
        games_started = extract_column(4, data, upper_bound, num_cols)
        innings_pitched = extract_column(5, data, upper_bound, num_cols)
        strikeouts = extract_column(6, data, upper_bound, num_cols)
        walks = extract_column(7, data, upper_bound, num_cols)
        homeruns = extract_column(8, data, upper_bound, num_cols)
        hits = extract_column(9, data, upper_bound, num_cols)
        runs = extract_column(10, data, upper_bound, num_cols)
        earned_runs = extract_column(11, data, upper_bound, num_cols)
        pitch_data = pd.DataFrame(list(zip(names, 
                                           team,
                                           throws,
                                           age,
                                           games,
                                           games_started,
                                           innings_pitched,
                                           strikeouts,
                                           walks,
                                           homeruns,
                                           hits,
                                           runs,
                                           earned_runs)),
                                 columns = headers)
        pitch_list.append(pitch_data)
    pitching = pd.concat(pitch_list, ignore_index = True, axis = 0)
    return pitching

In [10]:
#Call function to get all pitching counting stats for 2017, table has 12 columns
pitch_data = extract_pitching(soups, "Pitchers, Counting Stats", 12)

In [11]:
#Function to get pitching rates
def extract_pitching_rates(soup_list, text_id, num_cols):
    pitch_list = []
    for i in soup_list:
        data_crude = i.find(text = text_id).findNext().text.split("\n")
        data_crude2 = []
        for i in data_crude:
            if i != '':
                data_crude2.append(i)
        headers = data_crude2[:num_cols]
        data = data_crude2[num_cols:]
        upper_bound = int(len(data)/num_cols)
        names = extract_column(0, data, upper_bound, num_cols)
        IPs = extract_column(1, data, upper_bound, num_cols)
        TBFs = extract_column(2, data, upper_bound, num_cols)
        K_pcts = extract_column(3, data, upper_bound, num_cols)
        BB_pcts = extract_column(4, data, upper_bound, num_cols)
        BABIPs = extract_column(5, data, upper_bound, num_cols)
        ERAs = extract_column(6, data, upper_bound, num_cols)
        FIPs = extract_column(7, data, upper_bound, num_cols)
        ERA_minus = extract_column(8, data, upper_bound, num_cols)
        FIP_minus = extract_column(9, data, upper_bound, num_cols)
        pitch_data = pd.DataFrame(list(zip(names, 
                                           IPs,
                                           TBFs,
                                           K_pcts,
                                           BB_pcts,
                                           BABIPs,
                                           ERAs,
                                           FIPs,
                                           ERA_minus,
                                           FIP_minus)),
                                 columns = headers)
        pitch_list.append(pitch_data)
    pitching = pd.concat(pitch_list, ignore_index = True, axis = 0)
    return pitching

In [12]:
#Call function to get all pitching rate stats for 2017, table has 10 columns
pitch_rates_data = extract_pitching_rates(soups, "Pitchers, Rates and Averages", 10)

In [13]:
#Function to get other pitching stats
def extract_pitching_other(soup_list, text_id, num_cols):
    pitch_list = []
    for i in soup_list:
        data_crude = i.find(text = text_id).findNext().text.split("\n")
        data_crude2 = []
        for i in data_crude:
            if i != '':
                data_crude2.append(i)
        headers = data_crude2[:num_cols]
        data = data_crude2[num_cols:]
        upper_bound = int(len(data)/num_cols)
        names = extract_column(0, data, upper_bound, num_cols)
        IPs = extract_column(1, data, upper_bound, num_cols)
        K_9s = extract_column(2, data, upper_bound, num_cols)
        BB_9s = extract_column(3, data, upper_bound, num_cols)
        HR_9s = extract_column(4, data, upper_bound, num_cols)
        ERA_plus = extract_column(5, data, upper_bound, num_cols)
        zWARs = extract_column(6, data, upper_bound, num_cols)
        competition = extract_column(7, data, upper_bound, num_cols)
        pitch_data = pd.DataFrame(list(zip(names, 
                                           IPs,
                                           K_9s,
                                           BB_9s,
                                           HR_9s,
                                           ERA_plus,
                                           zWARs,
                                           competition)),
                                 columns = headers)
        pitch_list.append(pitch_data)
    pitching = pd.concat(pitch_list, ignore_index = True, axis = 0)
    return pitching

In [14]:
#Call function to get all other pitching stats for 2017, table has 8 columns
pitch_other_data = extract_pitching_other(soups, "Pitchers, Assorted Other", 8)

In [15]:
#combine counting, rate, and other data into one Pandas dataframe
pitch_rates_data.drop(columns = ['Player', 'IP'], inplace = True)
pitch_other_data.drop(columns = ['Player', 'IP', 'No. 1 Comp'], inplace = True)
#Start with counting and rates data
partial = pd.concat([pitch_data, pitch_rates_data], axis = 1)
#Add in other data
full = pd.concat([partial, pitch_other_data], axis = 1)

In [16]:
#Check columns and preview dataframe
full.columns
full.head(6)

Unnamed: 0,Player,Team,T,Age,G,GS,IP,K,BB,HR,...,BABIP,ERA,FIP,ERA-,FIP-,K/9,BB/9,HR/9,ERA+,zWAR
0,Kevin Gausman,Baltimore Orioles,R,26,31,31,176.0,162,55,21,...,0.297,3.94,3.82,92,88,8.28,2.81,1.07,104,2.8
1,Dylan Bundy,Baltimore Orioles,R,24,19,19,109.3,108,36,13,...,0.298,3.7,3.75,86,86,8.89,2.96,1.07,110,2.1
2,Chris Tillman,Baltimore Orioles,R,29,29,29,167.0,133,60,21,...,0.287,4.31,4.3,100,99,7.17,3.23,1.13,95,1.9
3,Zach Britton,Baltimore Orioles,L,29,66,0,64.0,73,18,5,...,0.276,2.25,2.64,52,61,10.27,2.53,0.7,181,1.7
4,Mychal Givens,Baltimore Orioles,R,27,63,0,79.3,101,31,7,...,0.298,2.95,3.03,69,70,11.46,3.52,0.79,138,1.4
5,Brad Brach,Baltimore Orioles,R,31,63,0,74.0,87,29,8,...,0.278,3.04,3.25,71,75,10.58,3.53,0.97,134,1.2


In [17]:
# Print full 2017 pitching data to csv, 
# commented out so datafile is not overwritten each time code is run
# full.to_csv('projected_2017.csv', index = False)

In [18]:
#Rerun code to get 2018 data.
responses = get_data(list_urls_2018) 
soups = make_soup(responses)
pitch_data = extract_pitching(soups, "Pitchers, Counting Stats", 12)
pitch_rates_data = extract_pitching_rates(soups, "Pitchers, Rates and Averages", 10)
pitch_other_data = extract_pitching_other(soups, "Pitchers, Assorted Other", 8)
pitch_rates_data.drop(columns = ['Player', 'IP'], inplace = True)
pitch_other_data.drop(columns = ['Player', 'IP', 'No. 1 Comp'], inplace = True)
partial = pd.concat([pitch_data, pitch_rates_data], axis = 1)
full = pd.concat([partial, pitch_other_data], axis = 1)
#full.to_csv('projected_2018.csv', index = False)

2019 Data! The tables are laid out differently, and one of the teams has one less column.

In [19]:
#Get source for 2019 data.
responses = get_data(list_urls_2019) 
soups = make_soup(responses)

In [20]:
#Function to get pitching counting stats. Many more columns present than in 2017/2018.
def extract_pitching_2019(soup_list, text_id, num_cols):
    pitch_list = []
    for j in soup_list:
        data_crude = j.find(text = text_id).findNext().text.split("\n")
        data_crude2 = []
        for i in data_crude:
            if i != '':
                data_crude2.append(i)
        data = data_crude2[num_cols:]
        headers = data_crude2[:num_cols]
        headers.insert(1, "Team")
        team_data = j.find('title').text
        upper_bound = int(len(data)/num_cols)
        names = extract_column(0, data, upper_bound, num_cols)
        team = team_name(team_data, upper_bound)
        throws = extract_column(1, data, upper_bound, num_cols)
        age = extract_column(2, data, upper_bound, num_cols)
        wins = extract_column(3, data, upper_bound, num_cols)
        losses = extract_column(4, data, upper_bound, num_cols)
        ERA = extract_column(5, data, upper_bound, num_cols)
        games = extract_column(6, data, upper_bound, num_cols)
        games_started = extract_column(7, data, upper_bound, num_cols)
        innings_pitched = extract_column(8, data, upper_bound, num_cols)
        hits = extract_column(9, data, upper_bound, num_cols)
        earned_runs = extract_column(10, data, upper_bound, num_cols)
        homeruns = extract_column(11, data, upper_bound, num_cols)
        walks = extract_column(12, data, upper_bound, num_cols)
        strikeouts = extract_column(13, data, upper_bound, num_cols)
        pitch_data = pd.DataFrame(list(zip(names, 
                                           team,
                                           throws,
                                           age,
                                           wins, 
                                           losses, 
                                           ERA, 
                                           games, 
                                           games_started, 
                                           innings_pitched,
                                           hits, 
                                           earned_runs, 
                                           homeruns, 
                                           walks, 
                                           strikeouts)),
                                 columns = headers)
        pitch_list.append(pitch_data)
    pitching = pd.concat(pitch_list, ignore_index = True, axis = 0)
    return pitching

In [21]:
#Call function to get all pitching counting stats for 2019, table has 14 columns
pitch_data = extract_pitching_2019(soups, 'Pitchers – Counting Stats', 14)

In [22]:
#Function to get pitching rates. Note this doesn't work for one team's data.
def extract_pitching_rates_2019(soup_list, text_id, num_cols):
    pitch_list = []
    for j in soup_list:
        data_crude = j.find(text = text_id).parent.parent.parent.text.split("\n")
        data_crude2 = []
        for i in data_crude:
            if i != '':
                data_crude2.append(i)
        data = data_crude2[num_cols:]
        headers = data_crude2[:num_cols]
        upper_bound = int(len(data)/num_cols)
        names = extract_column(0, data, upper_bound, num_cols)
        IPs = extract_column(1, data, upper_bound, num_cols)
        TBFs = extract_column(2, data, upper_bound, num_cols)
        K_9s = extract_column(3, data, upper_bound, num_cols)
        BB_9s = extract_column(4, data, upper_bound, num_cols)
        HR_9s = extract_column(5, data, upper_bound, num_cols)
        BABIPs = extract_column(6, data, upper_bound, num_cols)
        ERA_plus = extract_column(5, data, upper_bound, num_cols)
        ERA_minus = extract_column(8, data, upper_bound, num_cols)
        FIPs = extract_column(9, data, upper_bound, num_cols)
        WAR = extract_column(10, data, upper_bound, num_cols)
        pitch_data = pd.DataFrame(list(zip(names, 
                                           IPs,
                                           TBFs,
                                           K_9s,
                                           BB_9s,
                                           HR_9s,
                                           BABIPs,
                                           ERA_plus,
                                           ERA_minus,                                           
                                           FIPs,                                        
                                           WAR)),
                                 columns = headers)
        pitch_list.append(pitch_data)
    pitching = pd.concat(pitch_list, ignore_index = True, axis = 0)
    return pitching

In [23]:
#For some reason, soup[4] has fewer cols. Remove it from the list and add it back in later.
new_soup = soups[0:4] + soups[5:] 

In [24]:
#Call function to get all (but one team's) pitching rate stats for 2019, table has 11 columns
pitch_rates_data = extract_pitching_rates_2019(new_soup, 'TBF', 11)

In [25]:
#Function to get pitching rates. Note this is particular for one team's data. Doesn't
#contain ERA_minus data.
def extract_pitching_rates_2019_4(soup_list, text_id, num_cols):
    pitch_list = []
    for j in soup_list:
        data_crude = j.find(text = text_id).parent.parent.parent.text.split("\n")
        data_crude2 = []
        for i in data_crude:
            if i != '':
                data_crude2.append(i)
        data = data_crude2[num_cols:]
        headers = data_crude2[:num_cols]
        upper_bound = int(len(data)/num_cols)
        names = extract_column(0, data, upper_bound, num_cols)
        IPs = extract_column(1, data, upper_bound, num_cols)
        TBFs = extract_column(2, data, upper_bound, num_cols)
        K_9s = extract_column(3, data, upper_bound, num_cols)
        BB_9s = extract_column(4, data, upper_bound, num_cols)
        HR_9s = extract_column(5, data, upper_bound, num_cols)
        BABIPs = extract_column(6, data, upper_bound, num_cols)
        ERA_plus = extract_column(7, data, upper_bound, num_cols)
        FIPs = extract_column(8, data, upper_bound, num_cols)
        WAR = extract_column(9, data, upper_bound, num_cols)
        pitch_data = pd.DataFrame(list(zip(names, 
                                           IPs,
                                           TBFs,
                                           K_9s,
                                           BB_9s,
                                           HR_9s,
                                           BABIPs,
                                           ERA_plus,                                     
                                           FIPs,                                        
                                           WAR)),
                                 columns = headers)
        pitch_list.append(pitch_data)
    pitching = pd.concat(pitch_list, ignore_index = True, axis = 0)
    return pitching

In [26]:
#Work with just the one team's data that has a different number of columns
soups4 = [soups[4]]

In [27]:
#Call function to get one team's pitching rate stats for 2019, table has 10 columns
pitch_rates_data_4 = extract_pitching_rates_2019_4(soups4, 'TBF', 10)

In [28]:
#splicing the one team's data back in
pitch_rates_data_p1 = pitch_rates_data.iloc[0:175]
pitch_rates_data_p3 = pitch_rates_data.iloc[175:]
pitch_rates_combined = pd.concat([pitch_rates_data_p1, pitch_rates_data_4, pitch_rates_data_p3], ignore_index = True)

In [29]:
#removing redundant columns and No.1 Comp column.
pitch_rates_combined.drop(columns = ['Player', 'No. 1 Comp'], inplace = True)

In [30]:
full_2019 = pd.concat([pitch_data, pitch_rates_combined], axis = 1)

In [31]:
#Reordering dataframe to match 2017-2018 column order
projected_2019 = full_2019[['Player', 'Team', 'T', 'Age', 'G', 'GS', 'IP', 'SO', 
                            'BB', 'HR', 'H', 'ER', 'TBF', 'BABIP', 'ERA', 'FIP', 
                            'ERA-', 'K/9', 'BB/9', 'HR/9', 'ERA+', 'WAR', 'W', 'L']]

In [32]:
projected_2019.shape

(1426, 24)

In [None]:
# Print full 2018 pitching data to csv, 
# commented out so datafile is not overwritten each time code is run
#projected_2019.to_csv('projected_2019.csv', index = False)