Baseball Prediction: 5a - Getting (Raw) Individual Pitcher Data

In the previous notebook, we compared our simple, hitting-only model to the Las Vegas odds. We concluded that incorporating the starting pitcher information would be a crucial next step to improve our model.

In this notebook we will learn how to scrape individual, game-level, pitching data from retrosheet. We will write a loop to go through and download the data. This will enable us to augment our game-level dataframe with features derived from the previous performance of the starting pitcher.

Let's start by going to retrosheet and finding the stats for Corey Kluber (one of my favorite pitchers from my childhood).

www.retrosheet.org

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('display.max_columns',1000)
pd.set_option('display.max_rows',1000)

import lxml
import html5lib
from urllib.request import urlopen
import time

from bs4 import BeautifulSoup
import requests



In [2]:
url = 'https://www.retrosheet.org/boxesetc/2016/Kklubc0010062016.htm'
page = requests.get(url)

In [3]:
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "https://www.w3.org/TR/REC-html40/strict.dtd">

<html dir="LTR" lang="EN">
<pre><a href="../MISC/Kdescr.htm">Read Me</a></pre>
<head>
<title>The 2016 CLE A Regular Season Pitching Log for Corey Kluber</title>
<link href="https://www.retrosheet.org/menubar/menubar.css" rel="stylesheet" type="text/css"/>
<script src="https://www.retrosheet.org/menubar/menubar.js" type="text/javascript"></script>
</head>
<body>
<p class="nopad"><a href="https://www.retrosheet.org"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></a></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="https://www.retrosheet.org/searches/search.html">Search</a></li>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
<li><a href="https://www.

In [4]:
soup1 = list(soup.children)[-1]
soup1

<html dir="LTR" lang="EN">
<pre><a href="../MISC/Kdescr.htm">Read Me</a></pre>
<head>
<title>The 2016 CLE A Regular Season Pitching Log for Corey Kluber</title>
<link href="https://www.retrosheet.org/menubar/menubar.css" rel="stylesheet" type="text/css"/>
<script src="https://www.retrosheet.org/menubar/menubar.js" type="text/javascript"></script>
</head>
<body>
<p class="nopad"><a href="https://www.retrosheet.org"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></a></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="https://www.retrosheet.org/searches/search.html">Search</a></li>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">Managers</a>
<li><a href="https://www.retrosheet.org

In [5]:
soup2 = list(soup1.children)[-1]
soup2

<body>
<p class="nopad"><a href="https://www.retrosheet.org"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></a></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="https://www.retrosheet.org/searches/search.html">Search</a></li>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">Managers</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Coaches">Coaches</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Umpires">Umpires</a>
<li><a href="https://www.retrosheet.org/transactions/index.html">Transactions</a>
</li></li></li></li></li></ul>
<li><a href="#">Games →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html">Regular season</a>
<li><a h

In [6]:
soup3 = list(soup2.children)
soup3

['\n',
 <p class="nopad"><a href="https://www.retrosheet.org"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></a></p>,
 '\n',
 <div class="mbcenter">
 <ul class="nav">
 <li><a href="https://www.retrosheet.org/">Home</a>
 <li><a href="https://www.retrosheet.org/searches/search.html">Search</a></li>
 <li><a href="#">Games/People/Parks ↓</a>
 <ul>
 <li><a href="#">People →</a>
 <ul>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">Managers</a>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html#Coaches">Coaches</a>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html#Umpires">Umpires</a>
 <li><a href="https://www.retrosheet.org/transactions/index.html">Transactions</a>
 </li></li></li></li></li></ul>
 <li><a href="#">Games →</a>
 <ul>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html">R

In [7]:
index_num = np.where(["Opponent" in str(x) for x in soup3])[0][0]
index_num

12

In [8]:
soup4 = soup3[index_num]
soup4

<pre>   Date    #         Opponent  GS  CG SHO  GF  SV  IP     H  BFP  HR   R  ER  BB  IB  SO  SH  SF  WP HBP  BK  2B  3B GDP ROE   W   L    ERA
<a href="../2016/04052016.htm"> 4- 5-2016</a>   <a href="../2016/B04050CLE2016.htm">BOX+PBP</a> VS BOS A   1   0   0   0   0   5.1   9   27   1   4   4   2   0   5   0   0   1   0   0   1   0   0   0   0   1   6.75
<a href="../2016/04122016.htm"> 4-12-2016</a>   <a href="../2016/B04120TBA2016.htm">BOX+PBP</a> AT TB  A   1   0   0   0   0   7.2   4   28   1   3   3   2   0   6   0   0   0   1   0   1   0   1   0   0   1   4.85
<a href="../2016/04172016.htm"> 4-17-2016</a>   <a href="../2016/B04170CLE2016.htm">BOX+PBP</a> VS NY  N   1   0   0   0   0   6     9   28   0   6   6   1   0   8   0   0   0   0   0   3   1   0   0   0   1   6.16
<a href="../2016/04232016.htm"> 4-23-2016</a>   <a href="../2016/B04230DET2016.htm">BOX+PBP</a> AT DET A   1   0   0   0   0   8     2   26   1   1   1   0   0  10   0   0   0   0   0   0   0   1   1   1   0   

In [9]:
soup5 = list(soup4.children)
soup5

['   Date    #         Opponent  GS  CG SHO  GF  SV  IP     H  BFP  HR   R  ER  BB  IB  SO  SH  SF  WP HBP  BK  2B  3B GDP ROE   W   L    ERA\n',
 <a href="../2016/04052016.htm"> 4- 5-2016</a>,
 '   ',
 <a href="../2016/B04050CLE2016.htm">BOX+PBP</a>,
 ' VS BOS A   1   0   0   0   0   5.1   9   27   1   4   4   2   0   5   0   0   1   0   0   1   0   0   0   0   1   6.75\n',
 <a href="../2016/04122016.htm"> 4-12-2016</a>,
 '   ',
 <a href="../2016/B04120TBA2016.htm">BOX+PBP</a>,
 ' AT TB  A   1   0   0   0   0   7.2   4   28   1   3   3   2   0   6   0   0   0   1   0   1   0   1   0   0   1   4.85\n',
 <a href="../2016/04172016.htm"> 4-17-2016</a>,
 '   ',
 <a href="../2016/B04170CLE2016.htm">BOX+PBP</a>,
 ' VS NY  N   1   0   0   0   0   6     9   28   0   6   6   1   0   8   0   0   0   0   0   3   1   0   0   0   1   6.16\n',
 <a href="../2016/04232016.htm"> 4-23-2016</a>,
 '   ',
 <a href="../2016/B04230DET2016.htm">BOX+PBP</a>,
 ' AT DET A   1   0   0   0   0   8     2   26   1  

In [10]:
for i in range(12):
    print(soup5[i].get_text().split())

['Date', '#', 'Opponent', 'GS', 'CG', 'SHO', 'GF', 'SV', 'IP', 'H', 'BFP', 'HR', 'R', 'ER', 'BB', 'IB', 'SO', 'SH', 'SF', 'WP', 'HBP', 'BK', '2B', '3B', 'GDP', 'ROE', 'W', 'L', 'ERA']
['4-', '5-2016']
[]
['BOX+PBP']
['VS', 'BOS', 'A', '1', '0', '0', '0', '0', '5.1', '9', '27', '1', '4', '4', '2', '0', '5', '0', '0', '1', '0', '0', '1', '0', '0', '0', '0', '1', '6.75']
['4-12-2016']
[]
['BOX+PBP']
['AT', 'TB', 'A', '1', '0', '0', '0', '0', '7.2', '4', '28', '1', '3', '3', '2', '0', '6', '0', '0', '0', '1', '0', '1', '0', '1', '0', '0', '1', '4.85']
['4-17-2016']
[]
['BOX+PBP']


In [11]:
## Given the url that refers to a specific pitcher and season
## we scrape the data and process it a bit
def get_season_pitching_data(url):    
    time.sleep(1)
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    html=list(soup.children)[-1]
    body = list(html.children)[-1]
    sec_next = list(body.children)
    secnum = np.where(["Opponent" in str(x) for x in sec_next])[0][0]
    key_section = sec_next[secnum]
    working_part = list(key_section.children)
    p_header = working_part[0].strip().split()
    mod_header= ['at_vs','Opponent','League', 'GS', 'CG', 'SHO', 'GF', 'SV', 'IP', 'H',
            'BFP', 'HR', 'R', 'ER', 'BB', 'IB', 'SO', 'SH', 'SF', 'WP', 'HBP',
            'BK', '2B', '3B', 'GDP', 'ROE', 'W', 'L', 'ERA']

    date_list = []
    day_href_list = []
    for k in range(1,len(working_part),4):
        date_list.append(working_part[k].get_text().strip())
        day_href_list.append(working_part[k].attrs['href'])

    dblhead_num_list = []
    for k in range(2,len(working_part),4):
        dblhead_num_list.append(working_part[k].strip())

    game_href_list = []
    for k in range(3,len(working_part),4):
        game_href_list.append(working_part[k].attrs['href'])

    main_data_matrix = []
    for k in range(4,len(working_part),4):
        main_data_row = (working_part[k].strip().split())[:29]
        main_data_matrix.append(main_data_row)

    out_df = pd.DataFrame(main_data_matrix, columns = mod_header)
    out_df['Date'] = date_list
    out_df['dblhead_num'] = dblhead_num_list
    return(out_df)

In [12]:
get_season_pitching_data(url)

Unnamed: 0,at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
0,VS,BOS,A,1,0,0,0,0,5.1,9,27,1,4,4,2,0,5,0,0,1,0,0,1,0,0,0,0,1,6.75,4- 5-2016,
1,AT,TB,A,1,0,0,0,0,7.2,4,28,1,3,3,2,0,6,0,0,0,1,0,1,0,1,0,0,1,4.85,4-12-2016,
2,VS,NY,N,1,0,0,0,0,6.0,9,28,0,6,6,1,0,8,0,0,0,0,0,3,1,0,0,0,1,6.16,4-17-2016,
3,AT,DET,A,1,0,0,0,0,8.0,2,26,1,1,1,0,0,10,0,0,0,0,0,0,0,1,1,1,0,4.67,4-23-2016,
4,AT,PHI,N,1,0,0,0,0,7.0,5,25,0,3,2,0,0,6,1,0,0,0,0,2,0,2,1,0,0,4.24,4-29-2016,
5,VS,DET,A,1,1,1,0,0,9.0,5,32,0,0,0,2,0,7,0,0,0,0,0,1,0,2,0,1,0,3.35,5- 4-2016,
6,AT,HOU,A,1,0,0,0,0,2.2,5,16,0,5,5,3,0,3,0,0,0,0,0,2,0,0,0,0,1,4.14,5- 9-2016,
7,VS,MIN,A,1,0,0,0,0,6.2,7,30,1,4,4,3,0,7,0,1,0,1,0,1,0,1,0,0,1,4.3,5-14-2016,
8,AT,BOS,A,1,0,0,0,0,7.0,5,28,1,2,2,2,0,6,1,0,1,0,0,2,0,1,1,1,0,4.1,5-20-2016,
9,AT,CHI,A,1,0,0,0,0,7.1,7,30,0,2,1,1,0,9,0,0,0,0,0,0,0,0,0,1,0,3.78,5-25-2016,


In [13]:
url = 'https://www.retrosheet.org/boxesetc/K/Pklubc001.htm'
page = requests.get(url)
sup = BeautifulSoup(page.content, 'html.parser')
sup

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "https://www.w3.org/TR/REC-html40/strict.dtd">

<html dir="LTR" lang="EN">
<pre><a href="../MISC/Pdescr.htm">Read Me</a></pre>
<head>
<title>Corey Kluber</title>
<link href="https://www.retrosheet.org/menubar/menubar.css" rel="stylesheet" type="text/css"/>
<script src="https://www.retrosheet.org/menubar/menubar.js" type="text/javascript"></script>
</head>
<body>
<p class="nopad"><a href="https://www.retrosheet.org"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></a></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="https://www.retrosheet.org/searches/search.html">Search</a></li>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">Ma

In [14]:
sup2 = list(sup.children)[2]
sup2

<html dir="LTR" lang="EN">
<pre><a href="../MISC/Pdescr.htm">Read Me</a></pre>
<head>
<title>Corey Kluber</title>
<link href="https://www.retrosheet.org/menubar/menubar.css" rel="stylesheet" type="text/css"/>
<script src="https://www.retrosheet.org/menubar/menubar.js" type="text/javascript"></script>
</head>
<body>
<p class="nopad"><a href="https://www.retrosheet.org"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></a></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="https://www.retrosheet.org/searches/search.html">Search</a></li>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">Managers</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Coaches">Coaches</a>
<li><

In [15]:
sup3 = list(sup2.children)[5]
sup3

<body>
<p class="nopad"><a href="https://www.retrosheet.org"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></a></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="https://www.retrosheet.org/searches/search.html">Search</a></li>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">Managers</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Coaches">Coaches</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Umpires">Umpires</a>
<li><a href="https://www.retrosheet.org/transactions/index.html">Transactions</a>
</li></li></li></li></li></ul>
<li><a href="#">Games →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html">Regular season</a>
<li><a h

In [16]:
# Plan - find the <pre> tag that starts with 'Pitching Record' (after stripping whitespace)
# Get the href attribute for all the <a> tags with the word "Daily"

pre_tags = [x for x in sup3.find_all('pre')]
pre_tag_text = [x.get_text().strip() for x in pre_tags]
pre_tag_text

['Top Performances',
 'Batter Matchups',
 'Batting Record\nYear Team                     G    AB    R    H  2B  3B  HR  RBI   BB IBB   SO HBP  SH  SF  XI ROE GDP   SB  CS   AVG   OBP   SLG   BFW Year Team\n2011 CLE A                    3     0    0    0   0   0   0    0    0   0    0   0   0   0   0   0   0    0   0     -     -     -   0.0 2011 CLE A\n2012 CLE A                   12     0    0    0   0   0   0    0    0   0    0   0   0   0   0   0   0    0   0     -     -     -   0.0 2012 CLE A\n2013 CLE A    Daily Splits   26     2    1    0   0   0   0    0    1   0    1   0   0   0   0   0   0    0   0  .000  .333  .000   0.0 2013 CLE A\n2014 CLE A    Daily Splits   36     5    0    1   0   0   0    0    0   0    2   0   1   0   0   0   0    0   0  .200  .200  .200   0.0 2014 CLE A\n2015 CLE A    Daily Splits   32     6    0    0   0   0   0    0    0   0    3   0   0   0   0   0   1    0   0  .000  .000  .000   0.0 2015 CLE A\n2016 CLE A    Daily Splits   32     4    1    1   1   

In [17]:
np.where([x.startswith('Pitching Record') for x in pre_tag_text])[0][0]

8

In [18]:
ind = np.where([x.startswith('Pitching Record') for x in pre_tag_text])[0][0]
a_tags = pre_tags[ind].find_all('a')
a_tags

[<a href="../2011/Y_2011.htm">2011</a>,
 <a href="../2011/TCLE02011.htm">CLE A</a>,
 <a href="../2011/Kklubc0010012011.htm">Daily</a>,
 <a href="../2011/Lklubc0010012011.htm">Splits</a>,
 <a href="../2011/Y_2011.htm">2011</a>,
 <a href="../2011/TCLE02011.htm">CLE A</a>,
 <a href="../2012/Y_2012.htm">2012</a>,
 <a href="../2012/TCLE02012.htm">CLE A</a>,
 <a href="../2012/Kklubc0010022012.htm">Daily</a>,
 <a href="../2012/Lklubc0010022012.htm">Splits</a>,
 <a href="../2012/Y_2012.htm">2012</a>,
 <a href="../2012/TCLE02012.htm">CLE A</a>,
 <a href="../2013/Y_2013.htm">2013</a>,
 <a href="../2013/TCLE02013.htm">CLE A</a>,
 <a href="../2013/Kklubc0010032013.htm">Daily</a>,
 <a href="../2013/Lklubc0010032013.htm">Splits</a>,
 <a href="../2013/Y_2013.htm">2013</a>,
 <a href="../2013/TCLE02013.htm">CLE A</a>,
 <a href="../2014/Y_2014.htm">2014</a>,
 <a href="../2014/TCLE02014.htm">CLE A</a>,
 <a href="../2014/Kklubc0010042014.htm">Daily</a>,
 <a href="../2014/Lklubc0010042014.htm">Splits</a>,


In [19]:
links = [x.attrs['href'] for x in a_tags if x.get_text()=='Daily']
links

['../2011/Kklubc0010012011.htm',
 '../2012/Kklubc0010022012.htm',
 '../2013/Kklubc0010032013.htm',
 '../2014/Kklubc0010042014.htm',
 '../2015/Kklubc0010052015.htm',
 '../2016/Kklubc0010062016.htm',
 '../2017/Kklubc0010072017.htm',
 '../2018/Kklubc0010082018.htm',
 '../2019/Kklubc0010092019.htm',
 '../2020/Kklubc0010102020.htm',
 '../2021/Kklubc0010112021.htm',
 '../2022/Kklubc0010122022.htm']

In [20]:
### Get the links to the pitcher-season tables given the pitcher id
def get_daily_season_links(pitcher_id):
    letter = pitcher_id.upper()[0]
    url_prefix = 'https://www.retrosheet.org/boxesetc/'
    url = url_prefix+letter+'/P'+pitcher_id+'.htm'
    time.sleep(1)
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    html=list(soup.children)
    body = list(html[2].children)[5]
    pre_texts = [x for x in body.find_all('pre')]
    secnum = np.where([x.get_text().strip().startswith('Pitching Record') for x in pre_texts])[0][0]
    a_pre_texts = pre_texts[secnum].find_all('a')
    daily_season_links = [url_prefix+x.attrs['href'][3:] for x in a_pre_texts if x.get_text()=='Daily']
    return(daily_season_links)

In [21]:
get_daily_season_links('klubc001')

['https://www.retrosheet.org/boxesetc/2011/Kklubc0010012011.htm',
 'https://www.retrosheet.org/boxesetc/2012/Kklubc0010022012.htm',
 'https://www.retrosheet.org/boxesetc/2013/Kklubc0010032013.htm',
 'https://www.retrosheet.org/boxesetc/2014/Kklubc0010042014.htm',
 'https://www.retrosheet.org/boxesetc/2015/Kklubc0010052015.htm',
 'https://www.retrosheet.org/boxesetc/2016/Kklubc0010062016.htm',
 'https://www.retrosheet.org/boxesetc/2017/Kklubc0010072017.htm',
 'https://www.retrosheet.org/boxesetc/2018/Kklubc0010082018.htm',
 'https://www.retrosheet.org/boxesetc/2019/Kklubc0010092019.htm',
 'https://www.retrosheet.org/boxesetc/2020/Kklubc0010102020.htm',
 'https://www.retrosheet.org/boxesetc/2021/Kklubc0010112021.htm',
 'https://www.retrosheet.org/boxesetc/2022/Kklubc0010122022.htm']

In [22]:
get_season_pitching_data(get_daily_season_links('klubc001')[4])

Unnamed: 0,at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
0,AT,HOU,A,1,0,0,0,0,7.1,3,26,0,2,2,2,0,7,0,0,0,0,0,0,0,0,0,0,1,2.45,4- 6-2015,
1,VS,DET,A,1,0,0,0,0,6.1,7,25,1,2,2,1,0,10,0,0,0,0,0,0,0,1,0,0,0,2.63,4-11-2015,
2,AT,MIN,A,1,0,0,0,0,8.0,3,30,0,2,2,1,1,8,2,0,2,1,0,0,0,1,2,0,0,2.49,4-17-2015,
3,AT,CHI,A,1,0,0,0,0,6.0,13,30,1,6,6,1,0,6,0,0,2,0,0,4,0,1,0,0,1,3.9,4-22-2015,
4,VS,KC,A,1,0,0,0,0,6.1,10,30,0,6,4,2,1,5,0,0,0,1,0,3,0,2,1,0,1,4.24,4-27-2015,
5,VS,TOR,A,1,0,0,0,0,5.0,8,24,1,5,4,2,0,3,0,0,0,0,0,3,0,1,0,0,1,4.62,5- 2-2015,
6,AT,KC,A,1,0,0,0,0,5.2,7,28,1,5,5,2,0,7,1,0,0,1,0,0,0,0,1,0,1,5.04,5- 7-2015,
7,VS,STL,N,1,0,0,0,0,8.0,1,26,0,0,0,0,0,18,0,0,0,1,0,0,0,0,0,1,0,4.27,5-13-2015,
8,AT,CHI,A,1,0,0,0,0,9.0,5,31,0,1,1,1,0,12,0,0,1,0,0,0,1,2,0,0,0,3.79,5-18-2015,
9,VS,CIN,N,1,0,0,0,0,8.0,9,31,0,1,1,0,0,7,0,1,0,0,0,3,0,1,0,1,0,3.49,5-23-2015,


In [23]:
# Get all the data for a particular pitcher
def get_full_pitching_data(pitcher_id):
    link_list = get_daily_season_links(pitcher_id)
    df_pitching = pd.DataFrame()
    for url in link_list:
        df_pitching = pd.concat((df_pitching, get_season_pitching_data(url)))
    return(df_pitching)

In [24]:
ck_data = get_full_pitching_data('klubc001')

In [25]:
ck_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 256 entries, 0 to 30
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   at_vs        256 non-null    object
 1   Opponent     256 non-null    object
 2   League       256 non-null    object
 3   GS           256 non-null    object
 4   CG           256 non-null    object
 5   SHO          256 non-null    object
 6   GF           256 non-null    object
 7   SV           256 non-null    object
 8   IP           256 non-null    object
 9   H            256 non-null    object
 10  BFP          256 non-null    object
 11  HR           256 non-null    object
 12  R            256 non-null    object
 13  ER           256 non-null    object
 14  BB           256 non-null    object
 15  IB           256 non-null    object
 16  SO           256 non-null    object
 17  SH           256 non-null    object
 18  SF           256 non-null    object
 19  WP           256 non-null    object

In [26]:
ck_data.sample(5)

Unnamed: 0,at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
6,VS,DET,A,1,0,0,0,0,6.0,4,24,1,2,2,0,0,8,0,0,1,2,0,0,0,0,0,0,0,4.29,5-16-2022,
5,AT,HOU,A,1,0,0,0,0,5.0,6,25,0,3,3,3,0,6,1,0,0,0,0,3,0,0,1,0,0,5.81,4-26-2019,
6,AT,DET,A,1,0,0,0,0,6.0,6,24,0,2,2,1,0,4,0,0,0,1,0,1,0,2,0,1,0,4.79,9- 3-2012,
19,AT,CIN,N,1,0,0,0,0,7.2,7,30,0,3,3,1,0,5,0,1,0,0,0,1,0,0,0,1,0,3.38,7-18-2015,
5,VS,DET,A,1,0,0,0,0,8.0,2,27,0,0,0,1,0,10,0,0,1,0,0,0,0,0,0,1,0,3.03,5- 2-2021,


LOAD IN GAME LEVEL DATA

In [27]:
df = pd.read_csv('df_bp3.csv', low_memory=False)

In [28]:
start_pitchers_h = df.pitcher_start_id_h.unique()
start_pitchers_v = df.pitcher_start_id_v.unique()
len(start_pitchers_h), len(start_pitchers_v)

(5663, 5672)

In [29]:
start_pitchers_all = np.union1d(start_pitchers_h.astype(str), start_pitchers_v.astype(str))
len(start_pitchers_all), start_pitchers_all[:25]

(6212,
 array(['aased001', 'abadf001', 'abboc001', 'abbog001', 'abboj001',
        'abbok001', 'abbop001', 'abera101', 'abert101', 'abert102',
        'aberw101', 'ableh101', 'abrej001', 'aceva001', 'acevj001',
        'acevj002', 'ackej001', 'acket101', 'acklf101', 'acose101',
        'acosj101', 'adama002', 'adama101', 'adamb102', 'adamb104'],
       dtype='<U8'))

In [30]:
# run this for everyone in the list - may take a bit to run...

for p_id in start_pitchers_all:
    print(p_id)
    try:
        df_temp = get_full_pitching_data(p_id)
    except (AttributeError, AssertionError, ValueError):
        pass

    fname_out = '/Users/gilliancurtis/Desktop/beatingVegas/SP_Data/pitching_data_'+p_id+'.csv'
    df_temp.to_csv(fname_out, index=False)

aased001
abadf001
abboc001
abbog001
abboj001
abbok001
abbop001
abera101
abert101
abert102
aberw101
ableh101
abrej001
aceva001
acevj001
acevj002
ackej001
acket101
acklf101
acose101
acosj101
adama002
adama101
adamb102
adamb104
adamc002
adamd103
adamj001
adamk101
adamm101
adamr102
adamt001
adamw001
adamw101
adcon001
adenn001
adkid102
adkid103
adkig101
adkis001
adlet001
adonj001
affej001
agosj001
agrad001
aguih101
aguir001
aheap001
ainsk001
aitcr101
akerd001
akerj101
akink001
albea001
albec101
albej001
albem001
alboe101
albre101
albuv101
alcar001
alcas001
alcas101
alded101
aldrs001
aldrv101
alexa001
alexd001
alexg001
alexg102
alexj001
alexs001
alext001
allab101
allak001
alleb102
allef101
allej102
allel002
allel101
allen001
allim102
almac001
aloml101
altee101
altrn101
alvaa001
alvah001
alvaj003
alvaj004
alvat001
alvav001
alvaw001
alzoa001
amesr101
amorv101
anckw101
andea001
andeb002
andeb004
andeb101
andeb102
andec001
andec002
andec101
anded003
andef101
andei001
andej002
andej102
andel001
a

KeyboardInterrupt: 