## Getting Team and Stadium Data and Loading it into a PostgreSQL Database

This notebook will both:

* Create a DataFrame of "Team Data," which really just maps current MLB team names to their abbreviations in different baseball data sources (MLBAM/Statcast and [Baseball-Reference](https://www.baseball-reference.com/)) to create/populate a table `teams` in the `baseball` database created in the previous notebook
* Scrape dimensions data for each team's home stadium from [Clem's Baseball](http://www.andrewclem.com/Baseball/index.html) to create/populate a table `stadiums` in the `baseball` database.

In [1]:
import pandas as pd
import re
from bs4 import BeautifulSoup, Comment
import requests

Let's start off making a couple dictionaries mapping each team name to its MLBAM and Baseball-Reference abbreviation (the latter is not of any use in this project but is there for future use, should Baseball-Reference data be added to the `baseball` database). We'll return to these later in the notebook.

In [2]:
mlbam_team_dict = {
    'Arizona Diamondbacks': 'ARI',
    'Atlanta Braves': 'ATL',
    'Baltimore Orioles': 'BAL',
    'Boston Red Sox': 'BOS',
    'Chicago Cubs': 'CHC',
    'Chicago White Sox': 'CWS',
    'Cincinnati Reds': 'CIN',
    'Cleveland Indians': 'CLE',
    'Colorado Rockies': 'COL',
    'Detroit Tigers': 'DET',
    'Houston Astros': 'HOU',
    'Kansas City Royals': 'KC',
    'Los Angeles Angels of Anaheim': 'LAA',
    'Los Angeles Dodgers': 'LAD',
    'Miami Marlins': 'MIA',
    'Milwaukee Brewers': 'MIL',
    'Minnesota Twins': 'MIN',
    'New York Mets': 'NYM',
    'New York Yankees': 'NYY',
    'Oakland Athletics': 'OAK',
    'Philadelphia Phillies': 'PHI',
    'Pittsburgh Pirates': 'PIT',
    'San Diego Padres': 'SD',
    'San Francisco Giants': 'SF',
    'Seattle Mariners': 'SEA',
    'St. Louis Cardinals': 'STL',
    'Tampa Bay Rays': 'TB',
    'Texas Rangers': 'TEX',
    'Toronto Blue Jays': 'TOR',
    'Washington Nationals': 'WSH'
}

In [3]:
bref_team_dict = {
    'Arizona Diamondbacks': 'ARI',
    'Atlanta Braves': 'ATL',
    'Baltimore Orioles': 'BAL',
    'Boston Red Sox': 'BOS',
    'Chicago Cubs': 'CHC',
    'Chicago White Sox': 'CHW',
    'Cincinnati Reds': 'CIN',
    'Cleveland Indians': 'CLE',
    'Colorado Rockies': 'COL',
    'Detroit Tigers': 'DET',
    'Houston Astros': 'HOU',
    'Kansas City Royals': 'KCR',
    'Los Angeles Angels of Anaheim': 'LAA',
    'Los Angeles Dodgers': 'LAD',
    'Miami Marlins': 'MIA',
    'Milwaukee Brewers': 'MIL',
    'Minnesota Twins': 'MIN',
    'New York Mets': 'NYM',
    'New York Yankees': 'NYY',
    'Oakland Athletics': 'OAK',
    'Philadelphia Phillies': 'PHI',
    'Pittsburgh Pirates': 'PIT',
    'San Diego Padres': 'SDP',
    'San Francisco Giants': 'SFG',
    'Seattle Mariners': 'SEA',
    'St. Louis Cardinals': 'STL',
    'Tampa Bay Rays': 'TBR',
    'Texas Rangers': 'TEX',
    'Toronto Blue Jays': 'TOR',
    'Washington Nationals': 'WSN'
}

Now we'll set ourselves up to scrape the "Current Stadium" links from [Clem's](http://www.andrewclem.com/Baseball/Stadium_lists.html) and then the team name and dimensions data from each stadium page ([example](http://www.andrewclem.com/Baseball/AngelStadium.html)):

In [4]:
def get_soup(url):
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(re.sub('<!--|-->', '', page), 'html5lib')
    return soup

In [5]:
# get page and link for each stadium page
soup = get_soup('http://www.andrewclem.com/Baseball/Stadium_lists.html')
curr_stadium_pgs = soup.find('h5', text=re.compile('Current')).find_all_next('a')

In [6]:
# get just link to each stadium page
curr_stadium_links = [s['href'] for s in curr_stadium_pgs[:30]]

In [7]:
# scrape team name and dimensions data off each stadium page
stadium_dict = {}
for s in curr_stadium_links:
    sopa = get_soup('http://www.andrewclem.com/Baseball/' + s)
    team_name = re.sub('[\(].*[\)]', '', sopa.find('h1').text.strip().split('of the')[1]).strip()
    team_name = re.sub('\*', '', team_name).strip()
    tbl = sopa.find('table').find_all('tr')
    
    row = [re.sub('[\(\)\*]', '', s.text).strip() for s in tbl[2].find_all('td')[8:13] + tbl[2].find_all('td')[15:20]]
    head = [s.text for s in tbl[1].find_all('th')[7:17]]
    stadium_dict[team_name] = row
stadium_df = pd.DataFrame.from_dict(stadium_dict, orient='index')
stadium_df.columns = head

In [8]:
stadium_df

Unnamed: 0,Fair,Foul,LF,CF,RF,Left field,Left-center,Center field,Right-center,Right field
Los Angeles Angels of Anaheim,106.7,21.5,5,8,18,333,390,400,370,333
St. Louis Cardinals,112.1,25.2,8,8,8,336,375,400,375,335
Baltimore Orioles,108.1,23.6,7,7,21,333,364,400,373,318
Arizona Diamondbacks,114.2,25.5,8,25,8,330,376,407,376,335
New York Mets,109.6,20.7,8,8,8,335,362,408,375,330
Philadelphia Phillies,105.0,24.5,11,6,13,329,360,401,355,330
Detroit Tigers,113.7,26.5,7,9,9,345,370,420,388,330
Colorado Rockies,119.2,24.9,8,8,17,347,390,415,375,350
Los Angeles Dodgers,115.8110.5,27.919.3,4,8,4,330,375,395,375,330
Boston Red Sox,105.5,18.1,37,18,5,310,335,390,378,302


In [None]:
# df.index = df.index.map(team_dict)

We see there are some oddities for the Los Angeles Dodgers ([Dodger Stadium](http://www.andrewclem.com/Baseball/DodgerStadium.html)) and Washington Nationals ([Nationals Park](http://www.andrewclem.com/Baseball/NationalsPark.html)). For now, we will take the lower value for `LF`, `CF` and `RF` for Washington (these are the heights of their corresponding OF walls) and the second numerical value for `Fair` and `Foul` area dimensions for the Dodgers (as specified on the Dodger Stadium page).

In [12]:
stadium_df.loc['Washington Nationals']['LF'] = stadium_df.loc['Washington Nationals']['LF'].split(', ')[1]
stadium_df.loc['Washington Nationals']['CF'] = stadium_df.loc['Washington Nationals']['CF'].split(', ')[1]
stadium_df.loc['Washington Nationals']['RF'] = stadium_df.loc['Washington Nationals']['RF'].split(', ')[1]

In [13]:
stadium_df.loc['Los Angeles Dodgers']['Fair'] = 110.5
stadium_df.loc['Los Angeles Dodgers']['Foul'] = 19.3

Let's now further clean up the stadium dimension data: convert everything to numeric, make the team names a column and not the index, and rename the columns:

In [14]:
stadium_df = stadium_df.apply(pd.to_numeric)

In [15]:
stadium_df = stadium_df.reset_index().rename(columns={'index': 'Team'})

In [16]:
stadium_df

Unnamed: 0,Team,Fair,Foul,LF,CF,RF,Left field,Left-center,Center field,Right-center,Right field
0,Los Angeles Angels of Anaheim,106.7,21.5,5,8,18,333,390,400,370,333
1,St. Louis Cardinals,112.1,25.2,8,8,8,336,375,400,375,335
2,Baltimore Orioles,108.1,23.6,7,7,21,333,364,400,373,318
3,Arizona Diamondbacks,114.2,25.5,8,25,8,330,376,407,376,335
4,New York Mets,109.6,20.7,8,8,8,335,362,408,375,330
5,Philadelphia Phillies,105.0,24.5,11,6,13,329,360,401,355,330
6,Detroit Tigers,113.7,26.5,7,9,9,345,370,420,388,330
7,Colorado Rockies,119.2,24.9,8,8,17,347,390,415,375,350
8,Los Angeles Dodgers,110.5,19.3,4,8,4,330,375,395,375,330
9,Boston Red Sox,105.5,18.1,37,18,5,310,335,390,378,302


In [48]:
stadium_df.columns = ['team', 'fair', 'foul', 'lf_ht', 'cf_ht', 'rf_ht', 'lf', 'lc', 'cf', 'rc', 'rf']

OK, now we are ready to load this DataFrame into the `baseball` database. Let's make the `CREATE TABLE` statement for the `stadiums` table, connect to the `baseball` database and execute `CREATE TABLE` statement:

In [49]:
from psycopg2 import connect

In [60]:
create_stadium_tbl = '''
    CREATE TABLE IF NOT EXISTS stadiums(
        team TEXT,
        fair FLOAT,
        foul FLOAT,
        lf_ht INT,
        cf_ht INT,
        rf_ht INT,
        lf INT,
        lc INT,
        cf INT,
        rc INT,
        rf INT
    );
'''

In [61]:
conn = connect("dbname=baseball user= password=")
cur = conn.cursor()

In [62]:
cur.execute(create_stadium_tbl)
conn.commit()

Now we'll insert each row from this DataFrame into the `stadiums` table, similarly to what we did for the `statcast` table in the previous notebook:

In [63]:
cols = ", ".join([str(i) for i in stadium_df.columns.tolist()])
cols

'team, fair, foul, lf_ht, cf_ht, rf_ht, lf, lc, cf, rc, rf'

In [64]:
for i, row in stadium_df.iterrows():
    query = "INSERT INTO stadiums (" +cols + ") VALUES (" + "%s,"*(len(row)-1) + "%s)"
    cur.execute(query, tuple(row))
    conn.commit()

Finally, we'll create the `team` table to be added to the `baseball` database. We make the `CREATE TABLE` statement for this table, merge the two team dictionaries created at the beginning of the notebook, execute the `CREATE TABLE` statement and finally insert the rows of our team DataFrame into the `team` table in the `baseball` database.

In [81]:
create_team_tbl = '''
    CREATE TABLE IF NOT EXISTS teams(
        team TEXT,
        mlb_id TEXT,
        bref_id TEXT
    );
'''

In [82]:
teams_df = pd.DataFrame.from_dict(mlbam_team_dict, orient='index').\
    merge(pd.DataFrame.from_dict(bref_team_dict, orient='index'), left_index=True, right_index=True)
teams_df.columns = ['mlb_id', 'bref_id']
teams_df = teams_df.reset_index().rename(columns={'index': 'team'})
teams_df

Unnamed: 0,team,mlb_id,bref_id
0,Arizona Diamondbacks,ARI,ARI
1,Atlanta Braves,ATL,ATL
2,Baltimore Orioles,BAL,BAL
3,Boston Red Sox,BOS,BOS
4,Chicago Cubs,CHC,CHC
5,Chicago White Sox,CWS,CHW
6,Cincinnati Reds,CIN,CIN
7,Cleveland Indians,CLE,CLE
8,Colorado Rockies,COL,COL
9,Detroit Tigers,DET,DET


In [83]:
cur.execute(create_team_tbl)
conn.commit()

In [84]:
cols = ", ".join([str(i) for i in teams_df.columns.tolist()])
cols

'team, mlb_id, bref_id'

In [85]:
for i, row in teams_df.iterrows():
    query = "INSERT INTO teams (" +cols + ") VALUES (" + "%s,"*(len(row)-1) + "%s)"
    cur.execute(query, tuple(row))
    conn.commit()

In [86]:
cur.close()
conn.close()