# Loading the Data

![](https://upload.wikimedia.org/wikipedia/en/5/54/Simpsons_MoneyBART_Mike_Scioscia_Promo.jpg)

Our goal is to write a few functions that will allow us to process the data related to baseball games from [Retrosheet](https://www.retrosheet.org/).  We will examine the game log data, and the information for these files can be found [here](https://www.retrosheet.org/gamelogs/glfields.txt)

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
in_df = pd.read_table("data/GL2017.txt", sep = ",", header = None)

In [3]:
in_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,151,152,153,154,155,156,157,158,159,160
0,20170402,0,Sun,SFN,NL,1,ARI,NL,1,5,...,David Peralta,9,mathj001,Jeff Mathis,2,greiz001,Zack Greinke,1,,Y
1,20170402,0,Sun,CHN,NL,1,SLN,NL,1,3,...,Jedd Gyorko,4,gricr001,Randal Grichuk,7,martc006,Carlos Martinez,1,,Y
2,20170402,0,Sun,NYA,AL,1,TBA,AL,1,3,...,Tim Beckham,6,smitm007,Mallex Smith,7,norrd001,Derek Norris,2,,Y
3,20170403,0,Mon,PHI,NL,1,CIN,NL,1,4,...,Zack Cozart,6,barnt001,Tucker Barnhart,2,felds001,Scott Feldman,1,,Y
4,20170403,0,Mon,SDN,NL,1,LAN,NL,1,3,...,Yasmani Grandal,2,puigy001,Yasiel Puig,9,kersc001,Clayton Kershaw,1,,Y


In [5]:
in_df = pd.read_table("data/GL2017.txt", sep = ",", header = None)
in_df.rename(columns={3: 'visiting_team', 6: 'home_team', 9: 'runs_visitor', 10: 'runs_home'}, inplace = True)
in_df.loc[0:5, ['visiting_team', 'home_team', 'runs_visitor', 'runs_home']]

Unnamed: 0,visiting_team,home_team,runs_visitor,runs_home
0,SFN,ARI,5,6
1,CHN,SLN,3,4
2,NYA,TBA,3,7
3,PHI,CIN,4,3
4,SDN,LAN,3,14
5,COL,MIL,7,5


**PROBLEM**

Write a function that takes in a dataframe and returns a dataframe with the columns renamed as above.

In [6]:
#MY code
def baseball(df):
    out_df = df.rename(columns={3: 'visiting_team', 6: 'home_team', 9: 'runs_visitor', 10: 'runs_home'}, inplace = True)
    return out_df

In [10]:
new_df = pd.read_table("data/GL2017.txt", sep = ",", header = None)

In [11]:
baseball(new_df)

In [12]:
new_df['home_win'] = (new_df['runs_home'] > new_df['runs_visitor'])

In [13]:
new_df.home_win.head()

0     True
1     True
2     True
3    False
4     True
Name: home_win, dtype: bool

**PROBLEM** 

Add a new column to determine if the visiting team wins.

In [15]:
#My code
new_df['visitor_win'] = (new_df['runs_home'] < new_df['runs_visitor'])

In [16]:
new_df.visitor_win.head()

0    False
1    False
2    False
3     True
4    False
Name: visitor_win, dtype: bool

**PROBLEM**

Write a function named `new_columns` that takes in our labeled dataframe and outputs a new dataframe that contains the new columns `home_win` and `visitor_win`.

In [95]:
def new_columns(df):
    '''
    My code:  This doesn't take a 'labeled' df.  
    Instead it takes a raw baseball df and calls the 'baseball()' function first, and then
    returns a new df with the home/vis win columns
    '''
    baseball(df)
    df['visitor_win'] = (df['runs_home'] < df['runs_visitor'])
    df['home_win'] = (df['runs_home'] > df['runs_visitor'])
    df['RD_at_home'] = df.runs_home - df.runs_visitor
    df['RD_visiting'] = df.runs_visitor - df.runs_home
    return df

In [96]:
great_df = pd.read_table("data/GL2017.txt", sep = ",", header = None)

In [97]:
new_columns(great_df)

Unnamed: 0,0,1,2,visiting_team,4,5,home_team,7,8,runs_visitor,...,155,156,157,158,159,160,visitor_win,home_win,RD_at_home,RD_visiting
0,20170402,0,Sun,SFN,NL,1,ARI,NL,1,5,...,2,greiz001,Zack Greinke,1,,Y,False,True,1,-1
1,20170402,0,Sun,CHN,NL,1,SLN,NL,1,3,...,7,martc006,Carlos Martinez,1,,Y,False,True,1,-1
2,20170402,0,Sun,NYA,AL,1,TBA,AL,1,3,...,7,norrd001,Derek Norris,2,,Y,False,True,4,-4
3,20170403,0,Mon,PHI,NL,1,CIN,NL,1,4,...,2,felds001,Scott Feldman,1,,Y,True,False,-1,1
4,20170403,0,Mon,SDN,NL,1,LAN,NL,1,3,...,9,kersc001,Clayton Kershaw,1,,Y,False,True,11,-11
5,20170403,0,Mon,COL,NL,1,MIL,NL,1,7,...,6,guerj003,Junior Guerra,1,,Y,True,False,-2,2
6,20170403,0,Mon,ATL,NL,1,NYN,NL,1,0,...,2,syndn001,Noah Syndergaard,1,,Y,False,True,6,-6
7,20170403,0,Mon,MIA,NL,1,WAS,NL,1,2,...,2,stras001,Stephen Strasburg,1,,Y,False,True,2,-2
8,20170403,0,Mon,TOR,AL,1,BAL,AL,1,2,...,4,hardj003,J.J. Hardy,6,,Y,False,True,1,-1
9,20170403,0,Mon,PIT,NL,1,BOS,AL,1,3,...,5,leons001,Sandy Leon,2,,Y,False,True,2,-2


### Home Statistics

Using our dataframe above, let's work through finding the number of wins for the home team, number of runs scored by the home and visiting team, and the total number of home games for each team.

In [36]:
home_group = great_df.groupby('home_team')

In [41]:
home_group[['runs_visitor', 'runs_home', 'home_win']].apply(sum).head()

Unnamed: 0_level_0,runs_visitor,runs_home,home_win
home_team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ANA,335.0,356.0,43.0
ARI,346.0,457.0,52.0
ATL,421.0,346.0,37.0
BAL,407.0,395.0,46.0
BOS,349.0,387.0,48.0


In [45]:
type(home_group.head())

pandas.core.frame.DataFrame

In [46]:
home_group.sum()

Unnamed: 0_level_0,0,1,5,8,runs_visitor,runs_home,11,13,14,15,...,137,140,143,146,149,152,155,158,visitor_win,home_win
home_team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ANA,1633824494,0,6582,6758,335,356,4339,0.0,0.0,0.0,...,683,733,572,445,429,388,300,304,38.0,43.0
ARI,1633823197,0,6342,6411,346,457,4288,0.0,0.0,0.0,...,578,332,394,541,375,413,323,81,29.0,52.0
ATL,1633825877,6,6908,6800,421,346,4342,0.0,0.0,0.0,...,362,415,568,375,361,383,456,81,44.0,37.0
BAL,1633824663,0,6615,6596,407,395,4410,0.0,0.0,0.0,...,491,381,601,484,449,485,391,473,35.0,46.0
BOS,1633824784,3,6715,6739,349,387,4446,0.0,0.0,0.0,...,509,576,590,492,498,459,282,416,33.0,48.0
CHA,1633826239,6,7153,7055,398,366,4279,0.0,0.0,0.0,...,480,383,557,539,490,470,354,614,42.0,39.0
CHN,1633825364,0,6709,6660,369,436,4323,0.0,0.0,0.0,...,442,298,359,505,459,481,314,244,33.0,48.0
CIN,1633823811,0,6345,6365,417,402,4328,0.0,0.0,0.0,...,467,243,537,431,563,485,195,81,42.0,39.0
CLE,1633825159,0,6875,6812,275,406,4279,0.0,0.0,0.0,...,525,488,622,545,445,561,374,382,32.0,49.0
COL,1633824799,3,6640,6739,415,488,4241,0.0,0.0,0.0,...,352,453,468,415,560,473,207,81,35.0,46.0


In [47]:
home_df = home_group[['runs_visitor', 'runs_home', 'home_win']].sum()

In [48]:
home_group['home_win'].count().head()

home_team
ANA    81
ARI    81
ATL    81
BAL    81
BOS    81
Name: home_win, dtype: int64

In [50]:
home_df['home_games'] = home_group['home_win'].count()

In [51]:
home_df.head()

Unnamed: 0_level_0,runs_visitor,runs_home,home_win,home_games
home_team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ANA,335,356,43.0,81
ARI,346,457,52.0,81
ATL,421,346,37.0,81
BAL,407,395,46.0,81
BOS,349,387,48.0,81


**PROBLEM**

Create a new column in the `home_df` dataframe that contains the run difference at home call `rundf_home`.

In [52]:
home_df['rundf_home']=home_df['runs_home']-home_df['runs_visitor']

In [53]:
home_df.head()

Unnamed: 0_level_0,runs_visitor,runs_home,home_win,home_games,rundf_home
home_team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ANA,335,356,43.0,81,21
ARI,346,457,52.0,81,111
ATL,421,346,37.0,81,-75
BAL,407,395,46.0,81,-12
BOS,349,387,48.0,81,38


In [54]:
home_df.index.rename('Team', inplace=True)

In [55]:
home_df.head()

Unnamed: 0_level_0,runs_visitor,runs_home,home_win,home_games,rundf_home
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ANA,335,356,43.0,81,21
ARI,346,457,52.0,81,111
ATL,421,346,37.0,81,-75
BAL,407,395,46.0,81,-12
BOS,349,387,48.0,81,38


In [56]:
home_df.reset_index(inplace=True)

In [104]:
in_df = pd.read_table("data/GL2017.txt", sep = ",", header = None)

**PROBLEM**

Write a function that modularizes all of this called `home_team_data` that takes in a dataframe like our `in_df` and returns a dataframe called `home_df`.

In [105]:
def home_team_data(in_df):
    new_columns(in_df)
    home_group = in_df.groupby(in_df['home_team'])
    home_df = home_group[['runs_visitor', 'runs_home', 'home_win']].apply(sum).head()
    home_df['num_home_games'] = home_group.home_win.count()
    home_df['rd_at_home'] = home_df['runs_home'] - home_df['runs_visitor']
    home_df.index.rename('Team', inplace=True)
    
    return home_df
    
    

In [106]:
home_team_data(in_df)

Unnamed: 0_level_0,runs_visitor,runs_home,home_win,num_home_games,rd_at_home
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ANA,335.0,356.0,43.0,81,21.0
ARI,346.0,457.0,52.0,81,111.0
ATL,421.0,346.0,37.0,81,-75.0
BAL,407.0,395.0,46.0,81,-12.0
BOS,349.0,387.0,48.0,81,38.0


In [107]:
new_columns(in_df)

Unnamed: 0,0,1,2,visiting_team,4,5,home_team,7,8,runs_visitor,...,155,156,157,158,159,160,visitor_win,home_win,RD_at_home,RD_visiting
0,20170402,0,Sun,SFN,NL,1,ARI,NL,1,5,...,2,greiz001,Zack Greinke,1,,Y,False,True,1,-1
1,20170402,0,Sun,CHN,NL,1,SLN,NL,1,3,...,7,martc006,Carlos Martinez,1,,Y,False,True,1,-1
2,20170402,0,Sun,NYA,AL,1,TBA,AL,1,3,...,7,norrd001,Derek Norris,2,,Y,False,True,4,-4
3,20170403,0,Mon,PHI,NL,1,CIN,NL,1,4,...,2,felds001,Scott Feldman,1,,Y,True,False,-1,1
4,20170403,0,Mon,SDN,NL,1,LAN,NL,1,3,...,9,kersc001,Clayton Kershaw,1,,Y,False,True,11,-11
5,20170403,0,Mon,COL,NL,1,MIL,NL,1,7,...,6,guerj003,Junior Guerra,1,,Y,True,False,-2,2
6,20170403,0,Mon,ATL,NL,1,NYN,NL,1,0,...,2,syndn001,Noah Syndergaard,1,,Y,False,True,6,-6
7,20170403,0,Mon,MIA,NL,1,WAS,NL,1,2,...,2,stras001,Stephen Strasburg,1,,Y,False,True,2,-2
8,20170403,0,Mon,TOR,AL,1,BAL,AL,1,2,...,4,hardj003,J.J. Hardy,6,,Y,False,True,1,-1
9,20170403,0,Mon,PIT,NL,1,BOS,AL,1,3,...,5,leons001,Sandy Leon,2,,Y,False,True,2,-2


**PROBLEM**

Repeat the above but for visiting statistics.  By the end of this, you should have two dataframes `home_df` and `visit_df`.

In [114]:
def visit_team_data(in_df):
    new_columns(in_df)
    visit_group = in_df.groupby(in_df['visiting_team'])
    visit_df = visit_group[['runs_visitor', 'runs_home', 'visitor_win']].apply(sum).head()
    visit_df['num_visitor_games'] = visit_group.visitor_win.count()
    visit_df['RD_visitor'] = visit_df['runs_visitor'] - visit_df['runs_home']
    visit_df.index.rename('Team', inplace=True)
    
    return visit_df

In [115]:
visit_team_data(in_df)

Unnamed: 0_level_0,runs_visitor,runs_home,visitor_win,num_visitor_games,RD_visitor
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ANA,354.0,374.0,37.0,81,-20.0
ARI,355.0,313.0,41.0,81,42.0
ATL,386.0,400.0,35.0,81,-14.0
BAL,348.0,434.0,29.0,81,-86.0
BOS,398.0,319.0,45.0,81,79.0


### Merging DataFrames

Using the `merge()` function we can join our dataframes based on a given column.  

```python
overall_df = home_df.merge(visit_df, how='outer', left_on='Team', right_on='Team')
```

**Run Differential**: sum of run differentials at home and away

**Win Percentage**: total wins over total games

**PROBLEM**:

Write a function that takes in the `home_df` and `visit_df`, returning an `overall_df` that merges the dataframes and adds a `rd` and `win_pct` column.

**PROBLEM**

Write a function called `extract_inputs` that takes in an input dataframe like ours, applies the rename column function, the add columns function, the home and visiting processors, and overall dataframe merge.