# Loading the Data

![](https://upload.wikimedia.org/wikipedia/en/5/54/Simpsons_MoneyBART_Mike_Scioscia_Promo.jpg)

Our goal is to write a few functions that will allow us to process the data related to baseball games from [Retrosheet](https://www.retrosheet.org/).  We will examine the game log data, and the information for these files can be found [here](https://www.retrosheet.org/gamelogs/glfields.txt)

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
in_df = pd.read_table("data/GL2017.txt", sep = ",", header = None)

In [3]:
in_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,151,152,153,154,155,156,157,158,159,160
0,20170402,0,Sun,SFN,NL,1,ARI,NL,1,5,...,David Peralta,9,mathj001,Jeff Mathis,2,greiz001,Zack Greinke,1,,Y
1,20170402,0,Sun,CHN,NL,1,SLN,NL,1,3,...,Jedd Gyorko,4,gricr001,Randal Grichuk,7,martc006,Carlos Martinez,1,,Y
2,20170402,0,Sun,NYA,AL,1,TBA,AL,1,3,...,Tim Beckham,6,smitm007,Mallex Smith,7,norrd001,Derek Norris,2,,Y
3,20170403,0,Mon,PHI,NL,1,CIN,NL,1,4,...,Zack Cozart,6,barnt001,Tucker Barnhart,2,felds001,Scott Feldman,1,,Y
4,20170403,0,Mon,SDN,NL,1,LAN,NL,1,3,...,Yasmani Grandal,2,puigy001,Yasiel Puig,9,kersc001,Clayton Kershaw,1,,Y


In [5]:
in_df.rename(columns={3: 'visiting_team', 6: 'home_team', 9: 'runs_visitor', 10: 'runs_home'}, inplace = True)

In [7]:
in_df.loc[0:5, ['visiting_team', 'home_team', 'runs_visitor', 'runs_home']]

Unnamed: 0,visiting_team,home_team,runs_visitor,runs_home
0,SFN,ARI,5,6
1,CHN,SLN,3,4
2,NYA,TBA,3,7
3,PHI,CIN,4,3
4,SDN,LAN,3,14
5,COL,MIL,7,5


**PROBLEM**

Write a function that takes in a dataframe and returns a dataframe with the columns renamed as above.

In [8]:
in_df['home_win'] = (in_df['runs_home'] > in_df['runs_visitor'])

In [10]:
in_df.home_win.head()

0     True
1     True
2     True
3    False
4     True
Name: home_win, dtype: bool

**PROBLEM** 

Add a new column to determine if the visiting team wins.

**PROBLEM**

Write a function named `new_columns` that takes in our labeled dataframe and outputs a new dataframe that contains the new columns `home_win` and `visitor_win`.

### Home Statistics

Using our dataframe above, let's work through finding the number of wins for the home team, number of runs scored by the home and visiting team, and the total number of home games for each team.

In [11]:
home_group = in_df.groupby('home_team')

In [14]:
home_group[['runs_visitor', 'runs_home', 'home_win']].apply(sum).head()

Unnamed: 0_level_0,runs_visitor,runs_home,home_win
home_team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ANA,335.0,356.0,43.0
ARI,346.0,457.0,52.0
ATL,421.0,346.0,37.0
BAL,407.0,395.0,46.0
BOS,349.0,387.0,48.0


In [18]:
home_df = home_group[['runs_visitor', 'runs_home', 'home_win']].apply(sum)

In [15]:
home_group['home_win'].count().head()

home_team
ANA    81
ARI    81
ATL    81
BAL    81
BOS    81
Name: home_win, dtype: int64

In [19]:
home_df['home_games'] = home_group['home_win'].count()

In [20]:
home_df.head()

Unnamed: 0_level_0,runs_visitor,runs_home,home_win,home_games
home_team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ANA,335.0,356.0,43.0,81
ARI,346.0,457.0,52.0,81
ATL,421.0,346.0,37.0,81
BAL,407.0,395.0,46.0,81
BOS,349.0,387.0,48.0,81


**PROBLEM**

Create a new column in the `home_df` dataframe that contains the run difference at home call `rundf_home`.

In [21]:
home_df.index.rename('Team', inplace=True)

In [22]:
home_df.head()

Unnamed: 0_level_0,runs_visitor,runs_home,home_win,home_games
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ANA,335.0,356.0,43.0,81
ARI,346.0,457.0,52.0,81
ATL,421.0,346.0,37.0,81
BAL,407.0,395.0,46.0,81
BOS,349.0,387.0,48.0,81


In [23]:
home_df.reset_index(inplace=True)

In [24]:
home_df.head()

Unnamed: 0,Team,runs_visitor,runs_home,home_win,home_games
0,ANA,335.0,356.0,43.0,81
1,ARI,346.0,457.0,52.0,81
2,ATL,421.0,346.0,37.0,81
3,BAL,407.0,395.0,46.0,81
4,BOS,349.0,387.0,48.0,81


**PROBLEM**

Write a function that modularizes all of this called `home_team_data` that takes in a dataframe like our `in_df` and returns a dataframe called `home_df`.

**PROBLEM**

Repeat the above but for visiting statistics.  By the end of this, you should have two dataframes `home_df` and `visit_df`.

### Merging DataFrames

Using the `merge()` function we can join our dataframes based on a given column.  

```python
overall_df = home_df.merge(visit_df, how='outer', left_on='Team', right_on='Team')
```

**Run Differential**: sum of run differentials at home and away

**Win Percentage**: total wins over total games

**PROBLEM**:

Write a function that takes in the `home_df` and `visit_df`, returning an `overall_df` that merges the dataframes and adds a `rd` and `win_pct` column.

**PROBLEM**

Write a function called `extract_inputs` that takes in an input dataframe like ours, applies the rename column function, the add columns function, the home and visiting processors, and overall dataframe merge.