# Goalies from 1950-2020 - Narrowing the Scope

In this notebook, we're going to further process and narrow down the data we previously collected in preparation for analysis.

In [1]:
import pandas as pd

First, let's import our datasets that were prepared in the previous notebook

In [3]:
goalieStatsDf = pd.read_csv('GoalieStats1950-2020.csv', index_col = [0])
goalieStatsDf.head()

Unnamed: 0,Player,Season,Team,S/C,GP,GS,W,L,T,OT,...,G,A,P,PIM,Birth Year,S/P,Nationality,Draft Round,HOF,Age
1768,Aaron Dell,20162017,SJS,L,20,17,11,6,--,1,...,0,0,0,0,1989,AB,CAN,--,N,27
1390,Aaron Dell,20172018,SJS,L,29,22,15,5,--,4,...,0,1,1,0,1989,AB,CAN,--,N,28
1949,Aaron Dell,20182019,SJS,L,25,20,10,8,--,4,...,0,0,0,0,1989,AB,CAN,--,N,29
1698,Aaron Dell,20192020,SJS,L,33,30,12,15,--,3,...,0,0,0,0,1989,AB,CAN,--,N,30
3424,Aaron Dell,20202021,NJD,L,7,5,1,5,--,0,...,0,0,0,0,1989,AB,CAN,--,N,31


Same thing for the biographical information.

In [4]:
biosDf = pd.read_csv('GoaliesBio.csv', index_col = [0])
biosDf.head()

Unnamed: 0,Player,Team,S/C,DOB,Birth City,S/P,Ctry,Ntnlty,Ht,Wt,...,Round,Overall,1st Season,HOF,GP,W,L,T,OT,SO
0,Rob Zepp,--,L,1981-09-07,Scarborough,ON,CAN,CAN,74,198,...,4,110,20142015,N,10,5,2,--,0,0
1,Jeff Zatkoff,--,L,1987-06-09,Detroit,MI,USA,USA,74,186,...,3,74,20132014,N,48,18,21,--,4,1
2,Michael Zanier,--,L,1962-08-22,Trail,BC,CAN,CAN,71,189,...,--,--,19841985,N,3,1,1,1,--,0
3,Artyom Zagidulin,--,L,1995-08-08,Magnitogorsk,--,RUS,RUS,74,180,...,--,--,20202021,N,1,0,0,--,0,0
4,Matt Zaba,--,L,1983-07-14,Yorkton,SK,CAN,CAN,73,190,...,8,231,20092010,N,1,0,0,--,0,0


For our purposes, we are going to want to have the career save % in the bios table, because eventually we want to compare save % with HOF.  Let's add it.

First, we need to calculate the career save % for each player in the goalieStatsDF.

In [24]:
#Sv% is a string in our DB, so we first need to turn it into a float in order to calculate the mean.
goalieStatsDf['Sv%'] = pd.to_numeric(goalieStatsDf['Sv%'], errors = 'coerce')
goaliesBySavePct = goalieStatsDf[['Sv%','GAA']].groupby(goalieStatsDf['Player']).mean().reset_index()
goaliesBySavePct.head()

Unnamed: 0,Player,Sv%,GAA
0,Aaron Dell,0.899,2.992
1,Adam Berkhoel,0.882,3.8
2,Adam Hauser,0.75,7.08
3,Adam Munro,0.8865,3.33
4,Adam Werner,0.914,3.42


In [9]:
goaliesByGAA = goalieStatsDf['GAA'].groupby(goalieStatsDf['Player']).mean().reset_index()
goaliesByGAA.head()

Unnamed: 0,Player,GAA
0,Aaron Dell,2.992
1,Adam Berkhoel,3.8
2,Adam Hauser,7.08
3,Adam Munro,3.33
4,Adam Werner,3.42


In [25]:
#Now connect the goaliesBySavePct and goaliesByGAA to the biosDf

savePctList = []
goalsAgainstAveList = []
for player in biosDf['Player']:
    for index, row in goaliesBySavePct.iterrows():
        if row[0] == player:
            savePctList.append(row[1])
            goalsAgainstAveList.append(row[2])


biosDf['Career Save %'] = savePctList
biosDf['GAA'] = goalsAgainstAveList
biosDf.head()

Unnamed: 0,Player,Ntnlty,Overall,1st Season,HOF,GP,W,Career Save %,GAA
0,Rob Zepp,CAN,110,20142015,N,10,5,0.888,2.89
1,Jeff Zatkoff,USA,74,20132014,N,48,18,0.91225,2.4925
2,Michael Zanier,CAN,--,19841985,N,3,1,0.88,3.89
3,Artyom Zagidulin,RUS,--,20202021,N,1,0,0.818,4.25
4,Matt Zaba,CAN,231,20092010,N,1,0,0.875,3.56


In [26]:
#biosDf has 22 columns, let's have a look at the names to see what we want to narrow it down to

biosDf.columns

Index(['Player', 'Ntnlty', 'Overall', '1st Season', 'HOF', 'GP', 'W',
       'Career Save %', 'GAA'],
      dtype='object')

In [27]:
#let's narrow down the columns to just the ones we're interested in

biosDf = biosDf[['Player','Ntnlty', 'Overall', '1st Season', 'HOF', 'GP', 'W', 'Career Save %', 'GAA']]

In [28]:
biosDf.columns

Index(['Player', 'Ntnlty', 'Overall', '1st Season', 'HOF', 'GP', 'W',
       'Career Save %', 'GAA'],
      dtype='object')

In [29]:
#Let's export the biosDf for later use

biosDf.to_csv('GoalieBios1950-2020.csv')

Let's also narrow down the other table into a version that contains just what we need.

In [30]:
goalieStatsDf.columns

Index(['Player', 'Season', 'GP', 'W', 'Sv%', 'GAA', 'Age'], dtype='object')

In [31]:
goalieStatsDf = goalieStatsDf[['Player','Season','GP','W','Sv%','GAA','Age']]

In [32]:
goalieStatsDf.columns

Index(['Player', 'Season', 'GP', 'W', 'Sv%', 'GAA', 'Age'], dtype='object')

In [33]:
#Let's export that one too and on to the next notebook!

goalieStatsDf.to_csv('GoalieStats1950-2020-Filtered.csv')