# Age Curves in the NHL (2018-2024)

## DSCI 235

### Justin Eldridge, Cody Farris


# Introduction

We wanted to know how long players in various positions can expect to remain competitive in the NHL. We were specifically interested in how quickly player reach their peak performance, the age at which they peak, and how quickly different aspects of their performance declines with age. To accomplish this we decided to collect 5 seasons worth of skater and goalie data from pro-hockey-reference.com from 2018-2024. We then picked certain performance metrics and plotted them vs. age with a LOESS smoother to examine performance with increasing age.

# Background

There are 6 players on the ice at a given time. These consist of three offensive players (center, left wing, right wing), two defense men, and a goalie. 


# Data Acquisition and Cleaning:

We collected the player and goalie data for each season as .txt files and read them into python. We then completed some exploratory data analysis and concluded that some of the player positions were labeled inconsitently. For example, some players were labeled as Forwards, while others were labeled as Centers despite this being the same position. We also found that some players filled multiple roles leading them to have multiple positions included. To fix this we created indicator variables for each position so that players who played multiple positions would be included in the graphs for each of those positions. With this cleaning complete we created an overall skater and overal goalie data frames containing the entire time series of data. With the collection and cleaning complete we moved on to picking performance metrics and plotting. 

# Performance Metrics:

Goalies:
* Save Percentage (SV%): Saves/ Total Shots
* Goals Against Average (GAA): Total Goals Allowed per hour of play time.
* Goals Saved Above Average (GSAA): Number of goals saved above league average given the number of shot attempts.
* Goalie Point Shares (GPS): Estimated number of points contributed to team total due to defensive performance.

Skaters:
* Relative Corsi Percentage (CF%rel): Measures players impact on team puck posession and influence the number of shot attempts
* Plus/Minus (+/-): Goals for - Goals against (while teams are at even strength).
* Goals Plus Assists: Number of goals for + number of asissts.


# Plotting and LOESS

We decided to employ LOESS (Locally Estimated Scatterplot Smoothing) to capture any trend in player performance. We employed the lowess() function from the statsmodels package and then also used boostrap sampling to calculated a 95% confidence interval. There is a smoothing parameter in the LOESS model, which we called frac_val, that controls how much of the surrounding data is used to shape the curve. A large smoothing parameter results in more of the surrounding data being considered as the curve is shaped. This means that it is less sensitive to outliers but may not be flexible enough to capture the relationship. On the other hand a small smoothing parameter ($\approx$ 0.1) makes the curve much more flexible, making it better equpped to capture local detail. However this can also make it prone to overfitting as we will see. 

# Code:








In [2]:
#Import the appropriate libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Import the package for LOESS
import statsmodels.api as sm
import os

#Set the working directory
os.chdir("C:\\Users\\justi\\Desktop\\2025\\DSCI 235\\DSCI235-Project\\235_Project")

## Data Cleaning and Creation:

### Goalie Data:

In [3]:
#Load in the data
df1= pd.read_csv("C:\\Users\\justi\\Desktop\\2025\\DSCI 235\\DSCI235-Project\\235_Project\\csv\\Goalies 18-19.txt")
df2= pd.read_csv("C:\\Users\\justi\\Desktop\\2025\\DSCI 235\\DSCI235-Project\\235_Project\\csv\\Goalies 19-20.txt")
df3= pd.read_csv("C:\\Users\\justi\\Desktop\\2025\\DSCI 235\\DSCI235-Project\\235_Project\\csv\\Goalies 20-21.txt")
df4= pd.read_csv("C:\\Users\\justi\\Desktop\\2025\\DSCI 235\\DSCI235-Project\\235_Project\\csv\\Goalies 21-22.txt")
df5= pd.read_csv("C:\\Users\\justi\\Desktop\\2025\\DSCI 235\\DSCI235-Project\\235_Project\\csv\\Goalies 22-23.txt")
df6= pd.read_csv("C:\\Users\\justi\\Desktop\\2025\\DSCI 235\\DSCI235-Project\\235_Project\\csv\\Goalies 23-24.txt")

#Combine the data sets into a single data frame
goalie_df = pd.concat([df1, df2, df3, df4,df5, df6], axis=0)

#View the results
print(goalie_df.head())

#Drop the last column
goalie_df = goalie_df.drop(columns=['-9999'])


#Save the results to a csv file 
goalie_df.to_csv("goalie_data.csv")

   Rk             Player   Age Team Pos  GP  GS   W   L  T/O  ...   GA%-  \
0   1       Devan Dubnyk  32.0  MIN   G  67  66  31  28    6  ...   96.0   
1   2        Carey Price  31.0  MTL   G  66  64  35  24    6  ...   92.0   
2   3  Connor Hellebuyck  25.0  WPG   G  63  62  34  23    3  ...   97.0   
3   4   Sergei Bobrovsky  30.0  CBJ   G  62  61  37  24    1  ...   97.0   
4   5       Martin Jones  29.0  SJS   G  62  62  36  19    5  ...  115.0   

   GSAA  GAA/A   GPS  G  A PTS  PIM            Awards      -9999  
0   6.2   2.69  11.1  0  2   2    2               ASG  dubnyde01  
1  14.9   2.64  12.5  0  1   1    2   ASnhl-3Vezina-7  priceca01  
2   5.9   3.09  12.1  0  3   3    4               NaN  helleco01  
3   5.3   2.75  10.4  0  0   0    2  ASnhl-11Vezina-9  bobrose01  
4 -22.9   3.14   7.0  0  1   1    2               NaN  jonesma02  

[5 rows x 30 columns]


### Skater Data:

In [None]:
h1 = pd.read_csv('C:\\Users\\justi\\Desktop\\2025\\DSCI 235\\DSCI235-Project\\235_Project\\csv\\1819_Hockey.txt')
h2 = pd.read_csv('csv/1920_Hockey.txt')
h3 = pd.read_csv('csv/2021_Hockey.txt')
h4 = pd.read_csv('csv/21-22 Season.txt')
h5 = pd.read_csv('csv/2223_Hockey.txt')
h6 = pd.read_csv('csv/23-24 Season.txt')

#Equivalent of Cbind in r
#pd.concat([a,b], axis=1)

hockey_df = pd.concat([h1,h2,h3,h4,h5,h6], axis=0)
hockey_df = hockey_df.drop(columns = ['-9999'])






   Rk             Player  Age   Tm Pos  GP   CF   CA   CF%  CF% rel  ...  \
0   1  Justin Abdelkader   32  DET  LW  49  348  439  44.2     -1.9  ...   
1   2       Pontus Åberg   26  TOR  LW   5   31   50  38.3    -18.2  ...   
2   3     Vitaly Abramov   21  OTT  RW   2   11   17  39.3    -14.7  ...   
3   4       Noel Acciari   28  FLA   C  66  790  894  46.9     -3.5  ...   
4   5    Andrew Agozzino   29  TOT  LW  22  127  104  55.0      8.0  ...   

   oZS%  dZS%  TOI/60  TOI(EV)  TK  GV  E+/-  SAtt.  Thru%      -9999  
0  43.3  56.7   11:32     9:48  10  18  -5.5   76.0   52.6  abdelju01  
1  60.0  40.0    8:42     8:42   1   2  -1.0    5.0   80.0  abergpo01  
2  33.3  66.7    5:47     5:47   0   1  -0.3    4.0   75.0  abramvi01  
3  38.6  61.4   15:57    13:08  32  21  -7.0  189.0   57.1  acciano01  
4  53.4  46.6    7:21     6:49   5   4  -0.4    NaN    NaN  agozzan01  

[5 rows x 27 columns]


We wanted to dermine how many players there were for each position. We found that the naming conventions were not consistent. Despite Center and Forward being the same position they are enconded differently. We also found that some players fill multiple roles. 

In [6]:
hockey_df['Pos'].unique()

#To get the number of players in each position hockey_df['Pos'].value_counts()
print(hockey_df['Pos'].value_counts())

Pos
D        2232
C        1995
LW       1034
RW        832
F         213
C/LW       69
C/RW       45
LW/C       21
W          16
C/W        12
D/RW       11
RW/C        7
LW/RW       2
W/C         1
Name: count, dtype: int64


To fix this we decided to create indicator variable for each position. This way if a player played as a Center and a Right Wing, there data will be included in both graphs. Since we now had indicators 

hockey_df['F'] = np.where(hockey_df['Pos'].str.contains('F'), 1, 0)
hockey_df['D'] = np.where(hockey_df['Pos'].str.contains('D'), 1, 0)
hockey_df['C'] = np.where(hockey_df['Pos'].str.contains('C'), 1, 0)
hockey_df['W'] = np.where(hockey_df['Pos'].str.contains('W'), 1, 0)

hockey_df = hockey_df.drop(columns=['Rk','Pos'])

# Results

# Goalies:

## Save Percentage Vs. Age