# Are players younger now?

What are some of the trends we're seeing with player usage and performance by age?

Specifically, what are the distributions of playing time (PA for batters) and output (WAR) by age?  How do they compare over time?

We'll look at two metrics: **mean age (weighted by PA or WAR)**, which is a single number that can be computed for each season (or period of seasons) to look at trends over time.  And once that helps us identify some time periods to examine, we'll dive deeper into the age trends of those periods by using **age distribution curves** (e.g., WAR by age or PA by age) for those specific time periods.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

In [2]:
# Read in all batting stats data (including WAR components) since integration
batting = pd.read_parquet('../data/pybaseball/batting_1947-2019.parquet')

### Mean Age (weighted by PA or WAR)

The first step is to find a way to see trends across time.  How can we compute a single number that describes the age of a player population for a single season (or for a set of seasons)?

We could compute the mean age of a pool of players.  One can compute a simple mean that just weights all players equally.  But that leaves out a lot of the story, as players do not contribute equally to the population.  So we can weight by playing time; say, for each PA in a season, we take the age of the batter, then average those ages out -- that would be a mean age weighted by PA.  This would give us a better view of the average age of batters, based on who actually bats.

If we want this view, but based on who actually *produces*, we can do the same thing with WAR.  Well, it's not the exact same thing, because 1 WAR is not a discrete event like 1 PA is, but it ends up being the same formula.

We can then plot the mean age over time (and use a rolling average to reduce the noise) to see trends.

In [3]:
# Get mean age, season-by-season, weighted by a particular category
def get_age_weighted_by_category(data, category):
    by_age = data[data[category]>0] \
                [['Season', 'Age', category]] \
                .groupby(['Season', 'Age']).sum() \
                .unstack(level=0)
    mean_age = by_age.fillna(0).apply(lambda col: np.average(by_age.index, weights=col)).unstack(level=0)[category]
    return mean_age.rename(f'by_{category}')

In [4]:
mean_age_by_war = get_age_weighted_by_category(batting, 'WAR')
px.line(mean_age_by_war, labels={'value': 'mean age', 'variable': 'aggregated'})

OK, we can see some trends, but a lot of bouncing around by year.  Which makes sense, because the pool of players is aging every year, and it's turning over, but not uniformly.  We'll use rolling averages to see the broader trends.  Let's start with a 20-year rolling average to see the long arcs of time.

In [5]:
# Let's write a function for computing and plotting rolling averages of mean age
def plot_rolling_weighted_age(data, stats, window, title="Mean age of batters, weighted"):
    mean_ages = [get_age_weighted_by_category(data, stat).rolling(window=window, center=True).mean() for stat in stats]
    mean_ages_df = pd.concat(mean_ages, axis=1)
    return px.line(mean_ages_df, labels={'value': 'mean age', 'variable': 'aggregated'})

# Let's start with the long arcs of time: 20-year rolling averages for mean age, weighted by PA and WAR
plot_rolling_weighted_age(batting, ['PA', 'WAR'], 20, "Mean age of batters, weighted by PA or WAR")

OK, we can generally see three eras: the game getting younger from integration through about 1970, then aging for three decades, and then getting younger again in the 21st century.

The two curves (PA and WAR) basically show the same trends.  Let's take a brief aside, and see how we can use the same concept of mean age to reveal other trends about the game.  For example, we can see what kind of value is provided by players of different ages.  'Bat', 'Def' and 'Pos' are the batting, defensive and positional components of WAR.  Notice how teams have always gotten their batting value from older players and defensive value from younger players (while we also see the same long-term arcs of mean age):

In [6]:
plot_rolling_weighted_age(batting, ['Bat', 'Def', 'Pos'], 20)


OK, let's go back to age trends over time, specifically to see how the current day compares historically.  From the earlier, 20-year graph, it was clear that the league has been getting younger from its peak.  We can look at shorter windows (e.g., 5 years) to get a better view of the more minor fluctuations, and we see that the trend towards youth has been so strong that the league is the youngest it has been since the 1970s.

In [7]:
plot_rolling_weighted_age(batting, ['PA', 'WAR'], 5, "Mean age of batters, weighted by PA or WAR")

### Age Distribution Curves

The mean age gives us a way to summarize the age of a player pool with a single number, so we can easily see trends over time.  Alternatively, we can get a more detailed view of the distribution of playing time or output by age, looking at a fixed set of years.  For example, here is the age distribution curve for PA for all years since integration:

In [8]:
# Generate an age distribution curve for a chosen stat
def get_age_curve(data, stat):
    stats = data[data[stat]>0][['Age', stat]]
    return stats.groupby(['Age']).sum()

px.line(get_age_curve(batting, 'PA'))

OK, no surprises there: most of the playing time goes to players between 24-32, peaking around 26-27, with tails into the teens and forties.

Where it gets interesting is when we overlay age curves from two different eras, to see how things have changed.  Looking back at the mean age curves from last section, we can see that the aging trend peaked in the early 2000s (let's say 2000-04), so let's compare that to the most recent 5 years:

In [9]:
# Generate an age distribution curve over a specific range of years
def get_age_curve_from_years(data, stat, year_start, year_end):
    data_yrs = data[data['Season'].isin(range(year_start,year_end+1))]
    return get_age_curve(data_yrs, stat).rename(columns={stat:f'{year_start}-{year_end}'})

# Compare the distribution curves (for PA and WAR) from the early 2000s to the late 2010s
age_curves = {}
for stat in ['PA', 'WAR']:
    ranges = [(2000, 2004), (2015, 2019)]
    age_curves[stat] = [get_age_curve_from_years(batting, stat, yr_start, yr_end) for (yr_start, yr_end) in ranges]
    px.line(pd.concat(age_curves[stat], axis=1).fillna(0), title=f'Distribution of {stat} by age').show()

We can see that the curves are generally shaped the same, but clearly shifted left.  It looks like the PA curve has moved left by about a year, and the WAR curve movement is even more prominent, around two years.

Or looking vertically, to see how much production has varied by age: it appears that a lot of the value of players 33+ has disappeared, while that of early-20s has increased similarly.  Another way to look at this is cumulative WAR by age.  How much WAR is earned by players age X and lower?

In [10]:
war_by_age = pd.concat(age_curves['WAR'], axis=1).fillna(0)
cum_war_by_age = war_by_age.cumsum()
px.line(cum_war_by_age).show()
cum_war_by_age

Unnamed: 0_level_0,2000-2004,2015-2019
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
19.0,0.2,3.7
20.0,5.3,22.5
21.0,32.7,62.1
22.0,84.8,184.5
23.0,170.7,364.3
24.0,349.2,653.2
25.0,616.8,1008.0
26.0,956.5,1375.1
27.0,1291.0,1740.7
28.0,1659.4,2073.7


Players 25 and under are earning 60% more WAR these days than they were at the beginning of the century.  The lead for young players peaks at age 27, where today's young players are 450 WAR ahead of their predecessors.

A better way to look at the value provided by older players is essentially the reverse of this: WAR remaining by age. (And no offense to the "older players"; I'm 10 years older than some of these guys.)

In [11]:
remaining_war_by_age = cum_war_by_age.max()-cum_war_by_age
px.line(remaining_war_by_age).show()
remaining_war_by_age

Unnamed: 0_level_0,2000-2004,2015-2019
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
19.0,3403.4,3368.4
20.0,3398.3,3349.6
21.0,3370.9,3310.0
22.0,3318.8,3187.6
23.0,3232.9,3007.8
24.0,3054.4,2718.9
25.0,2786.8,2364.1
26.0,2447.1,1997.0
27.0,2112.6,1631.4
28.0,1744.2,1298.4


Modern teams are getting half the value from 33+ year old players than their predecessors early in the century were.

## Conclusion: Yes, players are younger now

In [12]:
# let's try stacked columns/bars for age distribution curves.  Might show both the curve and the cum/remaining
# in one view

px.bar(war_by_age.T, color='Age', color_discrete_sequence=px.colors.diverging.Fall)
#.plot.bar(stacked=True, legend=False, figsize=(6,10), cmap='YlOrBr')