<a href="https://www.kaggle.com/code/davideliu/5000m-data-analysis?scriptVersionId=111345305" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Italian Athletics 5000m Historical Data Analysis

What's the fastest time among all 5000m athletes every year?
What time do I need to be in the top-100?
Which age do athletes reach their peak performance level? 

This notebook not only lead you to the answer of the above questions, but it also provides many useful insights and statistics through friendly data visualization charts about the 5000m performance of athletes running for the Italian Federation over the time since 2005.

You can also find the original dataset and more info about it on [Kaggle](https://www.kaggle.com/datasets/davideliu/italian-athletics-historical-best-performance).

## Initialization

Load dataset and useful libraries

In [1]:
import numpy as np
import pandas as pd
import datetime

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if filename == 'athletics_IT.csv':
            df_path = os.path.join(dirname, filename)

Read the dataset and visualize raw data.

In [2]:
df_name = 'athletics_IT.csv'
df = pd.read_csv(df_path, encoding='utf8').reset_index(drop=True)
df.info()

  exec(code_obj, self.user_global_ns, self.user_ns)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 902282 entries, 0 to 902281
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   time        902282 non-null  object 
 1   wind        154345 non-null  object 
 2   name        902282 non-null  object 
 3   birth-year  902279 non-null  float64
 4   team        902282 non-null  object 
 5   position    902282 non-null  int64  
 6   location    902282 non-null  object 
 7   date        902282 non-null  object 
 8   sex         902282 non-null  object 
 9   event       902282 non-null  object 
 10  type        902282 non-null  object 
dtypes: float64(1), int64(1), object(9)
memory usage: 75.7+ MB


## Filter dataset by event

The new dataset contains the historical yearly personal best (PB) of all 5000m athletes from 2005 to 2021.

In [3]:
analyze_event = '5000m'
df = df[df.event == analyze_event]
df

Unnamed: 0,time,wind,name,birth-year,team,position,location,date,sex,event,type
495218,00:13:43.220000,,VINCENTI Salvatore,1972.0,G.A. FIAMME GIALLE,2,Cesenatico,2005-06-12,M,5000m,P
495219,00:13:43.720000,,ZANON Simone,1975.0,G.S. FIAMME ORO PADOVA,4,Cesenatico,2005-06-12,M,5000m,P
495220,00:13:43.830000,,LEONE Maurizio,1973.0,C.S. CARABINIERI SEZ. ATLETICA,5,Conegliano,2005-06-17,M,5000m,P
495221,00:13:49.670000,,LOMALA Joseph,1982.0,S G AMSICORA,6,Cesenatico,2005-06-12,M,5000m,P
495222,00:13:52.410000,,MASCHERONI Fabio,1977.0,CALCESTRUZZI CORRADINI EXCELS.,5,Ponzano Veneto,2005-07-01,M,5000m,P
...,...,...,...,...,...,...,...,...,...,...,...
526946,00:31:21.900000,,COLABELLO Amalia maria,1956.0,A.P.D. PONT-SAINT-MARTIN,2,Saint-christophe,2021-07-22,F,5000m,P
526947,00:31:47.800000,,MARCONE Teresa anna maria,1954.0,I PODISTI DI CAPITANATA,11,Foggia,2021-11-21,F,5000m,P
526948,00:32:05.400000,,VERTE Luciana,1976.0,POLISPORTIVA MOLISE CAMPOBASSO,2,Campobasso,2021-05-23,F,5000m,P
526949,00:33:29.900000,,VETRANO Maria ivana,1959.0,NUOVA ATLETICA COPERTINO,18,Lecce,2021-06-27,F,5000m,P


## Preprocess the dataset;

The main operations include:
- Define a new column `age` corresponding to the age of each athlete according the year of their PB.
- Define a new column `year` (`int`) from `date` (`pd.DateTime`).
- Remove athletes whose age is < 16 years old. Those values are outliers since the data should only cover seniores categories.
- Define some `groupby` objects useful to process the data to be visualized.

In [4]:
df['date'] = pd.to_datetime(df['date'])
df['year'] = pd.to_datetime(df['date']).dt.year
df['year'] = pd.to_numeric(df['year'], downcast='integer').astype(int)
n_years = len(df['year'].unique())
df['time'] = pd.to_datetime(df['time'])
df['birth-year'].fillna(df['birth-year'].mean(), inplace=True)
df['age'] = (df['year'] - df['birth-year']).astype(int)
df = df[df['age'] >= 16]
group_ys = df.groupby([df.year, df.sex])
group_as = df.groupby([df.age, df.sex])

## Charts wrappers with Plotly

In [5]:
import plotly.express as px
import plotly.graph_objects as go


def line_plot(df, x, y, hover_data=None, color=None, title=None):
    color_discrete_map = None
    if color == 'sex':
        color_discrete_map = {'M': 'blue', 'F': 'red'}
    fig = px.line(df, x=x, y=y, hover_data=hover_data, color=color,
                  color_discrete_map=color_discrete_map, title=title)
    fig.update_yaxes(tickformat="%M'%S.%L")
    fig.update_traces(mode="markers+lines")
    return fig


def scatter_plot(df, x, y, hover_data=None, color=None, title=None):
    color_discrete_map = None
    if color == 'sex':
        color_discrete_map = {'M': 'blue', 'F': 'red'}
    fig = px.scatter(df, x=x, y=y, hover_data=hover_data, color=color,
                     color_discrete_map=color_discrete_map, title=title)
    fig.update_yaxes(tickformat="%M'%S.%L")
    return fig


def histogram_plot(df, x, y=None, hover_data=None, color=None, cumulative=False, nbins=None, title=None):
    color_discrete_map = None
    if color == 'sex':
        color_discrete_map = {'M': 'blue', 'F': 'red'}
    fig = px.histogram(df, x=x, y=y, hover_data=hover_data, color=color, color_discrete_map=color_discrete_map,
                       cumulative=cumulative, nbins=nbins, title=title)
    fig.update_yaxes(tickformat="%M'%S.%L")
    #fig.update_xaxes(tickformat="%M'%S.%L")
    fig.update_layout(yaxis_title="")
    return fig


def split_violin_plot(df, x, y, split, colors=None, title=None, points='outliers'):
    assert points in ['all', 'outliers', False]
    values = df[split].unique()
    assert len(values) == 2
    if not colors:
        colors = ['blue', 'red']
    fig = go.Figure()
    for i, v in enumerate(values):
        fig.add_trace(go.Violin(x=df[x][df[split] == v],
                            y=df[y][df[split] == v],
                            legendgroup=v, scalegroup=v, name=v,
                            side='negative' if i == 0 else 'positive',
                            line_color=colors[i])
                 )
    fig.update_traces(meanline_visible=True, points=points)
    fig.update_layout(violingap=0, violinmode='overlay', title=title, legend_title=split)
    fig.update_yaxes(tickformat="%M'%S.%L")
    return fig


def multi_line_plot(x, y, hover_data=None, color=None, title=None, names=None, dash=None, legend_title=None):
    fig = go.Figure()
    if not legend_title:
        legend_title = 'category'
    for i, y_ in enumerate(y):
        fig.add_trace(go.Scatter(x=x, y=y_, name=names[i], legendrank=i+1,
                                 line=dict(color=color[i],
                                           dash=dash[i] if dash is not None else None)))
    fig.update_layout(title=title, legend_title=legend_title)
    fig.update_yaxes(tickformat="%M'%S.%L")
    return fig

# Data visualization

In [6]:
top1 = group_ys.head(1)
fig = line_plot(top1, x='year', y="time", hover_data=['name', 'team'], color='sex',
                title='Historical top-1 time - {}'.format(analyze_event))
fig.show()

- The line representing male athletes shows a steady decline until 2020 when Yemaneberhan Crippa set a new national record of 13'02.
- There is not a defined trend for females athletes, and Nadia Battocletti in 2021 is the only athlete since 2005 to have run a sub-15' performance.

In [7]:
mean_time = group_ys['time'].mean().reset_index()
fig = split_violin_plot(df, x='year', y="age", split='sex', points=False, 
                        title='Historical time distribution - {}'.format(analyze_event))
fig.show()

- For both genders the mean 5000m time is higher on recent years due to the higher number of non-professional participants attending competitions. It is however a good sign for the athletics movement because it means that more atheltes are attracted and show interest.

In [8]:
time_top100 = group_ys.head(100)
fig = split_violin_plot(time_top100, x='year', y="time", split='sex',
                        title='Historical top-100 time distribution - {}'.format(analyze_event))
fig.show()

- Males athletes are getting more competitive in the 5000m meters. In fact, the mean time of the top-100 athletes is the lowest in 2021 with 14'28. 
- For female athletes the trend is quite steady with the mean time always ranging between 17'20 and 17'40.
- 2020 can be considered an outlier since it has been heavily influence by Covid-19

In [9]:
thr_top100 = group_ys.nth(100)['time'].reset_index()
thr_top100_m = thr_top100[thr_top100.sex == 'M']
thr_top100_f = thr_top100[thr_top100.sex == 'F']
thr_top200 = group_ys.nth(200)['time'].reset_index()
thr_top200_m = thr_top200[thr_top200.sex == 'M']
thr_top200_f = thr_top200[thr_top200.sex == 'F']
fig = multi_line_plot(x=thr_top100_m.year, 
                      y=[thr_top100_m.time, thr_top100_f.time, thr_top200_m.time, thr_top200_f.time],
                      color=['blue', 'red', 'blue', 'red'],
                      dash=['dash', 'dash', 'dot', 'dot'],
                      names=['Top-100 M', 'Top-100 F', 'Top-250 M', 'Top-250 F'],
               title='Historical top-100 and top-200 entry time - {}'.format(analyze_event))
fig.show()

- For males 2021 was the toughes year and it is the only year where a sub-15'00 time was required to enter in the top-100 and a 15'31 was the threshold to enter in the top-250.
- Female athletes had their most competitive environment during the 2014 and 2015 years where a 18'06 time was needed to be top-100.
- Besides 2020, women had a relatively easy year on 2006: a time of 21'38 was enough to be top-250.

In [10]:
df['date_loc'] = df.date.astype(str) + df.location.astype(str)
races = df.groupby(df.year)['date_loc'].nunique().reset_index()
races.rename(columns={'date_loc': 'races'}, inplace=True)
fig = histogram_plot(races, x='year', y="races", nbins=n_years * 2 + 1, hover_data=['year', 'races'],
                     title='Historical number of races - {}'.format(analyze_event))
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



- Except during 2020, the number of races every year is quite steady.

In [11]:
participants = group_ys['time'].count().reset_index()
participants.rename(columns={'time': 'athletes'}, inplace=True)
fig = histogram_plot(participants, x='year', y="athletes", color='sex', nbins=n_years * 2 + 1, hover_data=['year', 'athletes'],
                     title='Historical number of athletes - {}'.format(analyze_event))
fig.show()

- There is a sharp increase in the number of participants in 2014, especially for male athletes. Given the results of the previous chart, the increase is not due to the availability of more races but probably caused by more athletes registered in the system.

In [12]:
mean_age = group_as['time'].mean().reset_index()
fig = scatter_plot(mean_age, x='age', y="time", color='sex', hover_data=['age'],
                   title='Mean time by age - {}'.format(analyze_event))
fig.show()

- Peak age for 5000m is around 23 years old for both genders. At that age male have an average time of 16'26 and ran in 18'56.
- For females, the performance decline after peak age is slower than in males.
- Performance decline gets sharper for athletes that are more than 60 years old.
- This chart includes all athletes: professionals and amateurs.

In [13]:
event_pb = df.loc[df.groupby([df.name, df.sex]).time.idxmin()]
event_pb_top100 = event_pb.sort_values(['time'], ascending=True).groupby(event_pb.sex).head(100)
fig = scatter_plot(event_pb_top100, x='age', y="time", color='sex', hover_data=['time', 'name', 'team', 'age'],
                   title='Age of top-100 athletes PB - {}'.format(analyze_event))
fig.show()

- Most of the male athletes that have a sub-13'30 PB ran it when they were 24 to 28 years old.
- Most of the female athletes that have a sub-15'30 PB ran it when they were 26 to 31 years old.
- Only 2 males ran their PB at over 35 years old against 5 females that made it.

In [14]:
event_pb_top300 = event_pb.sort_values(['time'], ascending=True).groupby(event_pb.sex).head(300)
n_ages = len(event_pb_top300.age.unique())
fig = histogram_plot(event_pb_top300, x='age', color='sex', nbins=n_ages * 3 + 1, hover_data=['age'],
                     title='Number of top-300 athletes PB by age - {}'.format(analyze_event))
fig.show()

- Most male elite athletes ran their PB when they were between 22 and 26 years old, period that coincides with the 5000m peak performance for males.
- Most female elite athletes ran their PB when they were 20 to 22 years old. However, a significative number of females even set their PB when they were over 30.

In [15]:
fig = histogram_plot(event_pb, x='age', color='sex', nbins=200, hover_data=['age'],
                     title='Number of athletes PB by age - {}'.format(analyze_event))
fig.show()

- Athletes that are less than 18 years old usually don't take part in many 5000m events since they are still too young and focus on shorter distances.
- The majority of athletes ran their PB between 18 to 19 years old. This is due to the fact that most athletes leave athletics before or during the college.
- After peaking before 20 years old, the number of 5000m participants has another small peaks after 40 years old due to the participation of athletes that decide to resume or start their sport career later.

In [16]:
event_pb_30 = event_pb.set_index('time').between_time('00:00', '00:25:00').reset_index()
fig = histogram_plot(event_pb_30, x='time', color='sex', cumulative=True, nbins=100, hover_data=['time'],
                     title='Number of athletes PB under different time thresholds - {}'.format(analyze_event))
fig.show()

- Some statistics for males: in history, only 76 athletes ran under 14'00, and 496 athletes ran under 15'00.
- Some statistics for females: in history, only 29 athletes ran under 16, 154 athletes ran under 17, and 446 athletes ran under 18.

In [17]:
df_sort_time = df.sort_values(['time'], ascending=True)
df_filter = df_sort_time.drop_duplicates(subset=['name'], keep="first")  # remove duplicates keep fastest time
df_team_count = df_filter.groupby([df_filter.team]).size().reset_index(name='athletes')
df_team_count = df_team_count.sort_values(['athletes'], ascending=False).head(20)
fig = histogram_plot(df_team_count, x='team', y='athletes', hover_data=['athletes'],
                     title='Number of athletes by team - {}'.format(analyze_event))
fig.show()

In [18]:
df_team_time = df_filter.groupby(df_filter.team)['time'].mean().reset_index()
df_team_time = df_team_time.sort_values(['time'], ascending=True).head(20)
fig = line_plot(df_team_time, x='team', y='time',
                title='Mean athletes PB by team - {}'.format(analyze_event))
fig.show()

- Some teams engage professional foreign athletes, mostly from Africa, to compete with them, thus resulting in an average time even lower than military professional teams that hire local elite athletes.