# Lab Tasks

In this notebook we will analyse a dataset from an Irish triathlon by using the Pandas library. In the dataset, each row represents an athlete, described by a number of different descriptive features:

- *Number:* The athlete's race bib number
- *Place:* The place in which the athlete finished the race
- *Age:* The athlete's age
- *Gender:* The gender that the athlete declared ('M' or 'F')
- *Province:* The Irish province where the athlete comes from (Leinster, Munster, Connacht, Ulster)
- *Swim:* The time taken for the swimming segment of the event (in seconds)
- *T1:* The time taken for the first transition of the event, from cycling to swimming (in seconds)
- *Cycle:* The time taken for the cycling segment of the event (in seconds)
- *T2:* The time taken for the swimming segment of the event, from swimming to running (in seconds)
- *Run:* The time taken for the running segment of the event (in seconds)

In [38]:
import pandas as pd
import urllib.request
import urllib.error

## Task 1 - Data Loading and Preparation

Use Python to download a file containing triathlon dataset in CSV format from the URL:

http://mlg.ucd.ie/modules/python/triathlon.csv

Load the dataset into a Pandas DataFrame, where the row index will be given by the athlete's bib number. Display the first 20 rows of the DataFrame.

In [39]:
url = "http://mlg.ucd.ie/modules/python/triathlon.csv"
df = pd.read_csv(url, index_col="Number")
df.head(20)

Unnamed: 0_level_0,Place,Gender,Age,Province,Swim,T1,Cycle,T2,Run
Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,11.0,M,19,Munster,518,64.0,2293.0,42.0,1446.0
2,129.0,M,25,Leinster,1102,124.0,2662.0,47.0,1723.0
3,35.0,M,25,Leinster,693,138.0,2472.0,42.0,1466.0
4,153.0,M,26,Munster,906,318.0,3002.0,62.0,1648.0
5,34.0,M,21,Connacht,566,131.0,2518.0,65.0,1522.0
6,131.0,M,25,Connacht,940,202.0,3083.0,63.0,1381.0
7,169.0,M,23,Leinster,1118,366.0,3114.0,127.0,1674.0
8,95.0,M,22,Leinster,1002,102.0,2449.0,59.0,1756.0
9,97.0,M,29,Ulster,735,213.0,2701.0,66.0,1655.0
10,150.0,M,29,Leinster,930,196.0,2976.0,66.0,1700.0


The dataset might contain missing values. For instance, some athletes may have registered for the race for never actually. Other athletes might have started the race, but not completed all segments of the triathlon.

From the DataFrame, identify the number of missing values in each column. Then remove any rows which contain missing values (i.e. athletes who did not fullly complete the race). How many rows are remaining?

In [40]:
df.isnull().sum()

Place       4
Gender      0
Age         0
Province    0
Swim        0
T1          1
Cycle       1
T2          2
Run         6
dtype: int64

In [41]:
df = df.dropna(axis=0)

In [43]:
print(f"Number of rows remaining is {len(df)}")
df.isnull().sum()

Number of rows remaining is 182


Place       0
Gender      0
Age         0
Province    0
Swim        0
T1          0
Cycle       0
T2          0
Run         0
dtype: int64

Add a new column to the DataFrame, called *Finish*, which is the total time taken for the race for each athlete (i.e. Swim + T1 + Cycle + T2 + Run).

In [44]:
df["finish"] = df["Swim"] + df["T1"] + df["Cycle"] + df["T2"] + df["Run"]

To verify the step above, sort the DataFrame, based on the *Finish* time, fastest to slowest. Display the top 10 fastest athletes overall:

In [47]:
df.sort_values(by = "finish", ascending=False)

Unnamed: 0_level_0,Place,Gender,Age,Province,Swim,T1,Cycle,T2,Run,finish
Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
178,184.0,F,64,Munster,1217,567.0,3443.0,118.0,2543.0,7888.0
119,183.0,M,56,Connacht,1040,234.0,3387.0,119.0,2212.0,6992.0
167,182.0,F,49,Connacht,1283,189.0,3394.0,118.0,2001.0,6985.0
157,181.0,F,34,Leinster,1243,156.0,2937.0,119.0,2412.0,6867.0
101,179.0,M,43,Ulster,1270,208.0,3143.0,122.0,2067.0,6810.0
...,...,...,...,...,...,...,...,...,...,...
118,5.0,M,52,Munster,621,104.0,2169.0,51.0,1322.0,4267.0
92,4.0,M,49,Leinster,621,89.0,2005.0,53.0,1474.0,4242.0
54,3.0,M,35,Ulster,621,85.0,1982.0,53.0,1282.0,4023.0
48,2.0,M,32,Munster,616,87.0,2016.0,58.0,1225.0,4002.0


## Task 2 - Initial Data Analysis

What is the average finishing time for athletes? What is the slowest finishing time?

In [55]:
average_finishing_time = df["finish"].mean()
slowest_time = df["finish"].max()
print(f"Average finishing time was {round(average_finishing_time,2)}, slowest time was {slowest_time}")

Average finishing time was 5359.84, slowest time was 7888.0


On average which segment of the race took the longest: swimming, cycling or running?

In [57]:
average_swim_time = df["Swim"].mean()
average_cycling_time = df["Cycle"].mean()
average_run_time = df["Run"].mean()

if average_cycling_time < average_run_time and average_swim_time < average_run_time:
    print("Running took the longest")
elif average_run_time < average_swim_time and average_cycling_time < average_swim_time:
    print("Swimming took the longest")
else:
    print("Cycling took the longest")

Cycling took the longest


How many female and male athletes competed in the race? How many athletes from each Irish province competed in the race? 

In [60]:
df["Gender"].value_counts()

Gender
M    134
F     48
Name: count, dtype: int64

In [61]:
df["Province"].value_counts()

Province
Leinster    75
Munster     55
Ulster      30
Connacht    22
Name: count, dtype: int64

How many female and male athletes were from each of the 4 provinces? 

(Hint: We can use a cross-tabulation)

In [63]:
pd.crosstab(df["Province"], df["Gender"])

Gender,F,M
Province,Unnamed: 1_level_1,Unnamed: 2_level_1
Connacht,6,16
Leinster,20,55
Munster,16,39
Ulster,6,24


## Task 3 - Further Data Analysis

Create a new column 'AgeCategory' that divides the runners into age categories: 16-19, 20-29, 30-39, 40-49, 50-65.

In [66]:
bins = [16,20,30,40,50,65]
df["AgeCategory"] = pd.cut(df["Age"], bins=bins, right=False)

How many female and male athletes were from each age category? 

(Hint: We can use a cross-tabulation and the 'AgeCategory' column created above)

In [67]:
pd.crosstab(df["AgeCategory"], df["Gender"])

Gender,F,M
AgeCategory,Unnamed: 1_level_1,Unnamed: 2_level_1
"[16, 20)",0,1
"[20, 30)",12,20
"[30, 40)",19,56
"[40, 50)",13,41
"[50, 65)",4,16


What was finishing place for each athlete per province?

(Hint: We can perform group-by aggregation)

In [73]:
group_by_province = df.groupby("Province", observed=False)

What were the average times for the three segments, per age category?

(Hint: We can perform group-by aggregation)

In [76]:
group_by_age_category = df.groupby("AgeCategory", observed=False)
group_by_age_category[["Swim","Cycle","Run"]].mean()

Unnamed: 0_level_0,Swim,Cycle,Run
AgeCategory,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"[16, 20)",518.0,2293.0,1446.0
"[20, 30)",839.75,2802.03125,1624.625
"[30, 40)",836.76,2566.346667,1552.84
"[40, 50)",888.5,2576.611111,1660.277778
"[50, 65)",870.9,2693.35,1778.5
