# Analysing Trail Run Results from cycosports

There is a great deal of potential for data analytics to be applied to sporting performance. Apart from tracking one's own exercise performance, there is also the possibility of analysing the data as a cohort when we participate in mass competition events, and seeing where we stand with regards to it.

Unlike self-recorded data, however, competition results may be in a format not readily amenable to analysis. This may entail having to scrape the data from the website, or do data wrangling.

For this analysis, the data is publicly available from https://cycosports.com/2021-jungle-cross-trail-run-april-3rd-4th/ in a downloadable PDF format. We will be analysing the 4th April data.

The data table needs to be extracted from the PDF into CSV format. https://pdftables.com/ was able to generate the CSV reasonably. However, this website only offers 25 pages of free conversions. An alternative free site which gives a reasonable output is https://www.pdftoexcelconverter.net/. 

Even after getting the data into CSV format, a lot more preparation needs to be done before analysis can be conducted. The following steps will cover this.

# 1. Preparing the data

We will aim to transform this data into a tidy data format as defined by Hadley Wickham (https://en.wikipedia.org/wiki/Tidy_data).

First, we import the necessary libraries and load the DataFrame.

In [1]:
import pandas as pd 
import numpy as np
import session_info 

session_info.show()

In [2]:
df = pd.read_csv('results.csv', skiprows=1)

In [3]:
df.head()

Unnamed: 0,Pl,overall,Name,Club,Start,1stLap,2ndLap,Time
0,10km - OPEN 13+ YRS,,,,,,,
1,Male,,,,,,,
2,Open - Male,,,,,,,
3,1.,1.0,Malachy Kirwan (316),,7:16:17.70,21:09.20,21:55.00,43:04.20
4,2.,2.0,William Petty (267),Coached,7:16:18.70,21:43.80,22:00.40,43:44.20


## Standardising the format of the column names

In [4]:
df = df.rename(columns = {"Pl":"category_rank", "overall":"event_rank", "1stLap":"lap_1", "2ndLap":"lap_2"})
df = df.rename(str.lower, axis="columns")

In [5]:
df.head()

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time
0,10km - OPEN 13+ YRS,,,,,,,
1,Male,,,,,,,
2,Open - Male,,,,,,,
3,1.,1.0,Malachy Kirwan (316),,7:16:17.70,21:09.20,21:55.00,43:04.20
4,2.,2.0,William Petty (267),Coached,7:16:18.70,21:43.80,22:00.40,43:44.20


## Removing the rows with no usable data

Currently, the DataFrame has these dimensions:

In [6]:
df.shape

(220, 8)

* The data contains some rows with the string "Jungle Cross 2021 Trail Run Series Race 2". 
* There are also repeated header rows (with the values "Pl", "overall", "Name", etc.) within the data. This is due to the PDF repeating them for each page. 

In [7]:
df[50:55]

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time
50,8.,38.,Jason Yai (288),,8:28:49.70,30:36.60,31:39.60,1:02:16.20
51,,,Jungle Cross 2021 Trail Run Series Race 2 & 20...,,,,,1
52,Jungle Cross 2021 Trail Run Series Race 2,,,,,,,
53,Pl,overall,Name,Club,Start,1stLap,2ndLap,Time
54,9.,44.,Lee Victor (290),,8:28:49.40,32:04.70,32:50.10,1:04:54.80


* Also, the rows containing "DNS" and "DNF" (in the "category_rank" and "event_rank" data columns) do not have timing information associated with them. 

In [8]:
df.tail()

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time
215,Under 14,,,,,,,
216,DNS DNS,,Hanna Croissant (426),,,,,
217,Open,,,,,,,
218,DNS DNS,,Gabriella Faure (417),,,,,
219,,,Jungle Cross 2021 Trail Run Series Race 2 & 20...,,,,,5.0


We will remove those rows from the DataFrame.

In [9]:
df = df[~df["category_rank"].str.contains("Jungle Cross 2021", na=False, regex=False)]
df = df[~df["name"].str.contains("Jungle Cross 2021", na=False, regex=False)]
df = df[~df["time"].str.contains("Time", na=False, regex=False)]
df = df[~df["category_rank"].str.contains("DNS|DNF", na=False)]

Notes: 

The inversion operator (`~`) is used to return rows not containing the terms. Alternatively, the following can be used:

`df = df[df["column"].str.contains("substring", na=False)==False]`

`regex=False` should not be used if there are regex expressions such as `"DNS|DNF"`.

`na=False` must be used. If not, rows in which values are not a string (e.g. NaN values) are included in the rows to be dropped. This will cause missing data in the rest of the DataFrame.

If we want to verify which rows are being dropped, the code can be run on the original DataFrame but without the inversion operation, e.g.:

`df["column"].str.contains("substring", na=False, regex=False)`

Now, we check the results of the operation:

In [10]:
df[50:55]

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time
50,8.0,38.0,Jason Yai (288),,8:28:49.70,30:36.60,31:39.60,1:02:16.20
54,9.0,44.0,Lee Victor (290),,8:28:49.40,32:04.70,32:50.10,1:04:54.80
55,10.0,48.0,Daniel Newton (329),,8:28:47.80,30:55.50,34:49.10,1:05:44.60
56,11.0,49.0,Tete Selado (324),,8:27:09.60,30:12.10,35:36.80,1:05:48.90
57,12.0,62.0,Rajen Prabhu (340),Cos coaching,7:19:37.00,35:57.60,39:56.20,1:15:53.80


In [11]:
df.tail()

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time
212,4.,27.0,Janice Teo (419),,12:10:29.10,17:10.20,16:52.00,34:02.20
213,5.,29.0,Vivi Martanto (423),,12:10:29.60,17:33.90,18:12.70,35:46.60
214,6.,31.0,Jessica Timms (394),Dulwich Runners,12:09:19.50,17:36.70,21:30.10,39:06.80
215,Under 14,,,,,,,
217,Open,,,,,,,


In [12]:
df.shape

(182, 8)

## Dealing with the event and gender data in the "category_rank" column

The DataFrame has additional information in the "category_rank" column. The items are: 
* race event (e.g. "10km - Masters (40+)") 
* gender (Male or Female) 
* age category (e.g. Open, Masters, Under 14) in that order. 

They are in their own header rows with no other information. 

This information has to be split out into their own columns to keep the "category_rank" column clean.

### Creating columns for distance, event and age

We can create new columns to specify distance, event and age from the race event information before deleting those rows. 

First, we search for rows containing "km", but resetting the index beforehand to synchronise the index numbers with the actual row numbers:

In [13]:
df = df.reset_index(drop=True)
df[df["category_rank"].str.contains("km", na=False, regex=False, case=False)]

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time
0,10km - OPEN 13+ YRS,,,,,,,
91,10km - Masters (40+),,,,,,,
142,3km Adventure Race (7y +),,,,,,,


We can see that there are only 3 race events. Their locations can be used to manually fill a new empty column, "race", by specifying values via slicing. 

In [14]:
df["race"] = ""
df.loc[:91, "race"] = "10km - Open - 13+"
df.loc[91:142,"race"] = "10km - Masters - 40+"
df.loc[142:,"race"] = "3km - Adventure Race - 7+"

We then remove the now-redundant header rows. The information in the new column is split into more granular columns.

In [15]:
df = df.drop([0, 91, 142])

df[["distance","event", "age"]] = df["race"].str.split("-",expand=True)
df = df.drop(columns="race")

This is the result:

In [16]:
df.head()

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time,distance,event,age
1,Male,,,,,,,,10km,Open,13+
2,Open - Male,,,,,,,,10km,Open,13+
3,1.,1.0,Malachy Kirwan (316),,7:16:17.70,21:09.20,21:55.00,43:04.20,10km,Open,13+
4,2.,2.0,William Petty (267),Coached,7:16:18.70,21:43.80,22:00.40,43:44.20,10km,Open,13+
5,3.,4.0,Chris Timms (251),Dulwich Runners,7:16:58.30,22:15.30,23:27.00,45:42.30,10km,Open,13+


### Creating a column for gender

We can prepare the "gender" column by first making a copy of the "category_rank" column: 

In [17]:
df["gender"] = df["category_rank"].copy()

However, there are 14 header rows for gender as shown below.

In [18]:
len(df[df["gender"].str.fullmatch("Male|Female", na=False)])

14

Hence, we will not use the manual method as shown above, but we will replace all the non-gender values in the column with NaN. This then allows us to forward-fill the values (down the column) from the remaining headers to complete the column.

In [19]:
df["gender"][~df["gender"].str.fullmatch("Male|Female", na=False)] = np.nan
df["gender"] = df.gender.ffill()

In [20]:
df.head()

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time,distance,event,age,gender
1,Male,,,,,,,,10km,Open,13+,Male
2,Open - Male,,,,,,,,10km,Open,13+,Male
3,1.,1.0,Malachy Kirwan (316),,7:16:17.70,21:09.20,21:55.00,43:04.20,10km,Open,13+,Male
4,2.,2.0,William Petty (267),Coached,7:16:18.70,21:43.80,22:00.40,43:44.20,10km,Open,13+,Male
5,3.,4.0,Chris Timms (251),Dulwich Runners,7:16:58.30,22:15.30,23:27.00,45:42.30,10km,Open,13+,Male


### Creating a column for category

Again, we can create the column by copying the current "category_rank" column. 

In [21]:
df["category"] = df["category_rank"].copy()

Replacing the rank numbers in the column with NaN values (leaving only the category information) allows us to forward-fill the category information to complete the column. The result is the right-most column:

In [22]:
df["category"] = df["category"].replace(to_replace='\d+\.\d*', value=np.nan, regex=True)
df["category"] = df["category"].ffill()

In [23]:
df.head()

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time,distance,event,age,gender,category
1,Male,,,,,,,,10km,Open,13+,Male,Male
2,Open - Male,,,,,,,,10km,Open,13+,Male,Open - Male
3,1.,1.0,Malachy Kirwan (316),,7:16:17.70,21:09.20,21:55.00,43:04.20,10km,Open,13+,Male,Open - Male
4,2.,2.0,William Petty (267),Coached,7:16:18.70,21:43.80,22:00.40,43:44.20,10km,Open,13+,Male,Open - Male
5,3.,4.0,Chris Timms (251),Dulwich Runners,7:16:58.30,22:15.30,23:27.00,45:42.30,10km,Open,13+,Male,Open - Male


Following that, we can strip excess gender information from this column.

In [24]:
df[["category","sex"]] = df["category"].str.split("-",expand=True)
df = df.drop(columns="sex")

In [25]:
df.head()

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time,distance,event,age,gender,category
1,Male,,,,,,,,10km,Open,13+,Male,Male
2,Open - Male,,,,,,,,10km,Open,13+,Male,Open
3,1.,1.0,Malachy Kirwan (316),,7:16:17.70,21:09.20,21:55.00,43:04.20,10km,Open,13+,Male,Open
4,2.,2.0,William Petty (267),Coached,7:16:18.70,21:43.80,22:00.40,43:44.20,10km,Open,13+,Male,Open
5,3.,4.0,Chris Timms (251),Dulwich Runners,7:16:58.30,22:15.30,23:27.00,45:42.30,10km,Open,13+,Male,Open


### Removing rows with no time 

At this point, the DataFrame still has the remaining header rows under "category_rank". These rows have no values under the "time" column. 

To clean the data up, we will simply remove all rows with NaN values under "time".

After this, the "category_rank" column should be clean:

In [26]:
df = df.dropna(subset=["time"])
df = df.reset_index(drop=True)

In [27]:
df.head()

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time,distance,event,age,gender,category
0,1.0,1.0,Malachy Kirwan (316),,7:16:17.70,21:09.20,21:55.00,43:04.20,10km,Open,13+,Male,Open
1,2.0,2.0,William Petty (267),Coached,7:16:18.70,21:43.80,22:00.40,43:44.20,10km,Open,13+,Male,Open
2,3.0,4.0,Chris Timms (251),Dulwich Runners,7:16:58.30,22:15.30,23:27.00,45:42.30,10km,Open,13+,Male,Open
3,4.0,6.0,Benoit Besnier (320),COS Coaching,7:16:17.80,23:32.70,24:02.80,47:35.50,10km,Open,13+,Male,Open
4,5.0,7.0,Daniel Rose (311),Coached Fitness,7:16:59.20,22:55.10,24:41.30,47:36.40,10km,Open,13+,Male,Open


## Dealing with the bib number

The "name" column has additional information regarding the bib number. It is possible to extract that information into a new column, "bib_number", and then remove the information from the "name" column. 

In [28]:
df["bib_number"] = df["name"].str.extract("\((\d+)\)",expand=True)
df["name"] = df["name"].str.split("(", expand=True)

In [29]:
df[5:10]

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time,distance,event,age,gender,category,bib_number
5,6.0,10.0,Stanislav Miroshnichenko,,7:18:44.00,24:22.60,24:23.00,48:45.60,10km,Open,13+,Male,Open,
6,7.0,12.0,Ian Stewart,Dulwich Runners,7:17:55.80,23:27.30,25:37.30,49:04.60,10km,Open,13+,Male,Open,253.0
7,8.0,14.0,Tycen Bundgaard,,7:18:43.60,25:01.40,24:47.10,49:48.50,10km,Open,13+,Male,Open,297.0
8,9.0,15.0,Samuel Iii Belandres,Filipino Runners,7:17:55.20,23:38.40,26:18.70,49:57.10,10km,Open,13+,Male,Open,
9,10.0,16.0,Anish Jha,,7:17:55.70,25:28.60,25:40.90,51:09.50,10km,Open,13+,Male,Open,299.0


There are some entries without a bib number (the value is NaN), so we will fill those with the string "None". 

In [30]:
df["bib_number"] = df['bib_number'].fillna("None")

In [31]:
df[5:10]

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time,distance,event,age,gender,category,bib_number
5,6.0,10.0,Stanislav Miroshnichenko,,7:18:44.00,24:22.60,24:23.00,48:45.60,10km,Open,13+,Male,Open,
6,7.0,12.0,Ian Stewart,Dulwich Runners,7:17:55.80,23:27.30,25:37.30,49:04.60,10km,Open,13+,Male,Open,253.0
7,8.0,14.0,Tycen Bundgaard,,7:18:43.60,25:01.40,24:47.10,49:48.50,10km,Open,13+,Male,Open,297.0
8,9.0,15.0,Samuel Iii Belandres,Filipino Runners,7:17:55.20,23:38.40,26:18.70,49:57.10,10km,Open,13+,Male,Open,
9,10.0,16.0,Anish Jha,,7:17:55.70,25:28.60,25:40.90,51:09.50,10km,Open,13+,Male,Open,299.0


## Removing NaN values from the "club" column

For consistency, we will do the same for NaN values in the "club" column as we did for the "bib_number" column.

In [32]:
df['club'] = df['club'].fillna("None")
df.head()

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time,distance,event,age,gender,category,bib_number
0,1.0,1.0,Malachy Kirwan,,7:16:17.70,21:09.20,21:55.00,43:04.20,10km,Open,13+,Male,Open,316
1,2.0,2.0,William Petty,Coached,7:16:18.70,21:43.80,22:00.40,43:44.20,10km,Open,13+,Male,Open,267
2,3.0,4.0,Chris Timms,Dulwich Runners,7:16:58.30,22:15.30,23:27.00,45:42.30,10km,Open,13+,Male,Open,251
3,4.0,6.0,Benoit Besnier,COS Coaching,7:16:17.80,23:32.70,24:02.80,47:35.50,10km,Open,13+,Male,Open,320
4,5.0,7.0,Daniel Rose,Coached Fitness,7:16:59.20,22:55.10,24:41.30,47:36.40,10km,Open,13+,Male,Open,311


Now, we have values in all cells of the DataFrame.

In [33]:
df.isna().sum().sum()

0

## Properly formatting the duration-based columns

The "lap_1", "lap_2" and "time" columns contain duration information. Analysis requires conversion to the Pandas timedelta object. However, using the columns as they are will throw errors with `pd.to_timedelta`. 

Within the same column, some entries are in `%M:%S.%f`(MM:SS:ff), and some are in `%-H:%M:%S.%f` (H:MM:SS:ff) format. Below, this can be seen in the "time" column.

In [34]:
df[60:65]

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time,distance,event,age,gender,category,bib_number
60,1.0,27.0,Luu Phuong Anh Dang,,8:27:09.30,27:50.50,28:51.00,56:41.50,10km,Open,13+,Female,Open,
61,2.0,28.0,Emily Astiz,Team Grit OCR,8:26:19.80,27:38.60,29:23.90,57:02.50,10km,Open,13+,Female,Open,298.0
62,3.0,36.0,Michelle Ferguson,,8:26:19.60,29:21.70,31:49.20,1:01:10.90,10km,Open,13+,Female,Open,303.0
63,4.0,45.0,Véronique Gille,,7:20:35.00,32:43.00,32:26.50,1:05:09.50,10km,Open,13+,Female,Open,281.0
64,5.0,47.0,Rachel Halliday,Dulwich Runners,8:28:03.90,32:38.70,33:02.00,1:05:40.70,10km,Open,13+,Female,Open,254.0


As long as we change the `%M:%S.%f` entries to `%-H:%M:%S.%f` so that the whole column is consistent, `pd.to_timedelta` will accept it. 

However, we will change both the `%M:%S.%f` and the `%-H:%M:%S.%f` entries in the column to `%H:%M:%S.%f` (HH:MM:SS:ff) as it is more conventional. 

There are three ways we can write functions to update the formats. The functions will add the respective zero digits and semicolons to the strings where applicable. 

### Method 1: Counting the number of ":" in the strings

In [35]:
def add_hours_zero(column):
    m = column.str.count(':') == 2
    column = column.mask(m, "0" + column, axis=0)
    return column

def add_hours(column):
    m = column.str.count(':') == 1
    column = column.mask(m, "00:" + column, axis=0)
    return column

### Method 2: Measuring the length of the string

In [36]:
def add_hours_zero(column):
    m = column.str.len() == 10
    column = column.mask(m, "0" + column, axis=0)
    return column

def add_hours(column):
    m = column.str.len() == 8
    column = column.mask(m, "00:" + column, axis=0)
    return column  

### Method 3: Using regex

In [37]:
def add_hours_zero(column):
    m = column.str.contains("^\d+:\d+:\d+\.\d+$", na=False)
    column = column.mask(m, "0" + column, axis=0)
    return column

def add_hours(column):
    m = column.str.contains("^\d+:\d+\.\d+$", na=False)
    column = column.mask(m, "00:" + column, axis=0)
    return column

### Applying the functions

Note that `add_hours_zero` must be applied before `add_hours` to work properly on the data.

In [38]:
df[["lap_1", "lap_2", "time"]] = df[["lap_1", "lap_2", "time"]].apply(add_hours_zero)
df[["lap_1", "lap_2", "time"]] = df[["lap_1", "lap_2", "time"]].apply(add_hours)

In [39]:
df[60:65]

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time,distance,event,age,gender,category,bib_number
60,1.0,27.0,Luu Phuong Anh Dang,,8:27:09.30,00:27:50.50,00:28:51.00,00:56:41.50,10km,Open,13+,Female,Open,
61,2.0,28.0,Emily Astiz,Team Grit OCR,8:26:19.80,00:27:38.60,00:29:23.90,00:57:02.50,10km,Open,13+,Female,Open,298.0
62,3.0,36.0,Michelle Ferguson,,8:26:19.60,00:29:21.70,00:31:49.20,01:01:10.90,10km,Open,13+,Female,Open,303.0
63,4.0,45.0,Véronique Gille,,7:20:35.00,00:32:43.00,00:32:26.50,01:05:09.50,10km,Open,13+,Female,Open,281.0
64,5.0,47.0,Rachel Halliday,Dulwich Runners,8:28:03.90,00:32:38.70,00:33:02.00,01:05:40.70,10km,Open,13+,Female,Open,254.0


## Formatting the "start" column to datetime format 

The "start" column contains the race start time of the runner on that day. After formatting it to datetime format, year-month-day placeholders appear in the entries. We can specify the actual date of the event.

In [40]:
df["start"] = pd.to_datetime(df["start"], format="%H:%M:%S.%f")
df["start"] = df["start"].map(lambda x: x.replace(year=2021, month=4, day=4))

In [41]:
df.head()

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time,distance,event,age,gender,category,bib_number
0,1.0,1.0,Malachy Kirwan,,2021-04-04 07:16:17.700,00:21:09.20,00:21:55.00,00:43:04.20,10km,Open,13+,Male,Open,316
1,2.0,2.0,William Petty,Coached,2021-04-04 07:16:18.700,00:21:43.80,00:22:00.40,00:43:44.20,10km,Open,13+,Male,Open,267
2,3.0,4.0,Chris Timms,Dulwich Runners,2021-04-04 07:16:58.300,00:22:15.30,00:23:27.00,00:45:42.30,10km,Open,13+,Male,Open,251
3,4.0,6.0,Benoit Besnier,COS Coaching,2021-04-04 07:16:17.800,00:23:32.70,00:24:02.80,00:47:35.50,10km,Open,13+,Male,Open,320
4,5.0,7.0,Daniel Rose,Coached Fitness,2021-04-04 07:16:59.200,00:22:55.10,00:24:41.30,00:47:36.40,10km,Open,13+,Male,Open,311


## Casting of types for the rank columns 

The "category_rank" and "event_rank" columns are of dtype: object and appear formatted as a float in the dataframe. We can change them to integers.

In [42]:
df["category_rank"] = pd.to_numeric(df["category_rank"]).astype(int)
df["event_rank"] = pd.to_numeric(df["event_rank"]).astype(int)

In [43]:
df.dtypes

category_rank             int32
event_rank                int32
name                     object
club                     object
start            datetime64[ns]
lap_1                    object
lap_2                    object
time                     object
distance                 object
event                    object
age                      object
gender                   object
category                 object
bib_number               object
dtype: object

## Exporting the cleaned data to .csv 

Now, we can export the data, and re-import it to validate it.

In [44]:
df.to_csv("results_clean.csv", index=False )
df = pd.read_csv('results_clean.csv')

In [45]:
# Uncomment below to show entire table in Jupyter Notebook:
# pd.set_option("display.max_rows", None)
# To undo it:
# pd.reset_option('display.max_rows')

In [46]:
df

Unnamed: 0,category_rank,event_rank,name,club,start,lap_1,lap_2,time,distance,event,age,gender,category,bib_number
0,1,1,Malachy Kirwan,,2021-04-04 07:16:17.700,00:21:09.20,00:21:55.00,00:43:04.20,10km,Open,13+,Male,Open,316
1,2,2,William Petty,Coached,2021-04-04 07:16:18.700,00:21:43.80,00:22:00.40,00:43:44.20,10km,Open,13+,Male,Open,267
2,3,4,Chris Timms,Dulwich Runners,2021-04-04 07:16:58.300,00:22:15.30,00:23:27.00,00:45:42.30,10km,Open,13+,Male,Open,251
3,4,6,Benoit Besnier,COS Coaching,2021-04-04 07:16:17.800,00:23:32.70,00:24:02.80,00:47:35.50,10km,Open,13+,Male,Open,320
4,5,7,Daniel Rose,Coached Fitness,2021-04-04 07:16:59.200,00:22:55.10,00:24:41.30,00:47:36.40,10km,Open,13+,Male,Open,311
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140,2,23,Anupama Prakash,,2021-04-04 12:09:54.900,00:12:50.20,00:13:45.30,00:26:35.50,3km,Adventure Race,7+,Female,Open,407
141,3,25,Vanessa Routley,,2021-04-04 12:09:50.800,00:13:24.40,00:17:48.20,00:31:12.60,3km,Adventure Race,7+,Female,Open,408
142,4,27,Janice Teo,,2021-04-04 12:10:29.100,00:17:10.20,00:16:52.00,00:34:02.20,3km,Adventure Race,7+,Female,Open,419
143,5,29,Vivi Martanto,,2021-04-04 12:10:29.600,00:17:33.90,00:18:12.70,00:35:46.60,3km,Adventure Race,7+,Female,Open,423


In [47]:
df.isna().sum().sum()

0

This is the end of Part 1. The next part will show the analysis of the prepared data.