This notebook will go over some of my observations with the lap time data, which includes some discrepancies that will need to be dealt with.

The end goal of this exploration is to figure out how to get a more accurate average lap time for each driver from 2014 - 2021, which we can use to analyze McLaren's performance in the other notebook file, ``mclaren_hybrid_era.ipynb``.

In [6]:
# Importing libraries
import pandas as pd
from IPython.display import display

In [7]:
# Read in the data - Here, we are assuming that the data is in the same directory,
# but this can be changed to point to where the files are stored locally
results = pd.read_csv('results.csv')
lap_times = pd.read_csv('lap_times.csv')

In [8]:
# Getting the lap times from the 2021 Azerbaijan Grand Prix
azerbaijan_lap_times = lap_times[lap_times['raceId'] == 1057][['raceId','driverId','lap','time', 'milliseconds']]

# Displaying lap times for lap 47 compared and comparing them with lap 49
display(azerbaijan_lap_times[azerbaijan_lap_times['lap'] == 47].head())
display(azerbaijan_lap_times[azerbaijan_lap_times['lap'] == 49].head())

Unnamed: 0,raceId,driverId,lap,time,milliseconds
497008,1057,844,47,2:13.338,133338
497059,1057,1,47,2:15.406,135406
497155,1057,815,47,2:16.165,136165
497206,1057,842,47,2:13.442,133442
497257,1057,832,47,2:15.333,135333


Unnamed: 0,raceId,driverId,lap,time,milliseconds
497010,1057,844,49,36:27.115,2187115
497061,1057,1,49,36:24.844,2184844
497157,1057,815,49,36:25.269,2185269
497208,1057,842,49,36:26.612,2186612
497259,1057,832,49,36:26.392,2186392


Above, we are displaying lap times from the 47th and 49th laps of the 2021 Azerbaijan Grand Prix. Interestingly, there is a major discrepancy between the lap times displayed for a majority of the drivers. The lap times listed are pretty in line with what we expect, but the times listed for lap 48/49 (depending on the driver) is extremely high, and the cause of this 34 minute discrepancy affected a majority of the drivers on the track.

Luckily, I watched this race! In particular, one thing that stood out while watching it is the ending, where Max Verstappen (the race leader at the time) crashed due to a tire blowout. This resulted in a red flag, which stopped the race until track conditions were safe enough to resume. The [Wikipedia article](https://en.wikipedia.org/wiki/2021_Azerbaijan_Grand_Prix) for the race says the following:

> __"*After a delay of 34 minutes, the race was restarted on lap 50 with a standing start, using the race order at the moment of the suspension of the race.*"__

The delay caused by the red flag explains the discrepancy with the lap time, which seems to include the time spent waiting for the race to resume in its calculation. This presents a problem because even though a majority of the drivers are affected the same way by this delay, the drivers who have stopped racing prior to the affected lap will not have the inflated lap time recorded in the table and will not have their average lap times affected by it.

I have noticed similar discrepancies due to race stoppage in other races that were stopped along with other less drastic changes in lap time (e.g. A yellow flag requiring drivers to slow down), so if we want a more accurate representation of race pace, we will need to deal with these outliers.

In [9]:
# Getting the average lap time and standard deviation of the lap times for each driver in the 2021 Azerbaijan GP
driver_average_lap = azerbaijan_lap_times.groupby(['driverId', 'raceId'])['milliseconds'].mean().reset_index()
driver_std = azerbaijan_lap_times.groupby(['driverId', 'raceId'])['milliseconds'].std().reset_index()

# Also getting the number of laps completed in the race. This could also be obtained from the results file
# if that is imported
driver_laps_completed = azerbaijan_lap_times.groupby(['driverId', 'raceId'])['lap'].max().reset_index()

# Rename columns to be more descriptive
driver_average_lap.rename(columns={'milliseconds': 'driver_average_lap_time'}, inplace=True)
driver_std.rename(columns={'milliseconds': 'standard_deviation'}, inplace=True)
driver_laps_completed.rename(columns={'lap': 'laps_completed'}, inplace=True)

# Join the DataFrames
driver_average_lap = driver_average_lap.merge(driver_std, on=['driverId', 'raceId'])
driver_average_lap = driver_average_lap.merge(driver_laps_completed, on=['driverId', 'raceId'])
display(driver_average_lap)

Unnamed: 0,driverId,raceId,driver_average_lap_time,standard_deviation,laps_completed
0,1,1057,157530.941176,290540.350845,51
1,4,1057,157309.647059,290220.384483,51
2,8,1057,157372.27451,290516.964431,51
3,20,1057,157211.666667,290325.370916,51
4,815,1057,157184.509804,290636.223893,51
5,817,1057,157358.509804,290424.128217,51
6,822,1057,157405.372549,291230.694874,51
7,830,1057,114123.644444,22343.385336,45
8,832,1057,157335.666667,290511.262881,51
9,839,1057,120568.333333,9988.131173,3


Above, we have each driver who have recorded lap times in the 2021 Azerbaijan Grand Prix along with their average lap time and standard deviation (lap times are displayed in milliseconds). Also, we have the laps completed in case we wanted to filter results for drivers that completed a certain percentage of the race.

With this, we can calculate the z-score by subtracting each lap time by the mean and dividing the standard deviation. If the absolute value of the z-score is above a specified threshold (e.g. 3 standard deviations), we can identify the outlier and drop it from the dataframe. With this, let's see if we can use this to remove the exaggerated lap times from the red flag session.

In [10]:
# Getting the calculations and joining them with the appropriate drivers to compare to each lap time
azerbaijan_filtered = azerbaijan_lap_times.merge(driver_average_lap, on=['raceId','driverId'])

# Calculating z-score for each lap
azerbaijan_filtered['z-score'] = (azerbaijan_filtered['milliseconds'] - azerbaijan_filtered['driver_average_lap_time']) / azerbaijan_filtered['standard_deviation']

# Displaying lap times that are more than 3 standard deviations from the average lap time for the driver
display(azerbaijan_filtered[abs(azerbaijan_filtered['z-score'] > 3)])

# The below lines can be uncommented to take a look at the lap times of several drivers

# display(azerbaijan_filtered[azerbaijan_filtered['driverId'] == 830])
# display(azerbaijan_filtered[azerbaijan_filtered['driverId'] == 1])
# display(azerbaijan_filtered[azerbaijan_filtered['driverId'] == 840])

# After calculating the z-score, the below line can be uncommented if we only wanted
# lap times less than 3 standard deviations from the mean

# azerbaijan_filtered = azerbaijan_filtered[azerbaijan_filtered['z-score'] < 3]

Unnamed: 0,raceId,driverId,lap,time,milliseconds,driver_average_lap_time,standard_deviation,laps_completed,z-score
48,1057,844,49,36:27.115,2187115,157259.568627,290734.757224,51,6.981812
99,1057,1,49,36:24.844,2184844,157530.941176,290540.350845,51,6.977733
133,1057,830,32,3:17.279,197279,114123.644444,22343.385336,45,3.721699
136,1057,830,35,3:05.197,185197,114123.644444,22343.385336,45,3.180957
195,1057,815,49,36:25.269,2185269,157184.509804,290636.223893,51,6.978086
246,1057,842,49,36:26.612,2186612,157238.666667,290686.902034,51,6.981303
297,1057,832,49,36:26.392,2186392,157335.666667,290511.262881,51,6.984433
348,1057,4,49,36:24.868,2184868,157309.647059,290220.384483,51,6.986271
399,1057,852,49,36:26.385,2186385,157314.392157,290605.354823,51,6.98222
450,1057,20,49,36:24.047,2184047,157211.666667,290325.370916,51,6.981255


With our results above, we are displaying every lap time that is more than 3 standard deviations from the average lap time of that driver. The extremely long lap times from the stopped session are displayed here along with other slower lap times that could have been caused by a variety of factors.

We should note the possibility that we might not want to get rid of some of the rows that are filtered out. Using the above results as an example, I have some guesses for why some of these lap times are being filtered out:
- A yellow flag will for incidents considered too minor to stop the race for. In this race, an example is Lance Stroll's crash at around Lap 30. Yellow flags and possible safety cars will limit the speed that the drivers can go and prevents overtaking on track, and the resulting slower lap times might end up having a higher z-score calculation.
- In the DataFrame displayed above, we see the first lap for driverId __840 (Lance Stroll)__ as having a high enough z-score to be filtered out. His lap times improve over the race until he crashes before finishing his 30th lap, and his slower opening lap sticks out more even though there isn't a noticeable reason why we should exclude that lap from our average.

That said, what we do get out from removing outliers is a more consistent look of how a driver performs throughout their race. By comparing average lap times to each other, we can deduce the performance of a team's car compared to the rest of the grid along with performance gaps between drivers with the same car.