## Pre-Processing

Before I can model my data, a few more pre-processing and feature engineering steps are required.

In [1]:
import numpy as np 
import pandas as pd 

In [2]:
df = pd.read_csv('../csv/for_preprocessing.csv')

First, I'll see the shape and info of my current dataframe. I know I have some NaN and NaT values which will require handling.

In [3]:
df.shape

(25400, 22)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25400 entries, 0 to 25399
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         25400 non-null  int64  
 1   resultId           25400 non-null  int64  
 2   raceId             25400 non-null  int64  
 3   driverId           25400 non-null  int64  
 4   constructorId      25400 non-null  int64  
 5   grid               25400 non-null  int64  
 6   positionOrder      25400 non-null  int64  
 7   points             25400 non-null  float64
 8   laps               25400 non-null  int64  
 9   time               4881 non-null   float64
 10  milliseconds       6808 non-null   float64
 11  rank               7151 non-null   float64
 12  fastestLapTime     6953 non-null   object 
 13  fastestLapSpeed    6953 non-null   float64
 14  statusId           25400 non-null  int64  
 15  driverRef          25400 non-null  object 
 16  dob                254

First, I want to handle some of these columns with data of type:object. A good place to start is dob, which should be converted to datetime. Then I can perform the operation mentioned previously whereby I remove most inactive drivers from the modeling dataset by using the oldest active driver's (Fernando Alonso) birthdate '1981-07-29',  as the cutoff.

In [5]:
df['dob'] = pd.to_datetime(df['dob'])
df = df[~(df['dob'] < '1981-07-29')]

In [6]:
df.shape

(4965, 22)

It's possible that some drivers younger than Alonso who aren't regulars or have already retired, i.e. Nico Rosberg, are still present in the dataset. But we've significantly reduced the number of drivers present who shouldn't factor in to predictions.

Now it's time to convert the remaining non-categorical value that should be numerical into datetime, and drop some redundant features that won't be needed or helpful for modeling. Many of these are already accounted for in other categorical variables which can be encoded before modeling (i.e. 'statusID' and 'statusRef').

In [7]:
df['fastestLapTime'] = pd.to_datetime(df['fastestLapTime'])

In [8]:
df = df.drop(['resultId', 'driverId', 'constructorId', 'statusId'], axis=1)

In [9]:
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

### Replacing NaN with max

Now, I'd like to handle the NaN and NaT values. Since these values in 'fastestLapTime' invariably refer to drivers who failed to complete a lap and thus DNF'd the race, most likely due to a scratch in qualifying, a good approach is to use the max value of the lap times from the race. That way, they always have a time as slow as the slowest car on the track.

In [10]:
race_match = df['raceId'] == 18
max(df['fastestLapTime'][race_match])

Timestamp('2022-07-29 01:32:01')

In [11]:
for race in df['raceId']:
    race_match = df['raceId'] == race
    df.fastestLapTime.fillna(max(df['fastestLapTime'][race_match]), inplace=True)

In [12]:
df[df['raceId'] == 18]

Unnamed: 0,raceId,grid,positionOrder,points,laps,time,milliseconds,rank,fastestLapTime,fastestLapSpeed,driverRef,dob,constructorRef,status,constructorPoints,avgPit,circuitRef
0,18,1,1,10.0,58,0.0,5690616.0,2.0,2022-07-29 01:27:27,218.3,hamilton,1985-01-07,mclaren,Finished,14.0,74893.543981,albert_park
2,18,7,3,6.0,58,8.163,5698779.0,5.0,2022-07-29 01:28:05,216.719,rosberg,1985-06-27,williams,Finished,9.0,49760.690763,albert_park
3,18,11,4,5.0,58,17.181,5707797.0,7.0,2022-07-29 01:28:36,215.464,alonso,1981-07-29,renault,Finished,5.0,69458.586705,albert_park
4,18,3,5,4.0,58,18.014,5708630.0,1.0,2022-07-29 01:27:25,218.385,kovalainen,1981-10-19,mclaren,Finished,14.0,24608.666667,albert_park
5,18,13,6,3.0,57,,,14.0,2022-07-29 01:29:38,212.974,nakajima,1985-01-11,williams,+1 Lap,9.0,,albert_park
8,18,2,9,0.0,47,,,9.0,2022-07-29 01:28:45,215.1,kubica,1984-12-07,bmw_sauber,Collision,8.0,24428.738095,albert_park
9,18,18,10,0.0,43,,,13.0,2022-07-29 01:29:33,213.166,glock,1982-03-18,toyota,Accident,,23743.946667,albert_park
11,18,20,12,0.0,30,,,16.0,2022-07-29 01:31:23,208.907,piquet_jr,1985-07-25,renault,Clutch,5.0,,albert_park
15,18,22,16,0.0,8,,,17.0,2022-07-29 01:32:01,207.461,sutil,1983-01-11,force_india,Hydraulics,,34640.107438,albert_park
19,18,9,20,0.0,0,,,,2022-07-29 01:32:01,,vettel,1987-07-03,toro_rosso,Collision,2.0,71738.981395,albert_park


Now, I'll do similar operations for 'fastestLapSpeed', 'rank', and 'time'. Then I'll look at the first race in the table to verify that all Nan/NaT values have been handled.

In [13]:
for race in df['raceId']:
    race_match = df['raceId'] == race
    df.fastestLapSpeed.fillna(max(df['fastestLapSpeed'][race_match]), inplace=True)

In [14]:
for race in df['raceId']:
    race_match = df['raceId'] == race
    df['rank'].fillna(max(df['rank'][race_match]), inplace=True)

In [15]:
for race in df['raceId']:
    race_match = df['raceId'] == race
    df.time.fillna(max(df['time'][race_match]), inplace=True)

In [16]:
for race in df['raceId']:
    race_match = df['raceId'] == race
    df.milliseconds.fillna(max(df['milliseconds'][race_match]), inplace=True)

In [17]:
for race in df['raceId']:
    race_match = df['raceId'] == race
    df.avgPit.fillna(max(df['avgPit'][race_match]), inplace=True)

In [18]:
df[df['raceId'] == 18]

Unnamed: 0,raceId,grid,positionOrder,points,laps,time,milliseconds,rank,fastestLapTime,fastestLapSpeed,driverRef,dob,constructorRef,status,constructorPoints,avgPit,circuitRef
0,18,1,1,10.0,58,0.0,5690616.0,2.0,2022-07-29 01:27:27,218.3,hamilton,1985-01-07,mclaren,Finished,14.0,74893.543981,albert_park
2,18,7,3,6.0,58,8.163,5698779.0,5.0,2022-07-29 01:28:05,216.719,rosberg,1985-06-27,williams,Finished,9.0,49760.690763,albert_park
3,18,11,4,5.0,58,17.181,5707797.0,7.0,2022-07-29 01:28:36,215.464,alonso,1981-07-29,renault,Finished,5.0,69458.586705,albert_park
4,18,3,5,4.0,58,18.014,5708630.0,1.0,2022-07-29 01:27:25,218.385,kovalainen,1981-10-19,mclaren,Finished,14.0,24608.666667,albert_park
5,18,13,6,3.0,57,18.014,5708630.0,14.0,2022-07-29 01:29:38,212.974,nakajima,1985-01-11,williams,+1 Lap,9.0,74893.543981,albert_park
8,18,2,9,0.0,47,18.014,5708630.0,9.0,2022-07-29 01:28:45,215.1,kubica,1984-12-07,bmw_sauber,Collision,8.0,24428.738095,albert_park
9,18,18,10,0.0,43,18.014,5708630.0,13.0,2022-07-29 01:29:33,213.166,glock,1982-03-18,toyota,Accident,,23743.946667,albert_park
11,18,20,12,0.0,30,18.014,5708630.0,16.0,2022-07-29 01:31:23,208.907,piquet_jr,1985-07-25,renault,Clutch,5.0,74893.543981,albert_park
15,18,22,16,0.0,8,18.014,5708630.0,17.0,2022-07-29 01:32:01,207.461,sutil,1983-01-11,force_india,Hydraulics,,34640.107438,albert_park
19,18,9,20,0.0,0,18.014,5708630.0,17.0,2022-07-29 01:32:01,218.385,vettel,1987-07-03,toro_rosso,Collision,2.0,71738.981395,albert_park


For the 'constructorPoints' column, it makes more sense to fill NaN with zero.

In [19]:
df['constructorPoints'].fillna(0, inplace=True)

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4965 entries, 0 to 25399
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   raceId             4965 non-null   int64         
 1   grid               4965 non-null   int64         
 2   positionOrder      4965 non-null   int64         
 3   points             4965 non-null   float64       
 4   laps               4965 non-null   int64         
 5   time               4965 non-null   float64       
 6   milliseconds       4965 non-null   float64       
 7   rank               4965 non-null   float64       
 8   fastestLapTime     4965 non-null   datetime64[ns]
 9   fastestLapSpeed    4965 non-null   float64       
 10  driverRef          4965 non-null   object        
 11  dob                4965 non-null   datetime64[ns]
 12  constructorRef     4965 non-null   object        
 13  status             4965 non-null   object        
 14  constru

All NaN values have now been taken care of.

### Engineering avgFastLap

There are two features in the current dataset that convey useful information about how quick drivers can move through the track, but require some manipulation to be useful for prediction purposes.

Ostensibly, this model will be used to make predictions at a very specific point on a Grand Prix weekend: right *after* the qualifying session on Saturday and *before* the start of the race on Sunday. This means that we wouldn't have a driver's fastestLapTime or fastestLapSpeed for the race in question at the point at which we are modeling the race winner.

There are two ways this can be handled. These columns can be reverted back one race to show the drivers' most recent performance. Or they can be averaged to provide a general picture of how each driver fares in that category. For fastestLapSpeed, the latter approach makes more sense.

In [21]:
df['avgFlyingLap'] = df.groupby('driverRef')['fastestLapSpeed'].transform('mean')

Now let's drop the fastestLapTime and fastestLapSpeed columns since (a) they're not useful for our predictions, as previously discussed and (b) we have the info they convery encoded in the new avgFlyingLap column. Then inspect the df to make sure it looks ok.

In [22]:
df.drop(columns=['fastestLapTime', 'fastestLapSpeed'], axis=1, inplace=True)
df

Unnamed: 0,raceId,grid,positionOrder,points,laps,time,milliseconds,rank,driverRef,dob,constructorRef,status,constructorPoints,avgPit,circuitRef,avgFlyingLap
0,18,1,1,10.0,58,0.000,5690616.0,2.0,hamilton,1985-01-07,mclaren,Finished,14.0,74893.543981,albert_park,205.447097
2,18,7,3,6.0,58,8.163,5698779.0,5.0,rosberg,1985-06-27,williams,Finished,9.0,49760.690763,albert_park,200.993073
3,18,11,4,5.0,58,17.181,5707797.0,7.0,alonso,1981-07-29,renault,Finished,5.0,69458.586705,albert_park,206.055815
4,18,3,5,4.0,58,18.014,5708630.0,1.0,kovalainen,1981-10-19,mclaren,Finished,14.0,24608.666667,albert_park,198.785598
5,18,13,6,3.0,57,18.014,5708630.0,14.0,nakajima,1985-01-11,williams,+1 Lap,9.0,74893.543981,albert_park,202.032667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25395,1072,17,16,0.0,44,18.014,5708630.0,16.0,vettel,1987-07-03,aston_martin,Collision damage,77.0,71738.981395,jeddah,205.278564
25396,1072,5,17,0.0,14,18.014,5708630.0,17.0,perez,1990-01-26,red_bull,Collision,559.5,74356.862155,jeddah,204.522298
25397,1072,20,18,0.0,14,18.014,5708630.0,20.0,mazepin,1999-03-02,haas,Collision,0.0,211286.047619,jeddah,210.124091
25398,1072,14,19,0.0,14,18.014,5708630.0,19.0,russell,1998-02-15,williams,Collision,23.0,135214.130841,jeddah,210.090883


### Encoding Categoricals

There are seveal features that are categorical and need to be encoded.

In [39]:
dfo = df.select_dtypes(include=['object'])
df = pd.concat([df.drop(dfo, axis=1), pd.get_dummies(dfo)], axis=1)

In [40]:
df.head()

Unnamed: 0,raceId,grid,positionOrder,points,laps,time,milliseconds,rank,dob,constructorPoints,...,circuitRef_shanghai,circuitRef_silverstone,circuitRef_sochi,circuitRef_spa,circuitRef_suzuka,circuitRef_valencia,circuitRef_villeneuve,circuitRef_yas_marina,circuitRef_yeongam,circuitRef_zandvoort
0,18,1,1,10.0,58,0.0,5690616.0,2.0,1985-01-07,14.0,...,0,0,0,0,0,0,0,0,0,0
2,18,7,3,6.0,58,8.163,5698779.0,5.0,1985-06-27,9.0,...,0,0,0,0,0,0,0,0,0,0
3,18,11,4,5.0,58,17.181,5707797.0,7.0,1981-07-29,5.0,...,0,0,0,0,0,0,0,0,0,0
4,18,3,5,4.0,58,18.014,5708630.0,1.0,1981-10-19,14.0,...,0,0,0,0,0,0,0,0,0,0
5,18,13,6,3.0,57,18.014,5708630.0,14.0,1985-01-11,9.0,...,0,0,0,0,0,0,0,0,0,0


### Scaling

There are numerical features in vastly different scales (see: time vs. milliseconds) so some scaling is definitely necessary here.

In [41]:
from sklearn.preprocessing import StandardScaler

In [42]:
df = df.astype({'dob': 'int64'})

  """Entry point for launching an IPython kernel.


In [44]:
df

Unnamed: 0,raceId,grid,positionOrder,points,laps,time,milliseconds,rank,dob,constructorPoints,...,circuitRef_shanghai,circuitRef_silverstone,circuitRef_sochi,circuitRef_spa,circuitRef_suzuka,circuitRef_valencia,circuitRef_villeneuve,circuitRef_yas_marina,circuitRef_yeongam,circuitRef_zandvoort
0,18,1,1,10.0,58,0.000,5690616.0,2.0,473904000000000000,14.0,...,0,0,0,0,0,0,0,0,0,0
2,18,7,3,6.0,58,8.163,5698779.0,5.0,488678400000000000,9.0,...,0,0,0,0,0,0,0,0,0,0
3,18,11,4,5.0,58,17.181,5707797.0,7.0,365212800000000000,5.0,...,0,0,0,0,0,0,0,0,0,0
4,18,3,5,4.0,58,18.014,5708630.0,1.0,372297600000000000,14.0,...,0,0,0,0,0,0,0,0,0,0
5,18,13,6,3.0,57,18.014,5708630.0,14.0,474249600000000000,9.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25395,1072,17,16,0.0,44,18.014,5708630.0,16.0,552268800000000000,77.0,...,0,0,0,0,0,0,0,0,0,0
25396,1072,5,17,0.0,14,18.014,5708630.0,17.0,633312000000000000,559.5,...,0,0,0,0,0,0,0,0,0,0
25397,1072,20,18,0.0,14,18.014,5708630.0,20.0,920332800000000000,0.0,...,0,0,0,0,0,0,0,0,0,0
25398,1072,14,19,0.0,14,18.014,5708630.0,19.0,887500800000000000,23.0,...,0,0,0,0,0,0,0,0,0,0


In [45]:
scaler = StandardScaler()

names = list(df.columns)

scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=names)

### Splitting Train and Test Data

With the other preprocessing steps completed, and scaling done on the full set of X, it's time to perform a train/test split. I will eventually want to create a model that predicts finishing position, so 'positionOrder' will be my designated Y / target feature.

In [46]:
y = df['positionOrder']
X = df.drop(columns = ['positionOrder'])

In [47]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

In [48]:
df.to_csv('../csv/preprocessed_data.csv')