# Purpose

Throughout a race the speed of an F1 car changes dramatically.  Maximized at the end of a straight before hard braking zones, with smooth acceleration out of each corner.  The speed v.s. time curve is the most iconic piece of telemetry used for interpreting a driver's lap quality.  Because each track has a distinct shape and unique combination of straights and corners, each telemetry trace tells a story of the circuit it belongs to; the signature of a track.

In this notebook I demonstrate the ability to identify the circuit in question solely based off of this speed telemetry trace using a *Supervised Neural Network* trained a subset of laps obtained from the F1 2022 season using the [fastf1](https://theoehrly.github.io/Fast-F1) package.

# Method

Telemetry traces are obtained for each track using the [fastf1](https://theoehrly.github.io/Fast-F1) package.  This tool does not provide data sampled at a regular rate, so some data cleaning is required for their use in this notebook.  This has been performed in advance using the functions provided by the [f1_djsouthall/tools/fastf1_tools.py](https://github.com/djsouthall/f1_djsouthall/blob/main/tools/fastf1_tools.py) script.  These laps are resampled to 0.1 second sampling, and all non-standard laps have been removed (i.e. laps with a safety car, yellow flags, pitstop in-lap, pitstop out-lap).  These standardized laps are loaded here and used for training the NN.

I am using the [tensorflow](https://www.tensorflow.org/) and [keras](https://keras.io/) APIs to construct and train this model.

In [54]:
# Perform imports and setup code

disable_gpu = False # Helpful for debugging, sometimes performs differently with GPU enabled. 

# OS imports
import os
import sys

# Common imports
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib.collections import LineCollection
from matplotlib import cm
%matplotlib widget
plt.ion()

# F1 imports
import fastf1
import fastf1.plotting
fastf1.Cache.enable_cache(os.environ['f1_cache'])  
sys.path.append(os.environ['f1_install'])
from tools.fastf1_tools import loadTelemForYear, loadTelemForEvent

# ML imports
import tensorflow as tf

if disable_gpu:
    try:
        # Disable all GPUS
        tf.config.set_visible_devices([], 'GPU')
        visible_devices = tf.config.get_visible_devices()
        for device in visible_devices:
            assert device.device_type != 'GPU'
    except:
        # Invalid device or cannot modify virtual devices once initialized.
        pass
    tf.config.set_visible_devices([], 'GPU')
    os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

from tensorflow import keras
import keras_tuner
from keras import optimizers

# Check for GPU
physical_devices = tf.config.experimental.list_physical_devices('GPU')
print('Number of GPUs Available: {}'.format(physical_devices))
tf.config.experimental.set_memory_growth(physical_devices[0], True)
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

# Disable warnings
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

Number of GPUs Available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


Next we load in the lap telemetry and filter it down to the subset of drivers and tracks we wish to train on. 

In [55]:
# Load and filter lap telemetry.

# Specify lap telemetry info
path = os.path.join(os.environ['f1_install'], 'dataframes') # Location of pre-processed telemetry data.
year = 2022
telem_param = 'Speed'

# Get telem for entire year
refined_laps, track_index_key = loadTelemForYear(year, path=os.path.join(os.environ['f1_install'], 'dataframes'), telem_param=telem_param, return_index_key=True) # Formatted laps generated by ml/save_lap_telemetry.py

# Here we can filter to a subset of drivers
all_drivers = np.unique([c.split('_')[0] for c in refined_laps.columns])

# For simplificity here we only consider the laps of Max Verstappen, 
# however in principle we could train the NN to distinguish using all 
# driver data.
if True:
    choose_drivers = ['VER']
else:
    choose_drivers = all_drivers

# Here we have the option to only include a subset of the tracks.
# We will use all of them.
all_track_indices = list(track_index_key.keys())

if True:
    included_track_indices = all_track_indices # Use all of the tracks
else:
    included_track_indices = all_track_indices[:5] # Use only the first 5 tracks

columns = refined_laps.columns[[np.isin(c.split('_')[0], choose_drivers) for c in refined_laps.columns]] # Select subset of columns relevant to selected drivers
columns = columns[[np.isin(int(c.split('_')[1]), included_track_indices) for c in columns]] # Select subset of columns relevant to selected tracks
refined_laps = refined_laps[columns] # Reduce to only relevant columns (laps)

Loading C:\Users\dsouthall\projects\f1\f1_djsouthall\dataframes\Speed_Bahrain_Grand_Prix_2022.pkl
Loading C:\Users\dsouthall\projects\f1\f1_djsouthall\dataframes\Speed_Saudi_Arabian_Grand_Prix_2022.pkl
Loading C:\Users\dsouthall\projects\f1\f1_djsouthall\dataframes\Speed_Australian_Grand_Prix_2022.pkl
Loading C:\Users\dsouthall\projects\f1\f1_djsouthall\dataframes\Speed_Emilia_Romagna_Grand_Prix_2022.pkl
Loading C:\Users\dsouthall\projects\f1\f1_djsouthall\dataframes\Speed_Miami_Grand_Prix_2022.pkl
Loading C:\Users\dsouthall\projects\f1\f1_djsouthall\dataframes\Speed_Spanish_Grand_Prix_2022.pkl
Loading C:\Users\dsouthall\projects\f1\f1_djsouthall\dataframes\Speed_Monaco_Grand_Prix_2022.pkl
Loading C:\Users\dsouthall\projects\f1\f1_djsouthall\dataframes\Speed_Azerbaijan_Grand_Prix_2022.pkl
Loading C:\Users\dsouthall\projects\f1\f1_djsouthall\dataframes\Speed_Canadian_Grand_Prix_2022.pkl
Loading C:\Users\dsouthall\projects\f1\f1_djsouthall\dataframes\Speed_British_Grand_Prix_2022.pkl
Loa

Here we randomize the lap order.  Though keras shuffles these as well, I choose to do so in advance here for clarity.

At this stage I also clip the laps to the shortest lap of the year to remove NaN values where lap lengths differ.

In [56]:
randomized_indices = np.random.choice(range(len(refined_laps.columns)), len(refined_laps.columns))
refined_laps = refined_laps.iloc[:, randomized_indices]# Shuffled randomly
refined_laps = refined_laps.dropna() # Will cut off laps beyond shortest lap of the season.  Quick and dirty way to normalize data. 

print('Column label format: {DRIVER}_{EVENT_ID}_{LAP_ID}')
print(refined_laps.head().T.head(4).T) #Print the head of the first 4 laps

Column label format: {DRIVER}_{EVENT_ID}_{LAP_ID}
                         VER_12_51    VER_15_8     VER_9_2   VER_12_27
0 days 00:00:00         273.000000  223.000000  322.666667  273.000000
0 days 00:00:00.100000  276.375000  225.625000  322.375000  274.000000
0 days 00:00:00.200000  278.365385  227.442308  322.038462  275.153846
0 days 00:00:00.300000  280.143750  229.168750  321.675000  276.400000
0 days 00:00:00.400000  282.055785  231.733471  321.446281  276.801653


At this point we have obtained a set of consistently formatted (sampled equally in time and of equal length) speed traces.

# Sort Data
Here we split the data into training and testing data.  We then process the labels with a vectorizer and prepare the datasets as tensors.


In [57]:
# Get numpy array where each row is lap, and obtain labels. 
np_laps = refined_laps.to_numpy().T #Will cut off laps longer than shortest lap in the calander.
print(np_laps.shape)

labels = np.asarray([track_index_key[int(c.split('_')[1])]['name'] for c in refined_laps.columns])

training_percent = 0.8

training_n = int(training_percent*len(refined_laps.columns))
training_indices = np.random.choice(range(len(refined_laps.columns)), size=training_n, replace=False)

testing_indices = np.arange(len(refined_laps.columns))[~np.isin(np.arange(len(refined_laps.columns)), training_indices)]

print('Training on {} laps, and testing on {}'.format(len(training_indices), len(testing_indices)))


(1007, 673)
Training on 805 laps, and testing on 202


In [58]:
# Prepare preprocessing
scaler = keras.layers.Rescaling(scale=1.0/np.nanmax(np_laps))
vectorizer = keras.layers.TextVectorization(output_mode="int")
vectorizer.adapt(np.unique(labels))
num_classes = len(np.unique(labels))

# Prepare Training Data
training_labels = vectorizer(labels[training_indices]) - 2 # [2:] strips off the '' and 'UNK' default 0 and 1 values from vectorizer. - 2 accounts for shifting these such that labels start at 0.
training_data = scaler(np_laps[training_indices])

# Preparing Testing Data
testing_labels = vectorizer(labels[testing_indices]) - 2 # [2:] strips off the '' and 'UNK' default 0 and 1 values from vectorizer. - 2 accounts for shifting these such that labels start at 0.
testing_data = scaler(np_laps[testing_indices])

# For later human comprehension it is helpful to have a list of
# formatted GP names for each vectorized label number.  
formatted_name_dict = dict([(item['name'].replace('_','') , item['EventName'].replace('_','')) for key, item in track_index_key.items()])
label_dict = dict([(vectorizer(l).numpy()[0] - 2, formatted_name_dict[l]) for l in vectorizer.get_vocabulary()[2:]]) # [2:] strips off the '' and 'UNK' default 0 and 1 values from vectorizer. - 2 accounts for shifting these such that labels start at 0.

# These events are NOT necessarily in the same order as they are in the
# season or as they are loaded in and defined in track_index_key.
print(label_dict)

{0: 'United States Grand Prix', 1: 'São Paulo Grand Prix', 2: 'Spanish Grand Prix', 3: 'Singapore Grand Prix', 4: 'Saudi Arabian Grand Prix', 5: 'Monaco Grand Prix', 6: 'Miami Grand Prix', 7: 'Mexico City Grand Prix', 8: 'Japanese Grand Prix', 9: 'Italian Grand Prix', 10: 'Hungarian Grand Prix', 11: 'French Grand Prix', 12: 'Emilia Romagna Grand Prix', 13: 'Dutch Grand Prix', 14: 'Canadian Grand Prix', 15: 'British Grand Prix', 16: 'Belgian Grand Prix', 17: 'Bahrain Grand Prix', 18: 'Azerbaijan Grand Prix', 19: 'Austrian Grand Prix', 20: 'Australian Grand Prix', 21: 'Abu Dhabi Grand Prix'}


## Preparing Model
Now we start preparing the model.  Here I use a relatively simple sequentual model filled with dense layers. 

In [59]:
convolutional = False # Still working on CNN, only NN tested right now.  

if convolutional == False:
    model = keras.models.Sequential([
        keras.layers.Dense(units=50, activation='tanh', input_shape=(training_data.shape[1],)),
        keras.layers.Dense(units=32, activation='tanh'),
        keras.layers.Dense(units=16, activation='tanh'),
        keras.layers.Dense(units=len(label_dict), activation='softmax')
        ])
else:
    model = keras.models.Sequential([
        keras.layers.Conv1D(filters=32, kernel_size=20, activation='relu', padding='same', input_shape=(training_data.shape[1],1,)),
        keras.layers.Dense(units=32, activation='tanh'),
        keras.layers.Dense(units=40, activation='tanh'),
        keras.layers.Dense(units=len(label_dict), activation='softmax')
        ])

# optimizer = optimizers.Adam(clipvalue=0.5)
optimizer = keras.optimizers.Adam(learning_rate=1e-5)#RMSprop
model.summary()

# Using accuracy as the metric below seems important. I originally was using categorical accuracy and it was causing
# problems that I could not quite understand.  Accuracy works well though. 
model.compile(optimizer=optimizer,loss="sparse_categorical_crossentropy",metrics=['accuracy']) 

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_12 (Dense)            (None, 50)                33700     
                                                                 
 dense_13 (Dense)            (None, 32)                1632      
                                                                 
 dense_14 (Dense)            (None, 16)                528       
                                                                 
 dense_15 (Dense)            (None, 22)                374       
                                                                 
Total params: 36,234
Trainable params: 36,234
Non-trainable params: 0
_________________________________________________________________


## Training Model

In [60]:
# Perform checks on data
assert np.any(np.isnan(training_data)) == False
assert np.any(np.isnan(training_labels)) == False
assert np.any(np.isnan(testing_data)) == False
assert np.any(np.isnan(testing_labels)) == False
assert type(training_data) == type(training_labels)

print('Training on {} laps, and testing on {}, attempting to distinguish between {} tracks.'.format(training_data.shape[0], testing_data.shape[0], len(np.unique(training_labels))))
model.fit(training_data, training_labels, batch_size=20, epochs=1000, verbose=2)#, validation_split=0.2

Training on 805 laps, and testing on 202, attempting to distinguish between 22 tracks.
Epoch 1/1000
41/41 - 1s - loss: 3.0709 - accuracy: 0.0634 - 723ms/epoch - 18ms/step
Epoch 2/1000
41/41 - 0s - loss: 3.0356 - accuracy: 0.0559 - 265ms/epoch - 6ms/step
Epoch 3/1000
41/41 - 0s - loss: 3.0070 - accuracy: 0.0733 - 173ms/epoch - 4ms/step
Epoch 4/1000
41/41 - 0s - loss: 2.9815 - accuracy: 0.0957 - 168ms/epoch - 4ms/step
Epoch 5/1000
41/41 - 0s - loss: 2.9581 - accuracy: 0.0944 - 171ms/epoch - 4ms/step
Epoch 6/1000
41/41 - 0s - loss: 2.9346 - accuracy: 0.1342 - 168ms/epoch - 4ms/step
Epoch 7/1000
41/41 - 0s - loss: 2.9120 - accuracy: 0.1528 - 158ms/epoch - 4ms/step
Epoch 8/1000
41/41 - 0s - loss: 2.8890 - accuracy: 0.1416 - 176ms/epoch - 4ms/step
Epoch 9/1000
41/41 - 0s - loss: 2.8664 - accuracy: 0.1317 - 181ms/epoch - 4ms/step
Epoch 10/1000
41/41 - 0s - loss: 2.8439 - accuracy: 0.1727 - 208ms/epoch - 5ms/step
Epoch 11/1000
41/41 - 0s - loss: 2.8228 - accuracy: 0.2323 - 176ms/epoch - 4ms/st

<keras.callbacks.History at 0x1b5ab6c2a00>

# Validating Model

In [61]:
result = model.evaluate(testing_data,testing_labels,verbose=1)

# Changing the accuracy into a percentage
testing_acc = result[1]*100
# Printing the accuracy
print('Test Accuracy - ', testing_acc,'%')
print(dict(zip(model.metrics_names, result)))

predictions = model(testing_data, training=False)
print('Looping through the testing data and checking results:')
for i, (expect, predict) in enumerate(zip(testing_labels.numpy() , predictions.numpy())):
    print('{:5}/{:<5}: True: {:25} Predicted: {:25} | {:^7} | Choice probability = {:.1f}%'.format(i+1, testing_data.shape[0], label_dict[expect[0]], label_dict[predict.argmax()], ['Wrong', 'Correct'][predict.argmax() == expect[0]],  predict.max()*100))

Test Accuracy -  99.00990128517151 %
{'loss': 0.15635132789611816, 'accuracy': 0.9900990128517151}
Looping through the testing data and checking results:
    1/202  : True: Bahrain Grand Prix        Predicted: Bahrain Grand Prix        | Correct | Choice probability = 90.8%
    2/202  : True: Canadian Grand Prix       Predicted: Canadian Grand Prix       | Correct | Choice probability = 85.9%
    3/202  : True: Dutch Grand Prix          Predicted: Dutch Grand Prix          | Correct | Choice probability = 96.4%
    4/202  : True: Belgian Grand Prix        Predicted: Belgian Grand Prix        | Correct | Choice probability = 92.5%
    5/202  : True: Spanish Grand Prix        Predicted: Spanish Grand Prix        | Correct | Choice probability = 91.0%
    6/202  : True: Canadian Grand Prix       Predicted: Canadian Grand Prix       | Correct | Choice probability = 86.4%
    7/202  : True: British Grand Prix        Predicted: British Grand Prix        | Correct | Choice probability = 90.6%

# Conclusions

Using machine learning with *keras* and *tensorflow* we have been able to create a model capable of distinguishing tracks based solely upon the speed trace of Formula 1 drivers! 

# See Also

* I am trying to apply this same framework to distinguish between driver's specific driving patterns.  The differences between drivers is far more subtle than the differences between tracks, so this has been a challenge.  Please check that out if you wish to see more: [f1_djsouthall/ml/guess_driver_from_telem.ipynb](https://github.com/djsouthall/f1_djsouthall/blob/main/ml/guess_driver_from_telem.ipynb)

* I have also been playing with the fastf1 tool to scrape other forms of F1 race data and store this data into SQL databases.  To see that work please look at the work found in [f1_djsouthall/sql/make_tables.py](https://github.com/djsouthall/f1_djsouthall/tree/main/sql/make_tables.py) (for SQL table creation) and [f1_djsouthall/sql/example_sql_analysis.py](https://github.com/djsouthall/f1_djsouthall/tree/main/sql/example_sql_analysis.py) (for example analysis performed using that database).