<h3>Introduction</h3>

<p>Formula 1 is an international motor racing competition that pits drivers and their teams against each other over a series of grand prix. There are usually around 20 grand prix per F1 season, each taking place over the course of a weekend. Typically, each weekend starts with three free practice sessions, during which teams can run their cars on track at will to collect data and tune the car for the course. Following this there is a qualifying session, the result of which determines the starting order of the race. In short, during qualifying drivers attempt to set the fastest lap time possible, with faster lap times starting higher up on the grid. The race is then run the following day.</p>

<p>With this project I wanted to try predicting a driver's ultimate position in qualifying based on how they do during the practice sessions. I used a popular Python package called fastf1 to get the data and attempted to build a useful predictor. I suspected that teams often hide their true performance during practice so as not to give away their true abilities for qualifying. I also thought that this performance would be identifiable in the data that fastf1 provides.</p>

<h3>fastf1 is not installed by default in most conda environments so please make sure it is available in this notebook before proceeding.</h3>

<p>Here are some starting imports for the project:</p>

In [None]:
import os.path

from typing import List

import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats

import pandas as pd
import numpy as np
import typing
from sklearn.model_selection import GroupShuffleSplit
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

plt.rcParams['figure.figsize'] = [12, 6]

<p>First I wrote a couple of helpers to retrieve and restructure the data into a format more suitable for building a predictive model. I make use of fastf1's built-in caching and also make sure to handle any errors that are thrown as the package throws when there isn't any data for a given session. Last, I wrote a function to extract the select the time set by a driver in qualifying that was used to determine their starting position.</p>

In [None]:
def get_timing_data_for_session(
        session_year: int,
        session_round: int,
        session_identifier: typing.Literal['FP1', 'FP2', 'FP3', 'Q', 'SQ', 'R']) -> pd.DataFrame:
    """
    Gets the Free Practice, Qualifying, and Race timing data for a given session

    Args:
        session_year (int): Year for the session
        session_round (int): Round for the session, starting at 1
        session_identifier (str): One of FP1, FP2, FP3, Q, SQ or R, representing the specific session to request

    Returns:
        DataFrame: Pandas DataFrame containing the timing data per driver per lap

    """
    cache_path = pathlib.Path('../fastf1_cache.nosync')
    cache_path.mkdir(parents=True, exist_ok=True)

    fastf1.Cache.enable_cache(str(cache_path))

    session = fastf1.get_session(session_year, session_round, session_identifier)
    session.load(telemetry=False)
    accurate_laps = session.laps.pick_accurate()
    df = pd.DataFrame(accurate_laps)
    return df


def get_event_data_for_session(session_year: int, session_round: int):
    """
    Retrieves the event data for a given session

    Args:
        session_year (int): The year in which the session takes place
        session_round (int): The round of the session

    Returns:
        Event: fastf1 Event object

    """
    cache_path = pathlib.Path('../fastf1_cache.nosync')
    cache_path.mkdir(parents=True, exist_ok=True)

    fastf1.Cache.enable_cache(str(cache_path))

    return fastf1.get_event(session_year, session_round)


def get_event_schedule_for_year(year: int):
    """
    Gets the event schedule for an entire year, excluding testing

    Args:
        year (int): The four-digit year

    Returns:
        EventSchedule: fastf1 EventSchedule object

    """
    cache_path = pathlib.Path('../fastf1_cache.nosync')
    cache_path.mkdir(parents=True, exist_ok=True)

    fastf1.Cache.enable_cache(str(cache_path))

    return fastf1.get_event_schedule(year, include_testing=False)


def get_results_for_session(
        session_year: int,
        session_round: int,
        session_identifier: typing.Literal['FP1', 'FP2', 'FP3', 'Q', 'SQ', 'R']) -> pd.DataFrame:
    """
    Gets the Free Practice, Qualifying, and Race results for a given session

    Args:
        session_year (int): Year for the session
        session_round (int): Round for the session, starting at 1
        session_identifier (str): One of FP1, FP2, FP3, Q, SQ or R, representing the specific session to request

    Returns:
        DataFrame: Pandas DataFrame containing the timing data per driver per lap

    """
    cache_path = pathlib.Path('../fastf1_cache.nosync')
    cache_path.mkdir(parents=True, exist_ok=True)

    fastf1.Cache.enable_cache(str(cache_path))

    session = fastf1.get_session(session_year, session_round, session_identifier)
    session.load(telemetry=False)
    df = pd.DataFrame(session.results)
    return df


def get_timing_data_for_race_weekend(session_year: int, session_round: int) -> pd.DataFrame:
    """
    Retrieve the timing data for a specified race weekend, including free practice and qualifying

    Args:
        session_year (int): Year for which to get data
        session_round (int): Round number for which to get data

    Returns:
        DataFrame: DataFrame containing timing data
    """
    event = get_event_data_for_session(session_year, session_round)
    is_sprint_race_weekend = event.get_session_name(3) != 'Practice 3'

    retrieved_session_data = []

    try:
        fp1_session_data_df = get_timing_data_for_session(
            session_year=session_year,
            session_round=session_round,
            session_identifier='FP1')
        retrieved_session_data.append(fp1_session_data_df)
    except (fastf1.core.DataNotLoadedError, fastf1.core.NoLapDataError):
        print(f'No data for event: Year {session_year} round {session_round} FP1')

    if not is_sprint_race_weekend:
        """
            There is a second free practice session on sprint race weekends, however this occurs after the traditional
                qualifying process used for the sprint race and will not be considered
        """
        try:
            fp2_session_data_df = get_timing_data_for_session(
                session_year=session_year,
                session_round=session_round,
                session_identifier='FP2')
            retrieved_session_data.append(fp2_session_data_df)
        except (fastf1.core.DataNotLoadedError, fastf1.core.NoLapDataError):
            print(f'No data for event: Year {session_year} round {session_round} FP1')

        try:
            fp3_session_data_df = get_timing_data_for_session(
                session_year=session_year,
                session_round=session_round,
                session_identifier='FP3')
            retrieved_session_data.append(fp3_session_data_df)
        except (fastf1.core.DataNotLoadedError, fastf1.core.NoLapDataError):
            print(f'No data for event: Year {session_year} round {session_round} FP1')

    if len(retrieved_session_data) > 0:
        full_testing_data_df = pd.concat(retrieved_session_data, axis=0)
    else:
        return pd.DataFrame()

    qualifying_results_df = get_results_for_session(
        session_year=session_year,
        session_round=session_round,
        session_identifier='Q'
    )

    qualifying_results_df = qualifying_results_df[['DriverNumber', 'Q1', 'Q2', 'Q3']]

    def select_qualifying_time(row: pd.Series) -> pd.Series:
        """
        Selects the lap time that determines a driver's position on the starting grid. Picks the fastest lap time in
            the latest qualifying session a driver participated in, even if that particular lap time was not the
            fastest over all qualifying sessions. It is unlikely, however, that the lap time selected by this function
             will not be the fastest overall lap set by a driver across all qualifying sessions.
        Args:
            row (Series): Pandas Series object representing an individual timing result obtained from fastf1

        Returns:
            Series: Series containing the lap time used to determine driver position on the starting grid.

        """
        if not pd.isna(row['Q3']):
            return pd.Series({'QualifyingTime': row['Q3']})
        elif not pd.isna(row['Q2']):
            return pd.Series({'QualifyingTime': row['Q2']})
        else:
            return pd.Series({'QualifyingTime': row['Q1']})

    qualifying_times = qualifying_results_df \
        .apply(select_qualifying_time, axis=1)
    qualifying_times.index.name = 'DriverNumber'

    qualifying_times = qualifying_times.reset_index().astype({'DriverNumber': int})
    full_testing_data_df = full_testing_data_df.reset_index().astype({'DriverNumber': int})

    qualifying_times['QualifyingTimeSeconds'] = qualifying_times['QualifyingTime'] \
        .apply(lambda td: td.total_seconds())
    qualifying_times['QualifyingPosition'] = qualifying_times['QualifyingTime'].rank(method='dense', ascending=True)

    return full_testing_data_df.merge(qualifying_times, on='DriverNumber', how='left')


def get_all_fp_timing_data_for_year(year: int) -> pd.DataFrame:
    """
    Gets all the FP and qualifying timing data for the given year and returns the data as an aggregated df

    Args:
        year (int): Year for which to retrieve data

    Returns:
        DataFrame: Pandas DataFrame containing one row per driver per lap for all fp laps they participated in
    """

    event_schedule = src.get_data.get_event_schedule_for_year(year)
    agg_df = pd.DataFrame()

    for round_num in event_schedule['RoundNumber'].tolist():
        print(f'Processing round {round_num}')
        new_data = get_timing_data_for_race_weekend(year, round_num)
        new_data['round'] = round_num
        if len(new_data) > 0:
            agg_df = pd.concat([agg_df, new_data], axis=0)

    return agg_df


def get_timing_data(years: List[int]) -> pd.DataFrame:
    """
    Retrieves the necessary data for modeling

    Args:
        years (List[int]): List of years of data to retrieve

    Returns:
        DataFrame: DataFrame containing the formatted data

    """
    df = pd.DataFrame()
    for year in years:
        df = pd.concat([df, get_all_fp_timing_data_for_year(year)], axis=0)
        df['year'] = year
    return df

<p>Tracks change week to week in Formula 1, with each track having substantially different values for the features we're trying to use. These differences are not relevant to the predictor, and also mean that the data collected at one track can't be used to build a general predictor that works at all tracks. To get around this I ended up writing a custom scaler that only scales features within a given race weekend. This generalizes the data: a value for a feature at one track can be meaningfully compared to a value for the same feature at another track.</p>

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MinMaxScaler


class RaceWeekendScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.scaler = MinMaxScaler()

    def _apply_quantile_transformer_to_round(self, round_df):
        return pd.DataFrame(self.scaler.fit_transform(round_df.drop('year_round', axis=1)))

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        scaled_features_df = X.groupby('year_round').apply(self._apply_quantile_transformer_to_round)
        return scaled_features_df.values

<p>I'm trying to predict a driver's qualifying position based on free practice performance, which is ultimately a ranking problem. The most commonly used error metrics aren't the most appropriate for this case because we're predicting into a discrete set of only 20 possible values at most. Instead, I opted to use Spearman's Rho coefficient as the scoring function. Spearman's Rho can be thought of as a correlation measure for ranked values, with a range of [-1, 1] and a corresponding significance that is similar to the correlation coefficient. For this project we're aiming for a value as close to +1 as possible. Unfortunately, most predictors don't support custom loss functions, so they are all trained on the default ones. However, when actually performing and reviewing testing performance I make sure to report the Spearman value.</p>

<p>I also wrote a function called run_cross_validation that will perform many rounds of training and testing using a specified number of cross-validation folds. I use this custom function instead of one provided by sklearn because while the predictor can be trained on the entire dataset, the predictions themselves have to be performed for the entire set of drivers in a given race weekend at a time in order to actually predict their ranks. This function adds that layer of logic.</p>

<p>Last, I defined a helper function that constructs a DataFrame with the features we'll be using. We'll be trying to use all the datapoints fastf1 offers in its timing data, including various speed trap speeds, sector times, and driver and team values. I've opted to use the fastest recorded values for each across all the practice sessions as features. I also implemented my own caching layer here that uses the same save location as fastf1.</p>

In [None]:
import os.path

from typing import List
from sklearn.svm import SVR

import numpy as np
from sklearn.model_selection import GroupShuffleSplit
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

pd.options.mode.chained_assignment = None


def spearman_rho(y) -> float:
    """
    Calculates Spearman's rank coefficient for two given lists

    Args:
        y: DataFrame containing at least a PredictedRank and TrueRank column

    Returns:
        float: The Spearman's rank coefficient for the provided lists
    """
    n_observations = len(y)
    rank_differences_sq = (y['true_qualifying_rank'] - y['predicted_qualifying_rank']) ** 2

    s_r = (1. - (6. * np.sum(rank_differences_sq)) / (n_observations * (n_observations ** 2. - 1.)))
    return s_r


def run_cross_validation(df, pipeline, n_splits=5, train_size=.75):
    group_k_fold = GroupShuffleSplit(n_splits=n_splits, train_size=train_size)
    scores = list()

    def predict_by_year_round_group(group, p):
        group['predicted_qualifying_quantile'] = p.predict(group)
        group['predicted_qualifying_rank'] = group['predicted_qualifying_quantile'].rank(method='first', ascending=True)
        return group

    for training_index, validation_index in group_k_fold.split(df, groups=df['year_round']):
        k_fold_training_set = df.iloc[training_index]
        k_fold_validation_set = df.iloc[validation_index]

        pipeline.fit(k_fold_training_set, k_fold_training_set['true_qualifying_rank'])

        k_fold_validation_set = k_fold_validation_set \
            .groupby(by=['year_round']) \
            .apply(predict_by_year_round_group, p=pipeline)

        s_r_score = np.average(k_fold_validation_set.groupby('year_round').apply(spearman_rho))
        scores.append(s_r_score)
    return scores


def get_timing_features(years_to_get: List[int], rebuild_cache=False) -> pd.DataFrame:
    """
    Retrieves all the features derived from free practice timing data for the given years. Features are the
        various speed trap speeds, sector times, and lap times.

    Args:
        years_to_get (list): List of years for which to retrieve data
        rebuild_cache (bool): If true will delete and recreate the cache

    Returns:
        DataFrame: DataFrame containing the timing features
    """

    timing_features_cache_path = '../fastf1_cache.nosync/timing_features_df.csv'
    cached_file_exists = os.path.isfile(timing_features_cache_path)

    if cached_file_exists and not rebuild_cache:
        timing_features_df = pd.read_csv(timing_features_cache_path)
        return timing_features_df

    timing_df = get_timing_data(years_to_get)
    timing_df['year_round'] = timing_df['year'].astype(str) + '_' + timing_df['round'].astype(str)
    timing_df = timing_df.reset_index(drop=True)
    timing_df['Sector1TimeSeconds'] = timing_df['Sector1Time'].apply(lambda td: td.total_seconds())
    timing_df['Sector2TimeSeconds'] = timing_df['Sector2Time'].apply(lambda td: td.total_seconds())
    timing_df['Sector3TimeSeconds'] = timing_df['Sector3Time'].apply(lambda td: td.total_seconds())
    timing_df['LapTimeSeconds'] = timing_df['LapTime'].apply(lambda td: td.total_seconds())
    timing_features_df = timing_df.groupby(by=['Driver', 'year', 'round']).agg({
        'SpeedI1': np.max,
        'SpeedI2': np.max,
        'SpeedFL': np.max,
        'SpeedST': np.max,
        'Sector1TimeSeconds': np.min,
        'Sector2TimeSeconds': np.min,
        'Sector3TimeSeconds': np.min,
        'LapTimeSeconds': np.min,
        'Driver': 'first',
        'Team': 'first',
        'QualifyingTimeSeconds': 'first',
        'year': 'first',
        'round': 'first',
        'year_round': 'first'
    }).rename(columns={
        'SpeedI1': 'speed_trap_s1_max',
        'SpeedI2': 'speed_trap_s2_max',
        'SpeedFL': 'speed_trap_fl_max',
        'SpeedST': 'speed_trap_st_max',
        'Sector1TimeSeconds': 'fastest_s1_seconds',
        'Sector2TimeSeconds': 'fastest_s2_seconds',
        'Sector3TimeSeconds': 'fastest_s3_seconds',
        'LapTimeSeconds': 'fastest_lap_seconds',
        'Driver': 'driver',
        'Team': 'team',
        'QualifyingTimeSeconds': 'qualifying_time_seconds',
    }).dropna().reset_index(drop=True)

    timing_features_df.to_csv(timing_features_cache_path, index=False)

    return timing_features_df

<p>The code below gets data from 2020 and 2021 ready for processing. I define the feature names and perform some basic filtering to makes sure there's no substitute drivers in the data.</p>

In [None]:
features_df = get_timing_features(years_to_get=[2020, 2021], rebuild_cache=False)

driver_appearance_counts_series = features_df['driver'].value_counts()
drivers_to_keep = driver_appearance_counts_series.loc[driver_appearance_counts_series > 5]

# Remove all substitute drivers, defined as drivers who complete fewer than 5 races
features_df = features_df.loc[features_df['driver'].isin(drivers_to_keep.index)]

features_df['true_qualifying_rank'] = \
    features_df[['year_round', 'qualifying_time_seconds']].groupby(by='year_round').rank('dense', ascending=True)

n_splits = 500
train_size = 0.75

speed_trap_features = [
    'speed_trap_s1_max',
    'speed_trap_s2_max',
    'speed_trap_fl_max',
    'speed_trap_st_max'
]

fastest_sector_features = [
    'fastest_s1_seconds',
    'fastest_s2_seconds',
    'fastest_s3_seconds',
    'fastest_lap_seconds'
]

categorical_feature_columns = [
    'driver',
    'team'
]

numerical_feature_columns = speed_trap_features + fastest_sector_features

race_weekend_scaler = RaceWeekendScaler()

<p>A quick call to seaborn shows that the speed trap features don't have much relation to a driver's qualifying performance. It's a little challenging to read because all the values are unscaled, meaning that different tracks will have different ranges for the features, but one can see that within each cluster there isn't much correlation. We see mostly just vertical or horizontal lines.</p>

In [None]:
sns.pairplot(features_df[speed_trap_features + ['qualifying_time_seconds']])
plt.show()

<p>The features related to sector time are more promising. We see clear correlations between sector times and qualifying performance, though many of the features are clearly not independent.</p>

In [None]:
sns.pairplot(features_df[fastest_sector_features + ['qualifying_time_seconds']])
plt.show()

<p>I define a function called create_analysis_pipeline_base which essentially creates an sklearn pipeline stub onto which we can stick any predictor. I also wrote function to plot our scoring distributions.</p>

In [None]:
def create_analysis_pipeline_base(numerical_feature_columns: List, categorical_feature_columns: List):
    numerical_feature_preprocessing_pipeline = Pipeline(steps=[
        ('race_weekend_scaler', RaceWeekendScaler()),
    ])

    categorical_feature_preprocessing_pipeline = Pipeline(steps=[
        ('one_hot_encoder', OneHotEncoder())
    ])

    feature_preprocessor = ColumnTransformer(transformers=[
        ('num_features', numerical_feature_preprocessing_pipeline, numerical_feature_columns),
        ('cat_features', categorical_feature_preprocessing_pipeline, categorical_feature_columns)
    ])

    prediction_pipeline = Pipeline(steps=[
        ('feature_preprocessing', feature_preprocessor)
    ])

    return prediction_pipeline


def plot_error_dist(errors: pd.Series, plot_z_score: bool = False, error_name: str = 'Error',
                    title: str = 'Error Distribution'):
    """
    Creates a histogram and QQ plot for the provided error distribution

    Args:
        errors (Series): Pandas Series object containing error data
        error_name (str): The name of the error being plotting, which will be used for axis labeling
        plot_z_score (bool): If true, will create the histogram using z-scores instead of raw scores
        title (str): Optional parameter to override the title of the graph

    """

    fig, (ax1, ax2) = plt.subplots(2)

    fig: plt.Figure
    ax1: plt.Axes
    ax2: plt.Axes

    n_bins = 50

    std_dev = np.std(errors)
    avg = np.average(errors)
    z_scores = (errors - avg) / std_dev

    if plot_z_score:
        bin_width = (z_scores.max() - z_scores.min()) / n_bins
        gaussian_x = np.linspace(min(z_scores), max(z_scores), 100)
        gaussian_y = scipy.stats.norm.pdf(gaussian_x, 0, 1)
        gaussian_y *= (len(errors) * bin_width)
        ax1.hist(x=z_scores, edgecolor='k', linewidth=1, bins=n_bins)
        ax1.set_xlabel('Z-Score')
    else:
        bin_width = (errors.max() - errors.min()) / n_bins
        gaussian_x = np.linspace(min(errors), max(errors), 100)
        gaussian_y = scipy.stats.norm.pdf(gaussian_x, avg, std_dev)
        gaussian_y *= (len(errors) * bin_width)
        ax1.hist(x=errors, edgecolor='k', linewidth=1, bins=n_bins)
        ax1.axvline(x=avg, label=f'Mean Value ({avg:.3f})', color='k', linestyle='--')
        ax1.set_xlabel(f'{error_name}')

    ax1.plot(gaussian_x, gaussian_y, color='r', linestyle='--', label='Scaled Normal Curve')
    ax1.set_title(title)
    ax1.set_ylabel('Count')
    ax1.legend()

    n = len(errors)
    single_lap_pct_diff_normal_quantiles = scipy.stats.norm.ppf(
        (np.arange(1, n + 1)) / (n + 1),
        0,
        1)
    ax2.scatter(x=single_lap_pct_diff_normal_quantiles, y=z_scores.sort_values())
    ax2.plot(single_lap_pct_diff_normal_quantiles, single_lap_pct_diff_normal_quantiles, linestyle='--', color='k')
    ax2.set_title('QQ Plot')
    ax2.set_xlabel('Normal Theoretical Quantiles')
    ax2.set_ylabel('Observed Quantiles')

    plt.tight_layout()
    plt.show()

<p>The first model I tried was a simple linear regressor with only the fastest free practice lap time, driver, and team as features. I trained and scored the model 500 times to get a fairly stable measure of the performance. There are 40 total races in the data and we set aside about 10 of them for the test set.</p>

In [None]:
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Lasso

fastest_lap_time_feature_pipeline = create_analysis_pipeline_base(
    numerical_feature_columns=['year_round', 'fastest_lap_seconds'],
    categorical_feature_columns=['driver', 'team']
)

svr_regressor = SVR()
linear_regressor = LinearRegression()

fastest_lap_time_feature_pipeline.steps.append(('Linear Regression', linear_regressor))
linear_regressor_results = run_cross_validation(features_df[['year_round', 'fastest_lap_seconds', 'driver', 'team', 'true_qualifying_rank']],
                                                fastest_lap_time_feature_pipeline, n_splits=500, train_size=0.75)

plot_error_dist(pd.Series(linear_regressor_results), error_name='Score', title='Linear Regression on Fastest FP Lap and Driver + Team')

<p>Next I tried using a support vector machine with the default kernel on the same feature, and found that the results were the same, likely due to the limited feature space.</p>

In [None]:
fastest_lap_time_feature_pipeline.steps.pop()
fastest_lap_time_feature_pipeline.steps.append(('SVM Regressor', svr_regressor))

svm_regressor_results = run_cross_validation(features_df[['year_round', 'fastest_lap_seconds', 'driver', 'team', 'true_qualifying_rank']],
                                             fastest_lap_time_feature_pipeline, n_splits=500, train_size=0.75)

plot_error_dist(pd.Series(svm_regressor_results), error_name='Score', title='SVR on Fastest FP Lap and Driver + Team')

<p>I tried adding the more promising lap and sector timing features using the more powerful SVM, and found that performance didn't meaningfully improve.</p>

In [None]:
fastest_sectors_feature_pipeline = create_analysis_pipeline_base(
    numerical_feature_columns=['year_round'] + fastest_sector_features,
    categorical_feature_columns=['driver', 'team']
)

fastest_sectors_feature_pipeline.steps.append(('SVR Regressor', svr_regressor))

svm_regressor_sectors_results = run_cross_validation(features_df[fastest_sector_features + ['year_round', 'driver', 'team', 'true_qualifying_rank']],
                                                     fastest_sectors_feature_pipeline, n_splits=500, train_size=0.75)

plot_error_dist(pd.Series(svm_regressor_sectors_results), error_name='Score', title='SVR on Fastest FP Sectors and Lap and Driver + Team')

<p>Finally, I tried using a LASSO regression model given the concerns about dependent features. It turns out that it was able to extract a little bit more performance that the SVM. I tried a variety of values for alpha and found that 0.05 worked around the best. The performance was about the same whether I used just the features that looked decent or all the features available from fastf1.</p>

In [None]:
lasso_regressor = Lasso(alpha=0.05)

fastest_sectors_feature_pipeline.steps.pop()
fastest_sectors_feature_pipeline.steps.append(('Lasso', lasso_regressor))

fastest_sectors_lasso_results = run_cross_validation(features_df[fastest_sector_features + ['year_round', 'driver', 'team', 'true_qualifying_rank']], 
                                                     fastest_sectors_feature_pipeline, n_splits=500, train_size=0.75)
plot_error_dist(pd.Series(fastest_sectors_lasso_results), error_name='Score', title='LASSO on Fastest FP Sectors and Lap and Driver + Team')

In [None]:
full_features_pipeline = create_analysis_pipeline_base(
    numerical_feature_columns=['year_round'] + numerical_feature_columns,
    categorical_feature_columns=['driver', 'team']
)

full_features_pipeline.steps.append(('Lasso', lasso_regressor))

lasso_results = run_cross_validation(features_df[numerical_feature_columns + ['year_round', 'driver', 'team', 'true_qualifying_rank']],
                                     full_features_pipeline, n_splits=500, train_size=0.75)

plot_error_dist(pd.Series(lasso_results), error_name='Score', title='LASSO on Speed Traps Fastest FP Lap and Sectors and Driver + Team')

<h3>Conclusion</h3>

<p>Overall the performance of the model was not impressive. To my surprise, the speed trap data did not have any meaningful bearing on qualifying performance. Additionally, the individual sector performances didn't reveal significantly more than the total lap time. To truly get after the "hidden" performance that teams save for qualifying, it's probably necessary to really dig into telemetry data and develop some features from that, however that task is much more complex than just looking at the top-level timing data</p>