# Predicting Formula 1 Fastest Lap Speed & Time

**Group:** V10FTW  
**Members:** Oskar Floeck s3725028 & Connor Hutchinson s3544152

## Table of contents

* [Source and Description](#desc)
* [Goals and Objectives](#goals)
* [Data Preparation](#data)
* [Data Exploration](#explore)
* [Statistical Modelling & Performance Evaluation](#model)
* [Summary & Conclusion](#conc)

## Source and Description <a name="desc">

### Data Source

Formula 1 Raw Data: http://ergast.com/mrd (Ergast, 2020)  
Track Weather Data: https://www.motorsport-total.com/formel-1/ergebnisse (Motorsport-total, 2020)

Given the way in which weather data was provided by the website above, a script was written to automatically grab neccesary information and store within `track_weather.csv` for analysis in this report. The full script can be [viewed here.](https://github.com/floeck/f1-weather-analysis/blob/master/notebooks/scrape-weather.ipynb)

* `races.csv` and `results.csv` obtained from Ergast: Contain all the relevant track data, such as fastest lap speeds, times and date.

* `track_weather.csv` from motorsport-total: Contain track weather information, such as humidity etc.

For the purposes of the report and due to track weather scraping limitations, the following tracks will be used in the analysis from the year 2007 to 2019. Details of these limitations are viewable on the script.

<table>
<tr>
</tr>
<tr>
<td>
    
* Albert Park, Australia
* Sepang, Malaysia
* Sachir, Bahrain
* Catalunya, Spain
* De Monaco, Europe 
* Montreal, Canada
* Magny Cours, France
* Silverstone, Britain
* Valencia, Germany
* Hungaroring, Hungary 
    
</td>
<td>
    
* Istanbul, Turkey
* Monza, Italy
* De Spa, Belgium
* Fuji, Japan
* Shanghai, China
* Interlagos, Brazil
* Yas Marina, Abu Dhabi
* Indianapolis, America
* Austin, America
* Hermanos, Mexico
    
</td>
</tr>
</table>

### Descriptive Features

Descriptions of data for `races.csv`

| feature | type  | units  | desc  |
|---|---|---|---|
| raceId  | int64 | digits | ID of the race |
| name | object | unknown  | Name of the race |
| date | object | date | Date of race |

Descriptions of data for `results.csv`

| feature | type  | units  | desc  |
|---|---|---|---|
| raceId  | int64 | digits | ID of the race |
| fastestLapTime | float64 | milliseconds  | Fastest lap time |
| fastestLapSpeed | float64 | kph | Fastest average lap speed |

Descriptions of data for `track_weather.csv`

| feature | type  | units  | desc  |
|---|---|---|---|
| track  | object | unknown | Track name |
| date | object | date  | Date of race |
| local_time | object | 24h-time | Time of race |
| weather | object | unknown | Track conditions |
| temp | float64 | celcius | Ambient temp |
| track_temp | float64 | celcius | Track temp |
| humidity | float64 | % | Ambient humidity |
| air_pressure | float64 | mBar | Ambient air pressure |
| wind_speed | float64 | m/s | Track wind speed |
| wind_direction | object | unknown | Track wind direction |


### Target Feature
The target feature is `fastestLapSpeed` and `fastestLapTime`, both of which are continuous numerical features.

## Goals and Objectives <a name="goals">

The primary goal of the report is to investigate weather attributes that affect Formula 1 track performance, namely in fastest lap speeds, and fastest lap times. 

## Data Preparation <a name="data">

### Preliminaries

In [163]:
# Module imports
import warnings
import pandas as pd 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

In [164]:
# Disable warnings and allow large columns
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)

In [165]:
# Function that allows to convert time
def to_milliseconds(string):
    string = string.replace(".", ":").split(":")
    minutes = int(string[0])
    seconds = int(string[1])
    milliseconds = int(string[2])
    return (minutes * 60000) + (seconds * 1000) + milliseconds

### Reading in Data & Cleaning

In [192]:
# Read in data to notebook
races = pd.read_csv('../data/formula-1/races.csv', sep = ',')
results = pd.read_csv('../data/formula-1/results.csv', sep = ',')
track_weather = pd.read_csv('../data/weather/track_weather.csv', sep = ',')

In [200]:
track_weather.head()
#track_weather.dtypes

Unnamed: 0,track,date,local_time,weather,temp,track_temp,humidity,air_pressure,wind_speed,wind_direction
0,albert-park,2007-03-18,14:00:00,sunny,21.0,40.0,52.0,1020.0,1.5,west
1,albert-park,2008-03-16,15:30:00,sunny,36.0,49.0,11.0,1014.0,3.5,north
2,albert-park,2009-03-29,17:00:00,sunny,21.0,27.0,65.0,1020.0,5.0,south
3,albert-park,2010-03-28,17:00:00,"cloudy, occasional rain",26.0,23.0,41.0,1008.0,2.0,west
4,albert-park,2011-03-27,17:00:00,slightly cloudy,17.0,20.0,61.0,1025.0,4.0,south


In [194]:
# Drop unneccesary columns
races = races.drop(columns = ['year', 'circuitId', 'round', 'time', 'url'])
results = results.drop(columns = ['resultId', 'driverId', 'constructorId', 'number', 'grid', 'position',
                                  'positionText', 'positionOrder', 'points', 'laps', 'fastestLap', 'time',
                                  'milliseconds', 'fastestLap', 'rank'])

# Update problem data in columns & query only those with races finishes
results = results.replace(r'\\N','null', regex=True)
results = results.loc[results['fastestLapTime'] != 'null']
results = results.drop(columns = ['statusId'])
results['fastestLapSpeed'] = results['fastestLapSpeed'].astype(float)

# Convert fastest time to milliseconds
results['fastestLapTime'] = results['fastestLapTime'].apply(to_milliseconds)

# Group by race and track and average fastest lap and top speed for race
races_results = races.merge(results, on = 'raceId').groupby(['raceId', 'name', 'date']).mean()

df = races_results.merge(track_weather, on = 'date')

# Add rain column
rain_desc = ["rain", "shower", "drizzle", "wet"]
df['rain'] = np.where(df['weather'].str.contains('|'.join(rain_desc)), 'wet', 'dry')

# Add year categorical column
df['year'] = df['date'].str[:4].astype(int).astype('category')

df['track'] = df['track'].astype('category')

# Round & Export dataframe to csv for submission
df = df.round(3)
df.to_csv('../data/V10FTW_Data.csv')

In [199]:
#results.sample(n = 12, random_state = 998)
df.sample(n = 5, random_state = 997)
#df.dtypes

Unnamed: 0,date,fastestLapTime,fastestLapSpeed,track,local_time,weather,temp,track_temp,humidity,air_pressure,wind_speed,wind_direction,rain,year
114,2014-03-30,106486.762,187.467,sepang,16:00:00,slightly cloudy,32.0,48.0,62.0,1002.0,2.5,north,dry,2014
138,2015-08-23,114660.0,219.946,de-spa,14:00:00,cloudy,23.0,35.0,42.0,958.0,3.5,northwest,dry,2015
129,2015-03-15,92157.077,207.176,albert-park,17:00:00,slightly cloudy,18.0,34.0,60.0,1015.0,6.0,south,dry,2015
60,2010-08-29,113446.696,222.786,de-spa,14:00:00,rainy,15.0,18.0,70.0,965.0,3.0,southwest,wet,2010
189,2018-07-29,82145.947,192.035,hungaroring,15:10:00,sunny,34.0,53.0,43.0,983.0,1.0,northwest,dry,2018


## Creating Models

In [128]:
# fit linear regression temp 	track_temp 	humidity 	air_pressure 	wind_speed 	wind_direction 	rain
X = df[['track_temp', 'rain', 'wind_speed', 'track', 'year']]
Y = df['fastestLapTime']

# convert categorical into dummy/indicator variables
X = pd.get_dummies(data=X, drop_first=True)

# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

# prediction with sklearn
track_temp = 21.0
rain = 'dry'
wind_speed = 4.5
track = 'albert-park'
year = '2016'
#print ('Predicted Fastest Lap Time: \n', regr.predict([[track_temp, rain, wind_speed, track, year]]))

# with statsmodels
X = sm.add_constant(X) # adding a constant
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)

Intercept: 
 97428.04604617717
Coefficients: 
 [ 4.53368794e+00  3.72010351e+02  3.91343516e+03 -1.34385367e+04
 -5.17327266e+03 -1.68372637e+04 -2.46052971e+04  8.00742753e+03
 -8.10613446e+03 -1.95770446e+04 -5.17327266e+03 -2.84280041e+04
 -1.34258182e+04 -1.45880232e+04 -2.16578257e+04 -2.58669472e+04
 -1.63321095e+04 -4.84924146e+03 -4.06187146e+03 -2.49478136e+03
 -1.05920280e+04  7.18170599e+02  6.92632824e+02  2.40002894e+03
  6.42861456e+03  5.90930211e+03  6.43575829e+03  5.35255021e+03
  8.96395938e+03  7.99128939e+03  7.19880162e+03  3.57742974e+03
  3.56212179e+03  3.37898146e+03]
                            OLS Regression Results                            
Dep. Variable:         fastestLapTime   R-squared:                       0.888
Model:                            OLS   Adj. R-squared:                  0.866
Method:                 Least Squares   F-statistic:                     41.85
Date:                Thu, 22 Oct 2020   Prob (F-statistic):           1.90e-68
Time

#### Analysis of Model
Year not much of a predictor, 
however, 2014 - 2015 (P < 0.1) statistically significant at the 10% level
may indicate fastest lap times in 2014, 2015 were significantly higher than base year of 2007