# F1 statistics & simulation

## Introduction
Jupyter notebook created by David Chevrier, Diggers  
Built on ATOTI, ActivePivot Python API library, Community version  
http://diggers-consulting.com  
contact@diggers-consulting.com  
April 2020

Tutorial project with ATOTI, analyzing historical Formula1 data to understand the impact of different scoring systems in F1 history on championship results

git-hub
https://github.com/dch-diggers/F1-data-analysis-with-ATOTI

## Dataset
Data from https://ergast.com/mrd/db/#csv or https://www.kaggle.com/draeg82/exploration-of-f1-dataset/data  
F1 data from 1950 to 2019


## Prerequisites
### Installation
ATOTI library installed in your jupyter lab environment  
Check the official installation page https://docs.atoti.io/0.3.1/installation.html

### Changes in data files
In files drivers.csv, rename column 'url' to 'driver_url'

## 1. Initialization & creation of the activepivot session

In [None]:
import atoti as tt
import numpy as np
import pandas as pd

session = tt.create_session()

## 2. Creation of Stores

In [None]:
# types definition to correctly import data in stores
resultsTypes = {
    "points": tt.types.DOUBLE,
}

In [None]:
sDrivers = session.read_csv("./f1db_csv/drivers.csv", keys=["driverId"], store_name="F1 drivers")

In [None]:
#sRaces = session.read_csv("./races.csv", keys=['raceId','circuitId','year'], store_name="F1 races")
sRaces = session.read_csv("./f1db_csv/races.csv", keys=['raceId'], store_name="F1 races")

In [None]:
#sResults = session.read_csv("./results.csv", keys=['resultId','raceId','driverId'], store_name="F1 results", types=resultsTypes)
sResults = session.read_csv("./f1db_csv/results.csv", keys=['resultId'], store_name="F1 results", types=resultsTypes)

In [None]:
sDriverStandings = session.read_csv("./f1db_csv/driver_standings.csv", keys=["driverStandingsId"], store_name="F1 driver standings")

In [None]:
print('Number of results: ',sResults.shape,'\nNumber of driver_standings: ',sDriverStandings.shape)

In [None]:
sResults.join(sDrivers,mapping={"driverId":"driverId"})
sResults.join(sRaces, mapping={"raceId": "raceId"})
#sResults.head(joined_columns=True)

In [None]:
#sResults.head(joined_columns=True).columns

In [None]:
# joins between stores
sDriverStandings.join(sDrivers,mapping={"driverId":"driverId"})
sDriverStandings.join(sRaces, mapping={"raceId": "raceId"})

In [None]:
#load_all_data necessary otherwise stores are loaded with 10000 lines max
session.load_all_data()

In [None]:
print('Number of results: ',sResults.shape,'\nNumber of driver_standings: ',sDriverStandings.shape)

## 3. Cube

### Cube for Race results

In [None]:
f1cube= session.create_cube(sResults,"F1Cube")

In [None]:
l = f1cube.levels
m = f1cube.measures
h = f1cube.hierarchies

In [None]:
session.url

### Cube for Driver standings

In [None]:
f1stdcube= session.create_cube(sDriverStandings,"F1StdCube")
ls = f1stdcube.levels
ms = f1stdcube.measures
hs = f1stdcube.hierarchies

## 4. First data visualization

### dataviz1
A simple data visualization showing a table with the total number of races by driver, sorted by descending 'count' field

In [None]:
f1cube.visualize('Total number of races by driver')

In [None]:
f1cube.visualize('Total number of points by driver / treemap')

## 5 Measures and first queries

In [None]:
# Definition of the measure aggregating the number of points on 2 particular levels: races and drivers
m['Total Points']=tt.agg.sum(m['points.SUM'],scope=tt.scope.origin("driverId","raceId"))

### query1
A simple query that returns a dataframe with the total number of points aggregated by driver forname,surname

In [None]:
dfq1=f1cube.query(m['Total Points'],levels=[l["forename"],l["surname"]])
dfq1

### query2
A similar query with the addition of the condition parameters, used as a filter on levels (not possible as of today on the measure)

In [None]:
dfq2=f1cube.query(m['Total Points'],levels=[l["forename"],l["surname"]],condition=l["surname"]=="Prost")
dfq2
# check data here: https://www.statsf1.com/en/alain-prost.aspx

### query3
Another query returning a datafram aggregating the Total points mesaure by driver and by year

In [None]:
#dfq3 = f1cube.query(m['Total Points'],levels=[l["driverRef"],l["driverForename"],l["driverSurname"],l["year"]])
dfq3 = f1cube.query(m['Total Points'],levels=[l["driverRef"],l["year"]])
dfq3

In [None]:
# You can then manipulate your dataframe like any other pandas dataframe, applying filtrer for example
#type(dfq3)
#dfq3.keys
dfq3[dfq3['Total Points']>0]

In [None]:
# in this case the resulting dataframe is multiindexed
dfq3.loc["alesi"].loc[1990]['Total Points']

### dataviz2
Data visualization of the top-5 drivers with the highest total of points in their whole career

In [None]:
f1cube.visualize('TOP-5 drivers with highest total points in career')

## check here in the widget configuration the "TopCount" filter used to select only the top-5
## check here the cell metadata to sort the data in the chart
    #     "plotly": {
    #         "layout": {
    #             "yaxis": {
    #                 "categoryorder": "total ascending"
    #             }
    #         }
    #     },

### preparation of the world champions dataframe 

In [None]:
dfwc = pd.DataFrame(index=range(1950,2020),columns=['driverRef','Total Points'])

In [None]:
f1stdcube.query(ms["contributors.COUNT"],levels=[ls["year"],ls["raceId"],ls["driverRef"]])

In [None]:
for i in range(1950,2020):
    dfstd=f1stdcube.query(ms["contributors.COUNT"],levels=[ls["year"],ls["raceId"],ls["driverRef"],ls["position"]],condition=(ls["position"]=="1") & (ls["year"]==str(i)))
    dfwc.loc[i]['driverRef']=dfstd.loc[i].loc[dfstd.index.get_level_values('raceId').unique().max()].index.get_level_values('driverRef').tolist()[0]

In [None]:
dfwc

## 6. Simulations
Variation of points scoring rule in F1 history  
[EN version](https://en.wikipedia.org/wiki/List_of_Formula_One_World_Championship_points_scoring_systems)  
[FR version](https://fr.wikipedia.org/wiki/Classement_des_pilotes_de_Formule_1_par_nombre_de_points#%C3%89volution_de_l'attribution_des_points_au_cours_du_temps)

### Preparation of a dataframe for the different scoring systems

In [None]:
# index = race year, 1 column for each position, values are the number of points scored for the race position
scoring_columns = np.arange(1,41)
scoring_index = ('sc1950to1959fl sc1960 sc1961to1990 sc1991to2002 sc2003to2009 sc2010to2013 sc2014lr sc2015to2018 sc2019fl').split()

In [None]:
dfscoring = pd.DataFrame(0,index=scoring_index,columns=scoring_columns)

In [None]:
# before 1991, the calculation rule is not a simple aggregation given that only the Nth best race results were retained...
# dfscoring.loc['sc1950to1959fl'][1,2,3,4,5] = [8,6,4,3,2] #fastest lap bonus +1 point
# dfscoring.loc['sc1960'][1,2,3,4,5,6] = [8,6,4,3,2,1]
# dfscoring.loc['sc1961to1990'][1,2,3,4,5,6] = [9,6,4,3,2,1]
dfscoring.loc['sc1991to2002'][1,2,3,4,5,6] = [10,6,4,3,2,1]
dfscoring.loc['sc2003to2009'][1,2,3,4,5,6,7,8] = [10,8,6,5,4,3,2,1]
dfscoring.loc['sc2010to2013'][1,2,3,4,5,6,7,8,9,10] = [25,18,15,12,10,8,6,4,2,1]
# dfscoring.loc['sc2014lr'][1,2,3,4,5,6,7,8,9,10] = [25,18,15,12,10,8,6,4,2,1] #last race bonus double points
dfscoring.loc['sc2015to2018'] = dfscoring.loc['sc2010to2013']
# dfscoring.loc['sc2019fl'] = dfscoring.loc['sc2010to2013'] #fastest lap bonus +1 point

In [None]:
#dfscoring.loc['sc2015to2018'][1]
dfscoring

### Simulation on points

In [None]:
# Creation of simulation
pointssystem_sim = f1cube.setup_simulation('pointssystem_sim', per=[l["positionText"]], replace=[m["points.SUM"]], base_scenario_name = 'Base')

In [None]:
# Creation of the different scenarios
sc2015to2018_scenario = pointssystem_sim.scenarios['System 2015 to 2018']
sc1991to2002_scenario = pointssystem_sim.scenarios['System 1991 to 2002']
sc2003to2009_scenario = pointssystem_sim.scenarios['System 2003 to 2009']

In [None]:
# Feed of the different scenarios with points from related scoring systems
for i in range(1,11):
    x=float(dfscoring.loc["sc2015to2018"][i])
    sc2015to2018_scenario += (str(i), x, tt.simulation.Priority.CRITICAL)
    
    x=float(dfscoring.loc["sc1991to2002"][i])
    sc1991to2002_scenario += (str(i), x, tt.simulation.Priority.CRITICAL)
    
    x=float(dfscoring.loc["sc2003to2009"][i])
    sc2003to2009_scenario += (str(i), x, tt.simulation.Priority.CRITICAL)

In [None]:
sc2015to2018_scenario.head(10)
#sc1991to2002_scenario.head(10)
#sc2003to2009_scenario.head(10)

In [None]:
# query4bis
# Construction of the base query to feed the champonship winners dataframe based on scenarios
dfq4b = f1cube.query(m['Total Points'],levels=[l["year"],l["pointssystem_sim"],l["driverRef"]])
#dfq4b

In [None]:
# preparation of the world champions dataframe comparison between scenarios
dfwc_comparison = pd.DataFrame(index=range(1950,2020),columns=['Base Champion','sc2015to2018 Champion','sc2015to2018','sc1991to2002 Champion','sc1991to2002','sc2003to2009 Champion','sc2003to2009'])

In [None]:
#### "BASE" CALCULATIONS ARE INCORRECT for championships before 1991 because not all results were considered...
# Example: in 1988, only the 11 best results were considered, meaning that Senna became WC, even if Prost had scored more points...
dfq4b.loc[1988].loc['Base'].sort_values(by=['Total Points'], ascending=False)

In [None]:
for i in range(1950,2020):
    dfwc_drv = dfwc.loc[i]['driverRef']
    xsim1 = dfq4b.loc[i].loc['System 2015 to 2018'].sort_values(by=['Total Points'], ascending=False).iloc[0].head(1)
    xsim2 = dfq4b.loc[i].loc['System 1991 to 2002'].sort_values(by=['Total Points'], ascending=False).iloc[0].head(1)
    xsim3 = dfq4b.loc[i].loc['System 2003 to 2009'].sort_values(by=['Total Points'], ascending=False).iloc[0].head(1)
    dfwc_comparison.loc[i]=[dfwc_drv,xsim1.name,dfwc_drv == xsim1.name,xsim2.name,dfwc_drv == xsim2.name,xsim3.name,dfwc_drv == xsim3.name]
dfwc_comparison

## 7. Presentation of the results of the different scenarios compared to the historical results: chamionship winners!

### Comparison of the results using scoring system in effect between 2015 and 2018

In [None]:
# list of different WC using sc2015to2018 vs. Base
dfwc_comparison.loc[range(1950,2020)][['Base Champion','sc2015to2018 Champion']][dfwc_comparison['sc2015to2018']==False]

### Comparison of the results using scoring system in effect between 1991 and 2002

In [None]:
# list of different WC using sc1991to2002 vs. Base
dfwc_comparison.loc[range(1950,2020)][['Base Champion','sc1991to2002 Champion']][dfwc_comparison['sc1991to2002']==False]

### Comparison of the results using scoring system in effect between 2003 and 2009

In [None]:
# list of different WC using sc2003to2009 vs. Base
dfwc_comparison.loc[range(1950,2020)][['Base Champion','sc2003to2009 Champion']][dfwc_comparison['sc2003to2009']==False]

## 8. Charts

### Showing the differences for the race result of a famous GP, Brazil 2008
Where we see how the 1991-2002 scoring systems makes a big difference between chamionship rivals Massa & Hamilton
See race summary here :https://www.youtube.com/watch?v=XHSeGou-pCI ;)

In [None]:
f1cube.visualize('Base vs. Simulation - Brazil 2008')

### Showing the differences for the 2008 world championship result
Would Felipe Massa have become world champ in 2008 with the 1991-2002 scoring system?? YES!

In [None]:
# chart
f1cube.visualize('Base vs. Simulation System 1991 to 2002 - 2008 Championship')

In [None]:
# 2008 championship standings table
f1cube.visualize('Base vs. Simulation - 2008 Championship table')

## THE END!