# Tutorial - F1 data analysis with Atoti

## Introduction
Jupyter notebook created by David Chevrier, Diggers  
Built on ATOTI, ActivePivot Python API library, Community version  
http://diggers-consulting.com  
contact@diggers-consulting.com  

Initial version: April 2020  
Last update: June 2020

Tutorial project with Atoti, analyzing historical Formula1 data to understand the impact of different scoring systems in F1 history on championship results

git-hub
https://github.com/diggers-lab/F1-data-analysis-with-ATOTI

## Dataset
Data from https://ergast.com/mrd/db/#csv or https://www.kaggle.com/draeg82/exploration-of-f1-dataset/data  
F1 data from 1950 to 2019


## Prerequisites
### Installation
Atoti library installed in your jupyter lab environment  
Check the official installation page https://docs.atoti.io/0.4.0/installation.html

### Changes in data files
In files drivers.csv, rename column 'url' to 'driver_url'

## Limitations

This tutorial will demonstrate how to use simulations and scenarios to analyze how different scoring system in F1 history could have changed the championships results.
In F1 history, depending on the scoring systems, the way to determine the World Champion is not always as simple as summing the points scored at each race of the year, based on the position of the driver.
You can check that on this page https://en.wikipedia.org/wiki/List_of_Formula_One_World_Championship_points_scoring_systems, for example: 
- between 1950 and 1900, not all races should be taken into consideration in the calculation, but only the N-best results of the year (ex: 11 best results between 1980 and 1990)
- extra points could be distributed for specific achievements (best lap, double points awarded in the last race of that season in 2014...)
- half points were awarded for races stopped before three-quarter-distance was completed
  
As a simplification here, we will exclude these specific rules.  
It could be the topic of another tutorial, don't hesitate if you want to help! :)  
  
As a result, we will focus on races from 1991 through 2018, excluding 2014 => for this list, the calculation to determine the WC is simply the sum of the points scored by drivers at all venues.
  
  
Reference: variation of points scoring rule in F1 history  
[EN version](https://en.wikipedia.org/wiki/List_of_Formula_One_World_Championship_points_scoring_systems)  
[FR version](https://fr.wikipedia.org/wiki/Classement_des_pilotes_de_Formule_1_par_nombre_de_points#%C3%89volution_de_l'attribution_des_points_au_cours_du_temps)

## 1. Initialization & creation of the activepivot session

In [None]:
import atoti as tt
import numpy as np
import pandas as pd

session = tt.create_session()

## 2. Creation of Stores

In [None]:
#force driverId to STRING for later usage in simulation
driversTypes = {
    "driverId": tt.types.STRING,
}

#load data in store
sDrivers = session.read_csv("./f1db_csv/drivers.csv", keys=["driverId"], store_name="F1 drivers", types = driversTypes)

In [None]:
#force raceId to STRING for later usage in simulation
racesTypes = {
    "raceId": tt.types.STRING,
}

#load data in store
sRaces = session.read_csv("./f1db_csv/races.csv", keys=['raceId'], store_name="F1 races", types=racesTypes)

In [None]:
#force points to DOUBLE for later usage in simulation
resultsTypes = {
    "points": tt.types.DOUBLE,
    "raceId": tt.types.STRING,
    "driverId": tt.types.STRING,
}

#load data in store
sResults = session.read_csv("./f1db_csv/results.csv", keys=['resultId'], store_name="F1 results", types=resultsTypes)

In [None]:
print('Number of results: ',sResults.shape)

In [None]:
sResults.join(sDrivers,mapping={"driverId":"driverId"})
sResults.join(sRaces, mapping={"raceId": "raceId"})
#sResults.head(joined_columns=True)

In [None]:
#load_all_data necessary otherwise stores are loaded with 10000 lines max
session.load_all_data()

In [None]:
print('Number of results: ',sResults.shape)

## 3. Cube

### Cube for Race results

In [None]:
#load store into multidimensional cube 
f1cube= session.create_cube(sResults,"F1Cube")

In [None]:
l = f1cube.levels
m = f1cube.measures
h = f1cube.hierarchies

In [None]:
session.url

## 4. First data visualization

### dataviz1
A simple data visualization showing a table with the total number of races by driver, sorted by descending 'count' field

In [None]:
f1cube.visualize('Total number of races by driver')

In [None]:
f1cube.visualize('Total number of points by driver / treemap')

## 5 First queries

In this first tutorial, we will focus on the points.SUM measure created by default in the cube.  
You can check the available measures by displaying the object cube.measures.  

In [None]:
m

### query1
A simple query that returns a dataframe with the total number of points aggregated by driver forname,surname

In [None]:
dfq1=f1cube.query(
    m['points.SUM'],
    levels=[l["forename"],l["surname"]]
)
dfq1

### query2
A similar query with the addition of the condition parameters, used as a filter on levels.

In [None]:
dfq2=f1cube.query(
    m['points.SUM'],
    levels=[l["forename"],l["surname"]],
    condition=l["surname"]=="Prost"
)
dfq2
# check data here: https://www.statsf1.com/en/alain-prost.aspx

### query3
Another query returning a datafram aggregating points mesaure by driver and by year.  
You can actually run aggreagations on any dimension of the cube!

In [None]:
dfq3 = f1cube.query(m['points.SUM'],levels=[l["driverRef"],l["year"]])
dfq3

In [None]:
# You can then manipulate your dataframe like any other pandas dataframe, applying filtrer for example
dfq3[dfq3['points.SUM']>0]

In [None]:
# in this case the resulting dataframe is multi-indexed
dfq3.loc["alesi"].loc[1992]['points.SUM']
#check data at https://www.statsf1.com/en/1992.aspx

### dataviz2
Data visualization of the top-5 drivers with the highest total of points in their whole career

In [None]:
f1cube.visualize('TOP-5 drivers with highest total points in career')

## check here in the widget configuration the "TopCount" filter used to select only the top-5
## check here the cell metadata to sort the data in the chart
    #     "plotly": {
    #         "layout": {
    #             "yaxis": {
    #                 "categoryorder": "total ascending"
    #             }
    #         }
    #     },
    
# check the data at https://www.statsf1.com/en/statistiques/pilote/point/nombre.aspx

## 6. Measures

### definition of a measure to determine the WC for a given year

In [None]:
# step 1: we define a measure that will return the maximum aggregation of points.SUM for a given driver on any dimension
m["Driver Points MAX"] = tt.agg.max(
    m["points.SUM"], 
    scope=tt.scope.origin(l["driverRef"])
)


In [None]:
# we can use it to determine the maximum number of points scored by a driver per chamionship
f1cube.query(
    m["Driver Points MAX"], 
    levels=[l["year"]]
)

In [None]:
# let's display the maximum number of points scored by a given driver at Abu Dhabi GP venue
f1cube.query(
    m["Driver Points MAX"], 
    levels=[l["name"]], 
    condition=(l["name"]=="Abu Dhabi Grand Prix")
)
# check data for Abu Dhabi GP (Lewis Hamilton) at https://www.statsf1.com/en/lewis-hamilton/palmares.aspx

In [None]:
# step 2: we define a measure that will return the max of the previous measure between several drivers
m["Winner Points"] = tt.parent_value(
    m["Driver Points MAX"], 
    on = h["driverRef"]
)

In [None]:
# query examples: comparing the 2 previously defined measures
f1cube.query(
    m["Driver Points MAX"],
    m["Winner Points"],
    levels=[l["year"],l["driverRef"]]
)

In [None]:
# step 3: we create a new measure that will only return the first driver among N where sum of points.SUM equals 'Winner Points' result
# limitation: does not handle ex aequo!
m["Winner"] = tt.agg.single_value(
    tt.where(
        m["Winner Points"] == m["points.SUM"], 
        l["driverRef"]
    ),
    scope=tt.scope.origin(l["driverRef"]),
)

In [None]:
# as a result, we can now query the driver who scored the max points per year = the World Champion 
# (based on our simplified model as mentionned at the beginning, meaning that the results are only correct for years between 1991 and 2018 exlucing 2014)
f1cube.query(
    m["Winner"],
    levels=[l["year"]]
)

### dataviz3: world champions between 1991-2018 (excluding 2014)

In [None]:
f1cube.visualize('World Champions table')
#check data at https://www.statsf1.com/en/statistiques/pilote/champion/chronologie.aspx

## 7. Simulations

### Preparation of a dataframe for the different scoring systems

In [None]:
# index = race year, 1 column for each position, values are the number of points scored for the race position
scoring_columns = np.arange(1,41)
scoring_index = ('sc1950to1959fl sc1960 sc1961to1990 sc1991to2002 sc2003to2009 sc2010to2013 sc2014lr sc2015to2018 sc2019fl').split()

In [None]:
dfscoring = pd.DataFrame(0,index=scoring_index,columns=scoring_columns)

In [None]:
# before 1991, the calculation rule is not a simple aggregation given that only the Nth best race results were retained...
# dfscoring.loc['sc1950to1959fl'][1,2,3,4,5] = [8,6,4,3,2] #fastest lap bonus +1 point
# dfscoring.loc['sc1960'][1,2,3,4,5,6] = [8,6,4,3,2,1]
# dfscoring.loc['sc1961to1990'][1,2,3,4,5,6] = [9,6,4,3,2,1]
dfscoring.loc['sc1991to2002'][1,2,3,4,5,6] = [10,6,4,3,2,1]
dfscoring.loc['sc2003to2009'][1,2,3,4,5,6,7,8] = [10,8,6,5,4,3,2,1]
dfscoring.loc['sc2010to2013'][1,2,3,4,5,6,7,8,9,10] = [25,18,15,12,10,8,6,4,2,1]
# dfscoring.loc['sc2014lr'][1,2,3,4,5,6,7,8,9,10] = [25,18,15,12,10,8,6,4,2,1] #last race bonus double points
dfscoring.loc['sc2015to2018'] = dfscoring.loc['sc2010to2013']
# dfscoring.loc['sc2019fl'] = dfscoring.loc['sc2010to2013'] #fastest lap bonus +1 point

In [None]:
#dfscoring.loc['sc2015to2018'][1]
dfscoring

### Simulation on points

In [None]:
# Creation of simulation
pointssystem_sim = f1cube.setup_simulation(
    'pointssystem_sim',
    levels=[l["positionText"], l["driverId"], l["raceId"]],
    replace=[m["points.SUM"]], 
    base_scenario = 'Base'
)

In [None]:
# Creation of the different scenarios
sc2015to2018_scenario = pointssystem_sim.scenarios['System 2015 to 2018']
sc1991to2002_scenario = pointssystem_sim.scenarios['System 1991 to 2002']
sc2003to2009_scenario = pointssystem_sim.scenarios['System 2003 to 2009']

In [None]:
# Feed of the different scenarios with points from related scoring systems
for i in range(1,11):
    x=float(dfscoring.loc["sc2015to2018"][i])
    sc2015to2018_scenario += (str(i), None, None, x)
    
    x=float(dfscoring.loc["sc1991to2002"][i])
    sc1991to2002_scenario += (str(i), None, None, x)
    
    x=float(dfscoring.loc["sc2003to2009"][i])
    sc2003to2009_scenario += (str(i), None, None, x)

In [None]:
sc2015to2018_scenario.head(10)
#sc1991to2002_scenario.head(10)
#sc2003to2009_scenario.head(10)

## 8. Presentation of the results of the different scenarios compared to the historical results: chamionship winners!

In [None]:
f1cube.visualize('World champions comparions')

# different WC compared to Base are highlighted in red

### Results
Interesting to see that with 2003-2009 and 2005-2018 scoring systems we would have the same impacts:
- Damon Hill would have become a 2 times WC, Schumi will continue with 7 titles but Villeneuve would have lost his crown in 1997... And Eddy Irvine would have won the title in 1999 againt Mike Hakkinen (the year Michael Schumacher broke his leg and missed several races)!
  
- We can notice the limitation of our measure "Winner": in 2016, with the 1991-2002 scoring system, the value is empty. We have an ex-aequo between Rosberg and Hamilton! Actually, Hamilton would have won the title instead of Rosberg, because he won 10 races against 9 for Rosberg!

### Showing the differences for the race result of a famous GP, Brazil 2008
Where we see how the 1991-2002 scoring systems makes a big difference between chamionship rivals Massa & Hamilton
See race summary here :https://www.youtube.com/watch?v=XHSeGou-pCI ;)

In [None]:
f1cube.visualize('Base vs. Simulation - Brazil 2008')

### Showing the differences for the 2008 world championship result
Would Felipe Massa have become world champion in 2008 with the 1991-2002 scoring system?? YES!

In [None]:
# 2008 championship
f1cube.visualize('Base vs. Simulation System 1991 to 2002 / 2008 Championship (chart)')

In [None]:
# 2008 championship
f1cube.visualize('Base vs. Simulation System 1991 to 2002 / 2008 Championship (table)')

## THE END!