Analyze the Formula 1 dataset. The files are:

*  [circuits.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/circuits.csv)
*  [circuits_bkp.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/circuits_bkp.csv)
*  [constructorResults.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/constructorResults.csv)
*  [constructorStandings.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/constructorStandings.csv)
*  [constructors.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/constructors.csv)
*  [driverStandings.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/driverStandings.csv)
*  [drivers.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/drivers.csv)
*  [drivers_bkp.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/drivers_bkp.csv)
*  [lapTimes.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/lapTimes.csv)
*  [pitStops.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/pitStops.csv)
*  [qualifying.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/qualifying.csv)
*  [races.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/races.csv)
*  [results.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/results.csv)
*  [seasons.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/seasons.csv)
*  [status.csv](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/status.csv)

In [1]:
import pandas as pd
import numpy as np

## For each decade, compute who is the driver born in that decade that scored more points in his career.

There are at least two possible ways to compute the decade. 
The first is to take the `dob` column, transform it into a date, then extract the year

In [5]:
drivers_data = pd.read_csv('https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/drivers.csv',
                          parse_dates = ['dob'], dayfirst=True)
drivers_data['decade'] = pd.to_datetime(drivers_data['dob']).dt.year // 10
drivers_data.head()

Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url,decade
0,1,hamilton,44.0,HAM,Lewis,Hamilton,1985-01-07,British,http://en.wikipedia.org/wiki/Lewis_Hamilton,198
1,2,heidfeld,,HEI,Nick,Heidfeld,1977-05-10,German,http://en.wikipedia.org/wiki/Nick_Heidfeld,197
2,3,rosberg,6.0,ROS,Nico,Rosberg,1985-06-27,German,http://en.wikipedia.org/wiki/Nico_Rosberg,198
3,4,alonso,14.0,ALO,Fernando,Alonso,1981-07-29,Spanish,http://en.wikipedia.org/wiki/Fernando_Alonso,198
4,5,kovalainen,,KOV,Heikki,Kovalainen,1981-10-19,Finnish,http://en.wikipedia.org/wiki/Heikki_Kovalainen,198


In [6]:
# Load results data
results = pd.read_csv('https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/results.csv')

# Compute points for each driver
driver_points = results.groupby('driverId')['points'].sum()
driver_points.head()

driverId
1    2610.0
2     259.0
3    1594.5
4    1849.0
5     105.0
Name: points, dtype: float64

In [8]:
results.dtypes

resultId             int64
raceId               int64
driverId             int64
constructorId        int64
number             float64
grid                 int64
position           float64
positionText        object
positionOrder        int64
points             float64
laps                 int64
time                object
milliseconds       float64
fastestLap         float64
rank               float64
fastestLapTime      object
fastestLapSpeed     object
statusId             int64
dtype: object

In [9]:
results.head()

Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
0,1,18,1,1,22.0,1,1.0,1,1,10.0,58,34:50.6,5690616.0,39.0,2.0,01:27.5,218.3,1
1,2,18,2,2,3.0,5,2.0,2,2,8.0,58,5.478,5696094.0,41.0,3.0,01:27.7,217.586,1
2,3,18,3,3,7.0,7,3.0,3,3,6.0,58,8.163,5698779.0,41.0,5.0,01:28.1,216.719,1
3,4,18,4,4,5.0,11,4.0,4,4,5.0,58,17.181,5707797.0,58.0,7.0,01:28.6,215.464,1
4,5,18,5,1,23.0,3,5.0,5,5,4.0,58,18.014,5708630.0,43.0,1.0,01:27.4,218.385,1


Now we are able to compute the best driver for each decade, using the `idxmax` function

In [10]:
drivers_data_with_points = drivers_data.join(driver_points, on='driverId')
best_of_each_decade = drivers_data_with_points.groupby('decade')['points'].idxmax()
best_of_each_decade

decade
189    786
190    642
191    579
192    288
193    327
194    181
195    116
196     29
197      7
198      0
199    814
Name: points, dtype: int64

Since the `idxmax` function returns the implicit index corresponding to the maximum values, we can use `iloc` to extract the drivers

In [11]:
drivers_data_with_points.loc[best_of_each_decade][['forename', 'surname', 'decade', 'points']]

Unnamed: 0,forename,surname,decade,points
786,Luigi,Fagioli,189,32.0
642,Nino,Farina,190,127.33
579,Juan,Fangio,191,279.14
288,Graham,Hill,192,289.0
327,Jackie,Stewart,193,360.0
181,Niki,Lauda,194,420.5
116,Alain,Prost,195,798.5
29,Michael,Schumacher,196,1566.0
7,Kimi,Räikkönen,197,1565.0
0,Lewis,Hamilton,198,2610.0


## For each circuit, find the fastest lap and output it with: (1) the date it was perfomed, (2) the name of the driver, and (3) the lap time

In [23]:
races_data = pd.read_csv('https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/races.csv',
                        parse_dates=['date'])
laps_data = pd.read_csv('https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/lapTimes.csv')
circuits_data = pd.read_csv('https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/circuits.csv')

In [24]:
races_data.dtypes

raceId                int64
year                  int64
round                 int64
circuitId             int64
name                 object
date         datetime64[ns]
time                 object
url                  object
dtype: object

In [25]:
races_data.head()

Unnamed: 0,raceId,year,round,circuitId,name,date,time,url
0,1,2009,1,1,Australian Grand Prix,2009-03-29,06:00:00,http://en.wikipedia.org/wiki/2009_Australian_G...
1,2,2009,2,2,Malaysian Grand Prix,2009-04-05,09:00:00,http://en.wikipedia.org/wiki/2009_Malaysian_Gr...
2,3,2009,3,17,Chinese Grand Prix,2009-04-19,07:00:00,http://en.wikipedia.org/wiki/2009_Chinese_Gran...
3,4,2009,4,3,Bahrain Grand Prix,2009-04-26,12:00:00,http://en.wikipedia.org/wiki/2009_Bahrain_Gran...
4,5,2009,5,4,Spanish Grand Prix,2009-05-10,12:00:00,http://en.wikipedia.org/wiki/2009_Spanish_Gran...


In [26]:
circuits_data.dtypes

circuitId       int64
circuitRef     object
name           object
location       object
country        object
lat           float64
lng           float64
alt           float64
url            object
dtype: object

In [27]:
laps_data.dtypes

raceId           int64
driverId         int64
lap              int64
position         int64
time            object
milliseconds     int64
dtype: object

In [28]:
laps_data.loc[laps_data['milliseconds'].idxmax()]

raceId                  847
driverId                  2
lap                      25
position                  4
time            2:05:07.547
milliseconds        7507547
Name: 8372, dtype: object

First we are going to add the circuit ID to each row of the laps dataset.

In [29]:
# Add circuit ID to each lap

laps_data_new = pd.merge(laps_data, races_data[['raceId', 'circuitId']])
assert len(laps_data_new) == len(laps_data), 'Lap without a matching circuit'

Compute the best lap for each circuit

In [30]:
best_lap_for_circuit = laps_data_new.groupby('circuitId')['milliseconds'].idxmin()
best_lap_for_circuit.head()

circuitId
1    246165
2    420521
3    248125
4    268741
5    278534
Name: milliseconds, dtype: int64

Join drivers and circuits data

In [31]:
laps_data.iloc[best_lap_for_circuit].head()

Unnamed: 0,raceId,driverId,lap,position,time,milliseconds
246165,90,30,29,1,1:24.125,84125
420521,983,20,41,4,1:34.080,94080
248125,92,30,7,1,1:30.252,90252
268741,75,21,66,5,1:15.641,75641
278534,84,31,39,2,1:24.770,84770


In [32]:
drivers_best_laps = pd.merge(laps_data.iloc[best_lap_for_circuit],
                             races_data[['raceId', 'date', 'circuitId']],
                             on='raceId')[['driverId', 'circuitId', 'date', 'time']]
drivers_best_laps.head()

Unnamed: 0,driverId,circuitId,date,time
0,30,1,2004-03-07,1:24.125
1,20,2,2017-10-01,1:34.080
2,30,3,2004-04-04,1:30.252
3,21,4,2005-05-08,1:15.641
4,31,5,2005-08-21,1:24.770


In [33]:
best_laps_data = drivers_best_laps.merge(drivers_data[['driverId', 'forename', 'surname']],
                                         on='driverId').merge(circuits_data[['circuitId', 'name']],
                                                           on='circuitId')

# Present only the data we need
best_laps_data[['name', 'forename', 'surname', 'date', 'time']]

Unnamed: 0,name,forename,surname,date,time
0,Albert Park Grand Prix Circuit,Michael,Schumacher,2004-03-07,1:24.125
1,Bahrain International Circuit,Michael,Schumacher,2004-04-04,1:30.252
2,Circuit de Monaco,Michael,Schumacher,2004-05-23,1:14.439
3,Silverstone Circuit,Michael,Schumacher,2004-07-11,1:18.739
4,Hungaroring,Michael,Schumacher,2002-08-18,1:16.207
5,Shanghai International Circuit,Michael,Schumacher,2004-09-26,1:32.238
6,Autodromo Enzo e Dino Ferrari,Michael,Schumacher,2004-04-25,1:20.411
7,A1-Ring,Michael,Schumacher,2003-05-18,1:08.337
8,Sepang International Circuit,Sebastian,Vettel,2017-10-01,1:34.080
9,Yas Marina Circuit,Sebastian,Vettel,2009-11-01,1:40.279


## Find the driver that has spent the most time performing pit stops

In [34]:
pit_stops_data = pd.read_csv('https://github.com/gdv/foundationsCS/raw/main/students/ex-data/f1-db/pitStops.csv')
driver_id = pit_stops_data.groupby('driverId')['milliseconds'].sum().idxmax()
print(driver_id)
drivers_data[drivers_data['driverId'] == driver_id]

817


Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url,decade
816,817,ricciardo,3.0,RIC,Daniel,Ricciardo,1989-07-01,Australian,http://en.wikipedia.org/wiki/Daniel_Ricciardo,198


## For each nationality, find the driver that scored most points in their career

In [35]:
drivers_idxs = drivers_data_with_points.groupby('nationality')['points'].idxmax()
best_of_each_nat = drivers_data_with_points.iloc[drivers_idxs][['forename', 'surname', 'nationality', 'points']]
best_of_each_nat[best_of_each_nat['points'] > 0]

Unnamed: 0,forename,surname,nationality,points
206,Mario,Andretti,American,180.0
198,Carlos,Reutemann,Argentine,310.0
16,Mark,Webber,Australian,1047.5
181,Niki,Lauda,Austrian,420.5
234,Jacky,Ickx,Belgian,181.0
12,Felipe,Massa,Brazilian,1167.0
0,Lewis,Hamilton,British,2610.0
34,Jacques,Villeneuve,Canadian,235.0
193,Eliseo,Salazar,Chilean,3.0
30,Juan,Pablo Montoya,Colombian,307.0


## Find the nations that have at least one driver with at least 1000 points

In [36]:
drivers_data_with_points[drivers_data_with_points['points'] >= 1000]['nationality'].unique()

array(['British', 'German', 'Spanish', 'Finnish', 'Brazilian',
       'Australian'], dtype=object)

## Find the nations that have at least two drivers with at least 1000 points

In [37]:
drivers_data_with_points[drivers_data_with_points['points'] >= 1000]

Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url,decade,points
0,1,hamilton,44.0,HAM,Lewis,Hamilton,1985-01-07,British,http://en.wikipedia.org/wiki/Lewis_Hamilton,198,2610.0
2,3,rosberg,6.0,ROS,Nico,Rosberg,1985-06-27,German,http://en.wikipedia.org/wiki/Nico_Rosberg,198,1594.5
3,4,alonso,14.0,ALO,Fernando,Alonso,1981-07-29,Spanish,http://en.wikipedia.org/wiki/Fernando_Alonso,198,1849.0
7,8,raikkonen,7.0,RAI,Kimi,Räikkönen,1979-10-17,Finnish,http://en.wikipedia.org/wiki/Kimi_R%C3%A4ikk%C...,197,1565.0
12,13,massa,19.0,MAS,Felipe,Massa,1981-04-25,Brazilian,http://en.wikipedia.org/wiki/Felipe_Massa,198,1167.0
16,17,webber,,WEB,Mark,Webber,1976-08-27,Australian,http://en.wikipedia.org/wiki/Mark_Webber,197,1047.5
17,18,button,22.0,BUT,Jenson,Button,1980-01-19,British,http://en.wikipedia.org/wiki/Jenson_Button,198,1235.0
19,20,vettel,5.0,VET,Sebastian,Vettel,1987-07-03,German,http://en.wikipedia.org/wiki/Sebastian_Vettel,198,2425.0
29,30,michael_schumacher,,MSC,Michael,Schumacher,1969-01-03,German,http://en.wikipedia.org/wiki/Michael_Schumacher,196,1566.0


In [38]:
nations = drivers_data_with_points[drivers_data_with_points['points'] >= 1000].groupby('nationality').size()
nations[nations >= 2]

nationality
British    2
German     3
dtype: int64