<h1><center>F1 RACE PREDICTOR</center></h1>
<h1><center>1. Data Collection</center></h1>

---

## In this Notebook:

In this Notebook you will find the data collection part of the F1 predictor project. 

It consists of the web scraping of the 7 main DataFrames we will use for this project. 

1. [Races](#1.-Races)
2. [Rounds](#2.-Rounds)
3. [Results](#3.-Results)
4. [Drivers Championship](#4.-Drivers-Championship)
5. [Constructors Championship](#5.-Constructors-Championship)
6. [Qualifying](#6.-Qualifying)
7. [Weather](#7.-Weather)

The data was scraped from either the [Ergast F1 API](https://ergast.com/mrd/) or directly from the [F1 website](https://www.formula1.com/).

The logic for the scraping algorithms was based off of Veronica Nigro's GitHub project.

---

In [4]:
# Import dependencies
import pandas as pd
import numpy as np
from pprint import pprint
import requests

from selenium import webdriver
import bs4
from bs4 import BeautifulSoup
import time

# 1. Races

In [2]:
races = {'season': [],
        'round': [],
        'circuit_id': [],
        'lat': [],
        'long': [],
        'country': [],
        'date': [],
        'url': []}

for year in list(range(1950,2022)):
    
    url = 'https://ergast.com/api/f1/{}.json'
    r = requests.get(url.format(year))
    json = r.json()

    for item in json['MRData']['RaceTable']['Races']:
        try:
            races['season'].append(int(item['season']))
        except:
            races['season'].append(None)

        try:
            races['round'].append(int(item['round']))
        except:
            races['round'].append(None)

        try:
            races['circuit_id'].append(item['Circuit']['circuitId'])
        except:
            races['circuit_id'].append(None)

        try:
            races['lat'].append(float(item['Circuit']['Location']['lat']))
        except:
            races['lat'].append(None)

        try:
            races['long'].append(float(item['Circuit']['Location']['long']))
        except:
            races['long'].append(None)

        try:
            races['country'].append(item['Circuit']['Location']['country'])
        except:
            races['country'].append(None)

        try:
            races['date'].append(item['date'])
        except:
            races['date'].append(None)

        try:
            races['url'].append(item['url'])
        except:
            races['url'].append(None)
        
races = pd.DataFrame(races)
print(races.shape)

(1057, 8)


In [3]:
races.tail(3) # Check before exporting

Unnamed: 0,season,round,circuit_id,lat,long,country,date,url
1054,2021,20,losail,25.49,51.4542,Qatar,2021-11-21,http://en.wikipedia.org/wiki/2021_Qatar_Grand_...
1055,2021,21,jeddah,21.6319,39.1044,Saudi Arabia,2021-12-05,http://en.wikipedia.org/wiki/2021_Saudi_Arabia...
1056,2021,22,yas_marina,24.4672,54.6031,UAE,2021-12-12,http://en.wikipedia.org/wiki/2021_Abu_Dhabi_Gr...


In [1]:
races.to_csv('data/races.csv', index = False)

NameError: name 'races' is not defined

# 2. Rounds

In [5]:
race = pd.read_csv('data/races.csv') 

In [6]:
# Obtain the rounds in a list of lists where every item is a round/year with its respective races.

rounds = []
for year in np.array(race.season.unique()):
    rounds.append([year, list(race[race.season == year]['round'])])

In [7]:
rounds[:] # 1950 to 2021 (the most up to date now)

[[1950, [1, 2, 3, 4, 5, 6, 7]],
 [1951, [1, 2, 3, 4, 5, 6, 7, 8]],
 [1952, [1, 2, 3, 4, 5, 6, 7, 8]],
 [1953, [1, 2, 3, 4, 5, 6, 7, 8, 9]],
 [1954, [1, 2, 3, 4, 5, 6, 7, 8, 9]],
 [1955, [1, 2, 3, 4, 5, 6, 7]],
 [1956, [1, 2, 3, 4, 5, 6, 7, 8]],
 [1957, [1, 2, 3, 4, 5, 6, 7, 8]],
 [1958, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]],
 [1959, [1, 2, 3, 4, 5, 6, 7, 8, 9]],
 [1960, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]],
 [1961, [1, 2, 3, 4, 5, 6, 7, 8]],
 [1962, [1, 2, 3, 4, 5, 6, 7, 8, 9]],
 [1963, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]],
 [1964, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]],
 [1965, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]],
 [1966, [1, 2, 3, 4, 5, 6, 7, 8, 9]],
 [1967, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]],
 [1968, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
 [1969, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]],
 [1970, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]],
 [1971, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]],
 [1972, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
 [1973, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]],
 [1

# 3. Results

In [8]:
results = {'season': [],
          'round':[],
           'circuit_id':[],
          'driver': [],
           'date_of_birth': [],
           'nationality': [],
          'constructor': [],
          'grid': [],
          'time': [],
          'status': [],
          'points': [],
          'podium': [],
          'url': []}

for n in list(range(len(rounds))):
    for i in rounds[n][1]:
    
        url = 'http://ergast.com/api/f1/{}/{}/results.json'
        r = requests.get(url.format(rounds[n][0], i))
        json = r.json()

        for item in json['MRData']['RaceTable']['Races'][0]['Results']:
            try:
                results['season'].append(int(json['MRData']['RaceTable']['Races'][0]['season']))
            except:
                results['season'].append(None)

            try:
                results['round'].append(int(json['MRData']['RaceTable']['Races'][0]['round']))
            except:
                results['round'].append(None)

            try:
                results['circuit_id'].append(json['MRData']['RaceTable']['Races'][0]['Circuit']['circuitId'])
            except:
                results['circuit_id'].append(None)

            try:
                results['driver'].append(item['Driver']['driverId'])
            except:
                results['driver'].append(None)
            
            try:
                results['date_of_birth'].append(item['Driver']['dateOfBirth'])
            except:
                results['date_of_birth'].append(None)
                
            try:
                results['nationality'].append(item['Driver']['nationality'])
            except:
                results['nationality'].append(None)

            try:
                results['constructor'].append(item['Constructor']['constructorId'])
            except:
                results['constructor'].append(None)

            try:
                results['grid'].append(int(item['grid']))
            except:
                results['grid'].append(None)

            try:
                results['time'].append(int(item['Time']['millis']))
            except:
                results['time'].append(None)

            try:
                results['status'].append(item['status'])
            except:
                results['status'].append(None)

            try:
                results['points'].append(int(item['points']))
            except:
                results['points'].append(None)

            try:
                results['podium'].append(int(item['position']))
            except:
                results['podium'].append(None)

            try:
                results['url'].append(json['MRData']['RaceTable']['Races'][0]['url'])
            except:
                results['url'].append(None)

results = pd.DataFrame(results)
print(results.shape)

(24947, 13)


In [9]:
results.to_csv('data/results.csv', index = False)

# 4. Drivers Championship

In [10]:
driver_standings = {'season': [],
                    'round':[],
                    'driver': [],
                    'driver_points': [],
                    'driver_wins': [],
                   'driver_standings_pos': []}

for n in list(range(len(rounds))):
    for i in rounds[n][1]:
    
        url = 'https://ergast.com/api/f1/{}/{}/driverStandings.json'
        r = requests.get(url.format(rounds[n][0], i))
        json = r.json()

        for item in json['MRData']['StandingsTable']['StandingsLists'][0]['DriverStandings']:
            try:
                driver_standings['season'].append(int(json['MRData']['StandingsTable']['StandingsLists'][0]['season']))
            except:
                driver_standings['season'].append(None)

            try:
                driver_standings['round'].append(int(json['MRData']['StandingsTable']['StandingsLists'][0]['round']))
            except:
                driver_standings['round'].append(None)
                                         
            try:
                driver_standings['driver'].append(item['Driver']['driverId'])
            except:
                driver_standings['driver'].append(None)
            
            try:
                driver_standings['driver_points'].append(int(item['points']))
            except:
                driver_standings['driver_points'].append(None)
            
            try:
                driver_standings['driver_wins'].append(int(item['wins']))
            except:
                driver_standings['driver_wins'].append(None)
                
            try:
                driver_standings['driver_standings_pos'].append(int(item['position']))
            except:
                driver_standings['driver_standings_pos'].append(None)
            
driver_standings = pd.DataFrame(driver_standings)
print(driver_standings.shape)

(27113, 6)


### point_shift function

Because the points are awarded after the race, we needed a 'point_shift' function to shift the points from previous races within the same Championship. We want to know the points the driver earned on the current race, not those he earned on the races before.

In [11]:
def point_shift (df, driver_or_team, points):
    df['lookup1'] = df.season.astype(str) + df[driver_or_team] + df['round'].astype(str) # current round
    df['lookup2'] = df.season.astype(str) + df[driver_or_team] + (df['round']-1).astype(str) # previous round
    new_df = df.merge(df[['lookup1', points]], how = 'left', left_on='lookup2',right_on='lookup1') # shifted one round
    new_df.drop(['lookup1_x', 'lookup2', 'lookup1_y'], axis = 1, inplace = True) # drop unnecessary columns
    new_df.rename(columns = {points+'_x': points+'_after_race', points+'_y': points}, inplace = True) # rename before/after race
    new_df[points].fillna(0, inplace = True)
    return new_df

In [12]:
driver_standings.tail(3)

Unnamed: 0,season,round,driver,driver_points,driver_wins,driver_standings_pos
27110,2021,22,mick_schumacher,0.0,0,19
27111,2021,22,kubica,0.0,0,20
27112,2021,22,mazepin,0.0,0,21


In [13]:
# shift the driver points

driver_standings = point_shift(driver_standings, 'driver', 'driver_points')
driver_standings = point_shift(driver_standings, 'driver', 'driver_wins')
driver_standings = point_shift(driver_standings, 'driver', 'driver_standings_pos')
driver_standings.tail(3)

Unnamed: 0,season,round,driver,driver_points_after_race,driver_wins_after_race,driver_standings_pos_after_race,driver_points,driver_wins,driver_standings_pos
27110,2021,22,mick_schumacher,0.0,0,19,0.0,0.0,19.0
27111,2021,22,kubica,0.0,0,20,0.0,0.0,20.0
27112,2021,22,mazepin,0.0,0,21,0.0,0.0,21.0


In [14]:
driver_standings.to_csv('data/driver_standings.csv', index = False)

# 5. Constructors Championship

In [15]:
# start from year 1958 because there is no data prior to that.

constructor_rounds = rounds[8:]

constructor_standings = {'season': [],
                    'round':[],
                    'constructor': [],
                    'constructor_points': [],
                    'constructor_wins': [],
                   'constructor_standings_pos': []}

for n in list(range(len(constructor_rounds))):
    for i in constructor_rounds[n][1]:
    
        url = 'https://ergast.com/api/f1/{}/{}/constructorStandings.json'
        r = requests.get(url.format(constructor_rounds[n][0], i))
        json = r.json()

        for item in json['MRData']['StandingsTable']['StandingsLists'][0]['ConstructorStandings']:
            try:
                constructor_standings['season'].append(int(json['MRData']['StandingsTable']['StandingsLists'][0]['season']))
            except:
                constructor_standings['season'].append(None)

            try:
                constructor_standings['round'].append(int(json['MRData']['StandingsTable']['StandingsLists'][0]['round']))
            except:
                constructor_standings['round'].append(None)
                                         
            try:
                constructor_standings['constructor'].append(item['Constructor']['constructorId'])
            except:
                constructor_standings['constructor'].append(None)
            
            try:
                constructor_standings['constructor_points'].append(int(item['points']))
            except:
                constructor_standings['constructor_points'].append(None)
            
            try:
                constructor_standings['constructor_wins'].append(int(item['wins']))
            except:
                constructor_standings['constructor_wins'].append(None)
                
            try:
                constructor_standings['constructor_standings_pos'].append(int(item['position']))
            except:
                constructor_standings['constructor_standings_pos'].append(None)
            
constructor_standings = pd.DataFrame(constructor_standings)
print(constructor_standings.shape)

(12711, 6)


In [16]:
# Again, using the point_shift function for constructor points

constructor_standings = point_shift(constructor_standings, 'constructor', 'constructor_points')
constructor_standings = point_shift(constructor_standings, 'constructor', 'constructor_wins')
constructor_standings = point_shift(constructor_standings, 'constructor', 'constructor_standings_pos')
constructor_standings.tail(3)

Unnamed: 0,season,round,constructor,constructor_points_after_race,constructor_wins_after_race,constructor_standings_pos_after_race,constructor_points,constructor_wins,constructor_standings_pos
12708,2021,22,williams,23.0,0,8,23.0,0.0,8.0
12709,2021,22,alfa,13.0,0,9,13.0,0.0,9.0
12710,2021,22,haas,0.0,0,10,0.0,0.0,10.0


In [17]:
constructor_standings.to_csv('data/constructor_standings.csv', index = False)

# 6. Qualifying

In [None]:
qualifying_results = pd.DataFrame()

# Qualifying times are only available from 1983

for year in list(range(2020,2022)):
    url = 'https://www.formula1.com/en/results.html/{}/races.html'
    r = requests.get(url.format(year))
    soup = BeautifulSoup(r.text, 'html.parser')
    
    # find links to all circuits for a given year
    
    year_links = []
    for page in soup.find_all('a', attrs = {'class':"resultsarchive-filter-item-link FilterTrigger"}):
        link = page.get('href')
        if f'/en/results.html/{year}/races/' in link: 
            year_links.append(link)
            
    # for each circuit, switch to the starting grid page and read table        

    year_df = pd.DataFrame()
    new_url = 'https://www.formula1.com{}'
    for n, link in list(enumerate(year_links)):
        link = link.replace('race-result.html', 'starting-grid.html')
        df = pd.read_html(new_url.format(link))
        df = df[0]
        df['season'] = year
        df['round'] = n+1
        for col in df:
            if 'Unnamed' in col:
                df.drop(col, axis = 1, inplace = True)
        
        # concatenate all tables from all years 
        
        year_df = pd.concat([year_df, df])

    qualifying_results = pd.concat([qualifying_results, year_df])
    
print(qualifying_results.shape)

In [None]:
qualifying_results.tail(3)

In [None]:
qualifying_results.rename(columns = {'Pos': 'grid_position', 'Driver': 'driver_name', 'Car': 'car',
                                     'Time': 'qualifying_time'}, inplace = True)

qualifying_results.drop('No', axis = 1, inplace = True) # driver number is useless because it is just an "arbitrary" number

In [None]:
qualifying_results.to_csv('data/qualifying.csv', index = False)

# 7. Weather

In [None]:
races = pd.read_csv('data/races.csv')
races.tail(3)

In [None]:
weather = races.iloc[:,[0,1,2]]
weather.tail(3)

In [None]:
info = []

for link in races.url:
    try:
        df = pd.read_html(link)[0]
        if 'Weather' in list(df.iloc[:,0]):
            n = list(df.iloc[:,0]).index('Weather')
            info.append(df.iloc[n,1])
        else:
            df = pd.read_html(link)[1]
            if 'Weather' in list(df.iloc[:,0]):
                n = list(df.iloc[:,0]).index('Weather')
                info.append(df.iloc[n,1])
            else:
                df = pd.read_html(link)[2]
                if 'Weather' in list(df.iloc[:,0]):
                    n = list(df.iloc[:,0]).index('Weather')
                    info.append(df.iloc[n,1])
                else:
                    df = pd.read_html(link)[3]
                    if 'Weather' in list(df.iloc[:,0]):
                        n = list(df.iloc[:,0]).index('Weather')
                        info.append(df.iloc[n,1])
                    else:
                        driver = webdriver.Chrome()
                        driver.get(link)

                        # click language button
                        button = driver.find_element_by_link_text('Italiano')
                        button.click()
                        
                        clima = driver.find_element_by_xpath('//*[@id="mw-content-text"]/div/table[1]/tbody/tr[9]/td').text
                        info.append(clima) 
                                
    except:
        info.append('not found')

In [None]:
len(info)

In [None]:
weather['weather'] = info

In [None]:
weather.tail(3)

In [None]:
# Fixing some missing weather labels

weather_dict = {'weather_warm': ['soleggiato', 'clear', 'warm', 'hot', 'sunny', 'fine', 'mild', 'sereno', 'Sunny and hot', 'Hot, dry, sunny', 'Sunny and warm', 'Warm, dry, sunny', 'Sunny[2]', 'Warm and sunny', 'Sunny, hot','Very hot, dry, sunny'],
               'weather_cold': ['cold', 'fresh', 'chilly', 'cool'],
               'weather_dry': ['dry', 'asciutto', 'Clear', 'Mainly sunny', 'Fine', 'Wet at start, dry later', 'Sunny, mild, dry', 'Dry and sunny'],
               'weather_wet': ['showers', 'wet', 'rain', 'pioggia', 'damp', 'thunderstorms', 'rainy', 'Rain', 'Overcast with intermittent rain'],
               'weather_cloudy': ['overcast', 'nuvoloso', 'clouds', 'cloudy', 'grey', 'coperto', 'Mainly cloudy, dry', 'Cloudy, dry', 'Overcast, mild, dry']}

weather_df = pd.DataFrame(columns = weather_dict.keys()) # convert into a df

for col in weather_df: # create duumy variables for all of the 5 categories.
    weather_df[col] = weather['weather'].map(lambda x: 1 if any(i in weather_dict[col] for i in x.lower().split()) else 0)

In [None]:
weather_info = pd.concat([weather, weather_df], axis = 1)

In [None]:
weather_info.tail(3) # Check before exporting

In [None]:
weather_info.to_csv('data/weather.csv', index= False)

## Next step: Data Preprocessing 