# Data Analytics in Motorsport: Predicting Formula 1 Race Outcomes

## Importing initial libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Importing dataset

In [None]:
!pip install kagglehub[pandas-datasets]



In [None]:
import kagglehub
from kagglehub import KaggleDatasetAdapter

In [None]:
dataset_name = 'rohanrao/formula-1-world-championship-1950-2020'

In [None]:
path = kagglehub.dataset_download(dataset_name)
print("Dataset downloaded to:", path)

Using Colab cache for faster access to the 'formula-1-world-championship-1950-2020' dataset.
Dataset downloaded to: /kaggle/input/formula-1-world-championship-1950-2020


In [None]:
import os

path = '/kaggle/input/formula-1-world-championship-1950-2020'
os.chdir(path)
os.listdir()

['races.csv',
 'constructor_results.csv',
 'drivers.csv',
 'constructors.csv',
 'lap_times.csv',
 'status.csv',
 'driver_standings.csv',
 'seasons.csv',
 'pit_stops.csv',
 'sprint_results.csv',
 'constructor_standings.csv',
 'results.csv',
 'circuits.csv',
 'qualifying.csv']

In [None]:
circuits = pd.read_csv('circuits.csv')
constructor_results = pd.read_csv('constructor_results.csv')
constructor_standings = pd.read_csv('constructor_standings.csv')
constructors = pd.read_csv('constructors.csv')
driver_standings = pd.read_csv('driver_standings.csv')
drivers = pd.read_csv('drivers.csv')
lap_times = pd.read_csv('lap_times.csv')
pit_stops = pd.read_csv('pit_stops.csv')
qualifying = pd.read_csv('qualifying.csv')
races = pd.read_csv('races.csv')
results = pd.read_csv('results.csv')
seasons = pd.read_csv('seasons.csv')
sprint_results = pd.read_csv('sprint_results.csv')
status = pd.read_csv('status.csv')

## Reviewing Datasets

### Circuits

#### The components of the dataset

We have 9 different components showing different information. I'll explain each columns:

*   `circuitId` - The ID of the circuit
*   `circuitRef` - The reference of the circuit
*   `name` - The name of the circuit
*   `location` - The city where the circuit is located
*   `country` - The country where the circuit is located
*   `lat` - Latitude
*   `lng` - Longitude
*   `alt` - Altitude
*   `url` - The link of wikipedia about the circuit



In [None]:
circuits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   circuitId   77 non-null     int64  
 1   circuitRef  77 non-null     object 
 2   name        77 non-null     object 
 3   location    77 non-null     object 
 4   country     77 non-null     object 
 5   lat         77 non-null     float64
 6   lng         77 non-null     float64
 7   alt         77 non-null     int64  
 8   url         77 non-null     object 
dtypes: float64(2), int64(2), object(5)
memory usage: 5.5+ KB


It looks like we have **integers, objects and floats** Dtypes in this dataset.

#### Missing values

In [None]:
circuits.isna().sum()

Unnamed: 0,0
circuitId,0
circuitRef,0
name,0
location,0
country,0
lat,0
lng,0
alt,0
url,0


We **don't have any missing values**

#### Duplicate and Unique values

In [None]:
circuits.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in circuits.columns:
  print(f'{col}: {circuits[col].duplicated().sum()}')

circuitId: 0
circuitRef: 0
name: 0
location: 2
country: 42
lat: 0
lng: 0
alt: 11
url: 0


We have **some duplicates in columns `location`, `country` and `alt`**, but since the key columns here is `circuitId`, that means all the rows are **unique ones**

### Constructor_results

#### The components of the dataset

We have 5 different components showing different information. I'll explain each columns:

*   `constructorResultsId` - The ID of results of the constructor
*   `raceId` - The ID of the race
*   `constructorId` - The ID of the constructor
*   `points` - Sum of points each constructor have gained
*   `status` - The status of the constructor

In [None]:
constructor_results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12625 entries, 0 to 12624
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   constructorResultsId  12625 non-null  int64  
 1   raceId                12625 non-null  int64  
 2   constructorId         12625 non-null  int64  
 3   points                12625 non-null  float64
 4   status                12625 non-null  object 
dtypes: float64(1), int64(3), object(1)
memory usage: 493.3+ KB


It looks like we have **integers, objects and floats** Dtypes in this dataset.

#### Missing values

In [None]:
constructor_results.isna().sum()

Unnamed: 0,0
constructorResultsId,0
raceId,0
constructorId,0
points,0
status,0


We **don't have any missing values**

#### Duplicate and Unique values

In [None]:
constructor_results.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in constructor_results.columns:
  print(f'{col}: {constructor_results[col].duplicated().sum()}')

constructorResultsId: 0
raceId: 11565
constructorId: 12450
points: 12564
status: 12623


We have **some duplicates in columns `raceId`, `constructorId`, `points` and `status`**, but since the key columns here is `constructorResultsId`, that means all the rows are **unique ones**

### Constructor_standings

#### The components of the dataset

We have 7 different components showing different information. I'll explain each columns:

*   `constructorStandingsId` - The ID of standing of the constructor
*   `raceId` - The ID of the race
*   `constructorId` - The ID of the constructor
*   `points` - Sum of points each constructor have gained
*   `position` - The position/standing of the constructor
*   `positionText` - The position/standing of the constructor as a text
*   `wins` - Win/Lose of the constructor

In [None]:
constructor_standings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13391 entries, 0 to 13390
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   constructorStandingsId  13391 non-null  int64  
 1   raceId                  13391 non-null  int64  
 2   constructorId           13391 non-null  int64  
 3   points                  13391 non-null  float64
 4   position                13391 non-null  int64  
 5   positionText            13391 non-null  object 
 6   wins                    13391 non-null  int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 732.4+ KB


It looks like we have **integers, objects and floats** Dtypes in this dataset.

#### Missing values

In [None]:
constructor_standings.isna().sum()

Unnamed: 0,0
constructorStandingsId,0
raceId,0
constructorId,0
points,0
position,0
positionText,0
wins,0


We **don't have any missing values**

#### Duplicate and Unique values

In [None]:
constructor_standings.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in constructor_standings.columns:
  print(f'{col}: {constructor_standings[col].duplicated().sum()}')

constructorStandingsId: 0
raceId: 12330
constructorId: 13231
points: 12812
position: 13369
positionText: 13368
wins: 13369


We have **some duplicates in columns `raceId`, `constructorId`, `points`, `position` `positionText` and `wins`**, but since the key columns here is `constructorStandingsId`, that means all the rows are **unique ones**

### Constructors

#### The components of the dataset

We have 5 different components showing different information. I'll explain each columns:

*   `constructorId` - The ID of the constructor
*   `constructorRef` - The reference of the constructor
*   `name` - The name of the constructor
*   `nationality` - The nationality of the constructor
*   `url` - The link of wikipedia about the constructor


In [None]:
constructors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 212 entries, 0 to 211
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   constructorId   212 non-null    int64 
 1   constructorRef  212 non-null    object
 2   name            212 non-null    object
 3   nationality     212 non-null    object
 4   url             212 non-null    object
dtypes: int64(1), object(4)
memory usage: 8.4+ KB


It looks like we have **integers and objects** Dtypes in this dataset.

#### Missing values

In [None]:
constructors.isna().sum()

Unnamed: 0,0
constructorId,0
constructorRef,0
name,0
nationality,0
url,0


We **don't have any missing values**

#### Duplicate and Unique values

In [None]:
constructors.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in constructors.columns:
  print(f'{col}: {constructors[col].duplicated().sum()}')

constructorId: 0
constructorRef: 0
name: 0
nationality: 188
url: 37


We have **some duplicates in columns `nationality` and `url`**, but since the key columns here is `constructorId`, that means all the rows are **unique ones**

### Driver_standings

#### The components of the dataset

We have 7 different components showing different information. I'll explain each columns:

*   `driverStandingsId` - The ID of standing of the driver
*   `raceId` - The ID of the race
*   `driverId` - The ID of the driver
*   `points` - driver's accumulated points
*   `position` - The position/standing of the driver
*   `positionText` - The position/standing of the driver as a text
*   `wins` - Win/Lose of the driver

In [None]:
driver_standings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34863 entries, 0 to 34862
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   driverStandingsId  34863 non-null  int64  
 1   raceId             34863 non-null  int64  
 2   driverId           34863 non-null  int64  
 3   points             34863 non-null  float64
 4   position           34863 non-null  int64  
 5   positionText       34863 non-null  object 
 6   wins               34863 non-null  int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 1.9+ MB


It looks like we have **integers, floats and objects** Dtypes in this dataset.

#### Missing values

In [None]:
driver_standings.isna().sum()

Unnamed: 0,0
driverStandingsId,0
raceId,0
driverId,0
points,0
position,0
positionText,0
wins,0


We **don't have any missing values**

#### Duplicate and Unique values

In [None]:
driver_standings.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in driver_standings.columns:
  print(f'{col}: {driver_standings[col].duplicated().sum()}')

driverStandingsId: 0
raceId: 33738
driverId: 34009
points: 34421
position: 34755
positionText: 34754
wins: 34843


We have **some duplicates in columns `driverStandingsId`, `raceId`, `driverId`, `points`, `position`, `positionText` and `win`**, but since the key columns here is `driverStandingsId`, that means all the rows are **unique ones**

### Drivers

#### The components of the dataset

We have 9 different components showing different information. I'll explain each columns:

*   `driverId` - The ID of the driver
*   `driverRef` - The reference of the driver
*   `number` - The number of the driver
*   `code` - The code of the driver
*   `forename` - The forename of the driver
*   `surname` - The surname of the driver
*   `dob` - Date of birth of the driver
*   `nationality` - The nationality of the driver
*   `url` - The link of wikipedia about the driver

In [None]:
drivers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 861 entries, 0 to 860
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   driverId     861 non-null    int64 
 1   driverRef    861 non-null    object
 2   number       861 non-null    object
 3   code         861 non-null    object
 4   forename     861 non-null    object
 5   surname      861 non-null    object
 6   dob          861 non-null    object
 7   nationality  861 non-null    object
 8   url          861 non-null    object
dtypes: int64(1), object(8)
memory usage: 60.7+ KB


It looks like we have **integers and objects** Dtypes in this dataset.

#### Missing values

In [None]:
drivers.isna().sum()

Unnamed: 0,0
driverId,0
driverRef,0
number,0
code,0
forename,0
surname,0
dob,0
nationality,0
url,0


We **don't have any missing values**

#### Duplicate and Unique values

In [None]:
drivers.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in drivers.columns:
  print(f'{col}: {drivers[col].duplicated().sum()}')

driverId: 0
driverRef: 0
number: 812
code: 763
forename: 383
surname: 59
dob: 18
nationality: 818
url: 0


We have **some duplicates in columns `number`, `code`, `forname`, `surname`, `dob` and `nationality`**, but since the key columns here is `driverId`, that means all the rows are **unique ones**

### Lap_times

#### The components of the dataset

We have 6 different components showing different information. I'll explain each columns:

*   `raceId` - The ID of the race
*   `driverId` - The ID of the driver
*   `lap` - The number of the lap
*   `position` - The position compared to different drivers
*   `time` - The time of the lap
*   `milliseconds` - The milliseconds of the lap


In [None]:
lap_times.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 589081 entries, 0 to 589080
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   raceId        589081 non-null  int64 
 1   driverId      589081 non-null  int64 
 2   lap           589081 non-null  int64 
 3   position      589081 non-null  int64 
 4   time          589081 non-null  object
 5   milliseconds  589081 non-null  int64 
dtypes: int64(5), object(1)
memory usage: 27.0+ MB


It looks like we have **integers and objects** Dtypes in this dataset.

#### Missing values

In [None]:
lap_times.isna().sum()

Unnamed: 0,0
raceId,0
driverId,0
lap,0
position,0
time,0
milliseconds,0


We **don't have any missing values**

#### Duplicate and Unique values

In [None]:
lap_times.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in lap_times.columns:
  print(f'{col}: {lap_times[col].duplicated().sum()}')

raceId: 588537
driverId: 588938
lap: 588994
position: 589057
time: 513275
milliseconds: 513275


We have **duplicates in all columns**

### Pit_stops

#### The components of the dataset

We have 7 different components showing different information. I'll explain each columns:

*   `raceId` - The ID of the race
*   `driverId` - The ID of the driver
*   `stop` - The fact that driver had pit stop or not
*   `lap` - On which lap did the driver had pit stop
*   `time` - On what time was the pit stop
*   `duration` - The duration of the pit stop
*   `milliseconds` - The milliseconds of the lap

In [None]:
pit_stops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11371 entries, 0 to 11370
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   raceId        11371 non-null  int64 
 1   driverId      11371 non-null  int64 
 2   stop          11371 non-null  int64 
 3   lap           11371 non-null  int64 
 4   time          11371 non-null  object
 5   duration      11371 non-null  object
 6   milliseconds  11371 non-null  int64 
dtypes: int64(5), object(2)
memory usage: 622.0+ KB


It looks like we have **integers and objects** Dtypes in this dataset.

#### Missing values

In [None]:
pit_stops.isna().sum()

Unnamed: 0,0
raceId,0
driverId,0
stop,0
lap,0
time,0
duration,0
milliseconds,0


We **don't have any missing values**

#### Duplicate and Unique values

In [None]:
pit_stops.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in pit_stops.columns:
  print(f'{col}: {pit_stops[col].duplicated().sum()}')

raceId: 11086
driverId: 11295
stop: 11357
lap: 11297
time: 3144
duration: 3767
milliseconds: 3767


We have **duplicates in all columns**

### Qualifying

#### The components of the dataset

We have 9 different components showing different information. I'll explain each columns:

*   `qualifyId` - The ID of the qualify
*   `raceId` - The ID of the race
*   `driverId` - The ID of the driver
*   `constructorId` - The ID of the constructor
*   `number` - The number of the driver
*   `position` - The position of the driver
*   `q1` - The time driver showed on q1
*   `q2` - The time driver showed on q2
*   `q3` - The time driver showed on q3

In [None]:
qualifying.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10494 entries, 0 to 10493
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   qualifyId      10494 non-null  int64 
 1   raceId         10494 non-null  int64 
 2   driverId       10494 non-null  int64 
 3   constructorId  10494 non-null  int64 
 4   number         10494 non-null  int64 
 5   position       10494 non-null  int64 
 6   q1             10494 non-null  object
 7   q2             10472 non-null  object
 8   q3             10448 non-null  object
dtypes: int64(6), object(3)
memory usage: 738.0+ KB


It looks like we have **integers and objects** Dtypes in this dataset.

#### Missing values

In [None]:
qualifying.isna().sum()

Unnamed: 0,0
qualifyId,0
raceId,0
driverId,0
constructorId,0
number,0
position,0
q1,0
q2,22
q3,46


As I can see we **have some missing values in `q2` and `q3`**. But as **in the rules of F1**, some drivers **might not qualify for the q2 and q3**.

#### Duplicate and Unique values

In [None]:
qualifying.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in qualifying.columns:
  print(f'{col}: {qualifying[col].duplicated().sum()}')

qualifyId: 0
raceId: 10000
driverId: 10322
constructorId: 10447
number: 10436
position: 10466
q1: 1347
q2: 5019
q3: 7022


We have **some duplicates all columns eccept `qualifyId`**, but since the key columns here is `qualifyId`, that means all the rows are **unique ones**

### Races

#### The components of the dataset

We have 18 different components showing different information. I'll explain each columns:

*   `raceId` - The ID of the race
*   `year` - The year of the race
*   `round` - The round of the season
*   `circuitId` - The ID of the circuit
*   `name` - The name of the circuit
*   `time` - The time of the race
*   `url` - The link of wikipedia about the race
*   `fp1_date` - The date of the practice 1
*   `fp1_time` - The time of the practice 1
*   `fp2_date` - The date of the practice 2
*   `fp2_time` - The time of the practice 2
*   `fp3_date` - The date of the practice 3
*   `fp3_time` - The time of the practice 3
*   `quali_date` - The date of the qualify
*   `quali_time` - The time of the qualify
*   `sprint_date` - The date of the sprint
*   `sprint_time` - The time of the sprint

In [None]:
races.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1125 entries, 0 to 1124
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   raceId       1125 non-null   int64 
 1   year         1125 non-null   int64 
 2   round        1125 non-null   int64 
 3   circuitId    1125 non-null   int64 
 4   name         1125 non-null   object
 5   date         1125 non-null   object
 6   time         1125 non-null   object
 7   url          1125 non-null   object
 8   fp1_date     1125 non-null   object
 9   fp1_time     1125 non-null   object
 10  fp2_date     1125 non-null   object
 11  fp2_time     1125 non-null   object
 12  fp3_date     1125 non-null   object
 13  fp3_time     1125 non-null   object
 14  quali_date   1125 non-null   object
 15  quali_time   1125 non-null   object
 16  sprint_date  1125 non-null   object
 17  sprint_time  1125 non-null   object
dtypes: int64(4), object(14)
memory usage: 158.3+ KB


It looks like we have **integers and objects** Dtypes in this dataset.

#### Missing values

In [None]:
races.isna().sum()

Unnamed: 0,0
raceId,0
year,0
round,0
circuitId,0
name,0
date,0
time,0
url,0
fp1_date,0
fp1_time,0


We **don't have any missing values**

#### Duplicate and Unique values

In [None]:
races.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in races.columns:
  print(f'{col}: {races[col].duplicated().sum()}')

raceId: 0
year: 1050
round: 1101
circuitId: 1048
name: 1071
date: 0
time: 1090
url: 0
fp1_date: 1034
fp1_time: 1104
fp2_date: 1034
fp2_time: 1105
fp3_date: 1052
fp3_time: 1106
quali_date: 1034
quali_time: 1109
sprint_date: 1106
sprint_time: 1112


We have **some duplicates all columns accept `raceId`, `date` and `url`**, but since the key columns here is `raceId`, that means all the rows are **unique ones**

### Results

#### The components of the dataset

We have 18 different components showing different information. I'll explain each columns:

*   `resultsId` - The ID of the result
*   `raceId` - The ID of the race
*   `driverId` - The ID of the driver
*   `constructorId` - The ID of the constructor
*   `number` - The number of the driver
*   `grid` - The position of the driver when the race started
*   `position` - The position of the driver when the race finished
*   `positionText` - The position of the driver when the race finished in text format
*   `positionOrder` - The order of the position
*   `points` - The points that the driver gained
*   `laps` - The number of laps the driver drove
*   `time` - The time the driver needed to finish the race
*   `milliseconds` - The time the driver needed to finish the race in milliseconds
*   `fastestlap` - The lap number on which driver showed the fastest time
*   `rank` - The rank of the fastest lap
*   `fastestLapTime` - The time of the fastest lap
*   `fastestLapSpeed` - The speed of the fastest lap
*   `statusId` - The ID of the status

In [None]:
results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26759 entries, 0 to 26758
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   resultId         26759 non-null  int64  
 1   raceId           26759 non-null  int64  
 2   driverId         26759 non-null  int64  
 3   constructorId    26759 non-null  int64  
 4   number           26759 non-null  object 
 5   grid             26759 non-null  int64  
 6   position         26759 non-null  object 
 7   positionText     26759 non-null  object 
 8   positionOrder    26759 non-null  int64  
 9   points           26759 non-null  float64
 10  laps             26759 non-null  int64  
 11  time             26759 non-null  object 
 12  milliseconds     26759 non-null  object 
 13  fastestLap       26759 non-null  object 
 14  rank             26759 non-null  object 
 15  fastestLapTime   26759 non-null  object 
 16  fastestLapSpeed  26759 non-null  object 
 17  statusId    

It looks like we have **integers, floats and objects** Dtypes in this dataset.

#### Missing values

In [None]:
results.isna().sum()

Unnamed: 0,0
resultId,0
raceId,0
driverId,0
constructorId,0
number,0
grid,0
position,0
positionText,0
positionOrder,0
points,0


We **don't have any missing values**

#### Duplicate and Unique values

In [None]:
results.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in results.columns:
  print(f'{col}: {results[col].duplicated().sum()}')

resultId: 0
raceId: 25634
driverId: 25898
constructorId: 26548
number: 26629
grid: 26724
position: 26725
positionText: 26720
positionOrder: 26720
points: 26720
laps: 26587
time: 19348
milliseconds: 19120
fastestLap: 26678
rank: 26733
fastestLapTime: 19285
fastestLapSpeed: 19034
statusId: 26622


We have **some duplicates all columns accept `resultId`**, but since the key columns here is `resultId`, that means all the rows are **unique ones**

### Seasons

#### The components of the dataset

We have 2 different components showing different information. I'll explain each columns:

*   `year` - The year of the season
*   `url` - The link of wikipedia about the season


In [None]:
seasons.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   year    75 non-null     int64 
 1   url     75 non-null     object
dtypes: int64(1), object(1)
memory usage: 1.3+ KB


It looks like we have **integer and object** Dtypes in this dataset.

#### Missing values

In [None]:
seasons.isna().sum()

Unnamed: 0,0
year,0
url,0


We **don't have any missing values**

#### Duplicate and Unique values

In [None]:
seasons.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in seasons.columns:
  print(f'{col}: {seasons[col].duplicated().sum()}')

year: 0
url: 0


We **don't have any duplicates.**

### Sprint_results

#### The components of the dataset

We have 16 different components showing different information. I'll explain each columns:

*   `resultsId` - The ID of the result
*   `raceId` - The ID of the race
*   `driverId` - The ID of the driver
*   `constructorId` - The ID of the constructor
*   `number` - The number of the driver
*   `grid` - The position of the driver when the race started
*   `position` - The position of the driver when the race finished
*   `positionText` - The position of the driver when the race finished in text format
*   `positionOrder` - The order of the position
*   `points` - The points that the driver gained
*   `laps` - The number of laps the driver drove
*   `time` - The time the driver needed to finish the race
*   `milliseconds` - The time the driver needed to finish the race in milliseconds
*   `fastestlap` - The lap on which driver showed the fastest time
*   `fastestLapTime` - The time of the fastest lap of the driver
*   `statusId` - The ID of the status

In [None]:
sprint_results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 360 entries, 0 to 359
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   resultId        360 non-null    int64 
 1   raceId          360 non-null    int64 
 2   driverId        360 non-null    int64 
 3   constructorId   360 non-null    int64 
 4   number          360 non-null    int64 
 5   grid            360 non-null    int64 
 6   position        360 non-null    object
 7   positionText    360 non-null    object
 8   positionOrder   360 non-null    int64 
 9   points          360 non-null    int64 
 10  laps            360 non-null    int64 
 11  time            360 non-null    object
 12  milliseconds    360 non-null    object
 13  fastestLap      360 non-null    object
 14  fastestLapTime  360 non-null    object
 15  statusId        360 non-null    int64 
dtypes: int64(10), object(6)
memory usage: 45.1+ KB


It looks like we have **integers and objects** Dtypes in this dataset.

#### Missing values

In [None]:
sprint_results.isna().sum()

Unnamed: 0,0
resultId,0
raceId,0
driverId,0
constructorId,0
number,0
grid,0
position,0
positionText,0
positionOrder,0
points,0


We **don't have any missing values**

#### Duplicate and Unique values

In [None]:
sprint_results.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in sprint_results.columns:
  print(f'{col}: {sprint_results[col].duplicated().sum()}')

resultId: 0
raceId: 342
driverId: 329
constructorId: 348
number: 327
grid: 339
position: 339
positionText: 337
positionOrder: 340
points: 351
laps: 346
time: 19
milliseconds: 19
fastestLap: 336
fastestLapTime: 9
statusId: 352


We have **some duplicates all columns accept `resultId`**, but since the key columns here is `resultId`, that means all the rows are **unique ones**

### Status

#### The components of the dataset

We have 2 different components showing different information. I'll explain each columns:

*   `statusId` - The ID of the status
*   `status` - The status

In [None]:
status.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139 entries, 0 to 138
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   statusId  139 non-null    int64 
 1   status    139 non-null    object
dtypes: int64(1), object(1)
memory usage: 2.3+ KB


It looks like we have **integer and object** Dtypes in this dataset.

#### Missing values

In [None]:
status.isna().sum()

Unnamed: 0,0
statusId,0
status,0


We **don't have any missing values**

#### Duplicate and Unique values

In [None]:
status.duplicated().sum()

np.int64(0)

There are **no duplicated rows**

In [None]:
for col in status.columns:
  print(f'{col}: {status[col].duplicated().sum()}')

statusId: 0
status: 0


We **don't have any duplicates.**