**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Sanaya Vakharia
- Dhyay Thakrar
- Megha Puskur
- Daniella

# Research Question

How do track characteristics, driver performance metrics, and weather conditions affect the probability of winning a Formula 1 race?


# Background and Prior Work

## Background & Introduction to Formula 1 Racing
Formula 1 racing combines high-speed competition with complex engineering, where predictive analytics play a crucial role in decision-making. Teams utilize vast amounts of data, from tire degradation to driver performance, to optimize race strategies and improve outcomes.

## Prior Work
1. **Application of Machine Learning**: Recent studies, such as the one documented by Aalto University, have explored the use of machine learning algorithms like Support Vector Machines, Random Forest, and Neural Networks to predict the timing of pit stops in F1 races. This research highlights the importance of various race factors, including tire degradation and weather conditions, in decision-making processes. ([Aaltodoc](https://aaltodoc.aalto.fi/server/api/core/bitstreams/70d5a580-c282-4278-8462-94d061471546/content))

2. **Comprehensive Data Analysis Platforms**: Platforms like the one found in the GitHub repository 'f1-analysis' use the FastF1 library to manipulate telemetry and race data for in-depth insights into race performance and strategies. This repository serves as a valuable resource for enthusiasts and analysts looking to perform statistical analyses and create data visualizations based on F1 race data. ([f1-analysis](https://github.com/lalutir/f1-analysis))

3. **Industrial Impact of Predictive Analytics**: An overview provided by sites like Penn State's page on predictive analytics in F1 discusses the industrial relevance and financial impact of predictive analytics in the sport. It outlines how teams leverage real-time data from over 300 sensors on each car to make strategic decisions that can alter race results, illustrating the blend of data science with competitive sports to enhance team performance and race strategy. ([F1 and predictive analysis](https://sites.psu.edu/aboutoliviadisanti/2023/11/14/unleashing-the-power-of-data-science-in-formula-1/))


# Hypothesis


Our hypothesis is that track characteristics, driver performance metrics, and weather conditions each significantly influence the probability of winning a Formula 1 race. 

Track features, such as the layout and corner types, can favor different car setups, potentially creating positive or negative correlations with winning likelihood depending on the car’s strengths. For instance, high-speed tracks may positively correlate with cars optimized for straight-line speed, while technical circuits could benefit those with strong downforce. Driver metrics, including qualifying position, overtaking ability, and consistency, are expected to correlate positively with success, as they reflect a driver’s ability to maximize their position and minimize errors under varying race conditions. Weather conditions add a layer of unpredictability, where rain, temperature, and humidity can shift race dynamics. Rain, for example, may benefit drivers skilled in wet conditions, while high temperatures could challenge teams with poor tire or cooling management, introducing both positive and negative correlations. 

Previous studies and data analyses, such as those referenced in existing research on machine learning applications in F1 (e.g., the study at Aalto University), suggest that these variables not only affect race strategy but also directly influence race results through their impact on vehicle handling, tire wear, and driver safety. The interaction of these variables is likely complex, given the dynamic and highly conditional nature of racing, where the optimal strategy must continuously adapt to the evolving race conditions.([Aaltodoc](https://aaltodoc.aalto.fi/server/api/core/bitstreams/70d5a580-c282-4278-8462-94d061471546/content))


# Data

## Data overview

- Master Dataset: 
  - Dataset Name: Formula 1 Dataset (1950 - 2023) [Cleaned]
  - Link to the dataset: https://www.kaggle.com/datasets/suletanmay/formula-1-dataset-1950-2023-cleaned 
  - Number of observations: 14 csv files 

- Dataset 1: 
  - Dataset Name: Cleaned Circuits
  - Link to the dataset: data/cleaned_circuits.csv
  - Number of observations: 77 * 5
  - Number of variables: 5
  
- Dataset 2: 
  - Dataset Name: Cleaned Constructor Results
  - Link to the dataset: data/cleaned_constructor_results.csv
  - Number of observations: 12290 * 4 
  - Number of variables: 4

- Dataset 3: 
  - Dataset Name: Cleaned Constructor Standings
  - Link to the dataset: data/cleaned_constructor_standings.csv 
  - Number of observations: 13051 * 6 
  - Number of variables: 6
  
- Dataset 4: 
  - Dataset Name: Cleaned Constructors
  - Link to the dataset: data/cleaned_constructors.csv 
  - Number of observations: 211 * 4
  - Number of variables: 4
  
- Dataset 5: 
  - Dataset Name: Cleaned Driver Standings
  - Link to the dataset: data/cleaned_driver_standings.csv
  - Number of observations: 34124 * 6
  - Number of variables: 6

- Dataset 6: 
  - Dataset Name: Cleaned Drivers
  - Link to the dataset: data/cleaned_drivers.csv
  - Number of observations: 857 * 7
  - Number of variables: 7

- Dataset 7: 
  - Dataset Name: Cleaned Lap Times 
  - Link to the dataset: data/cleaned_lap_times.csv 
  - Number of observations: 551742 * 5
  - Number of variables: 5

- Dataset 8: 
  - Dataset Name: Cleaned Pit Stops 
  - Link to the dataset: data/cleaned_pit_stops.csv 
  - Number of observations: 10089 * 5
  - Number of variables: 5
  
- Dataset 9: 
  - Dataset Name: Cleaned Qualifying 
  - Link to the dataset: data/cleaned_qualifying.csv 
  - Number of observations: 9815 * 9 
  - Number of variables: 9

- Dataset 10: 
  - Dataset Name: Cleaned Races 
  - Link to the dataset: data/cleaned_races.csv 
  - Number of observations: 1101 * 4 
  - Number of variables: 4
  
- Dataset 11: 
  - Dataset Name: Cleaned Results
  - Link to the dataset: data/cleaned_results.csv
  - Number of observations: 14 csv files 
  - Number of variables:
  
- Dataset 12: 
  - Dataset Name: Cleaned Seasons 
  - Link to the dataset: data/cleaned_seasons.csv 
  - Number of observations: 74 * 2 
  - Number of variables: 2

- Dataset 13: 
  - Dataset Name: Cleaned Sprint Results
  - Link to the dataset: data/cleaned_sprint_results.csv 
  - Number of observations: 180 * 12 
  - Number of variables: 12

- Dataset 14: 
  - Dataset Name: Cleaned Status 
  - Link to the dataset: data/cleaned_status.csv
  - Number of observations: 139 * 2
  - Number of variables: 2

All the datasets are already cleaned, so there are no missing values and everything is formatted correctly (we double checked this). However, there are a lot of unwanted columns which we removed. Moreover, we combined some datasets to make it easier to compute and conduct EDA. 

## Formula 1 Dataset (1950 - 2023) [Cleaned]

In [1]:
!pip install kagglehub



In [2]:
import pandas as pd
import kagglehub
import zipfile
import os

In [3]:
path = kagglehub.dataset_download("suletanmay/formula-1-dataset-1950-2023-cleaned")



In [4]:
circuits = pd.read_csv(os.path.join(path, 'cleaned_circuits.csv'))
constructor_results = pd.read_csv(os.path.join(path, 'cleaned_constructor_results.csv'))
constructor_standings = pd.read_csv(os.path.join(path, 'cleaned_constructor_standings.csv'))
constructors = pd.read_csv(os.path.join(path, 'cleaned_constructors.csv'))
driver_standings = pd.read_csv(os.path.join(path, 'cleaned_driver_standings.csv'))
drivers = pd.read_csv(os.path.join(path, 'cleaned_drivers.csv'))
lap_times = pd.read_csv(os.path.join(path, 'cleaned_lap_times.csv'))
pit_stops = pd.read_csv(os.path.join(path, 'cleaned_pit_stops.csv'))
qualifying = pd.read_csv(os.path.join(path, 'cleaned_qualifying.csv'))
races = pd.read_csv(os.path.join(path, 'cleaned_races.csv'))
results = pd.read_csv(os.path.join(path, 'cleaned_results.csv'))
seasons = pd.read_csv(os.path.join(path, 'cleaned_seasons.csv'))  # if needed
sprint_results = pd.read_csv(os.path.join(path, 'cleaned_sprint_results.csv'))  # if needed
status = pd.read_csv(os.path.join(path, 'cleaned_status.csv'))

In [5]:
circuits = circuits[['circuitId', 'circuit_name']]
constructors = constructors[['constructorId', 'constructor_name']]
drivers = drivers[['driverId', 'driver_code', 'driver_surname']]
all_races = races[['raceId', 'year', 'circuitId']].merge(circuits[['circuitId', 'circuit_name']], on = 'circuitId')

# Ethics & Privacy

Our team acknowledges the issues related to the unethical use of data. Personal data can endanger the privacy of individuals. Our data will contain information about previous race winners and race conditions, and, although unlikely, our model can be used by the teams in the future to possibly recreate these conditions to influence races.

# Team Expectations 

We expect everyone in our team to communicate effectively on our text message group.  Our expectations are replies within 2 days on most days and faster replies as we come closer to deadlines. We have decided to have open communication and say everything they feel related to the project and other team members bluntly but politely. If any of us see that someone isn't voicing their opinions because they might be shy, we will try our best to motivate them to speak up. We would prefer the decisions to be unanimous and we will try our best to accommodate if any team members disagrees with the others. If this doesn't work out, we will go with the majority vote because we believe everyone in our group is mature enough to handle disagreements. If a team member is not responding on time and the deadline is extremely close, we will decide to go with a majority vote then too. We have set a schedule with deadlines as proposed below. If someone is struggling with a deliverable, we will try our best to help them and can give them an easier task for that deliverable. They can in return contribute more when something else is due. We are going to publish a list of tasks on a google spreadsheet and update them after each checkpoint from our project timeline. People can tick them off when they are done and we hope to have everything completed before the deadline. We will assign tasks to everyone during our meetings. 

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 11/3  |  1 PM | Complete gathering the data  | Finalize the data and talk about how to go about cleaning it | 
| 11/10  |  1 PM |  Cleaning and organizing the data | Discuss the cleaned data and start with EDA/split it between team members 
| 11/16  | 1 PM  | EDA  | Validate the EDAs of everyone and interpret the findings   |
| 11/20  | 1 PM  | Come with more insights on the data | Start developing the model   |
| 11/25  | 1 PM  | Have an initial model ready | Discuss ways to refine the model |
| 11/30  | 1 PM  | Have a refined model | Start creating visualizations |
| 12/3  | 1 PM  | Done with visualizations | Start compiling the final project|
| 12/6  | 1 PM  | Done with final touch ups | Turn in Final Project |