<a href="https://colab.research.google.com/github/akjieettt/data-science-final-project/blob/main/DataScienceProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How Pit Stops Affect Race Outcomes in Formula 1

**Group Members**: Hrishi Kabra and Kiet Huynh

**Project Website**: https://akjieettt.github.io/data-science-final-project/

## Project Overview

Our project will investigate the relationship between chemical composition and wine quality, focusing on identifying which physicochemical properties most strongly predict wine quality and how these relationships differ between red and white wines. Our project aims to provide insights into wine production that could benefit winemakers and consumers. 

### Research Questions

**Primary Question**: *What chemical properties most strongly predict wine quality?*

**Secondary Question**: *How do chemical compositions differ between red and white wines, and what are the optimal chemical ranges for high-quality wines?*

### Background

Wine quality assessment is traditionally based on expert sensory evaluation, but understanding the underlying chemical composition can provide objective insights into what makes a wine exceptional. The wine industry relies on physicochemical properties to guide production decisions and quality control.

The Portuguese "Vinho Verde" wine region produces both red and white wines with distinct characteristics. Wine quality is rated on a scale from 0-10 based on sensory data from expert evaluations, considering factors like aroma, taste, and overall balance.

### Motivation For This Project

Wine production is an art and science where chemical composition determines quality. Understanding these relationships through data science can provide valuable insights for:
- **Winemakers**: Optimizing production processes and chemical formulations
- **Consumers**: Making informed purchasing decisions
- **Industry**: Quality control and standardization

Key factors we're investigating:
- **Chemical Balance**: How different acidity components interact
- **Wine Type Differences**: Red vs white wine chemical requirements
- **Quality Predictors**: Which properties matter most for high ratings

### Collaboration Plan

**Team Coordination:**
- Set up a private GitHub repository to coordinate all code, share datasets, and track progress
- Each member works on separate branches to implement features, which are merged via pull requests after code review to ensure consistency

**Technologies Used:**
- Version Control: Git and GitHub for source code management and collaboration
- Development Environment: Visual Studio Code Live Share, Google Colab, and Jupyter Notebooks for data analysis and prototyping
- Communication Tools: Small Family Collaboration Hub for offline discussions, FaceTime for online discussions and Google Docs for shared notes

**Meeting Schedule:**
- Consistently meet offline 2 - 3 times per week for 1 - 3 hours per session to discuss progress, solve problems, and coordinate tasks
- Outside of scheduled meetings, we communicate asynchronously via iMessage to stay aligned and share updates

**Task Management:**
- Tasks are divided based on expertise and interest
- Progress is tracked via a shared progress table (in a spreadsheet) to ensure deadlines are met and responsibilities are clear

## Milestone 1: Initial ETL

### Data Sources

We are working with the [**Wine Quality**](https://archive.ics.uci.edu/dataset/186/wine+quality) from UC Irvine's Machine Learning Repository. The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine:

1. **winequality-red**: Data About Red Wines
   - **Source**: [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/186/wine+quality)
   - **Coverage**: Data about 1599 different Red Wines
   - **Output**: quality rating (0–10) assigned by tasters

2. **winequality-white**: Data About White Wines
   - **Source**: [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/186/wine+quality)
   - **Coverage**: Data about 4898 different White Wines
   - **Output**: quality rating (0–10) assigned by tasters

### Imports and Loading The Data

In [None]:
# Import necessary libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Load the wine datasets
df_reds = pd.read_csv("data/winequality-red.csv", sep=";")
df_whites = pd.read_csv("data/winequality-white.csv", sep=";")

# Add wine type identifiers for red and white
df_reds['type'] = 'red'
df_whites['type'] = 'white'

# Combine the red and white wine datasets into a single df
df = pd.concat([df_reds, df_whites], ignore_index=True)

# Adding a unique id to each wine
df['wine_id'] = df.index

# Printing the number of wines in each dataset
print(f"Total Number of Red Wines: {len(df_reds)}")
print(f"Total Number of White Wines: {len(df_whites)}") 
print(f"Total Number of Wines: {len(df)}")

df.head()

  from scipy.stats import gaussian_kde


Loaded all 14 datasets


### Dataset Overview

We can see above that the dataset has roughly 6500 different wines . It also has 13 different columns

| Column Name | Description | Purpose |
|-------------|-------------|---------|
| **fixed acidity** | Non-volatile acids (tartaric, malic, citric) in g/dm³ | Affects wine's tartness and aging potential |
| **volatile acidity** | Acetic acid content in g/dm³ | Indicates wine spoilage; high levels create vinegar taste |
| **citric acid** | Citric acid content in g/dm³ | Adds freshness and flavor complexity |
| **residual sugar** | Remaining sugar after fermentation in g/dm³ | Determines wine sweetness level |
| **chlorides** | Salt content in g/dm³ | Influences wine's saltiness and balance |
| **free sulfur dioxide** | Free SO₂ in mg/dm³ | Acts as preservative and antioxidant |
| **total sulfur dioxide** | Total SO₂ content in mg/dm³ | Overall preservative level |
| **density** | Wine density in g/cm³ | Related to alcohol and sugar content |
| **pH** | Acidity level (0-14 scale) | Affects wine stability and taste |
| **sulphates** | Potassium sulphate in g/dm³ | Preservative and antioxidant |
| **alcohol** | Alcohol content by volume (%) | Affects body, flavor, and quality perception |
| **quality** | Expert rating score (0-10) | Target variable for quality prediction |
| **type** | Wine type (red/white) | Categorical identifier for wine classification |

Some More Information About The Dataset
- **Total Samples**: 6,497 wines
- **Red Wines**: 1,599 samples
- **White Wines**: 4,898 samples
- **Missing Values**: None
- **Quality Range**: 3-9 (most wines rated 5-6)
- **Source**: Portuguese "Vinho Verde" wines

In [41]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [42]:
%%shell
jupyter nbconvert --to html /content/drive/MyDrive/DataScienceProject.ipynb

[NbConvertApp] Converting notebook /content/drive/MyDrive/DataScienceProject.ipynb to html
[NbConvertApp] Writing 284454 bytes to /content/drive/MyDrive/DataScienceProject.html


