# Data Scientist Project: Vehicle CO₂ Emissions Prediction

## 1. Context - Why this project matters

### Environmental & regulatory pressures

- Transportation is a major contributor to CO₂ and greenhouse gas emissions in most countries; in Europe, regulations force automakers to meet fleet-wide CO₂ targets.
- There is often a **gap** between laboratory/certification emissions and **real-world** emissions (e.g. "real-world fuel consumption and CO₂ emissions ... around 20% higher than official values" in some observations).
- Knowing which vehicle technical factors drive the highest emissions can inform policy, design, consumer choices, and standards.

### Scientific & engineering relevance

- The relationship between vehicle specs (mass, power, engine displacement, fuel type, aerodynamics, drivetrain) and emissions is complex, involving physics, thermodynamics, and empirical behavior.
- Modeling emissions from vehicle features is a useful regression/predictive modeling exercise, combining domain knowledge and data science methods.
- If the model works, you can simulate "what-if" scenarios (e.g. what if engine power increases 10%, or weight decreases 5%) to see CO₂ effects.

### Use cases & stakeholders

- **Regulators/policy makers**: Which vehicle attributes should be regulated or incentivized (lighter weight, more efficient engines)
- **Automakers/engineers**: In early vehicle design, anticipate CO₂ emissions based on planned specs
- **Consumers/NGOs/advocacy**: To highlight high-emission vehicles, guide purchase decisions or awareness
- **Researchers/data scientists**: As a case study in combining domain physics with statistical modeling

### Challenges & caveats (risks)

- **Data measurement error/bias**: The recorded CO₂ (and fuel consumption) may be under or over estimates of real usage.
- **Omitted variables**: Features like driver style, road conditions, maintenance, aerodynamics (drag coefficient), gear ratios may not be in dataset.
- **Nonstationarity**: As technology advances, what drives emissions today might differ in the future (so the model may not generalize across years).
- **Multicollinearity**: Many vehicle specs move together (engine size, power, weight).
- **Model interpretability**: Highly flexible models (e.g. ensembles) may be more accurate but harder to interpret regarding causality.
- **Ethical/policy implications**: If predicting emissions influences regulation, fairness or unintended consequences may appear (e.g. penalizing certain classes of vehicle disproportionately).

## 2. Key concepts & domain knowledge to master

Here are the important technical and domain concepts we'll need to understand well:

| Concept | Why it matters here | Key points / subtleties |
|---------|---------------------|------------------------|
| **CO₂ emissions / carbon accounting** | This is our target variable (or one of them) | Need to understand how CO₂ is measured (certification tests vs on-road), units (g CO₂/km, g CO₂/mile) |
| **Fuel consumption / efficiency** | Emissions are directly linked to fuel burned | Fuel type (gasoline, diesel, hybrid, electric) matters; efficiency curve is not linear |
| **Vehicle physical attributes** | They influence energy required to move the car | Mass, frontal area / drag coefficient, rolling resistance, drivetrain losses |
| **Power, displacement, torque** | Engine performance metrics that tend to correlate with emissions | More power or larger displacement often leads to higher emissions, but efficiency of engine design matters |
| **Drivetrain / transmission / gearing** | They affect how engine power is used | Gear ratios, number of gears, transmission losses, presence of turbocharging / forced induction |
| **Aerodynamics & rolling resistance** | Drag at speed dominates energy at highway speeds | The drag coefficient (Cd) × frontal area is critical; data may or may not provide this |
| **Regulations & real-world vs test emissions** | Gives context for interpreting prediction error | Emissions under test cycles may not reflect real usage; regulatory thresholds (e.g. EU 95 g CO₂/km) |
| **Statistical modeling & machine learning** | The method to learn relationships of specs → emissions | Linear models, regularization, nonlinear models, interpretability, cross-validation, feature selection |
| **Feature engineering / interactions** | Real-world physics relationships may be nonlinear | Multiplicative features (e.g. power/weight, displacement × weight), polynomials, piecewise effects |
| **Validation, generalization, overfitting** | To ensure our model is robust | Be careful with splitting by time / year, avoid leakage, test on unseen vehicle types or years |
| **Scenario analysis / what-if simulation** | To use the model for prediction / design | Once model is trusted, one can simulate how CO₂ changes when design specs are altered |

## 3. Scope - what will we cover (and what we won't)

To keep the project manageable and meaningful, we need to define clear scope boundaries.

### In scope

- Use the France 2013 vehicle dataset as primary data for modeling (plus possibly related European datasets for validation)
- Analyze the influence of **technical/mechanical characteristics** (mass, engine size, power, fuel type, etc.) on CO₂ emissions
- Build predictive models to estimate CO₂ emissions given vehicle specs
- Interpret which features are most influential (feature importance, partial dependence)
- Scenario experiments: simulate changes in specs (e.g. weight reduction) to see predicted CO₂ effect
- Validate the model (cross-validation, hold-out) and quantify error / uncertainty
- Report findings, limitations, and possible recommendations

### Probably out of scope (or limited)

- **Driver behavior / usage patterns** (speeding, idling, acceleration behavior) - unless such data is provided
- **Road / environment externalities** (e.g. slope, ambient temperature, traffic congestion) - likely not present in data
- **Full lifecycle emissions or well-to-wheel / upstream emissions** (manufacturing emissions, fuel production) - unless a dataset provides that
- **Policy modeling / cost-benefit analysis** - we may mention implications, but not build full economic models
- **Real-time sensor data / time series modeling** - unless data is available
- **Deep fair AI / socioeconomic inequities** - although interesting, beyond the technical focus

## 4. Refined problem statement & success criteria

Putting it all together, here's a refined statement:

### Problem statement

Given a dataset of vehicles marketed in France in 2013, with their technical characteristics and measured CO₂ emissions, develop a statistical/machine learning model that predicts CO₂ emissions from vehicle specifications. Use the model to identify which technical features are most strongly associated with high emissions, and simulate how changes in specifications (e.g. weight, power) would affect predicted CO₂ output.

### Success criteria / evaluation metrics

- Good predictive accuracy on held-out test data (e.g. low MAE, RMSE, reasonably high R²)
- Robustness (model should generalize to new vehicle types not in training)
- Interpretability: being able to explain which features are influential
- Realistic scenario outputs (predicted changes are plausible)
- Clear reporting of limitations, possible biases, and suggestions

### Next steps based on the CSV might be:

#### Clean

- Drop unused columns (Unnamed:*)
- Handle missing values
- Standardize categorical variables (cod_cbr, hybride, Carrosserie, gamme)

#### Explore

- Distribution of co2
- Relationships between co2 and weight, power, fuel type, consumption
- Correlation heatmap

#### Define modeling strategy

- Regression problem (predict co2)
- Try baseline linear regression, then tree/ensemble models
- Evaluate with RMSE, MAE, R²