# CO2 Emission Prediction

## Project Overview

This project aims to predict the **CO2 emissions** of different types of cars based on the **volume** of fuel and **weight** of the vehicle using a regression model. By analyzing the relationship between these independent variables and the target variable (CO2 emission), we can estimate CO2 emission levels for various cars.

The dataset consists of information about car names, models, and their respective volume and weight, providing a comprehensive view for building a predictive regression model.

## About the Dataset

This dataset contains the CO2 emissions of different car models. It includes two key independent variables: **volume** and **weight** of the cars, which can be used to predict the **CO2 emission**. This dataset is perfect for practicing regression techniques, specifically linear regression, to predict a continuous variable.

## Data Source

This dataset is available on Kaggle in the following link:
> https://www.kaggle.com/datasets/midhundasl/co2-emission-of-cars-dataset/data

### Dataset Specifications
- **Car**: Name of the car.
- **Model**: Name of the model of the car.
- **Volume**: Volume of fuel (in cubic centimeters).
- **Weight**: Weight of the car (in kilograms).
- **CO2**: CO2 emitted by the car (in grams per kilometer).

## Problem Satement

- **Model Training**: Train the model with data so that it can predict the CO2 emision.
- **Model Evaluation**: Evaluate the performance of the model with different evaluation metrics like r2 score, mean squared error.
- **Model Optimization**: Optimize the performance of the regression model by regularization techenique.

### Load Libraries

In [64]:
# General
import pandas as pd
import numpy as np
import os
import warnings

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Model and Evaluation Metrics
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

### Settings

In [55]:
# Warnings
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
csv_path = os.path.join(data_path, "DATA_encoded.csv")
# csv_path = os.path.join(data_path, "DATA_cleaned.csv")

### Load Data

In [56]:
df = pd.read_csv(csv_path)

In [57]:
# Check Data
df.head()

Unnamed: 0,Car,Model,Volume,Weight,CO2
0,101.633838,101.633838,1000,790,99
1,101.113404,101.113404,1200,1160,95
2,101.183204,101.113404,1000,929,95
3,100.462862,100.462862,900,865,90
4,102.414489,102.414489,1500,1140,105


### Preprocessing

In [58]:
# Separate Input and Output variables
X = df.drop("CO2", axis= 1)
y = df["CO2"]

In [59]:
# Split training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [60]:
# Standarize the data
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Training and Evaluation

In [61]:
# Define a funcuntion to train and evaluate a specified model
def train_evaluate(model):
    # Train the model
    model.fit(X_train_s, y_train)

    # Predict with trained model
    y_train_pred = model.predict(X_train_s)
    y_test_pred = model.predict(X_test_s)

    # Print Evaluation Metrics
    print("=" * 60)
    print("EVALUATION METRICS FOR TRAIN DATA")
    print("=" * 60)
    print(f"R2 Score: {r2_score(y_train, y_train_pred)}")
    print(f"MAE: {mean_absolute_error(y_train, y_train_pred)}")
    print(f"MSE: {mean_squared_error(y_train, y_train_pred)}")
    print(f"RMSE: {np.sqrt(mean_squared_error(y_train, y_train_pred))}")
    print("=" * 60)
    print("EVALUATION METRICS FOR TEST DATA")
    print("=" * 60)
    print(f"R2 Score: {r2_score(y_test, y_test_pred)}")
    print(f"MAE: {mean_absolute_error(y_test, y_test_pred)}")
    print(f"MSE: {mean_squared_error(y_test, y_test_pred)}")
    print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_test_pred))}")

In [62]:
# Try Ridge regression
r = Ridge()
train_evaluate(r)

EVALUATION METRICS FOR TRAIN DATA
R2 Score: 0.9982097365005999
MAE: 0.22376820146684398
MSE: 0.07032264633613006
RMSE: 0.2651841743696823
EVALUATION METRICS FOR TEST DATA
R2 Score: 0.9973676141705304
MAE: 0.4018677293698172
MSE: 0.22798929144921248
RMSE: 0.4774822420249914


### Insights

The results after applying target encoding for car name and model show a significant improvement in model performance with Ridge regression. Here's the breakdown:

#### Training Performance:

- **R2: 0.9982** (Excellent fit, explaining almost all the variance in the training data)
- **MAE: 0.22** (Very low error on average)
- **MSE: 0.07 and RMSE: 0.27** (Very low, indicating the model is fitting the training data extremely well)

#### Testing Performance:

- **R2: 0.9974** (Very high, almost as good as the training score, indicating strong generalization to unseen data)
- **MAE: 0.40** (Slight increase, but still low)
- **MSE: 0.23 and RMSE: 0.48** (Slightly higher, but still very low)

#### Analysis:

- **Minimal Overfitting:** The close alignment of training and testing performance suggests that the model is not overfitting. The small gap in metrics is typical and expected due to the inherent randomness in the test set.
Impact of Target Encoding: Adding car name and model as encoded features provided the model with more useful information, leading to a dramatic improvement. These categorical variables may have had significant predictive power related to CO2 emissions.

#### Conclusion:

The model performs exceptionally well both on training and test data, with minimal errors and high R2 values. This indicates that the Ridge regression model, with target encoding for categorical features, has captured the relationships in the dataset accurately and generalizes well to new data.

In [63]:
# Try Lasso Regression
l = Lasso()
train_evaluate(l)

EVALUATION METRICS FOR TRAIN DATA
R2 Score: 0.9745134385497239
MAE: 0.822866218764007
MSE: 1.0011277377840586
RMSE: 1.000563710007543
EVALUATION METRICS FOR TEST DATA
R2 Score: 0.967321723702364
MAE: 1.303930479071532
MSE: 2.8302450862155633
RMSE: 1.6823332268654636


### Insights

With Lasso regression, there is a noticeable drop in performance compared to Ridge regression. Here’s an analysis of the results:

#### Training Performance:

- **R2: 0.9745** (Still high, but lower than Ridge, indicating less variance explained by the model)
- **MAE: 0.82** (Higher than Ridge, meaning more error in predictions)
- **MSE: 1.00 and RMSE: 1.00** (Both significantly higher than the Ridge results)

#### Testing Performance:

- **R2: 0.9673** (Good, but lower than Ridge, showing the model doesn’t explain as much variance in the test data)
- **MAE: 1.30** (A notable increase in error compared to Ridge)
- **MSE: 2.83 and RMSE: 1.68** (Both much higher than the Ridge results, indicating a less accurate model)

#### Conclussion:

While Lasso can be effective in feature selection by driving coefficients to zero, in this case, it is resulting in worse performance compared to Ridge. Ridge appears to handle this dataset better, as it retains more of the information from the encoded features (car name and model).