# CO2 Emissions Prediction using Linear Regression

## Overview
This notebook implements linear regression to predict CO2 emissions from vehicle characteristics using two approaches:
1. **Gradient Descent** - Implemented from scratch
2. **Scikit-learn LinearRegression** - Using the standard machine learning library

## Dataset
- **Source**: CO2 Emissions_Canada.csv
- **Features**: 
  - Engine Size (L)
  - Cylinders
  - Fuel Consumption City (L/100 km)
  - Fuel Consumption Highway (L/100 km)
  - Fuel Consumption Combined (L/100 km)
- **Target**: CO2 Emissions (g/km)

In [None]:
import sys
print(sys.executable)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns

## Data Loading and Preparation

The dataset is loaded and the relevant features are extracted for model training.

In [None]:
dataset_path = './CO2 Emissions_Canada.csv'
dataset = pd.read_csv(dataset_path)

features = [
    "Engine Size(L)",
    "Cylinders",
    "Fuel Consumption City (L/100 km)",
    "Fuel Consumption Hwy (L/100 km)",
    "Fuel Consumption Comb (L/100 km)"
]
target = "CO2 Emissions(g/km)"

X = dataset[features].values  # shape (n_samples, 5)
y = dataset[target].values    # shape (n_samples,)

## Exploratory Data Analysis

Visualizing the relationship between each feature and the target variable (CO2 emissions) using scatter plots to understand the data distribution and correlations.

In [None]:
# Scatter plots of each feature vs target
plt.figure(figsize=(15, 10))
for i, feature in enumerate(features):
    plt.subplot(2, 3, i+1)
    sns.scatterplot(data=dataset, x=feature, y=target)
    plt.title(f'{feature} vs {target}')
plt.tight_layout()
plt.show()

## Feature Scaling

Standardizing the features using z-score normalization to ensure all features contribute equally to the model and to improve gradient descent convergence.

In [None]:
X_mean = X.mean(axis=0)
X_std = X.std(axis=0)
X_scaled = (X - X_mean) / X_std

## Train-Validation Split

Splitting the dataset into training (80%) and validation (20%) sets with random shuffling to ensure unbiased model evaluation.

In [None]:
np.random.seed(42)
indices = np.arange(X_scaled.shape[0])
np.random.shuffle(indices)

X_scaled = X_scaled[indices]
y = y[indices]

split_idx = int(0.8 * X_scaled.shape[0])
X_train, X_val = X_scaled[:split_idx], X_scaled[split_idx:]
y_train, y_val = y[:split_idx], y[split_idx:]

## Gradient Descent Implementation (From Scratch)

Implementing the core components of linear regression using gradient descent optimization.

### Linear Model Function

Defining the hypothesis function for linear regression: **f(x) = Xw + b**

In [None]:
def f_wb(X, w, b):
    """Predict y using linear model: y = Xw + b"""
    return X @ w + b

### Cost Function

Computing the Mean Squared Error (MSE) as the cost function to measure model performance.

In [None]:
def compute_cost(X, y, w, b):
    """Compute Mean Squared Error cost"""
    m = X.shape[0]
    y_pred = f_wb(X, w, b)
    return (1/(2*m)) * np.sum((y_pred - y)**2)

### Gradient Computation

Calculating the partial derivatives of the cost function with respect to weights and bias.

In [None]:
def compute_gradients(X, y, w, b):
    """Compute gradients of the cost function w.r.t weights and bias"""
    m = X.shape[0]
    y_pred = f_wb(X, w, b)
    error = y_pred - y
    dw = (1/m) * (X.T @ error)
    db = (1/m) * np.sum(error)
    return dw, db

In [None]:
def gradient_descent(X, y, alpha, iterations, record_interval):
    """
    Performs Gradient Descent to train linear regression.

    Args:
        X (numpy.ndarray): Feature matrix of shape (n_samples, n_features)
        y (numpy.ndarray): Target vector of shape (n_samples,)
        alpha (float): Learning rate
        iterations (int): Number of iterations
        record_interval (int): Interval at which to record cost and weights

    Returns:
        w (numpy.ndarray): Final weights
        b (float): Final bias
        cost_history (list): Cost at recorded iterations
        w_history (list): Weights at recorded iterations
        b_history (list): Bias at recorded iterations
    """
    
    # Initialize weights and bias
    w = np.zeros(X.shape[1])
    b = 0
    
    cost_history = []
    w_history = []
    b_history = []

    for i in range(iterations):
        dw, db = compute_gradients(X, y, w, b)  # you already have this function
        w -= alpha * dw
        b -= alpha * db
        
        # Record cost, weights, and bias at intervals
        if i % record_interval == 0 or i == iterations - 1:
            cost_history.append(compute_cost(X, y, w, b))
            w_history.append(w.copy())
            b_history.append(b)

    return w, b, cost_history, w_history, b_history

### Training the Model

Running gradient descent with the following hyperparameters:
- **Learning rate (α)**: 0.01
- **Iterations**: 10,000
- **Recording interval**: 100 iterations

In [None]:
alpha=0.01
iterations=10000
record_interval=100

w_final, b_final, cost_history, w_history, b_history = gradient_descent(X_train, y_train, alpha, iterations, record_interval)

print("Final weights:", w_final)
print("Final bias:", b_final)


In [None]:
# Correct x-axis: length matches cost_history
iterations_recorded = [i for i in range(iterations) if i % 100 == 0]
if iterations-1 not in iterations_recorded:
    iterations_recorded.append(iterations-1)  # include last iteration

# Plot
plt.figure(figsize=(8,5))
plt.plot(iterations_recorded, cost_history, marker='o', color='blue')
plt.title("Cost History during Gradient Descent")
plt.xlabel("Iteration")
plt.ylabel("Cost")
plt.grid(True)
plt.show()


In [None]:
# Convert w_history to NumPy array for easier indexing
w_history = np.array(w_history)  # shape: (num_records, n_features)

# x-axis: iteration numbers where history was recorded
record_interval = 100
iterations_recorded = [i for i in range(0, 10000, record_interval)]
if 10000-1 not in iterations_recorded:
    iterations_recorded.append(10000-1)  # include last iteration

# Plot weights history
plt.figure(figsize=(10,6))
for i in range(w_history.shape[1]):
    plt.plot(iterations_recorded, w_history[:, i], label=f'Weight w{i}')

plt.title("Weights History During Gradient Descent")
plt.xlabel("Iteration")
plt.ylabel("Weight Value")
plt.legend()
plt.grid(True)
plt.show()

### Model Evaluation

Evaluating the gradient descent model on the validation set using Mean Squared Error (MSE) and R² score metrics.

In [None]:
# Predictions on validation set
y_val_pred_gd = f_wb(X_val, w, b)

# Evaluation
mse_gd = np.mean((y_val_pred_gd - y_val)**2)
ss_res = np.sum((y_val - y_val_pred_gd)**2)
ss_tot = np.sum((y_val - np.mean(y_val))**2)
r2_gd = 1 - (ss_res / ss_tot)

print("Gradient Descent MSE on validation set:", mse_gd)
print("Gradient Descent R² score on validation set:", r2_gd)

## Scikit-learn Linear Regression

Training a linear regression model using scikit-learn's built-in implementation for comparison with the gradient descent approach.

In [None]:
scaler = StandardScaler()
X_train_scaled_skl = scaler.fit_transform(X_train)
X_val_scaled_skl = scaler.transform(X_val)

lr_model = LinearRegression()
lr_model.fit(X_train_scaled_skl, y_train)

weights_skl = lr_model.coef_
bias_skl = lr_model.intercept_

y_val_pred_skl = lr_model.predict(X_val_scaled_skl)

mse_skl = mean_squared_error(y_val, y_val_pred_skl)
r2_skl_val = r2_score(y_val, y_val_pred_skl)

print("\n--- scikit-learn Linear Regression ---")
print("Trained weights:", weights_skl)
print("Trained bias:", bias_skl)
print("MSE on validation set (sklearn):", mse_skl)
print("R² score on validation set (sklearn):", r2_skl_val)

## Results Comparison

Visualizing the predictions from both models (Gradient Descent and Scikit-learn) against actual values to compare their performance.

In [None]:
plt.figure(figsize=(8,6))

# Plot actual vs predicted values
plt.scatter(y_val, y_val_pred_gd, alpha=0.5, color='blue', label='Gradient Descent')
plt.scatter(y_val, y_val_pred_skl, alpha=0.5, color='green', label='scikit-learn')

# Plot ideal line
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'r--', label='Ideal')

# Labels and title
plt.xlabel("Actual CO2 Emissions (g/km)")
plt.ylabel("Predicted CO2 Emissions (g/km)")
plt.title("Actual vs Predicted CO2 Emissions")
plt.legend()
plt.grid(True)
plt.show()