# Emissions Model Training with K-Means Clustering

This file contains the model training for K-Means clustering aiming to group boroughs into clusters based on pollution caused by different types of vehicles.


#### Objective:
The objective is to train a K-Means clustering model that categorizes boroughs into clusters based on their overall emissions profile.

#### Input:
The input data consists of the following features:

| Name                   | Description                                 | Column Name           | Data Type |
|------------------------|---------------------------------------------|-----------------------|-----------|
| Borough Name           | Exact borough name                          | BoroughName_ExactCut  | Object    |
| Length (m)             | Length in meters (e.g., length of roads)    | Length (m)            | Float64   |
| Pollutant              | Amount of pollution caused by vehicles      | Pollutant             | Float64   |
| Petrol Car             | Amount of pollution caused by petrol cars   | PetrolCar             | Float64   |
| Diesel Car             | Amount of pollution caused by diesel cars   | DieselCar             | Float64   |
| Petrol LGV             | Amount of pollution caused by petrol LGVs   | PetrolLgv             | Float64   |
| Diesel LGV             | Amount of pollution caused by diesel LGVs   | DieselLgv             | Float64   |
| Electric Car           | Amount of pollution caused by electric cars | ElectricCar           | Float64   |
| Electric LGV           | Amount of pollution caused by electric LGVs | ElectricLgv           | Float64   |

#### Output:
The trained K-Means clustering model groups boroughs into clusters based on their overall emissions profile.


### Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import TruncatedSVD

### Preprocessing Function

In [4]:
"""
This function loads the dataset from the given filepath, preprocesses and returns it.
"""

def preprocess(filepath):
    # Load dataset
    df = pd.read_csv(filepath)
    
    # Calculate total emissions for each row
    vehicle_emissions = ['PetrolCar', 'DieselCar', 'PetrolLgv', 'DieselLgv', 'ElectricCar', 'ElectricLgv']
    df['Total_Emissions'] = df[vehicle_emissions].sum(axis=1)
    
    # Standardize emissions by length (emissions per meter)
    df['Emissions_Per_M'] = df['Total_Emissions'] / df['Length (m)']
    
    # Calculate average emissions per meter for each borough and pollutant
    avg_emissions = df.groupby(['BoroughName_ExactCut', 'Pollutant'])['Emissions_Per_M'].transform('mean')
    
    # Add average emissions to the dataframe
    df['Avg_Emissions_Per_M'] = avg_emissions

    # Drop unnecessary columns
    df = df.drop(columns=['Total_Emissions', 'Length (m)', 'Emissions_Per_M'] + vehicle_emissions)
    
    # Drop duplicates
    df = df.drop_duplicates(subset=['BoroughName_ExactCut', 'Pollutant'])
    
    return df

### Model Training

In [None]:
# Filepath to the training dataset
train_filepath = './data/emissions_clean_train.csv'

# Preprocessing the training dataset
train_df = preprocess(train_filepath)

# Define the column transformer
column_transformer = ColumnTransformer([
    ('OneHotEncoder', OneHotEncoder(handle_unknown='ignore'), ['BoroughName_ExactCut'])
], remainder='passthrough')

# List of unique pollutants
unique_pollutants = train_df['Pollutant'].unique()

# Number of clusters
n_clusters = 3

# Dictionary to store the trained K-Means models for each pollutant
models = {}

for pollutant in unique_pollutants:
    # Filter the dataset for the current pollutant
    pollutant_df = train_df[train_df['Pollutant'] == pollutant]
    
    # Exclude 'Pollutant' column as it's not needed for the model input
    X = pollutant_df.drop(['Pollutant'], axis=1)
    
    # Apply the column transformer to prepare the dataset
    X_transformed = column_transformer.fit_transform(X)
    
    # Train the K-Means model
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=999)
    kmeans.fit(X_transformed)
    
    # Store the trained model in the dictionary
    models[pollutant] = kmeans

### Model Evaluation

In [None]:
# Retrieve the model and the transformed data for "CO2" pollutant
kmeans_co2 = models["CO2"]

# Apply TruncatedSVD to reduce the dimensionality of X_transformed to 2 dimensions
svd = TruncatedSVD(n_components=2)
X_svd = svd.fit_transform(X_transformed)

# Project the centroids to the 2D space
centroids_svd = svd.transform(kmeans_co2.cluster_centers_)

# Visualize the SVD-reduced data points and centroids
plt.figure(figsize=(10, 7))
plt.scatter(X_svd[:, 0], X_svd[:, 1], alpha=0.5, c=kmeans_co2.labels_, cmap='viridis', marker='o', edgecolor='k', s=50, label='Data Points')
plt.scatter(centroids_svd[:, 0], centroids_svd[:, 1], c='red', s=200, alpha=0.75, marker='X', label='Centroids')

# Annotate the centroids
plt.xlabel('SVD Component 1')
plt.ylabel('SVD Component 2')
plt.title('K-Means Clustering Visualization with Centroids (SVD)')
plt.legend()
plt.show()