# Emissions Model Training with Random Forest

This file contains the model training for a random forest algorithm aiming to classify boroughs into categories based on CO2 emissions intensity.

#### Objective:
The objective is to train a random forest classifier that categorizes boroughs into Low, Medium, or High Emission Areas.

#### Input:
The input data consists of the following features:

| Name                   | Description                                    | Column Name           | Data Type |
|------------------------|------------------------------------------------|-----------------------|-----------|
| Borough Name           | Exact borough name                             | BoroughName_ExactCut  | Object    |
| Length (m)             | Length in meters (e.g., length of roads)       | Length (m)            | Float64   |
| Pollutant              | Amount of pollution caused by vehicles         | Pollutant             | Float64   |
| Petrol Car             | Amount of pollution caused by petrol cars      | PetrolCar             | Float64   |
| Diesel Car             | Amount of pollution caused by diesel cars      | DieselCar             | Float64   |
| Petrol LGV             | Amount of pollution caused by petrol LGVs      | PetrolLgv             | Float64   |
| Diesel LGV             | Amount of pollution caused by diesel LGVs      | DieselLgv             | Float64   |
| Electric Car           | Amount of pollution caused by electric cars    | ElectricCar           | Float64   |
| Electric LGV           | Amount of pollution caused by electric LGVs    | ElectricLgv           | Float64   |

#### Output:
The trained random forest classifier categorises boroughs into Low, Medium, or High Emission Areas based on CO2 emissions intensity.


### Imports

In [11]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

### Preprocessing Function

In [12]:
"""
This function loads the dataset from the given filepath, preprocesses and returns it.
"""

def preprocess(filepath):
    # Load dataset
    df = pd.read_csv(filepath)
    
    # Filter by CO2 pollutant
    df = df[df['Pollutant'].str.contains("CO2")]
    
    # Calculate total emissions for each row
    vehicle_emissions = ['PetrolCar', 'DieselCar', 'PetrolLgv', 'DieselLgv', 'ElectricCar', 'ElectricLgv']
    df['Total_Emissions'] = df[vehicle_emissions].sum(axis=1)
    
    # Standardize emissions by length (emissions per meter)
    df['Emissions_Per_M'] = df['Total_Emissions'] / df['Length (m)']

    # Calculate percentile ranks for Emissions_Per_M
    df['PercentileRank'] = df['Emissions_Per_M'].rank(pct=True)
    
    # Assign emission categories based on percentile rank
    df['EmissionCategory'] = pd.cut(df['PercentileRank'], 
                                    bins=[0, 1/3, 2/3, 1], 
                                    labels=['Low', 'Medium', 'High'], 
                                    include_lowest=True)
    
    # Drop unnecessary columns
    df = df.drop(columns=['Pollutant', 'Total_Emissions', 'PercentileRank'] + vehicle_emissions)
    
    return df

### Model Training

In [13]:
# Filepath to the training dataset
train_filepath = './data/emissions_clean_train.csv'

# Preprocessing the training dataset
train_df = preprocess(train_filepath)

# Prepare features and target for training
X_training = train_df.drop('EmissionCategory', axis=1)
y_training = train_df['EmissionCategory']

# Encoding categorical variables using One-Hot Encoding
column_transformer = ColumnTransformer([
    ('OneHotEncoder', OneHotEncoder(handle_unknown='ignore'), ['BoroughName_ExactCut'])
], remainder='passthrough')

# Pipeline for encoding and classification
pipeline = Pipeline([
    ('transformation', column_transformer),
    ('classification', RandomForestClassifier(n_estimators=111, random_state=999))
])

# Training the model
pipeline.fit(X_training, y_training)

### Model Evaluation

In [14]:
# Filepath to the test dataset
test_filepath = './data/emissions_clean_test.csv'

# Preprocessing the test dataset
test_df = preprocess(test_filepath)

# Prepare features and target for testing
X_testing = test_df.drop('EmissionCategory', axis=1)
y_testing = test_df['EmissionCategory']

# Making predictions with the trained model
y_prediction = pipeline.predict(X_testing)

print(classification_report(y_testing, y_prediction))

              precision    recall  f1-score   support

        High       1.00      0.99      1.00      2443
         Low       1.00      1.00      1.00      2442
      Medium       0.99      1.00      1.00      2442

    accuracy                           1.00      7327
   macro avg       1.00      1.00      1.00      7327
weighted avg       1.00      1.00      1.00      7327

