# Emissions Model Training with Random Forest

This file contains the model training for a random forest algorithm aiming to classify boroughs into categories based on CO2 emissions intensity.

#### Objective:
The objective is to train a random forest classifier that categorizes boroughs into Low, Medium, or High Emission Areas.

#### Input:
The input data consists of the following features:

| Name                   | Description                                 | Column Name           | Data Type |
|------------------------|---------------------------------------------|-----------------------|-----------|
| Borough Name           | Exact borough name                          | BoroughName_ExactCut  | Object    |
| Pollutant              | Amount of pollution caused by vehicles      | Pollutant             | Float64   |
| Petrol Car             | Amount of pollution caused by petrol cars   | PetrolCar             | Float64   |
| Diesel Car             | Amount of pollution caused by diesel cars   | DieselCar             | Float64   |
| Petrol LGV             | Amount of pollution caused by petrol LGVs   | PetrolLgv             | Float64   |
| Diesel LGV             | Amount of pollution caused by diesel LGVs   | DieselLgv             | Float64   |
| Electric Car           | Amount of pollution caused by electric cars | ElectricCar           | Float64   |
| Electric LGV           | Amount of pollution caused by electric LGVs | ElectricLgv           | Float64   |

#### Output:
The trained random forest classifier categorises boroughs into Low, Medium, or High Emission Areas based on CO2 emissions intensity.


### Imports

In [57]:
import pandas as pd
import numpy as numpy

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.preprocessing import OneHotEncoder

In [15]:
df = pd.read_csv('.\\data\\LAEI2013_MajorRoads_EmissionsbyLink_2013.csv')

### Reading the Dataset

In [17]:
df = pd.read_csv('./data/emissions_clean_train.csv')

In [108]:
df.head(10)

Unnamed: 0,BoroughName_ExactCut,Length (m),Pollutant,PetrolCar,DieselCar,PetrolLgv,DieselLgv,ElectricCar,ElectricLgv,Total_Emissions,Emissions_Category
0,Sutton,45.1,PM25_Exhaust,0.001102,0.00241,0.0,0.001491,,,0.005002,Medium
1,Bromley,62.0,PM25_Tyre,0.001,0.001,0.0,0.001115,0.0,0.0,0.003115,Low
2,Croydon,111.9,PM25_Brake,,,,,,,0.0,Low
3,Hounslow,10.0,PM10_Tyre,0.001757,0.001422,0.0,0.001239,0.0,0.0,0.004418,Medium
4,Bromley,97.351163,PM10_Exhaust,,,,,,,0.0,Low
5,City of Westminster,78.7,PM25_Tyre,0.001,0.001274,0.0,0.001133,0.0,0.0,0.003408,Medium
6,City of Westminster,56.1,PM10_Brake,0.002,0.001,0.001,0.001,0.0,0.0,0.005,Medium
7,Bromley,317.2,PM10_Tyre,0.002,0.001,0.0,0.001196,0.0,0.0,0.004196,Medium
8,Richmond,29.0,PM10_Brake,0.002,0.001,0.0,0.001,0.0,0.0,0.004,Medium
9,Hillingdon,13.0,PM10_Brake,0.006253,0.004248,0.0,0.002403,0.0,0.0,0.012905,Medium


### One-Hot Encoding Categorical Columns

In [109]:
# Isolate the categorical columns from the rest of the dataframe
categorical_features = df[['BoroughName_ExactCut', 'Pollutant']]

# One hot encode the categorical_features
encoding = OneHotEncoder(handle_unknown='ignore')
X_encoded = encoding.fit_transform(categorical_features)

### Merge encoded dataframe

In [82]:
# Get column names of the encoded feature columns
encoded_column_names = encoding.get_feature_names_out(input_features=['BoroughName_ExactCut', 'Pollutant'])

# Convert the encoded sparse matrix into a new dataframe
encoded_df = pd.DataFrame(X_encoded.toarray(), columns=encoded_column_names)

# Merge the original dataframe with the encoded DataFrame
merged_df = pd.concat([df.drop(columns=['BoroughName_ExactCut', 'Pollutant']), encoded_df], axis=1)

In [83]:
merged_df

Unnamed: 0,Length (m),PetrolCar,DieselCar,PetrolLgv,DieselLgv,ElectricCar,ElectricLgv,Total_Emissions,Emissions_Category,BoroughName_ExactCut_Barking and Dagenham,...,Pollutant_CO2,Pollutant_NOx,Pollutant_PM10_Brake,Pollutant_PM10_Exhaust,Pollutant_PM10_Resusp,Pollutant_PM10_Tyre,Pollutant_PM25_Brake,Pollutant_PM25_Exhaust,Pollutant_PM25_Resusp,Pollutant_PM25_Tyre
0,45.100000,0.001102,0.002410,0.000,0.001491,,,0.005002,Medium,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,62.000000,0.001000,0.001000,0.000,0.001115,0.0,0.0,0.003115,Low,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,111.900000,,,,,,,0.000000,Low,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,10.000000,0.001757,0.001422,0.000,0.001239,0.0,0.0,0.004418,Medium,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,97.351163,,,,,,,0.000000,Low,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292971,98.500000,20.303000,11.170000,0.088,3.985000,,,35.546000,Medium,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
292972,24.500000,0.001093,0.001000,0.000,0.001486,,,0.003579,Low,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
292973,8.600000,0.004449,0.003059,0.001,0.002013,0.0,0.0,0.010521,Medium,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
292974,118.800000,0.005000,0.003000,0.000,0.001000,0.0,0.0,0.009000,Medium,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


### Categorising Vehicle Emissions Per Borough (Low, Medium, High)

In [105]:
vehicle_emissions = ['PetrolCar', 'DieselCar', 'PetrolLgv', 'DieselLgv', 'ElectricCar', 'ElectricLgv']

df['Total_Emissions'] = df[vehicle_emissions].sum(axis=1)

lower_threshold = df['Total_Emissions'].quantile(1/5)
upper_threshold = df['Total_Emissions'].quantile(2/5)

def Categorise_Emission_Levels(total_emissions):
    if total_emissions < lower_threshold:
        return 'Low'
    if total_emissions > lower_threshold < upper_threshold:
        return 'Medium'
    if total_emissions > upper_threshold:
        return 'High'
        
# call the function and define a new column so it categorises the features into High/Med/Low
df['Emissions_Category'] = df['Total_Emissions'].apply(Categorise_Emission_Levels)

In [107]:
# Dropping non-feature columns
X = merged_df.drop(columns=['Emissions_Category'])

y = ['Emissions_Category']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

rf_classifier.fit(X_train, y_train)

y_pred = rf_classifier.predict(X_test)

ValueError: Found input variables with inconsistent numbers of samples: [292976, 1]