# Final Assignment Notebook

The code below belongs to Group D and is for the Final Assignment of CSCK503 Machine Learning in Practice. The scenario for building this ML model can be seen below.

You have been engaged as a contract data scientist by Athana Data Science Services (ADSS), a small company specialising in the provision of data science consultancy services to public and private sector organisations. ADSS have just been awarded a contract by a government department (the Department of Environment) to help with the development of machine learning-based models for predicting atmospheric emissions (and pollution) from data gathered by various borough and county environment monitoring units. Your team leader wants you to assist with this project, and you will be required to carry out a number of tasks using the Anaconda/Scikit-Learn Python ML framework and its components.

## Import Libraries

In [7]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler

## Load in datasets and concatenate dataframes

In [2]:
df_2008_org = pd.read_excel('Major Roads/LAEI2013_MajorRoads_EmissionsbyLink_2008.xlsx')
df_2010_org = pd.read_excel('Major Roads/LAEI2013_MajorRoads_EmissionsbyLink_2010.xlsx')
df_2013_org = pd.read_excel('Major Roads/LAEI2013_MajorRoads_EmissionsbyLink_2013.xlsx')
df_2020_org = pd.read_excel('Major Roads/LAEI2013_MajorRoads_EmissionsbyLink_2020.xlsx')

main_df = pd.concat([df_2008_org, df_2010_org, df_2013_org, df_2020_org])
main_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1574746 entries, 0 to 402841
Data columns (total 35 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   GridId                1574746 non-null  int64  
 1   Toid                  1574746 non-null  int64  
 2   GRID_ExactCut_ID      1574746 non-null  int64  
 3   Location_ExactCut     1574746 non-null  object 
 4   BoroughExactCut       402842 non-null   object 
 5   Lts                   1574746 non-null  int64  
 6   Length (m)            1574746 non-null  float64
 7   Emissions             1574746 non-null  object 
 8   Year                  1574746 non-null  int64  
 9   Pollutant             1574746 non-null  object 
 10  Emissions Unit        1574746 non-null  object 
 11  Motorcycle            1574746 non-null  float64
 12  Taxi                  1574746 non-null  float64
 13  Car                   1574746 non-null  float64
 14  BusAndCoach           1574746 non-null  

## Function to convert all object columns to Numeric Encoding

In [3]:
def convert_object_columns_to_numeric(df):
    """
        This function takes a dataframe and will first find all object type columns, loop through each of them and encode them using a LabelEncoder object. This will ensure
        that the dataframe will not contain anymore object columns.
    """

    label_encoder = LabelEncoder()
    categorical_columns = df.select_dtypes(include=['object']).columns # Code to find all column names that are object type
    df_encoded = df.copy()

    for col in categorical_columns:
        df_encoded[col] = label_encoder.fit_transform(df_encoded[col]) # Label encode each object type column
        df_encoded[col] = df_encoded[col].astype(int) # And convert the column to integer

    df_encoded.info()
    return df_encoded

In [4]:
def split_data_based_on_target_label(df, target_column, test_year):
    """
        This function is responsible for taking in a dataframe, a target column, and a test year and will split the dataframe into an X and Y dataframe based on the target column.
        Then will take all rows with a Year value of less than the test_year and will use this as the training set and everything equal to or above the test year is part of the test set.
        E.g. if test_year = 2020 then all rows < 2020 will be training and all >= 2020 will be part of the test set.
    """
    train_set = df[df['Year'] < test_year]
    test_set = df[df['Year'] >= test_year]

    X_train = train_set.drop(target_column, axis=1)
    y_train = train_set[target_column]

    X_test = test_set.drop(target_column, axis=1)
    y_test = test_set[target_column]

    return X_train, X_test, y_train, y_test

In [5]:
def train_and_analyze_model(X_train, y_train, X_test, y_test, model, model_type):
    model.fit(X_train, y_train)

    predictions = model.predict(X_test)
    mse = mean_squared_error(y_test, predictions)
    print("Mean Squared Error (MSE):", mse)

    # Root Mean Squared Error (RMSE)
    rmse = np.sqrt(mse)
    print("Root Mean Squared Error (RMSE):", rmse)

    # Mean Absolute Error (MAE)
    mae = mean_absolute_error(y_test, predictions)
    print("Mean Absolute Error (MAE):", mae)

    # R-squared (R2) Score
    r2 = r2_score(y_test, predictions)
    print("R-squared (R2) Score:", r2)

    # Plotting actual vs predicted values
    plt.scatter(y_test, predictions)
    plt.xlabel('Actual values')
    plt.ylabel('Predicted values')
    plt.title(model_type + ' Actual vs. Predicted Values')
    # Plotting the identity line; perfect predictions would lie on this line
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
    plt.show()
    
    return mse, rmse

In [6]:
def split_dataset_and_train_models(df, target_column, model_list):
    df_encoded = convert_object_columns_to_numeric(df)
    X_train, X_test, y_train, y_test = split_data_based_on_target_label(df_encoded, target_column)
    lowest_mse_algorithm = ""
    lowest_mse = 999999
    lowest_rmse_algorithm = ""
    lowest_rmse = 999999
    for model in model_list:
        if model != "NeuralNetwork":
            mse, rmse = train_and_analyze_model(X_train, y_train, X_test, y_test, model_list[model], model)
        else:
            my_function =model_list[model]
        if mse < lowest_mse:
            lowest_mse = mse
            lowest_mse_algorithm = model
    
        if rmse < lowest_rmse:
            lowest_rmse = rmse
            lowest_rmse_algorithm = model

    print(f"Best performing algorithm according to Mean Squared Error is {lowest_mse_algorithm} and has a value of {lowest_mse}")
    print(f"Best performing algorithm according to Root Mean Squared Error is {lowest_rmse_algorithm} and has a value of {lowest_rmse}")

In [None]:
model_list = {"LinearRegression": LinearRegression(), "DecisionTreeRegressor": DecisionTreeRegressor(), "NeuralNetwork": build_neural_network()}


split_dataset_and_train_models(main_df, "PetrolCar", model_list)