# Emissions Data Clean-Up

### Introduction

This notebook is dedicated to cleaning up the emissions data. The objective is to clean up the data to prepare it for further analysis or modeling.

To start, the necessary libraries will be imported, and the dataset will be loaded. Subsequently, a variety of data cleaning tasks will be undertaken, including removing unnecessary columns, managing missing values, and adjusting the data as required.

### Imports

In [27]:
import pandas as pd
from sklearn.model_selection import train_test_split

### Reading the Dataset

In [28]:
# Read the data from the excel file
# df = pd.read_excel('./data/LAEI2013_MajorRoads_EmissionsbyLink_2013.xlsx')

In [29]:
df = pd.read_csv('.\\data\\LAEI2013_MajorRoads_EmissionsbyLink_2013.csv')

### Removing Unnecessary Columns

In [30]:
# Following the identified use cases, the specified columns are dropped as they are unnecessary for further processing.
df = df.drop(columns=['GridId', 'Toid', 'GRID_ExactCut_ID', 'Location_ExactCut', 'Lts', 'Emissions', 'Year', 'Emissions Unit', 'Motorcycle', 'Taxi', 'Car', 'BusAndCoach', 'Lgv', 'Rigid', 'Artic', 'Rigid2Axle', 'Rigid3Axle', 'Rigid4Axle', 'Artic3Axle', 'Artic5Axle', 'Artic6Axle', 'LtBus', 'Coach'])

### Saving the File With Required Columns Only

In [31]:
# To work with a lighter file after removing columns, it has chosen to save it to a CSV format.
df.to_csv('./data/emissions_required_columns_only.csv', index=False)

### Reading the Light Dataset

In [32]:
# The file is read again to start the data cleaning process.
pd.read_csv('./data/emissions_required_columns_only.csv')

Unnamed: 0,BoroughName_ExactCut,Length (m),Pollutant,PetrolCar,DieselCar,PetrolLgv,DieselLgv,ElectricCar,ElectricLgv
0,NonGLA,50.8,CO2,8.761,4.811,0.038,1.735,-,-
1,NonGLA,28.6,CO2,0.016,0.009,-,-,-,-
2,NonGLA,5.1,CO2,0.939,0.519,0.004,0.184,-,-
3,NonGLA,3.8,CO2,0.692,0.382,0.003,0.136,-,-
4,NonGLA,1.6,CO2,0.299,0.165,0.001,0.059,-,-
...,...,...,...,...,...,...,...,...,...
366215,NonGLA,68.3,PM25_Tyre,0.000,0.000,0.000,0.000,0.000,0.000
366216,NonGLA,18.4,PM25_Tyre,0.000,0.000,0.000,0.000,0.000,0.000
366217,NonGLA,150.7,PM25_Tyre,0.001,0.001,0.000,0.000,0.000,0.000
366218,NonGLA,27.5,PM25_Tyre,0.000,0.000,0.000,0.000,0.000,0.000


### Checking Data Types

In [33]:
# The data types of the columns are checked to ensure that they are correct.
df.dtypes

BoroughName_ExactCut     object
Length (m)              float64
Pollutant                object
PetrolCar                object
DieselCar                object
PetrolLgv                object
DieselLgv                object
ElectricCar              object
ElectricLgv              object
dtype: object

### Statistics

In [34]:
# The first 5 rows of the data are displayed to understand the structure of the data.
df.head()

Unnamed: 0,BoroughName_ExactCut,Length (m),Pollutant,PetrolCar,DieselCar,PetrolLgv,DieselLgv,ElectricCar,ElectricLgv
0,NonGLA,50.8,CO2,8.761,4.811,0.038,1.735,-,-
1,NonGLA,28.6,CO2,0.016,0.009,-,-,-,-
2,NonGLA,5.1,CO2,0.939,0.519,0.004,0.184,-,-
3,NonGLA,3.8,CO2,0.692,0.382,0.003,0.136,-,-
4,NonGLA,1.6,CO2,0.299,0.165,0.001,0.059,-,-


In [35]:
# The last 5 rows of the data are displayed to understand the structure of the data.
df.tail()

Unnamed: 0,BoroughName_ExactCut,Length (m),Pollutant,PetrolCar,DieselCar,PetrolLgv,DieselLgv,ElectricCar,ElectricLgv
366215,NonGLA,68.3,PM25_Tyre,0.0,0.0,0.0,0.0,0.0,0.0
366216,NonGLA,18.4,PM25_Tyre,0.0,0.0,0.0,0.0,0.0,0.0
366217,NonGLA,150.7,PM25_Tyre,0.001,0.001,0.0,0.0,0.0,0.0
366218,NonGLA,27.5,PM25_Tyre,0.0,0.0,0.0,0.0,0.0,0.0
366219,NonGLA,27.4,PM25_Tyre,0.0,0.0,0.0,0.0,0.0,0.0


In [36]:
# The dataset is described to understand the structure of the data.
df.describe(include = "all")

Unnamed: 0,BoroughName_ExactCut,Length (m),Pollutant,PetrolCar,DieselCar,PetrolLgv,DieselLgv,ElectricCar,ElectricLgv
count,366220,366220.0,366220,366220.0,366220.0,366220.0,366220.0,366220.0,366220.0
unique,34,,10,20594.0,16189.0,875.0,10148.0,2.0,2.0
top,NonGLA,,CO2,0.0,0.0,0.0,0.0,0.0,0.0
freq,65380,,36622,195122.0,186255.0,312451.0,228191.0,211488.0,209388.0
mean,,64.030886,,,,,,,
std,,82.558449,,,,,,,
min,,0.0,,,,,,,
25%,,12.8,,,,,,,
50%,,38.8,,,,,,,
75%,,82.9,,,,,,,


In [37]:
# Loop through each column and print the number of empty cells
for column in df.columns:
    empty_cells_count = (df[column] == 0).sum()
    print(f"{column}: {empty_cells_count}")

BoroughName_ExactCut: 0
Length (m): 4490
Pollutant: 0
PetrolCar: 0
DieselCar: 0
PetrolLgv: 0
DieselLgv: 0
ElectricCar: 0
ElectricLgv: 0


### Data Cleaning

In [38]:
# The following code is used to replace the 0 values in the emission columns 
# with the mean value for that borough and pollutant.

emission_columns = ['PetrolCar', 'DieselCar', 'PetrolLgv', 'DieselLgv', 'ElectricCar', 'ElectricLgv']

# For each emission column, replace 0 values with the mean value for that borough and pollutant
for column in emission_columns:
    # Convert column to numeric type
    df[column] = pd.to_numeric(df[column], errors='coerce')
    
    # Calculate mean values for each borough and pollutant combination, excluding 0 values
    means = df[df[column] > 0].groupby(['BoroughName_ExactCut', 'Pollutant'])[column].mean()
    
    # Iterate through the multi-index (borough, pollutant) to replace 0s with the calculated mean
    for index, mean_value in means.items():
        borough, pollutant = index
        df.loc[(df['BoroughName_ExactCut'] == borough) 
               & (df['Pollutant'] == pollutant) 
               & (df[column] == 0), column] = mean_value

# Display the first few rows to verify the changes
print(df.head())

  BoroughName_ExactCut  Length (m) Pollutant  PetrolCar  DieselCar  PetrolLgv  \
0               NonGLA        50.8       CO2      8.761      4.811      0.038   
1               NonGLA        28.6       CO2      0.016      0.009        NaN   
2               NonGLA         5.1       CO2      0.939      0.519      0.004   
3               NonGLA         3.8       CO2      0.692      0.382      0.003   
4               NonGLA         1.6       CO2      0.299      0.165      0.001   

   DieselLgv  ElectricCar  ElectricLgv  
0      1.735          NaN          NaN  
1        NaN          NaN          NaN  
2      0.184          NaN          NaN  
3      0.136          NaN          NaN  
4      0.059          NaN          NaN  


In [39]:
# The following code is used to replace the 0 values in the length column
# with the mean value for that borough.

length = ['Length (m)']

# For the lenght column, replace 0 values with the mean value for that borough
for column in length:
    # Calculate mean values for each borough, excluding 0 values
    means = df[df[column] > 0].groupby(['BoroughName_ExactCut'])[column].mean()
    
    # Iterate through the borough to replace 0s with the calculated mean
    for index, mean_value in means.items():
        borough = index
        df.loc[(df['BoroughName_ExactCut'] == borough) 
               & (df[column] == 0), column] = mean_value

# Display the first few rows to verify the changes
print(df.head())

  BoroughName_ExactCut  Length (m) Pollutant  PetrolCar  DieselCar  PetrolLgv  \
0               NonGLA        50.8       CO2      8.761      4.811      0.038   
1               NonGLA        28.6       CO2      0.016      0.009        NaN   
2               NonGLA         5.1       CO2      0.939      0.519      0.004   
3               NonGLA         3.8       CO2      0.692      0.382      0.003   
4               NonGLA         1.6       CO2      0.299      0.165      0.001   

   DieselLgv  ElectricCar  ElectricLgv  
0      1.735          NaN          NaN  
1        NaN          NaN          NaN  
2      0.184          NaN          NaN  
3      0.136          NaN          NaN  
4      0.059          NaN          NaN  


### Saving the cleaned data to a file

In [40]:
# Splitting the data into training and testing sets
# The size of the testing set is 20% of the total data
# The random state is set to 42 to ensure reproducibility
train_data, test_data = train_test_split(df, test_size=0.2, random_state=999)

# Saving the training data to a new CSV file
train_data.to_csv('./data/emissions_clean_train.csv', index=False)

# Saving the testing data to a new CSV file
test_data.to_csv('./data/emissions_clean_test.csv', index=False)