<a href="https://colab.research.google.com/github/futureCodersSE/python-programming-for-data/blob/main/Projects/Bus_Data_Emissions_Pandas_Analysis_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyising the bus data challenges
---

In the previous notebook, you simplified and cleaned all the pulled data in order to create a single dataframe which could then be analysed. 

In this notebook, you will be creating a set of reusable functions that will be vital for analysing the data.

It is essential that these functions have meaningul names and are reusable (that means good use of parameters, variables etc) so that they can be used elsewhere in the project and on different sets of bus data (eg from different days).
This is great practice of building robust, scalable code. 

The ultimate goal of the project, is to find out how much of an effect on the air quality old buses (eg. Euro III) are having and how many harmful emissions are being produced, in the AQMA. 

The goal of this notebooks analysis is to build a picture of how much harmful emissions are being produced by each type of bus, and what is the effect of the Euro III buses emissions on the overall bus emissions in the AQMA. 

To do this, we need to find out:   
* how far each type of bus has travelled in a particular day
* what percentage of that days buses were each type of bus 
* what was the total emissions produced on that day for each type of bus 
* what percentage of the total emissions for all of the buses in a day is each type of bus producing 
* Same as above, but narrowed down to buses that have passed through the AQMA 

For the purpose of testing, you will be using a smaller set of data which only has 1 hour of bus data, but by writing reusable functions, it will be easy to parse the large days data through the same functions!

### Importing the data
---

**Run the code cell below** to retrieve the dataframes you created in the simplification notebook. 

Click accept when asked for Google Drive access, so it can access the bus data folder in your drive. 

In [None]:
import os
import pandas as pd
import json 
from google.colab import drive

def mount_drive(data_path):
  drive.mount('/content/drive', force_remount=True)
  project_dir = "/content/drive/MyDrive/" + data_path
  return project_dir

def unmount_drive():
  drive.flush_and_unmount()
  print('Drive Unmounted')

def get_file_names(project_dir):
  path = os.path.join(os.getcwd(),project_dir)
  print(path)
  filenames = [os.path.join(path,i) for i in os.listdir(path) if os.path.isfile(os.path.join(path,i))]
  return filenames


def read_data(filename, read):
  df1 = read(filename)
  return df1


def create_list_df(filenames, read):
  df_list = []
  for f in filenames:
    df_list.append(read_data(f, read))
  return df_list

def create_df(df_list):
  df = pd.DataFrame()
  for d in df_list:
    df = df.append(d)
  return df

def normalize_df(df, col):
  df1 = pd.json_normalize(df[col])
  return df1



def remove_cols(d, cols_list):
  for col in cols_list:
    d.drop(col, axis=1, inplace=True)
  return d

def remove_dups(d):
  print("old length: ", len(d))
  d.drop_duplicates(inplace=True)
  print("new_length: ", len(df))
  return d

def get_simplified_data_set(data_path):
  project_dir = mount_drive(data_path)
  filenames = get_file_names(project_dir)
  df_list = create_list_df(filenames, pd.read_json)
  df = create_df(df_list)
  df1 = normalize_df(df, "MonitoredVehicleJourney")
  df2 = remove_cols(df1, ["DirectionRef", "PublishedLineName", "OperatorRef", "OriginRef", "DestinationRef", "DestinationAimedArrivalTime", "Bearing", "BlockRef", "FramedVehicleJourneyRef.DataFrameRef", "FramedVehicleJourneyRef.DatedVehicleJourneyRef"])
  return df2



### Preparation - re-create the simplified dataset and create a new, regs, dataset

To complete the rest of this worksheet you will need two datasets:
* the full dataset for the collected live bus data for the period recorded (e.g. OneHourOfData)
* a dataframe containing data on each bus in the fleet

Set the value of data_location to the full path to the data as it is on your Google Drive

Use the function get_simplified_data_set() to get the table of bus journeys for the period

Create a new dataframe called regs by reading the bus_regs.csv data from this link: https://raw.githubusercontent.com/futureCodersSE/python-programming-for-data/main/Datasets/bus_regs.csv

Take a look at both dataframes to see what the data looks like.

In [None]:
data_location = 'futureCoders-external-projects/Air-Quality-Karen-Eco-Hub/Bus-data/Data/OneHourOfData'

simplified_buses = get_simplified_data_set(data_location)
regs = pd.read_csv("https://raw.githubusercontent.com/futureCodersSE/python-programming-for-data/main/Datasets/bus_regs.csv")

### Task 1 - creating an emissions standard filtering function
---
The simplified dataframe is saved in `simplified_buses`, however in order to answer the next set of questions, it is necessary to have a function which seperates the dataset by emission class

Write a function which takes a dataframe and an emission class as parameters and uses the *regs* dataframe to check for buses that are of the given emissions class.

* create a variable inside the function which will filter the `regs` dataframe by the emission class and create a **list** of vehicle refs (these are in the `Last Tracked` column) only of those with the required emission class
* filter the bus dataframe keeping only the rows where the `VehicleRef` is in the *regs list*
* return the filtered bus dataframe

### Task 2 
---
Write a function which calculates the percentage of buses in the given period that were a particular emissions class

* create a function which takes the bus dataframe and emissions class as parameters
* create a variable within your function called *subset* which calls the function you created in Task 1, passing in the required emissions class
* calculate the percentage using the length of the **subset** and the length of the original bus dataframe 
* return the percentage rounded to 2 decimal places 

### Task 3 - creating a function to calculate total distance 
---
Later on, we will need to be able to quickly calculate the total distance that all buses of an emission starndard have travelled in a day.  For this you will use a dataframe that has been already filtered for a particular emission standard.

To test your function now, you can use the dataframe you generated in Task 1

* write a function which takes a dataframe (that has been filtered for an emission class) as a parameter and returns the total_distance  
TO DO THIS:
* create a variable called lats which converts the dataframes `['VehicleLocation.Latitude']` column to a list 
* create a variable called longs which converts the dataframes `['VehicleLocation.Longitude']` column to a list 
* create an empty variable called `total_distance` 

Some theory:

Because of the earths curvature, its not that simple to calculate distance using latitudes and longitudes. Luckily, there is a python library to help us out (it has been imported for you below)

`geodesic((origin_lat, origin_long), (dest_lat, dest_long)).kilometers` will return the distance between two locations (using coordinates) 

It takes 2 parameters in the form of tuples (so a tuple for origin and a tuple for destination)

* In your function, use a for loop to iterate through the indexes of one of your lists (theyre both the same length) - *hint: range(len(list))*
* save the lat and long at each index into a tuple called origin
* save the lat and long at the following index (+1) into a tuple called destination 
* using the distance finding function above, add the distance to your `total_distance` variable
* return the total_distance variable at the end of your function

*hint: since we are looking at the next item in the list simultaneously you might want to only iterate to the 2nd to last index of your list*


In [None]:
from geopy.distance import geodesic



### Task 4 
---

You might remember we worked out the rough emissions of a Euro 3 bus in the Bus Data Challenges.

We are going to use those same numbers - the cell below already contains the `get_emissions_data()` function which will create a list of dictionaries with the emissions values for each emission standard. 

Data Dictionary:
```
Field---------------------------Data Type------------------Description  
Emission Standard--------------Alphanumeric----------------Euro III, IV, V or VI	
CO2-------------------------------Float--------------------grams of CO2 emitted per KWhr  
Nox-------------------------------Float--------------------grams of Nox emitted per KWhr  
PM--------------------------------Float--------------------grams of particulate matter emitted per gm/KWhr  
```

**Some numbers to play with:**  
NOTE: These are NOT fact checked but give a rough idea of some numbers we might be able to use for a rough first model

*  A typical old diesel bus will typically get 5 miles per gallon, which is 2.126km per litre (divide mpg by 2.352)
*  One litre of diesel fuel has the energy content of 10.8 kWh

By that logic:

The total emissions per km is 10.8/2.126 equalling 5.08 (rounded to 2 decimal places)

To calculate total emissions for each euro standard:
* find total energy consumption for total distance: multiple total km distance (Task 3's function) by km energy (5.08) for each euro standard
* multiply total energy consumption by each toxic emissions grams for each euro standard 

Task:
* write a function that takes the `emissions_standard` as a parameter  
* inside your function, call your function from Task 1, parsing the parameters emissions_standard, and the `simplified_buses` dataframe, saving the result in a variable
* call your distance calculation function from task 2, parsing through the dataframe variable youve just created above, saving the result in a variable called `distance`
* use a for loop to iterate through the `emissions_data` dictionary 
* use an if statement to match up to the correct emissions standard dictionary
* calculate the total Co2, Nox and PM usage (use the distance variable you just made)
* return a dictionary containing the standard, the distance travelled by all buses of that standard in the period and the three total emissions calculated
```
{'Standard': emissions_standard, 'Total distance':distance, 'CO2':total CO2, 'Nox':total Nox, 'PM': total PM } 
```
* call your function for each euro standard (you'll need to pass the euro standard in with speech marks eg. "EURO III") adding the results to a new list called `bus_emissions_by_standard`
* print `bus_emissions_by_standard`








In [None]:
emissions_data = [
    {"Standard":"EURO III", "CO2":2.1, "Nox":5, "PM":0.1 },
    {"Standard":"EURO IV","CO2":1.5,"Nox":3.5,"PM":0.02 },
    {"Standard":"EURO V","CO2":1.5,"Nox":2,"PM":0.02},
    {"Standard":"EURO VI","CO2":1.5,"Nox":0.4,"PM":0.01}
]
  


### Task 5 
---

Find the percentage of the total emissions produced by all the buses for each emission standard 

* Write a function which takes an emissions class as a parameter 
* Create a variable, for each type of emission (CO2, Nox, PM) , which holds the total emissions (add up all the values from the dictionary you created in the last task using a for loop)
* Using the given emissions class, calculate the percentage of the total emissions are for the emissions class (the emissions class emissions / total emissions * 100) 
* return the percentage
* print comparisons of each standard, its percentage of emissions and the percentage of the buses that were that standard (use your function from Task 2)


### Task 6 
---
Create a function to filter the dataframe to just rows of buses in the AQMA boundary of Rainham High Street (similiar to the Task 9 in simplification worksheet)

Heres a reminder:

To remove all rows that are not within the boundary of Rainham High Street (the AQMA) you will need to check the latitudes and longitudes of each row. 

The bounding box for the latitude and longitude is as follows:
```
Max Lat 51.364935                                 Max Lat 51.364935
Min Long: 0.603210 ------------------------------ Max Long 0.617510  
                   |                            |  
                   |                            |  
                   |                            |  
                   |                            |  
                   |                            |
Min Lat 51.361462  ------------------------------ Min Lat 51.361462
Min Long 0.603210                                 Max Long 0.617510
```
Therefore, to be in the boundary:
* the longitude must be between 0.603210 and 0.617510
* the latitude must be between 51.361462 and 51.364935

Use the function you made in Task 9 of the bus-data-simplification worksheet. If you need to, edit it to generalise it to take a dataframe as a parameter and run it on the simplified_buses dataframe from this worksheet.

Return and save the filtered dataframe that only contains rows which have passed through the AQMA.



### Task 7
---
Using the filtered dataframe from Task 6, repeat tasks 4 and 5 on just the buses in the AQMA.

