In [2]:
import pandas as pd

# Definitions
* **Bus line**:

    Defined as: (bus_service_number,direction) 
    
    There are 2 directions for each bus service, 1 and 2.

* **Distance between Nearest MRT from a Bus Stop**:

    Defined using Euclidean Distance

    This a proxy for walking distance.

* **Average Number of Passengers at a Bus Stop for a given Bus line**:
    * get_weekend_passengers(bus_stop,bus_line) refers to the total number of passengers on that bus line at that bus stop on the weekends of August.
    * get_weekday_passengers(bus_stop,bus_line) refers to the total number of passengers on that bus line at that bus stop on the weekdays of August.

    $$
    \frac{5}{7} \times \text{get\_weekday\_passengers}(\text{bus\_stop}, \text{bus\_line}) + \frac{2}{7} \times \text{get\_weekend\_passengers}(\text{bus\_stop}, \text{bus\_line})
    $$


* **Weighted_Sum_Distance_BusLine_From_MRT**:
    * n refers to the number of bus stops in the bus line
    * $d_i$ refers to distance between Nearest MRT from Bus Stop i
    * $p_{i,busline}$ refers to Average Number of Passengers at Bus Stop i for the given Bus line
    * This formula is a proxy for parallelness of a busline with mrt

$$
\text{Weighted\_Sum\_Distance}_{busline} = \sum_{i=1}^{n} \left( d_{i} \cdot p_{i,busline} \right)
$$


# Approach
To get nearest MRT from a given Bus Stop, it will require a lot of computation due to the many MRTs there are. The same is true to get Average Number of Passengers at a Bus Stop for a given Bus line. To put all of the processing in the notebook might take a lot of time. So we will split into a few python scripts:

* get_nearest_mrt_to_bus_stops.ipynb -> ETL to get bus_stops_with_nearest_mrt.csv

    The new csv will have at least the following 7 columns: 
     - Bus Stop Code
     - Bus Stop Lat
     - Bus Stop Long
     - MRT Station Name
     - MRT Station lat
     - MRT Station long
     - Distance
    
    Primary Key is Bus Stop Code.

* get_passengers_bus_stop_bus_line.ipynb -> ETL to get avg_passengers_bus_stop_bus_line_august.csv

    The new csv will have have columns:
    - Bus Service Number
    - Direction
    - Bus Stop
    - Weekend Passengers
    - Weekday Passengers

    Primary Key is a compound key: (Bus Service Number,Direction,Bus Stop)


After getting the csvs from above, we will run this current script to output the **Weighted_Sum_Distance_BusLine_From_MRT** scores for each bus line



In [11]:
#read the first 10 rows of the file

df = pd.read_csv('data/origin_destination_bus_202408.csv', nrows=100)
df.sort_values(by='TOTAL_TRIPS', inplace=True,ascending=False)
df

Unnamed: 0,YEAR_MONTH,DAY_TYPE,TIME_PER_HOUR,PT_TYPE,ORIGIN_PT_CODE,DESTINATION_PT_CODE,TOTAL_TRIPS
5,2024-08,WEEKDAY,12,BUS,10089,10079,145
89,2024-08,WEEKDAY,7,BUS,77231,65279,95
32,2024-08,WEEKDAY,22,BUS,27231,27309,94
25,2024-08,WEEKDAY,18,BUS,63059,62139,92
7,2024-08,WEEKENDS/HOLIDAY,15,BUS,50251,50301,78
...,...,...,...,...,...,...,...
16,2024-08,WEEKDAY,7,BUS,75129,64529,1
29,2024-08,WEEKDAY,19,BUS,10199,5421,1
78,2024-08,WEEKENDS/HOLIDAY,6,BUS,14101,22009,1
64,2024-08,WEEKENDS/HOLIDAY,17,BUS,90029,80141,1
