## Comparing two bus systems using GTFS data in Python

Daniel Ritter | *GEOG 6180 Geoprocessing in Python* | Final Project

Transportation agencies share transit data using the [General Transit Feed Specification](https://gtfs.org/). GTFS is a text-based standard data format used by over 10,000 agencies in over 100 countries. The following script uses the [gtfs_functions](https://pypi.org/project/gtfs-functions/) Python package to describe and compare two bus systems.

In [1]:
# pip install gtfs_functions==2.0.8

import gtfs_functions as gf
import geopandas as gpd
import pandas as pd
import keplergl as kp

### Read GTFS data feed






`Feed()` creates an object of class `Feed` from a zipped GTFS feed, which is then used to create individual dataframes and geodataframes based on the text files within the feed (modified to fit the `gtfs_functions` workflow).

In [26]:
# Define file paths for GTFS zip files
gtfs_pathA = r"C:\Users\danie\OneDrive\Documents\Projects\GEOG6180\Project\GTFS-Feeds\GTFS-UTA.zip"
gtfs_pathB = r"C:\Users\danie\OneDrive\Documents\Projects\GEOG6180\Project\GTFS-Feeds\GTFS-VRT.zip"

# Define time windows (24-hour clock)
window = [0, 6, 9, 15, 19, 22, 24]

# Read feed
feedA = gf.Feed(gtfs_pathA, time_windows = window)
feedB = gf.Feed(gtfs_pathB, time_windows = window)

# Create objects from files for system A
agencyA = feedA.agency
routesA = feedA.routes
tripsA = feedA.trips
stopsA = feedA.stops
stoptimesA = feedA.stop_times
shapesA = feedA.shapes

# Create objects from files for system B
agencyB = feedB.agency
routesB = feedB.routes
tripsB = feedB.trips
stopsB = feedB.stops
stoptimesB = feedB.stop_times
shapesB = feedB.shapes

# Identify agencies
agencyA_name = agencyA["agency_name"][0]
agencyB_name = agencyB["agency_name"][0]

# Filter for bus routes only
def busfilter(routes): 
    busonly = routes[routes["route_type"] == 3]
    return busonly

busroutesA = busfilter(routesA)
busroutesB = busfilter(routesB)

INFO:root:Reading "agency.txt".
INFO:root:Reading "routes.txt".
INFO:root:accessing trips
INFO:root:Reading "trips.txt".
INFO:root:Reading "calendar.txt".
INFO:root:Reading "calendar_dates.txt".
INFO:root:The busiest date/s of this feed or your selected date range is/are:  ['2024-01-16'] with 5444 trips.
INFO:root:In that more than one busiest date was found, the first one will be considered.
INFO:root:In this case is 2024-01-16.
INFO:root:Reading "stop_times.txt".
INFO:root:_trips is defined in stop_times
INFO:root:Reading "stops.txt".
INFO:root:computing patterns
INFO:root:Reading "shapes.txt".
INFO:root:Reading "agency.txt".
INFO:root:Reading "routes.txt".
INFO:root:accessing trips
INFO:root:Reading "trips.txt".
INFO:root:Reading "calendar.txt".
INFO:root:Reading "calendar_dates.txt".
INFO:root:The busiest date/s of this feed or your selected date range is/are:  ['2023-09-04', '2023-02-16', '2023-03-23', '2023-03-24', '2024-05-28', '2023-03-27', '2023-03-28', '2023-03-29', '2023-03-

INFO:root:In that more than one busiest date was found, the first one will be considered.
INFO:root:In this case is 2023-09-04.
INFO:root:Reading "stop_times.txt".
INFO:root:_trips is defined in stop_times
INFO:root:Reading "stops.txt".
INFO:root:computing patterns
INFO:root:Reading "shapes.txt".


### Calculate descriptive statistics

`Feed()` imports routes.txt as a regular dataframe without a geometry attribute. Route geometries are stored in the *shapes* object and linked to the *routes* dataframe through the *trip* dataframe (which shares a *shape_id* attribute with *shapes* and a *route_id* attribute with *routes*). The `lines_freq` and `stops_freq` functions makes this linkage and generate trip counts based on the time windows defined earlier. Only one row per route is needed if there are multiple time windows.

In [27]:
# Calculate stop and route frequencies
stop_freqA = feedA.stops_freq
stop_freqB = feedB.stops_freq
line_freqA = feedA.lines_freq
line_freqB = feedB.lines_freq

# Create route objects with geometry
route_geomA = line_freqA.drop_duplicates(subset = "route_id")
route_geomB = line_freqB.drop_duplicates(subset = "route_id")

In order for route length to be meaningful, the CRS must be transformed from WGS 84 (degrees). US National Atlas Equal Area (EPSG 2163) is optimized for the United States, which makes it a reasonable default choice. 

In [115]:
# Reproject to EPSG 2163 (National Atlas Equal Area)
route_geomA = route_geomA.to_crs(2163)
route_geomB = route_geomB.to_crs(2163)

# Calculate route length and convert to miles
route_lengthA = route_geomA.length.divide(1609.34)
route_lengthB = route_geomB.length.divide(1609.34)

# Find min, max, mean, and total route lengths
route_meanA = route_lengthA.mean().round(1)
route_meanB = route_lengthB.mean().round(1)
net_lenA = route_lengthA.sum().round(1)
net_lenB = route_lengthB.sum().round(1)

# Summarize daily trips by route
route_tripsA = line_freqA.groupby("route_name")["ntrips"].sum()
route_tripsB = line_freqB.groupby("route_name")["ntrips"].sum()

# Find route and stop with the most trips
rtrip_totalA = line_freqA["ntrips"].sum()
rtrip_totalB = line_freqB["ntrips"].sum()
rtrip_maxA = route_tripsA.max()
rtrip_maxB = route_tripsB.max()
rtrip_maxIDa = route_tripsA.idxmax()
rtrip_maxIDb = route_tripsB.idxmax()

The `print` statements use a mix of defined variables and functions applied to dataframes. `if-elif-else` statements are used for direct comparisons between the two systems. An `if-else` statement would be adequate for almost every case, but there is a very slight chance that two systems could have the same value.

In [119]:
print("System Comparison: {} vs. {}\n".format(agencyA_name, agencyB_name))

print("{} has {} bus routes, while {} has {} bus routes.\n".format(agencyA_name, len(busroutesA), 
                                                                   agencyB_name, len(busroutesB)))

print("The average route length in the {} system is {} miles.".format(agencyA_name, route_meanA))
print("The average route length in the {} system is {} miles.".format(agencyB_name, route_meanB))

System Comparison: Utah Transit Authority vs. Valley Regional Transit

Utah Transit Authority has 81 bus routes, while Valley Regional Transit has 23 bus routes.

The average route length in the Utah Transit Authority system is 14.8 miles.
The average route length in the Valley Regional Transit system is 10.6 miles.


In [120]:
# Print length statement based on which system is longer
if net_lenA > net_lenB: 
    print("{}'s system is {}% longer than {}'s system ({} miles vs. {} miles).\n".format(agencyA_name, 
                                                                                         ((net_lenA - net_lenB) / net_lenB * 100).round(1), 
                                                                                         agencyB_name, net_lenA, net_lenB))
elif net_lenB > net_lenA: 
    print("{}'s system is {}% longer than {}'s system ({} miles vs. {} miles).\n".format(agencyB_name, 
                                                                                         ((net_lenB - net_lenA) / net_lenA * 100).round(1), 
                                                                                         agencyA_name, net_lenB, net_lenA))
else: 
    print("Both systems are the same length ({} miles).\n".format(net_lenA))


# Print stop statement based on which system has more stops
if len(stopsA) > len(stopsB): 
    print("{} has {}% more stops than {} ({} stops vs. {} stops).\n".format(agencyA_name, 
                                                                            round(((len(stopsA) - len(stopsB)) / len(stopsB) * 100), 1), 
                                                                            agencyB_name, len(stopsA), len(stopsB)))
elif len(stopsB) > len(stopsA): 
    print("{} has {}% more stops than {} ({} stops vs. {} stops).\n".format(agencyB_name, 
                                                                            round(((len(stopsB) - len(stopsA)) / len(stopsB) * 100), 1), 
                                                                            agencyA_name, len(stopsB), len(stopsA)))
else: 
    print("Both transit agencies have the same number of stops ({} stops).\n".format(len(stopsA)))


# Print trip statement based on which system has more trips
if rtrip_totalA > rtrip_totalB:  
    print("{} offers {}% more daily trips than {} ({} trips vs. {} trips).\n".format(agencyA_name, 
                                                                                     round(((rtrip_totalA - rtrip_totalB) / rtrip_totalB * 100), 1), 
                                                                                     agencyB_name, rtrip_totalA, rtrip_totalB))

elif rtrip_totalB > rtrip_totalA: 
    print("{} offers {}% more daily trips than {} ({} trips vs. {} trips).\n".format(agencyB_name, 
                                                                                     round(((rtrip_totalB - rtrip_totalA) / rtrip_totalA * 100), 1), 
                                                                                     agencyA_name, rtrip_totalB, rtrip_totalA))
else: 
    print("Both transit agencies have the same number of daily trips on their busiest day ({} trips).\n".format(rtrip_totalA))

# Print route details 
print("The {} route with the most trips per day (all modes) is {} ({} trips).".format(agencyA_name, rtrip_maxIDa, rtrip_maxA))
print("The {} route with the most trips per day (all modes) is {} ({} trips).".format(agencyB_name, rtrip_maxIDb, rtrip_maxB))

Utah Transit Authority's system is 444.6% longer than Valley Regional Transit's system (1274.4 miles vs. 234.0 miles).

Utah Transit Authority has 534.0% more stops than Valley Regional Transit (5243 stops vs. 827 stops).

Utah Transit Authority offers 689.0% more daily trips than Valley Regional Transit (5444 trips vs. 690 trips).

The Utah Transit Authority route with the most trips per day (all modes) is 830X UTAH VALLEY EXPRESS (298 trips).
The Valley Regional Transit route with the most trips per day (all modes) is 9 State Street (99 trips).


### Visualize system

`gtfs_functions` contains a basic mapping function built on `folium`, but a mapping-specific Python package like `kepler.gl` is needed for more intensive mapping.

In [126]:
## Create basic map of system A
routemapA = kp.KeplerGl(height = 600, data = {"routesA": route_geomA}, show_docs = False)
routemapA

## Create basic map of system B
# routemapB = kp.KeplerGl(height = 600, data = {"routesB": route_geomB}, show_docs = False)
# routemapB

KeplerGl(data={'routesA':     route_id                       route_name  direction_id       window  \
0      1…

Once the default visualization is modified within the widget, the configuration can be saved within Jupyter and reloaded to maintain the modifications when rerunning the script.

In [128]:
## Modify visualization and save configuration for system A
# with open("routemapA_config.py", "w") as f: 
#   f.write("config = {}".format(routemapA.config))

## Modify visualization and save configuration for system B
# with open("routemapB_config.py", "w") as f: 
#   f.write("config = {}".format(routemapB.config))

# Load preset config
%run routemapA_config.py

# Create basic system map with preset config
kp.KeplerGl(height = 600, data = {"routesA": route_geomB}, config = config, show_docs = False)

KeplerGl(config={'version': 'v1', 'config': {'visState': {'filters': [], 'layers': [{'id': 'hpahzag', 'type': …

### Describe service frequency

In this script, frequency is calculated from 3:00 p.m. to 7:00 a.m., which represents bus service for the afternoon/evening commute. This time period can be changed by adjusting the time windows used when reading the GTFS feed and/or updating the parameters. Since frequency refers to how often a bus comes in a specific direction, either outbound or inbound should be defined. Both directions will generally have the same frequency, but there will be some differences depending on how buses are routed within a specific system.

In [32]:
# Define desired time period and direction
freq_window = "15:00-19:00"
freq_direction = 0

# Filter for time period and direction
line_freq_filterB = line_freqB.query(f'window == "{freq_window}" and direction_id == {freq_direction}')

# Define frequencies
bins = [0, 10, 20, 30, 40, 60, 1440]

# Count number of routes for each frequency
bincountB = pd.cut(line_freq_filterB['min_per_trip'], bins = bins).value_counts(sort = False)

# Describe frequency
print("{} total routes operate during the afternoon/evening commute.".format(len(line_freq_filterB)))
print("   {} routes ({}%) operate every 0-10 minutes.".format(bincountB.iloc[0], (bincountB.iloc[0]/80*100).round(1)))
print("   {} routes ({}%) operate every 11-20 minutes.".format(bincountB.iloc[1], (bincountB.iloc[1]/80*100).round(1)))
print("   {} routes ({}%) operate every 21-30 minutes.".format(bincountB.iloc[2], (bincountB.iloc[2]/80*100).round(1)))
print("   {} routes ({}%) operate every 31-40 minutes.".format(bincountB.iloc[3], (bincountB.iloc[3]/80*100).round(1)))
print("   {} routes ({}%) operate every 41-60 minutes.".format(bincountB.iloc[4], (bincountB.iloc[4]/80*100).round(1)))
print("   {} routes ({}%) operate every 60+ minutes.".format(bincountB.iloc[5], (bincountB.iloc[5]/80*100).round(1)))

21 total routes operate during the afternoon/evening commute.
   0 routes (0.0%) operate every 0-10 minutes.
   1 routes (1.2%) operate every 11-20 minutes.
   5 routes (6.2%) operate every 21-30 minutes.
   4 routes (5.0%) operate every 31-40 minutes.
   8 routes (10.0%) operate every 41-60 minutes.
   3 routes (3.8%) operate every 60+ minutes.


The visualizations options in `kepler.gl` are relatively limited compared to a dedicated GIS software like ArcGIS Pro, so route frequencies are binned into quantiles instead of the predefined ranges used above.

In [10]:
## Create map of route frequencies
# routefreqmap = kp.KeplerGl(height = 600, data = {"routefreq": line_freq_filter}, show_docs = False)
# routefreqmap

In [33]:
## Modify visualization and save configuration
# with open("routefreqmap_config.py", "w") as f: 
#   f.write("config = {}".format(routefreqmap.config))

# Load preset config
%run routefreqmap_config.py

# Create basic system map with preset config
kp.KeplerGl(height = 600, data = {"routefreq": line_freq_filterB}, config = config, show_docs = False)

KeplerGl(config={'version': 'v1', 'config': {'visState': {'filters': [], 'layers': [{'id': 'arw3jh', 'type': '…