# Files

The data comes in a zip file consisting of the following files: 

- agency.txt
- calendar_dates.txt
- fare_attributes.txt
- routes.txt
- shapes.txt
- stop_times.txt
- stops.txt
- trips.txt

In [1]:
# import sys
import csv
import pandas as pd

In [2]:
# Constants
springfield_stop_id = "02507"
eugene_stop_id = "02121"
agate_inbound_stop_id = "09965"
agate_outbound_stop_id = "09904"

In [3]:
# Load the GTFS data into dataframes
routes_df = pd.read_csv('../data/routes.txt')
stops_df = pd.read_csv('../data/stops.txt')
calendar_dates_df = pd.read_csv('../data/calendar_dates.txt')
trips_df = pd.read_csv('../data/trips.txt')
stop_times_df = pd.read_csv('../data/stop_times.txt')

In [4]:
# Trim unnecessary columns
routes_df = routes_df[['route_id', 'route_short_name', 'route_long_name']]
stops_df = stops_df[['stop_id', 'stop_name', 'stop_lat', 'stop_lon', 'parent_station', 'platform_code']]
calendar_dates_df = calendar_dates_df[['service_id', 'date']]
trips_df = trips_df[['route_id', 'service_id', 'trip_id', 'trip_headsign', 'direction_id']]
stop_times_df = stop_times_df[['trip_id', 'departure_time', 'stop_id', 'stop_headsign']]

print (f"Loaded {len(routes_df)} routes, {len(stops_df)} stops, {len(calendar_dates_df)} calendar dates, {len(trips_df)} trips, and {len(stop_times_df)} stop times.")


Loaded 28 routes, 1187 stops, 275 calendar dates, 5240 trips, and 150718 stop times.
Removing untimed stops yeilds  150718 rows.


In [6]:
# Remove trips with no departure time
stop_times_df = stop_times_df[stop_times_df['departure_time'].notna()]
# stop_times_df = stop_times_df[stop_times_df["departure_time"] != ""]
print("Removing untimed stops yields ", len(stop_times_df), "rows.")

Removing untimed stops yeilds  48235 rows.


In [7]:
stop_times_df.head(20)

Unnamed: 0,trip_id,departure_time,stop_id,stop_headsign
0,698915,06:31:00,09011,92 EUGENE STATION <> 92 via LCC
1,698915,06:37:00,09240,92 EUGENE STATION <> 92 via LCC
2,698915,06:49:00,09236,92 EUGENE STATION <> 92 via LCC
4,698915,06:56:00,01806,92 EUGENE STATION <> 92 via LCC
6,698915,07:06:00,02305,92 EUGENE STATION
9,698915,07:14:00,01409,92 EUGENE STATION
13,698915,07:17:00,01320,92 EUGENE STATION
17,698915,07:25:00,escenter,92 EUGENE STATION
18,698916,07:30:00,02119,12 GATEWAY <> to CHAD DRIVE
22,698916,07:36:00,01447,12 GATEWAY <> to CHAD DRIVE


## Routes file

The file `routes.txt` is simple and short. We need very little data from it. The main purpose is to show the friendly name to the user while using the route_id in the database.

Routes are non-geographic (ex: The Yellow Line), can branch, and can describe both directions of travel.

- route_id = uniquely identifies and relates to other GTFS tables
- route_type = bus, subway, ferry, etc.

We can ignore these: `agency_id` is always "LTD", `route_desc` is always blank, and `route_type` is always "3".

`route_id` and `route_long_name` all match except for the case of "103" that has a short name of "EmX" (also matches the long name)

## Trips file

5240 entries.

A single instance of a vehicle traveling on a route. Always visits the same stops at the same time of day. No Geographic info. Finer level of detail than routes.txt

- trip_id: Unique identifier
- route_id: Relates to the route it belongs to
- shape_id: index in the shape.txt file.
- service_id: relates to calendar.txt and calendar_dates.txt to describe when it runs
- wheelchair_accessible
- bikes_allowed

In [31]:
# Reading a file with the CSV module and then converting to a Pandas DataFrame
# Update: Pandas can read CSV files directly, so this is not necessary

trips_path = "trips.txt"

# Read CSV file
with open(trips_path, 'r') as file:
    reader = csv.reader(file)
    data = list(reader)

# Convert to Pandas DataFrame
trips_df = pd.DataFrame(data[1:], columns = data[0])

# Select desired columns
# route_id,service_id,trip_id,trip_headsign,block_id,direction_id,shape_id

print(len(trips_df), "records read")
print(trips_df.shape[0], "rows and", trips_df.shape[1], "columns")

print("Unique service IDs:", trips_df["service_id"].nunique())
print("Unique trip IDs:", trips_df["trip_id"].nunique())
print("Unique block IDs:", trips_df["block_id"].nunique())

print ("Service IDs:", trips_df["service_id"].unique())


5240 records read
5240 rows and 7 columns
Unique service IDs: 9
Unique trip IDs: 5240
Unique block IDs: 210
Service IDs: ['13105100' '13105200' '13105300' '13205100' '13205200' '13205300'
 '13105304' '13205304' '13105312']


In [54]:
# Saturday trips
trips_df_sat = trips_df[trips_df["service_id"] == "13205100"]
print(len(trips_df_sat), "Saturday trips")

# Sunday trips
trips_df_sun = trips_df[trips_df["service_id"] == "13205200"]
print(len(trips_df_sun), "Sunday trips")

# Weekday trips
trips_df_mfa = trips_df[trips_df["service_id"] == "13205300"]
# mfb adds route 79x plus 1 extra dropoff for route 36
trips_df_mfb = trips_df[trips_df["service_id"] == "13205304"]
num_weekday_trips = len(trips_df_mfa) + len(trips_df_mfb)
print(f"Weekday trips: {len(trips_df_mfa)} + {len(trips_df_mfb)} = {num_weekday_trips}")

# EmX trips (Route 103)
trips_df_emx = trips_df[trips_df["route_id"] == "103"]
print(len(trips_df_emx), "total EmX trips")

# EmX trips on Saturday
trips_df_emx_sat = trips_df_sat[trips_df_sat["route_id"] == "103"]
print(len(trips_df_emx_sat), "Saturday EmX trips")

# Array of EmX Saturday trip IDs
list_of_sat_emx_trips = trips_df_emx_sat["trip_id"]


791 Saturday trips
611 Sunday trips
Weekday trips: 1134 + 43 = 1177
802 total EmX trips
129 Saturday EmX trips


## Files to ignore

`shapes.txt`: Each route has a GPS shape and is given a "shape_id". In this file, each row has a datapoint with its shape ID, sequence, and GPS coords. To construct a route, you would have to find all by shape_id. 116,136 rows, 25,745 shapes, 1,815 sequences. We can ignore this file of map data. Purely greographical info and optional in GTFS.

`agency.txt` just says that LTD published it on the west coast in English. No useful data

`fare_attributes.txt` lists the prices in dollars and says there are no transfers. No useful data.


## Calendar

`calendar.txt`: LTDs omits this file and uses the calendar_dates.txt file instead.

- service_id: Uniquely identifies pattern and identifies it to other files.
- monday, tuesday..., saturday, sunday: defines whether this particular service ID runs on that weekday
- start_date = yyyymmdd
- end_date = yyyymmdd

## Calendar Dates

`calendar_dates.txt`: 276 lines in total. Many dates have 2 entries.

- "service_id": 8-digit number referencing the calendar schedule in use that day. M-F calendars use both 13205300 and 13205304. Sat=13205100. Sun=13205200.
- "date": 8-digit representation of the date as yyyymmdd. It starts on Jan 8 and continues through 20230617. **Does the schedule change next month? YES.**
- "exception_type": 1=added. 2=removed. Always 1 in this file.

The best plan is to look up the current date to find the service_id(s) in use that day. `trips.txt` can then be filtered down to only include trips running that day. This will account for holidays automatically.

Alternatively, since we know the 4 service IDs currently in use and their mapping to days of the week, we can work the other direction and handle holidays manually.


## Stops

This will be useful in the final product for the end user to find a stop name and us to associate it with a stop ID.

`stops.txt`: 1188 lines. 

- `stop_id`: 5 digit code starting at 1. Ending at 99911
- `stop_code`: matches "stop_id". Skip
- `stop_name`: descriptive name. ex: "N/S of Main W of 58th"
- `stop_desc`: Always blank? Drop.
- `stop_lat` and "stop_lon": coords on a map. May be useful.
- `location_type`: always blank. Drop.
- `parent_station` when the stop is a bay, this refers to the station entry.
- `stop_timezone`, `wheelchair_boarding`, and `level_id` always blank. Drop.
- `platform_code`: The bay letter

Stops in a multi-bay station:

- 00704 has a parent station of 99910
- 01040 UO Bay B parent = 99905, platform code "B"
- 01550 UO South parent = 99906. No platform code
- 02101-02121 = Eugene Sta A-U. Parent = 99901
- 02156-02159 = UO C, A, D, E
- 02301-02305 = LCC A-E. Parent = 99904
- 02501-02507 = Spfld A-G, 99902.
- 02510 = Spfld special event. 99902. No bay.
- gateway station parent = 99907 bays A and B
- Santa Clara ABCEF = 99911
- 09927-8 Gateway B, C (spfld or riverbend) = 99907

Last block of numbers: 99901-11 = Eugene, Spfld, Amazon, LCC, UO, UO South, Gateway, Thurston, VRC, RR, SC. All have station type of 1.
Last 7 entries are about arrival zones. First 2 columns match parent station 4th column (stop_desc)
EmX routes don't include Eugene, Spfld, RB stations. Gateway included.

## Stop Times

Defines every time of day a stop is visited. Defines ordered sequence of stop visits. Arrival and departure times can be null.

`stop_times.csv` is the main source of data we are seeking. Sorted by trip_id, not by time or route. Rows are constructed as follows:

- trip_id: 6 digits. Relates to trips.txt file.
- arrival_time and departure_time: 24h format "hh:mm:ss". Blanks for untimed stops. WARNING: some may be after midnight (ex: 24:03:00)
- stop_id: 5 digits. Relates to stops.txt file. (individual stop/bay #s)
- stop_sequence: the stop's position along the route
- stop_headsign: Outside display on bus. ex: "101 EmX EUGENE STATION"
- layover: "True" or "False". Usually False.
- pickup_type, drop_off_type: always blank?
- shape_dist_traveled: Used for rendering partial shapes in a journey planner. Always blank?
- timepoint: 1 if this is a timed stop. 0 if it is an intermediate (untimed stop)

In [44]:

stop_times_path = "stop_times.txt"

# Read CSV file
with open(stop_times_path, 'r') as file:
    reader = csv.reader(file)
    data = list(reader)

# Convert to Pandas DataFrame
stop_times_df = pd.DataFrame(data[1:], columns = data[0])

# Select desired columns
desired_columns = ["trip_id", "departure_time", "stop_id", "stop_sequence", "stop_headsign"]
stop_times_df = stop_times_df[desired_columns]

print(len(stop_times_df), "records read")
print(stop_times_df.shape[0], "rows and", stop_times_df.shape[1], "columns")

150718 records read
150718 rows and 5 columns


In [45]:
# Filter out the untimed stops for now. They are difficult to deal with.

stop_times_df = stop_times_df[stop_times_df["departure_time"] != ""]
print("Removing empty times leaves us with", len(stop_times_df), "rows.")

Removing empty times leaves us with 48235 rows.


In [46]:
# Use a custom datetime format to fix hours > 24

def custom_to_datetime(time_string):
    """Convert a time string in the format HH:MM:SS to a datetime.time object."""

    # Skip pd.NaT and empty string values (only occurs when we proccess the entire table)
    if time_string == "":
        return None # or pd.NaT. Allows for further processing of untimed stops.
    if type(time_string) != str:
        print(f"Error: Expected string, got {type(time_string)}")
        return None
    
    nextday = 0
    hours, minutes, seconds = time_string.split(":")
    if int(hours) >= 24:
        hours = str(int(hours) - 24)
        nextday = 1
    newtime = pd.to_datetime(f"{hours}:{minutes}:{seconds}", format="%H:%M:%S")
    return newtime + pd.DateOffset(days=nextday)
    

In [47]:
# Convert departure_time for sorting.

# Whether we do this to the full schedule or just our subset
# will depend on whether we are preparing the full schedule for the database

# OLD: df["departure_time"] = pd.to_datetime(df["departure_time"], format="%H:%M:%S", errors='coerce')

stop_times_df["departure_time"] = stop_times_df["departure_time"].apply(custom_to_datetime)

In [48]:
# Print the first 20 rows of the dataframe

stop_times_df.head(20)

Unnamed: 0,trip_id,departure_time,stop_id,stop_sequence,stop_headsign
0,698915,1900-01-01 06:31:00,09011,1,92 EUGENE STATION <> 92 via LCC
1,698915,1900-01-01 06:37:00,09240,2,92 EUGENE STATION <> 92 via LCC
2,698915,1900-01-01 06:49:00,09236,3,92 EUGENE STATION <> 92 via LCC
4,698915,1900-01-01 06:56:00,01806,5,92 EUGENE STATION <> 92 via LCC
6,698915,1900-01-01 07:06:00,02305,7,92 EUGENE STATION
9,698915,1900-01-01 07:14:00,01409,10,92 EUGENE STATION
13,698915,1900-01-01 07:17:00,01320,14,92 EUGENE STATION
17,698915,1900-01-01 07:25:00,escenter,18,92 EUGENE STATION
18,698916,1900-01-01 07:30:00,02119,1,12 GATEWAY <> to CHAD DRIVE
22,698916,1900-01-01 07:36:00,01447,5,12 GATEWAY <> to CHAD DRIVE


In [4]:
# If a departure time is missing, fill it in with the previous time + 1 second
# DEBUG: This may be failing due to multiple threads. Try running in a single thread.
# Only necessary if we are preparing the full schedule for the database.

# In case the first row is blank:

# previous_time = None

# for index, row in df.iterrows():
#     if pd.isnull(row["departure_time"]):
#         if previous_time is None:
#             # Handle the case of the first row having an empty departure time
#             df.at[index, "departure_time"] = pd.to_datetime("00:00:00", format="%H:%M:%S")
#         else:
#             if previous_time == pd.to_datetime("23:59:59", format="%H:%M:%S"):
#                 next_time = pd.to_datetime("00:00:00", format="%H:%M:%S")
#             else:
#                 next_time = previous_time + pd.Timedelta(seconds=1)
#             df.at[index, "departure_time"] = previous_time + pd.Timedelta(seconds=1)
#     previous_time = row["departure_time"]

# To convert all dates in table back to strings:
# df["departure_time"] = df["departure_time"].dt.strftime("%H:%M:%S")

In [49]:
# Show the first 20 rows of the dataframe
stop_times_df.head(20)

Unnamed: 0,trip_id,departure_time,stop_id,stop_sequence,stop_headsign
0,698915,1900-01-01 06:31:00,09011,1,92 EUGENE STATION <> 92 via LCC
1,698915,1900-01-01 06:37:00,09240,2,92 EUGENE STATION <> 92 via LCC
2,698915,1900-01-01 06:49:00,09236,3,92 EUGENE STATION <> 92 via LCC
4,698915,1900-01-01 06:56:00,01806,5,92 EUGENE STATION <> 92 via LCC
6,698915,1900-01-01 07:06:00,02305,7,92 EUGENE STATION
9,698915,1900-01-01 07:14:00,01409,10,92 EUGENE STATION
13,698915,1900-01-01 07:17:00,01320,14,92 EUGENE STATION
17,698915,1900-01-01 07:25:00,escenter,18,92 EUGENE STATION
18,698916,1900-01-01 07:30:00,02119,1,12 GATEWAY <> to CHAD DRIVE
22,698916,1900-01-01 07:36:00,01447,5,12 GATEWAY <> to CHAD DRIVE


Now let's filter our results into a new dataframe. We want to find all buses leaving from a stop we provide.

In [50]:
springfield_stop_id = "02507"
eugene_stop_id = "02121"
agate_inbound_stop_id = "09965"
agate_outbound_stop_id = "09904"

# Number of rows:
num_rows_before = stop_times_df.shape[0]

# Filter rows based on "stop_id" column
stop_df = stop_times_df.loc[stop_times_df["stop_id"] == springfield_stop_id]
num_rows_after = stop_df.shape[0]

print(f"Number of rows before filtering specific stop: {num_rows_before}")
print(f"Number of rows after filtering specific stop: {num_rows_after}")


Number of rows before filtering specific stop: 48235
Number of rows after filtering specific stop: 400


In [56]:
# Saturday trips from Springfield
# print(len(list_of_sat_emx_trips))

stop_df_sat = stop_df[stop_df["trip_id"].isin(list_of_sat_emx_trips)]

Now let's sort by time of day:

In [57]:
# Sort by time of day:
# Note: Chat-GPT doesn't include the "by=" part in its code
sorted_df = stop_df_sat.sort_values(by="departure_time")


In [59]:
print(len(sorted_df))

64


In [60]:
# Print the first 20 rows
sorted_df.head(64)

Unnamed: 0,trip_id,departure_time,stop_id,stop_sequence,stop_headsign
80533,702498,1900-01-01 07:10:00,02507,16,101 EmX EUGENE STATION
80081,702485,1900-01-01 07:26:00,02507,16,101 EmX EUGENE STATION
84271,702625,1900-01-01 07:40:00,02507,28,101 EmX EUGENE STATION
88356,702756,1900-01-01 07:56:00,02507,16,101 EmX EUGENE STATION
89806,702806,1900-01-01 08:08:00,02507,28,101 EmX EUGENE STATION
...,...,...,...,...,...
82266,702556,1900-01-01 21:54:00,02507,16,101 EmX EUGENE STATION
86841,702706,1900-01-01 22:09:00,02507,28,101 EmX EUGENE STATION
89775,702804,1900-01-01 22:24:00,02507,16,101 EmX EUGENE STATION
81023,702512,1900-01-01 22:53:00,02507,28,101 EmX EUGENE STATION


In [29]:
# How many unique stop sequences are there?

print(f"There are {len(sorted_df['stop_sequence'].unique())} unique stop sequences.")
print(sorted_df['stop_sequence'].unique())

There are 2 unique stop sequences.
['28' '16']
