Skip to content

Research Request - refactor segment speed exports  #992

@tiffanychu90

Description

@tiffanychu90

Complete the below when receiving a research request, and continue to add to this issue as you receive additional details and produce deliverables. Be sure to also add the appropriate project-level label to this issue (eg gtfs-rt, DLA).

Research Question

Single sentence description: We have various aggregation levels that need to be harmonized across exports for easy joining in time-series visualizations.

  • by day (single day vs multiple days)
  • geometry (segment vs line geometry)
  • by time (offpeak, peak, all day, time-of-day bins)

Certain columns do not enable aggregation and others are needed if we want to aggregate beyond a single day.

Detailed description:

  1. Time harmonization --> add offpeak so we publish 3 time periods: peak, offpeak, and all-day speeds
  • Shapes are not easily used across months. Moving away from them means using route_id, direction_id, and stop_pair earlier...but where?
    • Exploratory work taking Apr and Oct 2023 dates shows that Big Blue Bus had very few joins on shape-stop_sequence, meaning we would not be able to track speeds over time in that segment.
    • Produce shape-stop values on a single day, but lose the multi day?
  1. Open data publishing: we ditch the internal keys (shape_array_key, gtfs_dataset_key in favor of natural identifiers (shape_id, route_id) and a stable agency identifier (organization_source_record_id).
    • Clean up redundancies: we save out 2 versions of the export (with and without internal keys)...redundant and confusing.
    • Goal: We need to be able to key into our exports whenever we get user feedback, but let's add the natural identifiers a lot earlier, and carry them through the aggregations. We can probably drop the internal keys in the step right before we zip up the shapefile.
    • This should solve the fact that some exports use schedule_gtfs_dataset_key while others use organization_source_record_id, and we're redundantly querying several tables in mart_transit_database to get the crosswalk we want. If it's not solved, we need to save out a crosswalk we can use across rt_vs_schedule and speeds to better merge the dfs prior to visualizing.
  2. New folder structure for these exports so it's clear whether we're grabbing single day, multiple day, what unit of analysis (segment or line).
  3. Set up a catalog.yml so we can easily find the paths for all these aggregated exports.
  4. Potentially need to refactor segment_speed_utils. If we want to enable certain averages to route-direction, we need to move some functions here out of rt_segment_speeds/scripts/ so they can be used elsewhere.

How will this research be used?

Enable usable time-series visualizations that are based on joining all the exports across speeds, schedule, and rt_vs_schedule areas.

Metadata

Metadata

Assignees

Labels

gtfs-rtWork related to GTFS-Realtimeresearch requestIssues that serve as a request for research (summary and handoff)

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions