Refactor gtfs schedule feeds latest #1162

lauriemerrell · 2022-03-03T18:53:38Z

Overall Description

🚨 Note that I am requesting to merge this into the refactor-gtfs-views-staging branch. We need to do a bunch of PRs that should move together (see #1137). I would like to get them reviewed by merging into this placeholder branch, and then just merge that whole branch at once into main via #1157 to avoid dealing with the timing of merging a bunch of PRs one by one into main.

Edited 3/15/22: Refactored a bit based on offline conversation with @mjumbewu.

This PR addresses #1139 (not using closing keyword because we shouldn't close the issue until we merge into main) -- right now there is "raw" schedule data exposed in Metabase via the gtfs_schedule BigQuery dataset which feeds the GTFS Schedule Feeds Latest database in Metabase. That data should be downstream of the data cleaning in gtfs_views_staging. To that end, this PR:

Deletes the gtfs_loader.gtfs_schedule_tables_load DAG task that currently creates this data
~~Creates a task group (🆕 in Airflow 2! 🎉 ) to produce the data instead, with one task per GTFS file~~
Creates a new DAG gtfs_schedule to produce the data instead, with one task per GTFS file
Adds shapes_clean which I missed in Refactor gtfs views staging cleaning #1145
Refactors to have validation_notices (unnested) instead of validation_report in this new version of gtfs_schedule

❓/ considerations for reviewers:
Addressed offline with Mjumbe
The naming and where-to-put things considerations here are kind of tough. Right now, all 19 tables in the gtfs_schedule dataset are produced by a single DAG task: gtfs_loader.gtfs_schedule_tables_load. This is clearly not ideal. There are two primary questions:

What DAG to put them in. Since they now depend on outputs from gtfs_views_staging, there are three main options:
1. gtfs_views_staging - could just stick them in here as a task group, right next to the _clean data that they're built from (not sure that this is a good idea; since this data is exposed to consumers, I would prefer to have it in a DAG that produces exposed data)
2. gtfs_views ~~- this is what I did~~
3. A new DAG - a new DAG called gtfs_schedule or something else would be consistent with our goal of having DAG names that align with the dataset where the data is written (see below for more on that)
  - However I wasn't sure if the new DAG name might just be confusing (gtfs_schedule is very vague)
  - I also think that these are functionally views so having them produced by views seemed appropriate in some sense
  - Note that that convention of DAG name = dataset name is already violated by this data since it's currently produced by gtfs_loader
  - Update 3/15: Switched to this option
What dataset to put them in (and, as a corollary, what to name the tables). The data is currently in the gtfs_schedule BQ dataset which feeds the GTFS Schedule Feeds Latest Metabase data. ~~The data schema of each individual table has not changed at all, it's just that the data in the columns has been slightly processed.~~ Correction: All existing columns are preserved, but a few new columns are added (record hashes and keys). Again I see three options:
1. Leave them in gtfs_schedule - this is what I did (don't have to change any references in existing analysis, don't have to reconfigure the Metabase connection)
2. Make a new dataset like gtfs_schedule_latest -- could go with option 1.iii above -- would break existing references and require a new Metabase connection
3. Just stick them in views, in which case we would probably need to rename the tables too. Then they'd be accessible in Metabase via the Warehouse Views database and wouldn't need their own.

Options 2.i and 2.ii both seem compatible with the imminent future where we start breaking apart views and making it more specific.

Would appreciate feedback/thoughts on the above from @atvaccaro / @evansiroky / perhaps @mjumbewu?

Checklist for all PRs

Run pre-commit run --all-files to make sure markdown/lint passes
Link this pull request to all issues that it will close using keywords (see GitHub docs about Linking a pull request to an issue using a keyword). Also mention any issues that are partially addressed or are related. issue will be closed by Refactor gtfs views staging #1157 not by this

Airflow DAG changes checklist

Include this section whenever any change to a DAG in the airflow/dags folder occurs, otherwise please omit this section.
Verify that all affected DAG tasks were able to run in a local environment
Take a screenshot of the graph view of the affected DAG in the local environment showing that all affected DAG tasks completed successfully

GTFS views staging (stop_times fails because of query limit):

GTFS schedule (new DAG) - stop_times isn't going to run because of query limits in dev:

Add/update documentation in the docs/airflow folder as needed
Fill out the following section describing what DAG tasks were added/updated

This PR updates the gtfs_loader, gtfs_schedule_views_staging, and gtfs_views DAGs in order to....

Delete gtfs_loader.gtfs_schedule_tables_load
Create gtfs_views_staging.shapes_clean
~~Create the gtfs_views.gtfs_schedule task group to create the 19 tables for each GTFS file plus validation_notices~~
Create the gtfs_schedule DAG to create the gtfs_schedule data

…a & register macro

atvaccaro

gtfs_views makes sense

are there any uniqueness checks that could verify we don't pull dupes, e.g. from an improperly coded type2 table?

lauriemerrell · 2022-03-03T21:02:54Z

are there any uniqueness checks that could verify we don't pull dupes, e.g. from an improperly coded type2 table?

This is a good question but probably should be part of _clean rather than here... I guess we can check for uniqueness of ID everywhere, and this should pass 🤔

atvaccaro · 2022-03-03T21:20:37Z

are there any uniqueness checks that could verify we don't pull dupes, e.g. from an improperly coded type2 table?

This is a good question but probably should be part of _clean rather than here... I guess we can check for uniqueness of ID everywhere, and this should pass 🤔

Yeah having uniqueness checks on every table is useful (as long as the data isn't too massive) just to detect accidentaly fanout in the pipeline where it's actually occurring.

lauriemerrell · 2022-03-03T21:33:38Z

I am going to add checks as a separate PR against the refactor-gtfs-views-staging branch

Delete scratchwork

evansiroky

The proposed changes seem good. It'd be nice to have the PR description say what was decided upon so that it can reflect the changes in the files.

lauriemerrell · 2022-03-16T13:07:43Z

The proposed changes seem good. It'd be nice to have the PR description say what was decided upon so that it can reflect the changes in the files.

@evansiroky can you clarify what you mean? The description has been updated with the final approach, unless I missed a spot.

lauriemerrell added 5 commits March 2, 2022 16:11

refactor(gtfs-schedule): define macro to get latest schedule data

c7e4261

refactor(gtfs-schedule): delete gtfs_schedule tables load dag task

4fc7c77

refactor(gtfs-schedule): make task group for gtfs_schedule latest dat…

78aeb4b

…a & register macro

fix(gtfs-schedule): add shapes table I missed before

bf87ff7

fix(gtfs-schedule): fix validation report/notices naming

33568ab

lauriemerrell requested a review from mjumbewu March 3, 2022 18:53

lauriemerrell requested review from atvaccaro and evansiroky as code owners March 3, 2022 18:53

lauriemerrell self-assigned this Mar 3, 2022

lauriemerrell added the project-gtfs-schedule For issues related to gtfs-schedule project label Mar 3, 2022

atvaccaro approved these changes Mar 3, 2022

View reviewed changes

lauriemerrell and others added 3 commits March 8, 2022 15:56

Delete template.sql

f3e6375

Delete scratchwork

refactor(gtfs schedule): move gtfs_schedule (latest) to its own dag

a793f35

fix(gtfs schedule): make gtfs_schedule latest_only

3d386b8

lauriemerrell requested a review from atvaccaro March 15, 2022 15:37

lauriemerrell mentioned this pull request Mar 15, 2022

GTFS Schedule table with historical shapes data #1210

Closed

evansiroky approved these changes Mar 15, 2022

View reviewed changes

lauriemerrell merged commit a332e04 into refactor-gtfs-views-staging Mar 16, 2022

lauriemerrell deleted the refactor-gtfs-schedule-feeds-latest branch March 16, 2022 13:08

This was referenced Mar 16, 2022

Refactor gtfs views staging #1157

Merged

fix(gtfs-schedule): fix get-latest logic #1230

Merged

lauriemerrell mentioned this pull request Mar 24, 2022

Fix - GTFS Schedule: Reduce wait tasks for gtfs_schedule DAG #1244

Merged

7 tasks

lauriemerrell mentioned this pull request Apr 20, 2022

GTFS Schedule (except type2) dbt migration #1369

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor gtfs schedule feeds latest #1162

Refactor gtfs schedule feeds latest #1162

lauriemerrell commented Mar 3, 2022 •

edited

Loading

atvaccaro left a comment

lauriemerrell commented Mar 3, 2022

atvaccaro commented Mar 3, 2022 •

edited

Loading

lauriemerrell commented Mar 3, 2022

evansiroky left a comment

lauriemerrell commented Mar 16, 2022

Refactor gtfs schedule feeds latest #1162

Refactor gtfs schedule feeds latest #1162

Conversation

lauriemerrell commented Mar 3, 2022 • edited Loading

Overall Description

Checklist for all PRs

Airflow DAG changes checklist

atvaccaro left a comment

Choose a reason for hiding this comment

lauriemerrell commented Mar 3, 2022

atvaccaro commented Mar 3, 2022 • edited Loading

lauriemerrell commented Mar 3, 2022

evansiroky left a comment

Choose a reason for hiding this comment

lauriemerrell commented Mar 16, 2022

lauriemerrell commented Mar 3, 2022 •

edited

Loading

atvaccaro commented Mar 3, 2022 •

edited

Loading