Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor gtfs schedule feeds latest #1162

Merged

Conversation

lauriemerrell
Copy link
Contributor

@lauriemerrell lauriemerrell commented Mar 3, 2022

Overall Description

🚨 Note that I am requesting to merge this into the refactor-gtfs-views-staging branch. We need to do a bunch of PRs that should move together (see #1137). I would like to get them reviewed by merging into this placeholder branch, and then just merge that whole branch at once into main via #1157 to avoid dealing with the timing of merging a bunch of PRs one by one into main.

Edited 3/15/22: Refactored a bit based on offline conversation with @mjumbewu.

This PR addresses #1139 (not using closing keyword because we shouldn't close the issue until we merge into main) -- right now there is "raw" schedule data exposed in Metabase via the gtfs_schedule BigQuery dataset which feeds the GTFS Schedule Feeds Latest database in Metabase. That data should be downstream of the data cleaning in gtfs_views_staging. To that end, this PR:

  • Deletes the gtfs_loader.gtfs_schedule_tables_load DAG task that currently creates this data
  • Creates a task group (🆕 in Airflow 2! 🎉 ) to produce the data instead, with one task per GTFS file
  • Creates a new DAG gtfs_schedule to produce the data instead, with one task per GTFS file
  • Adds shapes_clean which I missed in Refactor gtfs views staging cleaning #1145
  • Refactors to have validation_notices (unnested) instead of validation_report in this new version of gtfs_schedule

❓/ considerations for reviewers:
Addressed offline with Mjumbe
The naming and where-to-put things considerations here are kind of tough. Right now, all 19 tables in the gtfs_schedule dataset are produced by a single DAG task: gtfs_loader.gtfs_schedule_tables_load. This is clearly not ideal. There are two primary questions:

  1. What DAG to put them in. Since they now depend on outputs from gtfs_views_staging, there are three main options:
    1. gtfs_views_staging - could just stick them in here as a task group, right next to the _clean data that they're built from (not sure that this is a good idea; since this data is exposed to consumers, I would prefer to have it in a DAG that produces exposed data)
    2. gtfs_views - this is what I did
    3. A new DAG - a new DAG called gtfs_schedule or something else would be consistent with our goal of having DAG names that align with the dataset where the data is written (see below for more on that)
      • However I wasn't sure if the new DAG name might just be confusing (gtfs_schedule is very vague)
      • I also think that these are functionally views so having them produced by views seemed appropriate in some sense
      • Note that that convention of DAG name = dataset name is already violated by this data since it's currently produced by gtfs_loader
      • Update 3/15: Switched to this option
  2. What dataset to put them in (and, as a corollary, what to name the tables). The data is currently in the gtfs_schedule BQ dataset which feeds the GTFS Schedule Feeds Latest Metabase data. The data schema of each individual table has not changed at all, it's just that the data in the columns has been slightly processed. Correction: All existing columns are preserved, but a few new columns are added (record hashes and keys). Again I see three options:
    1. Leave them in gtfs_schedule - this is what I did (don't have to change any references in existing analysis, don't have to reconfigure the Metabase connection)
    2. Make a new dataset like gtfs_schedule_latest -- could go with option 1.iii above -- would break existing references and require a new Metabase connection
    3. Just stick them in views, in which case we would probably need to rename the tables too. Then they'd be accessible in Metabase via the Warehouse Views database and wouldn't need their own.

Options 2.i and 2.ii both seem compatible with the imminent future where we start breaking apart views and making it more specific.

Would appreciate feedback/thoughts on the above from @atvaccaro / @evansiroky / perhaps @mjumbewu?

Checklist for all PRs

Airflow DAG changes checklist

  • Include this section whenever any change to a DAG in the airflow/dags folder occurs, otherwise please omit this section.
  • Verify that all affected DAG tasks were able to run in a local environment
  • Take a screenshot of the graph view of the affected DAG in the local environment showing that all affected DAG tasks completed successfully

GTFS views staging (stop_times fails because of query limit):
image

GTFS schedule (new DAG) - stop_times isn't going to run because of query limits in dev:
image

  • Add/update documentation in the docs/airflow folder as needed
  • Fill out the following section describing what DAG tasks were added/updated

This PR updates the gtfs_loader, gtfs_schedule_views_staging, and gtfs_views DAGs in order to....

  • Delete gtfs_loader.gtfs_schedule_tables_load
  • Create gtfs_views_staging.shapes_clean
  • Create the gtfs_views.gtfs_schedule task group to create the 19 tables for each GTFS file plus validation_notices
  • Create the gtfs_schedule DAG to create the gtfs_schedule data

@lauriemerrell lauriemerrell self-assigned this Mar 3, 2022
@lauriemerrell lauriemerrell added the project-gtfs-schedule For issues related to gtfs-schedule project label Mar 3, 2022
Copy link
Contributor

@atvaccaro atvaccaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gtfs_views makes sense

are there any uniqueness checks that could verify we don't pull dupes, e.g. from an improperly coded type2 table?

@lauriemerrell
Copy link
Contributor Author

are there any uniqueness checks that could verify we don't pull dupes, e.g. from an improperly coded type2 table?

This is a good question but probably should be part of _clean rather than here... I guess we can check for uniqueness of ID everywhere, and this should pass 🤔

@atvaccaro
Copy link
Contributor

atvaccaro commented Mar 3, 2022

are there any uniqueness checks that could verify we don't pull dupes, e.g. from an improperly coded type2 table?

This is a good question but probably should be part of _clean rather than here... I guess we can check for uniqueness of ID everywhere, and this should pass 🤔

Yeah having uniqueness checks on every table is useful (as long as the data isn't too massive) just to detect accidentaly fanout in the pipeline where it's actually occurring.

@lauriemerrell
Copy link
Contributor Author

I am going to add checks as a separate PR against the refactor-gtfs-views-staging branch

Copy link
Member

@evansiroky evansiroky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed changes seem good. It'd be nice to have the PR description say what was decided upon so that it can reflect the changes in the files.

@lauriemerrell
Copy link
Contributor Author

The proposed changes seem good. It'd be nice to have the PR description say what was decided upon so that it can reflect the changes in the files.

@evansiroky can you clarify what you mean? The description has been updated with the final approach, unless I missed a spot.

@lauriemerrell lauriemerrell merged commit a332e04 into refactor-gtfs-views-staging Mar 16, 2022
@lauriemerrell lauriemerrell deleted the refactor-gtfs-schedule-feeds-latest branch March 16, 2022 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project-gtfs-schedule For issues related to gtfs-schedule project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants