-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor gtfs schedule feeds latest #1162
Refactor gtfs schedule feeds latest #1162
Conversation
…a & register macro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gtfs_views
makes sense
are there any uniqueness checks that could verify we don't pull dupes, e.g. from an improperly coded type2 table?
This is a good question but probably should be part of |
Yeah having uniqueness checks on every table is useful (as long as the data isn't too massive) just to detect accidentaly fanout in the pipeline where it's actually occurring. |
I am going to add checks as a separate PR against the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed changes seem good. It'd be nice to have the PR description say what was decided upon so that it can reflect the changes in the files.
@evansiroky can you clarify what you mean? The description has been updated with the final approach, unless I missed a spot. |
Overall Description
🚨 Note that I am requesting to merge this into the refactor-gtfs-views-staging branch. We need to do a bunch of PRs that should move together (see #1137). I would like to get them reviewed by merging into this placeholder branch, and then just merge that whole branch at once into main via #1157 to avoid dealing with the timing of merging a bunch of PRs one by one into main.
Edited 3/15/22: Refactored a bit based on offline conversation with @mjumbewu.
This PR addresses #1139 (not using closing keyword because we shouldn't close the issue until we merge into main) -- right now there is "raw" schedule data exposed in Metabase via the
gtfs_schedule
BigQuery dataset which feeds the GTFS Schedule Feeds Latest database in Metabase. That data should be downstream of the data cleaning ingtfs_views_staging
. To that end, this PR:gtfs_loader.gtfs_schedule_tables_load
DAG task that currently creates this dataCreates a task group (🆕 in Airflow 2! 🎉 ) to produce the data instead, with one task per GTFS filegtfs_schedule
to produce the data instead, with one task per GTFS fileshapes_clean
which I missed in Refactor gtfs views staging cleaning #1145validation_notices
(unnested) instead ofvalidation_report
in this new version ofgtfs_schedule
❓/ considerations for reviewers:
Addressed offline with Mjumbe
The naming and where-to-put things considerations here are kind of tough. Right now, all 19 tables in the
gtfs_schedule
dataset are produced by a single DAG task:gtfs_loader.gtfs_schedule_tables_load
. This is clearly not ideal. There are two primary questions:gtfs_views_staging
, there are three main options:gtfs_views_staging
- could just stick them in here as a task group, right next to the_clean
data that they're built from (not sure that this is a good idea; since this data is exposed to consumers, I would prefer to have it in a DAG that produces exposed data)gtfs_views
- this is what I didgtfs_schedule
or something else would be consistent with our goal of having DAG names that align with the dataset where the data is written (see below for more on that)gtfs_schedule
is very vague)views
seemed appropriate in some sensegtfs_loader
gtfs_schedule
BQ dataset which feeds theGTFS Schedule Feeds Latest
Metabase data.The data schema of each individual table has not changed at all, it's just that the data in the columns has been slightly processed.Correction: All existing columns are preserved, but a few new columns are added (record hashes and keys). Again I see three options:gtfs_schedule
- this is what I did (don't have to change any references in existing analysis, don't have to reconfigure the Metabase connection)gtfs_schedule_latest
-- could go with option1.iii
above -- would break existing references and require a new Metabase connectionviews
, in which case we would probably need to rename the tables too. Then they'd be accessible in Metabase via theWarehouse Views
database and wouldn't need their own.Options
2.i
and2.ii
both seem compatible with the imminent future where we start breaking apartviews
and making it more specific.Would appreciate feedback/thoughts on the above from @atvaccaro / @evansiroky / perhaps @mjumbewu?
Checklist for all PRs
pre-commit run --all-files
to make sure markdown/lint passesLink this pull request to all issues that it will close using keywords (see GitHub docs about Linking a pull request to an issue using a keyword). Also mention any issues that are partially addressed or are related.issue will be closed by Refactor gtfs views staging #1157 not by thisAirflow DAG changes checklist
airflow/dags
folder occurs, otherwise please omit this section.GTFS views staging (stop_times fails because of query limit):
GTFS schedule (new DAG) -
stop_times
isn't going to run because of query limits in dev:docs/airflow
folder as neededThis PR updates the
gtfs_loader
,gtfs_schedule_views_staging
, andgtfs_views
DAGs in order to....gtfs_loader.gtfs_schedule_tables_load
gtfs_views_staging.shapes_clean
Create thegtfs_views.gtfs_schedule
task group to create the 19 tables for each GTFS file plusvalidation_notices
gtfs_schedule
DAG to create thegtfs_schedule
data