-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat initial airtable views #1063
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the times I have seen code used to generate code that is then copied and pasted elsewhere to be executed it is a red flag of code that probably shouldn't exist in that form. This is an anti-pattern for the following reasons in my opinion:
- It creates two codebases that need to be maintained (one for the code that generates the code and another for the other code that is generated)
- The generated code is not always easily testable in various fashions (in execution, static typing or via automated testing)
- It requires extra build steps that are not needed had there been just one codebase
- If the generated code has a bug that needs to be fixed or a new feature should be introduced, this could result in a need to update all of the generated code files (of which there are already about a dozen)
The generate_airtable_mapping_table.py
file appears to be trying to make the text for SqlToWarehouseOperator
DAG tasks. A better approach for this is to refactor this file into two new custom airflow operators:
- AirtableTwoTableJoinOperator
- AirtableSelfJoinOperator
In doing this, you lose some of the automagical creation of some documentation of the DAG task, but also lose the burdens that come with the "executing code to generate executable code" anti-pattern. Furthermore, any additional DAG tasks will then be able to use one of these operators instead of having to manually run some code to generate another Gusty configuration.
I broadly agree with the specific idea that code generation introduces some tricky situations, though I disagree that it's necessarily an anti-pattern and will point out that there are plenty of popular projects that rely heavily on code generation (protobuf, among others). I do agree with the copy-paste concerns since that's a "soft" build step; nothing prevents a human developer from screwing things up.
Broadly agree though this is sometimes an upside; for example, using protos to define message formats and rpcs for microservices.
I disagree on this; runtime generation is going to be harder to analyze statically as well as test.
Broadly agree but I think these concerns are better solved outside of Airflow generally.
I think that templating the SQL (which Laurie tells me is supported) may be a better approach than introducing two new Operator types to essentially just template SQL in two different ways. Templating keeps it very clear that we're still in the "sql to warehouse" land, as well as prevents extra layers of indirection on top of "run sql in the warehouse". (Speaking of anti-patterns, Airflow Operator proliferation is a big one; makes things harder to develop/test locally and rarely provides more benefit than a Python operator that executes a callable.) With these considerations, I'm hesitant about adding even more abstraction around the SQL operator for the very simple reason that I'm going to be pushing for dbt very soon, and even now starting with this PR as a great example of where dbt could really benefit us. This specific PR would be slightly easier in the dbt world; dbt excels at templating SQL with a "compile" step as well as providing the column-level documentation and tests. I've also discussed it with Charlie since it can really help provide documentation out of the box, and also really helps the developer experience by helping to create user-specific test environments in the database. I'm worried that we may keep building more and more of what it already provides and miss out on the features that we get for free. |
Thanks @atvaccaro and @evansiroky -- I understand that the current approach is not ideal. @mjumbewu made me aware of this existing example of SQL templating in the Payments pipeline -- stg_enriched_micropayments, which is added as a defined macro in This seems like a pretty solid option to me -- I think it addresses Evan's primary concerns about having a two-step code generation process (I understand/agree with the concerns there), while balancing Andrew's concerns about proliferation of very limited-scope Airflow operators. I'm planning to go that route for now. |
Thanks, @atvaccaro for engaging in healthy dialogue around development patterns and methodologies. I really appreciate the points you bring up and think this adds knowledge for the whole team. I myself am still fairly new to the data science world and am not entirely sure how dbt is structured and how this PR relates to it, but am interested to learn more. Regarding the path forward here, that templating option via the macros does seem like it would involve less custom code due to it not having all the overhead that custom operators have. However, that option introduces a potential problem of proliferation of macros which may not be that big of a deal actually. It seems like the macros we currently have are limited in scope. So with all that being said, the macros option sounds good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good enough to go, but I have a few comments.
airflow/dags/airtable_views/california_transit_gtfs_service_data.sql
Outdated
Show resolved
Hide resolved
{{ | ||
|
||
sql_airtable_mapping( | ||
table1 = "gtfs_service_data",table2 = "", col1 = "reference_static_gtfs_service", col2 = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oi, this brings back memories
organization_id, | ||
name, | ||
organization_type, | ||
REPLACE(REPLACE(REPLACE(roles, "'",""), "[", ""), "]", "") roles, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is fine for now, but another thought is that instead of turning this into something comma-delimited, perhaps there can be another table consisting of the organizations and their roles. Also, maybe this translation doesn't even need to occur if the unnesting is possible using SQL in BigQuery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These multi-response fields are why I did this gross string manipulation and the reason I did it is that I find unnested data very difficult to work with in BQ. I thought that for analysts it would be easier to do a query on roles like "%MPO%"
than to deal with the unnested data. Making a separate table is ok too, that's just going pretty far down the normalized route and I wasn't sure how much we wanted to commit to that?
REPLACE(REPLACE(REPLACE(service_type, "'",""), "[", ""), "]", "") service_type, | ||
REPLACE(REPLACE(REPLACE(mode, "'",""), "[", ""), "]", "") mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as #1063 (comment)
I'm going to go ahead and merge. I tested that the Airtable & Payments macros still work post-second-refactor. |
Overall Description
Updated 2/10 to reflect refactor to macros
This PR adds the initial Airtable views that should be sufficient to answer the questions in #984 and cal-itp/data-analyses#379. 🚨 Reviewers please don't merge without confirming that the questions below have been addressed.
Specifically, this PR:
views
versions oforganizations
,services
,gtfs service data
, andgtfs datasets
-- these tables contain only a limited subset of columns that seemed necessary as a first passcalifornia_transit_<table name>.sql
file) -- would appreciate @edasmalchi or @e-lo check hereairtable_california_transit_<table name>
)? The names are long, but until Divide views dataset by domain #1005 I think this might be our best bet? -- hope @evansiroky can weigh ingenerate_airtable_mapping_table.py
helper that can be used to generate mapping DAG tasks programmaticallySQL macroDo we want to store the initial CSV that I used to generate this first batch in the repo? If so, where? (File attached for reference) -- also a Q for @evansiroky, @atvaccaro, or @mjumbewuhelpermacro, creates a bunch of mapping tables between the listed tablesairtable_loader.california_transit_service_component
task because it should just be loaded in as part of theTransit Technology Stacks
base -- it's copied from there, shouldn't have been included as part of theCalifornia Transit
base imports.Checklist for all PRs
pre-commit run --all-files
to make sure markdown/lint passesAirflow DAG changes checklist
airflow/dags
folder occurs, otherwise please omit this section.docs/airflow
folder as neededTransit Database
section of the docs, theAirflow
section of the docs, both, neither? I don't have any docs info in here yet because I wanted to confirm that we feel good about this approach and ask what we think is most important to document -- cc @charlie-costanzo hereThis PR creates the
airtable_views
DAG in order to....Add the following DAG tasks:
Primary views:
california_transit_gtfs_datasets.sql
california_transit_gtfs_service_data.sql
california_transit_organizations.sql
california_transit_services.sql
Mapping tables:
california_transit_map_gtfs_datasets_aggregated_to_x_self.sql
california_transit_map_gtfs_datasets_dataset_producers_x_organizations_gtfs_datasets_produced.sql
california_transit_map_gtfs_datasets_dataset_publisher_x_organizations_gtfs_datasets.sql
california_transit_map_gtfs_datasets_gtfs_service_mapping_x_gtfs_service_data_gtfs_dataset.sql
california_transit_map_gtfs_service_data_reference_static_gtfs_service_x_self.sql
california_transit_map_organizations_mobility_services_managed_x_services_provider.sql
california_transit_map_organizations_mobility_services_operated_x_services_operator.sql
california_transit_map_organizations_parent_organization_x_self.sql
california_transit_map_services_gtfs_services_association_x_gtfs_service_data_services.sql
california_transit_map_services_paratransit_for_x_self.sql
Here's a screenshot of what one of these mapping tables looks like (
airtable_california_transit_map_organizations_mobility_services_managed_x_services_provider.sql
), just to illustrate:It also ❌ deletes the
california_transit_service_component
DAG task in theairtable_loader
DAG because we shouldn't be loading that as part ofcalifornia_transit
.Here's the CSV file I fed into my helper to generate the mapping DAG tasks:
mapping_tables.csv