Feat initial airtable views #1063

lauriemerrell · 2022-02-09T22:04:40Z

Overall Description

Updated 2/10 to reflect refactor to macros

This PR adds the initial Airtable views that should be sufficient to answer the questions in #984 and cal-itp/data-analyses#379. 🚨 Reviewers please don't merge without confirming that the questions below have been addressed.

Specifically, this PR:

Creates views versions of organizations, services, gtfs service data, and gtfs datasets -- these tables contain only a limited subset of columns that seemed necessary as a first pass
- ❓ Do these columns look sufficient for a first attempt? Are any really core, stable facts about these entities missing? (Columns are enumerated in each california_transit_<table name>.sql file) -- would appreciate @edasmalchi or @e-lo check here
- ❓ How do we feel about the naming for the main views / tables (airtable_california_transit_<table name>)? The names are long, but until Divide views dataset by domain #1005 I think this might be our best bet? -- hope @evansiroky can weigh in
Creates a generate_airtable_mapping_table.py ~~helper that can be used to generate mapping DAG tasks programmatically~~ SQL macro
- ❓ Would appreciate feedback on the right place store this -- maybe @evansiroky, @atvaccaro, or @mjumbewu can advise
- ❓ ~~Do we want to store the initial CSV that I used to generate this first batch in the repo? If so, where? (File attached for reference) -- also a Q for @evansiroky, @atvaccaro, or @mjumbewu~~
Using the ~~helper~~ macro, creates a bunch of mapping tables between the listed tables
- ❓ Do these capture the specific relationships we care about to answer Research: percent of providers with a static GTFS dataset. #984 / Research: How many operators is Cal-ITP assessing? data-analyses#379 ? -- hoping @evansiroky, @edasmalchi, or @e-lo can advise
- ❓ How do we feel about the naming convention? -- appreciate thoughts from anyone
Delete the airtable_loader.california_transit_service_component task because it should just be loaded in as part of the Transit Technology Stacks base -- it's copied from there, shouldn't have been included as part of the California Transit base imports.

Checklist for all PRs

Run pre-commit run --all-files to make sure markdown/lint passes
Link this pull request to all issues that it will close using keywords (see GitHub docs about Linking a pull request to an issue using a keyword). Also mention any issues that are partially addressed or are related.

Airflow DAG changes checklist

Include this section whenever any change to a DAG in the airflow/dags folder occurs, otherwise please omit this section.
Verify that all affected DAG tasks were able to run in a local environment
Take a screenshot of the graph view of the affected DAG in the local environment showing that all affected DAG tasks completed successfully
Add/update documentation in the docs/airflow folder as needed
- ❓ Do we want to add docs about these to the Transit Database section of the docs, the Airflow section of the docs, both, neither? I don't have any docs info in here yet because I wanted to confirm that we feel good about this approach and ask what we think is most important to document -- cc @charlie-costanzo here
- ❓ Is it ok for me to add docs in a follow-up PR?
Fill out the following section describing what DAG tasks were added/updated

This PR creates the airtable_views DAG in order to....

Add the following DAG tasks:

Primary views:

california_transit_gtfs_datasets.sql
california_transit_gtfs_service_data.sql
california_transit_organizations.sql
california_transit_services.sql

Mapping tables:

california_transit_map_gtfs_datasets_aggregated_to_x_self.sql
california_transit_map_gtfs_datasets_dataset_producers_x_organizations_gtfs_datasets_produced.sql
california_transit_map_gtfs_datasets_dataset_publisher_x_organizations_gtfs_datasets.sql
california_transit_map_gtfs_datasets_gtfs_service_mapping_x_gtfs_service_data_gtfs_dataset.sql
california_transit_map_gtfs_service_data_reference_static_gtfs_service_x_self.sql
california_transit_map_organizations_mobility_services_managed_x_services_provider.sql
california_transit_map_organizations_mobility_services_operated_x_services_operator.sql
california_transit_map_organizations_parent_organization_x_self.sql
california_transit_map_services_gtfs_services_association_x_gtfs_service_data_services.sql
california_transit_map_services_paratransit_for_x_self.sql

Here's a screenshot of what one of these mapping tables looks like (airtable_california_transit_map_organizations_mobility_services_managed_x_services_provider.sql), just to illustrate:

It also ❌ deletes the california_transit_service_component DAG task in the airtable_loader DAG because we shouldn't be loading that as part of california_transit.

Here's the CSV file I fed into my helper to generate the mapping DAG tasks:
mapping_tables.csv

…s service data, and gtfs datasets

…ports

evansiroky

In the times I have seen code used to generate code that is then copied and pasted elsewhere to be executed it is a red flag of code that probably shouldn't exist in that form. This is an anti-pattern for the following reasons in my opinion:

It creates two codebases that need to be maintained (one for the code that generates the code and another for the other code that is generated)
The generated code is not always easily testable in various fashions (in execution, static typing or via automated testing)
It requires extra build steps that are not needed had there been just one codebase
If the generated code has a bug that needs to be fixed or a new feature should be introduced, this could result in a need to update all of the generated code files (of which there are already about a dozen)

The generate_airtable_mapping_table.py file appears to be trying to make the text for SqlToWarehouseOperator DAG tasks. A better approach for this is to refactor this file into two new custom airflow operators:

AirtableTwoTableJoinOperator
AirtableSelfJoinOperator

In doing this, you lose some of the automagical creation of some documentation of the DAG task, but also lose the burdens that come with the "executing code to generate executable code" anti-pattern. Furthermore, any additional DAG tasks will then be able to use one of these operators instead of having to manually run some code to generate another Gusty configuration.

atvaccaro · 2022-02-10T15:27:21Z

I broadly agree with the specific idea that code generation introduces some tricky situations, though I disagree that it's necessarily an anti-pattern and will point out that there are plenty of popular projects that rely heavily on code generation (protobuf, among others). I do agree with the copy-paste concerns since that's a "soft" build step; nothing prevents a human developer from screwing things up.

In the times I have seen code used to generate code that is then copied and pasted elsewhere to be executed it is a red flag of code that probably shouldn't exist in that form. This is an anti-pattern for the following reasons in my opinion:

It creates two codebases that need to be maintained (one for the code that generates the code and another for the other code that is generated)

Broadly agree though this is sometimes an upside; for example, using protos to define message formats and rpcs for microservices.

The generated code is not always easily testable in various fashions (in execution, static typing or via automated testing)

I disagree on this; runtime generation is going to be harder to analyze statically as well as test.

It requires extra build steps that are not needed had there been just one codebase

If the generated code has a bug that needs to be fixed or a new feature should be introduced, this could result in a need to update all of the generated code files (of which there are already about a dozen)

Broadly agree but I think these concerns are better solved outside of Airflow generally.

The generate_airtable_mapping_table.py file appears to be trying to make the text for SqlToWarehouseOperator DAG tasks. A better approach for this is to refactor this file into two new custom airflow operators:

AirtableTwoTableJoinOperator

AirtableSelfJoinOperator

In doing this, you lose some of the automagical creation of some documentation of the DAG task, but also lose the burdens that come with the "executing code to generate executable code" anti-pattern. Furthermore, any additional DAG tasks will then be able to use one of these operators instead of having to manually run some code to generate another Gusty configuration.

I think that templating the SQL (which Laurie tells me is supported) may be a better approach than introducing two new Operator types to essentially just template SQL in two different ways. Templating keeps it very clear that we're still in the "sql to warehouse" land, as well as prevents extra layers of indirection on top of "run sql in the warehouse". (Speaking of anti-patterns, Airflow Operator proliferation is a big one; makes things harder to develop/test locally and rarely provides more benefit than a Python operator that executes a callable.)

With these considerations, I'm hesitant about adding even more abstraction around the SQL operator for the very simple reason that I'm going to be pushing for dbt very soon, and even now starting with this PR as a great example of where dbt could really benefit us. This specific PR would be slightly easier in the dbt world; dbt excels at templating SQL with a "compile" step as well as providing the column-level documentation and tests.

I've also discussed it with Charlie since it can really help provide documentation out of the box, and also really helps the developer experience by helping to create user-specific test environments in the database. I'm worried that we may keep building more and more of what it already provides and miss out on the features that we get for free.

lauriemerrell · 2022-02-10T16:14:29Z

Thanks @atvaccaro and @evansiroky -- I understand that the current approach is not ideal. @mjumbewu made me aware of this existing example of SQL templating in the Payments pipeline -- stg_enriched_micropayments, which is added as a defined macro in dags.py.

This seems like a pretty solid option to me -- I think it addresses Evan's primary concerns about having a two-step code generation process (I understand/agree with the concerns there), while balancing Andrew's concerns about proliferation of very limited-scope Airflow operators. I'm planning to go that route for now.

evansiroky · 2022-02-10T17:26:13Z

Thanks, @atvaccaro for engaging in healthy dialogue around development patterns and methodologies. I really appreciate the points you bring up and think this adds knowledge for the whole team. I myself am still fairly new to the data science world and am not entirely sure how dbt is structured and how this PR relates to it, but am interested to learn more.

Regarding the path forward here, that templating option via the macros does seem like it would involve less custom code due to it not having all the overhead that custom operators have. However, that option introduces a potential problem of proliferation of macros which may not be that big of a deal actually. It seems like the macros we currently have are limited in scope. So with all that being said, the macros option sounds good to me.

lauriemerrell · 2022-02-10T18:47:58Z

Screenshot showing that DAG tasks can all run locally after refactoring to use macro:

evansiroky

Seems good enough to go, but I have a few comments.

airflow/dags/dags.py

airflow/dags/airtable_views/california_transit_gtfs_service_data.sql

evansiroky · 2022-02-10T21:43:30Z

...able_views/california_transit_map_gtfs_service_data_reference_static_gtfs_service_x_self.sql

+{{
+
+  sql_airtable_mapping(
+    table1 = "gtfs_service_data",table2 = "", col1 = "reference_static_gtfs_service", col2 = ""


oi, this brings back memories

evansiroky · 2022-02-10T21:46:12Z

airflow/dags/airtable_views/california_transit_organizations.sql

+    organization_id,
+    name,
+    organization_type,
+    REPLACE(REPLACE(REPLACE(roles, "'",""), "[", ""), "]", "") roles,


Maybe this is fine for now, but another thought is that instead of turning this into something comma-delimited, perhaps there can be another table consisting of the organizations and their roles. Also, maybe this translation doesn't even need to occur if the unnesting is possible using SQL in BigQuery.

These multi-response fields are why I did this gross string manipulation and the reason I did it is that I find unnested data very difficult to work with in BQ. I thought that for analysts it would be easier to do a query on roles like "%MPO%" than to deal with the unnested data. Making a separate table is ok too, that's just going pretty far down the normalized route and I wasn't sure how much we wanted to commit to that?

evansiroky · 2022-02-10T21:49:52Z

airflow/dags/airtable_views/california_transit_services.sql

+    REPLACE(REPLACE(REPLACE(service_type, "'",""), "[", ""), "]", "") service_type,
+    REPLACE(REPLACE(REPLACE(mode, "'",""), "[", ""), "]", "") mode,


Same comment as #1063 (comment)

airflow/dags/dags.py

lauriemerrell · 2022-02-11T15:30:24Z

I'm going to go ahead and merge. I tested that the Airtable & Payments macros still work post-second-refactor.

Laurie Merrell added 4 commits February 9, 2022 15:11

feat(airtable): initial airtable views - organizations, services, gtf…

06e52c6

…s service data, and gtfs datasets

chore: fix string

6ec864a

fix: indents, add metadata for airtable_views

dd27f36

fix(airtable): remove service component table from ca transit base im…

f335b04

…ports

lauriemerrell added the airtable Items related to pulling data from Cal-ITP's airtable database. Evan Siroky is product owner. label Feb 9, 2022

lauriemerrell requested review from mjumbewu, e-lo, evansiroky, atvaccaro, charlie-costanzo and edasmalchi February 9, 2022 22:04

lauriemerrell self-assigned this Feb 9, 2022

evansiroky requested changes Feb 10, 2022

View reviewed changes

refactor(airtable): convert to macros

17d5706

lauriemerrell requested a review from evansiroky February 10, 2022 18:45

evansiroky approved these changes Feb 10, 2022

View reviewed changes

evansiroky reviewed Feb 10, 2022

View reviewed changes

Laurie Merrell added 2 commits February 10, 2022 16:56

refactor(airflow): move sql macros to new __macros__ directory

6d8db65

refactor(airtable): stop doing string operations on foreign keys

1cb9b56

evansiroky reviewed Feb 10, 2022

View reviewed changes

airflow/dags/dags.py Outdated Show resolved Hide resolved

refactor(airflow): rearrange macro structure

7c703c8

lauriemerrell merged commit 2ee7ed8 into main Feb 11, 2022

lauriemerrell deleted the feat-initial-airtable-views branch February 11, 2022 15:30

lauriemerrell mentioned this pull request Feb 25, 2022

Airtable view: California Transit #973

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat initial airtable views #1063

Feat initial airtable views #1063

lauriemerrell commented Feb 9, 2022 •

edited

Loading

evansiroky left a comment

atvaccaro commented Feb 10, 2022 •

edited

Loading

lauriemerrell commented Feb 10, 2022

evansiroky commented Feb 10, 2022

lauriemerrell commented Feb 10, 2022

evansiroky left a comment

evansiroky Feb 10, 2022

evansiroky Feb 10, 2022

lauriemerrell Feb 10, 2022 •

edited

Loading

evansiroky Feb 10, 2022

lauriemerrell commented Feb 11, 2022

		REPLACE(REPLACE(REPLACE(service_type, "'",""), "[", ""), "]", "") service_type,
		REPLACE(REPLACE(REPLACE(mode, "'",""), "[", ""), "]", "") mode,

Feat initial airtable views #1063

Feat initial airtable views #1063

Conversation

lauriemerrell commented Feb 9, 2022 • edited Loading

Overall Description

Checklist for all PRs

Airflow DAG changes checklist

evansiroky left a comment

Choose a reason for hiding this comment

atvaccaro commented Feb 10, 2022 • edited Loading

lauriemerrell commented Feb 10, 2022

evansiroky commented Feb 10, 2022

lauriemerrell commented Feb 10, 2022

evansiroky left a comment

Choose a reason for hiding this comment

evansiroky Feb 10, 2022

Choose a reason for hiding this comment

evansiroky Feb 10, 2022

Choose a reason for hiding this comment

lauriemerrell Feb 10, 2022 • edited Loading

Choose a reason for hiding this comment

evansiroky Feb 10, 2022

Choose a reason for hiding this comment

lauriemerrell commented Feb 11, 2022

lauriemerrell commented Feb 9, 2022 •

edited

Loading

atvaccaro commented Feb 10, 2022 •

edited

Loading

lauriemerrell Feb 10, 2022 •

edited

Loading