Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat initial airtable views #1063

Merged
merged 8 commits into from
Feb 11, 2022
Merged

Feat initial airtable views #1063

merged 8 commits into from
Feb 11, 2022

Conversation

lauriemerrell
Copy link
Contributor

@lauriemerrell lauriemerrell commented Feb 9, 2022

Overall Description

Updated 2/10 to reflect refactor to macros

This PR adds the initial Airtable views that should be sufficient to answer the questions in #984 and cal-itp/data-analyses#379. 🚨 Reviewers please don't merge without confirming that the questions below have been addressed.

Specifically, this PR:

  1. Creates views versions of organizations, services, gtfs service data, and gtfs datasets -- these tables contain only a limited subset of columns that seemed necessary as a first pass
    • ❓ Do these columns look sufficient for a first attempt? Are any really core, stable facts about these entities missing? (Columns are enumerated in each california_transit_<table name>.sql file) -- would appreciate @edasmalchi or @e-lo check here
    • ❓ How do we feel about the naming for the main views / tables (airtable_california_transit_<table name>)? The names are long, but until Divide views dataset by domain #1005 I think this might be our best bet? -- hope @evansiroky can weigh in
  2. Creates a generate_airtable_mapping_table.py helper that can be used to generate mapping DAG tasks programmatically SQL macro
    • ❓ Would appreciate feedback on the right place store this -- maybe @evansiroky, @atvaccaro, or @mjumbewu can advise
    • Do we want to store the initial CSV that I used to generate this first batch in the repo? If so, where? (File attached for reference) -- also a Q for @evansiroky, @atvaccaro, or @mjumbewu
  3. Using the helper macro, creates a bunch of mapping tables between the listed tables
  4. Delete the airtable_loader.california_transit_service_component task because it should just be loaded in as part of the Transit Technology Stacks base -- it's copied from there, shouldn't have been included as part of the California Transit base imports.

Checklist for all PRs

  • Run pre-commit run --all-files to make sure markdown/lint passes
  • Link this pull request to all issues that it will close using keywords (see GitHub docs about Linking a pull request to an issue using a keyword). Also mention any issues that are partially addressed or are related.

Airflow DAG changes checklist

  • Include this section whenever any change to a DAG in the airflow/dags folder occurs, otherwise please omit this section.
  • Verify that all affected DAG tasks were able to run in a local environment
  • Take a screenshot of the graph view of the affected DAG in the local environment showing that all affected DAG tasks completed successfully
    image
  • Add/update documentation in the docs/airflow folder as needed
    • ❓ Do we want to add docs about these to the Transit Database section of the docs, the Airflow section of the docs, both, neither? I don't have any docs info in here yet because I wanted to confirm that we feel good about this approach and ask what we think is most important to document -- cc @charlie-costanzo here
    • ❓ Is it ok for me to add docs in a follow-up PR?
  • Fill out the following section describing what DAG tasks were added/updated

This PR creates the airtable_views DAG in order to....

Add the following DAG tasks:

Primary views:

  • california_transit_gtfs_datasets.sql
  • california_transit_gtfs_service_data.sql
  • california_transit_organizations.sql
  • california_transit_services.sql

Mapping tables:

  • california_transit_map_gtfs_datasets_aggregated_to_x_self.sql
  • california_transit_map_gtfs_datasets_dataset_producers_x_organizations_gtfs_datasets_produced.sql
  • california_transit_map_gtfs_datasets_dataset_publisher_x_organizations_gtfs_datasets.sql
  • california_transit_map_gtfs_datasets_gtfs_service_mapping_x_gtfs_service_data_gtfs_dataset.sql
  • california_transit_map_gtfs_service_data_reference_static_gtfs_service_x_self.sql
  • california_transit_map_organizations_mobility_services_managed_x_services_provider.sql
  • california_transit_map_organizations_mobility_services_operated_x_services_operator.sql
  • california_transit_map_organizations_parent_organization_x_self.sql
  • california_transit_map_services_gtfs_services_association_x_gtfs_service_data_services.sql
  • california_transit_map_services_paratransit_for_x_self.sql

Here's a screenshot of what one of these mapping tables looks like (airtable_california_transit_map_organizations_mobility_services_managed_x_services_provider.sql), just to illustrate:
image

It also ❌ deletes the california_transit_service_component DAG task in the airtable_loader DAG because we shouldn't be loading that as part of california_transit.

Here's the CSV file I fed into my helper to generate the mapping DAG tasks:
mapping_tables.csv

@lauriemerrell lauriemerrell added the airtable Items related to pulling data from Cal-ITP's airtable database. Evan Siroky is product owner. label Feb 9, 2022
@lauriemerrell lauriemerrell self-assigned this Feb 9, 2022
Copy link
Member

@evansiroky evansiroky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the times I have seen code used to generate code that is then copied and pasted elsewhere to be executed it is a red flag of code that probably shouldn't exist in that form. This is an anti-pattern for the following reasons in my opinion:

  • It creates two codebases that need to be maintained (one for the code that generates the code and another for the other code that is generated)
  • The generated code is not always easily testable in various fashions (in execution, static typing or via automated testing)
  • It requires extra build steps that are not needed had there been just one codebase
  • If the generated code has a bug that needs to be fixed or a new feature should be introduced, this could result in a need to update all of the generated code files (of which there are already about a dozen)

The generate_airtable_mapping_table.py file appears to be trying to make the text for SqlToWarehouseOperator DAG tasks. A better approach for this is to refactor this file into two new custom airflow operators:

  1. AirtableTwoTableJoinOperator
  2. AirtableSelfJoinOperator

In doing this, you lose some of the automagical creation of some documentation of the DAG task, but also lose the burdens that come with the "executing code to generate executable code" anti-pattern. Furthermore, any additional DAG tasks will then be able to use one of these operators instead of having to manually run some code to generate another Gusty configuration.

@atvaccaro
Copy link
Contributor

atvaccaro commented Feb 10, 2022

I broadly agree with the specific idea that code generation introduces some tricky situations, though I disagree that it's necessarily an anti-pattern and will point out that there are plenty of popular projects that rely heavily on code generation (protobuf, among others). I do agree with the copy-paste concerns since that's a "soft" build step; nothing prevents a human developer from screwing things up.

In the times I have seen code used to generate code that is then copied and pasted elsewhere to be executed it is a red flag of code that probably shouldn't exist in that form. This is an anti-pattern for the following reasons in my opinion:

  • It creates two codebases that need to be maintained (one for the code that generates the code and another for the other code that is generated)

Broadly agree though this is sometimes an upside; for example, using protos to define message formats and rpcs for microservices.

  • The generated code is not always easily testable in various fashions (in execution, static typing or via automated testing)

I disagree on this; runtime generation is going to be harder to analyze statically as well as test.

  • It requires extra build steps that are not needed had there been just one codebase
  • If the generated code has a bug that needs to be fixed or a new feature should be introduced, this could result in a need to update all of the generated code files (of which there are already about a dozen)

Broadly agree but I think these concerns are better solved outside of Airflow generally.

The generate_airtable_mapping_table.py file appears to be trying to make the text for SqlToWarehouseOperator DAG tasks. A better approach for this is to refactor this file into two new custom airflow operators:

  1. AirtableTwoTableJoinOperator
  2. AirtableSelfJoinOperator

In doing this, you lose some of the automagical creation of some documentation of the DAG task, but also lose the burdens that come with the "executing code to generate executable code" anti-pattern. Furthermore, any additional DAG tasks will then be able to use one of these operators instead of having to manually run some code to generate another Gusty configuration.

I think that templating the SQL (which Laurie tells me is supported) may be a better approach than introducing two new Operator types to essentially just template SQL in two different ways. Templating keeps it very clear that we're still in the "sql to warehouse" land, as well as prevents extra layers of indirection on top of "run sql in the warehouse". (Speaking of anti-patterns, Airflow Operator proliferation is a big one; makes things harder to develop/test locally and rarely provides more benefit than a Python operator that executes a callable.)

With these considerations, I'm hesitant about adding even more abstraction around the SQL operator for the very simple reason that I'm going to be pushing for dbt very soon, and even now starting with this PR as a great example of where dbt could really benefit us. This specific PR would be slightly easier in the dbt world; dbt excels at templating SQL with a "compile" step as well as providing the column-level documentation and tests.

I've also discussed it with Charlie since it can really help provide documentation out of the box, and also really helps the developer experience by helping to create user-specific test environments in the database. I'm worried that we may keep building more and more of what it already provides and miss out on the features that we get for free.

@lauriemerrell
Copy link
Contributor Author

Thanks @atvaccaro and @evansiroky -- I understand that the current approach is not ideal. @mjumbewu made me aware of this existing example of SQL templating in the Payments pipeline -- stg_enriched_micropayments, which is added as a defined macro in dags.py.

This seems like a pretty solid option to me -- I think it addresses Evan's primary concerns about having a two-step code generation process (I understand/agree with the concerns there), while balancing Andrew's concerns about proliferation of very limited-scope Airflow operators. I'm planning to go that route for now.

@evansiroky
Copy link
Member

Thanks, @atvaccaro for engaging in healthy dialogue around development patterns and methodologies. I really appreciate the points you bring up and think this adds knowledge for the whole team. I myself am still fairly new to the data science world and am not entirely sure how dbt is structured and how this PR relates to it, but am interested to learn more.

Regarding the path forward here, that templating option via the macros does seem like it would involve less custom code due to it not having all the overhead that custom operators have. However, that option introduces a potential problem of proliferation of macros which may not be that big of a deal actually. It seems like the macros we currently have are limited in scope. So with all that being said, the macros option sounds good to me.

@lauriemerrell
Copy link
Contributor Author

Screenshot showing that DAG tasks can all run locally after refactoring to use macro:
image

Copy link
Member

@evansiroky evansiroky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good enough to go, but I have a few comments.

airflow/dags/dags.py Outdated Show resolved Hide resolved
{{

sql_airtable_mapping(
table1 = "gtfs_service_data",table2 = "", col1 = "reference_static_gtfs_service", col2 = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oi, this brings back memories

Screen Shot 2022-02-10 at 1 42 34 PM

organization_id,
name,
organization_type,
REPLACE(REPLACE(REPLACE(roles, "'",""), "[", ""), "]", "") roles,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is fine for now, but another thought is that instead of turning this into something comma-delimited, perhaps there can be another table consisting of the organizations and their roles. Also, maybe this translation doesn't even need to occur if the unnesting is possible using SQL in BigQuery.

Copy link
Contributor Author

@lauriemerrell lauriemerrell Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These multi-response fields are why I did this gross string manipulation and the reason I did it is that I find unnested data very difficult to work with in BQ. I thought that for analysts it would be easier to do a query on roles like "%MPO%" than to deal with the unnested data. Making a separate table is ok too, that's just going pretty far down the normalized route and I wasn't sure how much we wanted to commit to that?

Comment on lines +30 to +31
REPLACE(REPLACE(REPLACE(service_type, "'",""), "[", ""), "]", "") service_type,
REPLACE(REPLACE(REPLACE(mode, "'",""), "[", ""), "]", "") mode,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as #1063 (comment)

airflow/dags/dags.py Outdated Show resolved Hide resolved
@lauriemerrell
Copy link
Contributor Author

I'm going to go ahead and merge. I tested that the Airtable & Payments macros still work post-second-refactor.

@lauriemerrell lauriemerrell merged commit 2ee7ed8 into main Feb 11, 2022
@lauriemerrell lauriemerrell deleted the feat-initial-airtable-views branch February 11, 2022 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
airtable Items related to pulling data from Cal-ITP's airtable database. Evan Siroky is product owner.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants