Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sql directory parser and dag generator #836

Merged
merged 13 commits into from
Oct 4, 2022

Conversation

feluelle
Copy link
Member

@feluelle feluelle commented Sep 13, 2022

Description

What is the current behavior?

Currently, we do not have a way of parsing a directory of sql files and generating a DAG out of it.

closes: #923
related: #897

What is the new behavior?

This PR adds a SQL Directory Parser as well as a DAG Generator with examples.

The current implementation adds minimal configurability. More can be handled in a separate PR.
The following assumptions have been made:

  1. The sql files directory can be passed via cli arg
  2. The start_date is datetime.now()
  3. The conn_id, schema and database can be set via sql file header
  4. The output table name equals the sql file name

Does this introduce a breaking change?

No.

Checklist

  • Created tests which fail without the change (if possible)
  • Extended the README / documentation, if necessary

@feluelle
Copy link
Member Author

This is a very early draft of how it could look like using multiple aql.transform_file operators.

cc @dimberman

@feluelle feluelle force-pushed the feature/sql-directory-parser branch 2 times, most recently from dabb6a6 to 28e63ec Compare September 14, 2022 06:35
@feluelle
Copy link
Member Author

feluelle commented Sep 14, 2022

Failed checks are unrelated. I'll add unit tests once we agree to a specific strategy.

I tested it by installing the following deps:

  • networkx (for detecting the right order in which to call the transform steps)
  • astro-sdk (for aql.transform_file calls, etc.)
  • apache-airflow (for airflow imports such as DAG, etc.)

You can then run the following examples as below:

  • AIRFLOW__CORE__ENABLE_XCOM_PICKLING=True python sql-cli/src/sql_cli/main.py sql-cli/examples/simple
  • AIRFLOW__CORE__ENABLE_XCOM_PICKLING=True python sql-cli/src/sql_cli/main.py sql-cli/examples/advanced

@feluelle
Copy link
Member Author

feluelle commented Sep 14, 2022

@dimberman @tatiana PTAL

Note that we can add meta data to the sql files and use frontmatter as done by Daniel to read it. Or we use yaml file(s) for configuration and additional meta data. But please let me know first if this approach, using jinja and networkx, could be a potential candidate OR we use what Daniel has built already (aql.render).

I like the jinja approach more as it generates a dag file from a template once (e.g. on cli call) which should in the end be faster for the airflow dag parser to process - instead of the aql.render which generates the dag/tasks every time the DAG file calling it gets parsed. Moreover the jinja approach allows someone to debug the DAG code if necessary whereas in the aql.render approach everything is dynamically being generated.

EDIT: Please see decision doc for pros and cons

@kaxil
Copy link
Collaborator

kaxil commented Sep 14, 2022

Markdown link check should be fixed by #842

@codecov
Copy link

codecov bot commented Sep 14, 2022

Codecov Report

Base: 94.46% // Head: 94.46% // No change to project coverage 👍

Coverage data is based on head (4c83951) compared to base (0438577).
Patch has no changes to coverable lines.

❗ Current head 4c83951 differs from pull request most recent head 096394a. Consider uploading reports for the commit 096394a to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #836   +/-   ##
=======================================
  Coverage   94.46%   94.46%           
=======================================
  Files          47       47           
  Lines        2113     2113           
  Branches      229      229           
=======================================
  Hits         1996     1996           
  Misses         83       83           
  Partials       34       34           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@tatiana
Copy link
Collaborator

tatiana commented Sep 20, 2022

@feluelle I like how simple and how explicit this approach is. Let's wait for other's feedback 🤞

Copy link
Collaborator

@dimberman dimberman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feluelle given the pros and cons we described, are we only going with the "generate transform_file function" option? Or would the idea be that the user can decide whether to generate or simply render at runtime?

I think given the speeds we've seen I don't think that "it renders faster" is a huge concern here. We were able to render 1000 tasks in pretty much the same amount of time.

The debugging argument also seems a bit iffy to me. Ultimately the error either way is going to be "could not parse at ..../foo.sql.py", I don't think it really matters which line in the airflow DAG starts the task. Either way the user is going to need to debug the SQL file itself.

Also with the aql.transform_file approach the DAG would be essentially unreadable when you get too 100 or 1000 tasks. It would generate a monstrously large DAG that would be really painful to edit.

sql-cli/src/sql_cli/dag_generator.py Show resolved Hide resolved
sql-cli/src/sql_cli/dag_generator.py Outdated Show resolved Hide resolved
sql-cli/src/sql_cli/sql_directory_parser.py Outdated Show resolved Hide resolved
@feluelle
Copy link
Member Author

feluelle commented Sep 21, 2022

@feluelle given the pros and cons we described, are we only going with the "generate transform_file function" option? Or would the idea be that the user can decide whether to generate or simply render at runtime?

I think it makes sense to offer both options:

  • aql.render in python-sdk as it is a python function which can directly be added to an existing DAG
    • probably being favored by DE
  • and the jinja approach in sql-cli as it does not require the user to know airflow or python at all to generate a DAG
    • probably being favored by DA

I think given the speeds we've seen I don't think that "it renders faster" is a huge concern here. We were able to render 1000 tasks in pretty much the same amount of time.

This is correct. I have edited my comment in GitHub to link to the notion doc.

The debugging argument also seems a bit iffy to me. Ultimately the error either way is going to be "could not parse at ..../foo.sql.py", I don't think it really matters which line in the airflow DAG starts the task. Either way the user is going to need to debug the SQL file itself.

Same here. Please let us have this discussion in this comment of the docs.

Also with the aql.transform_file approach the DAG would be essentially unreadable when you get too 100 or 1000 tasks. It would generate a monstrously large DAG that would be really painful to edit.

The point is that you should not edit it as it gets overwritten every time we regenerate the DAG file. This is indeed one of the "Cons" of this approach, see this block in the docs. With the new dataset support in Airflow and astro-sdk, we can also easily split it into multiple DAGs 👍 (we should probably do that anyway, even if not meant to be edited.)

@feluelle feluelle force-pushed the feature/sql-directory-parser branch 2 times, most recently from c1544e9 to 741f8bb Compare September 23, 2022 12:40
@feluelle
Copy link
Member Author

@dimberman @tatiana I changed the code to use transform_file but with raw sql code instead of file path, because of two reasons:

  1. If you develop locally the file path will might be different to where you deploy to e.g. I generated the DAG locally but want to run it in astro-cli (in docker). I would have to copy the files or create a volume to be able to reference the files
  2. Because you can see the raw sql code being deployed - no surprises! If I remember correctly, I think this was also a request from Mike. I also don't think we can see the sql code in the "Rendered" view of the task instance if we pass a file path. With the current implementation you can see the rendered sql code after Render sql code with parameters in BaseSQLDecoratedOperator #897 has been merged.

Another note:
I can also create a transform_raw function or similar (unfortunately transform we already have, so I would have to move it somewhere else if we want to use the same name transform) to make it clearer that we transform raw sql - not sql from a file.

@feluelle feluelle marked this pull request as ready for review September 23, 2022 13:03
sql-cli/examples/_dags/advanced.py Outdated Show resolved Hide resolved
sql-cli/examples/_dags/advanced.py Outdated Show resolved Hide resolved
sql-cli/examples/_dags/advanced.py Outdated Show resolved Hide resolved
sql-cli/examples/_dags/simple.py Outdated Show resolved Hide resolved
sql-cli/examples/_dags/simple.py Outdated Show resolved Hide resolved
sql-cli/src/sql_cli/dag_generator.py Show resolved Hide resolved
sql-cli/src/sql_cli/dag_generator.py Show resolved Hide resolved
sql-cli/src/sql_cli/dag_generator.py Outdated Show resolved Hide resolved
sql-cli/src/sql_cli/sql_directory_parser.py Outdated Show resolved Hide resolved
sql-cli/src/sql_cli/sql_directory_parser.py Outdated Show resolved Hide resolved
@feluelle
Copy link
Member Author

feluelle commented Sep 26, 2022

@dimberman @tatiana please see this slack thread in #team-dag-authoring-sql-cli about why I had to remove the TaskGroup feature.

.pre-commit-config.yaml Outdated Show resolved Hide resolved
@feluelle
Copy link
Member Author

The failing checks are unrelated. I will try to address them separately.

.pre-commit-config.yaml Outdated Show resolved Hide resolved
sql-cli/pyproject.toml Outdated Show resolved Hide resolved
sql-cli/pyproject.toml Outdated Show resolved Hide resolved
Copy link
Collaborator

@dimberman dimberman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic @feluelle. well-documented, clean, love it.

@@ -0,0 +1,46 @@
from __future__ import annotations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future, we may be able to rename this module to something jinja related - so we avoid generic utils blowing up. I'm happy for us to do this in a separate PR as well :)

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feluelle minor comments, looking forward to seeing the feedback from our end-users on 0.1! 🎉

@feluelle feluelle merged commit 7b791e4 into main Oct 4, 2022
@feluelle feluelle deleted the feature/sql-directory-parser branch October 4, 2022 14:26
feluelle added a commit that referenced this pull request Oct 5, 2022
# Description

## What is the current behavior?

Currently, the tasks template creates `transform_sql` tasks, but we only
have `transform_file`.

related: #836 

## What is the new behavior?

Fix references to `transform_sql` by replacing it with `transform_file`

## Does this introduce a breaking change?

No.

### Checklist
- [ ] Created tests which fail without the change (if possible)
- [ ] Extended the README / documentation, if necessary
@feluelle feluelle added this to the sql-cli/0.1.0 milestone Oct 12, 2022
@feluelle feluelle removed this from the sql-cli/0.1.0 milestone Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the SQL DAG generation 0.1
4 participants