Add sql directory parser and dag generator #836

feluelle · 2022-09-13T16:33:30Z

Description

What is the current behavior?

Currently, we do not have a way of parsing a directory of sql files and generating a DAG out of it.

closes: #923
related: #897

What is the new behavior?

This PR adds a SQL Directory Parser as well as a DAG Generator with examples.

The current implementation adds minimal configurability. More can be handled in a separate PR.
The following assumptions have been made:

The sql files directory can be passed via cli arg
The start_date is datetime.now()
The conn_id, schema and database can be set via sql file header
The output table name equals the sql file name

Does this introduce a breaking change?

No.

Checklist

Created tests which fail without the change (if possible)
Extended the README / documentation, if necessary

feluelle · 2022-09-13T16:36:10Z

This is a very early draft of how it could look like using multiple aql.transform_file operators.

cc @dimberman

feluelle · 2022-09-14T06:50:33Z

Failed checks are unrelated. I'll add unit tests once we agree to a specific strategy.

I tested it by installing the following deps:

networkx (for detecting the right order in which to call the transform steps)
astro-sdk (for aql.transform_file calls, etc.)
apache-airflow (for airflow imports such as DAG, etc.)

You can then run the following examples as below:

AIRFLOW__CORE__ENABLE_XCOM_PICKLING=True python sql-cli/src/sql_cli/main.py sql-cli/examples/simple
AIRFLOW__CORE__ENABLE_XCOM_PICKLING=True python sql-cli/src/sql_cli/main.py sql-cli/examples/advanced

feluelle · 2022-09-14T07:11:50Z

@dimberman @tatiana PTAL

Note that we can add meta data to the sql files and use frontmatter as done by Daniel to read it. Or we use yaml file(s) for configuration and additional meta data. But please let me know first if this approach, using jinja and networkx, could be a potential candidate OR we use what Daniel has built already (aql.render).

I like the jinja approach more as it generates a dag file from a template once (e.g. on cli call) which should in the end be faster for the airflow dag parser to process - instead of the aql.render which generates the dag/tasks every time the DAG file calling it gets parsed. Moreover the jinja approach allows someone to debug the DAG code if necessary whereas in the aql.render approach everything is dynamically being generated.

EDIT: Please see decision doc for pros and cons

kaxil · 2022-09-14T11:03:09Z

Markdown link check should be fixed by #842

codecov · 2022-09-14T11:45:46Z

Codecov Report

Base: 94.46% // Head: 94.46% // No change to project coverage 👍

Coverage data is based on head (4c83951) compared to base (0438577).
Patch has no changes to coverable lines.

❗ Current head 4c83951 differs from pull request most recent head 096394a. Consider uploading reports for the commit 096394a to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #836   +/-   ##
=======================================
  Coverage   94.46%   94.46%           
=======================================
  Files          47       47           
  Lines        2113     2113           
  Branches      229      229           
=======================================
  Hits         1996     1996           
  Misses         83       83           
  Partials       34       34

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

sql-cli/src/sql_cli/sql_directory_parser.py

sql-cli/src/sql_cli/dag_generator.py

tatiana · 2022-09-20T12:12:54Z

@feluelle I like how simple and how explicit this approach is. Let's wait for other's feedback 🤞

dimberman

@feluelle given the pros and cons we described, are we only going with the "generate transform_file function" option? Or would the idea be that the user can decide whether to generate or simply render at runtime?

I think given the speeds we've seen I don't think that "it renders faster" is a huge concern here. We were able to render 1000 tasks in pretty much the same amount of time.

The debugging argument also seems a bit iffy to me. Ultimately the error either way is going to be "could not parse at ..../foo.sql.py", I don't think it really matters which line in the airflow DAG starts the task. Either way the user is going to need to debug the SQL file itself.

Also with the aql.transform_file approach the DAG would be essentially unreadable when you get too 100 or 1000 tasks. It would generate a monstrously large DAG that would be really painful to edit.

sql-cli/src/sql_cli/dag_generator.py

sql-cli/src/sql_cli/sql_directory_parser.py

feluelle · 2022-09-21T07:03:20Z

@feluelle given the pros and cons we described, are we only going with the "generate transform_file function" option? Or would the idea be that the user can decide whether to generate or simply render at runtime?

I think it makes sense to offer both options:

aql.render in python-sdk as it is a python function which can directly be added to an existing DAG
- probably being favored by DE
and the jinja approach in sql-cli as it does not require the user to know airflow or python at all to generate a DAG
- probably being favored by DA

I think given the speeds we've seen I don't think that "it renders faster" is a huge concern here. We were able to render 1000 tasks in pretty much the same amount of time.

This is correct. I have edited my comment in GitHub to link to the notion doc.

The debugging argument also seems a bit iffy to me. Ultimately the error either way is going to be "could not parse at ..../foo.sql.py", I don't think it really matters which line in the airflow DAG starts the task. Either way the user is going to need to debug the SQL file itself.

Same here. Please let us have this discussion in this comment of the docs.

Also with the aql.transform_file approach the DAG would be essentially unreadable when you get too 100 or 1000 tasks. It would generate a monstrously large DAG that would be really painful to edit.

The point is that you should not edit it as it gets overwritten every time we regenerate the DAG file. This is indeed one of the "Cons" of this approach, see this block in the docs. With the new dataset support in Airflow and astro-sdk, we can also easily split it into multiple DAGs 👍 (we should probably do that anyway, even if not meant to be edited.)

feluelle · 2022-09-23T12:55:17Z

@dimberman @tatiana I changed the code to use transform_file but with raw sql code instead of file path, because of two reasons:

If you develop locally the file path will might be different to where you deploy to e.g. I generated the DAG locally but want to run it in astro-cli (in docker). I would have to copy the files or create a volume to be able to reference the files
Because you can see the raw sql code being deployed - no surprises! If I remember correctly, I think this was also a request from Mike. I also don't think we can see the sql code in the "Rendered" view of the task instance if we pass a file path. With the current implementation you can see the rendered sql code after Render sql code with parameters in BaseSQLDecoratedOperator #897 has been merged.

Another note:
I can also create a transform_raw function or similar (unfortunately transform we already have, so I would have to move it somewhere else if we want to use the same name transform) to make it clearer that we transform raw sql - not sql from a file.

sql-cli/examples/_dags/advanced.py

sql-cli/examples/_dags/simple.py

sql-cli/src/sql_cli/dag_generator.py

sql-cli/src/sql_cli/sql_directory_parser.py

feluelle · 2022-09-26T08:43:58Z

@dimberman @tatiana please see this slack thread in #team-dag-authoring-sql-cli about why I had to remove the TaskGroup feature.

.pre-commit-config.yaml

sql-cli/examples/_dags/simple.py

sql-cli/examples/advanced/mart/union_top_and_last.sql

- move exceptions to its own file - document functions in main

feluelle · 2022-09-30T12:28:57Z

The failing checks are unrelated. I will try to address them separately.

.pre-commit-config.yaml

sql-cli/pyproject.toml

sql-cli/src/sql_cli/dag_generator.py

dimberman

This looks fantastic @feluelle. well-documented, clean, love it.

sql-cli/src/sql_cli/macros/tasks.py.jinja2

sql-cli/src/sql_cli/sql_directory_parser.py

tatiana · 2022-09-30T14:09:22Z

sql-cli/src/sql_cli/utils.py

@@ -0,0 +1,46 @@
+from __future__ import annotations


In future, we may be able to rename this module to something jinja related - so we avoid generic utils blowing up. I'm happy for us to do this in a separate PR as well :)

sql-cli/tests/_dags/sql_files.py

tatiana

@feluelle minor comments, looking forward to seeing the feedback from our end-users on 0.1! 🎉

- add generated tests dag and sql to .gitignore - fix dependencies - improve tests - regenerate examples - refactor code

- add test for symlinks

# Description ## What is the current behavior? Currently, the tasks template creates `transform_sql` tasks, but we only have `transform_file`. related: #836 ## What is the new behavior? Fix references to `transform_sql` by replacing it with `transform_file` ## Does this introduce a breaking change? No. ### Checklist - [ ] Created tests which fail without the change (if possible) - [ ] Extended the README / documentation, if necessary

feluelle force-pushed the feature/sql-directory-parser branch 2 times, most recently from dabb6a6 to 28e63ec Compare September 14, 2022 06:35

tatiana reviewed Sep 20, 2022

View reviewed changes

sql-cli/src/sql_cli/sql_directory_parser.py Outdated Show resolved Hide resolved

tatiana reviewed Sep 20, 2022

View reviewed changes

sql-cli/src/sql_cli/dag_generator.py Outdated Show resolved Hide resolved

tatiana reviewed Sep 20, 2022

View reviewed changes

sql-cli/src/sql_cli/dag_generator.py Outdated Show resolved Hide resolved

dimberman requested changes Sep 20, 2022

View reviewed changes

sql-cli/src/sql_cli/dag_generator.py Show resolved Hide resolved

sql-cli/src/sql_cli/dag_generator.py Outdated Show resolved Hide resolved

sql-cli/src/sql_cli/sql_directory_parser.py Outdated Show resolved Hide resolved

feluelle requested review from dimberman and tatiana September 21, 2022 08:06

feluelle force-pushed the feature/sql-directory-parser branch 2 times, most recently from c1544e9 to 741f8bb Compare September 23, 2022 12:40

feluelle marked this pull request as ready for review September 23, 2022 13:03

feluelle requested review from utkarsharma2, sunank200, pankajastro and pankajkoti as code owners September 23, 2022 13:03

dimberman requested changes Sep 23, 2022

View reviewed changes

feluelle force-pushed the feature/sql-directory-parser branch from 169006b to c6902aa Compare September 26, 2022 11:26

feluelle requested a review from dimberman September 26, 2022 11:32

tatiana reviewed Sep 26, 2022

View reviewed changes

.pre-commit-config.yaml Outdated Show resolved Hide resolved

tatiana reviewed Sep 26, 2022

View reviewed changes

sql-cli/examples/_dags/simple.py Show resolved Hide resolved

tatiana reviewed Sep 26, 2022

View reviewed changes

sql-cli/examples/advanced/mart/union_top_and_last.sql Outdated Show resolved Hide resolved

feluelle and others added 4 commits September 30, 2022 09:02

Apply code suggestions

e48aae2

- move exceptions to its own file - document functions in main

Clarify docstring for __gt__ method

c3aaa96

Merge branch 'main' into feature/sql-directory-parser

4602014

Merge branch 'main' into feature/sql-directory-parser

a19023e