Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to disable schema check & creation on load_file #1921

Closed
5 of 6 tasks
tatiana opened this issue May 5, 2023 · 0 comments · Fixed by #1922
Closed
5 of 6 tasks

Allow users to disable schema check & creation on load_file #1921

tatiana opened this issue May 5, 2023 · 0 comments · Fixed by #1922
Labels
feature New feature or request priority/high High priority
Milestone

Comments

@tatiana
Copy link
Collaborator

tatiana commented May 5, 2023

Please describe the feature you'd like to see
As of Astro SDK 1.6.0, the load_file operation checks if the schema exists and if it doesn't, it attempts to create it.

Recently a user reported that the cost of checking if the schema exists is very high:
"I have a task that took 1:36 minutes to run, and it was 1:30 running the information schema query"

This was reported for Snowflake, but the same issue can apply to most of the supported Databases.

Describe the solution you'd like
Users should be able to run load_file with a boolean argument schema_exists. For backwards compatibility, the default value should be False. If this argument is False, the Python SDK does not check if the schema exists and does not attempt to create it.

Are there any alternatives to this feature?

  1. Find a more efficient way to check if the schema exists:
"SELECT SCHEMA_NAME from information_schema.schemata WHERE LOWER(SCHEMA_NAME) = %(schema_name)s;",
  1. Have a more generic way of allowing users to disable "optional" queries run by the Astro SDK.

Additional context
Follow up with customer on Slack: https://astronomer.slack.com/archives/C04L0HNK9ME/p1683231202383579?thread_ts=1682346906.404539&cid=C04L0HNK9ME

Acceptance Criteria

  • All checks and tests in the CI should pass
  • Unit tests (90% code coverage or more, once available)
  • Integration tests (if the feature relates to a new database or external service)
  • Example DAG
  • Docstrings in reStructuredText for each of methods, classes, functions and module-level attributes (including Example DAG on how it should be used)
  • Improve the documentation (README, Sphinx, and any other relevant)
@tatiana tatiana added the feature New feature or request label May 5, 2023
@tatiana tatiana changed the title Allow users to disable schema creation on load_file Allow users to disable schema check & creation on load_file May 5, 2023
@tatiana tatiana added the priority/high High priority label May 5, 2023
@tatiana tatiana changed the title Allow users to disable schema check & creation on load_file Add schema_exists argument to load_file to disable schema check & creation May 5, 2023
@tatiana tatiana changed the title Add schema_exists argument to load_file to disable schema check & creation Allow users to disable schema check & creation on load_file May 5, 2023
@tatiana tatiana added this to the 1.7.0 milestone May 5, 2023
tatiana added a commit that referenced this issue May 5, 2023
Support running `load_file` without checking if the table schema exists
or trying to create it.

Recently a user reported that the cost of checking if the schema exists
is very high for Snowflake:
"I have a (`load_file`) task that took 1:36 minutes to run, and it was
1:30 running the information schema query."
This is likely happening for other databases as well.

Introduce two ways of disabling schema checks:

1. On a per-task basis, by exposing the argument `schema_exists` in
`aql.load_file`
When this argument is `True`, the SDK will not check if the schema
exists or try to create it.
It is `False` by default, and the Python SDK will behave as of 1.6
(running schema check and, if needed, trying to create the schema)

2. Globally, by exposing the Airflow configuration
`load_table_schema_exists` in the `[astro-sdk]` section. This can also
be set using the environment variable
`AIRFLOW__ASTRO_SDK__LOAD_TABLE_SCHEMA_EXISTS`. The global configuration
can be overridden per task, using [1].

Closes: #1921
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request priority/high High priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant