Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Performance enhancement for insert_rows method DbApiHook with fast executemany + SAP Hana support #37246

Merged
merged 10 commits into from
Feb 10, 2024

Conversation

dabla
Copy link
Contributor

@dabla dabla commented Feb 8, 2024

…st executemany method on the cursor can be used to achieve better performance when inserting in bulk. Also check if dialect is SAP Hana in _generate_insert_sql method so the UPSERT statement can be used as REPLACE INTO doesn't exist on SAP Hana.

Added executemany parameter to insert_rows method of DbApiHook so the fast executemany method on the cursor can be used to achieve better performance when inserting in bulk instead of iterating over each row and inserting it one by one which is not efficient. Also check if dialect is SAP Hana in _generate_insert_sql method so the UPSERT statement can be used as REPLACE INTO doesn't exist on SAP Hana. Those changes have been unit tested in TestDbApiHook.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

…st executemany method on the cursor can be used to achieve better performance when inserting in bulk. Also check if dialect is SAP Hana in _generate_insert_sql method so the UPSERT statement can be used as REPLACE INTO doesn't exist on SAP Hana.
@potiuk
Copy link
Member

potiuk commented Feb 8, 2024

I think it should be two-fold:

  • make a method in common.sql DBApi that you can override, returning the statement
  • in ODBCHook make it customizable (and implement the method to do so) - for example by passing the actual statement to return as keyword arg in the ODBCHook

We already have a mechanims that when you use the common SQL Operators, you can pass hook kwargs as __init__ parameter. This way you could either pass the kwargs in your DAG or write your own SAPHanaSQL* operators that will instantiate the common ones with those kwargs.

@dabla
Copy link
Contributor Author

dabla commented Feb 8, 2024

I think it should be two-fold:

  • make a method in common.sql DBApi that you can override, returning the statement
  • in ODBCHook make it customizable (and implement the method to do so) - for example by passing the actual statement to return as keyword arg in the ODBCHook

We already have a mechanims that when you use the common SQL Operators, you can pass hook kwargs as __init__ parameter. This way you could either pass the kwargs in your DAG or write your own SAPHanaSQL* operators that will instantiate the common ones with those kwargs.

Thank you Jarek, I understand now what you are proposing, will try to do it like this.

@dabla
Copy link
Contributor Author

dabla commented Feb 9, 2024

I've adapted the so that the format of the insert and replace statements are now parametrized and thus independant of any dialect. I just want to implement a test with an airflow connection to make sure the parameter in the DbApiHook is well interpreted from the extras field, once test is written the code can be reviewed then.

@potiuk
Copy link
Member

potiuk commented Feb 10, 2024

I've adapted the so that the format of the insert and replace statements are now parametrized and thus independant of any dialect. I just want to implement a test with an airflow connection to make sure the parameter in the DbApiHook is well interpreted from the extras field, once test is written the code can be reviewed then.

Yep. That's great direction.

@potiuk
Copy link
Member

potiuk commented Feb 10, 2024

Also what you could do with this change is likely to implement this: #34860 which will follow the same patterm but by Snowflake Hook

@dabla
Copy link
Contributor Author

dabla commented Feb 10, 2024

Also what you could do with this change is likely to implement this: #34860 which will follow the same patterm but by Snowflake Hook

We could think of some kind of strategy that is applied for certain hooks/or sqlalchemy dialects. We could have a default strategy, and if a specific strategy is found (base on dialect name for example), then use that one. This is something I was also thinking about now that you mention the above issue. Maybe something for another pull request, as I would like to finish this one as fast as possible as we are also impacted by it. I would also like to take the oppurtunity to add an common SQL operator which allows you to easily persist (insert/replace/upsert) a list of dicts into the database so that you don't need a python operator/decorated task with a hook, something that is now missing in Airflow I. The operator would also simplify the DAG's even more and is also benificient for the DAG processing. We tend to have as less as possible (custom) python code in our DAGs to improve governance and have fast solutions and easy to maintain DAG's which are less error prone.

@potiuk
Copy link
Member

potiuk commented Feb 10, 2024

We could think of some kind of strategy that is applied for certain hooks/or sqlalchemy dialects

Yep. why not. Nothing against doing it in separate PR.

@potiuk
Copy link
Member

potiuk commented Feb 10, 2024

OK. To merge it now and we can do further PRs/iterations to add more functionality here.

@potiuk potiuk merged commit 70fd6ad into apache:main Feb 10, 2024
55 checks passed
@@ -37,6 +37,7 @@

import sqlparse
from deprecated import deprecated
from more_itertools import chunked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add more_itertools as a dependency for the provider?
Upon testing RC Provider common.sql: 1.11.0rc2, I am getting the below error

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/common/sql/operators/sql.py", line 28, in <module>
    from airflow.providers.common.sql.hooks.sql import DbApiHook, fetch_all_handler, return_single_query_results
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/common/sql/hooks/sql.py", line 40, in <module>
    from more_itertools import chunked
ModuleNotFoundError: No module named 'more_itertools'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm looks like dependency is missing for more-itertools

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Good point. We need to do it, we missed that one. Great catch @pankajkoti . We do have more-itertools in our CI image because it's a transitive dependency of a few other packages (hatch/ twine/salesforce) but we definitely need to add it for common.sql

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick fix. The (Astronomer) Providers RC automation test suite reported this failure :)

sunank200 pushed a commit to astronomer/airflow that referenced this pull request Feb 21, 2024
… with fast executemany + SAP Hana support (apache#37246)

* refactor: Added executemany parameter to insert_rows method so the fast executemany method on the cursor can be used to achieve better performance when inserting in bulk. Also check if dialect is SAP Hana in _generate_insert_sql method so the UPSERT statement can be used as REPLACE INTO doesn't exist on SAP Hana.

* refactor: Reformatted code to be static check compliant

* refactor: Refactored DbApiHook so that insert and replace statements are parametrized

# Conflicts:
#	tests/providers/common/sql/hooks/test_dbapi.py

* refactor: Fixed some formatting in DbApiHook

---------

Co-authored-by: David Blain <david.blain@infrabel.be>
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
… with fast executemany + SAP Hana support (apache#37246)

* refactor: Added executemany parameter to insert_rows method so the fast executemany method on the cursor can be used to achieve better performance when inserting in bulk. Also check if dialect is SAP Hana in _generate_insert_sql method so the UPSERT statement can be used as REPLACE INTO doesn't exist on SAP Hana.

* refactor: Reformatted code to be static check compliant

* refactor: Refactored DbApiHook so that insert and replace statements are parametrized

# Conflicts:
#	tests/providers/common/sql/hooks/test_dbapi.py

* refactor: Fixed some formatting in DbApiHook

---------

Co-authored-by: David Blain <david.blain@infrabel.be>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants