Always use the executemany method when inserting rows in DbApiHook as it's way much faster #38715

dabla · 2024-04-03T15:55:55Z

In my previous pull request I added the executemany parameter to the insert_rows method to allow you to choose which strategy to apply when inserting rows as the executemany method is way much faster than the original implementation. You can see this in the picture above when we did the performance comparison with thousands of records inserted in bulk, the penultimate one in red took already 13 minutes (so we killed it) and wasn't even finished while the last one with the executemany strategy was completed in merely a few minutes. So I decided to create this new pull request and always apply the faster executemany strategy, as the operator using the hook didn't have any way to change that property anyway and there wasn't anything foreseen to configure that parameter in the connection too. Also why keep both strategies if one is better than the other, then it's better to ditch the slower one which makes the code less complex and easier to read.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

…as it's much faster than inserting each row separately and committing every once in a while in between

airflow/providers/common/sql/hooks/sql.py

…onstructor of Hook as this is not a standard supported option by all ODBC drivers

… test

…does the same by default, no need for a specialized method and thus delegate to insert_rows method

…rt_rows_with_commit_every

dabla · 2024-04-04T08:26:29Z

Also deprecated bulk_insert_rows method in Teradata as it does almost the same as what I did in insert_rows, so for now I log a deprecation warning message and just delegate the call to the insert_rows method.

…d and changed some rows values to string for the TestTeradataHook

potiuk

Would love other reviews, but it looks good.

NIT: why changing types in tests to string? Will that work for all common.sql users? also NIT2 - we should let users know y depreaction warning if they are still using executemany as parameter.

uranusjr · 2024-04-08T06:13:25Z

Why not just always use executemany instead? It’s unclear to me why a new argument is needed.

dabla · 2024-04-08T06:55:55Z

Why not just always use executemany instead? It’s unclear to me why a new argument is needed.

To ellaborate a bit, that's indeed what this pull request is doing, always use executemany. But the previous PR you had an option to use the original implementation or the faster executemany one, so you had to pass the parameter executemany to use the faster implementation, now this is obsolete as we will always use executemany. Also the TeradataHook also had an additional bulk_insert_rows method which used the faster executemany implementation, this one now delegates to the insert_rows method as they share the same principle, so there we have a clean and less code and thus better maintainable.

airflow/providers/common/sql/hooks/sql.py

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>

…tocommit_connection in DbApiHook

dabla

Already applied those changed

dabla · 2024-04-11T19:01:32Z

Wondering why I have tests failing with this error message:

Error: Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/runner/work/airflow/airflow/.github/actions/post_tests_failure'. Did you forget to run actions/checkout before running your local action?

potiuk · 2024-04-11T19:07:08Z

I re-run it. Seems like intermittent error because of broken docker on GitHub Runner. Happens.

docker: Error response from daemon: unauthorized: authentication required.

dabla · 2024-04-11T19:07:22Z

Also wondering why these tests are failing as those are unrelated with my changes (I think):

_______________ TestWasbBlobSensorTrigger.test_waiting_for_blob ________________
[gw1] linux -- Python 3.8.19 /usr/local/bin/python

self = <tests.providers.microsoft.azure.triggers.test_wasb.TestWasbBlobSensorTrigger object at 0x7f2742e336d0>
mock_check_for_blob = <AsyncMock name='check_for_blob_async' id='139805898015072'>
caplog = <_pytest.logging.LogCaptureFixture object at 0x7f2718e380a0>

    @pytest.mark.asyncio
    @mock.patch("airflow.providers.microsoft.azure.hooks.wasb.WasbAsyncHook.check_for_blob_async")
    async def test_waiting_for_blob(self, mock_check_for_blob, caplog):
        """Tests the WasbBlobSensorTrigger sleeps waiting for the blob to arrive."""
        mock_check_for_blob.side_effect = [False, True]
        caplog.set_level(logging.INFO)
    
        with mock.patch.object(self.TRIGGER.log, "info"):
            task = asyncio.create_task(self.TRIGGER.run().__anext__())
    
        await asyncio.sleep(POKE_INTERVAL + 0.5)
    
        if not task.done():
            message = (
                f"Blob {TEST_DATA_STORAGE_BLOB_NAME} not available yet in container {TEST_DATA_STORAGE_CONTAINER_NAME}."
                f" Sleeping for {POKE_INTERVAL} seconds"
            )
>           assert message in caplog.text
E           assert 'Blob test_blob_providers_team.txt not available yet in container test-container-providers-team. Sleeping for 5.0 seconds' in "WARNING  airflow.api_internal.internal_api_call:before.py:40 Starting call to 'airflow.api_internal.internal_api_call...ow.api_internal.internal_api_call.internal_api_call.<locals>.make_jsonrpc_request', this is the 3rd time calling it.\n"
E            +  where "WARNING  airflow.api_internal.internal_api_call:before.py:40 Starting call to 'airflow.api_internal.internal_api_call...ow.api_internal.internal_api_call.internal_api_call.<locals>.make_jsonrpc_request', this is the 3rd time calling it.\n" = <_pytest.logging.LogCaptureFixture object at 0x7f2718e380a0>.text

tests/providers/microsoft/azure/triggers/test_wasb.py:115: AssertionError

potiuk · 2024-04-11T19:30:20Z

Also wondering why these tests are failing as those are unrelated with my changes (I think):

I'd say it's a side-effect of another test that probably interact with caplog in a bad way - and looking at the log entry, it's likely introduced but this one #38910 - just a watch-out @dstandish -> seems the tenacity retry added there can have some side - effects while it logs warning on retries.

That's pretty strange and I am not sure how it can leak to other tests, but likely it's because an asyncio nature of the tests and connected with xdist execution of these. I am afraid this one will be somewhat intermittent (it did not happen in the last run it seems)

airflow/providers/common/sql/hooks/sql.py

vincbeck · 2024-04-12T17:54:56Z

Hi @dabla,

This PR made one of our system test fails because the parameter rows of insert_rows can also be a generator. See fix here: #38972

refactor: Always use the fast_executemany method when inserting rows …

0312a8f

…as it's much faster than inserting each row separately and committing every once in a while in between

dabla requested a review from eladkal as a code owner April 3, 2024 15:55

boring-cyborg bot added area:providers provider:common-sql labels Apr 3, 2024

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

219bd43

Taragolis reviewed Apr 3, 2024

View reviewed changes

airflow/providers/common/sql/hooks/sql.py Outdated Show resolved Hide resolved

davidblain-infrabel and others added 6 commits April 3, 2024 18:39

refactor: Only set fast_executemany option if explicitly defined in c…

64b7a85

…onstructor of Hook as this is not a standard supported option by all ODBC drivers

refactor: Fixed tests related to insert_rows for Postgres

c183404

refactor: Fixed assertions on executemany for insert rows in Postgres…

bc3895e

… test

refactor: Deprecated bulk_insert_rows in Teradata as the insert_rows …

3af4c7c

…does the same by default, no need for a specialized method and thus delegate to insert_rows method

refactor: Removed duplicate calls object definition in test_bulk_inse…

ef36ed0

…rt_rows_with_commit_every

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

e222af5

davidblain-infrabel and others added 13 commits April 4, 2024 12:21

refactor: Use DeprecationWarning instead

d0fd763

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

4dbdff5

refactor: Use AirflowProviderDeprecationWarning instead

42b7ab7

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

58d3cfa

refactor: Ignore deprecation warnings for TestTeradataHook

04d0a47

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

6b0ce23

refactor: Re-added check on rows parameter for bulk_insert_rows metho…

1cb5e5e

…d and changed some rows values to string for the TestTeradataHook

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

57318a9

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

4831868

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

6bb724d

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

35580fe

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

3a9fcbf

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

1e577c7

dabla requested a review from Taragolis April 6, 2024 10:09

potiuk reviewed Apr 7, 2024

View reviewed changes

refactor: Removed placeholder attribute from DbApiHook interface

0250d49

potiuk mentioned this pull request Apr 11, 2024

Fix update-common-sql-api-stubs pre-commit check #38915

Merged

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

e17aca5

uranusjr reviewed Apr 11, 2024

View reviewed changes

airflow/providers/common/sql/hooks/sql.py Outdated Show resolved Hide resolved

Revert unnecessary format change

15f76dc

uranusjr reviewed Apr 11, 2024

View reviewed changes

airflow/providers/common/sql/hooks/sql.py Outdated Show resolved Hide resolved

uranusjr reviewed Apr 11, 2024

View reviewed changes

airflow/providers/common/sql/hooks/sql.py Outdated Show resolved Hide resolved

uranusjr reviewed Apr 11, 2024

View reviewed changes

airflow/providers/common/sql/hooks/sql.py Show resolved Hide resolved

dabla and others added 4 commits April 11, 2024 12:38

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

7814ee3

docs: Updated docstring for excutemany parameter in insert_rows method

65f0518

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

8850f82

refactor: Renamed _closing_supporting_autocommit method to _create_au…

3b029e1

…tocommit_connection in DbApiHook

dabla commented Apr 11, 2024

View reviewed changes

dabla added 2 commits April 11, 2024 14:31

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

85536d9

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

b9ccfc1

dabla requested a review from uranusjr April 11, 2024 19:01

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

1d295c0

uranusjr reviewed Apr 11, 2024

View reviewed changes

airflow/providers/common/sql/hooks/sql.py Outdated Show resolved Hide resolved

Use comprehension instead of map

ad65b9b

uranusjr approved these changes Apr 11, 2024

View reviewed changes

Merge branch 'main' into feature/sql-performance-enhancement-insertmany

1c58f99

potiuk merged commit 7ab24c7 into apache:main Apr 12, 2024
41 checks passed

vincbeck mentioned this pull request Apr 12, 2024

Fix DbApiHook.insert_rows when rows is a generator #38972

Merged

eladkal mentioned this pull request May 1, 2024

Status of testing Providers that were prepared on May 01, 2024 #39346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always use the executemany method when inserting rows in DbApiHook as it's way much faster #38715

Always use the executemany method when inserting rows in DbApiHook as it's way much faster #38715

dabla commented Apr 3, 2024 •

edited by potiuk

Loading

dabla commented Apr 4, 2024

potiuk left a comment

uranusjr commented Apr 8, 2024 •

edited

Loading

dabla commented Apr 8, 2024

dabla left a comment

dabla commented Apr 11, 2024

potiuk commented Apr 11, 2024

dabla commented Apr 11, 2024

potiuk commented Apr 11, 2024

vincbeck commented Apr 12, 2024

Always use the executemany method when inserting rows in DbApiHook as it's way much faster #38715

Always use the executemany method when inserting rows in DbApiHook as it's way much faster #38715

Conversation

dabla commented Apr 3, 2024 • edited by potiuk Loading

dabla commented Apr 4, 2024

potiuk left a comment

Choose a reason for hiding this comment

uranusjr commented Apr 8, 2024 • edited Loading

dabla commented Apr 8, 2024

dabla left a comment

Choose a reason for hiding this comment

dabla commented Apr 11, 2024

potiuk commented Apr 11, 2024

dabla commented Apr 11, 2024

potiuk commented Apr 11, 2024

vincbeck commented Apr 12, 2024

dabla commented Apr 3, 2024 •

edited by potiuk

Loading

uranusjr commented Apr 8, 2024 •

edited

Loading