Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always use the executemany method when inserting rows in DbApiHook as it's way much faster #38715

Merged

Conversation

dabla
Copy link
Contributor

@dabla dabla commented Apr 3, 2024

MicrosoftTeams-image (2)

In my previous pull request I added the executemany parameter to the insert_rows method to allow you to choose which strategy to apply when inserting rows as the executemany method is way much faster than the original implementation. You can see this in the picture above when we did the performance comparison with thousands of records inserted in bulk, the penultimate one in red took already 13 minutes (so we killed it) and wasn't even finished while the last one with the executemany strategy was completed in merely a few minutes. So I decided to create this new pull request and always apply the faster executemany strategy, as the operator using the hook didn't have any way to change that property anyway and there wasn't anything foreseen to configure that parameter in the connection too. Also why keep both strategies if one is better than the other, then it's better to ditch the slower one which makes the code less complex and easier to read.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

…as it's much faster than inserting each row separately and committing every once in a while in between
@dabla
Copy link
Contributor Author

dabla commented Apr 4, 2024

Also deprecated bulk_insert_rows method in Teradata as it does almost the same as what I did in insert_rows, so for now I log a deprecation warning message and just delegate the call to the insert_rows method.

@dabla dabla requested a review from Taragolis April 6, 2024 10:09
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love other reviews, but it looks good.

NIT: why changing types in tests to string? Will that work for all common.sql users? also NIT2 - we should let users know y depreaction warning if they are still using executemany as parameter.

@uranusjr
Copy link
Member

uranusjr commented Apr 8, 2024

Why not just always use executemany instead? It’s unclear to me why a new argument is needed.

@dabla
Copy link
Contributor Author

dabla commented Apr 8, 2024

Why not just always use executemany instead? It’s unclear to me why a new argument is needed.

To ellaborate a bit, that's indeed what this pull request is doing, always use executemany. But the previous PR you had an option to use the original implementation or the faster executemany one, so you had to pass the parameter executemany to use the faster implementation, now this is obsolete as we will always use executemany. Also the TeradataHook also had an additional bulk_insert_rows method which used the faster executemany implementation, this one now delegates to the insert_rows method as they share the same principle, so there we have a clean and less code and thus better maintainable.

Copy link
Contributor Author

@dabla dabla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already applied those changed

@dabla
Copy link
Contributor Author

dabla commented Apr 11, 2024

Wondering why I have tests failing with this error message:

Error: Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/runner/work/airflow/airflow/.github/actions/post_tests_failure'. Did you forget to run actions/checkout before running your local action?

@dabla dabla requested a review from uranusjr April 11, 2024 19:01
@potiuk
Copy link
Member

potiuk commented Apr 11, 2024

I re-run it. Seems like intermittent error because of broken docker on GitHub Runner. Happens.

docker: Error response from daemon: unauthorized: authentication required.

@dabla
Copy link
Contributor Author

dabla commented Apr 11, 2024

Also wondering why these tests are failing as those are unrelated with my changes (I think):

_______________ TestWasbBlobSensorTrigger.test_waiting_for_blob ________________
[gw1] linux -- Python 3.8.19 /usr/local/bin/python

self = <tests.providers.microsoft.azure.triggers.test_wasb.TestWasbBlobSensorTrigger object at 0x7f2742e336d0>
mock_check_for_blob = <AsyncMock name='check_for_blob_async' id='139805898015072'>
caplog = <_pytest.logging.LogCaptureFixture object at 0x7f2718e380a0>

    @pytest.mark.asyncio
    @mock.patch("airflow.providers.microsoft.azure.hooks.wasb.WasbAsyncHook.check_for_blob_async")
    async def test_waiting_for_blob(self, mock_check_for_blob, caplog):
        """Tests the WasbBlobSensorTrigger sleeps waiting for the blob to arrive."""
        mock_check_for_blob.side_effect = [False, True]
        caplog.set_level(logging.INFO)
    
        with mock.patch.object(self.TRIGGER.log, "info"):
            task = asyncio.create_task(self.TRIGGER.run().__anext__())
    
        await asyncio.sleep(POKE_INTERVAL + 0.5)
    
        if not task.done():
            message = (
                f"Blob {TEST_DATA_STORAGE_BLOB_NAME} not available yet in container {TEST_DATA_STORAGE_CONTAINER_NAME}."
                f" Sleeping for {POKE_INTERVAL} seconds"
            )
>           assert message in caplog.text
E           assert 'Blob test_blob_providers_team.txt not available yet in container test-container-providers-team. Sleeping for 5.0 seconds' in "WARNING  airflow.api_internal.internal_api_call:before.py:40 Starting call to 'airflow.api_internal.internal_api_call...ow.api_internal.internal_api_call.internal_api_call.<locals>.make_jsonrpc_request', this is the 3rd time calling it.\n"
E            +  where "WARNING  airflow.api_internal.internal_api_call:before.py:40 Starting call to 'airflow.api_internal.internal_api_call...ow.api_internal.internal_api_call.internal_api_call.<locals>.make_jsonrpc_request', this is the 3rd time calling it.\n" = <_pytest.logging.LogCaptureFixture object at 0x7f2718e380a0>.text

tests/providers/microsoft/azure/triggers/test_wasb.py:115: AssertionError

@potiuk
Copy link
Member

potiuk commented Apr 11, 2024

Also wondering why these tests are failing as those are unrelated with my changes (I think):

I'd say it's a side-effect of another test that probably interact with caplog in a bad way - and looking at the log entry, it's likely introduced but this one #38910 - just a watch-out @dstandish -> seems the tenacity retry added there can have some side - effects while it logs warning on retries.

That's pretty strange and I am not sure how it can leak to other tests, but likely it's because an asyncio nature of the tests and connected with xdist execution of these. I am afraid this one will be somewhat intermittent (it did not happen in the last run it seems)

@potiuk potiuk merged commit 7ab24c7 into apache:main Apr 12, 2024
41 checks passed
@vincbeck
Copy link
Contributor

Hi @dabla,

This PR made one of our system test fails because the parameter rows of insert_rows can also be a generator. See fix here: #38972

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants