Skip to content

[AIRFLOW-6921] Fetch celery states in bulk#7542

Merged
mik-laj merged 5 commits intoapache:masterfrom
PolideaInternal:AIRFLOW-6921
May 11, 2020
Merged

[AIRFLOW-6921] Fetch celery states in bulk#7542
mik-laj merged 5 commits intoapache:masterfrom
PolideaInternal:AIRFLOW-6921

Conversation

@mik-laj
Copy link
Member

@mik-laj mik-laj commented Feb 26, 2020

Before:
For each task, we send n query
After:
When we use BaseKeyValueStoreBackend or DatabaseBackend, we send only one query
When we use other backends, we send n queries (old behavior)

This change will be particularly significant if someone is using a backend that is on a different node.


Issue link: AIRFLOW-6921

Make sure to mark the boxes below before creating PR: [x]

  • Description above provides context of the change
  • Commit message/PR title starts with [AIRFLOW-NNNN]. AIRFLOW-NNNN = JIRA ID*
  • Unit tests coverage for changes (not needed for documentation changes)
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions.
  • I will engage committers as explained in Contribution Workflow Example.

* For document-only changes commit message can start with [AIRFLOW-XXXX].


In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Feb 26, 2020
@mik-laj mik-laj force-pushed the AIRFLOW-6921 branch 6 times, most recently from 9538e8a to 9377e21 Compare February 27, 2020 12:48
@ashb ashb requested review from KevinYang21 and saguziel February 27, 2020 13:50
@mik-laj mik-laj force-pushed the AIRFLOW-6921 branch 2 times, most recently from 490f219 to 0c2a1e1 Compare March 1, 2020 00:16
@turbaszek turbaszek mentioned this pull request Mar 18, 2020
6 tasks
@stale
Copy link

stale bot commented Apr 23, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Apr 23, 2020
@mik-laj mik-laj added the pinned Protect from Stalebot auto closing label Apr 23, 2020
@stale stale bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Apr 23, 2020
@mik-laj mik-laj force-pushed the AIRFLOW-6921 branch 4 times, most recently from 38b6b25 to 4a6d2ed Compare May 5, 2020 08:45
@mik-laj mik-laj changed the title [AIRFLOW-6921][WIP] Fetch celery states in bulk [AIRFLOW-6921] Fetch celery states in bulk May 6, 2020
@mik-laj
Copy link
Member Author

mik-laj commented May 6, 2020

CC: @KevinYang21 @saguziel Can you look at this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
exception_traceback = "Celery Task ID: {}\n{}".format(async_result, traceback.format_exc())
exception_traceback = f"Celery Task ID: {async_result}\n{traceback.format_exc()}"

Copy link
Member

@KevinYang21 KevinYang21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elegant solution, nice. Do we have any performance data also to share?

@mik-laj
Copy link
Member Author

mik-laj commented May 11, 2020

Backend: <class 'celery.backends.database.DatabaseBackend'> (Postgres)
My solution;

Loop time: 33.211 ms
Loop time: 16.434 ms
Loop time: 33.305 ms
Loop time: 36.204 ms
Loop time: 19.188 ms
Average time: 27.825 ms

Multiprocessing:

Loop time: 3520.833 ms
Loop time: 3015.769 ms
Loop time: 3014.007 ms
Loop time: 3008.050 ms
Loop time: 3011.210 ms
Average time: 3114.103 ms

Diff: -3087 ms (114 00%)

Backend: <class 'celery.backends.redis.RedisBackend'>
My solution:

Loop time: 4.023 ms
Loop time: 1.918 ms
Loop time: 5.565 ms
Loop time: 1.459 ms
Loop time: 1.200 ms
Average time: 3.417 ms

Multiprocessing:

Loop time: 1144.165 ms
Loop time: 686.941 ms
Loop time: 688.492 ms
Loop time: 691.988 ms
Loop time: 680.671 ms
Average time: 778.577 ms

Diff: -775 ms (258 33 %)

Details
import time

from airflow.executors.celery_executor import BulkStateFetcher, execute_command, app


def create_tasks(task_count: int):
    tasks = [["airflow", "version"] for _ in range(task_count)]
    async_tasks = []

    for task in tasks:
        async_tasks.append(execute_command.apply_async(args=[task]))
    time.sleep(1)
    return async_tasks


from perf_kit.repeat_and_time import timing, repeat
REPEAT_COUNT = 5


print("Backend: ", type(app.backend))
tasks = create_tasks(5000)


@timing(REPEAT_COUNT)
@repeat(REPEAT_COUNT)
@timing()
def case():
    fetcher = BulkStateFetcher(sync_parralelism=16)
    fetcher.get_many({})


case()

@timing(REPEAT_COUNT)
@repeat(REPEAT_COUNT)
@timing()
def case():
    fetcher = BulkStateFetcher(sync_parralelism=16)
    fetcher._get_many_using_multiprocessing(tasks)


case()

@mik-laj
Copy link
Member Author

mik-laj commented May 11, 2020

This is a test when all components are local. In a real environment, network latency has a big impact on the result because many round-trips. I mainly try to limit the number of queries to limit the number of round-trips.

@turbaszek
Copy link
Member

@mik-laj big kudos for this change! Should we merge it? :)

@mik-laj
Copy link
Member Author

mik-laj commented May 11, 2020

Yes. I did rebase today and i waited for the result.

@mik-laj mik-laj merged commit 5f3774a into apache:master May 11, 2020
@mik-laj mik-laj deleted the AIRFLOW-6921 branch May 11, 2020 08:44
@auvipy
Copy link
Contributor

auvipy commented May 11, 2020

great job!

@mik-laj
Copy link
Member Author

mik-laj commented May 11, 2020

Result for 500 tasks (A more common case):
Backend: <class 'celery.backends.database.DatabaseBackend'> (Postgres)

My solution:

Loop time: 39.800 ms
Loop time: 16.189 ms
Loop time: 15.799 ms
Loop time: 14.845 ms
Loop time: 19.363 ms
Average time: 21.314 ms

Multiprocessing:

Loop time: 500.316 ms
Loop time: 472.745 ms
Loop time: 488.694 ms
Loop time: 494.459 ms
Loop time: 495.087 ms
Average time: 490.514 ms

Diff: -469 ms (22 33%)

Backend: <class 'celery.backends.redis.RedisBackend'>
My solution:

Loop time: 4.843 ms
Loop time: 0.943 ms
Loop time: 1.421 ms
Loop time: 0.757 ms
Loop time: 1.215 ms
Average time: 2.051 ms

Multiprocessing:

Loop time: 290.430 ms
Loop time: 168.682 ms
Loop time: 269.204 ms
Loop time: 285.174 ms
Loop time: 276.070 ms
Average time: 258.048 ms

Diff: -256 ms (128 00%)

@mik-laj
Copy link
Member Author

mik-laj commented May 11, 2020

For 1 task:
Backend: <class 'celery.backends.redis.RedisBackend'>
My solution:

Loop time: 2.100 ms
Loop time: 0.722 ms
Loop time: 0.871 ms
Loop time: 0.589 ms
Loop time: 0.877 ms
Average time: 1.162 ms

Multiprocessing

Loop time: 130.874 ms
Loop time: 109.608 ms
Loop time: 120.918 ms
Loop time: 114.716 ms
Loop time: 116.159 ms
Average time: 118.662 ms

Diff: 117 ms

Backend: <class 'celery.backends.database.DatabaseBackend'> (Postgres)
My solution:

Loop time: 36.081 ms
Loop time: 17.839 ms
Loop time: 16.895 ms
Loop time: 18.450 ms
Loop time: 20.996 ms
Average time: 22.199 ms

Multiprocessing:

Loop time: 266.681 ms
Loop time: 116.485 ms
Loop time: 114.453 ms
Loop time: 111.432 ms
Loop time: 111.997 ms
Average time: 144.389 ms

Diff: -122 ms

@ashb
Copy link
Member

ashb commented May 11, 2020

Nice one, using multiprocessing pool there always felt a bit wrong.

@ashb
Copy link
Member

ashb commented May 11, 2020

@auvipy speaking as a celery dev: everything here is using a "public" API right? You don't see any long-term problems with this?

(It looks all good to my eyes, but I'm not familiar with internals of Celery)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:performance area:Scheduler including HA (high availability) scheduler pinned Protect from Stalebot auto closing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants