Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Async tasks should work 100% #89

Merged
merged 1 commit into from
Nov 25, 2020

Conversation

tom-pytel
Copy link
Contributor

These changes should address all remaining async-related problems not handled in #88. The spans list has been moved from an instance variable in SpanContext to a global task-local variable which allows "forking" it for new subtasks so that multiple concurrent async subtasks don't interfere with one another. spans is now a top-level module var because Python stipulates that ContextVar should be such and only one context exists at any given time so it is essentially a singleton anyway.

@kezhenxu94
Copy link
Member

kezhenxu94 commented Nov 25, 2020

The spans list has been moved from an instance variable in SpanContext to a global task-local variable which allows "forking" it for new subtasks so that multiple concurrent async subtasks don't interfere with one another.

By "interfere", I'm not perfectly sure what you mean, IMO, subtasks should share the same context as the one where it is spawned, is there other reason to duplicate the spans?

Also, I didn't succeed to make up a FAILED test case that "a single entry point executing multiple async exit requests simultaneously" (as you mentioned in #88), can you provide a snippet of code so that I can compliment the tests by adding a negative test case.

Other changes look good enough to me, thanks

@tom-pytel
Copy link
Contributor Author

By "interfere", I'm not perfectly sure what you mean, IMO, subtasks should share the same context as the one where it is spawned, is there other reason to duplicate the spans?

They do share the same context but the spans list must be different because if not then sibling tasks might wind up with a parent-child relationship since the parent for a new span is the span at the end of the current spans list. So for example if you were to have an entry span which then launches two other async spans concurrently then with the previous code the spans list might wind up as:

[EntrySpan, LocalSpan1, LocalSpan2]

With LocalSpan1 as parent to LocalSpan2, even though LocalSpan1 and LocalSpan2 should both have EntrySpan as the parent. Or worse, with ExitSpan the second would overwrite the first if I understand the code correctly and you would wind up with only one ExitSpan where two outgoing requests were made.

With this change each new async task gets its own isolated copy of a spans list which will wind up looking like:

[EntrySpan, LocalSpan1]

for the first sibling task and:

[EntrySpan, LocalSpan2]

for the second, with correct parents for everyone.

Also, I didn't succeed to make up a FAILED test case that "a single entry point executing multiple async exit requests simultaneously" (as you mentioned in #88), can you provide a snippet of code so that I can compliment the tests by adding a negative test case.

That kind of thing should no longer fail with this PR (which is the whole point of it) but should fail with #88 and below. If you were not successful in getting a fail case for #88 then that is different and I will look into it tomorrow so do let me know if you are referring to this PR with respect to not being able to make a fail case or #88.

@kezhenxu94
Copy link
Member

The explanation of "interfere" totally makes sense to me.

If you were not successful in getting a fail case for #88

That is what I meant, this kind of cases may be hard to reproduce stably but it would be nice if we can add a test case to possibly cover it as our best effort, so I need your help to reproduce it under #88, feel free to do it when you got some time. I'm merging this for now. Thanks again

@kezhenxu94 kezhenxu94 merged commit 606a005 into apache:master Nov 25, 2020
@tom-pytel
Copy link
Contributor Author

tom-pytel commented Nov 25, 2020

That is what I meant, this kind of cases may be hard to reproduce stably but it would be nice if we can add a test case to possibly cover it as our best effort, so I need your help to reproduce it under #88, feel free to do it when you got some time. I'm merging this for now. Thanks again

import time
from skywalking import agent
from skywalking.trace.context import get_context

agent.start()

async def test(num):
    # with get_context().new_local_span(f'child{num}'):
    with get_context().new_exit_span(f'child{num}', '0.0.0.0'):
        await asyncio.sleep(0.01)  # allow other tasks to tick and start spans
        if num == 1:
            get_context().active_span().raised()  # error a single child

async def main():
    with get_context().new_local_span('parent'):
        await asyncio.gather(
            test(0),
            test(1),
            test(2),
        )

asyncio.run(main())

time.sleep(1)  # allow BG daemon thread to finish updating

Compare #88 with this PR.

P.S. If you have a moment could you have a look at my question #5875 on the main skywalking issues board? I will be moving on to the NodeJS agent so would be nice to be able to test it as I work :)

@kezhenxu94
Copy link
Member

Thanks.

P.S. If you have a moment could you have a look at my question #5875 on the main skywalking issues board? I will be moving on to the NodeJS agent so would be nice to be able to test it as I work :)

I will check it soon

@tom-pytel
Copy link
Contributor Author

Out of curiosity, what purpose do Context.capture() and Context.continue() functions serve? Is it to preserve state across possibly asynchronous operation? If so then with this PR they may no longer be necessary.

@wu-sheng
Copy link
Member

Out of curiosity, what purpose do Context.capture() and Context.continue() functions serve? Is it to preserve state across possibly asynchronous operation? If so then with this PR they may no longer be necessary.

Let me give an example. Yes, those things designed for across threads, originally from Java agent core.
The reason to have this is to avoid the lock and get rid of the race condition bug in the tracing process. We are targeting 100% sampling tracing, and analyzing the metrics/topology based on tracing data.

Each language agent has its own choice whether it needs to adopt all modules in the java agent implementation.

@kezhenxu94
Copy link
Member

Out of curiosity, what purpose do Context.capture() and Context.continue() functions serve? Is it to preserve state across possibly asynchronous operation?

Yes

If so then with this PR they may no longer be necessary.

Seems like that's the case, feel free to open a pull request

@tom-pytel
Copy link
Contributor Author

I will return to this after the NodeJS agent stuff.

wu-sheng pushed a commit that referenced this pull request Feb 17, 2023
This PR is made of two tightly coupled parts:

* Total rewrite of agent startup logic from module functions -> singleton class. (some other logic was changed in meter to fix wrong forking behavior)
* Provide experimental support for os.fork(), exposed as an option.
* A demo directory to provide easier access to oap/kafka/demoservices (for contributors).

Minor changes:

* Docs: fixed some missed ones over time.
* Fixed a redis bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core enhancement New feature or request
Projects
None yet
4 participants