[Enhancement] Async tasks should work 100% #89

tom-pytel · 2020-11-24T17:38:10Z

These changes should address all remaining async-related problems not handled in #88. The spans list has been moved from an instance variable in SpanContext to a global task-local variable which allows "forking" it for new subtasks so that multiple concurrent async subtasks don't interfere with one another. spans is now a top-level module var because Python stipulates that ContextVar should be such and only one context exists at any given time so it is essentially a singleton anyway.

skywalking/trace/context.py

kezhenxu94 · 2020-11-25T03:23:13Z

The spans list has been moved from an instance variable in SpanContext to a global task-local variable which allows "forking" it for new subtasks so that multiple concurrent async subtasks don't interfere with one another.

By "interfere", I'm not perfectly sure what you mean, IMO, subtasks should share the same context as the one where it is spawned, is there other reason to duplicate the spans?

Also, I didn't succeed to make up a FAILED test case that "a single entry point executing multiple async exit requests simultaneously" (as you mentioned in #88), can you provide a snippet of code so that I can compliment the tests by adding a negative test case.

Other changes look good enough to me, thanks

tom-pytel · 2020-11-25T03:45:47Z

By "interfere", I'm not perfectly sure what you mean, IMO, subtasks should share the same context as the one where it is spawned, is there other reason to duplicate the spans?

They do share the same context but the spans list must be different because if not then sibling tasks might wind up with a parent-child relationship since the parent for a new span is the span at the end of the current spans list. So for example if you were to have an entry span which then launches two other async spans concurrently then with the previous code the spans list might wind up as:

[EntrySpan, LocalSpan1, LocalSpan2]

With LocalSpan1 as parent to LocalSpan2, even though LocalSpan1 and LocalSpan2 should both have EntrySpan as the parent. Or worse, with ExitSpan the second would overwrite the first if I understand the code correctly and you would wind up with only one ExitSpan where two outgoing requests were made.

With this change each new async task gets its own isolated copy of a spans list which will wind up looking like:

[EntrySpan, LocalSpan1]

for the first sibling task and:

[EntrySpan, LocalSpan2]

for the second, with correct parents for everyone.

Also, I didn't succeed to make up a FAILED test case that "a single entry point executing multiple async exit requests simultaneously" (as you mentioned in #88), can you provide a snippet of code so that I can compliment the tests by adding a negative test case.

That kind of thing should no longer fail with this PR (which is the whole point of it) but should fail with #88 and below. If you were not successful in getting a fail case for #88 then that is different and I will look into it tomorrow so do let me know if you are referring to this PR with respect to not being able to make a fail case or #88.

kezhenxu94 · 2020-11-25T03:55:29Z

The explanation of "interfere" totally makes sense to me.

If you were not successful in getting a fail case for #88

That is what I meant, this kind of cases may be hard to reproduce stably but it would be nice if we can add a test case to possibly cover it as our best effort, so I need your help to reproduce it under #88, feel free to do it when you got some time. I'm merging this for now. Thanks again

tom-pytel · 2020-11-25T13:38:34Z

That is what I meant, this kind of cases may be hard to reproduce stably but it would be nice if we can add a test case to possibly cover it as our best effort, so I need your help to reproduce it under #88, feel free to do it when you got some time. I'm merging this for now. Thanks again

import time
from skywalking import agent
from skywalking.trace.context import get_context

agent.start()

async def test(num):
    # with get_context().new_local_span(f'child{num}'):
    with get_context().new_exit_span(f'child{num}', '0.0.0.0'):
        await asyncio.sleep(0.01)  # allow other tasks to tick and start spans
        if num == 1:
            get_context().active_span().raised()  # error a single child

async def main():
    with get_context().new_local_span('parent'):
        await asyncio.gather(
            test(0),
            test(1),
            test(2),
        )

asyncio.run(main())

time.sleep(1)  # allow BG daemon thread to finish updating

Compare #88 with this PR.

P.S. If you have a moment could you have a look at my question #5875 on the main skywalking issues board? I will be moving on to the NodeJS agent so would be nice to be able to test it as I work :)

kezhenxu94 · 2020-11-25T15:15:39Z

Thanks.

P.S. If you have a moment could you have a look at my question #5875 on the main skywalking issues board? I will be moving on to the NodeJS agent so would be nice to be able to test it as I work :)

I will check it soon

tom-pytel · 2020-11-27T14:24:20Z

Out of curiosity, what purpose do Context.capture() and Context.continue() functions serve? Is it to preserve state across possibly asynchronous operation? If so then with this PR they may no longer be necessary.

wu-sheng · 2020-11-27T14:28:50Z

Out of curiosity, what purpose do Context.capture() and Context.continue() functions serve? Is it to preserve state across possibly asynchronous operation? If so then with this PR they may no longer be necessary.

Let me give an example. Yes, those things designed for across threads, originally from Java agent core.
The reason to have this is to avoid the lock and get rid of the race condition bug in the tracing process. We are targeting 100% sampling tracing, and analyzing the metrics/topology based on tracing data.

Each language agent has its own choice whether it needs to adopt all modules in the java agent implementation.

kezhenxu94 · 2020-11-27T14:45:27Z

Out of curiosity, what purpose do Context.capture() and Context.continue() functions serve? Is it to preserve state across possibly asynchronous operation?

Yes

If so then with this PR they may no longer be necessary.

Seems like that's the case, feel free to open a pull request

tom-pytel · 2020-11-27T16:01:02Z

I will return to this after the NodeJS agent stuff.

This PR is made of two tightly coupled parts: * Total rewrite of agent startup logic from module functions -> singleton class. (some other logic was changed in meter to fix wrong forking behavior) * Provide experimental support for os.fork(), exposed as an option. * A demo directory to provide easier access to oap/kafka/demoservices (for contributors). Minor changes: * Docs: fixed some missed ones over time. * Fixed a redis bug.

sonatype-lift bot reviewed Nov 24, 2020

View reviewed changes

skywalking/trace/context.py Outdated Show resolved Hide resolved

[Enhancement] Async tasks should work 100%

0cb0200

tom-pytel force-pushed the master branch from afffdec to 0cb0200 Compare November 24, 2020 21:50

kezhenxu94 added core enhancement New feature or request labels Nov 24, 2020

kezhenxu94 added this to the 0.5.0 milestone Nov 24, 2020

kezhenxu94 approved these changes Nov 25, 2020

View reviewed changes

kezhenxu94 merged commit 606a005 into apache:master Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Async tasks should work 100% #89

[Enhancement] Async tasks should work 100% #89

tom-pytel commented Nov 24, 2020

kezhenxu94 commented Nov 25, 2020 •

edited

tom-pytel commented Nov 25, 2020

kezhenxu94 commented Nov 25, 2020

tom-pytel commented Nov 25, 2020 •

edited

kezhenxu94 commented Nov 25, 2020

tom-pytel commented Nov 27, 2020

wu-sheng commented Nov 27, 2020

kezhenxu94 commented Nov 27, 2020

tom-pytel commented Nov 27, 2020

[Enhancement] Async tasks should work 100% #89

[Enhancement] Async tasks should work 100% #89

Conversation

tom-pytel commented Nov 24, 2020

kezhenxu94 commented Nov 25, 2020 • edited

tom-pytel commented Nov 25, 2020

kezhenxu94 commented Nov 25, 2020

tom-pytel commented Nov 25, 2020 • edited

kezhenxu94 commented Nov 25, 2020

tom-pytel commented Nov 27, 2020

wu-sheng commented Nov 27, 2020

kezhenxu94 commented Nov 27, 2020

tom-pytel commented Nov 27, 2020

kezhenxu94 commented Nov 25, 2020 •

edited

tom-pytel commented Nov 25, 2020 •

edited