-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prioritize blocked task messaging over idle tasks #627
Conversation
unified-scheduler-pool/src/lib.rs
Outdated
recv(idle_task_receiver) -> task => { | ||
if let Ok(task) = task { | ||
(task, &finished_idle_task_sender) | ||
} else { | ||
idle_task_receiver = never(); | ||
continue; | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here (back ref: solana-labs#34676 (comment))
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #627 +/- ##
=======================================
Coverage 81.8% 81.8%
=======================================
Files 851 851
Lines 230236 230326 +90
=======================================
+ Hits 188535 188609 +74
- Misses 41701 41717 +16 |
4866afa
to
763a443
Compare
4b91987
to
670682c
Compare
670682c
to
654a8c8
Compare
unified-scheduler-pool/src/lib.rs
Outdated
recv(idle_task_receiver) -> task => { | ||
if let Ok(task) = task { | ||
(task, &finished_idle_task_sender) | ||
} else { | ||
idle_task_receiver = never(); | ||
continue; | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here (back ref: solana-labs#34676 (comment))
70cd44c
to
b4724d1
Compare
b4724d1
to
fa918be
Compare
unified-scheduler-pool/src/lib.rs
Outdated
recv(blocked_task_receiver.for_select()) -> message => { | ||
if let Some(task) = blocked_task_receiver.after_select(message.unwrap()) { | ||
(task, &finished_blocked_task_sender) | ||
} else { | ||
continue; | ||
} | ||
}, | ||
recv(idle_task_receiver) -> task => { | ||
if let Ok(task) = task { | ||
(task, &finished_idle_task_sender) | ||
} else { | ||
idle_task_receiver = never(); | ||
continue; | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that the multi-channel set up works correctly without select_biased!
.
Let's say we have something along the lines of:
- blocked_task_sender => new context
- idle_task_sender => idle tasks
The idle tasks should be for the new context, but as far as I can tell, there's nothing preventing them from being randomly picked up before the new context in the handler threads.
I do believe this will work with a select_biased!
call and the proper ordering, but with the random select
ing it seems like there's random chance of the chained channel stuff getting messed up wrt to the idle tasks.
Maybe I am missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am wrong about this, it'd be great to add a test to convince me 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're totally correct. You saved me. This is a race condition...: f02ebd8, 4e045d7
I do believe this will work with a select_biased! call and the proper ordering
Well, this won't work even with select_biased!
... It'll firstly try to receive blocked, then idle. After that, it'll sched_yield
. Before the sched_yeild
, the handler thread still could see the idle task for the next context, if it became visible to the thread just after trying to receive blocked and missed to see new context. this means scheduler thread managed to send the new context then the context's idle task successively between the two try_recv
s.
I think this is the root cause of a mysterious panic, which i only observed once while running against mb...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comment about potential issue with chained channels and the idle task senders
I'm really grateful for you to catch the issue.. Still ci is pending, but i think i've fixed it... |
3543a21
to
1261092
Compare
Hey was reviewing this again, and had some concerns about the ChainedChannel design in general. After digging, I do think the impl is working now, but I think the way it's written are a bit confusing. The big I think making the order of things more explicit can help clarify the code, wdyt? loop {
let NewTaskPayload::OpenSubchannel(new_context) = new_task_receiver.recv() else {
// handle logical error.
};
loop {
// your select! loop here, but with the OpenSubChannel stuff being removed.
}
} This code makes it more clear we expect to open a subchannel, loop until we are finished, then eventually open another new one. |
thanks as always for taking a deep look.
Good point. This understanding is correct.
done: 6904938 I think this incurs additional recv per session. but i think this is a good compromise to improve readability. |
None | ||
); | ||
} else { | ||
unreachable!(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like bunch of .unwrap()
s here and there, this unreachable!()
will be replaced with proper thread shutdown code in later prs.
Not sure I understand where the additional recv per session is happening, could you give an example? I thought this would just be moving the initial recv, which is effectively assumed to be a opensubchannel, but not causing an additional one. |
yeah, that's correct. i was wrong... Actually, this commit results in fewer |
* Prioritize blocked task messaging over idle tasks * Add test_scheduler_schedule_execution_blocked * Add commentary * Reword commentary a bit * Document not-chosen approach in detail * Use {crossbeam,chained}_channel::unbounded() * Add test for race condition around SchedulingContext * Fix race condition around SchedulingContext * Comment about finished channels * Clean up the scheduling context race test * Codify hidden assumption with extra 1 recv/session (EDIT there's actually no extra recv)
Problem
Unified scheduler is slow. ;)
Summary of Changes
Add very primitive (under 100 loc pr) yet effective adaptive behavioral heuristic (i.e. no constant knob) at the scheduler layer to complement
SchedulingStateMachine
's buffer bloat ignorance.EDIT: phew just wrote the commentary. the irony is that the accompanied unit test took longer time to write than the actual code changes. Moreover the commented monologue prose took even longer time to write than the unit test.. lol
Perf numbers
at least ~5% consistent perf improvement is observed
(see solana-labs#35286 (comment) if you want to reproduce the results)
before (just after #129 is merged):
after:
(for the record) the merged commit
before(ad316fd):
after(fb465b3):
context: extracted from #593