Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race in counting running tasks. #16607

Merged
merged 1 commit into from Oct 22, 2020

Conversation

gbtitus
Copy link
Member

@gbtitus gbtitus commented Oct 21, 2020

In the runtime's pre-user-code hook we force the running task counts to
the proper values after module initialization is done, because (for
reasons we need not go into here) that phase can leave the counts
incorrect but we need them to be correct when we embark on the user
program. We also set up memory tracking in that hook, because we don't
want to track memory during module init but we do want to track it (if
doing so is requested) in the user code. But this has a race with the
running task count setting, due to on-stmts from non-0 nodes back to
node 0 to access the associated config const values. Here, add a
barrier between the running task count reset and the memory tracking
setup to resolve this race.

While here, improve the commentary that describes why the pre-user-code
hook is the way it is.

In the runtime's pre-user-code hook we force the running task counts to
the proper values after module initialization is done, because (for
reasons we need not go into here) that phase can leave the counts
incorrect but we need them to be correct when we embark on the user
program.  We also set up memory tracking in that hook, because we don't
want to track memory during module init but we do want to track it (if
doing so is requested) in the user code.  But this has a race with the
running task count setting, due to on-stmts from non-0 nodes back to
node 0 to access the associated config const values.  Here, add a
barrier between the running task count reset and the memory tracking
setup to resolve this race.

While here, improve the commentary that describes why the pre-user-code
hook is the way it is.

Signed-off-by: Greg Titus <gbtitus@users.noreply.github.com>
@gbtitus gbtitus requested a review from ronawho October 21, 2020 23:31
@gbtitus
Copy link
Member Author

gbtitus commented Oct 21, 2020

I think this will fix https://github.com/Cray/chapel-private/issues/1395. I chased it down after seeing some comm=ofi testing I was doing fail on some of the task counting tests, notably parallel/taskCount/runningTaskCount/innerCoforall.

@ronawho
Copy link
Contributor

ronawho commented Oct 22, 2020

Ah, nice -- I had come to the same conclusion while looking into 1395 earlier today. Great minds think alike!

I was looking to see if we could move the running task resetting code into the modules and avoid the call from the runtime. I've been wanting to do this for a while (#7194) and seemed like it might be nicer than adding yet another barrier.

I was also wondering if we could move the memory tracking setup into the module code too. This could open up the barrier for use in other setup code (and avoid some of our non-scalable internal barriers for the locale setup) -- https://github.com/ronawho/chapel/tree/use-comm-barrier-locale-init

I think we could merge this to fix the race, but I would also not mind sitting on it. I was hoping to use 1395 as a forcing function to finally get around to those cleanups and I imagine if we fix the race that'll kill my motivation to get those in. Chat about this tomorrow?

@ronawho
Copy link
Contributor

ronawho commented Oct 22, 2020

Offline we chatted about going with this for now, and I'll open a followup PR to move this into the module code.

@gbtitus gbtitus merged commit db5091a into chapel-lang:master Oct 22, 2020
@gbtitus gbtitus deleted the task-count-race branch October 22, 2020 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants