Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

here.runningTasks() doesn't seem to respect serial statements #17562

Open
bradcray opened this issue Apr 12, 2021 · 4 comments
Open

here.runningTasks() doesn't seem to respect serial statements #17562

bradcray opened this issue Apr 12, 2021 · 4 comments

Comments

@bradcray
Copy link
Member

Summary of Problem

While working with @strikeraryu on a recursive parallel algorithm, we're finding that here.runningTasks() seems to be giving the wrong value when used within a serial block. Specifically, it seems to count the number of tasks created rather than the number that are concurrently executing. This is problematic because it means that the idiom we usually advertise for throttling parallelism:

serial here.runningTasks() >= here.numCores() { ... }

doesn't necessarily work as intended because the number of running tasks will always be over-counted.

Steps to Reproduce

Source Code:

Here's a simple program that demonstrates this:

serial true {
  cobegin {
    writeln(here.runningTasks());
    writeln("1");
    writeln("2");
    writeln("3");
  }
}

On my 4-core system, I'm seeing here.runningTasks() always print as 4, and the other three writeln()s seem to always appear in sequence. Whereas if I remove the cobegin, I see the number of running tasks as 1, and if I remove the serial, I see the output emerge in random orders as expected. This suggests to me that the serial is working as intended, but that the accounting for the number of running tasks is incorrect.

Compile command:
chpl testit.chpl

./testit

Configuration Information

  • Output of chpl --version: chpl version 1.25.0 pre-release (d54a88a62f)
  • Output of $CHPL_HOME/util/printchplenv --anonymize:
CHPL_TARGET_PLATFORM: darwin
CHPL_TARGET_COMPILER: clang
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: system *
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_REGEXP: re2
CHPL_LLVM: none *
CHPL_AUX_FILESYS: none
@ronawho
Copy link
Contributor

ronawho commented Apr 12, 2021

Note that this impacts begins and coforalls as well:

serial true {
  writeln("begin");
  sync {
    begin writeln(here.runningTasks());
    begin writeln(here.runningTasks());
  }

  writeln("cobegin");
  cobegin {
    writeln(here.runningTasks());
    writeln(here.runningTasks());
  }

  writeln("coforall");
  coforall 1..2 {
    writeln(here.runningTasks());
  }
}

@damianmoz
Copy link

But not forall. My 6-core reports it correctly within such a loop.

But incorrectly as 4 in the first case you noted. It actually gets your last two examples correct.

@mppf
Copy link
Member

mppf commented Apr 26, 2021

This might be the wrong approach but here is a quick fix

diff --git a/modules/internal/ChapelBase.chpl b/modules/internal/ChapelBase.chpl
index e2337b7496..643954f847 100644
--- a/modules/internal/ChapelBase.chpl
+++ b/modules/internal/ChapelBase.chpl
@@ -1196,7 +1196,8 @@ module ChapelBase {
         e.taskCnt += 1;
       }
     }
-    if countRunningTasks {
+    var isSerial: bool = __primitive("task_get_serial");
+    if countRunningTasks && !isSerial {
       here.runningTaskCntAdd(1);  // decrement is in _waitEndCount()
       chpl_comm_task_create();    // countRunningTasks is a proxy for "is local"
                                   // here.  Comm layers are responsible for the
@@ -1210,7 +1211,8 @@ module ChapelBase {
   proc _upEndCount(e: _EndCount, param countRunningTasks=true, numTasks) {
     e.i.add(numTasks:int, memoryOrder.release);
 
-    if countRunningTasks {
+    var isSerial: bool = __primitive("task_get_serial");
+    if countRunningTasks && !isSerial {
       if numTasks > 1 {
         here.runningTaskCntAdd(numTasks:int-1);  // decrement is in _waitEndCount()
       }
@@ -1265,7 +1267,8 @@ module ChapelBase {
     // Wait for all tasks to finish
     e.i.waitFor(0, memoryOrder.acquire);
 
-    if countRunningTasks {
+    var isSerial: bool = __primitive("task_get_serial");
+    if countRunningTasks && !isSerial {
       const taskDec = if isAtomic(e.taskCnt) then e.taskCnt.read() else e.taskCnt;
       // taskDec-1 to adjust for the task that was waiting for others to finish
       here.runningTaskCntSub(taskDec-1);  // increment is in _upEndCount()
@@ -1290,7 +1293,8 @@ module ChapelBase {
     // Wait for all tasks to finish
     e.i.waitFor(0, memoryOrder.acquire);
 
-    if countRunningTasks {
+    var isSerial: bool = __primitive("task_get_serial");
+    if countRunningTasks && !isSerial {
       if numTasks > 1 {
         here.runningTaskCntSub(numTasks:int-1);
       }

It's possible that the check for the serial state looks in the task local storage and adds some overhead to task launch. I don't know if that's an issue. I also recall something about tracking running tasks in the compiler but I don't remember more details, so we'll want to check if the compiler needs changes as well.

@ronawho
Copy link
Contributor

ronawho commented Apr 27, 2021

This is along the lines of what I was thinking, but I haven't had time to think about any downsides to this approach.

The compiler code is only for on-stmts and I don't think is impacted by serial statements, but I'll have to check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants