Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asymmetry with chpl_nodeID with forall vs. coforall #16716

Closed
bradcray opened this issue Nov 16, 2020 · 17 comments · Fixed by #16737
Closed

Asymmetry with chpl_nodeID with forall vs. coforall #16716

bradcray opened this issue Nov 16, 2020 · 17 comments · Fixed by #16737

Comments

@bradcray
Copy link
Member

A colleague pointed out that for the following code running on two nodes:

writeln("coforall: ");
coforall loc in Locales {
  on loc {
    writeln("chpl_nodeID: ", chpl_nodeID, ", here.id = ", here.id);
  }
}
writeln();

writeln("forall: ");
forall loc in Locales {
  on loc {
    writeln("chpl_nodeID: ", chpl_nodeID, ", here.id = ", here.id);
  }
}

the output is inconsistent between the two loop forms:

coforall: 
chpl_nodeID: 0, here.id = 0
chpl_nodeID: 1, here.id = 1

forall: 
chpl_nodeID: 0, here.id = 0
chpl_nodeID: 0, here.id = 1

My first reaction was "chpl_nodeID isn't really intended to be a user-facing feature, and is only used for bootstrapping, so maybe we don't really need to worry about this." But while I think the first part of that statement is true, we rely on chpl_nodeID a lot in library code, which makes it slightly concerning. And the fact that I can't explain what's happening is concerning.

Here's what I (think I) know:

  • though chpl_nodeID is a fairly special SPMD-style / per-node C variable, Chapel doesn't really know this. It's declared as an extern var of type int and from what I've seen, the compiler doesn't seem to special-case it.

  • I think the forall loop is arguably doing the correct thing in that chpl_nodeID is a global integer variable, and so is subject to having a const in shadow variable being inserted for it. Since that shadow variable is inserted on locale 0, it makes sense that chpl_nodeID would be 0 when printed from either locale.

  • Putting in an explicit ref intent for the forall loop seems to confirm this, resulting in the 0 / 1 values being printed.

    forall loc in Locales with (ref chpl_nodeID) {
    ...
  • So, I'm confused by why the coforall loop doesn't seem to insert the similar shadow variable and get the same output (and, if it did, would this break all of the library code that relies on reasoning about chpl_nodeID?)

  • Also weird: I'd expect that putting a with (const in chpl_nodeID) into the coforall loop would symmetrically result in the same behavior as the forall loop, yet it doesn't.

  • I was expecting that the coforall+on optimization might be playing a role here, yet inserting a writeln() before the on-clause within the coforall doesn't change the behavior.

All of this makes me suspicious that we have some sort of bug or inconsistency in our implementation, though I'm not sure what it is. It also makes me believe that we should remove chpl_nodeID since it doesn't behave like a normal Chapel variable, and rely on something like a primitive that returns the current node ID instead. This also makes me curious to understand better when/why chpl_nodeID is used in libraries and what it would take to rewrite those to only use user-facing features.

@bradcray
Copy link
Member Author

Tagging @vasslitvinov in hopes that he has some insights due to the apparent role of task-private shadow variables; @gbtitus and @ronawho because they were part of the conversation that turned this up and are similarly stymied (so far); @e-kayrakli because of the use in strings and bytes code (which I believe was inherited, but that most of the original devs are no longer here); and @mppf because of the cycles he's spent thinking about cross-locale move/copy initialization which I think relates to the use of chpl_nodeID in BigInteger and String/Bytes (?).

@mppf
Copy link
Member

mppf commented Nov 16, 2020

It also makes me believe that we should remove chpl_nodeID since it doesn't behave like a normal Chapel variable, and rely on something like a primitive that returns the current node ID instead.

This sounds like a great approach to me. AFAIK it would be sufficient for all uses of chpl_nodeID in Chapel module code. String/Bytes could have used here but there were problems with that due to module initialization/resolution order. But these problems won't be present if we add a new primitive.

@gbtitus
Copy link
Member

gbtitus commented Nov 16, 2020

As part of dealing with this we should also look to see whether there are other variables defined by the C runtime and used by module (or worse, user) code, that have node-specific values. Whatever turns out to be the source of the chpl_nodeID issue here might affect those also, and also might be solvable with whatever technique we use for chpl_nodeID. No such variables come to my mind right now, but I think I'd be a bit surprised if there weren't any at all.

@vasslitvinov
Copy link
Member

It also makes me believe that we should remove chpl_nodeID

Is this to say that the user should be using here.id instead of chpl_nodeID ?

Given that we do not advertise chpl_nodeID, I don't think that we even need to do anything about it. For example, what would be our response if a user got in trouble while using chpl_buildStandInRTT() ?

In any case I can go either way about removing chpl_nodeID. Or, I can look at fixing the behavior of the forall case, if desired.

@bradcray
Copy link
Member Author

Is this to say that the user should be using here.id instead of chpl_nodeID ?

chpl_nodeID isn't user-facing, so I don't think users should be using it. This issue came from a developer, not a user which is why I'm wrestling with it rather than brushing it aside (particularly since code we've written relies on chpl_nodeID.

Or, I can look at fixing the behavior of the forall case, if desired.

I'd argue that the forall case is working correctly. I'm more interested in understanding why the coforall isn't behaving similarly and what that implies (e.g., a bug? or a misunderstanding on my part?)

@gbtitus
Copy link
Member

gbtitus commented Nov 17, 2020

An alternative to a compiler primitive with all its associated machinery would be a well-defined functional interface in the runtime, for example chpl_comm_getNodeID(), to replace the current global variable. Implemented as a static inline function at the top surface of the runtime, it should have no greater access cost than a variable. Arguably it might be better to use such a function instead of the global variable even within the runtime itself.

@vasslitvinov vasslitvinov self-assigned this Nov 18, 2020
@vasslitvinov
Copy link
Member

Brad, your intuition is right -- this is a bug. isOuterVar() in createTaskFunctions.cpp misses the case where the global variable is defined in a different module than the one containing the coforall. Granted, this isOuterVar() was written at the time when we had only one module. :) In any case, it should be fixed. Here is the offender:

static bool isOuterVar(Symbol* sym, FnSymbol* fn) {  // 'fn' is the coforall task function
  Symbol* symParent = sym->defPoint->parentSymbol;
  Symbol* parent = fn->defPoint->parentSymbol;       // the function containing the coforall construct

  while (true) {
    if (!isFnSymbol(parent) && !isModuleSymbol(parent))
      return false;
    if (symParent == parent)
      return true;
    if (!parent->defPoint)
      // Only happens when parent==rootModule [...]
      return false;

    // continue to the enclosing scope
    parent = parent->defPoint->parentSymbol;
  }
}

Here is a user-level single-locale reproducer:

module Lib {
  var globalLib = 10;
  proc updateLib() { globalLib += 1; }
}

module Main {
  use Lib;
  var s$: sync int;
  var globalMain = 20;
  proc updateMain() { globalMain += 1; }

  proc main {
    coforall 1..2 {
      s$ = 1; // grab the lock
      writeln("globaLib   = ", globalLib);
      writeln("globalMain = ", globalMain);
      updateLib();
      updateMain();
      s$;
    }
  }
}

currently printing

globaLib   = 10
globalMain = 20
globaLib   = 11   // should be 10
globalMain = 20

Consider this in my court.

@bradcray
Copy link
Member Author

bradcray commented Nov 18, 2020

Thanks for looking into it and for finding the bug, Vass. Just to make sure I'm understanding, once this bug is fixed, we'd need to use one of the replacements for chpl_nodeID discussed above to feel confident that current code would continue working, is that correct? (e.g., the coforall loop would begin reporting 0 for chpl_nodeID?)

[edit: Assuming so, I was interested in taking a look at that piece of the puzzle (it feels like a "Brad-sized chunk" and more interesting than the doc edits I've been stuck in recently).]

Granted, this isOuterVar() was written at the time when we had only one module. :)

That seems highly unlikely...?

@vasslitvinov
Copy link
Member

the coforall loop would begin reporting 0 for chpl_nodeID

Right, this is what I expect after a fix.

Note that chpl_nodeID will be 0 only when referenced directly from the lexical scope of a coforall. Which, arguably, neither users nor developers should do, because of this. (Of course they don't know that they shouldn't.)

It so happens that if chpl_nodeID were declared const (and still had the correct value on each locale), the code would continue printing 1 on Locale 1 even with the fix. This was done intentionally so that a reference to a global constant points to the value broadcast at start of the program. Which avoids creating a shadow variable and, at the time of writing this code, somehow avoided extra comm for coforall tasks.

@bradcray
Copy link
Member Author

It so happens that if chpl_nodeID were declared const (and still had the correct value on each locale), the code would continue printing 1 on Locale 1 even with the fix.

Interesting, I'd wondered about that, thanks for mentioning it. And I'm guessing there's no reason it couldn't be, if we wanted to leave it as-is... though Greg's idea of using a C static function (maybe even a "paren-less external function?" that's really a C variable, nudge-nudge-wink-wink) sounded intriguing and far less entwined with other language concepts, which seems appealing.

@vasslitvinov
Copy link
Member

There are several things with chpl_nodeID - currently it is extern; if we declared it const, we'd need to avoid the initial broadcast for it; perhaps use pragma "locale private" for it.

So a related concern is what to do with "locale private" variables. For example, chpl_localeTree in modules/internal/LocaleTree.chpl is considered a non-outer variable, which is technically incorrect. However, because of that it does not get a shadow variable so its local version is accessed. Which is not even about saving comm - in this case it is about correctness.

BTW there is a lengthy comment discussion in the sister isOuterVar() in flattenFunctions.cpp about locale-private variables w.r.t. RVF.

We could decide to have no shadow variables for locale-private variables. However, this would mix unrelated concepts and likely cause confusion down the road.

Suggestions?

@bradcray
Copy link
Member Author

I think that if you could focus on fixing the bug, I'll focus on switching chpl_nodeID to something that'll work (most likely a function... I'm realizing that my idea of a paren-less external function is unlikely to work given that we translate paren-less function calls to paren-ful ones).

@bradcray
Copy link
Member Author

The switch for chpl_nodeID turned out to be straightforward once I realized how little I wanted to update existing code, thanks to Chapel's support for moving between variables and paren-less functions and renaming extern symbols: #16737

This pulls the rug over the bug Vass identified for this specific case, but we should definitely still fix the bug for fear of doing the wrong thing in other cases (and perhaps it could even result in performance improvements for codes that refer to global variables within coforalls?)

@vasslitvinov
Copy link
Member

How should we handle "locale private" variables, such as chpl_localeTree? Handling them "correctly" would result in accessing a value from a remote locale via a shadow variable instead of accessing the current-locale value.

@bradcray
Copy link
Member Author

Ah, sorry Vass... I'd missed that your question was more generally applicable in other cases beyond this one.

Looking over the distinct uses of locale private variables, I'm wondering whether there's a single solution that would work for each of them. For example, chpl_localeTree seems fairly different in its use from many of the other locale private variables that seem to be more about caching something locally (?). I'm also wondering what it would take to eliminate these uses of locale private by writing them in terms of different concepts within the language itself.

Maybe we need to fork this off into a separate issue and involve those who are familiar with each of the variables (looks like there's only 8 of them?)

bradcray added a commit that referenced this issue Nov 19, 2020
Convert `chpl_nodeID` from `extern var` to `extern proc`

[reviewed by @mppf]

Declaring `chpl_nodeID` as a module-scope extern variable pointed to
an issue in which we weren't introducing a shadow variable for it as
we do for other module-scope variables.  However, it also shows a
weakness of using a module-scope extern variable for an external value
that is different on each locale: it suggests to Chapel that there's a
single extern variable when in fact there's a per-locale variable with
a distinct value on each locale.  This suggests that an extern var
isn't the best way to represent such values in Chapel.

In this PR, I change chpl_nodeID to an extern (paren-less) proc so
that the Chapel code which refers to it can remain unchanged; and I
assign it a new name in the C code so that we can implement it as a
static inline function without modifying other C code that refers to
the C-level variable.  The result is a minimal change to our code that
avoids the mismatch between describing such a value as a global
variable in Chapel that should result in a nearly identical
execution-time implementation.

Resolves #16716.
@bradcray
Copy link
Member Author

Since the original issue here is now resolved by changing how chpl__nodeID is implemented, I forked the bug that Vass found (and that caused the original behavior) into a new issue: #16743. I did not fork off a new issue for the question about "locale private" variables, though I think that could still be worthwhile.

@ronawho
Copy link
Contributor

ronawho commented Nov 19, 2020

#14143 might cover the "locale private" vars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants