Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for modestly redefining .indices on arrays #17883

Closed
bradcray opened this issue Jun 7, 2021 · 21 comments
Closed

Proposal for modestly redefining .indices on arrays #17883

bradcray opened this issue Jun 7, 2021 · 21 comments

Comments

@bradcray
Copy link
Member

bradcray commented Jun 7, 2021

As of PR #15039 or so, we added .indices queries to most indexable types as a means of writing code for them that refers to their indices abstractly rather than assuming the indices are, say, 1..size or 0..<size. At that time, .indices became a near-synonym* for .domain for arrays. Issue #15103 proposed that we retire .domain in favor of .indices on arrays, but was received fairly negatively.

[* = a near synyonym? how so? Well, it actually returned a copy of the original domain rather than an alias to the original domain. Note that I didn't realize this when originally filing this issue in June, and only became aware when finalizing #18274 ].

[For historical purposes, this issue captures a fork in the discussion of #15103 that started at this point ]

This issue proposes another way to remove the redundancy: What if, rather than returning the array's actual domain, this query returned (or perhaps yielded in some cases?) the array's index set locally. For example, for an n x n Block-distributed array named B, B.domain would return a (const) reference to B's actual Block-distributed domain as it does today, whereas B.indices would return {1..n, 1..n} (or {0..<n, 0..<n} or whatever indices B.domain had) as a local default rectangular domain.

This would have the benefits of:

  • providing a data-structure neutral way of getting a local copy of the indices of the data structure in question
  • removing the "two ways of doing the exact same thing" case that we have today in which .domain and .indices do the same thing
  • providing a user-facing way to ask for B's index set without getting a block-distributed thing back or querying on a per-dimension basis—the only two ways we can do so today I believe.

An obvious disadvantage would be:

  • .indices vs. .domain are similar, so learning to distinguish them is subtle and could have a significant performance impact when making the wrong choice (e.g., var A: [B.indices] real; vs. var A: [B.domain] real; or forall i in B.indices vs. forall i in B.domain).

One open question is whether .indices would have to return the indices in a closed form (say as a sparse domain for a sparse array or an associative domain for an associative array) or whether it would be reasonable for it to be implemented in a closed form in some cases (say, when the storage required is O(1), as for dense rectangular arrays) but as an iterator in others (say, when it's not, as for sparse / associative arrays). See #18353 for more on this topic.

@bradcray bradcray self-assigned this Jun 7, 2021
@bradcray bradcray changed the title Proposal for .indices on arrays Proposal for modestly redefining .indices on arrays Jun 7, 2021
@bradcray
Copy link
Member Author

@gbtitus , @redhatturtle, @ben-albrecht , @e-kayrakli , @ronawho: You were all negative to skeptical to neutral on the proposal in #15103. Would you let me know whether this one resonates more positively with you?

@gbtitus
Copy link
Member

gbtitus commented Jun 14, 2021

My objection in #15103 was entirely about getting rid of .domain, because I think it's by far the most natural name for the thing that returns an array's domain. So this is an improvement on #15103 in that regard. But as you say (or imply) there isn't anything about the name .indices that makes it memorable what its subtle differences are from .domain. Personally I think I might expect something called A.indices to return a tuple of the array's indices, rather than a domain. But I'm not coming up with anything I like better off the top of my head. I considered A.localDomain but that could be taken as implying just the indices that are caller-local, and A.unDmappedDomain is completely out of the question!

@e-kayrakli
Copy link
Contributor

I think I am on the fence for this proposal. Whereas, like Greg, I was/am against retiring .domain in favor of .indices

My main worry is around performance, which may not be the best criterion for deciding on interface matters. Brad already listed this as a disadvantage. But I'll add that I can add this bullet as another potential performance pitfall:

providing a data-structure neutral way of getting a local copy of the indices of the data structure in question

In general, I like to promote type-neutral ways of writing code, and this seems like a good example for this. But somehow it scares me that a user might write forall i in myData.indices expecting that they can use a list and a (distributed) array for myData.

Another potential scenario that scares me is similar: for a single-locale application, the user uses arr.indices all around and gets good performance. But when they want to use the same code for multi-locale, the performance tanks. So, by adding arr.indices, are we creating a way to write less portable array code?

All that being said, from the interface standpoint, I see the appeal of this proposal.

One alternative would be to remove arr.indices until we have a use case where that can be helpful in the way that's proposed in this issue. Seeing a use case would definitely help me nudge more in favor of this proposal, too.

@bradcray
Copy link
Member Author

To Greg's points:

But as you say (or imply) there isn't anything about the name .indices that makes it memorable what its subtle differences are from .domain.

I agree with this, but don't worry about it a lot. Lots of queries aren't necessarily self-defining without reading the docs, and I think we can make the docs clear in this respect.

Personally I think I might expect something called A.indices to return a tuple of the array's indices, rather than a domain.

The downside of that is that it wouldn't be possible to write a generic routine across collections like this:

proc printSerially(X) {
  for i in X.indices do
    write(X[i]);
}

and have it work across arrays, lists, strings, tuples, etc. because if X was an array and X.indices returned a tuple of ranges, for a 2D array (say), iterating over X.indices would yield twice, with i being a range, which would then cause X[i] to be interpreted as a 1D slice of a 2D array and a compilation error (because iterating over a tuple iterates over the elements of that tuple). Whereas if X.indices returned a 2D domain in that case (or was an iterator in the non-rectangular case), then this code would work for an array or a list or a string or a tuple or ...

@bradcray
Copy link
Member Author

To Engin's:

Another potential scenario that scares me is similar: for a single-locale application, the user uses arr.indices all around and gets good performance. But when they want to use the same code for multi-locale, the performance tanks. So, by adding arr.indices, are we creating a way to write less portable array code?

I feel like there are a lot of things in Chapel already that can be this sort of multi-locale performance trap, so don't find this particularly compelling. We've talked about adding a --performance-gotchas flag at times before, where a case that could fire very conservatively would be: "You're applying .indices to a distributed array; maybe you want to use .domain instead?"

Seeing a use case would definitely help me nudge more in favor of this proposal, too.

I find myself wanting to query the whole value of a Block-distributed array as a DR with some regularity... (admittedly, probably usually as a domain map author...).

@e-kayrakli
Copy link
Contributor

We've talked about adding a --performance-gotchas flag at times before, where a case that could fire very conservatively would be: "You're applying .indices to a distributed array; maybe you want to use .domain instead?"

That could help, for sure. Although, I don't know whether we can do anything smart about it, and would worry that we'd have to report anything distArr.indices as potential performance issue.

I find myself wanting to query the whole value of a Block-distributed array as a DR with some regularity... (admittedly, probably usually as a domain map author...).

I totally agree with this. As a dmap author, it is quite important. I am not so sure what an end user would want. While I am still OK with proceeding with this proposal; we can ask users, too.

@bradcray
Copy link
Member Author

Thinking about this overnight, I'm realizing that in order to change the behavior of .indices in a non-breaking way for users, we'd either need to deprecate it for a release, or else play similar tricks as .size on ranges (which were annoying to implement, and I think will be annoying to users).

So my current thought is to deprecate it for this release for simplicity (maybe referring to these issues in the deprecation warning?). If nothing else, this removes the current redundancy and may cause users who have started using myArr.indices to speak up (though I'm guessing that it's a new enough feature that there aren't many such users). That doesn't resolve this issue, but it gives us more time to wrestle with it. (I'm still in favor of it).

bradcray added a commit to bradcray/chapel that referenced this issue Jun 18, 2021
See chapel-lang#17883 (comment)
for rationale

---
Signed-off-by: Brad Chamberlain <bradcray@users.noreply.github.com>
@bradcray
Copy link
Member Author

So far, I've only run into one case where .indices has been attractive: The UnitTest framework uses it in some type-neutral contexts to be general across types. Once I get a clean test, we can consider whether it's preferable to have it specialize arrays vs. other types, or to not deprecate .indices, or to do something else.

@ben-albrecht
Copy link
Member

So far, I've only run into one case where .indices has been attractive: The UnitTest framework uses it in some type-neutral contexts to be general across types

I'll add that indices has been adopted a fair amount in the Cray HPO codebase for similar benefits. Deprecating indices to change the behavior would be annoying IMO, but looking at the 1.24-updated codebase, there's only ~20 uses of .indices, so it wouldn't be terrible in our case.

@bradcray
Copy link
Member Author

That's good to know, thanks Ben. Do you have thoughts about this proposal in general (independent of how we get there?). If .indices were to go directly from returning the array's domain to returning a DR domain describing the array's indices, would HPO need to be updated? (my guess is that it only uses DR domains, so would not be affected?)

@ben-albrecht
Copy link
Member

If .indices were to go directly from returning the array's domain to returning a DR domain describing the array's indices, would HPO need to be updated? (my guess is that it only uses DR domains, so would not be affected?)

Your guess is correct. Cray HPO would be unaffected.

Do you have thoughts about this proposal in general (independent of how we get there?)

Like others, my biggest concern with the previous proposal was with deprecating .domain. I am generally OK with this approach.

@bradcray
Copy link
Member Author

OK, so then popping back to where I was before my comment above: As a proponent of this approach, if I were the only Chapel designer-developer, I'd propose that we update the behavior similar to what we did for .size on ranges, domains, and arrays:

  • we have myArr.indices generate a warning when it's applied to a non-DR array to indicate that the behavior is going to change to return a non-distributed domain
  • we support a flag to opt into the new behavior now
  • we update the new behavior for compilations that throw that flag

I think the second best option (maybe the only one now?) would be to continue to make .indices and .domain synonyms (as it is on master today), which has the tradoffs:

Advantages:

  • no changes necessary

Disadvantages:

  • we typically don't like having two ways of saying the same thing
  • no direct / convenient way to get a distributed array's indices as a DR domain (though plenty of indirect / less convenient ways)

So then, of those who've weighed in on this, I think the questions are:

  • @gbtitus: Did I address your reservations / expectations sufficiently above?
  • @ben-albrecht: Would prefer the status quo over this issue's proposal?
  • @e-kayrakli: Do your hesitations cause you to prefer the status quo over this proposal?

Thanks!

@ben-albrecht
Copy link
Member

As a proponent of this approach, if I were the only Chapel designer-developer, I'd propose that we update the behavior similar to what we did for .size on ranges, domains, and arrays:

I am in favor of what has been outlined above. It doesn't seem too disruptive to early adopters of indices and offers sufficient benefits.

@e-kayrakli
Copy link
Contributor

@e-kayrakli: Do your hesitations cause you to prefer the status quo over this proposal?

Nope. I am OK with this proposal. And prefer it over the status quo.

@redhatturtle
Copy link
Contributor

My problems with #15103 were all about removing the .domain method because i think it's good language design. Also I don't think I've used .indices yet for any type and definitely didn't miss it while dealing with arrays. Therefore my initial stance on this proposal is neutral.

The benefits of this proposal all seem quite reasonable and worthy. I also agree that while this new definition is useful the differences from .domain are subtle and that might lead to less obvious code. On the other hand once a user understands the difference .indices might be really convenient, especially for querying the whole index space of an array with no regards to the distribution. The .domain.dim(x) is not a syntax I particularly enjoy writing.

The performance pitfall on the other hand worries me considerably less. While I can imagine people making this mistake I think it's reasonably easy to explain why it causes a performance problem (you 'lost' the array distribution so this loop running on a single locale) and having good examples on the documentation can alleviate the occurrence of these problems significantly.
Also since this would be a multi-locale issue only I would expect people that eventually face it to already have decent experience with Chapel and understating of domains and distributions.

We've talked about adding a --performance-gotchas flag at times before...

I very much like this idea independently of whatever turns out from the rest of this proposal.

@bradcray
Copy link
Member Author

While I am still OK with proceeding with this proposal; we can ask users, too.

I've posted asking for comments on this proposal here: https://chapel.discourse.group/t/design-indices-queries-on-arrays-local-representation-of-indices/6730

@bradcray
Copy link
Member Author

See PR #18274 for a draft of the proposal sketched out in #17883 (comment). In the end, I made the warning for all .indices queries since even if it's a DR array and the domain being returned is local, it'll still be a new domain rather than a const ref to the existing domain, which is a change in behavior that someone could observe (if, say, they created an array over the result and then resized the domain).

bradcray added a commit that referenced this issue Sep 8, 2021
Start process of distinguishing array.indices from array.domain

[reviewed by @e-kayrakli]

In issue #17883, I proposed redefining `array.indices` to return/yield the indices of an array locally rather than simply being an alias for `array.domain` as it is today.  This proposal was generally reviewed favorably and hasn't had any significant detractors.

This PR starts taking the step of redefining it by creating a warning for current users of `array.indices`, informing them of the change, and suggesting that they either update to `array.domain` (or a copy thereof) if they want to preserve the current behavior, or to throw the `-sarrayIndicesAlwaysLocal` config param flag at compile-time if they want to opt into the new behavior now.  This provides a gentle way to ease people into the new world without breaking their current code (in the event that it depends on getting a reference to the domain in question).

It's expected that this change won't actually impact many users because `array.indices` was introduced relatively recently whereas `array.domain` is longstanding.

The overall approach taken here was fairly simple:
* added a warning to the current array.indices routine
* added the config param and added a where clause to the current routine when it's false (the default)
* created new overloads of array.indices for dense rectangular vs. irregular cases that are used when it's set to true
* updated our internal and package module code, mason, and a few tests to use `array.domain` to preserve the current behavior; in one(?) case (test/types/collections/indices.compopts), threw the config param to opt into the new behavior

One thing done here that almost certainly deserves more discussion is that for the irregular case, I used an iterator to yield the indices locally since—in general—if we have a distributed sparse or associative array, there's no guarantee that its domain's indices could be stored in a single local sparse/associative domain.  So it seemed more appropriate to use an iterator for this case (where the user could then use that iterator to populate a local sparse/associative domain, or array, or whatever as they wish) rather than to presume a local storage format for the indices.  However, I've forked off issue #18353 to discuss this further (or even question the use of a procedure returning a domain in the regular case).

A few cases in this PR that perhaps deserve special mention:

* the UnitTest.chpl code was one of the few places that was relying on using `.indices` across a multitude of types, but since it's a package module rather than a test, we wouldn't want it to generate a warning; meanwhile, the code itself seemed a bit questionable in that it assumed that if two things were being compared, they'd necessarily have the same indices (which  might be true, but it wasn't obvious to me why it would be).  So instead of splitting this code to handle arrays and other types distinctly, I switched it to use a zippered loop instead (which could have performance implications, but since it's currently used a serial loop for the comparison anyway, I'm guessing performance isn't the biggest concern here).  The one other change I needed to be made is that the code to print out an array when there is an issue needed to be updated to use `.domain`, but could not apply `.domain` to other types, so I switched the string (!) argument that indicates what type is being compared to a param to make it safe.

* test/types/collections/arrIndicesAreLoc.chpl: We used to have an `arrIndicesAreDist.chpl` test which make sure that querying `.indices` on an array would return a distributed result.  Since this is no longer true, I renamed the test and expanded it to make sure that the indices received back are local to wherever we're running.

* test/types/collections/indices.good: Because I've changed `.indices` on sparse domains to yield indices, the output format changes

* test/studies/shootout/reverse-complement/bradc/revcomp-dowhile-bug.chpl: Here, I could've changed from `buf.indices` to `buf.domain` to avoid the warning, but the domain is declared just a few statements above the use-case, so it seemed simpler just to use that.
@lydia-duncan
Copy link
Member

@vasslitvinov - I see in your range spreadsheet and in the 2.0 scheduling issue that we think this is below the bar and so should just get marked unstable. Do you or @ShreyasKhandekar have a task for doing that already?

@bradcray
Copy link
Member Author

I actually think this could be considered stable and potentially even closed. I believe we did end up agreeing on and implementing the proposal in the OP for rectangular domains which led to #18274. There's a bit of a question in #18353 about behavior on sparse and associative, and I'm imagining I left this open due to that. But since they're unstable for 2.0 and that issue has its own issue now, I don't know that there's more to do here for 2.0.

@lydia-duncan
Copy link
Member

That'd be great! I just came across #17909, which looks related. If we remove the 2.0 label from this issue, would we close that one?

@vasslitvinov
Copy link
Member

I am game stabilizing the current behavior and closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants