Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically map paths in Args#map_each #21952

Closed
wants to merge 11 commits into from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Apr 10, 2024

When path mapping is enabled, File objects accessed in a user-supplied callback passed to Args#map_each automatically have their paths mapped.

Automatic rewriting is preferable to e.g. passing a rewriter object into the callback: All paths emitted into command lines must be rewritten with path mapping as inputs and outputs are staged at the mapped locations, so the user would need to manually map all paths - there is no choice. As an added benefit, the automatic rewriting ensures that existing rules relying on map_each do not need to be modified to use path mapping.

This is a reland of 955b31e, which got rolled back due to a 7% CPU time increase on a benchmark caused by frequent comparisons of equal but not reference equal StarlarkSemantics used as keys in the StarlarkClassDescriptor cache in CallUtils. Instead of overriding equals and hashCode for PathMapper, this change subclasses StarlarkSemantics to provide a different, reference equal instance as the cache key. This is safe since the value associated with the path_mapper key does not affect the availability of any Starlark field or method, just the behavior of their implementations.

Work towards #6526

@fmeum fmeum requested review from a team, brandjon and tetromino as code owners April 10, 2024 08:08
@fmeum fmeum requested review from aranguyen and removed request for a team, tetromino and brandjon April 10, 2024 08:09
@github-actions github-actions bot added team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Apr 10, 2024
@fmeum fmeum requested a review from a team as a code owner April 10, 2024 08:09
@fmeum fmeum added team-Configurability platforms, toolchains, cquery, select(), config transitions and removed team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Apr 10, 2024
@fmeum fmeum removed the request for review from a team April 10, 2024 08:09
*/
@CheckReturnValue
default StarlarkSemantics storeIn(StarlarkSemantics semantics) {
return new StarlarkSemantics(semantics.toBuilder().set(SEMANTICS_KEY, this).build()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to avoid creating new instance of StarlarkSemantics for NOOP case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that, but could you confirm that the benchmarks are good before we make that change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark is still running. I will let you know asap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the result

metric        median              Δ                  1-pval         
   cpu: 16852.300s  ±505.3s  +442.1s, +2.7% 0.92 (weakly significant)
memory:     11110MB   ±1.6MB  -2.0MB, -0.0% 0.13 (not significant)   
system:  1078.910s   ±55.8s    -5.8s, -0.5% 0.13 (not significant)   
  wall:  1402.026s   ±53.6s    +3.5s, +0.3% 0.13 (not significant)  

Wondering if preventing new instances of StarlarkSemantics for NOOP case would help reduce the regression even further?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a flamegraph that can attribute the additional CPU time? We can try with the default semantics, but if that avoids the regression, we should also be able to do better for the path mapping case (we may have missed another cache).

@fmeum
Copy link
Collaborator Author

fmeum commented Apr 10, 2024

Unfortunately map_each support opens us up for some tricky action key staleness issues (see e.g. https://github.com/bazelbuild/bazel/pull/21954/files#diff-7c8f74b8e12ce2ab6535801dce30a69f90a28c990ceef6780fc4ae457b6b7b9bR3667).

Since Starlark map_each fingerprinting currently works by just evaluating the map_each and hashing the result, we need to ensure that the path mapper used for fingerprinting is the same one used during execution - but that would require knowing whether there are any collisions between stripped paths, which in turn requires flattening a depset.

I don't yet know what to do about this.

@fmeum
Copy link
Collaborator Author

fmeum commented Apr 15, 2024

@aranguyen I decided to first submit a fix for the action key staleness issue for existing functionality in #21999. When that has been merged and we are happy with the benchmarks here, I will add more logic and tests here to ensure that we don't run into the same kind of issue.

When a command line item is subject to special default stringification behavior (currently, `Label` and `File` instances are), it does not suffice to fingerprint the information that computes the custom stringification; it is also necessary to track whether each individual item is subject to this special formatting. Otherwise, as demonstrated by numerous new test cases, command line items have identical fingerprints to their naive stringification, but may result in distinct command lines.

This is fixed by tracking element types via additional UUID markers, either individually or for entire `NestedSet`s based on the `Depset` type.
@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Apr 30, 2024
@fmeum fmeum deleted the fix-semantics branch May 2, 2024 15:10
@fmeum
Copy link
Collaborator Author

fmeum commented May 2, 2024

@bazel-io fork 7.2.0

@fmeum fmeum restored the fix-semantics branch May 2, 2024 15:11
iancha1992 pushed a commit to iancha1992/bazel that referenced this pull request May 2, 2024
When path mapping is enabled, `File` objects accessed in a user-supplied callback passed to `Args#map_each` automatically have their paths mapped.

Automatic rewriting is preferable to e.g. passing a rewriter object into the callback: All paths emitted into command lines must be rewritten with path mapping as inputs and outputs are staged at the mapped locations, so the user would need to manually map all paths - there is no choice. As an added benefit, the automatic rewriting ensures that existing rules relying on `map_each` do not need to be modified to use path mapping.

This is a reland of 955b31e, which got rolled back due to a 7% CPU time increase on a benchmark caused by frequent comparisons of equal but not reference equal `StarlarkSemantics` used as keys in the `StarlarkClassDescriptor` cache in `CallUtils`. Instead of overriding `equals` and `hashCode` for `PathMapper`, this change subclasses `StarlarkSemantics` to provide a different, reference equal instance as the cache key. This is safe since the value associated with the `path_mapper` key does not affect the availability of any Starlark field or method, just the behavior of their implementations.

Work towards bazelbuild#6526

Closes bazelbuild#21952.

PiperOrigin-RevId: 629546010
Change-Id: Ib21fa2371a28a02f0c868523b410c5a40c2c6c82
@keertk
Copy link
Member

keertk commented May 9, 2024

Hi all, is this still needed for 7.2?
If so, could you please send a PR to cherry-pick? Looks like there are a bunch of merge conflicts that need to be resolved. We're aiming to have the first RC out on Monday 5/13.

fmeum added a commit to fmeum/bazel that referenced this pull request May 10, 2024
When path mapping is enabled, `File` objects accessed in a user-supplied callback passed to `Args#map_each` automatically have their paths mapped.

Automatic rewriting is preferable to e.g. passing a rewriter object into the callback: All paths emitted into command lines must be rewritten with path mapping as inputs and outputs are staged at the mapped locations, so the user would need to manually map all paths - there is no choice. As an added benefit, the automatic rewriting ensures that existing rules relying on `map_each` do not need to be modified to use path mapping.

This is a reland of 955b31e, which got rolled back due to a 7% CPU time increase on a benchmark caused by frequent comparisons of equal but not reference equal `StarlarkSemantics` used as keys in the `StarlarkClassDescriptor` cache in `CallUtils`. Instead of overriding `equals` and `hashCode` for `PathMapper`, this change subclasses `StarlarkSemantics` to provide a different, reference equal instance as the cache key. This is safe since the value associated with the `path_mapper` key does not affect the availability of any Starlark field or method, just the behavior of their implementations.

Work towards bazelbuild#6526

Closes bazelbuild#21952.

PiperOrigin-RevId: 629546010
Change-Id: Ib21fa2371a28a02f0c868523b410c5a40c2c6c82

Closes bazelbuild#22221
@fmeum
Copy link
Collaborator Author

fmeum commented May 10, 2024

@keertk Submitted in #22322

fmeum added a commit to fmeum/bazel that referenced this pull request May 10, 2024
When path mapping is enabled, `File` objects accessed in a user-supplied callback passed to `Args#map_each` automatically have their paths mapped.

Automatic rewriting is preferable to e.g. passing a rewriter object into the callback: All paths emitted into command lines must be rewritten with path mapping as inputs and outputs are staged at the mapped locations, so the user would need to manually map all paths - there is no choice. As an added benefit, the automatic rewriting ensures that existing rules relying on `map_each` do not need to be modified to use path mapping.

This is a reland of 955b31e, which got rolled back due to a 7% CPU time increase on a benchmark caused by frequent comparisons of equal but not reference equal `StarlarkSemantics` used as keys in the `StarlarkClassDescriptor` cache in `CallUtils`. Instead of overriding `equals` and `hashCode` for `PathMapper`, this change subclasses `StarlarkSemantics` to provide a different, reference equal instance as the cache key. This is safe since the value associated with the `path_mapper` key does not affect the availability of any Starlark field or method, just the behavior of their implementations.

Work towards bazelbuild#6526

Closes bazelbuild#21952.

PiperOrigin-RevId: 629546010
Change-Id: Ib21fa2371a28a02f0c868523b410c5a40c2c6c82

Closes bazelbuild#22221
github-merge-queue bot pushed a commit that referenced this pull request May 10, 2024
When path mapping is enabled, `File` objects accessed in a user-supplied
callback passed to `Args#map_each` automatically have their paths
mapped.

Automatic rewriting is preferable to e.g. passing a rewriter object into
the callback: All paths emitted into command lines must be rewritten
with path mapping as inputs and outputs are staged at the mapped
locations, so the user would need to manually map all paths - there is
no choice. As an added benefit, the automatic rewriting ensures that
existing rules relying on `map_each` do not need to be modified to use
path mapping.

This is a reland of 955b31e, which got
rolled back due to a 7% CPU time increase on a benchmark caused by
frequent comparisons of equal but not reference equal
`StarlarkSemantics` used as keys in the `StarlarkClassDescriptor` cache
in `CallUtils`. Instead of overriding `equals` and `hashCode` for
`PathMapper`, this change subclasses `StarlarkSemantics` to provide a
different, reference equal instance as the cache key. This is safe since
the value associated with the `path_mapper` key does not affect the
availability of any Starlark field or method, just the behavior of their
implementations.

Work towards #6526

Closes #21952.

PiperOrigin-RevId: 629546010
Change-Id: Ib21fa2371a28a02f0c868523b410c5a40c2c6c82

Closes #22221
Kila2 pushed a commit to Kila2/bazel that referenced this pull request May 13, 2024
When path mapping is enabled, `File` objects accessed in a user-supplied callback passed to `Args#map_each` automatically have their paths mapped.

Automatic rewriting is preferable to e.g. passing a rewriter object into the callback: All paths emitted into command lines must be rewritten with path mapping as inputs and outputs are staged at the mapped locations, so the user would need to manually map all paths - there is no choice. As an added benefit, the automatic rewriting ensures that existing rules relying on `map_each` do not need to be modified to use path mapping.

This is a reland of 955b31e, which got rolled back due to a 7% CPU time increase on a benchmark caused by frequent comparisons of equal but not reference equal `StarlarkSemantics` used as keys in the `StarlarkClassDescriptor` cache in `CallUtils`. Instead of overriding `equals` and `hashCode` for `PathMapper`, this change subclasses `StarlarkSemantics` to provide a different, reference equal instance as the cache key. This is safe since the value associated with the `path_mapper` key does not affect the availability of any Starlark field or method, just the behavior of their implementations.

Work towards bazelbuild#6526

Closes bazelbuild#21952.

PiperOrigin-RevId: 629546010
Change-Id: Ib21fa2371a28a02f0c868523b410c5a40c2c6c82
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Configurability platforms, toolchains, cquery, select(), config transitions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants