Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genquery + transitions: build error when query succeeds #14169

Closed
robfig opened this issue Oct 26, 2021 · 35 comments
Closed

genquery + transitions: build error when query succeeds #14169

robfig opened this issue Oct 26, 2021 · 35 comments
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Configurability platforms, toolchains, cquery, select(), config transitions type: bug

Comments

@robfig
Copy link

robfig commented Oct 26, 2021

Description of the problem / feature request:

Using Bazel 4.2.0, a genquery expression fails when the corresponding bazel query succeeds, apparently due to not handling an important transition used by rules_go to implement the "nogo" static analysis tool.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

To reproduce:

git clone git@github.com:robfig/rules_go-issue2991.git
cd rules_go-issue2991
bazel build //:foo_go_proto # works
bazel query "kind('proto_library', deps(//:foo_go_proto))" # works
bazel build //:deps # fails

The error message:

ERROR: /private/var/tmp/_bazel_robfig/cf4a500e949de5c5d412d75252cf434f/external/io_bazel_rules_go/BUILD.bazel:73:16: in go_context_data rule @io_bazel_rules_go//:go_context_data: cycle in dependency graph:
    //:deps
    //:foo_go_proto
    @io_bazel_rules_go//proto:go_proto
    @org_golang_google_protobuf//types/known/structpb:structpb
    @org_golang_google_protobuf//reflect/protoreflect:protoreflect
    @org_golang_google_protobuf//encoding/protowire:protowire
    @org_golang_google_protobuf//internal/errors:errors
    @org_golang_google_protobuf//internal/detrand:detrand
.-> @io_bazel_rules_go//:go_context_data
|   @io_bazel_rules_nogo//:nogo
|   //:default_nogo
|   //:default_nogo_actual
|   @io_bazel_rules_go//go/tools/builders:nogo_srcs
|   @org_golang_x_tools//go/gcexportdata:gcexportdata
|   @org_golang_x_tools//go/internal/gcimporter:gcimporter
`-- @io_bazel_rules_go//:go_context_data
ERROR: Analysis of target '//:deps' failed; build aborted

This was discovered after rules_go moved to transitions for static analysis support in this commit:
bazelbuild/rules_go@63dfd99

I don't see any way to fix it within rules_go.

What operating system are you running Bazel on?

MacOS Catalina

What's the output of bazel info release?

release 4.2.0

Have you found anything relevant by searching the web?

I found no relevant results for the terms: genquery transitions

@oquenchil oquenchil added team-Configurability platforms, toolchains, cquery, select(), config transitions type: bug untriaged labels Nov 2, 2021
@gregestren
Copy link
Contributor

gregestren commented Nov 3, 2021

Do you know what causes the cycle in the build?

I'll try to dig up a similar issue someone reported recently. Short story is that genquery cannot alway resolve cycles as query can. genquery is itself a target, so any deps in its scope attribute are transitively evaluated just like in a real build, which includes the same cycle detection as real builds.

query never does this.

@robfig
Copy link
Author

robfig commented Nov 3, 2021

Note that a "real build" does not suffer the same cyclic dependency:

$ bazel build "//:foo_go_proto" # works

@gregestren
Copy link
Contributor

You're right - apologies.

I poked around more and I'm stumped. I get the same error with:

genquery(
    name = "deps",
    expression = "not a valid expression",
    scope = ["//:foo_go_proto"],
)

This tells me the cycle is getting reported through routine analysis of the deps from scope (i.e. this has nothing to do with the query logic genquery runs - it never gets a chance to even try parsing the expression).

If I try building something else, like a cc_binary with a data dep on //:foo_go_proto, that succeeds.

Do you have any toy reproduction scenarios not specific to the Go rules?

@gregestren
Copy link
Contributor

I think this toy example replicates:

config_setting(
    name = "fastbuild",
    values = {"compilation_mode": "fastbuild"},
)

config_setting(
    name = "opt",
    values = {"compilation_mode": "opt"},
)

genrule(
    name = "g",
    srcs = [],
    outs = ["g.out"],
    cmd = "echo hi > $@",
    tools = select({
        ":fastbuild": [":g"],
        ":opt": [],
    }),
)

genquery(
    name = "gq",
    expression = "abc",
    scope = [":g"],
)

cc_binary(
    name = "cc",
    srcs = ["cc.cc"],
    data = [":g"],
)

It exploits the facts that genrule#tools switches to the host configuration, and the target configuration's default --compilation_mode is fastbuild while the host's is opt.

@gregestren
Copy link
Contributor

Okay, now I remember. This is a known discrepancy between genquery and query. There was a similar discussion within Google a while back. I'll paste my comments from there:

With query, TransitiveTargetKey indeed invokes the cycle pretty much the same way as in genquery. But all of ParallelEvaluator and SkyframeExecutor.SkyframeTransitivePackageLoader are called within SkyframeLabelVisitor, which has logic to ignore these errors. And this happens within BazelQueryenvironment.preloadTransitiveClosure, which as I understand is just a performance optimization - not on the main logic path that query uses to actually get the deps. So if if fails, query can say "so what?".

genquery, on the other hand, is called within Bazel's core analysis phase, i.e. when constructing the actual configured target graph. This means the TransitiveTargetKey failure fails the GenQuery rule itself, which means real nodes in the build's configured target graph have null values. Skyframe then automatically examines these null results as a post-process to find cycles. Unlike query this isn't just a nice optimization - these are core results the build relies on. I don't know an easy way to avoid this as long as genquery relies on TransitiveTargetKey.

So the short story is both query and genquery identify a cycle, but query ignores it while genquery doesn't. And this isn't easy to reconcile.

The most straightforward paths would be one of:

  • refactor your code to avoid this cycle
  • use an aspect instead of genquery

Happy to talk through more the different options.

@robfig
Copy link
Author

robfig commented Nov 7, 2021

Thanks for digging in. I'm not sure how to resolve though:

  • The cycle is fundamental here - we want to compile the static analysis binary (nogo) using a config setting that specifies no static analysis, via a transition coming from compiling the user's program with the static analysis tool built and activated. I believe there is no way to remove the cycle, without giving up and requiring a prebuilt nogo binary.

  • The genquery rule is used in Uber's repo, so I can't speak to what it's used for or what it might be replaced by (cc @linzhp). But from the point of view of rules_go, we would ideally like to ship rules that allow users to use all of the bazel builtin rules, including genquery. If that's not feasible, then I suppose we would probably document that limitation and any workarounds.

@linzhp
Copy link
Contributor

linzhp commented Nov 8, 2021

There is a test in Uber to verify that a service doesn't have any transitive dependency on some IDL files. The test uses a genquery as its data and iterate through all the service's dependencies.

I am not sure if there is any workaround. It's probably not a good idea to run either bazel query or bazel build for some aspect in test runtime.

@gregestren
Copy link
Contributor

Okay. To be clear, the problem is this quirk genquery has, not the cycle. I see no design issue with the cycle you've built into the Go rules.

@linzhp, your test is part of a big bazel test ... run? Is it possible to inject into some other part of your CI that doesn't run bazel test, but instead runs bazel build with the aspect approach? That would also technically be more accurate than genquery since it reflects the reality of selective dependencies from select(), etc.

@gregestren
Copy link
Contributor

@haxorz - who has expertise on the core Bazel logic genquery runs within - this is another case of genquery loading phase cycles triggering errors while bazel query ignores the cycles.

Was there any other best practice recommendation aside from eliminating the cycles or prefering aspects?

@haxorz
Copy link
Contributor

haxorz commented Nov 10, 2021

Was there any other best practice recommendation aside from eliminating the cycles or prefering aspects?

No. The last major time this came up internally at Google, "eliminate the cycle" was viable and was the approach taken.

@gregestren gregestren added P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) and removed untriaged labels Nov 16, 2021
@linzhp
Copy link
Contributor

linzhp commented Nov 22, 2021

We were able to write an aspect and a rule that depend on the aspect to collect transitive dependencies for the test to check. genquery is no longer blocking us. @robfig you can close this now.

@robfig
Copy link
Author

robfig commented Nov 22, 2021

Thanks all!

@sluongng
Copy link
Contributor

sluongng commented Mar 1, 2022

I would suggest to keep this issue open instead.

It helps highlighting the problem where genquery and transition does not play well with each others.

The 2 ways we should try to solve this:

  • Document that the official recommendation is to rewrite the use of genquery into aspect (if thats possible)
  • Fix so that genquery properly detect transition during cycle dependency detection and (potentially) attempt to traverse the cycle N times before giving up.

@gregestren
Copy link
Contributor

I support the first item (documentation) and suggest a dedicated issue for it. Since that's something we can more actionably address.

For the second, I hear your point. This gets into the philosophy of what it means for issues to be open / closed, prioritized at whatever level, etc. I don't see any scenario where fixing this principally will happen any time soon barring heroic work by an enthusiastic contributor. This is really about redesigning genquery as a whole. genquery is uniquely "awkward" in that it embeds a loading phrase graph traversal into the full post-analysis (i.e. with transitions and configs) graph. That creates multiple design quirks, of which this is one.

So there's real design work to try to figure out how or if genquery can be refactored to better fit into Bazel's core design patterns.

I'm not sure if your "traverse N times" idea is a less complicated adjustment to work around the problem? If so, could you sketch up a prototype to demonstrate the effect?

Anyway, I don't know what the right call is re: issue status. I hear your point. But I wouldn't want keeping the issue open to imply a stronger promise to address it than we're making. And we can still access closed issues, of course. How about if I run this past one of our Bazel PMs to get some more thought out perspective on that?

@sluongng
Copy link
Contributor

sluongng commented Mar 1, 2022

I support the first item (documentation) and suggest a dedicated issue for it. Since that's something we can more actionably address.
...
Anyway, I don't know what the right call is re: issue status. I hear your point. But I wouldn't want keeping the issue open to imply a stronger promise to address it than we're making. And we can still access closed issues, of course. How about if I run this past one of our Bazel PMs to get some more thought out perspective on that?

Thanks for the quick reply @gregestren

If re-opening the issue is not ideal, then we can create 2 new issues: 1 for documenting the suggestion and 1 for designing solutions and tradeoffs. We can leave the latter to be low priority until the problem becomes more painful and worth the ROI to solve.


Personally I prefer redesigning rules_go's staticanalysis framework to use something like Validation Actions instead. May be if we can refactor rules_go toward that direction, then we can eliminate the need for cycle dependencies in rules_go and downgrade the genquery issue even further.

@pqn
Copy link

pqn commented Aug 4, 2023

Does anyone have an example aspect they can point to for collecting transitive deps? Maybe @linzhp?

@linzhp
Copy link
Contributor

linzhp commented Aug 4, 2023

Something like this:

DEPS_ATTRS = [
    "deps",
    "embed",
    "proto",
]

DepsetInfo = provider(
    fields = {
        "depset": "transitive dependencies",
    },
)

def _transitive_deps_aspect_impl(target, ctx):
    deps = []
    for attr in DEPS_ATTRS:
        if hasattr(ctx.rule.attr, attr):
            value = getattr(ctx.rule.attr, attr)
            if type(value) == "list":
                deps.extend([dep[DepsetInfo].depset for dep in value])
            else:
                deps.append(value[DepsetInfo].depset)

    return [DepsetInfo(depset = depset(direct = [(ctx.label, ctx.rule.kind)], transitive = deps))]

transitive_deps_aspect = aspect(
    implementation = _transitive_deps_aspect_impl,
    provides = [DepsetInfo],
    attr_aspects = DEPS_ATTRS,
)

@sluongng
Copy link
Contributor

sluongng commented Aug 4, 2023

https://github.com/bazelbuild/rules_license/ is a pretty good example of how to use aspect to collect dependencies information ;)

@pqn
Copy link

pqn commented Aug 4, 2023

Thank you both! This is useful.

@fmeum
Copy link
Collaborator

fmeum commented Apr 8, 2024

This has come up again in bazelbuild/rules_go#3883, so I will reopen.

@gregestren @haxorz Do you see a way to make genquery behave in some reasonable way here? Happy to work on a fix, but I wouldn't want to break any google3 requirements.

@gregestren
Copy link
Contributor

The Google discussion on genquery + cycles happened at b/141613846 and got committed effort in 2022 / 2023.

Does fa05a10 help? This was part of that.

@fmeum
Copy link
Collaborator

fmeum commented Apr 10, 2024

Does fa05a10 help? This was part of that.

It fixes the problem and returns reasonable results on the reproducer. What's the status of that flag? Are there any known blockers to flipping it?

@gregestren
Copy link
Contributor

gregestren commented Apr 10, 2024

It's been enabled in Google for a year.

According to b/271868450 it increases genquery's memory cost. This triggered a regression alert in Google but that was for a particularly heavy use of genquery that I imagine is unique to Google.

One of the bug comments: "strongly suggesting that this regression is limited to builds with many forbidden dependency tests". The author of the fix, @anakanemison, left Google a while ago so we can't expect them to follow up. But they did an awesome job addressing this (thanks again, if you see this!). @justinhorvitz might have thoughts as a stakeholder in memory optimization.

The Google flip added this to the common blazerc to help:

build --skyframe_high_water_mark_minor_gc_drops_per_invocation=10
build --skyframe_high_water_mark_full_gc_drops_per_invocation=10

Given all this, I suggest we flip this for Bazel.

@justinhorvitz
Copy link
Contributor

I'm familiar with Mark's work to address the genquery cycle issue. Yes, we should make --experimental_skip_ttvs_for_genquery the default - it was developed specifically for this problem.

Setting the max skyframe drops per invocation was in response to --experimental_skip_ttvs_for_genquery causing some builds with extreme numbers of genquery rules to time out after hours instead of OOMing "gracefully". Internally, we like fast OOMs much more than builds that take hours because we have automation to split up and retry OOMs. It probably makes sense to set the drop limits the same for bazel, but if we change the defaults, we would also affect <redacted name of internal dependency service that doesn't read the common blazercs>. So we need to be careful about how we are setting these.

A note on the aspect approach to collect transitive deps - we had an effort to investigate the feasibility of that (b/146629125) but it stalled due to some fundamental differences between aspect propagation and query.

@fmeum
Copy link
Collaborator

fmeum commented Apr 11, 2024

Even without better automation for retrying OOMs, I think that the average Bazel user would be better with an OOMing build than one that drags on for hours due to intermediate results getting dropped repeatedly.

I would probably flip the Skyframe flags first and independently of the new genquery implementation. @justinhorvitz Given the need to deal with , can I just submit a PR to change the defaults to 10 or do you need to take care of this in some other way?

@justinhorvitz
Copy link
Contributor

@haxorz we need to consider depserver before changing the default values of the flags --skyframe_high_water_mark_minor_gc_drops_per_invocation and --skyframe_high_water_mark_full_gc_drops_per_invocation. Do you think we should pass the current defaults (unlimited) to depserver before changing the default values?

@haxorz
Copy link
Contributor

haxorz commented Apr 12, 2024

Thanks for thinking of this, Justin. Answer is "sorta". Chat me internally.

@justinhorvitz
Copy link
Contributor

Update: we've begun the process of passing the current default values as flags where needed and need to wait for that to propagate to all environments. I will update once it's safe to proceed with changing the defaults.

@justinhorvitz
Copy link
Contributor

@fmeum at last the changes have propagated universally so it is now safe to proceed with changing the default flag values.

@fmeum
Copy link
Collaborator

fmeum commented May 31, 2024

Thanks, I sent #22597.

@eric-higgins-ai
Copy link

It seems like --experimental_skip_ttvs_for_genquery doesn't work when the genquery is built under the tool of a genrule. Here's a minimal repro:

WORKSPACE

workspace(
    name = "repro",
)

load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

# Load up our golang toolchain
http_archive(
    name = "io_bazel_rules_go",
    sha256 = "b2038e2de2cace18f032249cb4bb0048abf583a36369fa98f687af1b3f880b26",
    urls = [
        "https://github.com/bazelbuild/rules_go/releases/download/v0.48.1/rules_go-v0.48.1.zip",
    ],
)

load("@io_bazel_rules_go//go:deps.bzl", "go_register_toolchains", "go_rules_dependencies")

go_rules_dependencies()

go_register_toolchains(
    version = "1.22.2",
)

go_register_toolchains(nogo = "@//:nogo")

BUILD

load("@io_bazel_rules_go//go:def.bzl", "go_binary")
load("@io_bazel_rules_go//go:def.bzl", "nogo")

go_binary(
    name = "binary",
    srcs = ["src.go"]
)

genquery(
    name = "query",
    expression = ":binary",
    scope = [":binary"],
)

go_binary(
    name = "tool",
    data = [":query"],
    srcs = ["src.go"]
)

genrule(
    name = "genrule",
    cmd = "echo 1 > $(OUTS)",
    outs = ["genrule.txt"],
    tools = [":tool"],
)

nogo(
    name = "nogo",
    visibility = ["//visibility:public"],

bazel build --experimental_skip_ttvs_for_genquery //:tool passes, but bazel build --experimental_skip_ttvs_for_genquery //:genrule fails with

 rule @@io_bazel_rules_go//:go_context_data: cycle in dependency graph:
    //:genrule (e49f463c1e255f268defb20a97c4dc9f3eda12d37d969dea0e80f6195b9f66c9)
    //:genrule (bb86b10670a93e53819986452f7aef0cfb2ec0904f63d1c41fb7ffbdfb553fbb)
    //:tool (46922a77fd26eedc8d54139b48893f6e1691166b5bcea58b3d50051cb7e5ec14)
    //:query (46922a77fd26eedc8d54139b48893f6e1691166b5bcea58b3d50051cb7e5ec14)
    //:binary
.-> @@io_bazel_rules_go//:go_context_data
|   @@io_bazel_rules_nogo//:nogo
|   //:nogo
|   //:nogo_actual
|   @@io_bazel_rules_go//go/tools/builders:nogo_srcs
|   @@org_golang_x_tools//internal/facts:facts
`-- @@io_bazel_rules_go//:go_context_data

@fmeum
Copy link
Collaborator

fmeum commented Jun 25, 2024

@erichiggins0 Could you check whether this PR fixed the issue? #22892

@eric-higgins-ai
Copy link

eric-higgins-ai commented Jun 25, 2024

hmm @fmeum I'm happy to check but tbh I'm not sure how to do it 😅 Could you help me out a bit? Alternatively, the files I shared above should reproduce the issue - the only thing I didn't include is src.go, which is an empty go file that just contains a main function

@fmeum
Copy link
Collaborator

fmeum commented Jul 18, 2024

@erichiggins0 Sorry for the delay, I was OOO. I verified that the issue isn't present in last_green, but only in Bazel 7. I sent a dedicated fix in #23032.

github-merge-queue bot pushed a commit that referenced this issue Jul 18, 2024
)

This is not a cherry-pick, but a dedicated fix for Bazel 7.x. Bazel 8 is
not affected as the switch to the Starlarkified exec configuration has
already resolved this problem.

Fixes #14169
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.3.0 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.3.0rc1. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Configurability platforms, toolchains, cquery, select(), config transitions type: bug
Projects
None yet