Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Builtin Actor Debugging Story #592

Open
vyzo opened this issue May 26, 2022 · 8 comments
Open

The Builtin Actor Debugging Story #592

vyzo opened this issue May 26, 2022 · 8 comments
Assignees
Milestone

Comments

@vyzo
Copy link
Contributor

vyzo commented May 26, 2022

The Debug Problem

It has become quite apparent, especially as we are getting closer to the nv16 upgrade, that we need a mechanism to debug builtin actors.

This goes beyond enabling the actor_debugging context option (which enables the debug syscall), as debug code has to be compiled in and will diverge from the mainline code, at the very least in gas (eg debug prints, assertions, possibly different code paths for testing and so on).

At the root of the problem we have two issues:

  • the debug code CID is different from the mainnet code CID.
  • the gas/execution of the debug code will produce different results, diverging from mainnet.

Debugging Strategy

Here we propose a strategy for resolving the issue in a way that supports our debugging needs, without forking from mainnet.

The strategy is two fold:

  • at the fvm side, support code redirects for dual execution.
  • at the builtin-actors side, provide a special mainnet debug bundle, with debugging features enabled.

Dual Code Execution

More specifically, we can provide a debug manifest at fvm instantiation.
Upon message execution time, we concurrently execute the mainnet code (for the result and gas) and the debug code, redirecting mainnet code to the debug code in the debug execution, which has actor_debugging enabled. The result of message execution will be the result of the mainnet code execution. The debug code will be executed for side effects, which can be observed through stdout/stderr or by collecting and side-returning the debug result/trace.

Debug Bundles

At the very least we need to enable tracing/debug for logs, through a build feature.
The same feature test can also be used to enable debug logic (assertions, experimental features and so on).
The bundle can either be built by the developer for testing, or be part of the release workflow matrix which would build a builtin-actors-mainnet-debug.car bundle for consumption by lotus.

Lotus Integration

This is the easiest part, as we could simply load the debug bundle in the blockstore and pass the manifest CID to the fvm when enabling debug. This is probably best done by using an environment variable.

@raulk
Copy link
Member

raulk commented May 26, 2022

This issue summarises the problem and the core of the solution well, but there are still many open questions. Let's flesh those out to build a mature solution.

Lotus Integration

This is the easiest part, as we could simply load the debug bundle in the blockstore [...]

Let's think through this more in detail. Can you propose some details on what the DX will look like from the Lotus perspective? i.e. can we think though how core devs would go about providing a remapped bundle? Config, command, env var? I think we need commands, see below.

pass the manifest CID to the fvm when enabling debug.

This won't be enough. The FVM doesn't use the manifest to route calls to actors, it just uses it to resolve certain syscalls. This will likely need to be a remap_code: Map<Cid, Cid>


A few more things:

  1. I should be able to replace actors at runtime, I don't want to have to restart Lotus every time as this slows down my debugging experience. This means that we need to provide JSON-RPCs. It's acceptable to lose the state on a restart (i.e. no need to persist what is remapped).
  2. This pathway should work for M2 actors as well, i.e. for arbitrary CodeCIDs with arbitrary bundles. What do the bundles look like in that case?
  3. This pathway should support the debugging workflows that tools will eventually require (e.g. the FilHack tool by Bloxico, cc @TheDivic). Debugging workflows are progressive. This means that it should be possible to replace the remap directives themselves (i.e. A points to B, and later A points to C), as well as attach remap rules dynamically (i.e. I've remapped A, but A calls to B, and something weird happens there, so let me replace B now).
  4. Can you think through everywhere in the SDK we should add debug tracepoints that are activated with the debug feature? i.e. when receiving the invocation, dumping params, dumping return values, dumping ipld_reads, dumping state, etc.

@raulk
Copy link
Member

raulk commented May 26, 2022

(BTW -- I realise that this issue is in ref-fvm and most of my considerations have to do with client integration, but my point here is that we need to take the full picture into account to avoid building local solutions)

@arajasek
Copy link
Contributor

This is really good to read, thanks @vyzo and @raulk. Some thoughts:

  • We will definitely want debug bundles for testnets as well -- I don't think we should publish them as part of the release but support for this dual-mode on, say, calibrationnet will be pretty important.

  • I should be able to replace actors at runtime, I don't want to have to restart Lotus every time as this slows down my debugging experience. This means that we need to provide JSON-RPCs. It's acceptable to lose the state on a restart (i.e. no need to persist what is remapped).

    I agree with the need here, but I'm not sure why we need to provide RPCs for this. I would propose using a local bundle option (eg. specifying a local path for where to find the bundles), and performing a load-check every time you create an FVM (when debug mode is enabled) -- in Lotus this would mean calling LoadBundle each time. I might be missing something here.

  • This will likely need to be a remap_code: Map<Cid, Cid>

    Just passing a second bundle could do the trick, if we have the remap_code logic live entirely in the FVM. The advantage is that we now aren't passing some map from the client to the FVM, we can fully reuse the existing manifest structure. The disadvantage is that it won't work for M2 actors.

So, just to be a little more concrete for what this would look like in the FVM and in Lotus:

  • FVM: Have an optional debug_manifest CID
  • FVM: If debug_manifest is set, build a mapping from real-manifest CID to debug-manifest CID, use this mapping for the debug execution (idk how easy that is)
  • FVM: If debug_manifest is set, fill in tracing / log info from the debug execution (described by Vyzo above)
  • Lotus: Have a VM_DEBUG envvar
  • Lotus: Support having [v8-debug] in addition to [v8] in the bundles.toml (or put it under [v8]).
  • Lotus: When creating an FVM, if VM_DEBUG is set, fetch bundles (so as to always pull in changes).

This is already a decent bit of work (I suspect complications will arise / it'll take a couple iterations to get right) for something that is fairly urgently needed for the M1 milestone -- it's not in our feature freeze, but we would all be comfortable going into v16 with this working, and it helps the testing work we have planned over the next 2-3 weeks. I would urge against doing anything more involved (unless strictly necessary) right now in the interest of protecting our timelines. I suspect we will have lots of scope for future improvement that we should capture, but not prioritize.

@anorth
Copy link
Member

anorth commented May 26, 2022

This is a good conversation start, but I would highlight that it's only one part of the full story of debugging [builtin] actors – tracing/logging/printlining. It's good, we should do it, and well. But it's not enough.

In addition to this, we need a good story for executing actor code in an IDE, stepping execution, inspecting memory etc. I think we could get a very good 80/20 for this in pure Rust, skipping the WASM compilation and all the complications of inspecting and sourcemapping the WASM code and environment. Obviously it would miss some things like gas and issues that result from the compilation process, but for what I would expect to be the majority of debugging efforts around bugs in the actual actor code, this would be a far more powerful tool.

This dev/debug environment should be independent of any node implementation, with everything behind the FVM API under direct control of the developer/debugger. It's closely related to what I think we should have for an actor integration testing harness: a pure-Rust VM that integrates the lowest FVM syscall level.

@ZenGround0
Copy link

Good enough for now

The basics: to debug I go into builtin actors, make a code change, generate a debug bundle with make bundle-XXX debug. Then I add a line to bundles.toml debug_path = {XXX = <path>} ( a slight variation on @arajasek's [v8-debug] above) and rebuild. I then get dual execution, different behavior and debug logs / traces during message exec.

Under the hood this works as @vyzo and @arajasek describe above.

One thing missing from the discussion here is integration that tracks debug gas. One major use of debug specs-actor code in the past has been testing out and measuring non-state-breaking but performance changes (which can be security critical i.e. network v4) against live network messages. Up until now we've collected timing information in these instances but the dual execution model makes this more challenging as the old lotus commands will time both runs and we won't have a comparison benchmark to determine time/gas diff without a lot of operational setup or annoying context switching. IMO some way to measure debug gas and debug execution time (to debug situations where time gas mismatches are showing up) is high priority.

The perfect world

As above you build actors changes but you can load them into lotus at runtime with ./lotus fvm debug-code.

Mapping code cids to debug code cids is identically handled in lotus and fvm for builtin and user actors. I am guessing that the future of the manifest is that it will grow to include user code ids (though maybe system actors are partitioned separately in chain state) and the fvm will do the exact same code cache thing for user actor code.

This pathway should work for M2 actors as well, i.e. for arbitrary CodeCIDs with arbitrary bundles. What do the bundles look like in that case?

This will likely need to be a remap_code: Map<Cid, Cid>

Strong agree that debug bundles should be a map of actor code-id to actor debug-code-id/debug code. When the fvm loads actor code id for the first time it checks for existence in debug set, if found marks this code as debug which then forces dual execution. For M1 this would mean you could theoretically only debug compile a subset of system actors instead of all 11. But probably we wouldn't add this to builtin-actors debug-build for simplicity. Post M2 you could have a single debug bundle with some system and some user actor debug code.

As a side note in the perfect world we have a tool that adds / removes compiled wasm code into debug bundles.

I think the idea that EVM contracts can leverage full WASM debugging assumes that there are debugging directives in EVM that cross compilers can pick up on and put into WASM. Seems reasonable but no idea if these are true.

Debugging workflows are progressive. This means that it should be possible to replace the remap directives themselves (i.e. A points to B, and later A points to C), as well as attach remap rules dynamically (i.e. I've remapped A, but A calls to B, and something weird happens there, so let me replace B now).

I think sensible overwrite semantics for ./lotus fvm debug-code + runtime load of debug code should get us this. Any code cids being replaced in a debug bundle specified with debug-code will supercede previous code cids. Any code cids in previous debug bundles but not in new ones keep the same mapping. Let me know if I'm missing some subtlety.

More thoughts

No debug bundle releases

A debug bundle released and shipped with lotus doesn’t seem worth the effort. If I’m debugging actors I will probably want to write some code changes specific to my use to build into the bundle. I guess that even those just curious about reading more logs and using the debug defaults without changing code will be comfortable building a bundle from source and linking it to lotus with bundles.toml.

Lotus: When creating an FVM, if VM_DEBUG is set, fetch bundles (so as to always pull in changes).

I don't think that this is a problem that needs to be solved since the bundle should only be changing on dev branches / next network version where pulling changes is less critical and more under debugger's control. If I'm missing a reason that this is a legitimate problem then releasing debug bundles would make more sense.

Runtime loading

I should be able to replace actors at runtime, I don't want to have to restart Lotus every time as this slows down my debugging experience.

Strong agree. Note this is a strict improvement of the current workflow: modify specs-actors => go mod replace in lotus => rebuild => restart node. So while it's important in the long run it's no problem to live without it for a while.

I agree with the need here, but I'm not sure why we need to provide RPCs for this. I would propose using a local bundle option (eg. specifying a local path for where to find the bundles), and performing a load-check every time you create an FVM (when debug mode is enabled)

If the lotus fvm object is created for every block (like the current lotus VM) then this is reasonable. If its created by DI when making a new node this means you have to bring your node up and down which is better than rebuilding but not ideal.

Questions

Can you think through everywhere in the SDK we should add debug tracepoints that are activated with the debug feature? i.e. when receiving the invocation, dumping params, dumping return values, dumping ipld_reads, dumping state, etc.

@raulk this is confusing me because I don't know what side of the fvm you are on. Is this about debuggable wasm actor code or are you talking about making the native ref-fvm debuggable? Is there some interaction between the two such that the native ref-fvm debug endpoints can be triggered by debug compiled wasm?

I think we could get a very good 80/20 for this in pure Rust, skipping the WASM compilation and all the complications of inspecting and sourcemapping the WASM code and environment.

@anorth doesn't this restrict debugging only to wasm code that was compiled from rust? I see the benefits in debugging builtin actors for sure but it seems like this forces our work to not apply to the many user contracts expected to be developed from solidity/EVM. And making this work well in general is high leverage.

@vyzo
Copy link
Contributor Author

vyzo commented May 27, 2022

Yes, debugging in a debugger would be really nice to have, but it caters to different use cases and it is rather complex to do; it could also be outsourced with a grant.

With regards to dual execution, let me summarize the result of the sync discussion yesterday:

  • We have two forms of dual execution: debug execution and alternate (replayed) execution.
    • With debug execution, we execute a remapped version of the code, with debugging enabled and gas accounting turned off (practically just give it max gas). Debug execution is executed for side effects, and can produce a different state root which could be collected, diffed, etc. This is the common case we want to support for our debugging efforts, and will drastically improve our debugging experience. The dual execution doesn't even have to live in the FVM, we can do two concurrent executions from outside (e.g. lotus) and the FVM only needs to support remapping.
    • With alternate (replayed) execution, we can support use cases where we want to observe the chain and execution and collect data and metrics. In this case, the FVM itself performs the first execution, collects all syscalls (modulo debug calls) and then performs the alternate execution replaying the results of syscalls. The state trees are not allowed to diverge in this case.

Initially for M1 we want to support debug execution; this will unblock our debugging efforts, allowing us to quickly test changes and see debug output while the system is running on mainnet.

The support will require the following (minimal) changes in FVM:

  • Support code remapping, using a Cid->Cid mapping
  • Provide a mode where the machine is instantiated with debugging on, gas accounting off, and remapping.
  • Optionally we want to collect the output (state tree and debug logs) besides printing in stdout/stderr and return it to the user.

@anorth
Copy link
Member

anorth commented May 29, 2022

@anorth doesn't this restrict debugging only to wasm code that was compiled from rust? I see the benefits in debugging builtin actors for sure but it seems like this forces our work to not apply to the many user contracts expected to be developed from solidity/EVM. And making this work well in general is high leverage.

It's a good point, I was only addressing the issue as titled.

However I think much of point still stands: for development-time debugging, involving Lotus or any other full node is a whole lotta complication that's not needed. So for debugging any WASM actor, we want a WASM execution environment under full and direct developer control. And again, this is a thing we'd want for FVM integration tests, which should not depend on a node.

@vyzo
Copy link
Contributor Author

vyzo commented May 30, 2022

@anorth i am totally in agreement, but this is orthogonal. We still need to be able to debug actors as running on mainnet, it is a different part of the journey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants