Skip to content

feat: add cycle detection debug trace for LookupResources and LookupSubjects#3036

Closed
Jdepp007004 wants to merge 8 commits intoauthzed:mainfrom
Jdepp007004:feat/lookup-explain-cycle-debug-2056
Closed

feat: add cycle detection debug trace for LookupResources and LookupSubjects#3036
Jdepp007004 wants to merge 8 commits intoauthzed:mainfrom
Jdepp007004:feat/lookup-explain-cycle-debug-2056

Conversation

@Jdepp007004
Copy link
Copy Markdown
Contributor

Fixes #2056

What this does

Adds opt-in debug trace output to LookupResources and LookupSubjects that tracks
node visits during graph traversal and flags any node visited more than once as cyclic.
Mirrors the philosophy of CheckDebugTrace used by CheckPermission.

Changes

  • dispatch.proto — new LookupDebugTrace message, enable_debug_trace on all 3 Lookup requests, debug_trace on all 3 Lookup responses
  • internal/graph/debug.go — shared traversalTracker, nodeKey, trackVisit, buildLookupDebugTrace (zero-cost when disabled)
  • lookupresources2.go, lookupresources3.go, lookupsubjects.gotrackVisit wired at each recursion site, EnableDebugTrace propagated to nested dispatch calls
  • internal/services/v1/permissions.go — debug trace enabled via same requestmeta.RequestDebugInformation gRPC header as Check
  • Unit tests: internal/graph/cycle_detection_test.go
  • Integration tests: internal/dispatch/graph/cycle_detection_test.go

How to test

Enable tracing by setting the x-spicedb-debug: 1 gRPC metadata header on any
LookupResources or LookupSubjects request. Cyclic nodes will appear in the
LookupDebugTrace with is_cyclic = true and traversal_count > 1.

Notes

  • Zero cost when enable_debug_trace = false — no change to hot path behaviour
  • No changes to the public v1 API proto shape
  • buf generate confirmed working on Windows (buf 1.67.0)

…ubjects

Fixes authzed#2056

- Add LookupDebugTrace proto message to all 3 Lookup dispatch responses
- Add enable_debug_trace bool to all 3 Lookup dispatch requests
- Add internal/graph/debug.go with zero-cost traversalTracker
- Wire trackVisit in lookupresources2.go, lookupresources3.go, lookupsubjects.go
- Wire debug trace enable via requestmeta header in permissions.go
- Add unit tests in internal/graph/cycle_detection_test.go
- Add integration tests in internal/dispatch/graph/cycle_detection_test.go
@Jdepp007004 Jdepp007004 requested a review from a team as a code owner April 11, 2026 13:50
@github-actions github-actions Bot added area/api v1 Affects the v1 API area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) area/dispatch Affects dispatching of requests labels Apr 11, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 11, 2026

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@Jdepp007004
Copy link
Copy Markdown
Contributor Author

I have read the CLA Document and I hereby sign the CLA

authzedbot added a commit to authzed/cla that referenced this pull request Apr 11, 2026
@tstirrat15
Copy link
Copy Markdown
Contributor

How would these actually be consumed if you're not changing the public proto interface?

The dispatch paths are for internal communication between SpiceDB nodes - a client never invokes them directly and wouldn't receive their output unless it's propagated through the external interface.

@Jdepp007004
Copy link
Copy Markdown
Contributor Author

@tstirrat15 — fair point, I missed this. The trace gets populated through the dispatch layer but never actually surfaces to the client , it's dropped at the permissions.go stream handler and never sent back.
Fix is to read the traversal state after each DispatchLookup* call and write it as a gRPC response trailer, using the same responsemeta.SetResponseTrailerMetadata path already used for x-spicedb-dispatched-operations-count. That keeps it off the public proto surface entirely.
Two things I need to add:
a Snapshot() on traversalTracker that builds the LookupDebugTrace proto
wiring in permissions.go to attach it as a trailer post-dispatch. Will push those before the next review round.

@Jdepp007004
Copy link
Copy Markdown
Contributor Author

@tstirrat15 - Just pushed the fixes for the trailer surfacing we talked about.

As discussed:

Built the Snapshot() logic to assemble the trace internally, keeping it off the public proto.
Wired up permissions.go to attach it to the x-spicedb-lookup-debug-trace trailer post-dispatch.
While in there, I also cleaned up a few edge cases:

Wrapped the trailer writing in a defer block so the trace still surfaces even if the lookup terminates early (like hitting a max depth error).
Swapped the string parsing in the tracker map for a proper Go struct to ensure we don't magically drop nodes if an object ID happens to contain a colon.
Surfaced a full gRPC stream test to verify the trailer wiring.
Take a look when you have a sec and let me know if there's anything else you need changed.

@tstirrat15
Copy link
Copy Markdown
Contributor

Ah I'm sorry - this isn't going to work either. We originally put CheckDebugTrace data in gRPC trailers and then ran into issues with the amount of data that can fit into a single gRPC trailer, which is why we moved them down into the public API in the first place. You'd need to do something similar, which will involve an update to the authzed/api repo to add an appropriate message and then bring it in via the authzed-go package.

I think we also recently had a discussion about changing how debug information is propagated, which might mean that we'd want a different shape here. Lemme see if i can dig that up.

- remove all trailer-based debug tracing (responsemeta, SerializeLookupDebugTrace)
- fix streaming logic (buffer lastResp, no double send)
- attach DebugInformationV2 only on final streamed response
- update tests to validate DebugInformationV2 instead of trailers
- clean up unused imports and legacy code

Note: build errors expected until authzed/api and authzed-go updates land
@Jdepp007004 Jdepp007004 force-pushed the feat/lookup-explain-cycle-debug-2056 branch from 1eaa4fa to 50c858b Compare April 15, 2026 15:26
@Jdepp007004
Copy link
Copy Markdown
Contributor Author

Updated PR:

  • removed debug_diff artifacts from commit
  • cleaned up implementation
  • fully migrated lookup debug tracing to DebugInformationV2
  • removed all trailer-based logic
  • fixed streaming behavior

Build errors are expected until authzed/api changes propagate to authzed-go.

- add root node to DebugTree for clearer traversal structure
- ensure stable node identifiers for debug annotations
- minor test validation improvement
- no behavioral changes
- replace traversal map with stack-based trace
- capture per-edge traversal frames (resource, relation, permission)
- emit trace on MaxRecursionDepthExceeded when debug is enabled
- ensure concurrency safety via CloneTraversalStack
- remove batch-based sampling and fix incorrect traversal representation
- update tests for stack-based validation

Note: current implementation prioritizes trace correctness over batching by
splitting dispatches per ID. This may introduce additional overhead and can
be optimized in a follow-up by preserving batching while maintaining per-edge trace accuracy.
- restore batched dispatch behavior (remove per-ID dispatch)
- decouple traversal trace from execution
- maintain append-only traversal stack to preserve trace at error boundary
- attach trace via gRPC error details on MaxRecursionDepthExceeded
- fix stack being cleared before snapshot
- ensure trace is non-empty and visible via status.FromError(err).Details()
- update tests to validate error detail trace extraction

preserves execution performance while enabling debug trace visibility
- remove placeholder identifiers (*batch*, id1,...)
- ensure frames represent real traversal edges
- add depth field for deterministic ordering
- improve trace readability and structure
- strengthen tests for ordering and validity
- verify error details extraction

no changes to execution behavior
tstirrat15 added a commit that referenced this pull request Apr 29, 2026
… set (#3070)

## Description
Opening as an alternative/continuation to #3036. This takes a slightly
different approach, which is to wait until a max depth error is
encountered and then build the trace as the dispatch stack is unwound.
It tracks across dispatch boundaries as well.

## Notes
This code is a no-op if the request flag isn't present or if a max
recursion depth error isn't encountered.

## Changes
Will annotate.

## Testing
Review. See that tests pass. See authzed/zed#682 and its testing section
for an example of code that drives this.

---------

Co-authored-by: Jdepp007004 <johnnydepp050403@gmail.com>
Co-authored-by: Maria Ines Parnisari <maria.ines.parnisari@authzed.com>
@tstirrat15
Copy link
Copy Markdown
Contributor

Closing in favor of #3070

@tstirrat15 tstirrat15 closed this Apr 29, 2026
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 29, 2026
@Jdepp007004 Jdepp007004 deleted the feat/lookup-explain-cycle-debug-2056 branch May 4, 2026 13:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area/api v1 Affects the v1 API area/dispatch Affects dispatching of requests area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tooling to help debug cyclic data issues in LookupResources and LookupSubjects

2 participants