Skip to content

ref(spans): introduce spans buffer store abstraction#116382

Open
lvthanh03 wants to merge 7 commits into
masterfrom
tony/spans-store-abstraction
Open

ref(spans): introduce spans buffer store abstraction#116382
lvthanh03 wants to merge 7 commits into
masterfrom
tony/spans-store-abstraction

Conversation

@lvthanh03
Copy link
Copy Markdown
Member

@lvthanh03 lvthanh03 commented May 28, 2026

Refs STREAM-1002

SpansBuffer had orchestration, Redis command details, Lua result mapping, key construction, and observability all mixed together. This PR keeps the high-level buffer flow readable while moving low-level Redis mechanics behind SpansBufferStore.

This PR Introduces SpansBufferStore as the redis store class for the SpansBuffer.

This moves Redis-specific mechanics out of SpansBuffer and into the store, including:

  • Redis key construction
  • payload writes
  • add-buffer.lua script loading and EVALSHA execution
  • Redis result mapping back into InsertedSubsegment
  • loading flush candidates from queue shards
  • acquiring flush locks and mapping lock results back to FlushCandidate
  • loading payload keys, payload bytes, and segment ingest metadata (span count, byte count)
  • reading the current queue deadline for expiration/loss metrics
flowchart TB
    Relay["Relay span payloads"] --> Process["process_spans"]
    SpanFlusher["SpanFlusher"] --> Flush["flush_segments"]
    SpanFlusher --> Done["done_flush_segments"]

    subgraph Buffer["SpansBuffer"]
        Process --> BuildSubsegments["build Subsegment objects"]
        BuildSubsegments --> ProcessStore
        ProcessStore --> ProcessObs["process metrics and logs"]

        Flush --> FlushStore
        FlushStore --> RecordLoss["record loss metrics"]
        RecordLoss --> BuildFlushed["build FlushedSegment objects"]
        BuildFlushed --> FlushObs["flush metrics and logs"]

        Done --> CleanupStore
        CleanupStore --> DoneObs["done_flush_segments metrics"]

        subgraph Store["self.store: SpansBufferStore"]
            ProcessStore["store_payloads, insert_subsegments, update_queue"]
            FlushStore["load_flush_candidates, acquire_flush_locks, load_segments, queue deadline"]
            CleanupStore["cleanup_flushed_segments"]
        end
    end

    ProcessStore -. "Redis commands and Lua result mapping" .-> Redis[(Redis)]
    FlushStore -. "Redis commands and result mapping" .-> Redis
    CleanupStore -. "Redis cleanup" .-> Redis

    BuildFlushed -->|"FlushedSegment objects"| SpanFlusher
Loading

lvthanh03 added 2 commits May 27, 2026 17:37
Introduce LoadedSegment to keep a flush candidate, loaded payloads, payload
keys, and ingest metadata together through the flush pipeline. This replaces
parallel maps keyed by segment key and makes the next Redis store abstraction
smaller and safer.
@lvthanh03 lvthanh03 requested review from a team as code owners May 28, 2026 14:34
@github-actions github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label May 28, 2026
@untitaker
Copy link
Copy Markdown
Member

I'm very suspicious about this layer of abstraction. I feel that SpanBuffer itself becomes a very shallow and useless abstraction this way. A lot of the "orchestration" is deep inside add-buffer.lua already, and in order to replicate this logic without redis, we would have to reimplement a lot of it.

Redis isn't slow, so I don't see a need to mock it out. It's about as fast as in-memory operations.

@evanh
Copy link
Copy Markdown
Member

evanh commented May 28, 2026

Redis isn't slow, so I don't see a need to mock it out.

@untitaker I agree with this, but I do feel like this makes the code easier to follow.

@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 28, 2026

STREAM-1002

Base automatically changed from tony/loaded-segments-datamodel to master May 28, 2026 16:55
@lvthanh03
Copy link
Copy Markdown
Member Author

lvthanh03 commented May 28, 2026

I'm very suspicious about this layer of abstraction. I feel that SpanBuffer itself becomes a very shallow and useless abstraction this way.

SpanBuffer is still doing all the orchestration and other smaller operations that does not interact with redis (i.e. group spans by parent).

A lot of the "orchestration" is deep inside add-buffer.lua already, and in order to replicate this logic without redis, we would have to reimplement a lot of it.

I agree, maybe we shouldn't word this as adding an "abstraction" over redis but we're really just adding a class that owns any operation that talks to redis. I would say this addition is mainly for readability.

Copy link
Copy Markdown
Member

@untitaker untitaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't mean to block this. refactoring for readability is always good. i'm just not sure about using this as a testing strategy (i.e. writing most of our tests without redis)

fpacifici

This comment was marked as outdated.

@fpacifici fpacifici dismissed their stale review May 28, 2026 21:59

missed a point

@lvthanh03
Copy link
Copy Markdown
Member Author

lvthanh03 commented May 28, 2026

i'm just not sure about using this as a testing strategy (i.e. writing most of our tests without redis)

That makes sense now, I misunderstood the initial comment. I thought we were talking about how separating spans buffer redis operations out of the main orchestration class would impact speed.

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 27826fc. Configure here.

Comment thread src/sentry/spans/buffer.py Outdated
Copy link
Copy Markdown
Contributor

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly high leve suggestion for follow up. But there is one blocker: why are you not using the real redis in the test_buffer_store module?
Mocking redis is just removing test coverage here.

Comment thread tests/sentry/spans/test_buffer.py
Comment thread src/sentry/spans/buffer.py Outdated
Comment thread src/sentry/spans/buffer.py
Comment thread src/sentry/spans/buffer.py
Comment thread src/sentry/spans/buffer.py Outdated
Comment thread src/sentry/spans/buffer_store.py
Comment thread src/sentry/spans/buffer_store.py Outdated
Comment thread src/sentry/spans/buffer_store.py Outdated
Comment thread tests/sentry/spans/test_buffer_store.py Outdated
lvthanh03 added 2 commits May 29, 2026 11:57
- remove SpansBuffer key-helper passthroughs
- move payload preparation into SpansBufferStore
- add docstrings for public store methods
- simplify insert_subsegments and flush candidate mapping
@lvthanh03 lvthanh03 requested a review from fpacifici May 29, 2026 16:31

pytestmark = [pytest.mark.django_db]

# Keep these tests in their own Redis keyspace. CI runs test files in parallel,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the units of concurrency for running tests have their own redis each. you can use flushdb and in fact it's running automatically after every test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With flushdb, I ran into the problem locally where I ran pytest tests/sentry/spans/test_buffer* and I would have one singular test failing (test_deep2), but the test ran file in isolation. I will look into it more.

Copy link
Copy Markdown
Contributor

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comment inline.

Re: deployment, while this code is reasonably well tested, it is a large change we generally ship behind feature flags. Please be extra careful to s4s2 (both correctness and performance) when you ship it. You'd have to catch issues before it goes out to prod.

For the future:

  • please contain the size of your changes. That makes it not only easier to review but also easier to validate. The key is not necessarily the number of lines (copying a file in a different place makes a lot of lines but it is trivial), but the actual logic change. Here the risk is passing the wrong parameter to a function for example.
  • consider moving some test coverage from test_buffer to test_buffer_store. Now that is a smaller system so we can test more corner cases. You do not have to necessarily remove tests from test_buffer.

if compression_level == -1:
self._zstd_compressor = None
else:
self._zstd_compressor = zstandard.ZstdCompressor(level=compression_level)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were doing this once per call to process_spans before. Now you are doing this at every batch.
Is this an expensive operation that had to be done rarely ? I'd consider moving into store_payloads unless we are sure this is cheap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants