feat(postgres): Change the Postgres Adapter to be Partition Aware by george-sentry · Pull Request #591 · getsentry/taskbroker

george-sentry · 2026-04-10T21:43:34Z

Linear

Description

Currently, taskworkers pull tasks from taskbrokers via RPC. This approach works, but has some drawbacks. Therefore, we want taskbrokers to push tasks to taskworkers instead. Read this page on Notion for more information.

We are also moving from SQLite to Postgres. Now, brokers will share a single store with multiple other brokers. To prevent contention between brokers and unexpected results due to conflicting upkeep tasks, we will make each broker only touch tasks that came from the Kafka partitions it's currently responsible for.

This adds a postgres storage adapter for the taskbroker, as well as providing a way to choose between the adapters in the configuration. This adapter will also work with AlloyDB. In postgres, the keyword `offset` is reserved, so that column is called `kafka_offset` in the PG tables and converted to `offset`. The tests were updated to run with both the SQLite and Postgres adapter using the rstest crate. The `create_test_store` function was updated to be the standard for all tests, and to allow choosing between a SQLite and Postgres DB. A `remove_db` function was added to the trait and the existing adapters, since the tests create a unique PG database on every run that should be cleaned up. The `create_test_store` function was updated to be the standard for all tests, and to allow choosing between an SQLite and Postgres DB.

Have the postgres adapter only fetch and do upkeep on activations that are part of the partition that the consumer is assigned. The broker can still update tasks outside its partitions, in case a worker is connected to a broker that is then rebalanced. Change the consumer to pass the partitions to the store whenever partitions are assigned. This was originally tested with PARTITION BY, but that requires manually keeping track of the partition tables which isn't a desired behaviour.

…eorge/push-taskbroker/partition-store-by-partition

linear-code · 2026-04-10T22:06:05Z

STREAM-868 Change the Postgres Adapter to be Partition Aware

cursor · 2026-04-13T17:13:22Z

+            }
+            query_builder.push(")");
+        }
+    }


Empty partition list causes queries to match all rows

High Severity

add_partition_condition silently becomes a no-op when partitions is empty, causing all partition-scoped queries (upkeep, counts, fetches) to operate on ALL rows instead of NO rows. Since the upkeep task in upkeep.rs runs independently of the consumer lifecycle and shares the same store, after a partition revocation (which calls assign_partitions(vec![])) the upkeep loop continues running and will modify/delete/count tasks belonging to other brokers' partitions. This directly undermines the PR's goal of partition isolation.

Additional Locations (1)

src/kafka/consumer.rs#L398-L399

^{Reviewed by Cursor Bugbot for commit f97d2ad. Configure here.}

We get the same behavior before partitions are assigned, since the list will be empty in that case too. Is this desired?

I don't think this will break anything, just slow the queries down briefly. I'm inclined to say it's OK. Another option would be to do checks for the partitions and not run the queries if there are no partitions assigned but I don't know if that would be better.

evanh

This looks good to me!

…eorge/push-taskbroker/partition-store-by-partition

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 51c1889. Configure here.}

cursor · 2026-04-14T17:01:13Z

                            tpl == revoked,
                            "Revoked TPL should be equal to the subset of TPL we're consuming from"
                        );
+                        activation_store.assign_partitions(vec![]).unwrap();


Partitions cleared before actors shutdown causes unscoped queries

Medium Severity

assign_partitions(vec![]) is called before handles.shutdown() in both the Revoke and Shutdown handlers. Since add_partition_condition is a no-op when the partition list is empty, any upkeep queries still running during the up-to-4-second shutdown window will execute without partition filtering, operating on ALL rows in the shared database. This defeats partition isolation and could cause issues like duplicate deadletter messages from handle_failed_tasks if another broker is concurrently assigned those partitions. Swapping the order — shutting down actors first, then clearing partitions — would eliminate this race window.

Additional Locations (1)

src/kafka/consumer.rs#L404-L406

^{Reviewed by Cursor Bugbot for commit 51c1889. Configure here.}

Evan thinks this is fine.

sentry · 2026-04-14T17:03:19Z

                            tpl == revoked,
                            "Revoked TPL should be equal to the subset of TPL we're consuming from"
                        );
+                        activation_store.assign_partitions(vec![]).unwrap();
                        handles.shutdown(CALLBACK_DURATION).await;
                        metrics::gauge!("arroyo.consumer.current_partitions").set(0);


Bug: Clearing the partition filter via assign_partitions(vec![]) before handles.shutdown() creates a race condition where upkeep tasks can operate on all partitions, not just assigned ones.
_{Severity: HIGH}

Suggested Fix

The partition filter should be cleared after the actor handles have been successfully shut down. Move the activation_store.assign_partitions(vec![]).unwrap(); call to after the handles.shutdown(CALLBACK_DURATION).await; call. This ensures that no upkeep operations can run in the intermediate state where the broker has no assigned partitions but its tasks are still active.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/kafka/consumer.rs#L396-L401 Potential issue: During partition revocation, the code calls `activation_store.assign_partitions(vec![])` to clear the partition list before waiting for actor handles to shut down via `handles.shutdown(CALLBACK_DURATION)`. An independent upkeep task runs periodically. If this task executes during the shutdown window, it will operate with an empty partition list. Because the `add_partition_condition()` method omits the partition filter when the list is empty, upkeep queries like `handle_claim_expiration` will run against all partitions in the database. This can cause a broker to incorrectly modify tasks that have been reassigned to another broker, violating partition isolation.

Evan thinks this is fine.

evanh and others added 21 commits January 13, 2026 16:39

remove unnecessary Option

04b13bd

some fixes

1019655

updates

001ad1a

Merge branch 'main' into evanh/feat/use-postgresql-interface

3fbebe0

nit

010bab3

lint

51e7649

add postgres to devservices dependencies

a7eaa6f

try to get postgres working

234e722

fix

7ed5a7a

volume

d08af4f

fix test

4c29aee

use repo

c8322ad

remove duplicate code

086f6b5

merge

d405fac

imports

c1b9f8e

merge

3d90685

cleanup

d2e86f6

warn

94b41c2

Merge branch 'main' of https://github.com/getsentry/taskbroker into g…

895e119

…eorge/push-taskbroker/partition-store-by-partition

george-sentry force-pushed the george/push-taskbroker/partition-store-by-partition branch from 896acfe to 895e119 Compare April 10, 2026 21:56

george-sentry marked this pull request as ready for review April 10, 2026 22:06

george-sentry requested a review from a team as a code owner April 10, 2026 22:06

sentry Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/store/postgres_activation_store.rs Outdated

Comment thread src/store/postgres_activation_store.rs

cursor Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/store/postgres_activation_store.rs Outdated

Comment thread src/store/postgres_activation_store.rs

Comment thread src/kafka/consumer.rs

Revoked Partitions Cleared from Store, Depths Count Partition Aware

f97d2ad

cursor Bot reviewed Apr 13, 2026

View reviewed changes

evanh approved these changes Apr 14, 2026

View reviewed changes

Merge branch 'main' of https://github.com/getsentry/taskbroker into g…

bc48c78

…eorge/push-taskbroker/partition-store-by-partition

sentry Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/store/postgres_activation_store.rs

cursor Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/store/postgres_activation_store.rs

Make Claim Expiration Partition Aware

51c1889

cursor Bot reviewed Apr 14, 2026

View reviewed changes

sentry Bot reviewed Apr 14, 2026

View reviewed changes

george-sentry merged commit e7783d9 into main Apr 14, 2026
32 of 33 checks passed

george-sentry deleted the george/push-taskbroker/partition-store-by-partition branch April 14, 2026 17:24

sentry-release-bot Bot mentioned this pull request Apr 15, 2026

publish: getsentry/taskbroker@26.4.0 getsentry/publish#7816

Closed

3 tasks

Uh oh!

Conversation

george-sentry commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linear

Description

Uh oh!

linear-code Bot commented Apr 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Apr 13, 2026

Choose a reason for hiding this comment

Empty partition list causes queries to match all rows

Uh oh!

george-sentry Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

evanh Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

evanh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 14, 2026

Choose a reason for hiding this comment

Partitions cleared before actors shutdown causes unscoped queries

Uh oh!

george-sentry Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

sentry Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

george-sentry Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

george-sentry commented Apr 10, 2026 •

edited

Loading