feat: raw cache transport#1569
Conversation
There was a problem hiding this comment.
4 issues found across 16 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="pkg/cache/server.go">
<violation number="1" location="pkg/cache/server.go:320">
P1: Enabling `newCacheMuxListener` can allow idle connections to block indefinitely in `dispatch` (no read deadline), creating a potential resource-exhaustion path.</violation>
</file>
<file name="pkg/cache/prefetcher.go">
<violation number="1" location="pkg/cache/prefetcher.go:57">
P2: Validate and cap `workers`/`maxParts` before using them for channel capacity and goroutine fan-out; unbounded values can exhaust memory or crash the process.</violation>
</file>
<file name="pkg/cache/storage.go">
<violation number="1" location="pkg/cache/storage.go:680">
P2: Disk reads no longer repopulate the memory cache, so repeated accesses to the same page will stay on the slower disk path.</violation>
</file>
<file name="pkg/cache/cachefs_node.go">
<violation number="1" location="pkg/cache/cachefs_node.go:75">
P1: Evicting and closing cached FDs is unsafe with `ReadResultFd`: a later read can close an FD that is still in use by an in-flight read result. This can cause intermittent read failures under concurrent/readahead access.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
There was a problem hiding this comment.
3 issues found across 17 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="pkg/cache/client.go">
<violation number="1" location="pkg/cache/client.go:110">
P3: Starting the global cache-path stats logger with a client-scoped context can permanently disable stats logging after that client is canceled, because the logger is initialized via `sync.Once` and cannot be restarted.</violation>
</file>
<file name="pkg/cache/path_stats.go">
<violation number="1" location="pkg/cache/path_stats.go:110">
P2: `cacheFSStoreRetryErrors` is tracked but never logged in the cache path summary, so store-retry failures are silently missing from observability output.</violation>
<violation number="2" location="pkg/cache/path_stats.go:117">
P2: `serverStreamBytes` is never included in the summary log, so stream throughput data is collected but not reported.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
There was a problem hiding this comment.
6 issues found across 27 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="pkg/cache/server.go">
<violation number="1" location="pkg/cache/server.go:320">
P1: Enabling `newCacheMuxListener` can allow idle connections to block indefinitely in `dispatch` (no read deadline), creating a potential resource-exhaustion path.</violation>
</file>
<file name="pkg/cache/prefetcher.go">
<violation number="1" location="pkg/cache/prefetcher.go:57">
P2: Validate and cap `workers`/`maxParts` before using them for channel capacity and goroutine fan-out; unbounded values can exhaust memory or crash the process.</violation>
</file>
<file name="pkg/cache/storage.go">
<violation number="1" location="pkg/cache/storage.go:680">
P2: Disk reads no longer repopulate the memory cache, so repeated accesses to the same page will stay on the slower disk path.</violation>
</file>
<file name="pkg/cache/cachefs_node.go">
<violation number="1" location="pkg/cache/cachefs_node.go:75">
P1: Evicting and closing cached FDs is unsafe with `ReadResultFd`: a later read can close an FD that is still in use by an in-flight read result. This can cause intermittent read failures under concurrent/readahead access.</violation>
</file>
<file name="pkg/cache/client.go">
<violation number="1" location="pkg/cache/client.go:110">
P3: Starting the global cache-path stats logger with a client-scoped context can permanently disable stats logging after that client is canceled, because the logger is initialized via `sync.Once` and cannot be restarted.</violation>
</file>
<file name="pkg/cache/path_stats.go">
<violation number="1" location="pkg/cache/path_stats.go:110">
P2: `cacheFSStoreRetryErrors` is tracked but never logged in the cache path summary, so store-retry failures are silently missing from observability output.</violation>
<violation number="2" location="pkg/cache/path_stats.go:117">
P2: `serverStreamBytes` is never included in the summary log, so stream throughput data is collected but not reported.</violation>
</file>
<file name="pkg/abstractions/endpoint/autoscaler.go">
<violation number="1" location="pkg/abstractions/endpoint/autoscaler.go:91">
P2: Avoid blocking autoscaler decisions on telemetry I/O; this new synchronous event recording can stall the scale loop when repos are slow.</violation>
</file>
<file name="pkg/worker/worker.go">
<violation number="1" location="pkg/worker/worker.go:701">
P2: The orphan-state stop path does a synchronous trace RPC with `context.Background()` before sending the kill event, which can delay forced cleanup when tracing is slow.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
There was a problem hiding this comment.
9 issues found across 41 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="go.mod">
<violation number="1" location="go.mod:352">
P1: Committing local-path `replace` directives makes module resolution depend on your local folder structure and can break CI/other developers’ builds.</violation>
</file>
<file name="pkg/worker/logger.go">
<violation number="1" location="pkg/worker/logger.go:151">
P2: `msg.RunnerEvent` is not cleared between decode iterations, so stale runner events can be re-emitted for later log objects that don’t include `beta9_event`.</violation>
</file>
<file name="pkg/gateway/services/task.go">
<violation number="1" location="pkg/gateway/services/task.go:230">
P3: This adds a duplicated implementation of task event/phase emission logic that already exists in the function service, increasing maintenance drift risk.</violation>
</file>
<file name="pkg/api/v1/events.go">
<violation number="1" location="pkg/api/v1/events.go:187">
P2: Task-based batch responses can omit `task_id` on success, which breaks reliable mapping to requested `task_ids`.</violation>
<violation number="2" location="pkg/api/v1/events.go:261">
P2: Cluster-admin access is unintentionally blocked by an unconditional workspace equality check in `loadContainerEvents`.</violation>
</file>
<file name="pkg/worker/image_content_cache.go">
<violation number="1" location="pkg/worker/image_content_cache.go:112">
P1: The chunk-counting goroutine can block indefinitely if `StoreContent` fails early, because it keeps sending to `countingChunks` after the downstream consumer exits.</violation>
</file>
<file name="pkg/repository/events.go">
<violation number="1" location="pkg/repository/events.go:94">
P2: Returning `ErrEventReadUnsupported` for the no-reader case causes container event reads to surface as HTTP 500 in non-S2 deployments. This should be handled as an unsupported/unavailable feature path instead of an internal server error.</violation>
</file>
<file name="pkg/repository/events_s2.go">
<violation number="1" location="pkg/repository/events_s2.go:165">
P2: Task-scoped event queries are not filtered by `query.TaskID`, so `/tasks/:taskId` can include events from other tasks in the same container.</violation>
<violation number="2" location="pkg/repository/events_s2.go:246">
P1: `record.Timestamp` is in milliseconds but is being stored in `StoredAtNs` and later interpreted as nanoseconds, causing incorrect event times and ordering.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
There was a problem hiding this comment.
2 issues found across 10 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
There was a problem hiding this comment.
3 issues found across 15 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="pkg/cache/sendfile_linux.go">
<violation number="1" location="pkg/cache/sendfile_linux.go:72">
P1: The new no-progress retry path can loop forever when `sendfile` makes no progress without returning an error (e.g., source EOF before `length` bytes), causing a hang/spin.</violation>
</file>
<file name="pkg/gateway/services/object.go">
<violation number="1" location="pkg/gateway/services/object.go:182">
P2: Endpoint equality misses implicit default ports, so equivalent URLs like `https://host` and `https://host:443` are treated as different.</violation>
</file>
<file name="pkg/storage/geese.go">
<violation number="1" location="pkg/storage/geese.go:99">
P2: `StagedWriteModeEnabled` is now hardcoded to `false`, so the configuration flag is ignored and cannot enable staged writes anymore.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
There was a problem hiding this comment.
5 issues found across 26 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="pkg/cache/cachefs_node.go">
<violation number="1" location="pkg/cache/cachefs_node.go:69">
P2: When the FD cache reaches capacity, this returns an error instead of evicting an old entry, which prevents caching any new page FDs and degrades reads to non-zero-copy after the first `fdCacheSize` unique pages.</violation>
</file>
<file name="Makefile">
<violation number="1" location="Makefile:41">
P1: `cache-benchmark` now calls a non-existent script path, so the target fails immediately.</violation>
</file>
<file name="pkg/abstractions/endpoint/autoscaler.go">
<violation number="1" location="pkg/abstractions/endpoint/autoscaler.go:104">
P2: This async call can create unbounded overlapping goroutines (one per autoscaler tick) when the serve lock is missing, increasing DB/event pressure under load.</violation>
</file>
<file name="pkg/task/container_events.go">
<violation number="1" location="pkg/task/container_events.go:59">
P2: Set phase event `Source` on the top-level schema field instead of only embedding it inside `Attrs`.</violation>
</file>
<file name="pkg/worker/image_events.go">
<violation number="1" location="pkg/worker/image_events.go:420">
P2: When a stale runtime PID mapping is detected during parent traversal, the function returns immediately instead of continuing to the parent. This can drop valid container attribution for child processes if an intermediate PID was reused.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
There was a problem hiding this comment.
1 issue found across 47 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
<!-- This is an auto-generated description by cubic. --> ## Summary by cubic Removed Cedana-based checkpoint/restore to fix macOS build errors and simplify CRIU usage. CRIU now supports only Nvidia CUDA mode. - **Refactors** - Dropped Cedana CRIU integration and related types; removed `CRIUConfigModeCedana` and `pkg/worker/criu_cedana.go`. - Removed Cedana settings from `config.default.yaml`. - `InitializeCRIUManager` now only handles `nvidia` and returns “unsupported CRIU mode” for others. - Trimmed dependencies: removed `github.com/cedana/cedana` and `buf.build` Cedana protos; cleaned `go.mod`/`go.sum`. - **Migration** - Set `worker.criu.mode` to `nvidia` or disable CRIU if not needed. - Remove any `worker.criu.cedana` config blocks. Cedana is no longer supported. <sup>Written for commit 1ceca3e. Summary will update on new commits. <a href="https://cubic.dev/pr/beam-cloud/beta9/pull/1570?utm_source=github">Review in cubic</a></sup> <!-- End of auto-generated description by cubic. --> Co-authored-by: Luke Lombardi <luke@beam.cloud>
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
Summary by cubic
Adds a same‑port raw TCP cache read path with Linux zero‑copy and a gRPC codec v2, plus a Redis-backed, gateway-managed cache coordinator and an S2-powered Events system with new
/api/v1/events. Also ships a structuredbin/benchrunner, updated defaults (4 MiB pages, Go 1.23,github.com/beam-cloud/clipv2), and hardening for raw transport and Linuxsendfile.New Features
sendfile; gRPC codec v2 (pkg/cache/cachegrpc) that splits large payloads; client‑local page‑file views and FUSE FD reuse; 4 MiB pages with striped locks; tunable prefetch (aheadBytes,partLengthBytes,workers,maxPartsPerRead); path‑level counters and expanded metrics; HostMap only updates on meaningful changes; hardened mux andsendfileretry/edge‑case handling.RegisterCacheHost,UnregisterCacheHost,ListCacheHosts); registrations deduped; old coordinator RPCs removed fromcache.proto; local FS/lock metadata moved behindCacheMetadataStore.github.com/s2-streamstore/s2-sdk-go) replaces Fluent Bit; autoscaler/gateway/scheduler/workers emit lifecycle/metrics/decisions; new HTTP/api/v1/eventsbatch, summaries, and top‑lifecycle endpoints; startup phase timings recorded and included in reports.bin/benchrunner with YAML suites, validators, and JSONL metrics across cache/fs/clip/startup/sandbox/full, plus startup summaries.http://127.0.0.1:4566); GeeseFS defaults tuned (lower parallelism, longer HTTP timeout); upgraded to Go 1.23 andgithub.com/beam-cloud/clipv2; addedGONOSUMDBdefaults.Migration
cache.server.readTransport.*,cache.client.readTransport.*,cache.client.prefetch.*,cache.client.pageFDCacheSize, andcache.global.grpcPayloadCodec*. Page size defaults to 4 MiB.database.s2(apiKey,basin,streamPrefix). LocalStack presign defaults tohttp://127.0.0.1:4566. Images default to Clip v2.nvidiamode or disable CRIU.Written for commit 4281224. Summary will update on new commits. Review in cubic