Conversation
Add uber fx, zap, opentelemetry, jaeger exporter, otelgrpc, otelhttp, and elasticsearch client libraries.
Rewrite internal/logger to wrap go.uber.org/zap. Outputs JSON in production (default) and colored text in dev (LOG_FORMAT=text). Same public API preserved so all callers work unchanged.
Update Storage interface and PostgresStorage to accept ctx from callers instead of creating context.Background() internally. Fix redirect handler to use r.Context(). Add timeout to ClickHouse startup ping.
DeleteURL now deletes from Postgres before clearing cache. Add ListPaginated/ListByUserIDPaginated with SQL LIMIT/OFFSET and COUNT(*) instead of fetching all rows and slicing in memory.
- CORS: configurable allowed origins via CORS_ALLOWED_ORIGINS env var - Rate limiter: parse X-Forwarded-For properly, validate IPs - JWT: remove hardcoded fallback secret, require JWT_SECRET env var - HTTP handlers: add 1MB request body size limits
Rewrite all 7 cmd/*/main.go files to use go.uber.org/fx for declarative dependency injection with proper lifecycle management. - Graceful shutdown via fx.Lifecycle OnStop hooks for all services - HTTP servers use configurable timeouts (Read/Write/Idle) - Health endpoints check actual DB and Redis connectivity - Redirect service gets /health endpoint (fixes K8s probe) - Workers use WaitGroup for clean goroutine drain on shutdown - Remove all manual signal handling (FX handles SIGINT/SIGTERM)
- URL search index (CRUD + full-text search across long_url/short_code) - Click event indexing with bulk API for pipeline-worker - Log shipping via zap WriteSyncer with batched bulk writes - ES client as optional dependency (nil-safe, services work without ES) - /api/search endpoint on api-gateway backed by ES - ElasticsearchConfig in config with ES_ENABLED toggle - Logger supports extra write syncer for multi-output (stdout + ES)
- OTel tracer provider with OTLP/HTTP exporter for Jaeger - Configurable sampling rate (TRACING_SAMPLE_RATE) and toggle (TRACING_ENABLED) - HTTP tracing middleware with X-Trace-ID response header - gRPC client and server instrumentation via otelgrpc StatsHandler - Tracer provider lifecycle management (init on start, flush on shutdown) - All 7 services instrumented with tracer provider + graceful shutdown
- Add .dockerignore to exclude .git, docs, deployments, test files - Pin alpine:latest to alpine:3.21 across all 8 Dockerfiles - Fix api-gateway HEALTHCHECK: replace broken binary flag with wget health probe
- Add container-level securityContext to all app deployments (runAsNonRoot, readOnlyRootFilesystem, drop ALL capabilities) - Change imagePullPolicy from Never to IfNotPresent across all deployments - Add dev-only warning comment to secrets.yaml - Add NetworkPolicy rules restricting inter-service communication - Add Elasticsearch single-node StatefulSet + Service manifest - Add Jaeger all-in-one Deployment + Service manifest - Update kustomization.yaml with new resources
- GitHub Actions workflow with lint (golangci-lint), test (race detector + coverage), vulnerability scanning (govulncheck), and Go build verification - Docker image builds for all 7 services using BuildKit with GHA cache - Push to GHCR on main branch merge, build-only on PRs
|
Warning Review limit reached
More reviews will be available in 14 minutes and 9 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (26)
📝 WalkthroughWalkthroughThis PR refactors the URL shortener service to use Uber fx for dependency injection while introducing OpenTelemetry observability, Elasticsearch integration, and hardening infrastructure. Configuration system expands to support Elasticsearch, tracing, CORS, and JWT settings. Storage interface now accepts context throughout. All service entrypoints migrate from manual initialization to fx-managed lifecycles. Docker images are pinned to Alpine 3.21, Kubernetes deployments gain security contexts, and new observability infrastructure (Jaeger, Elasticsearch) is deployed alongside network policies. ChangesDependency Injection, Observability, and Elasticsearch
🎯 4 (Complex) | ⏱️ ~60 minutes
✨ Finishing Touches🧪 Generate unit tests (beta)
|
- Pin golangci-lint to v2.1.6 (supports Go 1.25) - Make vulnerability check informational (continue-on-error) since current findings are Go stdlib issues requiring Go version upgrade
Pre-built binary was compiled with Go 1.24 and can't lint Go 1.25 code. Using goinstall mode builds it with the project's Go 1.25.3 toolchain.
No golangci-lint release supports Go 1.25 yet. Using go vet as the lint step until golangci-lint catches up with the Go 1.25 toolchain.
There was a problem hiding this comment.
Actionable comments posted: 2
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (9)
cmd/api-gateway/main.go (1)
215-231:⚠️ Potential issue | 🟠 Major | ⚡ Quick winProtect the analytics catch-all route.
/api/analytics/clicksis authenticated, but/api/analytics/{id}/stats|timeline|geo|devices|referrersis registered naked here, so those analytics endpoints are public.Suggested fix
- mux.HandleFunc("/api/analytics/", func(w http.ResponseWriter, r *http.Request) { + mux.HandleFunc("/api/analytics/", authMiddleware.RequireAuth(func(w http.ResponseWriter, r *http.Request) { path := r.URL.Path switch { case strings.HasSuffix(path, "/stats"): analyticsHandler.GetStats(w, r) @@ default: http.NotFound(w, r) } - }) + }))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/api-gateway/main.go` around lines 215 - 231, The analytics catch-all handler in main.go is exposing dynamic analytics subroutes without auth. Update the /api/analytics/ registration so the paths handled by analyticsHandler.GetStats, GetTimeline, GetGeoStats, GetDeviceStats, and GetReferrers are wrapped with the same authentication/authorization middleware used by the protected analytics routes, or move them under the existing secured route group to ensure /api/analytics/{id}/stats|timeline|geo|devices|referrers are not publicly accessible.internal/storage/postgres.go (2)
91-93:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReturn a sentinel
not founderror instead of formatted text.This PR already forces callers into
strings.Contains(err.Error(), "not found"), andIncrementClicksnow maps a missing row tocodes.Internalbecause there is no stable error contract. Export a package-levelErrURLNotFoundand wrap it with%wso callers can useerrors.Is.Also applies to: 143-145
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/storage/postgres.go` around lines 91 - 93, Create and export a package-level sentinel error named ErrURLNotFound and replace the plain formatted error returns when a DB update/delete affects zero rows (the branches using cmdTag.RowsAffected() == 0) with a wrapped error using %w so callers can use errors.Is; e.g., return fmt.Errorf("URL with short code %s not found: %w", shortCode, ErrURLNotFound). Apply the same change to the other occurrence around lines 143-145 (and any similar “not found” branches such as in IncrementClicks) so all missing-row cases return the sentinel ErrURLNotFound wrapped for reliable detection.
25-31:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't drop the storage-level deadline without replacing it upstream.
These calls now pass the incoming context straight into pgx, but the request contexts shown in this PR are bare handler contexts with no local timeout. A slow or wedged database can now pin request goroutines indefinitely; keep a bounded timeout here or enforce one before every storage call.
Also applies to: 49-58, 78-86, 98-106, 137-139, 149-163, 183-197
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/storage/postgres.go` around lines 25 - 31, The storage methods (e.g., PostgresStorage.Save and other PostgresStorage DB callers in this file) currently pass the incoming ctx straight into pgx which can be a plain handler context with no timeout; wrap the incoming ctx with a bounded timeout when there is no deadline before calling s.db.Write().Exec / Query methods: check ctx.Deadline(), and if absent create a context.WithTimeout (choose a sensible default timeout used across storage, e.g., 2s–5s) and defer its cancel, then pass that wrapped ctx into the DB calls; apply the same pattern to every DB call in this file so storage never relies on an unbounded request context.internal/service/url_service.go (2)
229-249:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the search index in sync when clicks change.
IncrementClicksupdates PostgreSQL only. Search responses come from Elasticsearch, soclickswill stay at the original indexed value forever unlessUpdateClicksis called on the ES document here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/service/url_service.go` around lines 229 - 249, IncrementClicks currently updates PostgreSQL only (s.store.IncrementClicks) then reads the URL (s.store.GetByShortCode) but never updates the search index; call the search-index update method (e.g. s.search.UpdateClicks or s.searchClient.UpdateClicks) after you obtain the updated url.Clicks and pass the short code and new clicks count to keep Elasticsearch in sync, and handle the update error (log and/or return a gRPC status error) before returning the IncrementClicksResponse.
336-349:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCustom aliases never get indexed into Elasticsearch.
The normal create flow writes a URL document to Elasticsearch, but the custom-alias flow stops after PostgreSQL and cache. Those URLs won't show up in
/api/searchat all.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/service/url_service.go` around lines 336 - 349, The custom-alias path stops after postgresStore.CreateCustomURL and cache.Set so those records never get indexed into Elasticsearch; after successfully creating the custom URL (the block that calls postgresStore.CreateCustomURL and sets s.cache.Set), call the same indexing routine used by the normal create flow (the service's Elasticsearch index method—e.g., the function used elsewhere to write URL documents to ES) with the same document payload (alias/shortURL/longURL/expiresAt/userID/createdAt) and handle/indexing errors similarly before returning the CreateURLResult so custom aliases are searchable via /api/search.cmd/pipeline-worker/main.go (2)
156-168:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThe outer ticker is throttling stream consumption.
processBatchalready blocks inXReadGroupforw.blockTime, so waiting onpollIntervalbefore every call delays the first read and adds idle time between batches while backlog exists. Loop directly onprocessBatchand keeppollIntervalonly as an error/backoff delay.♻️ Suggested shape
func (w *PipelineWorker) Start(ctx context.Context, log *logger.Logger) { - ticker := time.NewTicker(w.pollInterval) - defer ticker.Stop() - for { - select { - case <-ctx.Done(): - return - case <-ticker.C: - if err := w.processBatch(ctx, log); err != nil { - log.Error("Failed to process batch: %v", err) - } + if ctx.Err() != nil { + return + } + if err := w.processBatch(ctx, log); err != nil { + log.Error("Failed to process batch: %v", err) + if ctx.Err() == nil { + time.Sleep(w.pollInterval) + } } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/pipeline-worker/main.go` around lines 156 - 168, The current Start loop uses a time.Ticker that waits pollInterval between calls, which throttles reads even though processBatch already blocks in XReadGroup for w.blockTime; remove the outer ticker and change PipelineWorker.Start to continuously call w.processBatch(ctx, log) inside a for loop that checks ctx.Done(), and only sleep for w.pollInterval (or apply backoff) when processBatch returns an error (or when you need to pause), thereby preserving w.blockTime as the primary blocking read and using pollInterval strictly as an error/backoff delay; keep references to PipelineWorker.Start, w.processBatch, w.pollInterval and w.blockTime when making the change.
198-209:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftBad/invalid events can stay in Redis stream PEL indefinitely
cmd/pipeline-worker/main.goskips messages that failw.enrichEvent(...)(they never get appended tomessageIDs), and if all messages in the batch fail it returns before anyXAck; otherwise it onlyXAcks the successfully enrichedmsg.IDs (lines 198-209, 243-247).- Since the consumer reads with
XREADGROUP ... Streams: []string{w.streamName, ">"}, it won’t redeliver already-pending entries from the PEL, and there’s no pending-entry recovery (XPENDING/XCLAIM/XAUTOCLAIM) or dead-letter/ack path in code—so malformed events can accumulate permanently unless handled externally. https://redis.io/docs/latest/commands/xreadgroup/🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/pipeline-worker/main.go` around lines 198 - 209, The loop in the consumer (where w.enrichEvent(msg.Values) is called) currently skips and never acknowledges malformed messages so they remain in the Redis stream PEL indefinitely; modify the workflow in cmd/pipeline-worker/main.go so that every processed msg.ID is acknowledged or moved to a dead-letter path regardless of enrichEvent success: collect failed IDs as well as successful ones, call XACK (or XCLAIM/XAUTOCLAIM handling for pending recovery) for all msg.IDs before returning (including the branch where len(clickEvents)==0), and for failed enrichments push the original message to a dead-letter stream or log/metric via the same code paths that reference messageIDs, clickEvents, w.enrichEvent and w.streamName so malformed events are removed from the PEL and not permanently stuck.cmd/cleanup-worker/main.go (1)
123-137:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftExpired URLs are removed from Postgres but left live in cache.
runCleanupignoresurlCache, while the cache is configured with a fixedcfg.Cache.L2TTL. That means an already-expired short code can continue resolving from cache until TTL expiry even after the row is deleted from storage. This cleanup path needs the deleted codes so it can evict both cache tiers when rows are removed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/cleanup-worker/main.go` around lines 123 - 137, runCleanup currently ignores urlCache so rows deleted by store.DeleteExpiredURLs keep resolving from cache; change the storage cleanup to return the keys removed (e.g. modify DeleteExpiredURLs to return ([]string, error) or add a new GetAndDeleteExpiredURLs that returns deleted short codes) and update runCleanup to iterate those codes and evict them from the cache (use the cache API on urlCache — e.g. urlCache.Delete/Remove/Evict for each short code and any L1/L2-specific eviction method your cache exposes) before logging the deletedCount; keep the existing logging but derive deletedCount from the returned slice length.cmd/analytics-worker/main.go (1)
151-165:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftUnacknowledged analytics events are left in the Redis consumer-group pending list with no retry/recovery path.
In
cmd/analytics-worker/main.go, malformed events (missing/incorrectmsg.Values["short_code"]type) arecontinued before being appended tomessageIDs, so they never reachXAck. IfupdateClickCounts(...)returns an error, the codecontinues and skipsXAckfor the whole batch (so those message IDs stay pending). SinceXReadGroupusesStreams: []string{params.StreamName, ">"}, it only returns never-delivered messages, and there’s noXPENDING/XCLAIM/XAUTOCLAIMlogic elsewhere to reclaim pending entries.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/analytics-worker/main.go` around lines 151 - 165, Malformed messages are being skipped before their IDs are collected and batches are sometimes not acknowledged when updateClickCounts fails, leaving entries in the Redis consumer-group pending list; change the message loop to always append msg.ID to messageIDs (move messageIDs = append(messageIDs, msg.ID) before the short_code type-check) so every consumed message is tracked, and ensure XAck is invoked for all messageIDs regardless of malformed events or updateClickCounts errors (e.g., call the XAck/acknowledgement path in a defer/finally or after handling errors, and optionally route malformed events to a dead-letter/logging path instead of skipping them) so no messages remain permanently pending.
🟠 Major comments (24)
internal/clickhouse/client.go-40-41 (1)
40-41:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClose the ClickHouse connection on ping failure.
On Line 40 / Line 41, a failed
Pingreturns early without closing the previously opened connection, which can leak resources during startup retries.Proposed fix
if err := conn.Ping(pingCtx); err != nil { + _ = conn.Close() return nil, fmt.Errorf("failed to ping clickhouse: %w", err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/clickhouse/client.go` around lines 40 - 41, The Ping error path in the ClickHouse connection setup (inside the function that calls conn.Ping(pingCtx)) returns early without closing the opened connection (variable conn), leaking resources; modify the error branch so you first close the connection (e.g., call conn.Close(...) with an appropriate context) and handle/ignore any close error, then return the wrapped ping error (fmt.Errorf("failed to ping clickhouse: %w", err)); update the code around conn.Ping(pingCtx) to ensure conn is closed on failure before returning..github/workflows/ci.yaml-18-18 (1)
18-18:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDisable credential persistence on all checkout steps.
Each checkout currently leaves credentials in local git config. Set
persist-credentials: falseto reduce token exposure risk.Suggested change pattern
- - uses: actions/checkout@v4 + - uses: actions/checkout@v4 + with: + persist-credentials: falseAlso applies to: 30-30, 44-44, 55-55, 76-76
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci.yaml at line 18, Update every checkout step that uses the actions/checkout@v4 action to explicitly set persist-credentials: false so credentials are not left in the local git config; locate each job step with uses: actions/checkout@v4 (multiple occurrences) and add the persist-credentials: false input under that step's configuration, preserving any existing settings..github/workflows/ci.yaml-9-11 (1)
9-11:⚠️ Potential issue | 🟠 Major | ⚡ Quick winScope
packages: writeto the Docker job only.Line 11 grants package write permission to every job. That’s broader than necessary and weakens least-privilege posture.
Suggested change
permissions: contents: read - packages: write jobs: + # ... lint/test/vuln/build unchanged ... docker: name: Docker Build runs-on: ubuntu-latest needs: [lint, test, build] + permissions: + contents: read + packages: writeAlso applies to: 61-92
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci.yaml around lines 9 - 11, The workflow currently grants packages: write under the top-level permissions block which applies to every job; narrow this by removing or setting packages to read at the top-level permissions and instead add packages: write only to the Docker-related job (refer to the top-level permissions section and the Docker job definition, e.g., the job named "docker" or any job in the 61-92 range), ensuring other jobs keep least-privilege (packages: read) while the Docker job retains packages: write..github/workflows/ci.yaml-18-22 (1)
18-22:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPin GitHub Actions to immutable commit SHAs
.github/workflows/ci.yamlstill uses mutable action tags (not full commit SHAs), e.g.actions/checkout@v4,actions/setup-go@v5,golangci/golangci-lint-action@v6,actions/upload-artifact@v4,docker/setup-buildx-action@v3,docker/login-action@v3,docker/build-push-action@v6(lines 18-22, 30-31, 35, 44-45, 55-56, 76-78, 84).Suggested change pattern
- - uses: actions/checkout@v4 + - uses: actions/checkout@<full-commit-sha>Apply similarly to:
actions/setup-gogolangci/golangci-lint-actionactions/upload-artifactdocker/setup-buildx-actiondocker/login-actiondocker/build-push-action🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci.yaml around lines 18 - 22, The workflow uses mutable action tags (e.g., actions/checkout@v4, actions/setup-go@v5, golangci/golangci-lint-action@v6 and others) which should be pinned to immutable commit SHAs; update each step reference in .github/workflows/ci.yaml (the steps using actions/checkout, actions/setup-go, golangci/golangci-lint-action, actions/upload-artifact, docker/setup-buildx-action, docker/login-action, docker/build-push-action) to the corresponding full commit SHA for the release you want to follow, replacing the `@vX` tags with @<commit-sha> and verify the actions still work in CI.internal/middleware/tracing.go-13-18 (1)
13-18:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix Trace ID extraction + move
X-Trace-IDheader write before response is served (internal/middleware/tracing.go)
trace.SpanFromContext(r.Context())runs afterotelhttp.NewHandler(...).ServeHTTP; the OTel handler adds the created span to its local request context (viar.WithContext), so the outerr.Context()won’t contain that span.- Setting
X-Trace-IDafterServeHTTPcan be too late if the downstream handler has already written/committed the response.Suggested fix
func Tracing(serviceName string) func(http.Handler) http.Handler { return func(next http.Handler) http.Handler { - handler := otelhttp.NewHandler(next, serviceName) - return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { - handler.ServeHTTP(w, r) - span := trace.SpanFromContext(r.Context()) - if span.SpanContext().HasTraceID() { - w.Header().Set("X-Trace-ID", span.SpanContext().TraceID().String()) - } - }) + instrumented := otelhttp.NewHandler(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + span := trace.SpanFromContext(r.Context()) + if span.SpanContext().HasTraceID() { + w.Header().Set("X-Trace-ID", span.SpanContext().TraceID().String()) + } + next.ServeHTTP(w, r) + }), serviceName) + return instrumented } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/middleware/tracing.go` around lines 13 - 18, The middleware is reading the span from the wrong context and writing the X-Trace-ID after the response may already be committed; wrap the downstream handler so you extract the span from the request context that the downstream actually receives and set the "X-Trace-ID" header before that handler writes the response. Concretely, replace the current inline call that does handler.ServeHTTP(w, r) then trace.SpanFromContext(r.Context()) with a wrapper http.HandlerFunc that, given the incoming *http.Request r, reads span := trace.SpanFromContext(r.Context()) and if span.SpanContext().HasTraceID() sets w.Header().Set("X-Trace-ID", span.SpanContext().TraceID().String()) immediately, then calls the original handler.ServeHTTP(w, r); update the code paths around ServeHTTP/trace.SpanFromContext to use this wrapper so header is written from the same request context the handler uses.cmd/api-gateway/main.go-190-199 (1)
190-199:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep backend error details out of
/health.Returning raw
errtext here exposes internal DB and Redis details to any caller. Log the full error, but send a generic response body.Suggested fix
- http.Error(w, fmt.Sprintf("DB unavailable: %v", err), http.StatusServiceUnavailable) + http.Error(w, "DB unavailable", http.StatusServiceUnavailable) @@ - http.Error(w, fmt.Sprintf("Redis unavailable: %v", err), http.StatusServiceUnavailable) + http.Error(w, "Redis unavailable", http.StatusServiceUnavailable)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/api-gateway/main.go` around lines 190 - 199, The health handler is returning raw backend error text in HTTP responses; instead, keep logging the full error via log.Error but replace the response body with a generic message. In the block using dbManager.Primary().Ping(ctx) and the block using redisClient.GetClient().Ping(ctx), stop including err in fmt.Sprintf when calling http.Error — send a fixed, non-sensitive string like "DB unavailable" or "Redis unavailable" (no error details) while still logging the original err with log.Error for diagnostics.cmd/url-service/main.go-114-143 (1)
114-143:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClose the Elasticsearch client during shutdown.
provideESClientcan return a live client, butOnStoponly closes Redis and Postgres. That leaves transport resources open across restarts.Suggested fix
func registerLifecycle( lc fx.Lifecycle, grpcServer *grpc.Server, urlService *service.URLService, listener net.Listener, tp *sdktrace.TracerProvider, redisClient *redis.RedisClient, dbManager *database.DBManager, + esClient *es.Client, log *logger.Logger, ) { @@ OnStop: func(ctx context.Context) error { log.Info("Shutting down url-service...") grpcServer.GracefulStop() tracing.ShutdownTracer(ctx, tp) redisClient.Close() dbManager.Close() + if esClient != nil { + esClient.Close() + } return nil }, }) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/url-service/main.go` around lines 114 - 143, registerLifecycle currently doesn't close the Elasticsearch client returned by provideESClient; add an esClient parameter (the value returned from provideESClient) to registerLifecycle and ensure OnStop closes its transport/resources—e.g., call the client's Close() if available or at minimum invoke esClient.Transport.CloseIdleConnections() (and any other client-specific shutdown API) inside the OnStop block so Elasticsearch connections are released during shutdown.cmd/api-gateway/main.go-273-280 (1)
273-280:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail fast on TCP bind errors during fx startup.
OnStartspawnsserver.ListenAndServe()in a goroutine and then returnsnil, so a “port already in use” bind failure is only logged and fx may still report a healthy startup.Suggested fix
OnStart: func(ctx context.Context) error { log.Info("Listening on %s", server.Addr) + ln, err := net.Listen("tcp", server.Addr) + if err != nil { + return err + } go func() { - if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed { + if err := server.Serve(ln); err != nil && err != http.ErrServerClosed { log.Error("Server error: %v", err) } }() return nil },Add the missing
netimport to make the change compile.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/api-gateway/main.go` around lines 273 - 280, OnStart currently launches server.ListenAndServe() in a goroutine and returns nil so bind errors are only logged; change it to perform a TCP bind first (use net.Listen on server.Addr), return the bind error immediately if it fails, then start the HTTP server with server.Serve(listener) in a goroutine and keep the listener reference for shutdown in OnStop; update the OnStart/OnStop handlers around the existing server variable and ensure you add the missing net import so the code compiles.cmd/user-service/main.go-99-104 (1)
99-104:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBound gRPC shutdown by the lifecycle context.
grpcServer.GracefulStop()does not accept a context and will block indefinitely on stuck RPCs, causing the fx shutdown to exceed its timeout and force-kill the application.Suggested fix
OnStop: func(ctx context.Context) error { log.Info("Shutting down user-service...") - grpcServer.GracefulStop() + timedOut := false + stopped := make(chan struct{}) + go func() { + grpcServer.GracefulStop() + close(stopped) + }() + select { + case <-stopped: + case <-ctx.Done(): + grpcServer.Stop() + timedOut = true + } tracing.ShutdownTracer(ctx, tp) dbManager.Close() + if timedOut { + return ctx.Err() + } return nil },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/user-service/main.go` around lines 99 - 104, OnStop currently calls grpcServer.GracefulStop() directly which can block forever; instead start GracefulStop in a goroutine and wait for either that goroutine to finish or the provided lifecycle context (ctx) to be done, and if ctx is done first call grpcServer.Stop() to force close; keep tracing.ShutdownTracer(ctx, tp) and dbManager.Close() after the server shutdown sequence so they still run during teardown. Locate the OnStop func and replace the direct grpcServer.GracefulStop() call with a goroutine + done channel + select on ctx.Done() to call grpcServer.Stop() if needed, ensuring graceful finish otherwise.cmd/redirect-service/main.go-106-113 (1)
106-113:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPropagate port binding errors to Fx startup.
ListenAndServe()binds the socket inside the goroutine, soOnStartreturnsnilbefore the bind completes. Port binding failures (already in use, permission denied) are only logged—Fx considers startup successful even when the server never accepts traffic.Suggested fix
OnStart: func(ctx context.Context) error { log.Info("Listening on %s", server.Addr) + ln, err := net.Listen("tcp", server.Addr) + if err != nil { + return err + } go func() { - if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed { + if err := server.Serve(ln); err != nil && err != http.ErrServerClosed { log.Error("Server error: %v", err) } }() return nil },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/redirect-service/main.go` around lines 106 - 113, OnStart currently starts server.ListenAndServe inside a goroutine so port bind errors happen asynchronously and OnStart returns nil; change it to perform the bind synchronously and return any bind error. Specifically, replace the goroutine call to server.ListenAndServe with: call net.Listen("tcp", server.Addr) and if it returns an error return that error from OnStart, then start a goroutine that calls server.Serve(listener) (checking for err != http.ErrServerClosed and logging). Use listener.Addr() in your log instead of server.Addr so the actual bound address is reported; keep the existing error handling for server.Serve.cmd/url-service/main.go-136-142 (1)
136-142:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBound gRPC shutdown by the lifecycle context.
grpcServer.GracefulStop()ignores thectxparameter and blocks indefinitely if any in-flight RPC is stuck, causing thefxshutdown to hang beyond the configuredStopTimeout. Implement a timeout safety net as recommended by the gRPC documentation to forcefully stop the server if graceful shutdown does not complete in time.Suggested fix
OnStop: func(ctx context.Context) error { log.Info("Shutting down url-service...") - grpcServer.GracefulStop() + timedOut := false + stopped := make(chan struct{}) + go func() { + grpcServer.GracefulStop() + close(stopped) + }() + select { + case <-stopped: + case <-ctx.Done(): + grpcServer.Stop() + timedOut = true + } tracing.ShutdownTracer(ctx, tp) redisClient.Close() dbManager.Close() + if timedOut { + return ctx.Err() + } return nil },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/url-service/main.go` around lines 136 - 142, OnStop's current call to grpcServer.GracefulStop() can block indefinitely; wrap the graceful stop in a goroutine and wait for it with the lifecycle context (ctx) as a timeout safety net, selecting between a done channel and ctx.Done() (or a derived timeout) and call grpcServer.Stop() to forcefully terminate if the graceful shutdown doesn't finish in time. Modify the OnStop func to spawn the goroutine that runs grpcServer.GracefulStop(), wait on a done signal vs ctx.Done(), and only after either graceful completion or forced grpcServer.Stop() proceed to call tracing.ShutdownTracer(ctx, tp), redisClient.Close(), and dbManager.Close() so shutdown completes reliably within fx's StopTimeout.deployments/k8s/base/secrets.yaml-1-2 (1)
1-2:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftA warning comment doesn't stop weak secrets from shipping.
Because this manifest still lives in
base, anyone applying the base kustomization can roll out the placeholder DSNs and JWT secret by accident. Move dev defaults out ofbaseand require an overlay-specific generator or externally managed Secret instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deployments/k8s/base/secrets.yaml` around lines 1 - 2, The base secrets.yaml contains dev placeholder secrets that can be accidentally applied; remove these defaults from the base and instead require overlay-specific secrets or an external secret manager. Concretely: delete the dev DSNs/JWT values from deployments/k8s/base/secrets.yaml, update the base kustomization to no longer include a Secret generator for these values, and add overlay kustomizations (e.g., deployments/k8s/overlays/dev) that provide a secretGenerator or a reference to an externally-managed Secret (Vault/SOPS/SealedSecret) for the DSNs and JWT secret; ensure code paths/configs referencing the secret name remain unchanged so overlays supply the actual values at apply time.deployments/k8s/base/analytics-worker/deployment.yaml-29-29 (1)
29-29:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't pair
image: ...:latestwithimagePullPolicy: IfNotPresentIn
deployments/k8s/base/analytics-worker/deployment.yaml, line 29 usesimage: ghcr.io/varun5711/tiny-analytics-worker:latestwithimagePullPolicy: IfNotPresent, which can leave nodes running a stale cached image after rebuilds/rollouts. The same:latest+IfNotPresentcombo appears in:
deployments/k8s/base/api-gateway/deployment.yamldeployments/k8s/base/cleanup-worker/cronjob.yamldeployments/k8s/base/jaeger/deployment.yamldeployments/k8s/base/pipeline-worker/deployment.yamldeployments/k8s/base/redirect-service/deployment.yamldeployments/k8s/base/url-service/deployment.yamldeployments/k8s/base/user-service/deployment.yamlSwitch to immutable image tags/digests (preferred) or use
imagePullPolicy: Alwaysuntil tags are versioned.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deployments/k8s/base/analytics-worker/deployment.yaml` at line 29, The deployment uses image: ghcr.io/varun5711/tiny-analytics-worker:latest together with imagePullPolicy: IfNotPresent (in deployments/k8s/base/analytics-worker/deployment.yaml), which can serve stale cached images; update the manifest so the container image is immutable (preferred) by replacing the :latest tag with a specific version tag or digest, or if you must keep non-versioned tags temporarily, change imagePullPolicy from IfNotPresent to Always; apply the same fix pattern to the other manifests listed (api-gateway, cleanup-worker, jaeger, pipeline-worker, redirect-service, url-service, user-service).deployments/k8s/base/elasticsearch/statefulset.yaml-20-61 (1)
20-61:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd the missing pod/container security context.
This StatefulSet is still using the default root-capable security posture while the rest of this cohort was hardened. Please align it with the same
runAsNonRoot,allowPrivilegeEscalation: false, andcapabilities.drop: ["ALL"]pattern before shipping this.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deployments/k8s/base/elasticsearch/statefulset.yaml` around lines 20 - 61, The StatefulSet is missing hardened pod/container security context; add a pod-level securityContext under spec.template.spec with runAsNonRoot: true (and optionally runAsUser to a non-root uid) and then add a container-level securityContext inside the container with name: elasticsearch setting allowPrivilegeEscalation: false and capabilities.drop: ["ALL"] so the elasticsearch container and pod follow the same non-root, no-privilege-escalation policy as the rest of the cohort.deployments/k8s/base/network-policies.yaml-40-64 (1)
40-64:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAllow OTLP egress to Jaeger.
These egress rules never permit traffic to
app: jaegeron port4318, so the services/workers you just instrumented for tracing will be unable to export spans once these policies are enforced.Also applies to: 88-97, 113-121, 137-153, 169-185, 201-217
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deployments/k8s/base/network-policies.yaml` around lines 40 - 64, The egress rules are missing an allowance for OTLP/Jaeger traffic, so add an egress entry that permits traffic to podSelector matchLabels: app: jaeger on port 4318/TCP; update each egress list (the blocks around the existing podSelector entries such as the ones currently allowing url-service, user-service, redis, postgresql, clickhouse, elasticsearch and the other similar blocks at ranges 88-97, 113-121, 137-153, 169-185, 201-217) to include a to: podSelector: matchLabels: app: jaeger with a ports: - port: 4318 protocol: TCP so instrumented services can export OTLP spans to Jaeger.deployments/k8s/base/jaeger/deployment.yaml-19-56 (1)
19-56:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHarden the Jaeger pod like the other workloads.
This deployment still runs with the default security context, so it misses the non-root / no-privilege-escalation / dropped-capabilities controls added elsewhere in this PR.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deployments/k8s/base/jaeger/deployment.yaml` around lines 19 - 56, The Jaeger container spec (containers: - name: jaeger) is missing hardened securityContext settings; update the pod and container security contexts near that containers block so the pod runs non-root and cannot escalate privileges — add a pod-level spec.securityContext with runAsNonRoot: true and runAsUser (e.g. 1000) and a container-level securityContext for the jaeger container that sets allowPrivilegeEscalation: false, capabilities.drop: ["ALL"], readOnlyRootFilesystem: true (and optionally seccompProfile: { type: RuntimeDefault }) so the jaeger container has the same non-root / no-privilege-escalation / dropped-capabilities controls as the other workloads referenced in this PR.internal/elasticsearch/click_index.go-96-110 (1)
96-110:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBulk response is not inspected for per-item failures.
res.IsError()only catches HTTP-level errors. The_bulkAPI returns HTTP 200 even when individual documents fail (mapping conflicts, version conflicts, etc.). With the current check, partial failures are silently swallowed and the caller treats the batch as fully indexed, leading to silent data loss. Decode the response and inspect the top-levelerrorsflag / per-itemstatus.🛠️ Sketch of partial-failure handling
if res.IsError() { return fmt.Errorf("elasticsearch bulk index error: %s", res.String()) } + var bulkResp struct { + Errors bool `json:"errors"` + Items []map[string]struct { + Status int `json:"status"` + Error json.RawMessage `json:"error,omitempty"` + } `json:"items"` + } + if err := json.NewDecoder(res.Body).Decode(&bulkResp); err != nil { + return fmt.Errorf("failed to decode bulk response: %w", err) + } + if bulkResp.Errors { + return fmt.Errorf("elasticsearch bulk index had item-level failures") + } + return nil🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/elasticsearch/click_index.go` around lines 96 - 110, The bulk response handling in the Bulk call using c.es.Bulk must be extended to decode and inspect the Elasticsearch bulk response JSON (not just res.IsError()) — read and json.Decode res.Body into a struct that includes the top-level "errors" bool and the per-item "items" array (e.g., map[string]map[string]interface{}), check if response.Errors is true and iterate c.items to collect any item-level failures (non-2xx statuses or presence of an "error" object), and return a descriptive error (or include in processLogger) that lists failed item statuses and error reasons; adjust the existing function around c.es.Bulk / c.clickIndex() to close the body, decode the JSON, and fail when any per-item error is found instead of treating HTTP 200 as success.internal/elasticsearch/client.go-34-42 (1)
34-42:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTimeout context isn’t applied to the Elasticsearch connectivity check
The 10s timeout context is discarded (
_ = ctx), soes.Info()runs without it and won’t respect the intended bound. Pass the context into the call (es.Info(es.Info.WithContext(ctx))).🛠️ Proposed fix
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) defer cancel() - _ = ctx - res, err := es.Info() + res, err := es.Info(es.Info.WithContext(ctx)) if err != nil { return nil, fmt.Errorf("failed to connect to elasticsearch: %w", err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/elasticsearch/client.go` around lines 34 - 42, The timeout context created by context.WithTimeout is being discarded (_ = ctx) so the Elasticsearch info check doesn't respect the 10s bound; pass the context into the call by using the context created (ctx) with es.Info (e.g., call es.Info with the WithContext option) and keep the defer cancel() so the request uses the timeout; update the code around ctx, cancel, and es.Info to call es.Info(es.Info.WithContext(ctx)) instead of es.Info().internal/elasticsearch/url_index.go-57-71 (1)
57-71:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFilter expired URLs out of Elasticsearch search results.
Every SQL list/get path in this PR excludes expired URLs, but this query does not. Once a URL expires,
/api/searchcan keep returning it until some separate cleanup removes the document.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/elasticsearch/url_index.go` around lines 57 - 71, The Elasticsearch query in SearchURLs currently returns expired documents; update the searchBody in Client.SearchURLs to wrap the existing "multi_match" under a "bool" query and add a "filter" that only allows URLs whose "expires_at" is greater than "now" or have no "expires_at" field (i.e. not expired). Concretely, replace the top-level "query": {"multi_match": ...} with "query": {"bool": {"must": {the multi_match map}, "filter": [{"bool":{"should":[{"range":{"expires_at":{"gt":"now"}}},{"bool":{"must_not":{"exists":{"field":"expires_at"}}}}]}}]}} so searches exclude expired documents.internal/elasticsearch/url_index.go-13-25 (1)
13-25:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't expose
user_idin the public search payload.
HTTPHandler.SearchURLswritesURLSearchResultstraight to the response, so reusingURLDocumenthere leaks internal user identifiers to API callers. Split the index model from the response DTO or omitUserIDbefore returning search results.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/elasticsearch/url_index.go` around lines 13 - 25, The URLDocument struct currently contains UserID and is being serialized directly by HTTPHandler.SearchURLs, leaking internal user identifiers; create a separate response DTO (e.g., PublicURLDocument without the UserID field) or a sanitized projection for URLSearchResult and map each URLDocument -> PublicURLDocument (or strip UserID) before marshalling in HTTPHandler.SearchURLs so responses never include user_id; update URLSearchResult to hold the public DTO (or ensure omission) and adjust any consumers accordingly.internal/service/url_service.go-88-97 (1)
88-97:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't silently ignore Elasticsearch write failures.
If PostgreSQL succeeds and the Elasticsearch side-effect fails here,
/api/searchdrifts with no signal or retry path. At minimum this needs structured logging plus a retry/backfill path; otherwise the request should fail when search consistency matters.Also applies to: 217-219
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/service/url_service.go` around lines 88 - 97, The Elasticsearch IndexURL call (s.esClient.IndexURL in internal/service/url_service.go) currently ignores errors; change this to capture the error result and handle it: at minimum call structured logging (include shortCode, req.UserId, createdAt, expiresAt and the error) instead of discarding, and implement a retry/backfill strategy — e.g., push a retry task or enqueue a backfill record when IndexURL returns an error so a background worker can reattempt, or propagate the error to the API caller when search consistency is required; apply the same fix for the other silent IndexURL invocation (around the second occurrence referenced in the comment).cmd/analytics-worker/main.go-83-115 (1)
83-115:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix fx lifecycle shutdown hook registration and Redis stream pending handling
cmd/analytics-worker/main.goregistersOnStopby callinglc.Append(...)insideOnStart(nested lifecycle append). Fx expects all lifecycle hooks to be appended during initialization, so thisOnStopcleanup isn’t reliably part of Fx’s shutdown sequence.processEventsusesXReadGroup(..., Streams: []string{params.StreamName, ">"})(new messages only). WhenupdateClickCountsfails (or a message is malformed), the codecontinues withoutXACK, leaving those entries in the consumer-group PEL indefinitely—there’s noXPENDING/XCLAIM/pending-read path to re-process them.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/analytics-worker/main.go` around lines 83 - 115, The OnStop hook is being appended inside OnStart (lc.Append within the fx.Hook's OnStart) which breaks Fx lifecycle expectations; move the OnStop registration out of the nested append by registering a single fx.Hook (or appending both hooks at init) that declares OnStart and OnStop together, capture cancel and wg in variables scoped so OnStop can call cancel(), wg.Wait(), tracing.ShutdownTracer, redisClient.Close and dbManager.Close; additionally, update processEvents to handle pending stream entries: before reading ">" call XPending/XPendingExt to list pending IDs for params.ConsumerGroup and use XClaim (or XReadGroup with specific IDs) to claim and re-process pending messages, ensure updateClickCounts always XACKs on success and on unrecoverable/malformed messages either XACK after moving to a DLQ or increment a retry counter and XCLAIM/XPENDING-based retry policy so entries aren’t left indefinitely in the PEL (refer to processEvents, updateClickCounts, XReadGroup, XACK, XPENDING, XCLAIM).cmd/pipeline-worker/main.go-108-140 (1)
108-140:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix Fx lifecycle hook registration + simplify batch polling + prevent stuck pending messages
cmd/pipeline-worker/main.goregisters theOnStophook by callinglc.Append(fx.Hook{OnStop: ...})insideOnStart; register both callbacks in the samelc.Append(...)block to match Fx lifecycle expectations: https://uber-go.github.io/fx/lifecycle.html.Startwaits forpollIntervalviatime.NewTicker(...)before callingprocessBatch, even thoughprocessBatchalready blocks onXReadGroupwithBlock: w.blockTime; the outer ticker adds avoidable latency/rate-limits effective blocking.- If enrichment fails for all messages,
processBatchreturns beforeXAck(becausemessageIDsis only populated for successfully enriched events), leaving those entries in the consumer-group PEL; there’s noXPENDING/XCLAIM-based reclaim logic in code to retry them.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/pipeline-worker/main.go` around lines 108 - 140, The Fx hooks are being registered incorrectly (OnStop is appended inside OnStart) and the worker loop and acking logic can leave messages stuck; change the lifecycle registration to append a single lc.Append(...) with both OnStart and OnStop hooks (use the existing lc.Append call that currently wraps OnStart and add OnStop there), remove the outer time.Ticker/pollInterval in worker.Start so it relies on XReadGroup's Block (use w.blockTime in XReadGroup and call processBatch immediately when XReadGroup returns), and ensure processBatch always XAck every message ID read (populate messageIDs for all messages or ack before enrichment) so failed enrichments don't leave entries in the PEL; additionally implement reclaim logic using Redis XPENDING/XCLAIM (or a reclaimPending function invoked periodically from Start) to claim and reprocess stale pending messages.cmd/cleanup-worker/main.go-78-104 (1)
78-104:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix Fx lifecycle hook registration order and invalidate expired-URL cache entries (cmd/cleanup-worker/main.go: 78-104, 123-137)
OnStopis registered insideOnStart(lc.Append(fx.Hook{ ... OnStart ... lc.Append(fx.Hook{OnStop...}) ...})); registerOnStopin the same (outer)lc.Appendinstead of mutating lifecycle hooks duringOnStart. See https://uber-go.github.io/fx/lifecycle.htmlrunCleanupdeletes expired URLs (store.DeleteExpiredURLs) but never invalidatesurlCachekeys;redirect_handlerservescache.Get("url:"+shortCode)directly on cache hit without anyexpires_atcheck, so expired redirects can remain until L1 eviction and/or L2 TTL.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/cleanup-worker/main.go` around lines 78 - 104, The OnStop hook must not be appended from inside the OnStart hook; move the lc.Append(fx.Hook{OnStop:...}) registration out so both OnStart and OnStop are registered on the same outer lc.Append and keep OnStart creating workerCtx/cancel and starting runCleanupLoop(workerCtx, store, urlCache, log) while OnStop calls cancel(), wg.Wait(), tracing.ShutdownTracer, redisClient.Close(), dbManager.Close(). Also update runCleanupLoop (and the code path that calls store.DeleteExpiredURLs) to remove expired entries from the in-memory cache by deleting/invalidation using urlCache.Delete or urlCache.Invalidate for keys built the same way redirect_handler reads them (e.g., "url:"+shortCode) immediately after deleting from the store so expired redirects are not served from cache.
🟡 Minor comments (3)
internal/middleware/cors.go-17-21 (1)
17-21:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSet
Vary: Originunconditionally.Since the
Access-Control-Allow-Originvalue depends on the requestOrigin,Vary: Originmust be emitted on every response, not only when the origin matches. Otherwise a shared cache may serve a response containing an allowed origin's CORS headers to a disallowed origin.🔧 Proposed fix
+ w.Header().Set("Vary", "Origin") if origin != "" && originSet[origin] { w.Header().Set("Access-Control-Allow-Origin", origin) w.Header().Set("Access-Control-Allow-Credentials", "true") - w.Header().Set("Vary", "Origin") }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/middleware/cors.go` around lines 17 - 21, The Vary header must be sent on every response; update the middleware so w.Header().Set("Vary", "Origin") is executed unconditionally (move it out of the if-block) while leaving the conditional setting of Access-Control-Allow-Origin and Access-Control-Allow-Credentials that use origin and originSet intact; ensure the change touches the block that currently checks if origin != "" && originSet[origin] so that Vary: Origin is always added regardless of that condition.deployments/k8s/base/secrets.yaml-1-2 (1)
1-2:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winNormalize this file to LF line endings.
YAMLlint is already failing here with
wrong new line character: expected \n, so this file will keep breaking CI until the CRLFs are removed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deployments/k8s/base/secrets.yaml` around lines 1 - 2, The file deployments/k8s/base/secrets.yaml contains CRLF line endings causing yamllint errors; convert the file to use LF only (e.g., run a dos2unix conversion or your editor’s "normalize to LF" option), add a .gitattributes entry (e.g., *.yaml text eol=lf) to enforce LF on commit, verify with git status and re-run yamllint, then commit the normalized secrets.yaml so CI stops failing on "wrong new line character".internal/logger/logger.go-56-58 (1)
56-58:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse the same JSON schema for the mirrored sink.
LogSyncer.Writeonly understands a stringtimestamp, but this encoder emits Zap's defaulttsnumeric field. That makes every mirrored log fall back to ingestion time, so Elasticsearch ordering/filtering drifts from the real event time.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/logger/logger.go` around lines 56 - 58, The mirrored JSON encoder (created via zapcore.NewJSONEncoder(zap.NewProductionEncoderConfig())) emits Zap's numeric "ts" field, which LogSyncer.Write doesn't understand; update the extra sink encoder to use the same encoder config/schema as the primary logger so it emits a string "timestamp" and the same time encoding. Concretely, reuse the existing encoder config (or construct one with TimeKey set to "timestamp" and TimeEncoder set to an ISO8601/string encoder) instead of calling zap.NewProductionEncoderConfig() directly when creating jsonEnc for extraSyncer (see jsonEnc, extraSyncer, zapcore.NewJSONEncoder, zapcore.NewCore).
🧹 Nitpick comments (3)
.github/workflows/ci.yaml (1)
24-24: ⚡ Quick winAvoid floating
latestversions for CI tooling.Line 24 and Line 48 use
latest, which makes CI non-reproducible and can introduce surprise failures. Pin explicit versions.Suggested change
- uses: golangci/golangci-lint-action@v6 with: - version: latest + version: v1.59.1 # example: pin exact version in repo policy - - run: go install golang.org/x/vuln/cmd/govulncheck@latest + - run: go install golang.org/x/vuln/cmd/govulncheck@v1.1.3 # example: pin exact versionAlso applies to: 48-48
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci.yaml at line 24, The workflow currently uses floating "version: latest" pins (shown as the literal token version: latest) which makes CI non-reproducible; locate every occurrence of the string "version: latest" in the .github/workflows/ci.yaml (including the second occurrence) and replace it with an explicit, tested version tag or SHA for the tool/action (e.g., a specific semver or commit SHA) appropriate for that step; ensure the chosen version is documented in the workflow comment or repo changelog so future updates are intentional.go.mod (1)
7-7: Dependency security posture looks OK for the pinned versions (no returned advisory ranges match).GitHub advisory data flags vulnerabilities only in earlier version ranges that are below what this
go.modpins:
- OpenTelemetry core at
go.opentelemetry.io/otel+go.opentelemetry.io/otel/sdkv1.44.0 (shown issues were patched at 1.41.0/1.43.0, with vulnerable ranges ≤1.40/≤1.42)- OpenTelemetry contrib at
otelgrpcv0.69.0 andotelhttpv0.49.0 (shown issues patched at 0.46.0/0.44.0; vulnerable ranges <0.46/<0.44)golang.org/x/cryptov0.51.0 (shown issues were vulnerable only before 0.45.0)google.golang.org/grpcv1.81.1 (shown issues were vulnerable only before 1.79.3 and before 1.58.3)google.golang.org/protobufv1.36.11 (shown issues were vulnerable only before 1.33.0 / in a narrow 1.29.0–<1.29.1 window)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@go.mod` at line 7, Confirm and lock the verified safe dependency versions in the module file by ensuring the listed modules (e.g., github.com/elastic/go-elasticsearch/v8 v8.19.6, go.opentelemetry.io/otel v1.44.0, go.opentelemetry.io/otel/sdk v1.44.0, golang.org/x/crypto v0.51.0, google.golang.org/grpc v1.81.1, google.golang.org/protobuf v1.36.11) are present and up-to-date in go.mod, run go mod tidy to refresh go.sum, commit the updated go.mod/go.sum, and add a brief note to the repo (SECURITY.md or a comment) summarizing that these pinned versions were audited and no advisory ranges matched.deployments/k8s/base/jaeger/deployment.yaml (1)
21-23: ⚡ Quick winPin Jaeger to a tested version or digest.
jaegertracing/all-in-one:latestmakes rollouts non-deterministic, and paired withIfNotPresentit can leave different nodes running different binaries after reschedules. Please pin an explicit tag or digest here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deployments/k8s/base/jaeger/deployment.yaml` around lines 21 - 23, The jaeger container currently uses the non-deterministic image "jaegertracing/all-in-one:latest" (container name: jaeger, image field: jaegertracing/all-in-one:latest, imagePullPolicy: IfNotPresent); change the image field to a specific, tested tag or an immutable digest (for example replace ":latest" with a pinned tag like ":1.45.0" or "sha256:...") so deployments are deterministic, and adjust imagePullPolicy if needed to match your update strategy.
| github.com/elastic/go-elasticsearch/v8 v8.19.6 | ||
| github.com/golang-jwt/jwt/v5 v5.3.0 | ||
| github.com/google/uuid v1.6.0 | ||
| github.com/jackc/pgx/v5 v5.7.6 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check GitHub security advisories for pgx and identify the latest secure version
echo "=== Checking pgx security advisories ==="
gh api graphql -f query='
{
securityVulnerabilities(first: 10, ecosystem: GO, package: "github.com/jackc/pgx/v5") {
nodes {
advisory {
summary
severity
publishedAt
identifiers {
type
value
}
}
vulnerableVersionRange
firstPatchedVersion {
identifier
}
}
}
}'
echo ""
echo "=== Checking latest pgx/v5 release ==="
gh api repos/jackc/pgx/releases/latest --jq '.tag_name'Repository: Varun5711/tiny
Length of output: 1654
go.mod: upgrade github.com/jackc/pgx/v5 (v5.7.6) to a patched release
- Line 10:
github.com/jackc/pgx/v5 v5.7.6is affected by GHSA-9jj7-4m8r-rfcm / CVE-2026-33816 (CRITICAL, memory-safety), patched in 5.9.0. - Line 10: also affected by GHSA-j88v-2chj-qfwx (SQL injection via placeholder confusion), patched in 5.9.2.
Upgrade togithub.com/jackc/pgx/v5 v5.9.2or newer.
🧰 Tools
🪛 OSV Scanner (2.3.8)
[CRITICAL] 10-10: github.com/jackc/pgx/v5 5.7.6: CVE-2026-33815 in github.com/jackc/pgx
(GO-2026-4771)
[CRITICAL] 10-10: github.com/jackc/pgx/v5 5.7.6: CVE-2026-33816 in github.com/jackc/pgx
(GO-2026-4772)
[CRITICAL] 10-10: github.com/jackc/pgx/v5 5.7.6: Memory-safety vulnerability in github.com/jackc/pgx/v5.
[CRITICAL] 10-10: github.com/jackc/pgx/v5 5.7.6: pgx: SQL Injection via placeholder confusion with dollar quoted string literals
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@go.mod` at line 10, Update the pgx dependency version in go.mod to a patched
release: replace the current module version "github.com/jackc/pgx/v5 v5.7.6"
with "github.com/jackc/pgx/v5 v5.9.2" (or newer) to address GHSA-9jj7-4m8r-rfcm
/ CVE-2026-33816 and GHSA-j88v-2chj-qfwx; after updating the version string run
`go mod tidy` to refresh go.sum and ensure the new release is used by the build.
| func (ls *LogSyncer) Flush() { | ||
| ls.mu.Lock() | ||
| if len(ls.buffer) == 0 { | ||
| ls.mu.Unlock() | ||
| return | ||
| } | ||
| entries := ls.buffer | ||
| ls.buffer = make([]LogEntry, 0, ls.flushSize) | ||
| ls.mu.Unlock() | ||
|
|
||
| index := ls.logIndex() | ||
|
|
||
| var buf bytes.Buffer | ||
| for _, entry := range entries { | ||
| meta := map[string]interface{}{ | ||
| "index": map[string]interface{}{ | ||
| "_index": index, | ||
| }, | ||
| } | ||
| metaLine, _ := json.Marshal(meta) | ||
| buf.Write(metaLine) | ||
| buf.WriteByte('\n') | ||
|
|
||
| dataLine, _ := json.Marshal(entry) | ||
| buf.Write(dataLine) | ||
| buf.WriteByte('\n') | ||
| } | ||
|
|
||
| ls.client.es.Bulk(bytes.NewReader(buf.Bytes())) | ||
| } |
There was a problem hiding this comment.
Close the bulk response body and stop swallowing transport failures.
Bulk returns an HTTP response that must be closed. Ignoring both err and res.Body leaks connections on every flush, and bulk indexing failures are currently silent.
Suggested fix
func (ls *LogSyncer) Flush() {
ls.mu.Lock()
if len(ls.buffer) == 0 {
ls.mu.Unlock()
return
}
@@
- ls.client.es.Bulk(bytes.NewReader(buf.Bytes()))
+ res, err := ls.client.es.Bulk(bytes.NewReader(buf.Bytes()))
+ if err != nil {
+ return
+ }
+ defer res.Body.Close()
+ if res.IsError() {
+ return
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (ls *LogSyncer) Flush() { | |
| ls.mu.Lock() | |
| if len(ls.buffer) == 0 { | |
| ls.mu.Unlock() | |
| return | |
| } | |
| entries := ls.buffer | |
| ls.buffer = make([]LogEntry, 0, ls.flushSize) | |
| ls.mu.Unlock() | |
| index := ls.logIndex() | |
| var buf bytes.Buffer | |
| for _, entry := range entries { | |
| meta := map[string]interface{}{ | |
| "index": map[string]interface{}{ | |
| "_index": index, | |
| }, | |
| } | |
| metaLine, _ := json.Marshal(meta) | |
| buf.Write(metaLine) | |
| buf.WriteByte('\n') | |
| dataLine, _ := json.Marshal(entry) | |
| buf.Write(dataLine) | |
| buf.WriteByte('\n') | |
| } | |
| ls.client.es.Bulk(bytes.NewReader(buf.Bytes())) | |
| } | |
| func (ls *LogSyncer) Flush() { | |
| ls.mu.Lock() | |
| if len(ls.buffer) == 0 { | |
| ls.mu.Unlock() | |
| return | |
| } | |
| entries := ls.buffer | |
| ls.buffer = make([]LogEntry, 0, ls.flushSize) | |
| ls.mu.Unlock() | |
| index := ls.logIndex() | |
| var buf bytes.Buffer | |
| for _, entry := range entries { | |
| meta := map[string]interface{}{ | |
| "index": map[string]interface{}{ | |
| "_index": index, | |
| }, | |
| } | |
| metaLine, _ := json.Marshal(meta) | |
| buf.Write(metaLine) | |
| buf.WriteByte('\n') | |
| dataLine, _ := json.Marshal(entry) | |
| buf.Write(dataLine) | |
| buf.WriteByte('\n') | |
| } | |
| res, err := ls.client.es.Bulk(bytes.NewReader(buf.Bytes())) | |
| if err != nil { | |
| return | |
| } | |
| defer res.Body.Close() | |
| if res.IsError() { | |
| return | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/elasticsearch/log_index.go` around lines 107 - 136, The Flush method
on LogSyncer currently ignores the result of ls.client.es.Bulk, leaking response
bodies and swallowing transport errors; update LogSyncer.Flush to capture the
(res, err) returned by ls.client.es.Bulk, immediately handle non-nil err
(log/emit/return), ensure res.Body is closed (defer res.Body.Close() right after
checking res != nil), and check the HTTP status code or response body for errors
so bulk failures are surfaced (e.g. treat non-2xx as an error and log the
response body). Make these changes in the LogSyncer.Flush implementation around
the ls.client.es.Bulk call so failures and resource cleanup are not ignored.
Handle unchecked error returns across 15 files, replace deprecated grpc.Dial with grpc.NewClient, fix nil pointer dereference in test, and restore golangci-lint in CI with v2.1.6 goinstall mode.
Fix all defer Close/Shutdown patterns and json.Encode calls that the linter flags for unchecked error returns.
GITHUB_REPOSITORY can contain uppercase characters which Docker rejects in image tags. Use bash lowercase expansion to fix.
Summary by CodeRabbit
New Features
/api/searchendpoint for querying URLsInfrastructure & Deployment
Improvements