Production-ready proxy: working Postgres scaling, race fixes, ops, tests, non-Docker docs (#54, #52)#59
Conversation
- swap lib/pq for jackc/pgx/v5 (driver was never registered) - use dialect-correct placeholders and ON CONFLICT upserts - add atomic Assign to prevent multi-proxy split-brain - add SQLite DB layer tests and gated Postgres integration tests Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- guard static resource map and use BackendState snapshots (no data races) - extract testable chooseBackend; use atomic Assign for new pads; fall back to redirect when a stored static backend is down - http.Server with signal-driven graceful shutdown and DB close - /healthz, /readyz and /metrics on the management port - structured zap logging; record metrics on routing and DB outcomes - add routing and backend-availability tests Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Review Summary by QodoProduction-ready proxy: Postgres scaling, atomic assignment, concurrency safety, operability, and comprehensive testing
WalkthroughsDescription• **Postgres scaling now works**: Replaced lib/pq with jackc/pgx/v5, implemented dialect-correct SQL placeholders ($N for Postgres, ? for SQLite), and fixed ON CONFLICT upserts for both backends • **Atomic pad assignment prevents split-brain**: Implemented Assign method using `INSERT ... ON CONFLICT DO NOTHING` + read-back pattern so the first proxy to win decides the backend and others read the winner • **Concurrency hardening**: Replaced bare global AvailableBackends with thread-safe BackendState using sync.RWMutex; guarded static-resource map; snapshot-based cleanup loop to avoid holding locks during I/O • **Production operability**: Added graceful shutdown (SIGINT/SIGTERM drain via http.Server/signal.NotifyContext), health endpoints (/healthz, /readyz), Prometheus metrics (/metrics), configurable managementPort, and structured zap logging • **Comprehensive testing**: Added 10+ test files covering DB layer (SQLite and Postgres integration tests), routing decisions, BackendState concurrency, config validation, and availability checks • **CI/CD and deployment**: New test.yml workflow with go vet, go test -race against Postgres service container; .goreleaser.yaml for cross-platform binary releases; release.yml for tag-triggered automation; systemd service unit for Linux deployment • **Non-Docker installation docs**: README section with prebuilt binary download, source build, configuration, and database provisioning examples • **Dependency updates**: Upgraded to Go 1.25, added jackc/pgx/v5 and prometheus/client_golang Diagramflowchart LR
A["lib/pq driver<br/>SQLite-only SQL<br/>Racy assignment"] -->|"Replace driver"| B["jackc/pgx/v5<br/>Dialect-correct SQL<br/>Atomic Assign"]
C["Bare global<br/>AvailableBackends<br/>Data races"] -->|"Thread-safe wrapper"| D["RWMutex-guarded<br/>BackendState<br/>Snapshots"]
E["No graceful shutdown<br/>No observability<br/>No validation"] -->|"Add operability"| F["Signal handling<br/>Health endpoints<br/>Prometheus metrics<br/>Config validation"]
B --> G["Production-ready<br/>proxy"]
D --> G
F --> G
File Changes1. proxyHandler.go
|
Code Review by Qodo
Context used 1. Broken config template
|
| ### Configure | ||
|
|
||
| 1. Copy the template and edit it: | ||
| ```bash | ||
| cp settings.json.template settings.json | ||
| ``` | ||
| 2. Set `port` (proxy listen port), optionally `managementPort` | ||
| (default `8081`, serves `/pads`, `/metrics`, `/healthz`, `/readyz`), your | ||
| `backends`, and a database (see **Database** below). | ||
| 3. The settings path defaults to `./settings.json`; override it with the | ||
| `SETTINGS_FILE` environment variable. |
There was a problem hiding this comment.
1. Broken config template 🐞 Bug ≡ Correctness
README now instructs copying settings.json.template, but the shipped templates are invalid JSON (trailing commas) and use tokenURL while the code expects tokenUrl, so startup will fatally fail at json.Unmarshal/Validate for common OAuth configs.
Agent Prompt
## Issue description
The project recommends copying `settings.json.template`, but the template(s) are not parseable by `encoding/json` (trailing commas) and use the key `tokenURL` while the code unmarshals `tokenUrl` and then enforces it via `Settings.Validate()`. This makes the documented setup path fail at startup.
## Issue Context
- `main.go` uses `encoding/json.Unmarshal` and then calls `Settings.Validate()`.
- `models.Backend.TokenURL` is tagged `json:"tokenUrl"` and validation requires it when OAuth is configured.
- Both `settings.json.template` and `settings.json.sqlite.template` currently use `tokenURL` and include trailing commas.
## Fix Focus Areas
- README.md[49-60]
- settings.json.template[1-27]
- settings.json.sqlite.template[1-27]
- models/Settings.go[22-60]
- main.go[21-33]
## Suggested fix
1. Make both template files valid JSON (remove trailing commas).
2. Decide on one key name and make it consistent across:
- templates
- README OAuth section (currently says `tokenURL`)
- code (`models.Backend.TokenURL` tag and validation message)
3. Strongly consider backward compatibility for existing configs that likely followed the README (`tokenURL`): implement custom unmarshalling or accept both `tokenURL` and `tokenUrl` and normalize into one field before validation/usage.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| func ScrapeJSFiles(settings models.Settings, static *staticResources, logger *zap.SugaredLogger) { | ||
| go func() { | ||
| for { | ||
| for key, backend := range backends.Backends { | ||
| for key, backend := range settings.Backends { | ||
| response, err := http.Get("http://" + backend.Host + ":" + strconv.Itoa(backend.Port) + "/p/test") | ||
| if err != nil { | ||
| log.Println("Error while scraping JS files: ", err) | ||
| logger.Warnf("Error while scraping JS files: %v", err) | ||
| continue | ||
| } | ||
|
|
||
| doc, err := goquery.NewDocumentFromReader(response.Body) |
There was a problem hiding this comment.
2. No http timeouts 🐞 Bug ☼ Reliability
ScrapeJSFiles and the cleanup loop perform outbound HTTP calls without any client/request timeout, so one stuck backend connection can hang those goroutines indefinitely and prevent ongoing routing/cleanup state refresh.
Agent Prompt
## Issue description
Several background loops do outbound HTTP I/O using `http.Get()` / `&http.Client{}` with no timeout. If a backend accepts a TCP connection but never responds (or stalls mid-response), the goroutine can block indefinitely.
## Issue Context
This impacts:
- Static resource scraping (`ScrapeJSFiles`) used for routing padbootstrap resources.
- Pad reconciliation (`cleanUpEtherpads`) which calls Etherpad’s `listAllPads` API.
- Backend availability checks (`checkAvailability`) that drive `BackendState` updates.
## Fix Focus Areas
- proxyHandler.go[81-110]
- runtime.go[55-90]
- checkAvailability.go[22-51]
## Suggested fix
1. Create a shared `http.Client` with a reasonable `Timeout` (e.g., 5–10s) and reuse it.
2. Prefer per-request context timeouts (`context.WithTimeout`) for clearer cancellation.
3. Ensure bodies are closed in all paths (including non-2xx responses) and consider checking `StatusCode` before parsing.
4. In `ScrapeJSFiles`, consider using a ticker and making the HTTP calls cancellable on shutdown (optional but improves operability).
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| func validPort(p int) bool { return p > 0 && p <= 65535 } | ||
|
|
||
| // Validate checks that the settings are internally consistent and usable. | ||
| func (s Settings) Validate() error { | ||
| if !validPort(s.Port) { | ||
| return fmt.Errorf("port must be between 1 and 65535, got %d", s.Port) | ||
| } | ||
| if s.ManagementPort != 0 && !validPort(s.ManagementPort) { | ||
| return fmt.Errorf("managementPort must be between 1 and 65535, got %d", s.ManagementPort) | ||
| } | ||
| if len(s.Backends) == 0 { | ||
| return errors.New("at least one backend must be configured") | ||
| } | ||
| for name, b := range s.Backends { | ||
| if b.Host == "" { | ||
| return fmt.Errorf("backend %q is missing host", name) | ||
| } | ||
| if !validPort(b.Port) { | ||
| return fmt.Errorf("backend %q has invalid port %d", name, b.Port) | ||
| } | ||
| if (b.Username != nil) != (b.Password != nil) { | ||
| return fmt.Errorf("backend %q must set both username and password", name) | ||
| } | ||
| hasOAuth := b.ClientId != nil && b.ClientSecret != nil && b.TokenURL != nil | ||
| if (b.ClientId != nil || b.ClientSecret != nil || b.TokenURL != nil) && !hasOAuth { | ||
| return fmt.Errorf("backend %q must set clientId, clientSecret and tokenUrl together", name) | ||
| } | ||
| } | ||
| hasFile := s.DBSettings.Filename != "" | ||
| hasConn := s.DBSettings.Connstr != "" | ||
| if hasFile == hasConn { | ||
| return errors.New("exactly one of dbSettings.filename or dbSettings.postgresConnstr must be set") | ||
| } | ||
| return nil |
There was a problem hiding this comment.
3. Validation misses invariants 🐞 Bug ☼ Reliability
Settings.Validate doesn’t reject default-zero values like checkInterval<=0 or maxPadsPerInstance<=0, nor does it prevent managementPort/port collisions, which can cause a busy-loop availability checker, no available backends, or bind failures at runtime.
Agent Prompt
## Issue description
`Settings.Validate()` was added, but it misses checks that prevent severe runtime behavior when settings fields are omitted or set to unsafe values.
## Issue Context
- `runtime.checkAvailabilityLoop` sleeps `time.Duration(settings.CheckInterval) * time.Millisecond`; if `CheckInterval` is 0, it can spin.
- `checkAvailability` compares `ActivePads >= settings.MaxPadsPerInstance`; if `MaxPadsPerInstance` is 0, every backend becomes “full” and routing for new pads fails.
- Management server and proxy server can be configured to use the same port (or `port` can equal the default management port when `managementPort` is left as 0), causing `ListenAndServe` to fail.
## Fix Focus Areas
- models/Settings.go[35-66]
- runtime.go[28-40]
- runtime.go[174-181]
- checkAvailability.go[48-50]
## Suggested fix
1. In `Settings.Validate()` enforce:
- `CheckInterval > 0`
- `MaxPadsPerInstance > 0`
2. Validate port collisions:
- If `ManagementPort != 0`, ensure `ManagementPort != Port`.
- If `ManagementPort == 0`, ensure `Port != 8081` (or move this check to `StartServer` after defaulting `managementPort`).
3. Consider validating `CheckInterval` upper bounds (optional) to prevent extremely slow availability refresh in production.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Summary
Makes the proxy reliable for multi-instance production use and documents non-Docker installation. Closes #54 and #52.
#54 — scaling now actually works. PR #57 shipped Postgres support that was non-functional:
lib/pqdriver was never registered, sosql.Open("postgres", …)failed at runtime.INSERT OR REPLACE,?placeholders).Fixes:
lib/pq→jackc/pgx/v5(via thestdlibadapter); droplib/pq.$N/?) andON CONFLICTupserts for both backends.Assign(INSERT … ON CONFLICT DO NOTHING+ read-back) so the first proxy to win decides the backend and the rest read the winner — no split-brain.Concurrency hardening
RWMutex-guardedBackendStatereplaces the bare global that was read without locking.Operability
http.Server/signal.NotifyContext./healthz,/readyz(ready ⇔ ≥1 backend up), and/metrics(Prometheus) on the management port.managementPort; startup config validation; structured zap logging.#52 — installation without Docker
support/etherpad-proxy.servicesystemd unit..goreleaser.yaml+ tag-triggered release workflow for prebuilt cross-platform binaries.Quality
BackendState, config validation, and availability checks.test.ymlCI workflow:go vet+go test -racewith a Postgres service container.Design and step-by-step plan live in
docs/superpowers/.Test Plan
go build ./...,go vet ./...,go test ./...all pass locally/healthz200,/readyz503 (backend down),/metricsexposesetherpad_proxy_*,/pads200go test -race ./...against the Postgres service (couldn't run-racelocally — no C compiler on the dev box)vX.Y.Z) to exercise the GoReleaser workflow🤖 Generated with Claude Code