Skip to content

Production-ready proxy: working Postgres scaling, race fixes, ops, tests, non-Docker docs (#54, #52)#59

Merged
SamTV12345 merged 10 commits into
mainfrom
production-ready-proxy
May 28, 2026
Merged

Production-ready proxy: working Postgres scaling, race fixes, ops, tests, non-Docker docs (#54, #52)#59
SamTV12345 merged 10 commits into
mainfrom
production-ready-proxy

Conversation

@SamTV12345
Copy link
Copy Markdown
Member

Summary

Makes the proxy reliable for multi-instance production use and documents non-Docker installation. Closes #54 and #52.

#54 — scaling now actually works. PR #57 shipped Postgres support that was non-functional:

  • The lib/pq driver was never registered, so sql.Open("postgres", …) failed at runtime.
  • Postgres queries used SQLite-only syntax (INSERT OR REPLACE, ? placeholders).
  • New-pad assignment was racy across proxies (random pick + last-write-wins → two proxies route the same pad to different backends).

Fixes:

  • Swap lib/pqjackc/pgx/v5 (via the stdlib adapter); drop lib/pq.
  • Dialect-correct placeholders ($N / ?) and ON CONFLICT upserts for both backends.
  • Atomic Assign (INSERT … ON CONFLICT DO NOTHING + read-back) so the first proxy to win decides the backend and the rest read the winner — no split-brain.

Concurrency hardening

  • RWMutex-guarded BackendState replaces the bare global that was read without locking.
  • Guarded the static-resource map (was a data race between the scraper goroutine and the router).
  • Cleanup loop snapshots the up-list under lock, then does HTTP I/O without holding it.

Operability

  • Graceful shutdown (SIGINT/SIGTERM drain + DB close) via http.Server/signal.NotifyContext.
  • /healthz, /readyz (ready ⇔ ≥1 backend up), and /metrics (Prometheus) on the management port.
  • Configurable managementPort; startup config validation; structured zap logging.

#52 — installation without Docker

  • README section: build from source, configure, run, SQLite-vs-Postgres provisioning.
  • Sample support/etherpad-proxy.service systemd unit.
  • .goreleaser.yaml + tag-triggered release workflow for prebuilt cross-platform binaries.

Quality

  • From 0 tests to coverage of the DB layer, routing decisions, BackendState, config validation, and availability checks.
  • New test.yml CI workflow: go vet + go test -race with a Postgres service container.

Design and step-by-step plan live in docs/superpowers/.

Note: go.mod now requires Go 1.25 (a transitive dependency forces it); CI and docs updated to match.

Test Plan

  • go build ./..., go vet ./..., go test ./... all pass locally
  • Live smoke test: management endpoints return /healthz 200, /readyz 503 (backend down), /metrics exposes etherpad_proxy_*, /pads 200
  • CI runs go test -race ./... against the Postgres service (couldn't run -race locally — no C compiler on the dev box)
  • Verify graceful shutdown drains under SIGTERM on Linux (force-killed locally on Windows)
  • Tag a release (vX.Y.Z) to exercise the GoReleaser workflow

🤖 Generated with Claude Code

SamTV12345 and others added 10 commits May 28, 2026 22:38
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- swap lib/pq for jackc/pgx/v5 (driver was never registered)
- use dialect-correct placeholders and ON CONFLICT upserts
- add atomic Assign to prevent multi-proxy split-brain
- add SQLite DB layer tests and gated Postgres integration tests

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- guard static resource map and use BackendState snapshots (no data races)
- extract testable chooseBackend; use atomic Assign for new pads;
  fall back to redirect when a stored static backend is down
- http.Server with signal-driven graceful shutdown and DB close
- /healthz, /readyz and /metrics on the management port
- structured zap logging; record metrics on routing and DB outcomes
- add routing and backend-availability tests

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

Review Summary by Qodo

Production-ready proxy: Postgres scaling, atomic assignment, concurrency safety, operability, and comprehensive testing

✨ Enhancement 🧪 Tests 📝 Documentation

Grey Divider

Walkthroughs

Description
• **Postgres scaling now works**: Replaced lib/pq with jackc/pgx/v5, implemented dialect-correct
  SQL placeholders ($N for Postgres, ? for SQLite), and fixed ON CONFLICT upserts for both
  backends
• **Atomic pad assignment prevents split-brain**: Implemented Assign method using `INSERT ... ON
  CONFLICT DO NOTHING` + read-back pattern so the first proxy to win decides the backend and others
  read the winner
• **Concurrency hardening**: Replaced bare global AvailableBackends with thread-safe
  BackendState using sync.RWMutex; guarded static-resource map; snapshot-based cleanup loop to
  avoid holding locks during I/O
• **Production operability**: Added graceful shutdown (SIGINT/SIGTERM drain via
  http.Server/signal.NotifyContext), health endpoints (/healthz, /readyz), Prometheus metrics
  (/metrics), configurable managementPort, and structured zap logging
• **Comprehensive testing**: Added 10+ test files covering DB layer (SQLite and Postgres integration
  tests), routing decisions, BackendState concurrency, config validation, and availability checks
• **CI/CD and deployment**: New test.yml workflow with go vet, go test -race against Postgres
  service container; .goreleaser.yaml for cross-platform binary releases; release.yml for
  tag-triggered automation; systemd service unit for Linux deployment
• **Non-Docker installation docs**: README section with prebuilt binary download, source build,
  configuration, and database provisioning examples
• **Dependency updates**: Upgraded to Go 1.25, added jackc/pgx/v5 and prometheus/client_golang
Diagram
flowchart LR
  A["lib/pq driver<br/>SQLite-only SQL<br/>Racy assignment"] -->|"Replace driver"| B["jackc/pgx/v5<br/>Dialect-correct SQL<br/>Atomic Assign"]
  C["Bare global<br/>AvailableBackends<br/>Data races"] -->|"Thread-safe wrapper"| D["RWMutex-guarded<br/>BackendState<br/>Snapshots"]
  E["No graceful shutdown<br/>No observability<br/>No validation"] -->|"Add operability"| F["Signal handling<br/>Health endpoints<br/>Prometheus metrics<br/>Config validation"]
  B --> G["Production-ready<br/>proxy"]
  D --> G
  F --> G

Loading

Grey Divider

File Changes

1. proxyHandler.go ✨ Enhancement +137/-127

Concurrency-safe routing with metrics and refactored backend selection

• Refactored createRoute to chooseBackend returning backend key instead of proxy, eliminating
 pointer issues
• Replaced global AvailableBackends with thread-safe BackendState snapshots for concurrent
 access
• Wrapped StaticResourceMap in a concurrency-safe staticResources type with mutex-guarded
 methods
• Added metrics instrumentation for requests, pad assignments, clashes, and DB errors
• Improved logging with structured zap logger and fixed race conditions in backend selection logic

proxyHandler.go


2. runtime.go ✨ Enhancement +92/-66

Graceful shutdown, health endpoints, and operability improvements

• Implemented graceful shutdown with signal handling (SIGINT/SIGTERM) and context-based server
 draining
• Added health endpoints /healthz (always 200) and /readyz (200 if backends up, else 503)
• Introduced configurable managementPort with default fallback to 8081
• Separated proxy and management servers into explicit http.Server instances
• Added Prometheus metrics handler at /metrics and improved error handling with proper cleanup

runtime.go


3. databases/sqlite/sqlite_db.go ✨ Enhancement +46/-37

Atomic assignment and dialect-correct SQL for SQLite

• Added Assign method implementing atomic pad-to-backend assignment via `INSERT ... ON CONFLICT DO
 NOTHING` + read-back
• Replaced INSERT OR REPLACE with dialect-correct ON CONFLICT upserts for both pad and
 clashes tables
• Stored StatementBuilder with sq.Question placeholder format in DB struct for consistent query
 building
• Improved error handling and code cleanup (removed blank lines, consolidated variable declarations)

databases/sqlite/sqlite_db.go


View more (21)
4. databases/postgres/postgres_db.go ✨ Enhancement +45/-34

Postgres driver upgrade and atomic assignment implementation

• Switched driver from lib/pq to jackc/pgx/v5 via stdlib adapter with "pgx" driver name
• Added Assign method with atomic INSERT ... ON CONFLICT DO NOTHING + read-back pattern
• Replaced INSERT OR REPLACE with Postgres-compatible ON CONFLICT upserts
• Stored StatementBuilder with sq.Dollar placeholder format for correct parameterized queries

databases/postgres/postgres_db.go


5. proxyHandler_test.go 🧪 Tests +135/-0

Unit tests for routing decisions and backend selection

• Added comprehensive unit tests for chooseBackend routing logic covering new pad assignment,
 stored backend reuse, reassignment when down, and clash detection
• Implemented fake DB for testing with Assign method and pad/clash tracking
• Created test helper newTestHandler to initialize ProxyHandler with seeded BackendState

proxyHandler_test.go


6. databases/sqlite/sqlite_db_test.go 🧪 Tests +108/-0

SQLite database layer integration tests

• Added integration tests for SQLite DB layer covering Set/Get, upsert behavior, atomic
 Assign, clash recording, and cleanup operations
• Tests verify atomicity of Assign (second call returns first winner) and proper error handling
 for missing rows

databases/sqlite/sqlite_db_test.go


7. databases/postgres/postgres_db_test.go 🧪 Tests +80/-0

Postgres database layer integration tests

• Added Postgres integration tests gated on PG_TEST_DSN environment variable with automatic skip
 when unset
• Tests cover atomic Assign, Set/Get upsert behavior, and clash recording with table
 truncation for deterministic reruns

databases/postgres/postgres_db_test.go


8. models/Settings.go ✨ Enhancement +42/-0

Settings validation and management port configuration

• Added ManagementPort field to Settings struct for configurable management server port
• Implemented comprehensive Validate() method checking port ranges, backend configuration, auth
 coherence, and database settings
• Added helper validPort() to ensure ports are in valid range (1-65535)

models/Settings.go


9. models/settings_test.go 🧪 Tests +75/-0

Settings validation unit tests

• Added table-driven tests for Settings.Validate() covering valid config, invalid ports, missing
 backends, database configuration errors, and partial auth configs

models/settings_test.go


10. checkAvailability_test.go 🧪 Tests +59/-0

Availability check integration tests

• Added integration tests for checkAvailability function using httptest backends with crafted
 /stats responses
• Tests verify correct classification of backends as up/available based on capacity and reachability

checkAvailability_test.go


11. metrics/metrics.go ✨ Enhancement +45/-0

Prometheus metrics package for observability

• Created new metrics package with Prometheus instrumentation for request outcomes, backend status,
 pad assignments, clashes, and DB errors
• Defined six metrics: RequestsTotal, BackendsUp, BackendsAvailable, PadAssignmentsTotal,
 ClashesTotal, DBErrorsTotal

metrics/metrics.go


12. models/AvailableBackends.go ✨ Enhancement +39/-5

Thread-safe backend state management with RWMutex

• Replaced bare AvailableBackends struct with thread-safe BackendState type using sync.RWMutex
• Implemented snapshot methods (SnapshotAvailable, SnapshotUp) returning copies to prevent
 external mutation
• Added SetState for atomic replacement and IsUp helper for backend status checks

models/AvailableBackends.go


13. models/backendstate_test.go 🧪 Tests +38/-0

BackendState concurrency and isolation tests

• Added unit tests for BackendState verifying snapshot isolation, IsUp checks, and concurrent
 access safety

models/backendstate_test.go


14. databases/interfaces/iDB.go ✨ Enhancement +4/-0

IDB interface extension for atomic assignment

• Added Assign method to IDB interface for atomic pad-to-backend assignment with race-safe
 semantics

databases/interfaces/iDB.go


15. main.go ✨ Enhancement +4/-0

Configuration validation at startup

• Added Settings.Validate() call before StartServer to fail fast on configuration errors

main.go


16. go.mod Dependencies +19/-5

Dependencies: pgx driver, Prometheus, and Go 1.25 upgrade

• Upgraded Go version requirement from 1.24.0 to 1.25.0
• Replaced github.com/lib/pq with github.com/jackc/pgx/v5 for Postgres support
• Added github.com/prometheus/client_golang for metrics
• Updated transitive dependencies for oauth2, net, sys, text, and tools

go.mod


17. .github/workflows/test.yml ⚙️ Configuration changes +33/-0

CI workflow for testing with race detector and Postgres

• Created new CI workflow running go vet and go test -race on every push and PR
• Includes Postgres 16 service container with health checks for integration testing
• Sets PG_TEST_DSN environment variable for Postgres tests

.github/workflows/test.yml


18. .goreleaser.yaml ⚙️ Configuration changes +32/-0

GoReleaser configuration for prebuilt binary releases

• Added GoReleaser configuration for cross-platform binary builds (linux/darwin/windows ×
 amd64/arm64)
• Configured archives with README, LICENSE, settings template, and systemd service file
• Enabled checksum generation for release artifacts

.goreleaser.yaml


19. .github/workflows/release.yml ⚙️ Configuration changes +23/-0

GitHub Actions workflow for automated releases

• Created tag-triggered release workflow that runs GoReleaser on v* tags
• Configured with GitHub token permissions for publishing to Releases

.github/workflows/release.yml


20. support/etherpad-proxy.service ⚙️ Configuration changes +21/-0

Systemd service unit for Linux deployment

• Added systemd unit file for running etherpad-proxy as a service with dedicated user
• Configured working directory, environment variables, restart policy, and security hardening
 options

support/etherpad-proxy.service


21. docs/superpowers/specs/2026-05-28-production-ready-proxy-design.md 📝 Documentation +218/-0

Production-ready proxy design specification and architecture

• Comprehensive design document detailing fixes for scaling correctness, concurrency, operability,
 and quality
• Covers atomic assignment via Assign, thread-safe BackendState, graceful shutdown, health
 endpoints, config validation, and metrics
• Includes test strategy, CI/CD approach, and non-Docker installation documentation

docs/superpowers/specs/2026-05-28-production-ready-proxy-design.md


22. README.md 📝 Documentation +88/-0

Non-Docker installation and deployment documentation

• Added "Installation without Docker" section with prebuilt binary download and source build
 instructions
• Documented configuration steps, database choices (SQLite vs Postgres), and systemd service setup
• Included provisioning examples and environment variable usage for non-Docker deployments

README.md


23. docs/superpowers/plans/2026-05-28-production-ready-proxy.md 📝 Documentation +2105/-0

Production-ready proxy implementation plan with 11 tasks

• Comprehensive implementation plan for production-ready etherpad-proxy with 11 tasks covering
 database layer, concurrency safety, observability, and deployment
• Task 1: Swap lib/pq to jackc/pgx/v5, implement dialect-correct SQL placeholders and upserts,
 add atomic Assign method to prevent multi-proxy split-brain
• Tasks 2–8: Add Postgres integration tests (gated on PG_TEST_DSN), RWMutex-guarded
 BackendState, Prometheus metrics package, config validation with ManagementPort, refactored
 routing with testable chooseBackend, graceful shutdown with signal handling,
 health/readiness/metrics endpoints
• Tasks 9–11: CI workflow with go vet and race tests against Postgres service, non-Docker
 installation docs with systemd unit, GoReleaser cross-platform binary releases

docs/superpowers/plans/2026-05-28-production-ready-proxy.md


24. go.sum Additional files +58/-17

...

go.sum


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented May 28, 2026

Code Review by Qodo

🐞 Bugs (4) 📘 Rule violations (0) 📎 Requirement gaps (1)

Context used

Grey Divider


Action required

1. Broken config template 🐞 Bug ≡ Correctness
Description
README now instructs copying settings.json.template, but the shipped templates are invalid JSON
(trailing commas) and use tokenURL while the code expects tokenUrl, so startup will fatally fail
at json.Unmarshal/Validate for common OAuth configs.
Code

README.md[R49-59]

Evidence
README instructs copying the template; the program unmarshals JSON and validates; the templates
contain trailing commas and tokenURL, while the struct tag/validation requires tokenUrl, causing
startup failure with the documented/template configuration.

README.md[49-60]
main.go[21-33]
models/Settings.go[22-60]
settings.json.template[1-27]
settings.json.sqlite.template[1-27]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The project recommends copying `settings.json.template`, but the template(s) are not parseable by `encoding/json` (trailing commas) and use the key `tokenURL` while the code unmarshals `tokenUrl` and then enforces it via `Settings.Validate()`. This makes the documented setup path fail at startup.

## Issue Context
- `main.go` uses `encoding/json.Unmarshal` and then calls `Settings.Validate()`.
- `models.Backend.TokenURL` is tagged `json:"tokenUrl"` and validation requires it when OAuth is configured.
- Both `settings.json.template` and `settings.json.sqlite.template` currently use `tokenURL` and include trailing commas.

## Fix Focus Areas
- README.md[49-60]
- settings.json.template[1-27]
- settings.json.sqlite.template[1-27]
- models/Settings.go[22-60]
- main.go[21-33]

## Suggested fix
1. Make both template files valid JSON (remove trailing commas).
2. Decide on one key name and make it consistent across:
  - templates
  - README OAuth section (currently says `tokenURL`)
  - code (`models.Backend.TokenURL` tag and validation message)
3. Strongly consider backward compatibility for existing configs that likely followed the README (`tokenURL`): implement custom unmarshalling or accept both `tokenURL` and `tokenUrl` and normalize into one field before validation/usage.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. No HTTP timeouts 🐞 Bug ☼ Reliability
Description
ScrapeJSFiles and the cleanup loop perform outbound HTTP calls without any client/request timeout,
so one stuck backend connection can hang those goroutines indefinitely and prevent ongoing
routing/cleanup state refresh.
Code

proxyHandler.go[R81-90]

Evidence
The scraper uses http.Get directly, and cleanup uses a zero-timeout http.Client; similarly,
availability uses http.Get. With no timeout, these calls can block indefinitely, preventing the
loops from progressing.

proxyHandler.go[81-110]
runtime.go[55-90]
checkAvailability.go[22-35]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Several background loops do outbound HTTP I/O using `http.Get()` / `&http.Client{}` with no timeout. If a backend accepts a TCP connection but never responds (or stalls mid-response), the goroutine can block indefinitely.

## Issue Context
This impacts:
- Static resource scraping (`ScrapeJSFiles`) used for routing padbootstrap resources.
- Pad reconciliation (`cleanUpEtherpads`) which calls Etherpad’s `listAllPads` API.
- Backend availability checks (`checkAvailability`) that drive `BackendState` updates.

## Fix Focus Areas
- proxyHandler.go[81-110]
- runtime.go[55-90]
- checkAvailability.go[22-51]

## Suggested fix
1. Create a shared `http.Client` with a reasonable `Timeout` (e.g., 5–10s) and reuse it.
2. Prefer per-request context timeouts (`context.WithTimeout`) for clearer cancellation.
3. Ensure bodies are closed in all paths (including non-2xx responses) and consider checking `StatusCode` before parsing.
4. In `ScrapeJSFiles`, consider using a ticker and making the HTTP calls cancellable on shutdown (optional but improves operability).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Validation misses invariants 🐞 Bug ☼ Reliability
Description
Settings.Validate doesn’t reject default-zero values like checkInterval<=0 or
maxPadsPerInstance<=0, nor does it prevent managementPort/port collisions, which can cause a
busy-loop availability checker, no available backends, or bind failures at runtime.
Code

models/Settings.go[R33-66]

Evidence
Validate currently checks ports are in range but not checkInterval/maxPads invariants; runtime uses
checkInterval directly to compute sleep duration, and availability logic uses maxPadsPerInstance to
decide availability, making default-zero configs pathological.

models/Settings.go[35-66]
runtime.go[30-39]
runtime.go[178-181]
checkAvailability.go[48-50]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`Settings.Validate()` was added, but it misses checks that prevent severe runtime behavior when settings fields are omitted or set to unsafe values.

## Issue Context
- `runtime.checkAvailabilityLoop` sleeps `time.Duration(settings.CheckInterval) * time.Millisecond`; if `CheckInterval` is 0, it can spin.
- `checkAvailability` compares `ActivePads >= settings.MaxPadsPerInstance`; if `MaxPadsPerInstance` is 0, every backend becomes “full” and routing for new pads fails.
- Management server and proxy server can be configured to use the same port (or `port` can equal the default management port when `managementPort` is left as 0), causing `ListenAndServe` to fail.

## Fix Focus Areas
- models/Settings.go[35-66]
- runtime.go[28-40]
- runtime.go[174-181]
- checkAvailability.go[48-50]

## Suggested fix
1. In `Settings.Validate()` enforce:
  - `CheckInterval > 0`
  - `MaxPadsPerInstance > 0`
2. Validate port collisions:
  - If `ManagementPort != 0`, ensure `ManagementPort != Port`.
  - If `ManagementPort == 0`, ensure `Port != 8081` (or move this check to `StartServer` after defaulting `managementPort`).
3. Consider validating `CheckInterval` upper bounds (optional) to prevent extremely slow availability refresh in production.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. README missing non-Docker verify step 📎 Requirement gap ⚙ Maintainability
Description
The new non-Docker installation guide explains how to download/build, configure, and run the proxy,
but it does not include explicit steps to verify the service is working after startup (e.g.,
checking /healthz//readyz//metrics). This can leave users unsure whether a non-Docker install
succeeded, failing the step-by-step requirement.
Code

README.md[R61-69]

Evidence
PR Compliance ID 2 requires non-Docker instructions to include how to run and verify the service. In
README.md, the new section includes "Run" commands and describes ports/endpoints, but provides no
concrete verification steps (e.g., curl checks) after startup.

Provide step-by-step installation instructions for non-Docker deployments
README.md[61-69]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The README's "Installation without Docker" guide lacks an explicit "verify" step after "Run", so users don't have clear commands to confirm a successful startup.

## Issue Context
The compliance checklist requires a sequential non-Docker installation guide that includes how to run *and verify* the service.

## Fix Focus Areas
- README.md[61-69]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. ReadAll error ignored 🐞 Bug ≡ Correctness
Description
checkAvailability discards the error from io.ReadAll(response.Body) and proceeds as if the body
was valid, which can misclassify a backend as up/available based on partial or empty data.
Code

runtime.go[R30-39]

Evidence
The runtime loop calls checkAvailability continuously to set backend state; in
checkAvailability, the io.ReadAll error is not checked before attempting to parse the response
JSON, so partial reads can be interpreted as successful stats.

runtime.go[30-39]
checkAvailability.go[29-46]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`checkAvailability` assigns `err := io.ReadAll(...)` but never checks it; it then overwrites `err` with `response.Body.Close()` and continues to `json.Unmarshal`. This can treat a failed/partial read as valid input.

## Issue Context
`runtime.checkAvailabilityLoop` depends on `checkAvailability` output to update the in-memory `BackendState`, so misclassification directly affects routing decisions.

## Fix Focus Areas
- checkAvailability.go[29-46]
- runtime.go[30-39]

## Suggested fix
1. Check the `io.ReadAll` error immediately; if non-nil, close the body and mark the backend down/unavailable.
2. Consider checking `response.StatusCode == 200` before reading/parsing.
3. Avoid reusing `err` for close errors in a way that masks the primary failure (use a separate variable for close errors).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@SamTV12345 SamTV12345 merged commit d5ef1b2 into main May 28, 2026
1 check passed
@SamTV12345 SamTV12345 deleted the production-ready-proxy branch May 28, 2026 21:14
Comment thread README.md
Comment on lines +49 to +59
### Configure

1. Copy the template and edit it:
```bash
cp settings.json.template settings.json
```
2. Set `port` (proxy listen port), optionally `managementPort`
(default `8081`, serves `/pads`, `/metrics`, `/healthz`, `/readyz`), your
`backends`, and a database (see **Database** below).
3. The settings path defaults to `./settings.json`; override it with the
`SETTINGS_FILE` environment variable.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Broken config template 🐞 Bug ≡ Correctness

README now instructs copying settings.json.template, but the shipped templates are invalid JSON
(trailing commas) and use tokenURL while the code expects tokenUrl, so startup will fatally fail
at json.Unmarshal/Validate for common OAuth configs.
Agent Prompt
## Issue description
The project recommends copying `settings.json.template`, but the template(s) are not parseable by `encoding/json` (trailing commas) and use the key `tokenURL` while the code unmarshals `tokenUrl` and then enforces it via `Settings.Validate()`. This makes the documented setup path fail at startup.

## Issue Context
- `main.go` uses `encoding/json.Unmarshal` and then calls `Settings.Validate()`.
- `models.Backend.TokenURL` is tagged `json:"tokenUrl"` and validation requires it when OAuth is configured.
- Both `settings.json.template` and `settings.json.sqlite.template` currently use `tokenURL` and include trailing commas.

## Fix Focus Areas
- README.md[49-60]
- settings.json.template[1-27]
- settings.json.sqlite.template[1-27]
- models/Settings.go[22-60]
- main.go[21-33]

## Suggested fix
1. Make both template files valid JSON (remove trailing commas).
2. Decide on one key name and make it consistent across:
   - templates
   - README OAuth section (currently says `tokenURL`)
   - code (`models.Backend.TokenURL` tag and validation message)
3. Strongly consider backward compatibility for existing configs that likely followed the README (`tokenURL`): implement custom unmarshalling or accept both `tokenURL` and `tokenUrl` and normalize into one field before validation/usage.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment thread proxyHandler.go
Comment on lines +81 to 90
func ScrapeJSFiles(settings models.Settings, static *staticResources, logger *zap.SugaredLogger) {
go func() {
for {
for key, backend := range backends.Backends {
for key, backend := range settings.Backends {
response, err := http.Get("http://" + backend.Host + ":" + strconv.Itoa(backend.Port) + "/p/test")
if err != nil {
log.Println("Error while scraping JS files: ", err)
logger.Warnf("Error while scraping JS files: %v", err)
continue
}

doc, err := goquery.NewDocumentFromReader(response.Body)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. No http timeouts 🐞 Bug ☼ Reliability

ScrapeJSFiles and the cleanup loop perform outbound HTTP calls without any client/request timeout,
so one stuck backend connection can hang those goroutines indefinitely and prevent ongoing
routing/cleanup state refresh.
Agent Prompt
## Issue description
Several background loops do outbound HTTP I/O using `http.Get()` / `&http.Client{}` with no timeout. If a backend accepts a TCP connection but never responds (or stalls mid-response), the goroutine can block indefinitely.

## Issue Context
This impacts:
- Static resource scraping (`ScrapeJSFiles`) used for routing padbootstrap resources.
- Pad reconciliation (`cleanUpEtherpads`) which calls Etherpad’s `listAllPads` API.
- Backend availability checks (`checkAvailability`) that drive `BackendState` updates.

## Fix Focus Areas
- proxyHandler.go[81-110]
- runtime.go[55-90]
- checkAvailability.go[22-51]

## Suggested fix
1. Create a shared `http.Client` with a reasonable `Timeout` (e.g., 5–10s) and reuse it.
2. Prefer per-request context timeouts (`context.WithTimeout`) for clearer cancellation.
3. Ensure bodies are closed in all paths (including non-2xx responses) and consider checking `StatusCode` before parsing.
4. In `ScrapeJSFiles`, consider using a ticker and making the HTTP calls cancellable on shutdown (optional but improves operability).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment thread models/Settings.go
Comment on lines +33 to +66
func validPort(p int) bool { return p > 0 && p <= 65535 }

// Validate checks that the settings are internally consistent and usable.
func (s Settings) Validate() error {
if !validPort(s.Port) {
return fmt.Errorf("port must be between 1 and 65535, got %d", s.Port)
}
if s.ManagementPort != 0 && !validPort(s.ManagementPort) {
return fmt.Errorf("managementPort must be between 1 and 65535, got %d", s.ManagementPort)
}
if len(s.Backends) == 0 {
return errors.New("at least one backend must be configured")
}
for name, b := range s.Backends {
if b.Host == "" {
return fmt.Errorf("backend %q is missing host", name)
}
if !validPort(b.Port) {
return fmt.Errorf("backend %q has invalid port %d", name, b.Port)
}
if (b.Username != nil) != (b.Password != nil) {
return fmt.Errorf("backend %q must set both username and password", name)
}
hasOAuth := b.ClientId != nil && b.ClientSecret != nil && b.TokenURL != nil
if (b.ClientId != nil || b.ClientSecret != nil || b.TokenURL != nil) && !hasOAuth {
return fmt.Errorf("backend %q must set clientId, clientSecret and tokenUrl together", name)
}
}
hasFile := s.DBSettings.Filename != ""
hasConn := s.DBSettings.Connstr != ""
if hasFile == hasConn {
return errors.New("exactly one of dbSettings.filename or dbSettings.postgresConnstr must be set")
}
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

3. Validation misses invariants 🐞 Bug ☼ Reliability

Settings.Validate doesn’t reject default-zero values like checkInterval<=0 or
maxPadsPerInstance<=0, nor does it prevent managementPort/port collisions, which can cause a
busy-loop availability checker, no available backends, or bind failures at runtime.
Agent Prompt
## Issue description
`Settings.Validate()` was added, but it misses checks that prevent severe runtime behavior when settings fields are omitted or set to unsafe values.

## Issue Context
- `runtime.checkAvailabilityLoop` sleeps `time.Duration(settings.CheckInterval) * time.Millisecond`; if `CheckInterval` is 0, it can spin.
- `checkAvailability` compares `ActivePads >= settings.MaxPadsPerInstance`; if `MaxPadsPerInstance` is 0, every backend becomes “full” and routing for new pads fails.
- Management server and proxy server can be configured to use the same port (or `port` can equal the default management port when `managementPort` is left as 0), causing `ListenAndServe` to fail.

## Fix Focus Areas
- models/Settings.go[35-66]
- runtime.go[28-40]
- runtime.go[174-181]
- checkAvailability.go[48-50]

## Suggested fix
1. In `Settings.Validate()` enforce:
   - `CheckInterval > 0`
   - `MaxPadsPerInstance > 0`
2. Validate port collisions:
   - If `ManagementPort != 0`, ensure `ManagementPort != Port`.
   - If `ManagementPort == 0`, ensure `Port != 8081` (or move this check to `StartServer` after defaulting `managementPort`).
3. Consider validating `CheckInterval` upper bounds (optional) to prevent extremely slow availability refresh in production.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

does the etherpad-proxy itself scale?

1 participant