RFC: Rate Limiting & Brute Force Protection

## RFC: Rate Limiting & Brute Force Protection

**Phase**: 1 — Security Hardening & Enterprise Foundation  
**Priority**: P0 — Critical  
**Estimated Effort**: Medium  

---

### Problem Statement

Authorizer currently has **zero protection** against credential stuffing, brute force attacks, or API abuse. Every competitor (WorkOS Radar, Clerk Bot Protection, Keycloak Brute Force Detector) ships rate limiting as a core feature. Without this, Authorizer cannot be recommended for production enterprise use.

The current middleware chain (`LoggerMiddleware → ContextMiddleware → CORSMiddleware → ClientCheckMiddleware`) has no rate limiting layer.

---

### Current Architecture Context

- HTTP framework: **Gin** (`gin-gonic/gin`)
- Middleware chain defined in `internal/server/http_routes.go`
- Memory store layer exists with Redis and DB-backed implementations (`internal/memory_store/`)
- Session tokens already use key patterns like `{userId}:{token_type}_{nonce}` in the memory store
- Config parsed via Cobra CLI flags in `cmd/root.go`
- No rate limiting library currently in `go.mod`

---

### Proposed Solution

#### 1. Rate Limiter Middleware

**Algorithm**: Token bucket via `golang.org/x/time/rate` for in-memory, sliding window counter for Redis-backed.

**Why token bucket + sliding window hybrid**: Token bucket is simple and efficient for single-instance deployments. For distributed deployments (multiple Authorizer instances behind a load balancer), we need Redis-backed sliding window counters that are atomic across instances. The memory store abstraction already supports this pattern.

**New middleware**: `internal/http_handlers/rate_limit_middleware.go`

```go
type RateLimitConfig struct {
    // Per-IP limits
    RequestsPerWindow int           // default: 100
    WindowDuration    time.Duration // default: 60s
    
    // Auth-specific limits (stricter)
    AuthRequestsPerWindow int           // default: 20
    AuthWindowDuration    time.Duration // default: 60s
    
    // Enabled flag
    Enabled bool
}
```

**Implementation approach**:
- Add `RateLimitMiddleware` to the Gin middleware chain, placed **after** `LoggerMiddleware` and **before** `CORSMiddleware`
- Use `c.ClientIP()` (Gin's built-in, respects `X-Forwarded-For` with trusted proxies) as the rate limit key
- For authenticated endpoints, use `{user_id}:{client_ip}` composite key
- Auth endpoints (`/oauth/token`, `/graphql` mutations `login`, `signup`, `verify_otp`, `magic_link_login`, `forgot_password`) get stricter limits
- Return `429 Too Many Requests` with `Retry-After` header and JSON error body
- Store counters in memory store (Redis when available, in-memory with DB fallback)

**Redis sliding window implementation** (atomic via Lua script):
```lua
-- KEYS[1] = rate limit key
-- ARGV[1] = window size in ms, ARGV[2] = current timestamp ms, ARGV[3] = max requests
redis.call('ZREMRANGEBYSCORE', KEYS[1], 0, ARGV[2] - ARGV[1])
local count = redis.call('ZCARD', KEYS[1])
if count < tonumber(ARGV[3]) then
    redis.call('ZADD', KEYS[1], ARGV[2], ARGV[2] .. math.random())
    redis.call('PEXPIRE', KEYS[1], ARGV[1])
    return 0  -- allowed
end
return 1  -- blocked
```

#### 2. Login Attempts Table & Account Lockout (Sliding Window)

Instead of adding `failed_login_count`/`locked_until` columns to the User table, we use a **dedicated `LoginAttempt` table** that tracks every login attempt with full metadata. Lockout is determined by counting failures within a sliding time window — no explicit lock/unlock state needed.

**Why this approach over per-user columns:**
- **Sliding window is more accurate** — 10 failures in 15 minutes is suspicious; 10 failures over 6 months isn't. A simple counter can't distinguish these.
- **Multi-dimensional detection** — same IP hitting many accounts (credential stuffing) vs many IPs hitting one account (distributed brute force). Per-user columns can't do this.
- **Natural retry after lockout** — no need to "reset" anything. Once old attempts fall outside the window, the user is automatically unlocked.
- **Full audit/forensics** — every attempt is preserved with IP, user agent, method, and failure reason. Feeds directly into the Audit Log system (Phase 1.3).
- **Clean User schema** — no security-state pollution on the User table.

**New schema**: `internal/storage/schemas/login_attempt.go`

```go
type LoginAttempt struct {
    ID            string `json:"id" gorm:"primaryKey;type:char(36)"`
    UserID        string `json:"user_id" gorm:"type:char(36);index:idx_login_attempt_user_time"`         // nullable — for non-existent users, track by email
    Email         string `json:"email" gorm:"type:varchar(256);index:idx_login_attempt_email_time"`       // always populated
    IPAddress     string `json:"ip_address" gorm:"type:varchar(45);index:idx_login_attempt_ip_time"`      // supports IPv6
    UserAgent     string `json:"user_agent" gorm:"type:text"`
    Method        string `json:"method" gorm:"type:varchar(50)"`                                          // password, otp, magic_link, totp, social
    Success       bool   `json:"success" gorm:"type:bool;default:false"`
    FailureReason string `json:"failure_reason" gorm:"type:varchar(100)"`                                 // invalid_password, account_not_found, mfa_failed, account_locked, etc.
    CreatedAt     int64  `json:"created_at" gorm:"autoCreateTime"`
}
```

**Composite indexes for query performance:**
- `(user_id, created_at)` — per-user lockout checks
- `(email, created_at)` — lockout checks when user_id is unknown
- `(ip_address, created_at)` — per-IP credential stuffing detection

**New storage interface methods:**
```go
// AddLoginAttempt records a login attempt (success or failure)
AddLoginAttempt(ctx context.Context, attempt *schemas.LoginAttempt) error
// CountFailedAttempts counts failed login attempts for a user/email within a time window
CountFailedAttempts(ctx context.Context, email string, since int64) (int64, error)
// CountFailedAttemptsByIP counts failed attempts from an IP within a time window
CountFailedAttemptsByIP(ctx context.Context, ip string, since int64) (int64, error)
// ListLoginAttempts returns login attempts for a user (for admin/audit views)
ListLoginAttempts(ctx context.Context, userID string, pagination *model.Pagination) ([]*schemas.LoginAttempt, *model.Pagination, error)
// DeleteLoginAttemptsBefore removes attempts older than a timestamp (retention cleanup)
DeleteLoginAttemptsBefore(ctx context.Context, before int64) error
```

**Lockout logic in login flow** (`internal/graphql/login.go` and related auth handlers):

```go
// Before password/OTP verification:
windowStart := time.Now().Add(-lockoutWindow).Unix()
failedCount, _ := store.CountFailedAttempts(ctx, email, windowStart)

if failedCount >= lockoutThreshold {
    // Calculate retry-after: find the oldest attempt in the window,
    // the lock lifts when that attempt falls outside the window
    retryAfter := windowStart + lockoutWindow - oldestAttemptInWindow
    return Error("account_temporarily_locked", retryAfter)
}

// After verification:
attempt := &schemas.LoginAttempt{
    UserID:    user.ID,  // empty if user not found
    Email:     email,
    IPAddress: clientIP,
    UserAgent: userAgent,
    Method:    "password",
    Success:   passwordValid,
    FailureReason: failureReason, // "" on success
}
store.AddLoginAttempt(ctx, attempt)
```

**How "retry after lock open" works:**
- Lockout window = 15 minutes (configurable)
- Threshold = 10 attempts (configurable)  
- If a user has 10 failures in the last 15 minutes → locked
- As time passes, old failures slide out of the window → naturally unlocked
- No state to reset — the window handles everything
- Example: 10 failures between 10:00–10:05, window=15min → locked until 10:15 (when the first failure at 10:00 falls outside the window). At 10:15 only 9 failures remain in window → unlocked.

**Admin override** — `_unlock_user(user_id: ID!): Response`  
Deletes recent failed attempts for the user, immediately bringing count below threshold. Used for emergency unlocks without waiting for the window to expire.

**Credential stuffing detection** (per-IP):
```go
ipFailedCount, _ := store.CountFailedAttemptsByIP(ctx, clientIP, windowStart)
if ipFailedCount >= ipThreshold {  // e.g., 50 failed attempts from same IP
    // Block IP temporarily, or trigger CAPTCHA challenge
}
```

#### 3. IP Blocking/Allowlisting

**New schema**: `internal/storage/schemas/ip_rule.go`
```go
type IPRule struct {
    ID        string `json:"id" gorm:"primaryKey;type:char(36)"`
    IP        string `json:"ip" gorm:"type:varchar(45);uniqueIndex"` // supports IPv6, CIDR notation
    Type      string `json:"type" gorm:"type:varchar(10)"`           // "block" or "allow"
    Reason    string `json:"reason" gorm:"type:text"`
    ExpiresAt int64  `json:"expires_at"`                             // 0 = permanent
    CreatedAt int64  `json:"created_at"`
}
```

**New storage interface methods:**
```go
AddIPRule(ctx context.Context, rule *schemas.IPRule) (*schemas.IPRule, error)
DeleteIPRule(ctx context.Context, id string) error
ListIPRules(ctx context.Context, ruleType string, pagination *model.Pagination) ([]*schemas.IPRule, *model.Pagination, error)
GetIPRuleByIP(ctx context.Context, ip string) (*schemas.IPRule, error)
```

**Middleware check**: Early in request pipeline, check IP against cached allowlist/blocklist. Use memory store for caching (refresh every 60s from DB).

**Automatic IP blocking**: When credential stuffing is detected (high failed attempts from single IP across multiple accounts), automatically create a temporary IP block rule.

**Admin GraphQL mutations**:
- `_add_ip_rule(ip: String!, type: String!, reason: String, expires_at: Int64): IPRule`
- `_remove_ip_rule(id: ID!): Response`
- `_list_ip_rules(params: PaginatedInput, type: String): IPRules`

#### 4. Leaked Password Detection

**Integration**: Have I Been Pwned k-Anonymity API (https://api.pwnedpasswords.com/range/{SHA1_PREFIX})

**How it works** (privacy-preserving):
1. SHA-1 hash the password
2. Send first 5 characters to HIBP API
3. Compare remaining hash suffix against returned list
4. No full password or hash ever leaves the server

**Implementation**: New utility `internal/utils/password_check.go`
```go
func IsPasswordLeaked(password string) (bool, error) {
    hash := sha1.Sum([]byte(password))
    hexHash := strings.ToUpper(hex.EncodeToString(hash[:]))
    prefix, suffix := hexHash[:5], hexHash[5:]
    
    resp, err := http.Get("https://api.pwnedpasswords.com/range/" + prefix)
    // ... parse response, check if suffix appears
}
```

**Integration points**: Called during `signup` and `reset_password` mutations when `--check-leaked-passwords=true`.

#### 5. Retention & Cleanup

**Login attempts grow over time** — automatic cleanup is essential:

- CLI flag: `--login-attempt-retention-days=90`
- Background goroutine runs `DeleteLoginAttemptsBefore()` daily
- Aligns with audit log retention (Phase 1.3) — same cleanup pattern
- Expired IP rules also cleaned up in the same sweep

---

### CLI Configuration Flags

```
--enable-rate-limit=true                    # Enable/disable rate limiting
--rate-limit-requests=100                   # Requests per window (general)
--rate-limit-window=60s                     # Window duration (general)
--rate-limit-auth-requests=20               # Requests per window (auth endpoints)
--rate-limit-auth-window=60s                # Window duration (auth endpoints)
--account-lockout-threshold=10              # Failed attempts before lockout
--account-lockout-window=15m                # Sliding window duration
--account-lockout-ip-threshold=50           # Per-IP failed attempts before blocking
--check-leaked-passwords=false              # Enable HIBP password check
--login-attempt-retention-days=90           # Days to keep login attempt records
```

---

### Migration Strategy

1. Create `login_attempts` table/collection across all 13+ DB providers with composite indexes
2. Create `ip_rules` table/collection across all DB providers
3. Add memory store methods for rate limit counters
4. No changes to User schema
5. Rate limiting defaults to **enabled** for new deployments, documented flag to disable

---

### Testing Plan

- Unit tests for token bucket and sliding window algorithms
- Integration tests for sliding window lockout flow:
  - Fail N times → locked → wait for window to slide → retry succeeds
  - Fail N times → admin unlock (delete attempts) → retry succeeds immediately
- Integration tests for IP blocking middleware
- Integration tests for credential stuffing detection (per-IP threshold)
- Load tests to verify rate limiting under concurrent requests
- Test with Redis and in-memory memory store backends
- Test HIBP API integration with known-leaked passwords
- Test retention cleanup removes old records correctly

---

### References

- [OWASP Brute Force Prevention](https://cheatsheetseries.owasp.org/cheatsheets/Authentication_Cheat_Sheet.html#account-lockout)
- [RFC 6585 — 429 Too Many Requests](https://tools.ietf.org/html/rfc6585#section-4)
- [Have I Been Pwned API](https://haveibeenpwned.com/API/v3#SearchingPwnedPasswordsByRange)
- [Token Bucket Algorithm](https://en.wikipedia.org/wiki/Token_bucket)
- [Keycloak Brute Force Detection](https://www.keycloak.org/docs/latest/server_admin/#password-guess-brute-force-attacks)
- [OWASP Credential Stuffing Prevention](https://cheatsheetseries.owasp.org/cheatsheets/Credential_Stuffing_Prevention_Cheat_Sheet.html)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Rate Limiting & Brute Force Protection #501