Skip to content

RFC-0005: Implement ClickHouse Storage Schema #5

@copyleftdev

Description

@copyleftdev

📋 RFC Reference

Document: RFC-0005: ClickHouse Storage Schema
Version: 0.2.0
Phase: Phase 4 - Storage & Reliability
Depends On: #1 (RFC-0001), #4 (RFC-0004)

🎯 Objective

Implement ClickHouse storage schema with replication, backup strategy, disaster recovery, and realistic compression ratios for long-term session telemetry storage.

📦 Acceptance Criteria

ClickHouse Schema

  • scrybe.sessions table created with ReplicatedMergeTree
  • Primary key: toStartOfHour(timestamp), session_id
  • Partitioning by toYYYYMMDD(timestamp)
  • TTL configured: 90 days
  • All columns defined per RFC-0005

Indexes

  • Token bloom filter on fingerprint_hash
  • Token bloom filter on ip
  • Min-max index on bot_probability
  • Index granularity: 8192
  • Indexes validated with EXPLAIN queries

Materialized Views

  • scrybe.hourly_stats view created
  • scrybe.fingerprint_clusters view created
  • POPULATE directive used (backfill existing data)
  • Aggregation logic validated
  • Query performance tested

Replication Setup

  • ZooKeeper cluster configured (3 nodes)
  • ReplicatedMergeTree configured
  • Shard and replica macros defined
  • 2 replicas minimum per shard
  • Replication lag monitored

Backup Strategy

  • clickhouse-backup installed
  • S3 backup destination configured
  • Daily incremental backups scheduled
  • Weekly full backups scheduled
  • Backup retention: 30 days remote, 3 days local
  • Backup validation script created

Disaster Recovery

  • DR runbook created and tested
  • RTO < 30 minutes validated
  • RPO < 1 hour validated
  • Point-in-time recovery tested
  • Full cluster failure recovery tested
  • DNS failover procedure documented

Compression Configuration

  • ZSTD compression level 3 configured
  • Compression ratio 10-20:1 validated
  • Storage calculations verified
  • Min part size configured (10MB)
  • Compression benchmarks run

Performance Tuning

  • Write throughput > 100k sessions/sec
  • Query latency < 100ms (p99)
  • Memory limits configured (50GB)
  • Background pool sized (16 threads)
  • Merge settings optimized

Rust Integration

  • clickhouse-rs client configured
  • Connection pooling implemented
  • Batch insert logic (1000 rows)
  • Error handling with retries
  • Health check queries

Monitoring

  • Prometheus metrics exposed
  • Grafana dashboard created
  • Alerts configured:
    • Replication lag > 10s
    • Query latency > 500ms
    • Disk usage > 80%
    • Failed backups
  • Log aggregation configured

Testing

  • Unit tests for Rust client (>90% coverage)
  • Integration tests for schema
  • Load test: 100k inserts/sec sustained
  • Query performance tests
  • Backup/restore tests
  • Replication failover tests

Documentation

  • Schema documentation
  • Query examples documented
  • Backup/restore procedures
  • Disaster recovery runbook
  • Performance tuning guide
  • Monitoring setup guide

📚 Key Files to Create

crates/scrybe-storage/
├── Cargo.toml
├── src/
│   ├── lib.rs              # Public API
│   ├── client.rs           # ClickHouse client
│   ├── schema.rs           # Table definitions
│   ├── writer.rs           # Batch writer
│   ├── reader.rs           # Query interface
│   └── health.rs           # Health checks
└── tests/
    └── integration_test.rs

deployment/clickhouse/
├── schema.sql              # Table DDL
├── views.sql               # Materialized views
├── config.xml              # ClickHouse config
├── backup-config.yml       # Backup configuration
└── scripts/
    ├── backup-validation.sh
    └── disaster-recovery.sh

🔗 Dependencies

Depends On:

Blocks:

  • None (can work in parallel with other components)

🧪 Testing Checklist

  • Schema creation successful
  • Insert 1M test rows successfully
  • Query performance meets targets
  • Backup/restore successful
  • Replication working
  • Failover successful
  • Test coverage > 90%

📊 Success Metrics

  • Write throughput > 100k/sec
  • Query latency < 100ms (p99)
  • Compression ratio 10-20:1
  • Replication lag < 1 second
  • Backup success rate 100%
  • RTO < 30 minutes
  • RPO < 1 hour

💾 Storage Calculations

At 10k req/sec sustained:

  • 864M sessions/day × 5KB = 4.3TB/day uncompressed
  • 90 days = 387TB uncompressed
  • At 15:1 compression = 25.8TB stored
  • Storage cost: $2,064/month

🛡️ Reliability Checklist

  • Replication enabled (2+ replicas)
  • Backups automated (daily incremental)
  • Backup validation weekly
  • DR plan tested
  • Monitoring active
  • Alerts configured

🦉 TigerStyle Compliance

  • No unwrap/expect/panic
  • All errors via Result
  • Explicit error handling
  • Public APIs documented
  • Test coverage > 90%

Estimated Effort: 2 weeks
Assigned To: TBD
Related RFCs: RFC-0005 v0.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    component:storageClickHouse Storage componentinfrastructureInfrastructure setup (Redis, ClickHouse, etc.)phase-4:storagePhase 4: Storage & Reliabilitypriority:highHigh priorityrfcRFC implementation taskrustRust implementationtigerstyleTigerStyle compliance required

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions