-
Notifications
You must be signed in to change notification settings - Fork 1
Closed
Labels
component:storageClickHouse Storage componentClickHouse Storage componentinfrastructureInfrastructure setup (Redis, ClickHouse, etc.)Infrastructure setup (Redis, ClickHouse, etc.)phase-4:storagePhase 4: Storage & ReliabilityPhase 4: Storage & Reliabilitypriority:highHigh priorityHigh priorityrfcRFC implementation taskRFC implementation taskrustRust implementationRust implementationtigerstyleTigerStyle compliance requiredTigerStyle compliance required
Description
📋 RFC Reference
Document: RFC-0005: ClickHouse Storage Schema
Version: 0.2.0
Phase: Phase 4 - Storage & Reliability
Depends On: #1 (RFC-0001), #4 (RFC-0004)
🎯 Objective
Implement ClickHouse storage schema with replication, backup strategy, disaster recovery, and realistic compression ratios for long-term session telemetry storage.
📦 Acceptance Criteria
ClickHouse Schema
-
scrybe.sessionstable created with ReplicatedMergeTree - Primary key:
toStartOfHour(timestamp), session_id - Partitioning by
toYYYYMMDD(timestamp) - TTL configured: 90 days
- All columns defined per RFC-0005
Indexes
- Token bloom filter on
fingerprint_hash - Token bloom filter on
ip - Min-max index on
bot_probability - Index granularity: 8192
- Indexes validated with EXPLAIN queries
Materialized Views
-
scrybe.hourly_statsview created -
scrybe.fingerprint_clustersview created - POPULATE directive used (backfill existing data)
- Aggregation logic validated
- Query performance tested
Replication Setup
- ZooKeeper cluster configured (3 nodes)
- ReplicatedMergeTree configured
- Shard and replica macros defined
- 2 replicas minimum per shard
- Replication lag monitored
Backup Strategy
- clickhouse-backup installed
- S3 backup destination configured
- Daily incremental backups scheduled
- Weekly full backups scheduled
- Backup retention: 30 days remote, 3 days local
- Backup validation script created
Disaster Recovery
- DR runbook created and tested
- RTO < 30 minutes validated
- RPO < 1 hour validated
- Point-in-time recovery tested
- Full cluster failure recovery tested
- DNS failover procedure documented
Compression Configuration
- ZSTD compression level 3 configured
- Compression ratio 10-20:1 validated
- Storage calculations verified
- Min part size configured (10MB)
- Compression benchmarks run
Performance Tuning
- Write throughput > 100k sessions/sec
- Query latency < 100ms (p99)
- Memory limits configured (50GB)
- Background pool sized (16 threads)
- Merge settings optimized
Rust Integration
-
clickhouse-rsclient configured - Connection pooling implemented
- Batch insert logic (1000 rows)
- Error handling with retries
- Health check queries
Monitoring
- Prometheus metrics exposed
- Grafana dashboard created
- Alerts configured:
- Replication lag > 10s
- Query latency > 500ms
- Disk usage > 80%
- Failed backups
- Log aggregation configured
Testing
- Unit tests for Rust client (>90% coverage)
- Integration tests for schema
- Load test: 100k inserts/sec sustained
- Query performance tests
- Backup/restore tests
- Replication failover tests
Documentation
- Schema documentation
- Query examples documented
- Backup/restore procedures
- Disaster recovery runbook
- Performance tuning guide
- Monitoring setup guide
📚 Key Files to Create
crates/scrybe-storage/
├── Cargo.toml
├── src/
│ ├── lib.rs # Public API
│ ├── client.rs # ClickHouse client
│ ├── schema.rs # Table definitions
│ ├── writer.rs # Batch writer
│ ├── reader.rs # Query interface
│ └── health.rs # Health checks
└── tests/
└── integration_test.rs
deployment/clickhouse/
├── schema.sql # Table DDL
├── views.sql # Materialized views
├── config.xml # ClickHouse config
├── backup-config.yml # Backup configuration
└── scripts/
├── backup-validation.sh
└── disaster-recovery.sh
🔗 Dependencies
Depends On:
- RFC-0001: Implement Core Architecture #1 (RFC-0001: Core Architecture)
- RFC-0004: Implement Fingerprinting & Enrichment Pipeline #4 (RFC-0004: Enrichment Pipeline)
Blocks:
- None (can work in parallel with other components)
🧪 Testing Checklist
- Schema creation successful
- Insert 1M test rows successfully
- Query performance meets targets
- Backup/restore successful
- Replication working
- Failover successful
- Test coverage > 90%
📊 Success Metrics
- Write throughput > 100k/sec
- Query latency < 100ms (p99)
- Compression ratio 10-20:1
- Replication lag < 1 second
- Backup success rate 100%
- RTO < 30 minutes
- RPO < 1 hour
💾 Storage Calculations
At 10k req/sec sustained:
- 864M sessions/day × 5KB = 4.3TB/day uncompressed
- 90 days = 387TB uncompressed
- At 15:1 compression = 25.8TB stored
- Storage cost: $2,064/month
🛡️ Reliability Checklist
- Replication enabled (2+ replicas)
- Backups automated (daily incremental)
- Backup validation weekly
- DR plan tested
- Monitoring active
- Alerts configured
🦉 TigerStyle Compliance
- No unwrap/expect/panic
- All errors via Result
- Explicit error handling
- Public APIs documented
- Test coverage > 90%
Estimated Effort: 2 weeks
Assigned To: TBD
Related RFCs: RFC-0005 v0.2.0
Metadata
Metadata
Assignees
Labels
component:storageClickHouse Storage componentClickHouse Storage componentinfrastructureInfrastructure setup (Redis, ClickHouse, etc.)Infrastructure setup (Redis, ClickHouse, etc.)phase-4:storagePhase 4: Storage & ReliabilityPhase 4: Storage & Reliabilitypriority:highHigh priorityHigh priorityrfcRFC implementation taskRFC implementation taskrustRust implementationRust implementationtigerstyleTigerStyle compliance requiredTigerStyle compliance required