-
Notifications
You must be signed in to change notification settings - Fork 1
Streaming client improvements and snowflake loader features #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
06ee399 to
764f127
Compare
- Load labels from CSV files with automatic type detection - Support hex string to binary conversion for Ethereum addresses - Thread-safe label storage and retrieval - Add LabelJoinConfig type for configuring joins
d08ce89 to
54d0026
Compare
82ebd6b to
21db92c
Compare
- StreamStateStore interface with in-memory, null, and DB-backed implementations - Block range tracking with gap detection - Reorg invalidation support Key features: - Resume from last processed position after crashes - Exactly-once semantics via batch deduplication - Gap detection and intelligent backfill - Support for multiple networks and tables
- Exponential backoff with jitter for transient failures - Adaptive rate limiting with automatic adjustment - Back pressure detection and mitigation - Error classification (transient vs permanent) - Configurable retry policies Features: - Auto-detects rate limits and slows down requests - Detects timeouts and adjusts batch sizes - Production-tested configurations included
- Integrate state management for resume and deduplication - Add label joining support with automatic type conversion - Implement resilience features (retry, backpressure, rate limiting) - Add metadata columns (_amp_batch_id) for reorg handling - Support streaming with block ranges and reorg detection - Separate _try_load_batch() for better error handling
- Add resume optimization that adjusts min_block based on persistent state - Implement gap-aware partitioning for intelligent backfill - Add pre-flight table creation to avoid locking issues - Improve error handling and logging for state operations - Support label joining in parallel workers Key features: - Auto-detects processed ranges and skips already-loaded partitions - Prioritizes gap filling before processing new data - Efficient partition creation avoiding redundant work - Visible logging for resume operations and adjustments Resume workflow: 1. Query state store for max processed block 2. Adjust min_block to skip processed ranges 3. Detect gaps in processed data 4. Create partitions prioritizing gaps first 5. Process remaining historical data
Add label management to Client class: - Initialize LabelManager with configurable label directory - Support loading labels from CSV files - Pass label_manager to all loader instances - Enable label joining in streaming queries via load() method Updates: - Client now supports label enrichment out of the box - Loaders inherit label_manager from client - Add pyarrow.csv dependency for label loading
- PostgreSQL: Add reorg support with DELETE/UPDATE, metadata columns - Redis: Add streaming metadata and batch ID support - DeltaLake: Support new metadata columns - Iceberg: Update for base class changes - LMDB: Add metadata column support All loaders now support: - State-backed resume and deduplication - Label joining via base class - Resilience features (retry, backpressure) - Reorg-aware streaming with metadata tracking
Add unit tests for all new streaming features: - test_label_joining.py - Label enrichment with type conversion - test_label_manager.py - CSV loading and label storage - test_resilience.py - Retry, backoff, rate limiting - test_resume_optimization.py - Resume position calculation - test_stream_state.py - State store implementations - test_streaming_helpers.py - Utility functions and batch ID generation - test_streaming_types.py - BlockRange, ResumeWatermark types
- Add Snowflake-backed persistent state store (amp_stream_state table) - Implement SnowflakeStreamStateStore with overlap detection - Support multiple loading methods: stage, insert, pandas, snowpipe_streaming - Add connection pooling for parallel workers - Implement reorg history tracking with simplified schema - Support Parquet stage loading for better performance State management features: - Block-level overlap detection for different partition sizes - MERGE-based upsert to prevent duplicate state entries - Resume position calculation with gap detection - Deduplication across runs Performance improvements: - Parallel stage loading with connection pool - Optimized Parquet format for stage loads - Efficient batch processing with metadata columns
Add comprehensive demo applications for Snowflake loading: 1. snowflake_parallel_loader.py - Full-featured parallel loader - Configurable block ranges, workers, and partition sizes - Label joining with CSV files - State management with resume capability - Support for all Snowflake loading methods - Reorg history tracking - Clean formatted output with progress indicators 2. test_erc20_parallel_load.py - Simple ERC20 transfer loader - Basic parallel loading example - Good starting point for new users 3. test_erc20_labeled_parallel.py - Label-enriched example - Demonstrates label joining with token metadata - Shows how to enrich blockchain data 4. Query templates in apps/queries/ - erc20_transfers.sql - Decode ERC20 Transfer events - README.md - Query documentation
New tests: - test_resilient_streaming.py - Resilience with real databases - Enhanced Snowflake loader tests with state management - Enhanced PostgreSQL tests with reorg handling - Updated Redis, DeltaLake, Iceberg, LMDB loader tests Integration test features: - Real database containers (PostgreSQL, Redis, Snowflake) - State persistence and resume testing - Label joining with actual data - Reorg detection and invalidation - Parallel loading with multiple workers - Error injection and recovery Tests require Docker for database containers.
Add containerization and orchestration support: - General-purpose Dockerfile for amp-python - Snowflake-specific Dockerfile with parallel loader - GitHub Actions workflow for automated Docker publishing to ghcr.io - Kubernetes deployment manifest for GKE with resource limits - Comprehensive .dockerignore and .gitignore Docker images: - amp-python: Base image with all loaders - amp-snowflake: Optimized for Snowflake parallel loading - Includes snowflake_parallel_loader.py as entrypoint - Pre-configured with Snowflake connector and dependencies
21db92c to
68bbdb3
Compare
- All loading methods comparison (stage, insert, pandas, streaming) - State management and resume capability - Label joining for data enrichment - Performance tuning and optimization - Parallel loading configuration - Reorg handling strategies - Troubleshooting common issues
Users should now mount label CSV files at runtime using volume mounts (Docker) or init containers with cloud storage (Kubernetes). Changes - Removed COPY data/ line from both Dockerfiles - The /data directory is still created (mkdir -p /app /data) but empty - Updated .gitignore to ignore entire data/ directory - Removed data/** trigger from docker-publish workflow - Added comprehensive docs/label_manager.md with: * Docker volume mount examples * Kubernetes init container pattern (recommended for large files) * ConfigMap examples (for small files <1MB) * PersistentVolume examples (for shared access) * Performance considerations and troubleshooting
When data_structure='string', batch IDs are stored inside JSON values rather than as hash fields. The reorg handler now checks the data structure and uses GET+JSON parse for strings, HGET for hashes.
68bbdb3 to
e0e5765
Compare
craigtutterow
approved these changes
Nov 12, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I apologize ahead of time for the size of this PR 😬 . I'm now adding a detailed description here with tips for reviewing to hopefully make it more digestible.
Overview
This PR adds streaming infrastructure features to amp-python for comprehensive state management, resilience features, and a significantly enhanced Snowflake loader. The changes enable resumable, fault-tolerant streaming with automatic gap detection, reorg handling, and label enrichment.
Review Strategy
I recommend reviewing in this order:
1. Core Concepts (20 min)
Start here to understand the foundation:
src/amp/streaming/state.py- Unified state managementsrc/amp/streaming/types.py- Core types:BlockRange,ResponseBatch,BatchIdentifierdocs/resilience.md- Architecture overview2. Key Features (30 min)
Review the main capabilities:
src/amp/loaders/base.py(lines 200-400) - Streaming support in base loadersrc/amp/streaming/parallel.py(lines 350-450) - Gap detection & resume optimizationsrc/amp/config/label_manager.py- CSV label enrichmentsrc/amp/streaming/resilience.py- Retry, backpressure, rate limiting3. Snowflake Implementation (20 min)
The largest single file change:
src/amp/loaders/implementations/snowflake_loader.py- Persistent state, COPY INTO, parallel loading4. Tests & Validation (15 min)
Verify comprehensive coverage:
tests/unit/test_stream_state.py- State management teststests/integration/test_checkpoint_resume.py- End-to-end resume scenariostests/integration/test_resilient_streaming.py- Resilience with real databases5. Other stuffs (10 min)
Key Features
1. Unified Stream State Management
Why: Simplifies resumability + idempotency into one system
Files:
src/amp/streaming/state.py,sql/snowflake_stream_state.sqlStreamStateStoreinterface with in-memory, null, and DB-backed implementationsBatchIdentifierwith hash-based uniqueness for reorg detectionExample:
2. Label Enrichment System
Why: Join blockchain data with metadata (token info, labels, etc.)
Files:
src/amp/config/label_manager.py,docs/label_manager.mdExample:
3. Resilience Features
Why: Production workloads need fault tolerance
Files:
src/amp/streaming/resilience.py,docs/resilience.md4. Parallel Execution Enhancements
Why: Fill historical gaps while maintaining streaming
Files:
src/amp/streaming/parallel.pyExample:
5. Enhanced Snowflake Loader
Why: Enterprise-grade streaming to Snowflake
Files:
src/amp/loaders/implementations/snowflake_loader.pyTesting summary
Coverage
Key Test Files
tests/unit/test_stream_state.py- State managementtests/unit/test_resilience.py- Retry, backpressure, rate limitingtests/integration/test_checkpoint_resume.py- Resume scenariostests/integration/test_resilient_streaming.py- Fault injectionRun Tests
Architecture Decisions
Why Unified State Management?
Previous design (replaced by this) had separate systems for checkpoints (resume) and processed ranges (idempotency). This created:
Solution:
StreamStateStoreprovides both capabilities with a single source of truth.Why Hash-based Batch IDs?
Block ranges alone aren't unique during reorgs (same range, different chain state). Hash-based IDs:
New Files
Core Infrastructure
src/amp/streaming/state.py- Unified state managementsrc/amp/streaming/resilience.py- Retry, backpressure, rate limitingsrc/amp/config/label_manager.py- CSV label enrichmentsql/snowflake_stream_state.sql- Snowflake state schemaApplications
apps/snowflake_parallel_loader.py- Production-ready parallel loaderapps/test_erc20_labeled_parallel.py- Example with label enrichmentapps/queries/erc20_transfers.sql- Sample queryDocumentation
docs/resilience.md- Complete resilience architecture guidedocs/label_manager.md- Label enrichment guideapps/SNOWFLAKE_LOADER_GUIDE.md- Snowflake loader tutorialapps/queries/README.md- Query examplesInfrastructure
Dockerfile,Dockerfile.snowflake- Containerized deploymentk8s/deployment.yaml- Kubernetes deployment.github/workflows/docker-publish.yml- Docker CI/CDKey Commit Groups
Foundation (Commits 1-3)
Core Streaming (Commits 4-6)
Loader Updates (Commit 7)
_amp_batch_idmetadata columnSnowflake (Commits 8-9)
Testing (Commits 10-11)
Polish (Commits 12-18)
Performance
See
performance_benchmarks.jsonfor detailed metrics.Highlights: