Quant Data Pipeline
An introductory non-production-grade market data pipeline built to demonstrate quantitative data engineering skills
Project Goals
This project demonstrates: Market Microstructure Knowledge - Understanding OHLCV data and exchange mechanics Dirty Data Handling - Managing gaps, duplicates, API failures gracefully Data Engineering Best Practices - Parquet storage, partitioning, efficient queries Production-Ready Code - Error handling, logging, scheduling, monitoring
Features: -Automated data collection from Binance API -Data cleaning and validation -Efficient Parquet storage with partitioning -Daily incremental updates -Comprehensive logging and error handling
Learning Notes & Key Concepts
What is OHLCV Data?
OHLCV stands for Open, High, Low, Close, Volume - the fundamental data format for financial time series:
Open : First trade price in the period Why?: Shows where trading started High : Highest price reached Why?: Shows maximum buying pressure Low : Lowest price reached Why?: Shows maximum selling pressure Close : Last trade price in the period Why?: Most important - used for most calculations Volume : Total units traded Why?: Shows market activity/liquidity
Example: A 1-hour candle for BTC/USDT might show:
- Open: $65,000 (price at start of hour)
- High: $65,500 (peak during hour)
- Low: $64,800 (trough during hour)
- Close: $65,200 (price at end of hour)
- Volume: 1,234.56 BTC (total traded)
What is a "Candle" or "Candlestick"? -A candle is a visual representation of OHLCV data. The "body" shows open-to-close, and "wicks" show high/low extremes. Called a candle because it literally looks like one!
Timeframes Data comes in different granularities:
1m= 1-minute candles (525,600 candles/year)5m= 5-minute candles1h= 1-hour candles (8,760 candles/year)1d= 1-day candles (365 candles/year)
Trade-off: Smaller timeframes = more data = more storage = more API calls
Why Parquet over CSV?
Compression - CSV : None Parquet : ~80% smaller Type Safety - CSV : Everything is string Parquet : Preserves int, float, datetime Read Speed - CSV : Must read entire file Parquet : Can read specific columns Write Speed - CSV : Fast Parquet : Slightly slower Industry Use- CSV : Spreadsheets Parquet : Production data systems
Rate Limiting
APIs protect themselves from abuse. Binance allows ~1200 requests/minute for public endpoints. If you exceed this, you get temporarily banned (IP block).
Solution: Add time.sleep() between calls. Better to be slow than banned.
Data Partitioning
Instead of one giant file, we split data by date:
data/processed/ ├── symbol=BTCUSDT/ │ ├── year=2024/ │ │ ├── month=01/ │ │ │ └── data.parquet │ │ ├── month=02/ │ │ │ └── data.parquet
Why? If you need January 2024 data, you load ONE small file, not years of data.
Architecture
[Binance API]
|
[Extract (ccxt)] --> Rate Limited, Paginated
|
[Transform (pandas)] --> Clean & Validate, Deduplicate
|
[Load (Parquet)] --> Partitioned by date/symbol
Project Structure
quant-data-pipeline/ ├── src/ # Source code │ ├── init_.py │ ├── config.py # Configuration settings │ ├── fetcher.py # API connection & data fetching │ ├── transformer.py # Data cleaning & validation │ ├── storage.py # Parquet read/write operations │ ├── backfill.py # Historical data download │ └── update.py # Daily incremental updates ├── data/ │ ├── raw/ # Untouched API responses (for debugging) │ ├── processed/ # Clean Parquet files (partitioned) │ └── backups/ # Safety copies ├── config/ │ └── settings.yaml # User configuration ├── logs/ # Application logs ├── tests/ # Unit tests ├── docs/ # Additional documentation ├── notebooks/ # Jupyter exploration (optional) ├── requirements.txt # Dependencies ├── setup.py # Package setup └── README.md # This file
First run: Test connection to Binance: python -m src.fetcher --test
Run historical backfill: python -m src.backfill --symbol BTCUSDT --start 2024-01-01
Check data: python -m src.storage --info
Configuration
'config/settings.yaml':
Data collection settings symbols:
- BTCUSDT
- ETHUSDT timeframe: 1h start_date: "2024-01-01"
API settings:
rate_limit_delay: 1.0 # seconds between calls
max_retries: 3
Storage settings: data_dir: ./data/processed partition_by: [symbol, year, month]
Testing Run all tests: pytest tests/
Run with coverage: pytest tests/ --cov=src
Skills Demonstrated
ETL Pipelines - Extract, Transform, Load methodology API Integration - Rate limiting, pagination, error handling Data Quality - Validation, deduplication, type enforcement Storage Optimization - Columnar formats, partitioning strategies Production Practices - Logging, configuration, testing
License
MIT License - Use freely for learning and portfolio purposes.
Contributing
This is a learning project. Improvements are welcome!