Skip to content

brighthann/quant-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quant Data Pipeline

An introductory non-production-grade market data pipeline built to demonstrate quantitative data engineering skills

Project Goals

This project demonstrates: Market Microstructure Knowledge - Understanding OHLCV data and exchange mechanics Dirty Data Handling - Managing gaps, duplicates, API failures gracefully Data Engineering Best Practices - Parquet storage, partitioning, efficient queries Production-Ready Code - Error handling, logging, scheduling, monitoring

Features: -Automated data collection from Binance API -Data cleaning and validation -Efficient Parquet storage with partitioning -Daily incremental updates -Comprehensive logging and error handling

Learning Notes & Key Concepts

What is OHLCV Data?

OHLCV stands for Open, High, Low, Close, Volume - the fundamental data format for financial time series:

Open : First trade price in the period Why?: Shows where trading started High : Highest price reached Why?: Shows maximum buying pressure Low : Lowest price reached Why?: Shows maximum selling pressure Close : Last trade price in the period Why?: Most important - used for most calculations Volume : Total units traded Why?: Shows market activity/liquidity

Example: A 1-hour candle for BTC/USDT might show:

  • Open: $65,000 (price at start of hour)
  • High: $65,500 (peak during hour)
  • Low: $64,800 (trough during hour)
  • Close: $65,200 (price at end of hour)
  • Volume: 1,234.56 BTC (total traded)

What is a "Candle" or "Candlestick"? -A candle is a visual representation of OHLCV data. The "body" shows open-to-close, and "wicks" show high/low extremes. Called a candle because it literally looks like one!

Timeframes Data comes in different granularities:

  • 1m = 1-minute candles (525,600 candles/year)
  • 5m = 5-minute candles
  • 1h = 1-hour candles (8,760 candles/year)
  • 1d = 1-day candles (365 candles/year)

Trade-off: Smaller timeframes = more data = more storage = more API calls

Why Parquet over CSV?

Compression - CSV : None Parquet : ~80% smaller Type Safety - CSV : Everything is string Parquet : Preserves int, float, datetime Read Speed - CSV : Must read entire file Parquet : Can read specific columns Write Speed - CSV : Fast Parquet : Slightly slower Industry Use- CSV : Spreadsheets Parquet : Production data systems

Rate Limiting

APIs protect themselves from abuse. Binance allows ~1200 requests/minute for public endpoints. If you exceed this, you get temporarily banned (IP block).

Solution: Add time.sleep() between calls. Better to be slow than banned.

Data Partitioning

Instead of one giant file, we split data by date:

data/processed/ ├── symbol=BTCUSDT/ │ ├── year=2024/ │ │ ├── month=01/ │ │ │ └── data.parquet │ │ ├── month=02/ │ │ │ └── data.parquet


Why? If you need January 2024 data, you load ONE small file, not years of data.


Architecture

        [Binance API]

          |

      [Extract (ccxt)]   -->    Rate Limited, Paginated 

          |

    [Transform (pandas)]    -->    Clean & Validate, Deduplicate
          
          |

     [Load (Parquet)]       -->    Partitioned by date/symbol






Project Structure

quant-data-pipeline/ ├── src/ # Source code │ ├── init_.py │ ├── config.py # Configuration settings │ ├── fetcher.py # API connection & data fetching │ ├── transformer.py # Data cleaning & validation │ ├── storage.py # Parquet read/write operations │ ├── backfill.py # Historical data download │ └── update.py # Daily incremental updates ├── data/ │ ├── raw/ # Untouched API responses (for debugging) │ ├── processed/ # Clean Parquet files (partitioned) │ └── backups/ # Safety copies ├── config/ │ └── settings.yaml # User configuration ├── logs/ # Application logs ├── tests/ # Unit tests ├── docs/ # Additional documentation ├── notebooks/ # Jupyter exploration (optional) ├── requirements.txt # Dependencies ├── setup.py # Package setup └── README.md # This file

First run: Test connection to Binance: python -m src.fetcher --test

Run historical backfill: python -m src.backfill --symbol BTCUSDT --start 2024-01-01

Check data: python -m src.storage --info

Configuration

'config/settings.yaml':

Data collection settings symbols:

  • BTCUSDT
  • ETHUSDT timeframe: 1h start_date: "2024-01-01"

API settings:
rate_limit_delay: 1.0 # seconds between calls max_retries: 3

Storage settings: data_dir: ./data/processed partition_by: [symbol, year, month]

Testing Run all tests: pytest tests/

Run with coverage: pytest tests/ --cov=src

Skills Demonstrated

ETL Pipelines - Extract, Transform, Load methodology API Integration - Rate limiting, pagination, error handling Data Quality - Validation, deduplication, type enforcement Storage Optimization - Columnar formats, partitioning strategies Production Practices - Logging, configuration, testing

License

MIT License - Use freely for learning and portfolio purposes.

Contributing

This is a learning project. Improvements are welcome!

About

A cryptocurrency data pipeline project for quantitative analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published