GitHub - beiqisu10/crypto_data_engineering: Medallion ETL pipeline for crypto metrics. Tech: Python, Airflow, Terraform, GCP (GCS/BigQuery), and dbt Core.

Crypto Sentiment & Price Analytics Pipeline

1. Project Overview

This project addresses the challenge of quantifying market psychology in the highly volatile cryptocurrency sector.

The Problem: Investors often struggle to determine if the "Fear & Greed Index" is a lagging indicator or a valid predictive signal for short-term price reversals.
The Solution: An end-to-end data pipeline that integrates Binance transaction data with market sentiment indices. By modeling this data, the project identifies "Fear-driven" bottoms and validates the "Buy the Fear" strategy through historical backtesting.

2. Tech Stack

Cloud Platform: Google Cloud Platform (GCP)
Infrastructure as Code: Terraform (Used to provision GCS Buckets and BigQuery Datasets)
Orchestration: Apache Airflow (Workflow management)
Data Lake: Google Cloud Storage (GCS)
Data Warehouse: BigQuery (Partitioned and Clustered)
Transformation: dbt (Data Build Tool)
Visualization: Looker Studio

3. Data Pipeline Architecture

The pipeline follows a Medallion-style architecture:

Ingestion: Python scripts fetch Binance Ticker data and Fear & Greed API data.
Landing: Raw CSVs are uploaded to GCS (Google Cloud Storage) as the landing zone.
Staging: Data is loaded into BigQuery using External or Native tables to prepare for transformation.
Production: dbt transforms raw data into a fct_crypto_daily_metrics table, applying window functions for volatility and returns.

Data Lineage (dbt)

Below is the visualization of the data flow, showing how the three distinct sources converge into our final metrics table:

This graph illustrates the transformation from `stg_` models to the final

4. Reproducibility & Setup

This project uses a Makefile to encapsulate complex commands, ensuring consistent execution across different environments.

Prerequisites

Google Cloud Account: A project with Billing and BigQuery/GCS APIs enabled.
GCP Credentials: A Service Account JSON key stored locally (referenced in profiles.yml).
Docker & Docker-compose: For the Airflow orchestration layer.
Terraform: For infrastructure as code.

🚀Quick Start (Automated Environment Setup)

The following command provisions the cloud resources and starts the Airflow environment:

make setup

This handles Infrastructure (Terraform) and Orchestration (Docker/Airflow) in one go.

🔧Step-by-Step Manual Execution

If you prefer to run components individually:

Step 1: Infrastructure (IaC)

Provision the required GCS buckets and BigQuery datasets:

make infra-up

Step 2: Orchestration (Airflow)

Start the Airflow containers and access the UI at localhost:8080:

make docker-up

Note: Ensure the google_cloud_default connection is configured in the Airflow UI to point to your GCP project.
Note: The market_metadata_ingestion_v1 DAG includes an ExternalTaskSensor to maintain strict data lineage with upstream Binance ingestion.

Step 3: Transformation (dbt)

Once raw data is available in BigQuery, run the dbt models to transform landing data into production-ready metrics:

make dbt-run

Note: This command executes dbt deps and dbt run within the Airflow container to ensure proper permissions and dependencies.

Data Quality & Maintenance

To ensure the pipeline remains healthy, use the built-in linting and cleaning tools:

make clean: Removes Docker volumes and temporary dbt artifacts.
make dbt-test: Runs only the data integrity tests (unique, not_null, relationship).

5. Data Warehouse Optimization

The fct_crypto_daily_metrics table in BigQuery is optimized using a combined Partitioning and Clustering strategy:

Partitioning by `date_day`

Strategy: partition_by = {"field": "date_day", "data_type": "date", "granularity": "day"}
Reasoning: Crypto market data is inherently time-series. Most analytical queries filter by a specific date range.
Benefit: BigQuery only scans the relevant daily blocks instead of the entire dataset, significantly reducing query costs and improves response time.

Clustering by `sentiment_label`

Strategy: cluster_by = ["sentiment_label"]
Reasoning: A primary use case is to compare performance across different sentiment categories (e.g., "Extreme Fear" vs. "Greed").
Benefit: Clustering organizes the data physically, making filter operations like WHERE sentiment_label = 'Extreme Fear' much more efficient.

6. Dashboard & Insights

The final analysis is presented in a Looker Studio Dashboard featuring 2 core tiles:

Tile 1: Fear-Return Scatter Plot: Shows the correlation between low sentiment scores and positive 7-day forward returns.
Tile 2: Trend Divergence Chart: A dual-axis time series comparing BTC Price vs. 7-day Moving Average Sentiment.

Project Observations & Future Roadmap

As the current ingestion pipeline is in its initial phase, the following observations and planned enhancements have been identified:

Current Status (Proof of Concept): The pipeline successfully demonstrates the end-to-end integration of Binance trade data, CoinGecko prices, and Alternative.me sentiment indices.
Data Limitation: Due to the limited 30-day window and API rate limits during the initial run, current visualizations serve as a functional demonstration rather than a statistically significant backtest.
Future Roadmap:
- Historical Backfilling: Implement a dedicated Spark job to backfill 2+ years of historical trade data to validate the "Buy the Fear" hypothesis at scale.
- Advanced Modeling: Transition from simple moving averages to real-time anomaly detection for price volatility vs. sentiment shifts.
- Expanded Scope: Ingest data for additional assets (ETH, SOL) to compare sentiment sensitivity across different market caps.

7. Data Quality & Testing

dbt Tests: Basic data integrity is ensured using dbt tests defined in schema.yml.
- unique and not_null tests are applied to the date_day primary key.
Validation: Running dbt test before updating production tables prevents duplicate records or missing data from affecting the final dashboard.
Error Handling & Resilience:
- Retry Logic: All Airflow tasks are configured with retries: 2 and a retry_delay of 5 minutes to handle transient network issues during API calls.
- Alerting Framework: The DAGs include on_failure_callback hooks. While the production Slack/Email API keys are excluded for security, the logic is pre-integrated for easy enterprise deployment.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
airflow		airflow
dbt_crypto		dbt_crypto
images		images
scripts		scripts
terraform		terraform
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crypto Sentiment & Price Analytics Pipeline

1. Project Overview

2. Tech Stack

3. Data Pipeline Architecture

This graph illustrates the transformation from `stg_` models to the final

4. Reproducibility & Setup

5. Data Warehouse Optimization

Partitioning by `date_day`

Clustering by `sentiment_label`

6. Dashboard & Insights

7. Data Quality & Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crypto Sentiment & Price Analytics Pipeline

1. Project Overview

2. Tech Stack

3. Data Pipeline Architecture

This graph illustrates the transformation from stg_ models to the final

4. Reproducibility & Setup

5. Data Warehouse Optimization

Partitioning by date_day

Clustering by sentiment_label

6. Dashboard & Insights

7. Data Quality & Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

This graph illustrates the transformation from `stg_` models to the final

Partitioning by `date_day`

Clustering by `sentiment_label`

Packages