Skip to content

gBEN1-bit/cde_python_postgres_dbt_project

Repository files navigation

Modern ELT Pipeline with Python, DBT, Postgres, and Power BI

This project demonstrates a modern Extract, Load, Transform (ELT) pipeline for processing COVID-19 data, leveraging best practices for reproducibility, modularity, and automation. It replaces a previous Python-only ETL process with a more robust ELT architecture, utilizing specialized tools for each stage of the data pipeline.

ELT Pipeline Architecture

ELT Pipeline Architecture


Previous ETL Pipeline Architecture

Previous ETL Pipeline Architecture

Link to previous ETL project: View on GitHub


Visualization

visualization

Link to Report: View on PowerBI Cloud

Overview

The pipeline extracts COVID-19 data, loads it into a PostgreSQL database, transforms it using DBT, and visualizes it with Power BI. Docker ensures reproducibility across environments, and a scheduler automates daily runs at 12:00 AM.

Technologies Used:

  • Python: Extracts data from a source and loads it into Postgres.
  • PostgreSQL: Stores raw and transformed data.
  • DBT: Transforms data.
  • Power BI: Visualizes transformed data in dashboards.
  • Docker: Containerizes services for consistent deployment. Network to enable communication between containers.
  • Scheduler: Automates daily pipeline execution using cron jobs.

Objectives

  • Build an efficient, reproducible ELT pipeline.
  • Share container images on Docker Hub for easy reuse without rebuilding.
  • Provide clear documentation for setup and execution by any user.

Project Structure

cde_python_dbt_project/
│
├── README.md              # Project documentation and setup instructions
├── .env                   # Environment variables for database credentials and configuration
├── docker-compose.yml     # Orchestrates Postgres, Python, DBT, and scheduler services
│
├── python/                # Python ETL service
│   ├── requirements.txt   # Python dependencies for the ETL script
│   ├── Dockerfile         # Builds the Python ETL container
│   └── extract_load.py    # Extracts and loads data into staging_covid_data table
│
├── dbt/                   # DBT transformation service
│   ├── Dockerfile         # Builds the DBT container
│   ├── entrypoint.sh      # Sets up DBT environment and runs DBT commands
│   ├── dbt_project.yml    # Configures DBT project and models
│   └── models/            # DBT models and schema definitions
│       ├── schema.yml     # Defines source tables (e.g., staging_covid_data)
│       └── transform_model.sql  # Transforms data into dw_covid_data
│
└── scheduler/             # Scheduler service for automation
     ├── Dockerfile        # Builds the scheduler container
     └── cronjob.sh        # Defines cron jobs to trigger Python and DBT containers

Prerequisites

  • Docker: Install Docker Desktop or Docker CLI.

  • Bash Shell: Available on Linux/macOS or Git Bash/WSL on Windows.

  • Internet Connection: Required for pulling/building Docker images.

  • PostgreSQL Client (optional): For querying the database (e.g., psql or DBeaver).

  • A Visualization tool (optional): For building visualizations (e.g Power BI Desktop).


Setup Instructions

  1. Clone the Repository
git clone https://github.com/gBEN1-bit/cde_python_postgres_dbt_project.git
cd cde_python_dbt_project
  1. Configure Environment Variables Assign values to the variables in the .env file:
POSTGRES_USER=
POSTGRES_PASSWORD=
POSTGRES_DB=
POSTGRES_HOST=
POSTGRES_PORT=5432
DOCKERHUB_USERNAME=
  1. Start Docker Desktop Ensure Docker Desktop is running (Windows/macOS) or Docker daemon is active (Linux).

  2. Run the Pipeline:

docker-compose up -d

This starts the postgres, python, dbt, and scheduler containers. The python service loads data into staging_covid_data. The dbt service transforms the data. The scheduler runs the pipeline daily at 12:00 AM. Note that the python and dbt exits when they finish running their process

  1. Verify the Output
  • Check container logs:
docker-compose logs --tail=100
  • Confirm the tables staging_covid_data and dw_covid_data in Postgres:
docker exec -it cde_postgres psql -U postgres -d cde_db -c "\dt public.*"
docker exec -it cde_postgres psql -U postgres -d cde_db -c "SELECT * FROM public.staging_covid_data LIMIT 5;"
docker exec -it cde_postgres psql -U postgres -d cde_db -c "SELECT * FROM public.dw_covid_data LIMIT 5;"
  1. Visualize with any visualization tool of your choice (e.g PowerBI)
  • Connect Visualization tool (e.g Power BI) to the Postgres database using credentials from .env.
  • Use the public.dw_covid_data table to build dashboards.
  • Example connection settings:
  • Host: localhost (or container IP if remote)
  • Port: 5432
  • Database: (from .env)
  • User: (from .env)
  • Password: (from .env)
  1. Push Images to Docker Hub (Optional)
  • Log in to Docker Hub:
docker login
  • Build and push images, replacing dockerhub_USERNAME with your Docker Hub username:
docker build -t dockerhub_USERNAME/cde-python:latest ./python
docker push dockerhub_USERNAME/cde-python:latest

docker build -t dockerhub_USERNAME/cde-dbt:latest ./dbt
docker push dockerhub_USERNAME/cde-dbt:latest

docker build -t dockerhub_USERNAME/cde-scheduler:latest ./scheduler
docker push dockerhub_USERNAME/cde-scheduler:latest

To Pull the Image without rebuilding the project

  • Pull Images from Docker Hub If you prefer not to build the images locally, pull them from Docker Hub (replace dockerhub_USERNAME with the actual username):
docker pull dockerhub_USERNAME/cde-python:latest
docker pull dockerhub_USERNAME/cde-dbt:latest
docker pull dockerhub_USERNAME/cde-scheduler:latest
  • Then, run the pipeline:
docker-compose up -d

Key Best Practices used in the project

  • Environment Variables: Secure configuration using .env files.
  • Healthchecks: Ensure Postgres is ready before Python/DBT execution.
  • Dynamic DBT Configuration: Generate profiles.yml dynamically in entrypoint.sh.
  • Containerization: Use Docker for consistent, reproducible environments.
  • Separation of Concerns: Dedicated containers for extraction (Python), transformation (DBT), and scheduling.
  • Logging: Comprehensive logs for debugging and monitoring.
  • Schema Management: Explicit schema definitions to avoid issues like public_public.

Troubleshooting

  • Table Not Found Verify public.dw_covid_data:
docker exec -it cde_postgres psql -U postgres -d cde_db -c "\dt public.*"
  • Check DBT logs for errors:
docker logs cde_dbt
  • Schema Issues (e.g., public_public): Ensure dbt_project.yml and transform_model.sql do not override schema unnecessarily.

  • Drop incorrect tables:

docker exec -it cde_postgres psql -U postgres -d cde_db -c "DROP TABLE IF EXISTS public_public.dw_covid_data; 
DROP SCHEMA IF EXISTS public_public CASCADE;"
  • Container Failures Rebuild with:
docker-compose build --no-cache

Data Source

The pipeline processes COVID-19 data (e.g., from an API), loaded into staging_covid_data with columns: Date, Country, Confirmed, Recovered, Deaths. The dw_covid_data table transforms this data for analysis.

Contributing

Contributions are welcome! Please submit a pull request or open an issue on the GitHub repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published