Modern ELT Pipeline with Python, DBT, Postgres, and Power BI

This project demonstrates a modern Extract, Load, Transform (ELT) pipeline for processing COVID-19 data, leveraging best practices for reproducibility, modularity, and automation. It replaces a previous Python-only ETL process with a more robust ELT architecture, utilizing specialized tools for each stage of the data pipeline.

ELT Pipeline Architecture

Previous ETL Pipeline Architecture

Link to previous ETL project: View on GitHub

Visualization

Link to Report: View on PowerBI Cloud

Overview

The pipeline extracts COVID-19 data, loads it into a PostgreSQL database, transforms it using DBT, and visualizes it with Power BI. Docker ensures reproducibility across environments, and a scheduler automates daily runs at 12:00 AM.

Technologies Used:

Python: Extracts data from a source and loads it into Postgres.
PostgreSQL: Stores raw and transformed data.
DBT: Transforms data.
Power BI: Visualizes transformed data in dashboards.
Docker: Containerizes services for consistent deployment. Network to enable communication between containers.
Scheduler: Automates daily pipeline execution using cron jobs.

Objectives

Build an efficient, reproducible ELT pipeline.
Share container images on Docker Hub for easy reuse without rebuilding.
Provide clear documentation for setup and execution by any user.

Project Structure

cde_python_dbt_project/
│
├── README.md              # Project documentation and setup instructions
├── .env                   # Environment variables for database credentials and configuration
├── docker-compose.yml     # Orchestrates Postgres, Python, DBT, and scheduler services
│
├── python/                # Python ETL service
│   ├── requirements.txt   # Python dependencies for the ETL script
│   ├── Dockerfile         # Builds the Python ETL container
│   └── extract_load.py    # Extracts and loads data into staging_covid_data table
│
├── dbt/                   # DBT transformation service
│   ├── Dockerfile         # Builds the DBT container
│   ├── entrypoint.sh      # Sets up DBT environment and runs DBT commands
│   ├── dbt_project.yml    # Configures DBT project and models
│   └── models/            # DBT models and schema definitions
│       ├── schema.yml     # Defines source tables (e.g., staging_covid_data)
│       └── transform_model.sql  # Transforms data into dw_covid_data
│
└── scheduler/             # Scheduler service for automation
     ├── Dockerfile        # Builds the scheduler container
     └── cronjob.sh        # Defines cron jobs to trigger Python and DBT containers

Prerequisites

Docker: Install Docker Desktop or Docker CLI.
Bash Shell: Available on Linux/macOS or Git Bash/WSL on Windows.
Internet Connection: Required for pulling/building Docker images.
PostgreSQL Client (optional): For querying the database (e.g., psql or DBeaver).
A Visualization tool (optional): For building visualizations (e.g Power BI Desktop).

Setup Instructions

Clone the Repository

git clone https://github.com/gBEN1-bit/cde_python_postgres_dbt_project.git
cd cde_python_dbt_project

Configure Environment Variables Assign values to the variables in the .env file:

POSTGRES_USER=
POSTGRES_PASSWORD=
POSTGRES_DB=
POSTGRES_HOST=
POSTGRES_PORT=5432
DOCKERHUB_USERNAME=

Start Docker Desktop Ensure Docker Desktop is running (Windows/macOS) or Docker daemon is active (Linux).
Run the Pipeline:

docker-compose up -d

This starts the postgres, python, dbt, and scheduler containers. The python service loads data into staging_covid_data. The dbt service transforms the data. The scheduler runs the pipeline daily at 12:00 AM. Note that the python and dbt exits when they finish running their process

Verify the Output

Check container logs:

docker-compose logs --tail=100

Confirm the tables staging_covid_data and dw_covid_data in Postgres:

docker exec -it cde_postgres psql -U postgres -d cde_db -c "\dt public.*"
docker exec -it cde_postgres psql -U postgres -d cde_db -c "SELECT * FROM public.staging_covid_data LIMIT 5;"
docker exec -it cde_postgres psql -U postgres -d cde_db -c "SELECT * FROM public.dw_covid_data LIMIT 5;"

Visualize with any visualization tool of your choice (e.g PowerBI)

Connect Visualization tool (e.g Power BI) to the Postgres database using credentials from .env.
Use the public.dw_covid_data table to build dashboards.
Example connection settings:
Host: localhost (or container IP if remote)
Port: 5432
Database: (from .env)
User: (from .env)
Password: (from .env)

Push Images to Docker Hub (Optional)

Log in to Docker Hub:

docker login

Build and push images, replacing dockerhub_USERNAME with your Docker Hub username:

docker build -t dockerhub_USERNAME/cde-python:latest ./python
docker push dockerhub_USERNAME/cde-python:latest

docker build -t dockerhub_USERNAME/cde-dbt:latest ./dbt
docker push dockerhub_USERNAME/cde-dbt:latest

docker build -t dockerhub_USERNAME/cde-scheduler:latest ./scheduler
docker push dockerhub_USERNAME/cde-scheduler:latest

To Pull the Image without rebuilding the project

Pull Images from Docker Hub If you prefer not to build the images locally, pull them from Docker Hub (replace dockerhub_USERNAME with the actual username):

docker pull dockerhub_USERNAME/cde-python:latest
docker pull dockerhub_USERNAME/cde-dbt:latest
docker pull dockerhub_USERNAME/cde-scheduler:latest

Then, run the pipeline:

docker-compose up -d

Key Best Practices used in the project

Environment Variables: Secure configuration using .env files.
Healthchecks: Ensure Postgres is ready before Python/DBT execution.
Dynamic DBT Configuration: Generate profiles.yml dynamically in entrypoint.sh.
Containerization: Use Docker for consistent, reproducible environments.
Separation of Concerns: Dedicated containers for extraction (Python), transformation (DBT), and scheduling.
Logging: Comprehensive logs for debugging and monitoring.
Schema Management: Explicit schema definitions to avoid issues like public_public.

Troubleshooting

Table Not Found Verify public.dw_covid_data:

docker exec -it cde_postgres psql -U postgres -d cde_db -c "\dt public.*"

Check DBT logs for errors:

docker logs cde_dbt

Schema Issues (e.g., public_public): Ensure dbt_project.yml and transform_model.sql do not override schema unnecessarily.
Drop incorrect tables:

docker exec -it cde_postgres psql -U postgres -d cde_db -c "DROP TABLE IF EXISTS public_public.dw_covid_data; 
DROP SCHEMA IF EXISTS public_public CASCADE;"

Container Failures Rebuild with:

docker-compose build --no-cache

Data Source

The pipeline processes COVID-19 data (e.g., from an API), loaded into staging_covid_data with columns: Date, Country, Confirmed, Recovered, Deaths. The dw_covid_data table transforms this data for analysis.

Contributing

Contributions are welcome! Please submit a pull request or open an issue on the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
dbt		dbt
python		python
scheduler		scheduler
.env		.env
ELT_Architecture.png		ELT_Architecture.png
ETL_Architecture.png		ETL_Architecture.png
README.md		README.md
Visualization.gif		Visualization.gif
docker-compose.yml		docker-compose.yml
visualization.png		visualization.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Modern ELT Pipeline with Python, DBT, Postgres, and Power BI

ELT Pipeline Architecture

Previous ETL Pipeline Architecture

Link to previous ETL project: View on GitHub

Visualization

Link to Report: View on PowerBI Cloud

Overview

The pipeline extracts COVID-19 data, loads it into a PostgreSQL database, transforms it using DBT, and visualizes it with Power BI. Docker ensures reproducibility across environments, and a scheduler automates daily runs at 12:00 AM.

Objectives

Project Structure

Prerequisites

Setup Instructions

To Pull the Image without rebuilding the project

Key Best Practices used in the project

Troubleshooting

Data Source

Contributing

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

gBEN1-bit/cde_python_postgres_dbt_project

Folders and files

Latest commit

History

Repository files navigation

Modern ELT Pipeline with Python, DBT, Postgres, and Power BI

ELT Pipeline Architecture

Previous ETL Pipeline Architecture

Link to previous ETL project: View on GitHub

Visualization

Link to Report: View on PowerBI Cloud

Overview

The pipeline extracts COVID-19 data, loads it into a PostgreSQL database, transforms it using DBT, and visualizes it with Power BI. Docker ensures reproducibility across environments, and a scheduler automates daily runs at 12:00 AM.

Objectives

Project Structure

Prerequisites

Setup Instructions

To Pull the Image without rebuilding the project

Key Best Practices used in the project

Troubleshooting

Data Source

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages