This project demonstrates a modern Extract, Load, Transform (ELT) pipeline for processing COVID-19 data, leveraging best practices for reproducibility, modularity, and automation. It replaces a previous Python-only ETL process with a more robust ELT architecture, utilizing specialized tools for each stage of the data pipeline.
Link to previous ETL project: View on GitHub
Link to Report: View on PowerBI Cloud
The pipeline extracts COVID-19 data, loads it into a PostgreSQL database, transforms it using DBT, and visualizes it with Power BI. Docker ensures reproducibility across environments, and a scheduler automates daily runs at 12:00 AM.
Technologies Used:
- Python: Extracts data from a source and loads it into Postgres.
- PostgreSQL: Stores raw and transformed data.
- DBT: Transforms data.
- Power BI: Visualizes transformed data in dashboards.
- Docker: Containerizes services for consistent deployment. Network to enable communication between containers.
- Scheduler: Automates daily pipeline execution using cron jobs.
- Build an efficient, reproducible ELT pipeline.
- Share container images on Docker Hub for easy reuse without rebuilding.
- Provide clear documentation for setup and execution by any user.
cde_python_dbt_project/
│
├── README.md # Project documentation and setup instructions
├── .env # Environment variables for database credentials and configuration
├── docker-compose.yml # Orchestrates Postgres, Python, DBT, and scheduler services
│
├── python/ # Python ETL service
│ ├── requirements.txt # Python dependencies for the ETL script
│ ├── Dockerfile # Builds the Python ETL container
│ └── extract_load.py # Extracts and loads data into staging_covid_data table
│
├── dbt/ # DBT transformation service
│ ├── Dockerfile # Builds the DBT container
│ ├── entrypoint.sh # Sets up DBT environment and runs DBT commands
│ ├── dbt_project.yml # Configures DBT project and models
│ └── models/ # DBT models and schema definitions
│ ├── schema.yml # Defines source tables (e.g., staging_covid_data)
│ └── transform_model.sql # Transforms data into dw_covid_data
│
└── scheduler/ # Scheduler service for automation
├── Dockerfile # Builds the scheduler container
└── cronjob.sh # Defines cron jobs to trigger Python and DBT containers-
Docker: Install Docker Desktop or Docker CLI.
-
Bash Shell: Available on Linux/macOS or Git Bash/WSL on Windows.
-
Internet Connection: Required for pulling/building Docker images.
-
PostgreSQL Client (optional): For querying the database (e.g., psql or DBeaver).
-
A Visualization tool (optional): For building visualizations (e.g Power BI Desktop).
- Clone the Repository
git clone https://github.com/gBEN1-bit/cde_python_postgres_dbt_project.git
cd cde_python_dbt_project- Configure Environment Variables Assign values to the variables in the .env file:
POSTGRES_USER=
POSTGRES_PASSWORD=
POSTGRES_DB=
POSTGRES_HOST=
POSTGRES_PORT=5432
DOCKERHUB_USERNAME=-
Start Docker Desktop Ensure Docker Desktop is running (Windows/macOS) or Docker daemon is active (Linux).
-
Run the Pipeline:
docker-compose up -dThis starts the postgres, python, dbt, and scheduler containers. The python service loads data into staging_covid_data. The dbt service transforms the data. The scheduler runs the pipeline daily at 12:00 AM. Note that the python and dbt exits when they finish running their process
- Verify the Output
- Check container logs:
docker-compose logs --tail=100- Confirm the tables staging_covid_data and dw_covid_data in Postgres:
docker exec -it cde_postgres psql -U postgres -d cde_db -c "\dt public.*"
docker exec -it cde_postgres psql -U postgres -d cde_db -c "SELECT * FROM public.staging_covid_data LIMIT 5;"
docker exec -it cde_postgres psql -U postgres -d cde_db -c "SELECT * FROM public.dw_covid_data LIMIT 5;"- Visualize with any visualization tool of your choice (e.g PowerBI)
- Connect Visualization tool (e.g Power BI) to the Postgres database using credentials from .env.
- Use the public.dw_covid_data table to build dashboards.
- Example connection settings:
- Host: localhost (or container IP if remote)
- Port: 5432
- Database: (from .env)
- User: (from .env)
- Password: (from .env)
- Push Images to Docker Hub (Optional)
- Log in to Docker Hub:
docker login- Build and push images, replacing dockerhub_USERNAME with your Docker Hub username:
docker build -t dockerhub_USERNAME/cde-python:latest ./python
docker push dockerhub_USERNAME/cde-python:latest
docker build -t dockerhub_USERNAME/cde-dbt:latest ./dbt
docker push dockerhub_USERNAME/cde-dbt:latest
docker build -t dockerhub_USERNAME/cde-scheduler:latest ./scheduler
docker push dockerhub_USERNAME/cde-scheduler:latest- Pull Images from Docker Hub If you prefer not to build the images locally, pull them from Docker Hub (replace dockerhub_USERNAME with the actual username):
docker pull dockerhub_USERNAME/cde-python:latest
docker pull dockerhub_USERNAME/cde-dbt:latest
docker pull dockerhub_USERNAME/cde-scheduler:latest- Then, run the pipeline:
docker-compose up -d- Environment Variables: Secure configuration using .env files.
- Healthchecks: Ensure Postgres is ready before Python/DBT execution.
- Dynamic DBT Configuration: Generate profiles.yml dynamically in entrypoint.sh.
- Containerization: Use Docker for consistent, reproducible environments.
- Separation of Concerns: Dedicated containers for extraction (Python), transformation (DBT), and scheduling.
- Logging: Comprehensive logs for debugging and monitoring.
- Schema Management: Explicit schema definitions to avoid issues like public_public.
- Table Not Found Verify public.dw_covid_data:
docker exec -it cde_postgres psql -U postgres -d cde_db -c "\dt public.*"- Check DBT logs for errors:
docker logs cde_dbt-
Schema Issues (e.g., public_public): Ensure dbt_project.yml and transform_model.sql do not override schema unnecessarily.
-
Drop incorrect tables:
docker exec -it cde_postgres psql -U postgres -d cde_db -c "DROP TABLE IF EXISTS public_public.dw_covid_data;
DROP SCHEMA IF EXISTS public_public CASCADE;"- Container Failures Rebuild with:
docker-compose build --no-cacheThe pipeline processes COVID-19 data (e.g., from an API), loaded into staging_covid_data with columns: Date, Country, Confirmed, Recovered, Deaths. The dw_covid_data table transforms this data for analysis.
Contributions are welcome! Please submit a pull request or open an issue on the GitHub repository.


