Welcome to your data engineering project! The goal is to simulate a real-world commercial task by building a complete ETL (Extract, Transform, Load) pipeline.
You will be responsible for designing, building, and running a Python application that reads data from multiple sources (files and a database), processes it, and generates a final analysis report.
This project will challenge you to apply everything you've learned about Python, OOP, SOLID, Pandas, NumPy, and PostgreSQL.
- OOP (Object-Oriented Programming): You will build a modular application using classes.
- SOLID Principles: Your code should be maintainable and extensible.
- Pandas: For loading, merging, and aggregating the data.
- NumPy: For efficient, vectorized numerical calculations.
- PostgreSQL: For connecting to, and reading from, a relational database.
- Pydantic Settings: For settings handling, using .env file for settings.
You are a data engineer at a new e-commerce company. Your data is fragmented:
- Business Data: Your company's
productcatalog andcustomerinformation live in a production PostgreSQL database. - Event Data: All user activity (clicks, purchases) is dumped as JSON event logs into a complex, nested zip file structure.
Your mission: Create an automated pipeline that will be runnned daily. It must read all the event data, enrich it with data from the database, and produce a final CSV report summarizing sales performance by product category and customer segment.
You will be working with two distinct data sources.
You must generate this data using the data_generator.py script.
- Structure: The script creates master zip files (e.g.,
data/events_week_42.zip). - Nesting:
- Inside the master zip are daily zip files (e.g.,
events_2023-10-23.zip). - Inside each daily zip are JSON part-files (e.g.,
part-001.json).
- Inside the master zip are daily zip files (e.g.,
- Event JSON Format:
[ { "timestamp": "...", "customer_id": "c123", "event_type": "view_product", "product_id": "p789" }, { "timestamp": "...", "customer_id": "c456", "event_type": "purchase", "product_id": "p101", "quantity": 2 } ]
You will run a Docker container that automatically creates and populates this database using the sql/init.sql script.
-
customerstable: Contains information on all 100 customers.customer_id join_date segment c001 2024-12-05 Regular c002 2025-07-21 VIP ... ... ... c100 2025-03-14 New -
productstable: Contains information on all 50 products.product_id product_name category price p001 Product Gamma 1 Electronics 149.99 p002 Product Alpha 2 Books 24.50 ... ... ... ... p050 Product Delta 50 Electronics 299.95
Follow these steps to set up your environment.
Let's create a virtual environment and install the dependencies.
# Synchronize packages via uv
uv sync
# Activate virtual environment if needed
source .venv/bin/activateThis project uses Docker to run the PostgreSQL database. The docker-compose.yml file is already configured.
# This command will start the database in the background.
# It will automatically find the `sql/init.sql` file and
# run it to create your tables and data.
docker compose up -dYour database is now running. You can connect to it with these credentials:
- Host:
localhost - Port:
5432 - User:
myuser - Password:
mypassword - Database:
ecommerce_db
Now, run the Python script to generate the raw event logs.
# This will create 50 weekly archives in a new 'data/' folder
python data_generator.py -c 50Follow these steps to set up your environment on Windows.
Important: Make sure you have Docker Desktop installed and running before you start.
Open your terminal (Command Prompt) to set up the uv environment.
# Synchronize packages via uv
uv sync
# Activate virtual environment if needed
.venv\Scripts\activate.bat
# Or .venv\Scripts\Activate.ps1 for powershellThis project uses Docker to run the PostgreSQL database. The docker-compose.yml file is already configured.
# This command will start the database in the background.
# It will automatically find the `sql/init.sql` file and
# run it to create your tables and data.
docker-compose up -dNow, run the Python script to generate the raw event logs.
# This will create 50 weekly archives in a new 'data/' folder
python data_generator.py -c 50You are now ready to build! Good luck =)
Your main task is to create the core pipeline logic (e.g., in a pipeline/ directory). Your pipeline must perform these Extract, Transform and Load steps:
- Extract (Files): Create a class that can navigate the nested zip structure (
data/*.zip->*.zip->*.json) and load all events into a single Pandas DataFrame. Also add functionality for batch processing for weak machines. - Extract (DB): Create a class that connects to the PostgreSQL database and loads the
customersandproductstables into two separate DataFrames (hint: usepd.read_sql). - Transform:
- Filter the events DataFrame to get only
purchaseevents. - Join the
purchaseevents with theproductsDataFrame onproduct_id. - Join the result with the
customersDataFrame oncustomer_id. - Feature Engineering: Create a new column
total_revenue=quantity*price. - Aggregate:
groupby()the DataFrame bycategoryandcustomer_segment. - Count: Calculate the
sumoftotal_revenue,sumofquantity(asunits_sold), and thenunique(count distinct) ofcustomer_id.
- Load: Save this final, aggregated DataFrame to a new file (e.g.,
reports/sales_report.csv).
| category | customer_segment | total_revenue | units_sold | unique_customers |
|---|---|---|---|---|
| Electronics | VIP | 14999.50 | 120 | 45 |
| Electronics | New | 8500.00 | 70 | 60 |
| Clothing | Regular | 5200.25 | 210 | 115 |
| Books | Lapsed | 1500.75 | 80 | 30 |
| ... | ... | ... | ... | ... |
If you finish the main task, try these:
- Unit Testing: Write
pytesttests for yourDataTransformerclass. - Logging: Add a proper
loggingmodule to your pipeline to log info and error messages. - Separate Reports: Create a separate
report.csvfor each product category.