Skip to content

atomicDataDev/ecommerce_proj

Repository files navigation

🚀 E-commerce Analytics Pipeline

Welcome to your data engineering project! The goal is to simulate a real-world commercial task by building a complete ETL (Extract, Transform, Load) pipeline.

You will be responsible for designing, building, and running a Python application that reads data from multiple sources (files and a database), processes it, and generates a final analysis report.

This project will challenge you to apply everything you've learned about Python, OOP, SOLID, Pandas, NumPy, and PostgreSQL.

🎯 Core Concepts to Apply

  • OOP (Object-Oriented Programming): You will build a modular application using classes.
  • SOLID Principles: Your code should be maintainable and extensible.
  • Pandas: For loading, merging, and aggregating the data.
  • NumPy: For efficient, vectorized numerical calculations.
  • PostgreSQL: For connecting to, and reading from, a relational database.
  • Pydantic Settings: For settings handling, using .env file for settings.

📖 The Task: Your Scenario

You are a data engineer at a new e-commerce company. Your data is fragmented:

  1. Business Data: Your company's product catalog and customer information live in a production PostgreSQL database.
  2. Event Data: All user activity (clicks, purchases) is dumped as JSON event logs into a complex, nested zip file structure.

Your mission: Create an automated pipeline that will be runnned daily. It must read all the event data, enrich it with data from the database, and produce a final CSV report summarizing sales performance by product category and customer segment.


🗃️ The Data Sources

You will be working with two distinct data sources.

1. File System: Event Logs

You must generate this data using the data_generator.py script.

  • Structure: The script creates master zip files (e.g., data/events_week_42.zip).
  • Nesting:
    • Inside the master zip are daily zip files (e.g., events_2023-10-23.zip).
    • Inside each daily zip are JSON part-files (e.g., part-001.json).
  • Event JSON Format:
    [
      {
        "timestamp": "...",
        "customer_id": "c123",
        "event_type": "view_product",
        "product_id": "p789"
      },
      {
        "timestamp": "...",
        "customer_id": "c456",
        "event_type": "purchase",
        "product_id": "p101",
        "quantity": 2
      }
    ]

2. PostgreSQL Database

You will run a Docker container that automatically creates and populates this database using the sql/init.sql script.

  • customers table: Contains information on all 100 customers.

    customer_id join_date segment
    c001 2024-12-05 Regular
    c002 2025-07-21 VIP
    ... ... ...
    c100 2025-03-14 New
  • products table: Contains information on all 50 products.

    product_id product_name category price
    p001 Product Gamma 1 Electronics 149.99
    p002 Product Alpha 2 Books 24.50
    ... ... ... ...
    p050 Product Delta 50 Electronics 299.95

🛠️ 🐧 Setup: How to Get Started (Linux)

Follow these steps to set up your environment.

Step 1: Set up the Python Environment

Let's create a virtual environment and install the dependencies.

# Synchronize packages via uv
uv sync

# Activate virtual environment if needed
source .venv/bin/activate

Step 2: Start the Database

This project uses Docker to run the PostgreSQL database. The docker-compose.yml file is already configured.

# This command will start the database in the background.
# It will automatically find the `sql/init.sql` file and
# run it to create your tables and data.

docker compose up -d

Your database is now running. You can connect to it with these credentials:

  • Host: localhost
  • Port: 5432
  • User: myuser
  • Password: mypassword
  • Database: ecommerce_db

Step 3: Generate the Event Data

Now, run the Python script to generate the raw event logs.

# This will create 50 weekly archives in a new 'data/' folder
python data_generator.py -c 50

🛠️ 🪟 Setup: How to Get Started (Windows)

Follow these steps to set up your environment on Windows.

Important: Make sure you have Docker Desktop installed and running before you start.

Step 1: Set up the Python Environment

Open your terminal (Command Prompt) to set up the uv environment.

# Synchronize packages via uv
uv sync

# Activate virtual environment if needed
.venv\Scripts\activate.bat
# Or .venv\Scripts\Activate.ps1 for powershell

Step 2: Start the Database

This project uses Docker to run the PostgreSQL database. The docker-compose.yml file is already configured.

# This command will start the database in the background.
# It will automatically find the `sql/init.sql` file and
# run it to create your tables and data.

docker-compose up -d

Step 3: Generate the Event Data

Now, run the Python script to generate the raw event logs.

# This will create 50 weekly archives in a new 'data/' folder
python data_generator.py -c 50

You are now ready to build! Good luck =)

📋 Your Mission: The Pipeline

Your main task is to create the core pipeline logic (e.g., in a pipeline/ directory). Your pipeline must perform these Extract, Transform and Load steps:

  1. Extract (Files): Create a class that can navigate the nested zip structure (data/*.zip -> *.zip -> *.json) and load all events into a single Pandas DataFrame. Also add functionality for batch processing for weak machines.
  2. Extract (DB): Create a class that connects to the PostgreSQL database and loads the customers and products tables into two separate DataFrames (hint: use pd.read_sql).
  3. Transform:
  • Filter the events DataFrame to get only purchase events.
  • Join the purchase events with the products DataFrame on product_id.
  • Join the result with the customers DataFrame on customer_id.
  • Feature Engineering: Create a new column total_revenue = quantity * price.
  • Aggregate: groupby() the DataFrame by category and customer_segment.
  • Count: Calculate the sum of total_revenue, sum of quantity (as units_sold), and the nunique (count distinct) of customer_id.
  1. Load: Save this final, aggregated DataFrame to a new file (e.g., reports/sales_report.csv).

🏁 Final Report (The Target)

category customer_segment total_revenue units_sold unique_customers
Electronics VIP 14999.50 120 45
Electronics New 8500.00 70 60
Clothing Regular 5200.25 210 115
Books Lapsed 1500.75 80 30
... ... ... ... ...

⭐ Bonus Challenges

If you finish the main task, try these:

  • Unit Testing: Write pytest tests for your DataTransformer class.
  • Logging: Add a proper logging module to your pipeline to log info and error messages.
  • Separate Reports: Create a separate report.csv for each product category.

About

Educational task on building ETL pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages