Part 1: Python Scripting, ETL, and Data Modeling

In [1]:

import pandas as pd
import sqlalchemy
import numpy as np

# --- Step 1: Extract ---
# Load sales and customer data
sales_df = pd.read_csv("../data/sales_data.csv")
customers_df = pd.read_csv("../data/customers_data.csv")

# --- Step 2: Transform ---
# Add TotalSales = QuantitySold * Price
sales_df["TotalSales"] = sales_df["QuantitySold"] * sales_df["Price"]

print("Transformed Data:")
display(sales_df)



Transformed Data:


Unnamed: 0,Date,ProductID,ProductName,QuantitySold,Price,Category,CustomerID,TotalSales
0,2023-02-01,9,Gadget I,10,10.0,Gadgets,102,100.0
1,2023-02-02,10,Gizmo J,3,11.0,Gizmos,103,33.0
2,2023-02-03,1,Widget A,6,2.5,Gadgets,104,15.0
3,2023-02-04,2,Gadget B,9,3.0,Gadgets,105,27.0
4,2023-02-05,3,Widget C,8,4.0,Widgets,101,32.0
5,2023-02-06,4,Gizmo D,6,5.0,Gizmos,102,30.0
6,2023-02-07,5,Widget E,7,6.0,Widgets,103,42.0
7,2023-02-08,6,Gadget F,5,7.0,Gadgets,104,35.0
8,2023-02-09,7,Gizmo G,2,8.0,Gizmos,105,16.0
9,2023-02-10,8,Widget H,11,9.0,Widgets,101,99.0


In [2]:


# Generate dim_date for 2023 to su[pport DateID foreign key

date_range = pd.date_range(start="2023-01-01", end="2023-12-31", freq="D")
dim_date = pd.DataFrame({
    "DateID": date_range.date,
    "Day": date_range.day,
    "Month": date_range.month,
    "Quarter": date_range.quarter,
    "Year": date_range.year
})

display(dim_date)

Unnamed: 0,DateID,Day,Month,Quarter,Year
0,2023-01-01,1,1,1,2023
1,2023-01-02,2,1,1,2023
2,2023-01-03,3,1,1,2023
3,2023-01-04,4,1,1,2023
4,2023-01-05,5,1,1,2023
...,...,...,...,...,...
360,2023-12-27,27,12,4,2023
361,2023-12-28,28,12,4,2023
362,2023-12-29,29,12,4,2023
363,2023-12-30,30,12,4,2023


In [3]:
# --- Step 3: Load ---
# Create SQLite engine 
from sqlalchemy import create_engine
engine = create_engine("sqlite:///sales_dw.db", echo=True)


# Convert 'Date' column to datetime.date objects for SQLite compatibility
sales_df["DateID"] = pd.to_datetime(sales_df["Date"]).dt.date
# Write tables to SQL database
sales_df.to_sql("fact_sales", con=engine, if_exists="replace", index=False,
                dtype={
                    "DateID": sqlalchemy.types.Date(),
                    "ProductID": sqlalchemy.types.Integer(),
                    "CustomerID": sqlalchemy.types.Integer(),
                    "ProductName": sqlalchemy.types.String(),
                    "QuantitySold": sqlalchemy.types.Integer(),
                    "Price": sqlalchemy.types.Float(),
                    "TotalSales": sqlalchemy.types.Float()
                })

# Load only distinct products into dim_product
dim_product_df = sales_df[["ProductID", "ProductName", "Category"]].drop_duplicates()
              
                                      
sales_df.to_sql("dim_product", con=engine, if_exists="replace", index=False,
                dtype={
                    "ProductID": sqlalchemy.types.Integer(),
                    "ProductName": sqlalchemy.types.String(),
                    "Category": sqlalchemy.types.String(),
                })

customers_df.to_sql("dim_customer", con=engine, if_exists="replace", index=False,
                    dtype={
                        "CustomerID": sqlalchemy.types.Integer(),
                        "CustomerName": sqlalchemy.types.String(),
                        "CustomerEmail": sqlalchemy.types.String(),
                        "CustomerLocation": sqlalchemy.types.String()
                    })

dim_date.to_sql("dim_date", con=engine, if_exists="replace", index=False,
                dtype={
                    "DateID": sqlalchemy.types.Integer(),
                    "Date": sqlalchemy.types.Date(),
                    "Day": sqlalchemy.types.Integer(),
                    "Month": sqlalchemy.types.Integer(),
                    "Quarter": sqlalchemy.types.Integer(),
                    "Year": sqlalchemy.types.Integer()
                })

print("Data loaded successfully into database.")


2025-09-20 22:43:48,148 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-09-20 22:43:48,153 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("fact_sales")
2025-09-20 22:43:48,154 INFO sqlalchemy.engine.Engine [raw sql] ()
2025-09-20 22:43:48,156 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("fact_sales")
2025-09-20 22:43:48,158 INFO sqlalchemy.engine.Engine [raw sql] ()
2025-09-20 22:43:48,159 INFO sqlalchemy.engine.Engine SELECT name FROM sqlite_master WHERE type='table' AND name NOT LIKE 'sqlite~_%' ESCAPE '~' ORDER BY name
2025-09-20 22:43:48,160 INFO sqlalchemy.engine.Engine [raw sql] ()
2025-09-20 22:43:48,161 INFO sqlalchemy.engine.Engine SELECT name FROM sqlite_master WHERE type='view' AND name NOT LIKE 'sqlite~_%' ESCAPE '~' ORDER BY name
2025-09-20 22:43:48,162 INFO sqlalchemy.engine.Engine [raw sql] ()
2025-09-20 22:43:48,164 INFO sqlalchemy.engine.Engine PRAGMA main.table_xinfo("fact_sales")
2025-09-20 22:43:48,165 INFO sqlalchemy.engine.Engine [raw sql] ()
202

Star schema 

built as a dimensional model Kimbal classical style. Ideal for analytics and reporting.

fact_sales
Keys: DateID, ProductID, CustomerID (PK)
Measures: QuantitySold, Price, TotalSales

Store descriptive attributes that provide context to facts:

dim_customer: who bought the product.
dim_product: what was sold.
dim_date: when it happened.

Why Use a Star Schema?

Simplicity – Easy to understand for business analysts and data scientists.
Performance – Fact tables are optimized for aggregation (SUM, COUNT, AVG, etc.).
Scalability – Works well with large datasets.
Flexibility – Can easily add new dimensions without changing the entire model.
Compatibility – Most BI tools (Power BI, Tableau, Looker) expect star schemas.

Entity model:

                                          dim_customer
                                          +-------------+
                                          | CustomerID  |
                                          | Name        |
                                          | Email       |
                                          | Location    |
                                          +-------------+
                                                |
                                                |
                                          fact_sales
                              +--------------------------+
                              |  DateID (FK)             |  
                              |  ProductID (FK)          |
                              |  CustomerID(FK)          |  
                              |  QuantitySold            |
                              |  Price                   | 
                              |  TotalSales              |  
                              +--------------------------+
                                    |                     |
                                    |                     |
                              +-------------+       +--------------+
                              | dim_date    |       | dim_product  |
                              |-------------|       |------------- |
                              | DateID (PK) |       | ProductID(PK)|
                              | Date        |       | ProductName  |
                              | Day         |       | Category     |
                              | Month       |       +--------------+
                              | Quarter     |
                              | Year        |
                              +-------------+
