# Supermarket Sales Pipeline

Notebook-first implementation of a small data pipeline:
- Extract latest Kaggle dataset (via Kaggle API)
- Load to SQLite staging (bronze)
- Transform to 2 dimensions + 1 fact
- Run an example report query (joins + window functions)

References:
- Architecture diagram: `../docs/architecture_diagram.md`
- SQL scripts: `../sql/`

## 0) Setup expectations

This notebook is the *orchestrator* for the code in `src/`.

It expects a `.env` file at the repo root with at least:
- `KAGGLE_USERNAME`
- `KAGGLE_KEY`

Outputs:
- Raw files go under `./data/`
- SQLite DB goes under `./db/`

In [None]:
from pathlib import Path

# Make sure repo root is on sys.path so `src` is importable when running from notebooks
import sys
repo_root = Path.cwd().parent
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

In [None]:
from src.config import load_settings
from src.logging_utils import configure_logging

settings = load_settings()
configure_logging(settings.log_level)
settings

## 1) Run the pipeline

This uses `src/runner.py` which performs:
1. Kaggle extract (latest)
2. Create SQLite tables
3. Load staging (bronze)
4. Build dimensions (`silver_dim_product_line` Type 1, `silver_dim_branch` SCD2)
5. Load `silver_fact_sales` idempotently

In [None]:
from src.runner import run_pipeline

run_pipeline()

## 2) Quick validations

Show row counts for staging/dims/fact.

In [None]:
import sqlite3

conn = sqlite3.connect(settings.sqlite_db_path)
for table in ["bronze_sales_raw", "silver_dim_product_line", "silver_dim_branch", "silver_fact_sales"]:
    count = conn.execute(f"SELECT COUNT(*) FROM {table}").fetchone()[0]
    print(table, count)
conn.close()

## 3) Example analytical report

This executes one of the SQL files under `../sql/` (joins + window functions).
You can replace it later with the final questions you want to present.

In [None]:
import pandas as pd

report_sql = (repo_root / "sql" / "14.Top 3 Product Lines per Branch (Revenue Rank).sql").read_text(encoding="utf-8")
conn = sqlite3.connect(settings.sqlite_db_path)
df_report = pd.read_sql_query(report_sql, conn)
conn.close()

df_report.head(20)