# 01 – Storage and Organization (Filesystem and Optional Database)

This notebook comes **after** `00_data_acquisition.ipynb`.

It documents the storage and organization strategy used in the project and
(optionally) demonstrates how the processed data can be loaded into a
relational database.

Focus:

1. **Filesystem-based storage model**
   - How data are organized under `data/` (raw vs processed).
   - Naming conventions for CSV files and other artifacts.

2. **Optional RDBMS loading example (SQLite)**
   - How to load the integrated CSV into a SQLite database.
   - A simple SQL-style query to illustrate usage.

3. **Connection to extraction & enrichment**
   - How the integrated dataset produced by `scripts/integrate_data.py`
     / `02_integration.ipynb` fits into this storage layout.

In [20]:
from pathlib import Path
import sqlite3

import pandas as pd

PROJECT_ROOT = Path("..").resolve()
DATA_DIR = PROJECT_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
PROCESSED_DIR = DATA_DIR / "processed"

PROJECT_ROOT, DATA_DIR, RAW_DIR, PROCESSED_DIR

(PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal'),
 PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal/data'),
 PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal/data/raw'),
 PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal/data/processed'))

## 1. Filesystem structure

The project follows a **filesystem-based, tabular storage strategy** instead
of a full relational database. All data are stored as CSV files in a
structured folder layout inside the Git repository.

High-level layout (relative to the project root):

```text
.
├── data/
│   ├── raw/          # Raw input data downloaded from Kaggle
│   ├── processed/    # Cleaned, integrated, and derived CSV files
│   └── README.md     # Instructions for data acquisition & integrity checks
├── figures/          # Output plots and visualizations
├── notebooks/        # Jupyter notebooks (00_data_acquisition, 01_integration, etc.)
├── scripts/          # Python scripts (get_data.py, integrate_data.py, ...)
├── ProjectPlan.md
├── StatusReport.md
└── README.md         # Final project report

In [21]:
def print_tree(start_path: Path, max_depth: int = 3, prefix: str = ""):
    """
    Simple recursive directory tree printer (limited depth for readability).
    """
    if max_depth < 0:
        return
    
    items = sorted(start_path.iterdir(), key=lambda p: (p.is_file(), p.name))
    for i, item in enumerate(items):
        connector = "└── " if i == len(items) - 1 else "├── "
        print(prefix + connector + item.name)
        if item.is_dir():
            extension = "    " if i == len(items) - 1 else "│   "
            print_tree(item, max_depth=max_depth - 1, prefix=prefix + extension)

print(f"Project root: {PROJECT_ROOT}")
print("Directory tree under data/:")
print_tree(DATA_DIR, max_depth=2)

Project root: /Users/ujjwal/Downloads/IS-477-Project-Ujjwal
Directory tree under data/:
├── processed
│   ├── coffee_integrated.csv
│   ├── coffee_sales_clean.csv
│   └── coffee_shop_clean.csv
├── raw
│   ├── .DS_Store
│   ├── coffee-sales.zip
│   ├── coffee-shop.zip
│   ├── coffee_sales.csv
│   └── coffee_shop.csv
├── .DS_Store
├── README.md
├── checksums.sha256
└── coffee_project.sqlite


### 1.1 Naming conventions

Within `data/`, the project uses consistent naming conventions:

- **Raw data** (downloaded from Kaggle via `scripts/get_data.py`):
  - `data/raw/coffee_sales.csv`
  - `data/raw/coffee_shop.csv`

- **Cleaned data** (produced by profiling/cleaning notebooks/scripts):
  - `data/processed/coffee_sales_clean.csv`
  - `data/processed/coffee_shop_clean.csv`

- **Integrated data** (produced by integration/enrichment):
  - `data/processed/coffee_integrated.csv`

This follows a `<source>_<stage>.csv` pattern, where:
- `<source>` is the dataset name (`coffee_sales`, `coffee_shop`),
- `<stage>` is the processing stage (`clean`, `integrated`, etc.).

This layout keeps the data lifecycle visible as **raw → processed → integrated**.

In [22]:
processed_files = list(PROCESSED_DIR.glob("*.csv"))
processed_files

[PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal/data/processed/coffee_integrated.csv'),
 PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal/data/processed/coffee_sales_clean.csv'),
 PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal/data/processed/coffee_shop_clean.csv')]

In [23]:
sales_clean_path = PROCESSED_DIR / "coffee_sales_clean.csv"
shop_clean_path = PROCESSED_DIR / "coffee_shop_clean.csv"
integrated_path = PROCESSED_DIR / "coffee_integrated.csv"

sales_clean = pd.read_csv(sales_clean_path)
shop_clean = pd.read_csv(shop_clean_path)
integrated = pd.read_csv(integrated_path)

print("coffee_sales_clean.csv shape:", sales_clean.shape)
print("coffee_shop_clean.csv shape: ", shop_clean.shape)
print("coffee_integrated.csv shape: ", integrated.shape)

coffee_sales_clean.csv shape: (149116, 11)
coffee_shop_clean.csv shape:  (3547, 11)
coffee_integrated.csv shape:  (149116, 20)


## 2. Storage strategy recap

In summary:

- The project **does not rely on an external database** for its core workflow.
- Instead, it uses:
  - Raw CSVs in `data/raw/` (recreated by `scripts/get_data.py` and
    documented in `00_data_acquisition.ipynb` and `data/README.md`).
  - Cleaned & integrated CSVs in `data/processed/` (recreated by
    profiling and integration steps).

This approach:

- Keeps the project lightweight and easy to run on any machine.
- Aligns with the size/complexity of the datasets.
- Enhances reproducibility by making each stage visible as files
  (`raw` → `clean` → `integrated`).

The next section shows an **optional** example of loading the integrated
data into a relational database (SQLite), to satisfy the "RDBMS" part of
the assignment rubric.

## 3. Optional: loading the integrated data into a relational database (SQLite)

The assignment mentions:

> Script(s) used to load data into a relational database (if you use a RDBMS).

The main pipeline uses a filesystem-based model, but to demonstrate how the
data could be used with a relational database, we will:

1. Create a SQLite database file under `data/`.
2. Load `coffee_integrated.csv` into a SQL table using `pandas.to_sql`.
3. Run a simple SQL query to summarize the data.

This section is **illustrative** and not required for the core
filesystem-based analysis.

In [24]:
# Path to a SQLite database file (will be created if not present)
sqlite_path = DATA_DIR / "coffee_project.sqlite"

# Connect to SQLite
conn = sqlite3.connect(sqlite_path)

# Make a copy of the integrated DataFrame and rename potentially conflicting columns
integrated_db = integrated.copy()

# Rename weekday-related columns to avoid conflicts in SQLite
rename_map = {}
if "weekday" in integrated_db.columns:
    rename_map["weekday"] = "weekday_sales"
if "Weekday" in integrated_db.columns:
    rename_map["Weekday"] = "weekday_shop"

if rename_map:
    integrated_db = integrated_db.rename(columns=rename_map)

# Ensure a revenue column exists for the later SQL example
if "revenue" not in integrated_db.columns and {"transaction_qty", "unit_price"}.issubset(integrated_db.columns):
    integrated_db["revenue"] = integrated_db["transaction_qty"] * integrated_db["unit_price"]

# Write the DataFrame to a SQL table named 'coffee_integrated'
integrated_db.to_sql("coffee_integrated", conn, if_exists="replace", index=False)

sqlite_path

PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal/data/coffee_project.sqlite')

In [25]:
# Example query: average revenue per store_location, top 10

query = """
SELECT store_location,
       AVG(revenue) AS avg_revenue
FROM coffee_integrated
GROUP BY store_location
ORDER BY avg_revenue DESC
LIMIT 10;
"""

avg_rev_by_store = pd.read_sql(query, conn)
avg_rev_by_store

Unnamed: 0,store_location,avg_revenue
0,Lower Manhattan,4.814726
1,Hell's Kitchen,4.661696
2,Astoria,4.589891


In [26]:
conn.close()

## 4. Connection to extraction and enrichment

The integrated dataset used above (`coffee_integrated.csv`) is produced by
the extraction/enrichment steps in the project:

- **Script**: `scripts/integrate_data.py`
  - Derives `hour_of_day` from `transaction_time` in the sales data.
  - Aggregates the coffee shop data by `hour_of_day` to create an hourly profile.
  - Merges the two datasets on `hour_of_day` to create `coffee_integrated.csv`.

- **Notebook**: `02_integration.ipynb`
  - Documents and visualizes the same integration logic with sanity checks
    and basic summaries.

From a storage perspective:

- The integrated CSV is stored under `data/processed/` following the
  project’s naming conventions.
- It can be:
  - Used directly as a CSV for analysis (e.g., in a profiling/EDA notebook),
  - Or loaded into a relational database (as shown above) for SQL-style
    querying or integration with other systems.

This notebook therefore completes the documentation for:

- **Storage and organization** – filesystem layout and naming conventions.
- **RDBMS loading (optional)** – how the data can be represented in a
  relational database.
- The link between storage and **extraction & enrichment**, by showing
  where the integrated dataset lives and how it can be consumed.