# 🌿 Graves Greenery — In-Memory SQL (DuckDB + `%%sql`)

**Goal:** Type SQL like `%%sql\nSELECT * FROM dim_customers LIMIT 5;` with **no external DB**. We load CSVs from your GitHub repo into an **in-memory** DuckDB and keep table names matching file stems (snake_case).


In [None]:
!pip -q install --upgrade duckdb duckdb-engine "sqlalchemy>=2.0" ipython-sql jupysql

import os, subprocess
REPO_USER = "danielsgraves"                 # <-- correct owner
REPO_NAME = "Graves_Greenery_Analysis"
REPO_DIR  = f"/content/{REPO_NAME}"

if not os.path.exists(REPO_DIR):
    subprocess.run(
        f"git clone --depth 1 https://github.com/{REPO_USER}/{REPO_NAME}.git {REPO_DIR}",
        shell=True, check=True
    )
else:
    subprocess.run(f"git -C {REPO_DIR} pull --ff-only", shell=True, check=True)

print("Repo ready at:", REPO_DIR)
print("CSV root:", f"{REPO_DIR}/data")

## Connect one in-memory DuckDB session for `%%sql`
- No `duckdb.connect()` calls (avoids dual-config conflicts)
- One `%sql` connection living entirely in memory


In [None]:
%reload_ext sql
%sql duckdb:///:memory:
print("Connected %sql to in-memory DuckDB.")

## Load all CSVs with table names matching file stems
- File `data/sales/dim_customers.csv` → table **`dim_customers`**
- Non-alphanumeric chars become `_`; numbers get `t_` prefix if needed
- Shows a mapping preview before creating tables


In [None]:
import os, re, glob, html
from pathlib import Path

CSV_GLOB = "data/**/*.[cC][sS][vV]"   # case-insensitive .csv
INCLUDE_PARENT_PREFIX = False          # set True to prefix parent folder: e.g., sales_dim_customers

def to_snake(name: str) -> str:
    s = re.sub(r"[^0-9a-zA-Z]+", "_", name).strip("_")
    s = re.sub(r"_+", "_", s)
    if s and s[0].isdigit():
        s = "t_" + s
    return s.lower()

def table_name_for(csv_path: Path) -> str:
    stem = csv_path.stem
    if INCLUDE_PARENT_PREFIX and csv_path.parent != csv_path.parent.parent:
        return to_snake(csv_path.parent.name + "_" + stem)
    return to_snake(stem)

files = [Path(p) for p in glob.glob(os.path.join(REPO_DIR, CSV_GLOB), recursive=True)]
files = [p for p in files if p.is_file()]
print(f"Found {len(files)} CSV(s). Showing first 15 mappings…")
preview = [(str(p.relative_to(REPO_DIR)), table_name_for(p)) for p in files[:15]]
for rel, tbl in preview:
    print(f"  {rel}  →  {tbl}")

# Create tables via the same %sql connection (no secondary connections)
for p in files:
    tbl = table_name_for(p)
    # Safe, minimal quoting: identifier in double-quotes; path in single quotes
    q = f"""
    CREATE OR REPLACE TABLE "{tbl}" AS
    SELECT * FROM read_csv_auto('{str(p)}', header=True, sample_size=-1, ignore_errors=True);
    """
    get_ipython().run_cell_magic('sql', '', q)

print("Loaded tables (first few):", [table_name_for(p) for p in files[:8]])

## Verify tables and run a simple test query
- If your repo has `data/**/dim_customers.csv`, the table will be **`dim_customers`**


In [None]:
%%sql
SELECT table_name
FROM information_schema.tables
WHERE table_schema='main'
ORDER BY table_name;

In [None]:
%%sql
-- Test query (adjust name if your file is different)
SELECT * FROM dim_customers LIMIT 5;