# 🌿 Graves Greenery — single-connection DuckDB + `%sql`

This notebook avoids the “two connections / different configuration” issue by:
- **Never** calling `duckdb.connect()` directly
- Letting `%sql` open **one** connection to the DuckDB file
- Loading CSVs **through the same `%sql` connection**


In [None]:
!pip -q install --upgrade duckdb duckdb-engine "sqlalchemy>=2.0" ipython-sql jupysql

import os, subprocess
REPO_USER = "danielsgraves"          # <-- correct owner
REPO_NAME = "Graves_Greenery_Analysis"
REPO_DIR  = f"/content/{REPO_NAME}"

if not os.path.exists(REPO_DIR):
    subprocess.run(
        f"git clone --depth 1 https://github.com/{REPO_USER}/{REPO_NAME}.git {REPO_DIR}",
        shell=True, check=True
    )
else:
    subprocess.run(f"git -C {REPO_DIR} pull --ff-only", shell=True, check=True)

print("Repo ready at:", REPO_DIR)
print("CSV root:", f"{REPO_DIR}/data")

## DB path (we’ll rebuild from CSV each run)
- We remove any stale DB file to prevent config conflicts, then reconnect cleanly with `%sql`.


In [None]:
import os
DB_PATH = "/content/graves_greenery.duckdb"  # single DB file used everywhere

# Remove stale DB file to avoid "same file different config" lock
if os.path.exists(DB_PATH):
    os.remove(DB_PATH)
print("DB file set:", DB_PATH, "(any old file removed if existed)")

## Connect once via `%sql` (no DBAPI/engines)
We let `%sql` open the only connection to the DB file.

In [None]:
%reload_ext sql

# Use absolute path DSN (four slashes for absolute path)
%sql duckdb:////content/graves_greenery.duckdb
print("Connected %sql to:", DB_PATH)

In [None]:
%%sql
SELECT * FROM pragma_database_list();

## Load CSVs using the same `%sql` connection
We build `CREATE TABLE ... AS SELECT * FROM read_csv_auto(...)` statements and execute them through `%sql` so we never open a second connection.


In [None]:
import os, re, glob
from pathlib import Path

CSV_GLOB = "data/**/*.[cC][sS][vV]"  # case-insensitive .csv
INCLUDE_PARENT_IN_TABLE = False

def slugify_table_name(path, include_parent=False):
    p = Path(path)
    stem = re.sub(r'[^a-z0-9_]+','_', p.stem.lower()).strip('_')
    if include_parent and p.parent != p.parent.parent:
        parent = re.sub(r'[^a-z0-9_]+','_', p.parent.name.lower()).strip('_')
        stem = f"{parent}_{stem}"
    if re.match(r'^\d', stem):
        stem = 't_' + stem
    return stem

files = glob.glob(os.path.join(REPO_DIR, CSV_GLOB), recursive=True)
print(f"Found {len(files)} CSV(s). Loading…")

loaded = []
for f in files:
    tbl = slugify_table_name(f, INCLUDE_PARENT_IN_TABLE)
    q = f"""
    CREATE OR REPLACE TABLE "{tbl}" AS
    SELECT * FROM read_csv_auto('{f}', header=True, sample_size=-1, ignore_errors=True);
    """
    # Execute via the bound %sql connection
    get_ipython().run_cell_magic('sql', '', q)
    loaded.append((tbl, f))

print("Loaded tables (first few):", [t for t,_ in loaded[:8]])

## Verify & query
List tables, then run a simple test query. If your customers file is named differently, pick a real table name from the list.

In [None]:
%%sql
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'main'
ORDER BY table_name;

In [None]:
%%sql
-- Replace 'customers' with any table name shown above if different
SELECT * FROM customers LIMIT 5;