## CGDA — LLM Pipeline Debug Notebook (System → Process → Output)

This notebook lets you **inspect what happens to your grievance data** end-to-end:

- **System (input)**: raw CSV as provided by the client
- **Process (preprocessing)**: header mapping + date parsing + canonical record payload sent to the LLM
- **Output (LLM)**: raw Gemini JSON + validated/fallback-filled fields stored by the portal

It imports and uses the **same code** as the CGDA backend (`cgda/backend/services`).



In [11]:
import os
import sys
from pathlib import Path

# Auto-detect repo root whether the notebook is run from cgda/ or cgda/notebooks/
_cwd = Path.cwd().resolve()
if (_cwd / "backend").exists():
    REPO_ROOT = _cwd
elif (_cwd.parent / "backend").exists():
    REPO_ROOT = _cwd.parent
else:
    REPO_ROOT = _cwd

# Point imports to the backend folder
BACKEND_DIR = REPO_ROOT / "backend"
if str(BACKEND_DIR) not in sys.path:
    sys.path.insert(0, str(BACKEND_DIR))

# Load .env (Gemini key + model names) if available.
# Prefer python-dotenv, but fall back to a minimal parser so the notebook works everywhere.
try:
    from dotenv import load_dotenv

    load_dotenv(REPO_ROOT / ".env")
except Exception:
    env_path = REPO_ROOT / ".env"
    if env_path.exists():
        for line in env_path.read_text(encoding="utf-8").splitlines():
            line = line.strip()
            if not line or line.startswith("#") or "=" not in line:
                continue
            k, v = line.split("=", 1)
            k = k.strip()
            v = v.strip().strip('"').strip("'")
            os.environ.setdefault(k, v)

print("CWD:", _cwd)
print("Repo root:", REPO_ROOT)
print("Backend import path added:", BACKEND_DIR)
print("GEMINI_API_KEY set:", bool(os.getenv("GEMINI_API_KEY")))
print("GEMINI_API_KEY length:", len(os.getenv("GEMINI_API_KEY", "")) if os.getenv("GEMINI_API_KEY") else 0)
print("GEMINI_MODEL_DEFAULT:", os.getenv("GEMINI_MODEL_DEFAULT", ""))
print("GEMINI_MODEL_FALLBACK:", os.getenv("GEMINI_MODEL_FALLBACK", ""))



CWD: /Users/apple/Documents/nmmc jan 2026/cgda/notebooks
Repo root: /Users/apple/Documents/nmmc jan 2026/cgda
Backend import path added: /Users/apple/Documents/nmmc jan 2026/cgda/backend
GEMINI_API_KEY set: True
GEMINI_API_KEY length: 39
GEMINI_MODEL_DEFAULT: gemini-2.0-flash
GEMINI_MODEL_FALLBACK: gemini-2.0-flash-lite


In [12]:
# Dataframe tools (optional but recommended)
try:
    import pandas as pd
except Exception as e:
    raise RuntimeError(
        "pandas is required for dataframe inspection in this notebook.\n"
        "Install with: pip install pandas\n\n"
        f"Import error: {e}"
    )

pd.set_option("display.max_colwidth", 120)
pd.set_option("display.width", 160)



## 1) System (Input): Load CSV

Pick any CSV you want to analyze. For client-like data, use:
- `data/raw/nmmc_client_sample.csv`

For the original seeded dataset, use:
- `data/raw/sample_grievances.csv`



In [13]:
CSV_PATH = REPO_ROOT / "data" / "raw" / "nmmc_client_sample.csv"  # change this

if not CSV_PATH.exists():
    raise FileNotFoundError(f"CSV not found: {CSV_PATH}")

df_raw = pd.read_csv(CSV_PATH)
print("Loaded:", CSV_PATH)
print("Rows:", len(df_raw))
df_raw.head(10)



Loaded: /Users/apple/Documents/nmmc jan 2026/cgda/data/raw/nmmc_client_sample.csv
Rows: 24


Unnamed: 0,Sr.no,Grievance Id,Date,Complainant Name,Subject,Department,Ward,Status,Closed date
0,1,NMMC/25/10690,22-12-2025 09:52 AM,Vikas Borhade,Cleaning,Solid Waste Management,Turbhe,CLOSED,24-12-2025 03:41 PM
1,2,NMMC/25/10689,22-12-2025 09:40 AM,Prakash Dilpak,गतिरोधक बसविण्याबाबत,City Engineer,Airoli,CLOSED,22-12-2025 03:48 PM
2,3,NMMC/25/10688,22-12-2025 08:59 AM,Shahbaz Sayyed,Drainage cap on road damaged,City Engineer,Vashi,CLOSED,26-12-2025 12:01 PM
3,4,NMMC/25/10687,22-12-2025 01:43 AM,Akshay Sakpal,Street Light not working,Electrical,Koparkhairane,CLOSED,25-12-2025 02:42 PM
4,5,NMMC/25/10681,21-12-2025 08:54 PM,Amar Kulkarni,Lots of vendors on footpath,Encroachment,Turbhe,CLOSED,22-12-2025 02:15 PM
5,6,NMMC/25/10679,21-12-2025 08:02 PM,Ankush Khapre,Illegal parking on ground,Encroachment,Koparkhairane,CLOSED,29-12-2025 11:27 AM
6,7,NMMC/25/10678,21-12-2025 07:31 PM,Vijay Devadiga,Abandoned senior citizen on footpath opposite society,Encroachment,Turbhe,CLOSED,22-12-2025 02:16 PM
7,8,NMMC/25/10676,21-12-2025 05:44 PM,Nirbhay Mhatre,Sanitary Chamber Leakage,Public Health Engineering,Koparkhairane,CLOSED,22-12-2025 06:42 PM
8,9,NMMC/25/10675,21-12-2025 05:11 PM,Rishabh Singh,Illegal and unscientific tree trimming and pruning,Garden,Belapur,CLOSED,26-12-2025 12:18 PM
9,10,NMMC/25/10673,21-12-2025 01:50 PM,Akshay Jagtap,इमारतीखाली साचणाऱ्या कचऱ्याबाबत तक्रार,Solid Waste Management,Nerul,CLOSED,30-12-2025 10:38 AM


## 2) Process (Preprocessing): how CGDA maps headers + parses dates

This uses the backend’s actual preprocessing helpers from `backend/services/data_service.py`:
- header normalization
- candidate-column picking
- date parsing (supports `dd-mm-yyyy hh:mm AM/PM`)



In [14]:
from services.data_service import _normalize_headers, _pick, _parse_date, _parse_float

headers = list(df_raw.columns)
norm_map = _normalize_headers(headers)

col_gid = _pick(norm_map, "grievance_id", "complaint_id", "id", "ticket_id")
col_text = _pick(norm_map, "grievance_text", "complaint_text", "description", "details", "text", "subject", "complaint_subject")
col_created = _pick(
    norm_map,
    "created_date",
    "lodged_date",
    "registered_date",
    "date_lodged",
    "date",
    "created_on",
    "created_datetime",
    "created_at",
)
col_closed = _pick(
    norm_map,
    "closed_date",
    "resolved_date",
    "date_closed",
    "closed_on",
    "closed_datetime",
    "closed_at",
)
col_ward = _pick(norm_map, "ward", "ward_name", "ward_no", "ward_number")
col_dept = _pick(norm_map, "department", "dept", "service", "category_department")
col_rating = _pick(norm_map, "feedback_star", "star_rating", "citizen_feedback_rating", "rating", "feedback_rating")

mapping = {
    "grievance_id": col_gid,
    "grievance_text": col_text,
    "created_date": col_created,
    "closed_date": col_closed,
    "ward": col_ward,
    "department": col_dept,
    "feedback_star": col_rating,
}

pd.DataFrame([mapping]).T.rename(columns={0: "CSV column used"})



Unnamed: 0,CSV column used
grievance_id,Grievance Id
grievance_text,Subject
created_date,Date
closed_date,Closed date
ward,Ward
department,Department
feedback_star,


In [15]:
def build_llm_payload(row: dict) -> dict:
    """Build the exact record payload the backend sends to Gemini."""
    gid = str(row.get(col_gid, "") or "").strip()
    text = str(row.get(col_text, "") or "").strip()

    created = _parse_date(str(row.get(col_created)) if col_created else None) if col_created else None
    closed = _parse_date(str(row.get(col_closed)) if col_closed else None) if col_closed else None

    ward = str(row.get(col_ward, "") or "").strip() if col_ward else ""
    dept = str(row.get(col_dept, "") or "").strip() if col_dept else ""

    rating = _parse_float(str(row.get(col_rating)) if col_rating else None) if col_rating else None

    return {
        "grievance_id": gid,
        "grievance_text": text,
        "ward": ward or None,
        "department": dept or None,
        "created_date": created.isoformat() if created else None,
        "closed_date": closed.isoformat() if closed else None,
        "feedback_star": rating,
    }

rows = df_raw.to_dict(orient="records")
payloads = [build_llm_payload(r) for r in rows]

# Show first 5 payloads that will be sent to LLM
pd.DataFrame(payloads).head(5)



Unnamed: 0,grievance_id,grievance_text,ward,department,created_date,closed_date,feedback_star
0,NMMC/25/10690,Cleaning,Turbhe,Solid Waste Management,2025-12-22,2025-12-24,
1,NMMC/25/10689,गतिरोधक बसविण्याबाबत,Airoli,City Engineer,2025-12-22,2025-12-22,
2,NMMC/25/10688,Drainage cap on road damaged,Vashi,City Engineer,2025-12-22,2025-12-26,
3,NMMC/25/10687,Street Light not working,Koparkhairane,Electrical,2025-12-22,2025-12-25,
4,NMMC/25/10681,Lots of vendors on footpath,Turbhe,Encroachment,2025-12-21,2025-12-22,


## 3) Output (LLM): run Gemini structuring and inspect outputs

This calls the backend’s `AIService.structure_grievance()`.

- It **tries `gemini-2.0-flash`**, then falls back to **`gemini-2.0-flash-lite`**.
- It **never crashes**: if Gemini fails, all fields become `"Unknown"`.

We’ll run it on a small sample first.



In [16]:
from services.ai_service import AIService

ai = AIService()

SAMPLE_N = 5  # keep small to avoid rate limits
sample_payloads = payloads[:SAMPLE_N]

outputs = []
for rec in sample_payloads:
    out = ai.structure_grievance(rec)
    outputs.append(
        {
            **rec,
            "category": out.category,
            "sub_issue": out.sub_issue,
            "sentiment": out.sentiment,
            "severity": out.severity,
            "repeat_flag": out.repeat_flag,
            "delay_risk": out.delay_risk,
            "dissatisfaction_reason": out.dissatisfaction_reason,
            "ai_provider": out.ai_provider,
            "ai_engine": out.ai_engine,
            "ai_model": out.ai_model,
            "raw_ok": out.raw_ok,
        }
    )

pd.DataFrame(outputs)



Unnamed: 0,grievance_id,grievance_text,ward,department,created_date,closed_date,feedback_star,category,sub_issue,sentiment,severity,repeat_flag,delay_risk,dissatisfaction_reason,ai_provider,ai_engine,ai_model,raw_ok
0,NMMC/25/10690,Cleaning,Turbhe,Solid Waste Management,2025-12-22,2025-12-24,,Solid Waste Management,Unknown,Neutral,Low,False,Low,Unknown,caseA,Gemini,gemini-2.0-flash,True
1,NMMC/25/10689,गतिरोधक बसविण्याबाबत,Airoli,City Engineer,2025-12-22,2025-12-22,,City Engineer,Unknown,Unknown,Unknown,False,Unknown,Unknown,caseA,Gemini,gemini-2.0-flash,True
2,NMMC/25/10688,Drainage cap on road damaged,Vashi,City Engineer,2025-12-22,2025-12-26,,City Engineer,Damaged drainage infrastructure,Negative,Medium,False,Medium,Unknown,caseA,Gemini,gemini-2.0-flash,True
3,NMMC/25/10687,Street Light not working,Koparkhairane,Electrical,2025-12-22,2025-12-25,,Electrical,Street light not working,Negative,Medium,False,Low,Unknown,caseA,Gemini,gemini-2.0-flash,True
4,NMMC/25/10681,Lots of vendors on footpath,Turbhe,Encroachment,2025-12-21,2025-12-22,,Encroachment,Vendors on footpath,Negative,Medium,False,Medium,Unknown,caseA,Gemini,gemini-2.0-flash,True


## 4) (Optional) Raw Gemini JSON for one record

If you want to see the **raw JSON string** returned by Gemini (before validation), run this.



In [17]:
import json

# WARNING: this uses a private method for transparency/debugging
rec = payloads[0]

prompt_tpl = (REPO_ROOT / "backend" / "prompts" / "grievance_structuring.txt").read_text(encoding="utf-8")
prompt = prompt_tpl.replace("{{INPUT_JSON}}", json.dumps(rec, ensure_ascii=False))

if not os.getenv("GEMINI_API_KEY"):
    print("GEMINI_API_KEY not set; skipping raw Gemini call.")
else:
    try:
        raw_text = ai._call_gemini(model=os.getenv("GEMINI_MODEL_DEFAULT", "gemini-2.0-flash"), prompt=prompt)
        print(raw_text)
    except Exception as e:
        print("Raw Gemini call failed:", type(e).__name__, str(e)[:800])



{
  "category": "Solid Waste Management",
  "sub_issue": "Unknown",
  "sentiment": "Neutral",
  "severity": "Low",
  "repeat_flag": false,
  "delay_risk": "Low",
  "dissatisfaction_reason": "Unknown"
}


## 5) Full-run (optional): run on more rows and see distributions

Increase `N` carefully to avoid rate limits. The portal processes in batches and continues in the background.



In [18]:
N = 10  # try 10 first
outs = []
for rec in payloads[:N]:
    out = ai.structure_grievance(rec)
    outs.append({"grievance_id": rec.get("grievance_id"), "dept": rec.get("department"), "ward": rec.get("ward"), "text": rec.get("grievance_text"),
                 "category": out.category, "sub_issue": out.sub_issue, "sentiment": out.sentiment, "severity": out.severity,
                 "repeat_flag": out.repeat_flag, "delay_risk": out.delay_risk, "dissatisfaction_reason": out.dissatisfaction_reason,
                 "ai_model": out.ai_model, "raw_ok": out.raw_ok})

df_out = pd.DataFrame(outs)
display(df_out)

print("\nCategory distribution (sample):")
display(df_out["category"].value_counts(dropna=False).head(15))

print("\nUnknown-rate by field (sample):")
unknown_rate = (df_out[["category","sub_issue","sentiment","severity","delay_risk","dissatisfaction_reason"]] == "Unknown").mean().sort_values(ascending=False)
display(unknown_rate)



Unnamed: 0,grievance_id,dept,ward,text,category,sub_issue,sentiment,severity,repeat_flag,delay_risk,dissatisfaction_reason,ai_model,raw_ok
0,NMMC/25/10690,Solid Waste Management,Turbhe,Cleaning,Solid Waste Management,Unknown,Neutral,Low,False,Low,Unknown,gemini-2.0-flash,True
1,NMMC/25/10689,City Engineer,Airoli,गतिरोधक बसविण्याबाबत,City Engineer,Unknown,Unknown,Unknown,False,Unknown,Unknown,gemini-2.0-flash,True
2,NMMC/25/10688,City Engineer,Vashi,Drainage cap on road damaged,City Engineer,Damaged drainage infrastructure,Negative,Medium,False,Medium,Unknown,gemini-2.0-flash,True
3,NMMC/25/10687,Electrical,Koparkhairane,Street Light not working,Electrical,Street light malfunction,Negative,Medium,False,Medium,Unknown,gemini-2.0-flash,True
4,NMMC/25/10681,Encroachment,Turbhe,Lots of vendors on footpath,Encroachment,Vendors on footpath,Negative,Medium,False,Medium,Unknown,gemini-2.0-flash,True
5,NMMC/25/10679,Encroachment,Koparkhairane,Illegal parking on ground,Encroachment,Illegal parking,Negative,Medium,False,Medium,Unknown,gemini-2.0-flash,True
6,NMMC/25/10678,Encroachment,Turbhe,Abandoned senior citizen on footpath opposite society,Encroachment,Abandoned person,Negative,High,False,High,Unknown,gemini-2.0-flash,True
7,NMMC/25/10676,Public Health Engineering,Koparkhairane,Sanitary Chamber Leakage,Public Health Engineering,Sanitary Chamber Leakage,Negative,Medium,False,Medium,Unknown,gemini-2.0-flash,True
8,NMMC/25/10675,Garden,Belapur,Illegal and unscientific tree trimming and pruning,Garden,Illegal tree trimming,Negative,Medium,False,Medium,Unknown,gemini-2.0-flash,True
9,NMMC/25/10673,Solid Waste Management,Nerul,इमारतीखाली साचणाऱ्या कचऱ्याबाबत तक्रार,Solid Waste Management,Garbage accumulation,Negative,Medium,False,Medium,Unknown,gemini-2.0-flash,True



Category distribution (sample):


category
Encroachment                 3
Solid Waste Management       2
City Engineer                2
Electrical                   1
Public Health Engineering    1
Garden                       1
Name: count, dtype: int64


Unknown-rate by field (sample):


dissatisfaction_reason    1.0
sub_issue                 0.2
sentiment                 0.1
severity                  0.1
delay_risk                0.1
category                  0.0
dtype: float64

## 0) Gemini Smoke Test (Starter)

Run this first to confirm Gemini is responding *before* running the full pipeline.



In [19]:
import os
import json
from services.ai_service import AIService

ai = AIService()

if not os.getenv("GEMINI_API_KEY"):
    raise RuntimeError("GEMINI_API_KEY is not set. Put it in cgda/.env and re-run the first setup cell.")

# Minimal prompt that forces JSON output
prompt = "Return JSON only: {\"ok\": true, \"model\": \"" + os.getenv("GEMINI_MODEL_DEFAULT", "gemini-2.0-flash") + "\"}"

try:
    txt = ai._call_gemini(model=os.getenv("GEMINI_MODEL_DEFAULT", "gemini-2.0-flash"), prompt=prompt)
    print("Gemini raw response:\n", txt)
    parsed = json.loads(txt) if txt.strip().startswith("{") else None
    print("\nParsed JSON:", parsed)
except Exception as e:
    print("Gemini smoke test FAILED:")
    print(type(e).__name__, str(e)[:1200])



Gemini raw response:
 {"ok": true, "model": "gemini-2.0-flash"}

Parsed JSON: {'ok': True, 'model': 'gemini-2.0-flash'}
