# AADT Confidence Interval


---

## FHWA Links
* Guidelines for Obtaining AADT Estimates from Non-Traditional Sources:
    * https://www.fhwa.dot.gov/policyinformation/travel_monitoring/pubs/aadtnt/Guidelines_for_AADT_Estimates_Final.pdf

---
  
## AADT Analysis Locations
* 10 locations were used in the analysis
* Locations were determined based on the location on installed & recording Traffic Operations cameras
    * for additional information contact Zhenyu Zhu with Traffic Operations

## Traffic Census Data
* https://dot.ca.gov/programs/traffic-operations/census/traffic-volumes
* Back AADT, Peak Month, and Peak Hour usually represents traffic South or West of the count location.  
* Ahead AADT, Peak Month, and Peak Hour usually represents traffic North or East of the count location. Listing of routes with their designated  

* Because the Back & Ahead counts are included at each location in the Traffic Census Data, (e.g., "IRWINDALE, ARROW HIGHWAY") only one [OBJECTID*] per location was pulled; for this analysis the North Bound Nodes were used for the analysis. 
    * for more information see the diagram: https://traffic.onramp.dot.ca.gov/downloads/traffic/files/performance/census/Back_and_Ahead_Leg_Traffic_Count_Diagram.pdf

## StreetLight Analysis Data
* Analysis Type == Network Performance
* Segment Metrics
* 2022 was used to match currently available Traffic Census Data (as of 8/27/2025)
* pulled a variety of Day Types, but plan to just look at """All Day Types"""
* pulled a variety of Day Parts, but plan to just look at """All Day Parts"""

---


## How this notebook estimates StreetLight vs. Traffic Census differences

**What we’re trying to answer:**  
Across selected corridor locations, is the Non-Traditional AADT generally higher or lower than Traffic Census (aka Traditional) AADT, and by how much? We also show how certain we are about that average difference.

---

### The data we use
- **Traffic Census (TC):** The official counts by location (`objectid`) with two directions: *ahead* and *back*.
- **StreetLight (STL):** Volume by road segment (“**zonename**”) with tags like **daytype** (e.g., All Days) and **daypart** (e.g., All Day).
- **Location mapping:** For each TC location, a list of the STL zosenames that represent the *ahead* side and the *behind* side of that location.

---

### How we build one number per location (AADT)
1) **Pick the TC value** that matches the counter’s direction:  
   - Even `objectid` → use the TC *back* value  
   - Odd `objectid` → use the TC *ahead* value  
   *(This mirrors the direction convention previously reviewed.)*

2) **Filter StreetLight to the same conditions** you care about (usually **All Days** and **All Day**).

3) **For each STL zonename**, take the average volume within that filter.  
   *(This gives one “typical” value per segment under the chosen daytype/daypart.)*

4) **Sum the STL segments for this location**:  
   - Add up the “ahead” segments.  
   - Add up the “behind” segments.  
   - Then add those two sides together.  
   *(Result = StreetLight AADT for that location.)*

Now each location has:
- **TC AADT** (the benchmark)  
- **STL AADT** (the estimate from StreetLight)

---

### Turn those into apples-to-apples differences (TCE, in %)
For every location with both numbers:
- **Traffic Count Error (TCE)** = the percent difference between STL and TC.  
  - Negative TCE → STL is lower than TC.  
  - Positive TCE → STL is higher than TC.

We collect one TCE value per location.

---

### Summarize and add a confidence band (CI)
- **Average TCE**: the typical over/under across all locations.  
- **95% Confidence Interval**: a “margin of error” around that average, based on how much the location-level TCEs vary and how many locations you have.  
  - If the interval **crosses 0%**, the average difference isn’t statistically clear (could be slightly above or below zero).  
  - If the interval is **entirely below 0%**, STL tends to be lower than TC.  
  - If it’s **entirely above 0%**, STL tends to be higher.

We also show a **t-statistic** and **p-value** for the “is the average difference basically zero?” question; lower p-values mean a clearer difference.

---

### What to look for
- **The average TCE** (direction and size).  
- **Whether the 95% CI includes 0%.**  
- **Any locations with missing segments or mismatched data** (these are flagged so you can QA them).

## import packages

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from scipy.stats import t as student_t  # if SciPy is not available, use a small lookup table
import csv
import re

from pathlib import Path


In [2]:
# pull in the coordinates from the utils docs
#from osow_frp_o_d_utils_v3 import origin_intersections, destination_intersections
import shs_ct_tc_locations_utils as tc_locs

### Identify the corridor

In [3]:
# Available corridors
    # "interstate_605_d7_tc_aadt_locations"
    # "sr_99_d3_tc_aadt_locations"

In [4]:
# Identify the corridor to be analyzed
CORRIDOR_VAR_NAME = "sr_99_d3_tc_aadt_locations"

In [5]:
# Resolve the object from the module by name
try:
    aadt_locations = getattr(tc_locs, CORRIDOR_VAR_NAME)
except AttributeError:
    raise KeyError(
        f"'{CORRIDOR_VAR_NAME}' not found in shs_ct_tc_locations_utils. "
        "Double-check the variable name."
    )

## Step 0, Pull in the Data

In [6]:
# This function will pull in the data and clean the column headers in a way that will make them easier to work with
def getdata_and_cleanheaders(path):
    # Read the CSV file
    df = pd.read_csv(path)

    # Clean column headers: remove spaces, convert to lowercase, and strip trailing asterisks
    cleaned_columns = []
    for column in df.columns:
        cleaned_column = column.replace(" ", "").lower().rstrip("*")
        cleaned_columns.append(cleaned_column)

    df.columns = cleaned_columns
    return df

### import option 1: Identify the Google Cloud Storage path

In [7]:
# # Identify the GCS path to the data
# gcs_path = "gs://calitp-analytics-data/data-analyses/big_data/compare_traffic_counts/0_2022/"

In [8]:
# pull in the data & create dataframes
#df_tc = getdata_and_cleanheaders(f"{gcs_path}caltrans_traffic_census_2022.csv")  # Traffic Census

In [9]:
# # Identify the StreetLight Analysis to be used in the AADT comparison
# df_stl = getdata_and_cleanheaders(f"{gcs_path}streetlight_605_d7_all_vehicles_np_2022.csv")  # StreetLight

### import option 2: Identify the local data path

In [10]:
# Base data folder: aadt_confidence_interval/aadt_data/2022
LOCAL_DATA_DIR = Path.cwd() / "aadt_data" / "2022"
if not LOCAL_DATA_DIR.exists():
    raise FileNotFoundError(f"Data folder not found: {LOCAL_DATA_DIR.resolve()}")

In [11]:
# Traffic Census (traditional) — local CSV
df_tc = getdata_and_cleanheaders(LOCAL_DATA_DIR / "caltrans_traffic_census_2022.csv")

In [12]:
# StreetLight (non-traditional) — local CSV
df_stl = getdata_and_cleanheaders(LOCAL_DATA_DIR / "streetlight_99_d3_all_vehicles_2022_np.csv")

### Export to a CSV for viewing/validation

In [13]:
# comparing
df_tc.to_csv("df_tc.csv", index=False)

In [14]:
# comparing
df_stl.to_csv("df_stl.csv", index=False)

## Step 00: Normalizer

In [15]:
def _ensure_list(x):
    if x is None: return []
    if isinstance(x, (list, tuple, set)): return list(x)
    return [x]

def explode_locations_to_objectids(aadt_locs):
    """
    Returns a list of dicts where each item is ONE objectid with:
      name, daytype, objectids [list[str]], ahead_zones [list[str]], behind_zones [list[str]]
    This shape is accepted by your existing traditional/non_traditional builders.
    """
    rows = []

    # Case A: "flat" list like interstate_605_aadt_locations
    if isinstance(aadt_locs, list) and aadt_locs and isinstance(aadt_locs[0], dict) and "objectid" in aadt_locs[0]:
        for loc in aadt_locs:
            oid = str(loc.get("objectid"))
            nm  = f"{loc.get('location_description','UNKNOWN')} [{oid}]"
            day = loc.get("daytype", "0: All Days (M-Su)")

            ahead, behind = [], []
            for k, v in loc.items():
                if not k.startswith("zonename_"):
                    continue
                idx = int(k.split("_")[1])
                # assume even indexes (0,2) are "ahead"/NB and odd (1,3) are "behind"/SB (matches your list)
                if idx % 2 == 0: ahead.append(v)
                else:            behind.append(v)

            rows.append({
                "name": nm,
                "daytype": day,
                "objectids": [oid],
                "ahead_zones": [z for z in ahead if z],
                "behind_zones": [z for z in behind if z],
            })
        return rows

    # Case B: nested dict(s) like sr_605_d7_tc_aadt_locations
    def _gather_objectids(node):
        ids = []
        if "objectid"  in node: ids.extend(_ensure_list(node["objectid"]))
        if "objectids" in node: ids.extend(_ensure_list(node["objectids"]))
        return [str(i) for i in ids if i is not None and str(i).strip() != ""]

    if isinstance(aadt_locs, list):
        iterable = []
        for item in aadt_locs:
            if isinstance(item, dict):
                iterable.append(item)
    elif isinstance(aadt_locs, dict):
        iterable = [aadt_locs]
    else:
        iterable = []

    for block in iterable:
        for base_name, loc in block.items():
            day = loc.get("daytype", "0: All Days (M-Su)")
            nodes = loc.get("nodes", {}) or {}
            for node_name, node in nodes.items():
                oids = _gather_objectids(node)
                if not oids: continue
                nm = f"{base_name} [{','.join(oids)}]"

                ahead = _ensure_list(node.get("zonename_ahead", []))
                behind = _ensure_list(node.get("zonename_behind", []))

                rows.append({
                    "name": nm,
                    "daytype": day,
                    "objectids": oids,
                    "ahead_zones": [z for z in ahead if z],
                    "behind_zones": [z for z in behind if z],
                })
    return rows

## Step 1, Build a per-location summary of Traffic Census locations

In [16]:
def traditional_aadt_by_location(aadt_locations, df_tc, as_df=True, use_parity=False):
    """
    Build a per-location summary of *traditional* (Traffic Census) AADT.

    Policy:
      • If multiple objectids exist for a location, prefer those whose numeric value is ODD.
        If no odd ids exist, fall back to all ids found.
      • Default behavior (use_parity=False): for each kept objectid, compute (ahead_aadt + back_aadt)/2,
        then average across kept objectids.
      • If use_parity=True: even oid -> back_aadt; odd oid -> ahead_aadt (legacy behavior).

    Output columns:
      location, daytype, objectids, n_objectids, n_found_in_tc, missing_objectids,
      traditional_ahead_mean, traditional_behind_mean, traditional_aadt
    """
    # Requires: import pandas as pd; import numpy as np

    def _ensure_list(x):
        if x is None: return []
        if isinstance(x, (list, tuple, set)): return list(x)
        return [x]

    def _gather_objectids(node_dict):
        ids = []
        if not isinstance(node_dict, dict): return ids
        if "objectid"  in node_dict: ids.extend(_ensure_list(node_dict["objectid"]))
        if "objectids" in node_dict: ids.extend(_ensure_list(node_dict["objectids"]))
        return [str(i).strip() for i in ids if i is not None and str(i).strip() != ""]

    def _dedup(seq):
        seen=set(); out=[]
        for x in seq:
            if x not in seen:
                out.append(x); seen.add(x)
        return out

    def _keep_odd_objectids(ids):
        odds = [i for i in ids if i.isdigit() and (int(i) % 2 == 1)]
        return odds if odds else ids

    def _normalize_one_location(name, loc, include_oid_in_name=True):
        nodes = (loc.get("nodes") if isinstance(loc, dict) else None) or {}
        all_ids=[]
        for _, node in nodes.items():
            all_ids.extend(_gather_objectids(node))
        if not all_ids and isinstance(loc, dict) and "objectid" in loc:
            all_ids = [str(loc["objectid"])]

        all_ids = _dedup([i for i in all_ids if i])
        kept_ids = _keep_odd_objectids(all_ids)

        name_out = name
        if include_oid_in_name and kept_ids:
            name_out = f"{name} [{','.join(kept_ids)}]"

        return {
            "name": name_out,
            "daytype": (loc.get("daytype") if isinstance(loc, dict) else None) or "0: All Days (M-Su)",
            "objectids": kept_ids,
        }

    def _normalize_input(aadt_locs):
        if isinstance(aadt_locs, pd.DataFrame) and {"name","daytype","objectids"}.issubset(aadt_locs.columns):
            recs = aadt_locs.to_dict(orient="records")
            for r in recs:
                r["objectids"] = _keep_odd_objectids(_ensure_list(r.get("objectids")))
            return recs
        if isinstance(aadt_locs, list) and aadt_locs and isinstance(aadt_locs[0], dict) and \
           {"name","daytype","objectids"}.issubset(aadt_locs[0].keys()):
            recs = []
            for r in aadt_locs:
                r = dict(r)
                r["objectids"] = _keep_odd_objectids(_ensure_list(r.get("objectids")))
                recs.append(r)
            return recs

        recs = []
        if isinstance(aadt_locs, dict):
            for nm, loc in aadt_locs.items():
                recs.append(_normalize_one_location(nm, loc))
            return recs

        if isinstance(aadt_locs, list):
            for item in aadt_locs:
                if not isinstance(item, dict):
                    continue
                if "nodes" in item:
                    nm = item.get("location_description") or item.get("name") or "UNKNOWN"
                    recs.append(_normalize_one_location(nm, item))
                elif "objectid" in item:
                    oid = str(item.get("objectid")).strip()
                    nm  = item.get("location_description") or item.get("name") or "UNKNOWN"
                    kept = _keep_odd_objectids([oid])
                    recs.append({
                        "name": f"{nm} [{','.join(kept)}]" if kept else nm,
                        "daytype": item.get("daytype", "0: All Days (M-Su)"),
                        "objectids": kept,
                    })
                else:
                    for nm, loc in item.items():
                        recs.append(_normalize_one_location(nm, loc))
        return recs

    def _traditional_aadt_for_ids(df_tc_in, obj_ids):
        """
        Default (use_parity=False): per-oid average of (ahead_aadt, back_aadt), then mean across oids.
        If use_parity=True: even->back_aadt, odd->ahead_aadt.
        """
        obj_ids = [str(x).strip() for x in (obj_ids or []) if str(x).strip()]
        if not obj_ids:
            return np.nan, np.nan, np.nan, 0

        sub = df_tc_in[df_tc_in["objectid"].astype(str).str.strip().isin(obj_ids)].copy()
        if sub.empty:
            return np.nan, np.nan, np.nan, 0

        if use_parity:
            vals = []
            for oid in obj_ids:
                row = sub[sub["objectid"].astype(str).str.strip() == oid]
                if row.empty:
                    continue
                v = row.iloc[0]["back_aadt"] if (oid.isdigit() and int(oid) % 2 == 0) else row.iloc[0]["ahead_aadt"]
                vals.append(pd.to_numeric(v, errors="coerce"))
            vals = pd.Series(vals, dtype="float64").dropna()
            if vals.empty: return np.nan, np.nan, np.nan, 0
            overall = float(vals.mean())
            return overall, np.nan, np.nan, int(vals.shape[0])

        # --- average ahead/back per objectid, then average across objectids ---
        sub["ahead_aadt"] = pd.to_numeric(sub.get("ahead_aadt"), errors="coerce")
        sub["back_aadt"]  = pd.to_numeric(sub.get("back_aadt"),  errors="coerce")

        # per-oid average: mean of available sides (ignore NaN)
        per_oid_avg = sub[["ahead_aadt","back_aadt"]].mean(axis=1, skipna=True)
        per_oid_avg = per_oid_avg.dropna()

        if per_oid_avg.empty:
            return np.nan, np.nan, np.nan, 0

        overall = float(per_oid_avg.mean())

        # side means (for reporting only)
        ahead_vals = sub["ahead_aadt"].dropna()
        back_vals  = sub["back_aadt"].dropna()
        mean_ahead = float(ahead_vals.mean()) if not ahead_vals.empty else np.nan
        mean_back  = float(back_vals.mean())  if not back_vals.empty  else np.nan
        count_used = int(per_oid_avg.shape[0])

        return overall, mean_ahead, mean_back, count_used

    # ---- main ----
    norm = _normalize_input(aadt_locations)
    tc_ids_all = set(df_tc["objectid"].astype(str).str.strip().unique())

    rows = []
    for loc in norm:
        obj_ids = [str(x).strip() for x in (loc.get("objectids") or []) if str(x).strip()]
        overall, mean_ahead, mean_back, n_found = _traditional_aadt_for_ids(df_tc, obj_ids)
        missing = [x for x in obj_ids if x not in tc_ids_all]

        rows.append({
            "location": loc.get("name"),
            "daytype":  loc.get("daytype"),
            "objectids": "|".join(obj_ids),
            "n_objectids": len(obj_ids),
            "n_found_in_tc": int(n_found),
            "missing_objectids": "|".join(missing) if missing else "",
            "traditional_ahead_mean": mean_ahead,
            "traditional_behind_mean": mean_back,
            "traditional_aadt": overall,
        })

    return pd.DataFrame(rows) if as_df else rows


In [17]:
# run step 1 - traditional aadt counts
trad_df = traditional_aadt_by_location(aadt_locations, df_tc, as_df=True)

In [18]:
#trad_df.head()

In [19]:
# Export Step 1 as a CSV to take a look
trad_df.to_csv("step_1_traditional_aadt_by_location.csv", index=False)

## Step 2 Identify Traffic Census location names for the StreetLight segments

In [20]:
import pandas as pd
import numpy as np

# def non_traditional_aadt_by_location(
#     aadt_locations,
#     df_stl,
#     daytype_filter="0: All Days (M-Su)",
#     daypart_filter="0: All Day (12am-12am)",
#     modeoftravel_filter=None,
#     zonename_col="zonename",
#     stl_volume_col="averagedailysegmenttraffic(stlvolume)",
#     as_df=True,
#     agg="sum",  # "sum" mirrors reviewed pipeline; "mean" = average across segments
#     segment_count_mode="unique",  # "unique" counts deduped zonenames; set "all" to count before dedup
# ):
#     """
#     Build a per-location summary of *non-traditional* (StreetLight) AADT.

#     Output columns (one row per location):
#       location, daytype_expected, daytype_used, daypart_used, modeoftravel_used,
#       ahead_zones, behind_zones,
#       non_trad_ahead_mean, non_trad_behind_mean, non_trad_aadt,
#       stl_ahead_rows, stl_behind_rows, missing_ahead_zones, missing_behind_zones

#     NEW columns added to align with 'count each segment listed':
#       listed_ahead_segments, listed_behind_segments,
#       present_ahead_segments, present_behind_segments
#     """
#     # ---- helpers ----
#     def _ensure_list(x):
#         if x is None: return []
#         if isinstance(x, (list, tuple, set)): return list(x)
#         return [x]

#     def _gather_zones(node_dict):
#         ahead  = _ensure_list(node_dict.get("zonename_ahead", []))
#         behind = _ensure_list(node_dict.get("zonename_behind", []))
#         return ahead, behind

#     def _gather_objectids(node_dict):
#         ids = []
#         if not isinstance(node_dict, dict): return ids
#         if "objectid"  in node_dict: ids.extend(_ensure_list(node_dict["objectid"]))
#         if "objectids" in node_dict: ids.extend(_ensure_list(node_dict["objectids"]))
#         return [str(i) for i in ids if i is not None and str(i).strip() != ""]

#     def _dedup(seq):
#         seen=set(); out=[]
#         for x in seq:
#             if x not in seen:
#                 out.append(x); seen.add(x)
#         return out

#     def _normalize_one_location(name, loc, include_oid_in_name=True):
#         """Nested 'nodes' format -> collect objectids and zonenames; append [oids] to name for merge alignment."""
#         nodes = loc.get("nodes", {}) or {}
#         ahead_all, behind_all, all_oids = [], [], []
#         for _, node in nodes.items():
#             a, b = _gather_zones(node)
#             ahead_all.extend([z for z in a if z])
#             behind_all.extend([z for z in b if z])
#             all_oids.extend(_gather_objectids(node))

#         name_out = name
#         if include_oid_in_name and all_oids:
#             name_out = f"{name} [{','.join(_dedup(all_oids))}]"

#         # record both pre-dedup and dedup lists so we can count segments the way you prefer
#         ahead_dedup  = _dedup(ahead_all)
#         behind_dedup = _dedup(behind_all)

#         return {
#             "name": name_out,
#             "daytype": loc.get("daytype", "0: All Days (M-Su)"),
#             "ahead_zones_all": ahead_all,
#             "behind_zones_all": behind_all,
#             "ahead_zones": ahead_dedup,
#             "behind_zones": behind_dedup,
#         }

#     def _normalize_input(aadt_locs):
#         # Already normalized DataFrame?
#         if isinstance(aadt_locs, pd.DataFrame) and \
#            {"name","daytype","ahead_zones","behind_zones"}.issubset(aadt_locs.columns):
#             recs = aadt_locs.to_dict(orient="records")
#             # ensure *_zones_all exist for counting
#             for r in recs:
#                 r.setdefault("ahead_zones_all", r.get("ahead_zones", []))
#                 r.setdefault("behind_zones_all", r.get("behind_zones", []))
#             return recs

#         # Already normalized list[dict]?
#         if isinstance(aadt_locs, list) and aadt_locs and isinstance(aadt_locs[0], dict) and \
#            {"name","daytype","ahead_zones","behind_zones"}.issubset(aadt_locs[0].keys()):
#             recs = aadt_locs
#             for r in recs:
#                 r.setdefault("ahead_zones_all", r.get("ahead_zones", []))
#                 r.setdefault("behind_zones_all", r.get("behind_zones", []))
#             return recs

#         recs = []
#         # Dict keyed by name (nested format)
#         if isinstance(aadt_locs, dict):
#             for nm, loc in aadt_locs.items():
#                 recs.append(_normalize_one_location(nm, loc))
#             return recs

#         # List of locations (mixed formats)
#         if isinstance(aadt_locs, list):
#             for item in aadt_locs:
#                 if not isinstance(item, dict):
#                     continue
#                 if "nodes" in item:
#                     nm = item.get("location_description") or item.get("name") or "UNKNOWN"
#                     recs.append(_normalize_one_location(nm, item))
#                 elif "objectid" in item:
#                     # flat I-605 row (objectid + zonename_0..3)
#                     oid = str(item.get("objectid"))
#                     nm  = item.get("location_description") or item.get("name") or "UNKNOWN"
#                     day = item.get("daytype", "0: All Days (M-Su)")

#                     ahead_all, behind_all = [], []
#                     for k, v in item.items():
#                         if not (isinstance(k, str) and k.startswith("zonename_")):
#                             continue
#                         try:
#                             idx = int(k.split("_")[1])
#                         except Exception:
#                             idx = None
#                         # Convention: even 0/2 -> ahead; odd 1/3 -> behind
#                         if idx is not None and idx % 2 == 0:
#                             ahead_all.append(v)
#                         else:
#                             behind_all.append(v)

#                     recs.append({
#                         "name": f"{nm} [{oid}]",
#                         "daytype": day,
#                         "ahead_zones_all": ahead_all,
#                         "behind_zones_all": behind_all,
#                         "ahead_zones": _dedup([z for z in ahead_all if z]),
#                         "behind_zones": _dedup([z for z in behind_all if z]),
#                     })
#                 else:
#                     for nm, loc in item.items():
#                         recs.append(_normalize_one_location(nm, loc))
#         return recs

#     # ---- filter & precompute per-zone means ----
#     must_cols = [zonename_col, stl_volume_col, "daytype", "daypart"]
#     for c in must_cols:
#         if c not in df_stl.columns:
#             raise KeyError(f"df_stl is missing required column: {c}")

#     filt = (df_stl["daytype"] == daytype_filter) & (df_stl["daypart"] == daypart_filter)
#     if modeoftravel_filter and ("modeoftravel" in df_stl.columns):
#         filt = filt & (df_stl["modeoftravel"] == modeoftravel_filter)

#     stl_filtered = df_stl.loc[filt, [zonename_col, stl_volume_col]].copy()

#     # Clean types
#     stl_filtered[zonename_col] = stl_filtered[zonename_col].astype(str).str.strip()
#     stl_filtered[stl_volume_col] = pd.to_numeric(stl_filtered[stl_volume_col], errors="coerce")

#     # Compute per-zonename averages (handles duplicates safely)
#     zone_group = stl_filtered.groupby(zonename_col)[stl_volume_col]
#     zone_mean = zone_group.mean()   # pd.Series: index=zonename, value=mean volume
#     zone_rows = zone_group.size()   # pd.Series: index=zonename, value=row count backing the mean
#     present_zones = set(zone_mean.index)

#     def _zone_stats(zones_list, agg_local="sum"):
#         """
#         Return:
#           aggregated_value,
#           backing_row_count_sum,
#           missing_list,
#           present_segment_count,
#           listed_segment_count
#         """
#         # choose counting base (dedup vs all)
#         zones_all = [z for z in _ensure_list(zones_list) if z and str(z).strip() != ""]
#         zones_all = [str(z).strip() for z in zones_all]
#         zones_for_agg = _dedup(zones_all) if segment_count_mode == "unique" else zones_all

#         if not zones_for_agg:
#             return np.nan, 0, [], 0, 0

#         present = [z for z in zones_for_agg if z in present_zones]
#         missing = [z for z in zones_for_agg if z not in present_zones]

#         vals = zone_mean.reindex(present).dropna()
#         if agg_local == "sum":
#             val = float(vals.sum()) if len(vals) else np.nan
#         else:  # "mean"
#             val = float(vals.mean()) if len(vals) else np.nan

#         n_rows = int(zone_rows.reindex(present).fillna(0).sum())
#         present_seg_ct = len(present)
#         listed_seg_ct = len(zones_for_agg)
#         return val, n_rows, missing, present_seg_ct, listed_seg_ct

#     # ---- build rows ----
#     norm = _normalize_input(aadt_locations)
#     rows = []
#     for loc in norm:
#         ahead_all  = loc.get("ahead_zones_all", [])
#         behind_all = loc.get("behind_zones_all", [])
#         ahead_ded  = loc.get("ahead_zones", [])
#         behind_ded = loc.get("behind_zones", [])

#         # choose which list drives counts/aggregation
#         ahead_for_counts  = ahead_ded if segment_count_mode == "unique" else ahead_all
#         behind_for_counts = behind_ded if segment_count_mode == "unique" else behind_all

#         val_ahead, ahead_n, miss_a, present_ahead_ct, listed_ahead_ct = _zone_stats(ahead_for_counts, agg_local=agg)
#         val_behind, behind_n, miss_b, present_behind_ct, listed_behind_ct = _zone_stats(behind_for_counts, agg_local=agg)

#         # Sum ahead + behind to mirror reviewed pipeline
#         overall = np.nansum([val_ahead, val_behind])

#         rows.append({
#             "location": loc.get("name"),
#             "daytype_expected": loc.get("daytype"),
#             "daytype_used": daytype_filter,
#             "daypart_used": daypart_filter,
#             "modeoftravel_used": modeoftravel_filter if modeoftravel_filter else "",

#             # keep original zone strings (deduped for readability)
#             "ahead_zones": "|".join(ahead_ded),
#             "behind_zones": "|".join(behind_ded),

#             # values
#             "non_trad_ahead_mean": val_ahead,
#             "non_trad_behind_mean": val_behind,
#             "non_trad_aadt": overall,

#             # backing row counts in df_stl (unchanged)
#             "stl_ahead_rows": ahead_n,
#             "stl_behind_rows": behind_n,

#             # NEW: segment counts (how many segments were listed vs present in df_stl)
#             "listed_ahead_segments": listed_ahead_ct,
#             "listed_behind_segments": listed_behind_ct,
#             "present_ahead_segments": present_ahead_ct,
#             "present_behind_segments": present_behind_ct,

#             # missing lists
#             "missing_ahead_zones": "|".join(miss_a) if miss_a else "",
#             "missing_behind_zones": "|".join(miss_b) if miss_b else "",
#         })

#     return pd.DataFrame(rows) if as_df else rows




# def non_traditional_aadt_by_location(
#     aadt_locations,
#     df_stl,
#     daytype_filter="0: All Days (M-Su)",
#     daypart_filter="0: All Day (12am-12am)",
#     modeoftravel_filter=None,
#     zonename_col="zonename",
#     stl_volume_col="averagedailysegmenttraffic(stlvolume)",
#     as_df=True,
#     agg="sum",                 # "sum" mirrors reviewed pipeline; "mean" = average across segments
#     segment_count_mode="unique",  # "unique" counts deduped zonenames; "all" counts before dedup
# ):
#     """
#     Build a per-location summary of *non-traditional* (StreetLight) AADT.

#     Output columns (one row per location):
#       location, daytype_expected, daytype_used, daypart_used, modeoftravel_used,
#       ahead_zones, behind_zones,
#       non_trad_ahead_mean, non_trad_behind_mean, non_trad_aadt,
#       stl_ahead_rows, stl_behind_rows, missing_ahead_zones, missing_behind_zones

#     NEW columns added to align with 'count each segment listed':
#       listed_ahead_segments, listed_behind_segments,
#       present_ahead_segments, present_behind_segments

#     Notes:
#       - non_trad_aadt = average of (non_trad_ahead_mean, non_trad_behind_mean) [NaN-safe]
#     """
#     # ---- helpers ----
#     def _ensure_list(x):
#         if x is None: return []
#         if isinstance(x, (list, tuple, set)): return list(x)
#         return [x]

#     def _gather_zones(node_dict):
#         ahead  = _ensure_list(node_dict.get("zonename_ahead", []))
#         behind = _ensure_list(node_dict.get("zonename_behind", []))
#         return ahead, behind

#     def _gather_objectids(node_dict):
#         ids = []
#         if not isinstance(node_dict, dict): return ids
#         if "objectid"  in node_dict: ids.extend(_ensure_list(node_dict["objectid"]))
#         if "objectids" in node_dict: ids.extend(_ensure_list(node_dict["objectids"]))
#         return [str(i) for i in ids if i is not None and str(i).strip() != ""]

#     def _dedup(seq):
#         seen=set(); out=[]
#         for x in seq:
#             if x not in seen:
#                 out.append(x); seen.add(x)
#         return out

#     def _normalize_one_location(name, loc, include_oid_in_name=True):
#         """Nested 'nodes' -> collect objectids and zonenames; append [oids] to name for merge alignment."""
#         nodes = loc.get("nodes", {}) or {}
#         ahead_all, behind_all, all_oids = [], [], []
#         for _, node in nodes.items():
#             a, b = _gather_zones(node)
#             ahead_all.extend([z for z in a if z])
#             behind_all.extend([z for z in b if z])
#             all_oids.extend(_gather_objectids(node))

#         name_out = name
#         if include_oid_in_name and all_oids:
#             name_out = f"{name} [{','.join(_dedup(all_oids))}]"

#         # keep both pre-dedup and dedup lists so we can count segments as requested
#         ahead_ded  = _dedup(ahead_all)
#         behind_ded = _dedup(behind_all)

#         return {
#             "name": name_out,
#             "daytype": loc.get("daytype", "0: All Days (M-Su)"),
#             "ahead_zones_all": ahead_all,
#             "behind_zones_all": behind_all,
#             "ahead_zones": ahead_ded,
#             "behind_zones": behind_ded,
#         }

#     def _normalize_input(aadt_locs):
#         # Already normalized DataFrame?
#         if isinstance(aadt_locs, pd.DataFrame) and \
#            {"name","daytype","ahead_zones","behind_zones"}.issubset(aadt_locs.columns):
#             recs = aadt_locs.to_dict(orient="records")
#             for r in recs:
#                 r.setdefault("ahead_zones_all", r.get("ahead_zones", []))
#                 r.setdefault("behind_zones_all", r.get("behind_zones", []))
#             return recs

#         # Already normalized list[dict]?
#         if isinstance(aadt_locs, list) and aadt_locs and isinstance(aadt_locs[0], dict) and \
#            {"name","daytype","ahead_zones","behind_zones"}.issubset(aadt_locs[0].keys()):
#             recs = aadt_locs
#             for r in recs:
#                 r.setdefault("ahead_zones_all", r.get("ahead_zones", []))
#                 r.setdefault("behind_zones_all", r.get("behind_zones", []))
#             return recs

#         recs = []
#         # Dict keyed by name (nested format)
#         if isinstance(aadt_locs, dict):
#             for nm, loc in aadt_locs.items():
#                 recs.append(_normalize_one_location(nm, loc))
#             return recs

#         # List of locations (mixed formats)
#         if isinstance(aadt_locs, list):
#             for item in aadt_locs:
#                 if not isinstance(item, dict):
#                     continue
#                 if "nodes" in item:
#                     nm = item.get("location_description") or item.get("name") or "UNKNOWN"
#                     recs.append(_normalize_one_location(nm, item))
#                 elif "objectid" in item:
#                     # flat row (objectid + zonename_0..3)
#                     oid = str(item.get("objectid"))
#                     nm  = item.get("location_description") or item.get("name") or "UNKNOWN"
#                     day = item.get("daytype", "0: All Days (M-Su)")

#                     ahead_all, behind_all = [], []
#                     for k, v in item.items():
#                         if not (isinstance(k, str) and k.startswith("zonename_")):
#                             continue
#                         try:
#                             idx = int(k.split("_")[1])
#                         except Exception:
#                             idx = None
#                         # even 0/2 -> ahead; odd 1/3 -> behind
#                         if idx is not None and idx % 2 == 0:
#                             ahead_all.append(v)
#                         else:
#                             behind_all.append(v)

#                     recs.append({
#                         "name": f"{nm} [{oid}]",
#                         "daytype": day,
#                         "ahead_zones_all": ahead_all,
#                         "behind_zones_all": behind_all,
#                         "ahead_zones": _dedup([z for z in ahead_all if z]),
#                         "behind_zones": _dedup([z for z in behind_all if z]),
#                     })
#                 else:
#                     for nm, loc in item.items():
#                         recs.append(_normalize_one_location(nm, loc))
#         return recs

#     # ---- filter & per-zone means ----
#     must_cols = [zonename_col, stl_volume_col, "daytype", "daypart"]
#     for c in must_cols:
#         if c not in df_stl.columns:
#             raise KeyError(f"df_stl is missing required column: {c}")

#     filt = (df_stl["daytype"] == daytype_filter) & (df_stl["daypart"] == daypart_filter)
#     if modeoftravel_filter and ("modeoftravel" in df_stl.columns):
#         filt = filt & (df_stl["modeoftravel"] == modeoftravel_filter)

#     stl_filtered = df_stl.loc[filt, [zonename_col, stl_volume_col]].copy()
#     stl_filtered[zonename_col] = stl_filtered[zonename_col].astype(str).str.strip()
#     stl_filtered[stl_volume_col] = pd.to_numeric(stl_filtered[stl_volume_col], errors="coerce")

#     zone_group = stl_filtered.groupby(zonename_col)[stl_volume_col]
#     zone_mean = zone_group.mean()    # index=zonename, value=mean volume
#     zone_rows = zone_group.size()    # index=zonename, value=row count
#     present_zones = set(zone_mean.index)

#     def _zone_stats(zones_list, agg_local="sum"):
#         """
#         Return:
#           aggregated_value,
#           backing_row_count_sum,
#           missing_list,
#           present_segment_count,
#           listed_segment_count
#         """
#         zones_all = [z for z in _ensure_list(zones_list) if z and str(z).strip() != ""]
#         zones_all = [str(z).strip() for z in zones_all]
#         zones_for_agg = _dedup(zones_all) if segment_count_mode == "unique" else zones_all

#         if not zones_for_agg:
#             return np.nan, 0, [], 0, 0

#         present = [z for z in zones_for_agg if z in present_zones]
#         missing = [z for z in zones_for_agg if z not in present_zones]

#         vals = zone_mean.reindex(present).dropna()
#         if agg_local == "sum":
#             val = float(vals.sum()) if len(vals) else np.nan
#         else:  # "mean"
#             val = float(vals.mean()) if len(vals) else np.nan

#         n_rows = int(zone_rows.reindex(present).fillna(0).sum())
#         present_seg_ct = len(present)
#         listed_seg_ct = len(zones_for_agg)
#         return val, n_rows, missing, present_seg_ct, listed_seg_ct

#     # ---- build rows ----
#     norm = _normalize_input(aadt_locations)
#     rows = []
#     for loc in norm:
#         ahead_all  = loc.get("ahead_zones_all", [])
#         behind_all = loc.get("behind_zones_all", [])
#         ahead_ded  = loc.get("ahead_zones", [])
#         behind_ded = loc.get("behind_zones", [])

#         ahead_for_counts  = ahead_ded if segment_count_mode == "unique" else ahead_all
#         behind_for_counts = behind_ded if segment_count_mode == "unique" else behind_all

#         val_ahead,  ahead_n,  miss_a, present_ahead_ct,  listed_ahead_ct  = _zone_stats(ahead_for_counts,  agg_local=agg)
#         val_behind, behind_n, miss_b, present_behind_ct, listed_behind_ct = _zone_stats(behind_for_counts, agg_local=agg)

#         # >>> CHANGE HERE: average the two direction aggregates (NaN-safe)
        
#         overall = np.nanmean([val_ahead, val_behind])
        
        
#         rows.append({
#             "location": loc.get("name"),
#             "daytype_expected": loc.get("daytype"),
#             "daytype_used": daytype_filter,
#             "daypart_used": daypart_filter,
#             "modeoftravel_used": modeoftravel_filter if modeoftravel_filter else "",

#             "ahead_zones": "|".join(ahead_ded),
#             "behind_zones": "|".join(behind_ded),

#             "non_trad_ahead_mean": val_ahead,
#             "non_trad_behind_mean": val_behind,
#             "non_trad_aadt": overall,

#             "stl_ahead_rows": ahead_n,
#             "stl_behind_rows": behind_n,

#             "listed_ahead_segments": listed_ahead_ct,
#             "listed_behind_segments": listed_behind_ct,
#             "present_ahead_segments": present_ahead_ct,
#             "present_behind_segments": present_behind_ct,

#             "missing_ahead_zones": "|".join(miss_a) if miss_a else "",
#             "missing_behind_zones": "|".join(miss_b) if miss_b else "",
#         })

#     return pd.DataFrame(rows) if as_df else rows



def non_traditional_aadt_by_location(
    aadt_locations,
    df_stl,
    daytype_filter="0: All Days (M-Su)",
    daypart_filter="0: All Day (12am-12am)",
    modeoftravel_filter=None,
    zonename_col="zonename",
    stl_volume_col="averagedailysegmenttraffic(stlvolume)",
    as_df=True,
    agg="sum",                # "sum" mirrors reviewed pipeline at the per-side level
    segment_count_mode="unique",  # "unique" counts deduped zonenames; "all" counts before dedup
):
    """
    Build a per-location summary of *non-traditional* (StreetLight) AADT.

    Output columns (one row per location):
      location, daytype_expected, daytype_used, daypart_used, modeoftravel_used,
      ahead_zones, behind_zones,
      non_trad_ahead_mean, non_trad_behind_mean, non_trad_aadt,
      stl_ahead_rows, stl_behind_rows, missing_ahead_zones, missing_behind_zones,
      listed_ahead_segments, listed_behind_segments, present_ahead_segments, present_behind_segments
    """

    # ---- helpers ----
    def _ensure_list(x):
        if x is None:
            return []
        if isinstance(x, (list, tuple, set)):
            return list(x)
        return [x]

    def _gather_zones(node_dict):
        ahead  = _ensure_list(node_dict.get("zonename_ahead", []))
        behind = _ensure_list(node_dict.get("zonename_behind", []))
        return ahead, behind

    def _gather_objectids(node_dict):
        ids = []
        if not isinstance(node_dict, dict):
            return ids
        if "objectid"  in node_dict: ids.extend(_ensure_list(node_dict["objectid"]))
        if "objectids" in node_dict: ids.extend(_ensure_list(node_dict["objectids"]))
        return [str(i) for i in ids if i is not None and str(i).strip() != ""]

    def _dedup(seq):
        seen=set(); out=[]
        for x in seq:
            if x not in seen:
                out.append(x); seen.add(x)
        return out

    def _normalize_one_location(name, loc, include_oid_in_name=True):
        """Collect objectids and zonenames; append [oids] to name for merge alignment."""
        nodes = loc.get("nodes", {}) or {}
        ahead_all, behind_all, all_oids = [], [], []
        for _, node in nodes.items():
            a, b = _gather_zones(node)
            ahead_all.extend([z for z in a if z])
            behind_all.extend([z for z in b if z])
            all_oids.extend(_gather_objectids(node))

        name_out = name
        if include_oid_in_name and all_oids:
            name_out = f"{name} [{','.join(_dedup(all_oids))}]"

        return {
            "name": name_out,
            "daytype": loc.get("daytype", "0: All Days (M-Su)"),
            "ahead_zones_all": ahead_all,
            "behind_zones_all": behind_all,
            "ahead_zones": _dedup(ahead_all),
            "behind_zones": _dedup(behind_all),
        }

    def _normalize_input(aadt_locs):
        # Already normalized DataFrame?
        if isinstance(aadt_locs, pd.DataFrame) and \
           {"name","daytype","ahead_zones","behind_zones"}.issubset(aadt_locs.columns):
            recs = aadt_locs.to_dict(orient="records")
            for r in recs:
                r.setdefault("ahead_zones_all", r.get("ahead_zones", []))
                r.setdefault("behind_zones_all", r.get("behind_zones", []))
            return recs

        # Already normalized list[dict]?
        if isinstance(aadt_locs, list) and aadt_locs and isinstance(aadt_locs[0], dict) and \
           {"name","daytype","ahead_zones","behind_zones"}.issubset(aadt_locs[0].keys()):
            recs = aadt_locs
            for r in recs:
                r.setdefault("ahead_zones_all", r.get("ahead_zones", []))
                r.setdefault("behind_zones_all", r.get("behind_zones", []))
            return recs

        recs = []
        # Dict keyed by name (nested format)
        if isinstance(aadt_locs, dict):
            for nm, loc in aadt_locs.items():
                recs.append(_normalize_one_location(nm, loc))
            return recs

        # List of locations (mixed formats)
        if isinstance(aadt_locs, list):
            for item in aadt_locs:
                if not isinstance(item, dict):
                    continue
                if "nodes" in item:
                    nm = item.get("location_description") or item.get("name") or "UNKNOWN"
                    recs.append(_normalize_one_location(nm, item))
                elif "objectid" in item:
                    # flat row (objectid + zonename_0..3)
                    oid = str(item.get("objectid"))
                    nm  = item.get("location_description") or item.get("name") or "UNKNOWN"
                    day = item.get("daytype", "0: All Days (M-Su)")

                    ahead_all, behind_all = [], []
                    for k, v in item.items():
                        if not (isinstance(k, str) and k.startswith("zonename_")):
                            continue
                        try:
                            idx = int(k.split("_")[1])
                        except Exception:
                            idx = None
                        # even 0/2 -> ahead; odd 1/3 -> behind
                        if idx is not None and idx % 2 == 0:
                            ahead_all.append(v)
                        else:
                            behind_all.append(v)

                    recs.append({
                        "name": f"{nm} [{oid}]",
                        "daytype": day,
                        "ahead_zones_all": ahead_all,
                        "behind_zones_all": behind_all,
                        "ahead_zones": _dedup([z for z in ahead_all if z]),
                        "behind_zones": _dedup([z for z in behind_all if z]),
                    })
                else:
                    for nm, loc in item.items():
                        recs.append(_normalize_one_location(nm, loc))
        return recs

    # ---- filter & precompute per-zone means ----
    must_cols = [zonename_col, stl_volume_col, "daytype", "daypart"]
    for c in must_cols:
        if c not in df_stl.columns:
            raise KeyError(f"df_stl is missing required column: {c}")

    filt = (df_stl["daytype"] == daytype_filter) & (df_stl["daypart"] == daypart_filter)
    if modeoftravel_filter and ("modeoftravel" in df_stl.columns):
        filt = filt & (df_stl["modeoftravel"] == modeoftravel_filter)

    stl_filtered = df_stl.loc[filt, [zonename_col, stl_volume_col]].copy()

    # Clean types
    stl_filtered[zonename_col] = stl_filtered[zonename_col].astype(str).str.strip()
    stl_filtered[stl_volume_col] = pd.to_numeric(stl_filtered[stl_volume_col], errors="coerce")

    # Per-zonename averages (handles duplicates safely)
    zone_group = stl_filtered.groupby(zonename_col)[stl_volume_col]
    zone_mean = zone_group.mean()   # pd.Series: index=zonename, value=mean volume
    zone_rows = zone_group.size()   # pd.Series: index=zonename, value=row count backing the mean
    present_zones = set(zone_mean.index)

    def _zone_stats(zones_list, agg_local="sum"):
        """
        Return:
          aggregated_value,
          backing_row_count_sum,
          missing_list,
          present_segment_count,
          listed_segment_count
        """
        zones_all = [z for z in _ensure_list(zones_list) if z and str(z).strip() != ""]
        zones_all = [str(z).strip() for z in zones_all]
        zones_for_agg = _dedup(zones_all) if segment_count_mode == "unique" else zones_all

        if not zones_for_agg:
            return np.nan, 0, [], 0, 0

        present = [z for z in zones_for_agg if z in present_zones]
        missing = [z for z in zones_for_agg if z not in present_zones]

        vals = zone_mean.reindex(present).dropna()
        if agg_local == "sum":
            val = float(vals.sum()) if len(vals) else np.nan
        else:  # "mean" across segments within a side
            val = float(vals.mean()) if len(vals) else np.nan

        n_rows = int(zone_rows.reindex(present).fillna(0).sum())
        present_seg_ct = len(present)
        listed_seg_ct = len(zones_for_agg)
        return val, n_rows, missing, present_seg_ct, listed_seg_ct

    def _combine_dirs(a, b):
        """Average the two directions if both exist; otherwise use the one that exists."""
        a = np.nan if a is None else a
        b = np.nan if b is None else b
        if pd.notna(a) and pd.notna(b):
            return (float(a) + float(b)) / 2.0
        if pd.notna(a):
            return float(a)
        if pd.notna(b):
            return float(b)
        return np.nan

    # ---- build rows ----
    norm = _normalize_input(aadt_locations)
    rows = []
    for loc in norm:
        ahead_all  = loc.get("ahead_zones_all", [])
        behind_all = loc.get("behind_zones_all", [])
        ahead_ded  = loc.get("ahead_zones", [])
        behind_ded = loc.get("behind_zones", [])

        # choose which list drives counts/aggregation
        ahead_for_counts  = ahead_ded if segment_count_mode == "unique" else ahead_all
        behind_for_counts = behind_ded if segment_count_mode == "unique" else behind_all

        val_ahead, ahead_n, miss_a, present_ahead_ct, listed_ahead_ct = _zone_stats(ahead_for_counts, agg_local=agg)
        val_behind, behind_n, miss_b, present_behind_ct, listed_behind_ct = _zone_stats(behind_for_counts, agg_local=agg)

        # Average across directions (NOT mean of means twice; just combine the two sides)
        overall = _combine_dirs(val_ahead, val_behind)

        rows.append({
            "location": loc.get("name"),
            "daytype_expected": loc.get("daytype"),
            "daytype_used": daytype_filter,
            "daypart_used": daypart_filter,
            "modeoftravel_used": modeoftravel_filter if modeoftravel_filter else "",

            # Keep deduped zone strings for readability
            "ahead_zones": "|".join(ahead_ded),
            "behind_zones": "|".join(behind_ded),

            # Values
            "non_trad_ahead_mean": val_ahead,
            "non_trad_behind_mean": val_behind,
            "non_trad_aadt": overall,

            # Backing row counts in df_stl
            "stl_ahead_rows": ahead_n,
            "stl_behind_rows": behind_n,

            # Segment counts
            "listed_ahead_segments": listed_ahead_ct,
            "listed_behind_segments": listed_behind_ct,
            "present_ahead_segments": present_ahead_ct,
            "present_behind_segments": present_behind_ct,

            # Missing lists
            "missing_ahead_zones": "|".join(miss_a) if miss_a else "",
            "missing_behind_zones": "|".join(miss_b) if miss_b else "",
        })

    return pd.DataFrame(rows) if as_df else rows


In [21]:
# this will run the "non_traditional_aadt_by_location" function if  you have the raw nested structure:
stl_df = non_traditional_aadt_by_location(
    aadt_locations,
    df_stl,
    daytype_filter="0: All Days (M-Su)",
    daypart_filter="0: All Day (12am-12am)",
    modeoftravel_filter="All Vehicles - StL All Vehicles Volume",  # or None
    zonename_col="zonename",
    stl_volume_col="averagedailysegmenttraffic(stlvolume)",
    as_df=True
)



In [22]:
# Export step 2 to a CSV
stl_df.to_csv("step_2_non_traditional_aadt_by_location.csv", index=False)

### Step 3, Build the per-location comparison DataFrame

In [23]:
# # ------------------------------------------------------
# # 3) Build the per-location comparison DataFrame
# # ------------------------------------------------------

def _base_location(s: str) -> str:
    if not isinstance(s, str): return ""
    return re.sub(r"\s*\[.*\]\s*$", "", s).strip()

def _pick_col(df: pd.DataFrame, base: str):
    if base in df.columns: 
        return base
    for suf in ("_trad", "_nt", "_x", "_y"):
        c = f"{base}{suf}"
        if c in df.columns:
            return c
    return None

def _uniq_join(series, sep="|"):
    vals = []
    for x in series.dropna().astype(str):
        if not x:
            continue
        vals.extend([t.strip() for t in x.split(sep) if t.strip()])
    # preserve order while deduping
    seen=set(); out=[]
    for v in vals:
        if v not in seen:
            out.append(v); seen.add(v)
    return sep.join(out)

def _uniq_join_commas(series):
    # for objectids like "7817" etc.
    vals = []
    for x in series.dropna().astype(str):
        for t in re.split(r"[,\s]+", x.strip()):
            if t:
                vals.append(t)
    seen=set(); out=[]
    for v in vals:
        if v not in seen:
            out.append(v); seen.add(v)
    return ",".join(out)

# def build_aadt_comparison_df(
#     aadt_locations,
#     df_tc,
#     df_stl,
#     daytype_filter="0: All Days (M-Su)",
#     daypart_filter="0: All Day (12am-12am)",
#     modeoftravel_filter=None,
#     zonename_col="zonename",
#     stl_volume_col="averagedailysegmenttraffic(stlvolume)",
#     agg="sum",
#     segment_count_mode="unique",
#     collapse_to_one_per_location=True  # NEW
# ) -> pd.DataFrame:

#     # --- 1) Traditional side ---
#     trad_df = traditional_aadt_by_location(
#         aadt_locations=aadt_locations, df_tc=df_tc, as_df=True
#     ).copy()
#     trad_df["location_base"] = trad_df["location"].apply(_base_location)

#     # --- 2) Non-traditional side ---
#     nt_df = non_traditional_aadt_by_location(
#         aadt_locations=aadt_locations,
#         df_stl=df_stl,
#         daytype_filter=daytype_filter,
#         daypart_filter=daypart_filter,
#         modeoftravel_filter=modeoftravel_filter,
#         zonename_col=zonename_col,
#         stl_volume_col=stl_volume_col,
#         as_df=True,
#         agg=agg,
#         segment_count_mode=segment_count_mode
#     ).copy()
#     nt_df["location_base"] = nt_df["location"].apply(_base_location)

#     # --- 3) Merge on base name ---
#     merged = pd.merge(trad_df, nt_df, how="inner", on="location_base", suffixes=("_trad", "_nt"))

#     # Helper: safely fetch a possibly-suffixed column from `merged`
#     def G(name):
#         c = _pick_col(merged, name)
#         return merged[c] if c is not None else pd.Series([np.nan]*len(merged), index=merged.index)
      
#     if collapse_to_one_per_location:
#         grp = merged.groupby("location_base", dropna=False)

#         def _first(series):
#             return series.iloc[0] if len(series) else np.nan

#         out = pd.DataFrame({
#             "location": grp["location_base"].first(),

#             # IDs & zones
#             "objectids": grp[_pick_col(merged, "objectids_trad") or _pick_col(merged, "objectids")].apply(
#                 lambda s: ",".join(dict.fromkeys([x for v in s.dropna().astype(str) for x in v.replace("|",",").split(",") if x]))
#             ),
#             "ahead_zones": grp[_pick_col(merged, "ahead_zones_nt") or _pick_col(merged, "ahead_zones")].apply(
#                 lambda s: "|".join(dict.fromkeys([x.strip() for v in s.dropna().astype(str) for x in v.split("|") if x.strip()]))
#             ),
#             "behind_zones": grp[_pick_col(merged, "behind_zones_nt") or _pick_col(merged, "behind_zones")].apply(
#                 lambda s: "|".join(dict.fromkeys([x.strip() for v in s.dropna().astype(str) for x in v.split("|") if x.strip()]))
#             ),

#             # Traditional metrics: duplicates are identical -> mean == first
#             "traditional_ahead_mean": grp[_pick_col(merged, "traditional_ahead_mean")].mean(),
#             "traditional_behind_mean": grp[_pick_col(merged, "traditional_behind_mean")].mean(),
#             "traditional_aadt":       grp[_pick_col(merged, "traditional_aadt")].mean(),

#             # Non-traditional metrics: DO NOT sum across duplicate merge rows
#             # Use mean (same as first, since NT is repeated on each duplicate row)
#             "non_trad_ahead_mean": grp[_pick_col(merged, "non_trad_ahead_mean")].mean(),
#             "non_trad_behind_mean":grp[_pick_col(merged, "non_trad_behind_mean")].mean(),
#             "non_trad_aadt":       grp[_pick_col(merged, "non_trad_aadt")].mean(),

#             # Counts / diagnostics: use max (not sum) to avoid doubling
#             "stl_ahead_rows":          grp[_pick_col(merged, "stl_ahead_rows")].max(),
#             "stl_behind_rows":         grp[_pick_col(merged, "stl_behind_rows")].max(),
#             "listed_ahead_segments":   grp[_pick_col(merged, "listed_ahead_segments")].max(),
#             "listed_behind_segments":  grp[_pick_col(merged, "listed_behind_segments")].max(),
#             "present_ahead_segments":  grp[_pick_col(merged, "present_ahead_segments")].max(),
#             "present_behind_segments": grp[_pick_col(merged, "present_behind_segments")].max(),

#             # Missing lists: unique-join
#             "missing_ahead_zones": grp[_pick_col(merged, "missing_ahead_zones")].apply(
#                 lambda s: "|".join(dict.fromkeys([x.strip() for v in s.dropna().astype(str) for x in v.split("|") if x.strip()]))
#             ),
#             "missing_behind_zones": grp[_pick_col(merged, "missing_behind_zones")].apply(
#                 lambda s: "|".join(dict.fromkeys([x.strip() for v in s.dropna().astype(str) for x in v.split("|") if x.strip()]))
#             ),

#             # Filters / metadata: stable representative
#             "daytype":           grp[_pick_col(merged, "daytype_trad") or _pick_col(merged, "daytype")].first(),
#             "daytype_expected":  grp[_pick_col(merged, "daytype_expected")].first(),
#             "daytype_used":      grp[_pick_col(merged, "daytype_used")].first(),
#             "daypart_used":      grp[_pick_col(merged, "daypart_used")].first(),
#             "modeoftravel_used": grp[_pick_col(merged, "modeoftravel_used")].first(),
#         }).reset_index(drop=True)

#         # recompute counts and TCE
#         out["n_objectids"] = out["objectids"].apply(lambda s: 0 if not isinstance(s, str) or not s.strip()
#                                                     else len([x for x in s.split(",") if x]))
#         # n_found_in_tc: take max (avoid doubling)
#         if _pick_col(merged, "n_found_in_tc"):
#             out["n_found_in_tc"] = grp[_pick_col(merged, "n_found_in_tc")].max().values

#         def _tce_row(row):
#             t, n = row.get("traditional_aadt"), row.get("non_trad_aadt")
#             return 100.0 * (n - t) / t if pd.notna(t) and t != 0 and pd.notna(n) else np.nan
#         out["tce_percent"] = out.apply(_tce_row, axis=1)

#         preferred_cols = [
#             "location","objectids","n_objectids","n_found_in_tc",
#             "ahead_zones","behind_zones",
#             "traditional_ahead_mean","traditional_behind_mean","traditional_aadt",
#             "non_trad_ahead_mean","non_trad_behind_mean","non_trad_aadt",
#             "tce_percent",
#             "daytype","daytype_expected","daytype_used","daypart_used","modeoftravel_used",
#             "stl_ahead_rows","stl_behind_rows",
#             "missing_ahead_zones","missing_behind_zones",
#             "listed_ahead_segments","listed_behind_segments",
#             "present_ahead_segments","present_behind_segments",
#         ]
#         cols = [c for c in preferred_cols if c in out.columns]
#         return out[cols].copy()


def build_aadt_comparison_df(
    aadt_locations,
    df_tc,
    df_stl,
    daytype_filter="0: All Days (M-Su)",
    daypart_filter="0: All Day (12am-12am)",
    modeoftravel_filter=None,
    zonename_col="zonename",
    stl_volume_col="averagedailysegmenttraffic(stlvolume)"
) -> pd.DataFrame:
    """
    Robust build that:
      • Works with nested locations OR exploded objectid rows
      • Merges Traditional + Non-Traditional
      • Collapses duplicates to one row per physical location
      • Sums STL per direction across dup rows, then sets non_trad_aadt = avg(ahead_sum, behind_sum)
      • Keeps objectids as a '|'-joined string to avoid numeric/CSV formatting issues
    """

    # --- 1) Build the two sides ---
    trad_df = traditional_aadt_by_location(
        aadt_locations=aadt_locations,
        df_tc=df_tc,
        as_df=True
    )
    nt_df = non_traditional_aadt_by_location(
        aadt_locations=aadt_locations,
        df_stl=df_stl,
        daytype_filter=daytype_filter,
        daypart_filter=daypart_filter,
        modeoftravel_filter=modeoftravel_filter,
        zonename_col=zonename_col,
        stl_volume_col=stl_volume_col,
        as_df=True
    )

    merged = pd.merge(
        trad_df,
        nt_df,
        how="inner",
        on="location",
        suffixes=("_trad", "_nt")
    )

    # --- 2) Normalize to a single location name (strip trailing " [objectids]") ---
    merged["location_base"] = merged["location"].str.replace(r"\s*\[[^\]]+\]\s*$", "", regex=True)

    # Helpers
    def uniq_join(series):
        seen=set(); out=[]
        for s in series.dropna().astype(str):
            for tok in str(s).split("|"):
                tok = tok.strip()
                if tok and tok not in seen:
                    seen.add(tok); out.append(tok)
        return "|".join(out)

    def join_objectids(series):
        toks=[]
        for s in series.astype(str).fillna(""):
            # split anything that looks like a delimiter, keep digit runs
            parts = re.findall(r"\d{1,}", s)
            toks.extend([p for p in parts if p])
        # stable-unique
        out=[]; seen=set()
        for t in toks:
            if t not in seen:
                seen.add(t); out.append(t)
        return "|".join(out)

    # --- 3) Collapse duplicates BY location_base ---
    agg = {
        # Traditional (should agree across dup rows; mean is fine)
        "traditional_ahead_mean": "mean",
        "traditional_behind_mean": "mean",
        "traditional_aadt": "mean",

        # StreetLight directions: sum across dup rows
        "non_trad_ahead_mean": "sum",
        "non_trad_behind_mean": "sum",

        # Segment/row counts: sum
        "stl_ahead_rows": "sum",
        "stl_behind_rows": "sum",
        "listed_ahead_segments": "sum",
        "listed_behind_segments": "sum",
        "present_ahead_segments": "sum",
        "present_behind_segments": "sum",

        # String unions
        "ahead_zones": uniq_join,
        "behind_zones": uniq_join,
        "missing_ahead_zones": uniq_join,
        "missing_behind_zones": uniq_join,

        # IDs / metadata
        "objectids": join_objectids,
        "n_objectids": "sum",
        "n_found_in_tc": "sum",
        "daytype": "first",
        "daytype_expected": "first",
        "daytype_used": "first",
        "daypart_used": "first",
        "modeoftravel_used": "first",
    }
    # only keep aggregations for columns we actually have
    agg = {k:v for k,v in agg.items() if k in merged.columns}

    out = (merged
           .groupby("location_base", as_index=False)
           .agg(agg)
           .rename(columns={"location_base":"location"}))

    # Harden objectids as strings; recompute counts from the pipe-joined string
    if "objectids" in out.columns:
        out["objectids"] = out["objectids"].astype(str)
        out["n_objectids"] = out["objectids"].str.split(r"\|").apply(lambda xs: len([t for t in xs if t]))

    # --- 4) Recompute STL combined AADT as the avg of the summed directions (or lone side if missing) ---
    if {"non_trad_ahead_mean","non_trad_behind_mean"}.issubset(out.columns):
        a = pd.to_numeric(out["non_trad_ahead_mean"], errors="coerce")
        b = pd.to_numeric(out["non_trad_behind_mean"], errors="coerce")
        out["non_trad_aadt"] = np.where(
            a.notna() & b.notna(),
            (a + b) / 2.0,
            np.where(a.notna(), a, b)
        )

    # --- 5) Recompute TCE ---
    if {"traditional_aadt","non_trad_aadt"}.issubset(out.columns):
        T = pd.to_numeric(out["traditional_aadt"], errors="coerce")
        N = pd.to_numeric(out["non_trad_aadt"], errors="coerce")
        out["tce_percent"] = np.where(T.notna() & (T != 0) & N.notna(), 100.0*(N - T)/T, np.nan)

    # Keep a clean, stable column order (only columns that exist)
    preferred_cols = [
        "location",
        "objectids","n_objectids","n_found_in_tc",
        "ahead_zones","behind_zones",
        "traditional_ahead_mean","traditional_behind_mean","traditional_aadt",
        "non_trad_ahead_mean","non_trad_behind_mean","non_trad_aadt",
        "tce_percent",
        "daytype","daytype_expected","daytype_used","daypart_used","modeoftravel_used",
        "stl_ahead_rows","stl_behind_rows",
        "listed_ahead_segments","listed_behind_segments",
        "present_ahead_segments","present_behind_segments",
        "missing_ahead_zones","missing_behind_zones",
    ]
    cols = [c for c in preferred_cols if c in out.columns]
    return out[cols].copy()


    
    

#     if collapse_to_one_per_location:
#         # Build an aggregation map keyed on location_base
#         grp = merged.groupby("location_base", dropna=False)

#         out = pd.DataFrame({
#             # display name = base (OID-less)
#             "location": grp["location_base"].first(),

#             # IDs & zones (unique-join)
#             "objectids": grp.apply(lambda g: _uniq_join_commas(G("objectids_trad").loc[g.index] if "objectids_trad" in merged.columns else G("objectids").loc[g.index])),
#             "ahead_zones": grp.apply(lambda g: _uniq_join(G("ahead_zones_nt").loc[g.index] if "ahead_zones_nt" in merged.columns else G("ahead_zones").loc[g.index])),
#             "behind_zones": grp.apply(lambda g: _uniq_join(G("behind_zones_nt").loc[g.index] if "behind_zones_nt" in merged.columns else G("behind_zones").loc[g.index])),

#             # Traditional metrics: use mean (equivalent to first if identical)
#             "traditional_ahead_mean": grp[G("traditional_ahead_mean").name].mean(),
#             "traditional_behind_mean": grp[G("traditional_behind_mean").name].mean(),
#             "traditional_aadt": grp[G("traditional_aadt").name].mean(),

#             # Non-traditional metrics: sum across split rows (recombines directions)
#             "non_trad_ahead_mean": grp[G("non_trad_ahead_mean").name].sum(min_count=1),
#             "non_trad_behind_mean": grp[G("non_trad_behind_mean").name].sum(min_count=1),
#             "non_trad_aadt": grp[G("non_trad_aadt").name].sum(min_count=1),

#             # Row counts / diagnostics: sum
#             "stl_ahead_rows": grp[G("stl_ahead_rows").name].sum(min_count=1),
#             "stl_behind_rows": grp[G("stl_behind_rows").name].sum(min_count=1),
#             "listed_ahead_segments": grp[G("listed_ahead_segments").name].sum(min_count=1),
#             "listed_behind_segments": grp[G("listed_behind_segments").name].sum(min_count=1),
#             "present_ahead_segments": grp[G("present_ahead_segments").name].sum(min_count=1),
#             "present_behind_segments": grp[G("present_behind_segments").name].sum(min_count=1),

#             # Missing lists: unique-join
#             "missing_ahead_zones": grp.apply(lambda g: _uniq_join(G("missing_ahead_zones").loc[g.index])),
#             "missing_behind_zones": grp.apply(lambda g: _uniq_join(G("missing_behind_zones").loc[g.index])),

#             # Filters / metadata: pick something consistent
#             "daytype": grp[_pick_col(merged, "daytype_trad") or _pick_col(merged, "daytype")].first(),
#             "daytype_expected": grp[_pick_col(merged, "daytype_expected")].first(),
#             "daytype_used": grp[_pick_col(merged, "daytype_used")].first(),
#             "daypart_used": grp[_pick_col(merged, "daypart_used")].first(),
#             "modeoftravel_used": grp[_pick_col(merged, "modeoftravel_used")].first(),
#         }).reset_index(drop=True)

#         # n_objectids + n_found_in_tc after consolidation
#         out["n_objectids"] = out["objectids"].apply(lambda s: 0 if not isinstance(s, str) or not s.strip() else len([x for x in s.split(",") if x]))
#         if "n_found_in_tc" in merged.columns:
#             # sum then clip to n_objectids (safety)
#             n_found = grp[_pick_col(merged, "n_found_in_tc")].sum().values
#             out["n_found_in_tc"] = np.minimum(n_found, out["n_objectids"].values)
#         else:
#             out["n_found_in_tc"] = np.nan

#         # Recompute TCE on the collapsed totals
#         def _tce_row(row):
#             t = row.get("traditional_aadt")
#             n = row.get("non_trad_aadt")
#             if pd.notna(t) and t != 0 and pd.notna(n):
#                 return 100.0 * (n - t) / t
#             return np.nan
#         out["tce_percent"] = out.apply(_tce_row, axis=1)

#         # Final column order
#         preferred_cols = [
#             "location",
#             "objectids", "n_objectids", "n_found_in_tc",
#             "ahead_zones", "behind_zones",
#             "traditional_ahead_mean", "traditional_behind_mean", "traditional_aadt",
#             "non_trad_ahead_mean", "non_trad_behind_mean", "non_trad_aadt",
#             "tce_percent",
#             "daytype", "daytype_expected", "daytype_used", "daypart_used", "modeoftravel_used",
#             "stl_ahead_rows", "stl_behind_rows",
#             "missing_ahead_zones", "missing_behind_zones",
#             "listed_ahead_segments", "listed_behind_segments",
#             "present_ahead_segments", "present_behind_segments",
#         ]
#         cols = [c for c in preferred_cols if c in out.columns]
#         return out[cols].copy()

#     # --- fallback: no collapse requested ---
#     # (kept from the previous version, but now with base-name merge)
#     def _tce(row):
#         t = row.get(_pick_col(merged, "traditional_aadt"))
#         n = row.get(_pick_col(merged, "non_trad_aadt"))
#         if pd.notna(t) and t != 0 and pd.notna(n):
#             return 100.0 * (n - t) / t
#         return np.nan
#     merged["tce_percent"] = merged.apply(_tce, axis=1)

#     # prefer NT label, else trad; otherwise base
#     loc_nt = _pick_col(merged, "location_nt")
#     loc_tr = _pick_col(merged, "location_trad")
#     if loc_nt:
#         merged["location"] = merged[loc_nt].apply(_base_location)  # strip OIDs if you still want one line per row
#     elif loc_tr:
#         merged["location"] = merged[loc_tr].apply(_base_location)
#     else:
#         merged["location"] = merged["location_base"]

#     preferred_cols = [
#         "location",
#         "objectids", "n_objectids", "n_found_in_tc", "missing_objectids",
#         "ahead_zones", "behind_zones",
#         "traditional_ahead_mean", "traditional_behind_mean", "traditional_aadt",
#         "non_trad_ahead_mean", "non_trad_behind_mean", "non_trad_aadt",
#         "tce_percent",
#         "daytype", "daytype_expected", "daytype_used", "daypart_used", "modeoftravel_used",
#         "stl_ahead_rows", "stl_behind_rows",
#         "missing_ahead_zones", "missing_behind_zones",
#         "listed_ahead_segments", "listed_behind_segments",
#         "present_ahead_segments", "present_behind_segments",
#     ]
#     cols = [c for c in preferred_cols if c in merged.columns]
#     return merged[cols].copy()

In [24]:
# 3.1) Build the combined comparison DataFrame
cmp_df = build_aadt_comparison_df(
    aadt_locations=aadt_locations,  # your dict/list structure
    df_tc=df_tc,                                 # Traffic Census dataframe
    df_stl=df_stl,                               # StreetLight dataframe
    daytype_filter="0: All Days (M-Su)",
    daypart_filter="0: All Day (12am-12am)",
    modeoftravel_filter=None,                    # e.g., "0: All Modes" if you need it
    zonename_col="zonename",
    stl_volume_col="averagedailysegmenttraffic(stlvolume)"
)

In [25]:
# 3.2) Quick peek
#cmp_df.head()

In [26]:
# 3.3) (Optional) sort by absolute TCE to see big deltas first
cmp_df = cmp_df.sort_values("tce_percent", key=lambda s: s.abs(), ascending=False)

In [27]:
# 3.4) Export to CSV 
cmp_df.to_csv("step_3_comparison_dataframe.csv", index=False)

### Step 3.5 Collapse to one row per location


In [28]:
# def collapse_to_one_row_per_location(cmp_df: pd.DataFrame) -> pd.DataFrame:
#     df = cmp_df.copy()

#     # strip the trailing " [.....]" objectid tags from the location
#     df["location_clean"] = df["location"].str.replace(r"\s*\[[^\]]+\]\s*$", "", regex=True)

#     # helper: union pipe-joined tokens in order
#     def uniq_join(series):
#         seen = set()
#         out = []
#         for s in series.dropna().astype(str):
#             for tok in s.split("|"):
#                 tok = tok.strip()
#                 if tok and tok not in seen:
#                     seen.add(tok); out.append(tok)
#         return "|".join(out)

#     # helper: join unique objectids
#     def join_objectids(series):
#         toks = []
#         for s in series.dropna().astype(str):
#             toks.extend([t.strip() for t in s.split(",") if t.strip()])
#         uniq = sorted(set(toks), key=lambda x: (len(x), x))  # stable-ish ordering
#         return ",".join(uniq)

#     agg = {
#         # Traffic Census (keep representative value; rows should agree)
#         "traditional_ahead_mean": "mean",
#         "traditional_behind_mean": "mean",
#         "traditional_aadt": "mean",

#         # StreetLight directions: **SUM across duplicate rows**
#         "non_trad_ahead_mean": "sum",
#         "non_trad_behind_mean": "sum",

#         # Counts should sum
#         "stl_ahead_rows": "sum",
#         "stl_behind_rows": "sum",
#         "listed_ahead_segments": "sum",
#         "listed_behind_segments": "sum",
#         "present_ahead_segments": "sum",
#         "present_behind_segments": "sum",

#         # Metadata / strings
#         "ahead_zones": uniq_join,
#         "behind_zones": uniq_join,
#         "missing_ahead_zones": uniq_join,
#         "missing_behind_zones": uniq_join,
#         "daytype": "first",
#         "daytype_expected": "first",
#         "daytype_used": "first",
#         "daypart_used": "first",
#         "modeoftravel_used": "first",

#         # IDs
#         "objectids": join_objectids,
#         "n_objectids": "sum",      # temporary; we’ll recompute below from objectids
#         "n_found_in_tc": "sum",    # sum across duplicates
#     }

#     # only aggregate columns that exist
#     agg = {k: v for k, v in agg.items() if k in df.columns}

#     out = df.groupby("location_clean", as_index=False).agg(agg)
#     out = out.rename(columns={"location_clean": "location"})

#     # Recompute n_objectids from the joined ID list (unique count)
#     if "objectids" in out.columns:
#         out["n_objectids"] = out["objectids"].apply(
#             lambda s: len({t for t in str(s).split(",") if t})
#         )

#     # Recompute non_trad_aadt as the average of the two (summed) directions
#     if {"non_trad_ahead_mean", "non_trad_behind_mean"}.issubset(out.columns):
#         a = pd.to_numeric(out["non_trad_ahead_mean"], errors="coerce")
#         b = pd.to_numeric(out["non_trad_behind_mean"], errors="coerce")
#         out["non_trad_aadt"] = np.where(
#             a.notna() & b.notna(),
#             (a + b) / 2.0,
#             np.where(a.notna(), a, b)
#         )

#     # Recompute TCE with the updated non_trad_aadt
#     if {"traditional_aadt", "non_trad_aadt"}.issubset(out.columns):
#         T = pd.to_numeric(out["traditional_aadt"], errors="coerce")
#         N = pd.to_numeric(out["non_trad_aadt"], errors="coerce")
#         out["tce_percent"] = np.where(
#             T.notna() & (T != 0) & N.notna(),
#             100.0 * (N - T) / T,
#             np.nan
#         )

#     return out


def collapse_to_one_row_per_location(cmp_df: pd.DataFrame) -> pd.DataFrame:
    df = cmp_df.copy()

    # strip trailing " [.....]" tag
    df["location_clean"] = df["location"].str.replace(r"\s*\[[^\]]+\]\s*$", "", regex=True)

    # union helper for pipe-joined strings
    def uniq_join(series):
        seen = set(); out = []
        for s in series.dropna().astype(str):
            for tok in s.split("|"):
                tok = tok.strip()
                if tok and tok not in seen:
                    seen.add(tok); out.append(tok)
        return "|".join(out)

    # join objectids with a PIPE to avoid CSV/number formatting issues
    def join_objectids(series):
        toks = []
        for s in series.astype(str).fillna(""):
            # split on commas/pipes/spaces, keep digits
            for t in re.split(r"[,\|\s]+", str(s)):
                t = t.strip()
                if t:
                    toks.append(t)
        # unique but stable-ish
        uniq = []
        seen = set()
        for t in toks:
            if t not in seen:
                seen.add(t); uniq.append(t)
        return "|".join(uniq)

    agg = {
        # TC side (rows should agree; mean is fine)
        "traditional_ahead_mean": "mean",
        "traditional_behind_mean": "mean",
        "traditional_aadt": "mean",

        # StreetLight directions: **SUM across duplicate rows**
        "non_trad_ahead_mean": "sum",
        "non_trad_behind_mean": "sum",

        # Counts should sum
        "stl_ahead_rows": "sum",
        "stl_behind_rows": "sum",
        "listed_ahead_segments": "sum",
        "listed_behind_segments": "sum",
        "present_ahead_segments": "sum",
        "present_behind_segments": "sum",

        # Strings/metadata
        "ahead_zones": uniq_join,
        "behind_zones": uniq_join,
        "missing_ahead_zones": uniq_join,
        "missing_behind_zones": uniq_join,
        "daytype": "first",
        "daytype_expected": "first",
        "daytype_used": "first",
        "daypart_used": "first",
        "modeoftravel_used": "first",

        # IDs
        "objectids": join_objectids,
        "n_objectids": "sum",
        "n_found_in_tc": "sum",
    }
    agg = {k: v for k, v in agg.items() if k in df.columns}

    out = df.groupby("location_clean", as_index=False).agg(agg).rename(columns={"location_clean":"location"})

    # Harden objectids: keep as strings and recompute counts
    if "objectids" in out.columns:
        out["objectids"] = out["objectids"].astype(str)
        out["n_objectids"] = out["objectids"].str.split(r"\|").apply(lambda xs: len([t for t in xs if t]))

    # Recompute non_trad_aadt as the avg of the two summed directions
    if {"non_trad_ahead_mean","non_trad_behind_mean"}.issubset(out.columns):
        a = pd.to_numeric(out["non_trad_ahead_mean"], errors="coerce")
        b = pd.to_numeric(out["non_trad_behind_mean"], errors="coerce")
        out["non_trad_aadt"] = np.where(a.notna() & b.notna(), (a+b)/2.0, np.where(a.notna(), a, b))

    # Recompute TCE with updated non_trad_aadt
    if {"traditional_aadt","non_trad_aadt"}.issubset(out.columns):
        T = pd.to_numeric(out["traditional_aadt"], errors="coerce")
        N = pd.to_numeric(out["non_trad_aadt"], errors="coerce")
        out["tce_percent"] = np.where(T.notna() & (T!=0) & N.notna(), 100.0*(N-T)/T, np.nan)

    return out

In [29]:
# run the collapse to one row function
cmp_df = collapse_to_one_row_per_location(cmp_df)

In [30]:
# 3.4) Export to CSV 
cmp_df.to_csv("step_3_5_collapse_to_one_row.csv", index=False)

## Step 4 Confidence Interval over TCE

In [34]:
# # ------------------------------------------------------
# # 4) Confidence interval over TCE
# # ------------------------------------------------------

def _prep_tces(detail_df, tce_col="tce_percent", cap_abs=None, winsor_pct=None):
    # Extract, coerce, and clean
    s = pd.to_numeric(detail_df[tce_col], errors="coerce").replace([np.inf, -np.inf], np.nan).dropna()

    dropped = 0
    if cap_abs is not None:
        mask = s.abs() <= float(cap_abs)
        dropped = int((~mask).sum())
        s = s[mask]

    # Optional winsorization
    if winsor_pct is not None and 0 < winsor_pct < 0.5 and len(s) > 0:
        lo = s.quantile(winsor_pct)
        hi = s.quantile(1 - winsor_pct)
        s = s.clip(lower=lo, upper=hi)

    return s.astype(float), dropped

def tce_confidence_interval(
    detail_df,
    confidence=0.95,
    tce_col="tce_percent",
    cap_abs=None,          # e.g., 500 trims extreme %s
    winsor_pct=None        # e.g., 0.01 winsorizes 1% tails
):
    """
    One-sample t CI on TCE (%) vs 0.
    Returns: (mean_tce, ci_lo, ci_hi, tcrit, t_stat)
    """
    tces, dropped = _prep_tces(detail_df, tce_col=tce_col, cap_abs=cap_abs, winsor_pct=winsor_pct)
    n = int(tces.shape[0])
    if n == 0:
        return None, None, None, None, None

    mean_tce = float(tces.mean())

    if n > 1:
        std_tce = float(tces.std(ddof=1))
        se = std_tce / np.sqrt(n) if std_tce > 0 else 0.0
        if se > 0:
            dof = n - 1
            tcrit = float(stats.t.ppf((1 + confidence) / 2.0, dof))
            ci_lo = mean_tce - tcrit * se
            ci_hi = mean_tce + tcrit * se
            t_stat = mean_tce / se
        else:
            tcrit = ci_lo = ci_hi = t_stat = None
    else:
        tcrit = ci_lo = ci_hi = t_stat = None

    return mean_tce, ci_lo, ci_hi, tcrit, t_stat

def tce_confidence_interval_df(
    detail_df,
    confidence=0.95,
    tce_col="tce_percent",
    cap_abs=None,          # drop rows with |tce| > cap_abs
    winsor_pct=None        # winsorize tails by this fraction
) -> pd.DataFrame:
    """
    Same as tce_confidence_interval, with a one-row DataFrame and diagnostics.
    """
    tces, dropped = _prep_tces(detail_df, tce_col=tce_col, cap_abs=cap_abs, winsor_pct=winsor_pct)
    n = int(tces.shape[0])

    if n == 0:
        return pd.DataFrame([{
            "confidence": confidence,
            "tce_col": tce_col,
            "n": 0,
            "dof": None,
            "mean_tce": None,
            "std_tce": None,
            "se": None,
            "t_critical": None,
            "margin_of_error": None,
            "ci_lower": None,
            "ci_upper": None,
            "t_statistic": None,
            "p_value_two_sided": None,
            "cohens_d": None,
            "count_dropped": int(dropped),
            "cap_abs": cap_abs,
            "winsor_pct": winsor_pct
        }])

    mean_tce = float(tces.mean())

    if n > 1:
        std_tce = float(tces.std(ddof=1))
        se = std_tce / np.sqrt(n) if std_tce > 0 else 0.0
        dof = n - 1

        if se > 0:
            tcrit = float(stats.t.ppf((1 + confidence) / 2.0, dof))
            moe = tcrit * se
            ci_lo = mean_tce - moe
            ci_hi = mean_tce + moe
            t_stat = mean_tce / se
            p_val = float(2 * (1 - stats.t.cdf(abs(t_stat), dof)))
            cohens_d = mean_tce / std_tce if std_tce > 0 else None
        else:
            tcrit = moe = ci_lo = ci_hi = t_stat = p_val = cohens_d = None
    else:
        std_tce = None
        se = None
        dof = None
        tcrit = moe = ci_lo = ci_hi = t_stat = p_val = cohens_d = None

    return pd.DataFrame([{
        "confidence": confidence,
        "tce_col": tce_col,
        "n": n,
        "dof": dof,
        "mean_tce": mean_tce,
        "std_tce": std_tce,
        "se": se,
        "t_critical": tcrit,
        "margin_of_error": moe,
        "ci_lower": ci_lo,
        "ci_upper": ci_hi,
        "t_statistic": t_stat,
        "p_value_two_sided": p_val,
        "cohens_d": cohens_d,
        "count_dropped": int(dropped),
        "cap_abs": cap_abs,
        "winsor_pct": winsor_pct
    }])









In [35]:
# 4.0) Normalize to objectid rows (works for either of your location formats)
norm_rows = explode_locations_to_objectids(aadt_locations)  # or sr_605_d7_tc_aadt_locations

# 4.1) Build comparison
cmp_df = build_aadt_comparison_df(
    aadt_locations=norm_rows,
    df_tc=df_tc,
    df_stl=df_stl,
    daytype_filter="0: All Days (M-Su)",
    daypart_filter="0: All Day (12am-12am)",
    modeoftravel_filter=None,
    zonename_col="zonename",
    stl_volume_col="averagedailysegmenttraffic(stlvolume)"
)

In [36]:
# 4.2) Get the CI summary as a DataFrame
# tce_summary_df = tce_confidence_interval_df(cmp_df, confidence=0.95)
tce_summary_df = tce_confidence_interval_df(cmp_df, confidence=0.95)

In [37]:
# 4.3) Quick peek
print(tce_summary_df)

   confidence      tce_col   n  dof  mean_tce    std_tce        se  \
0        0.95  tce_percent  49   48 -3.703764  23.388965  3.341281   

   t_critical  margin_of_error   ci_lower  ci_upper  t_statistic  \
0    2.010635         6.718095 -10.421859  3.014331    -1.108486   

   p_value_two_sided  cohens_d  count_dropped cap_abs winsor_pct  
0           0.273175 -0.158355              0    None       None  


In [38]:
# 4.4) Export to CSV 
cmp_df.to_csv("step_4_summary.csv", index=False)

In [39]:
mean_tce, ci_lower, ci_upper, t_critical, t_statistic = tce_confidence_interval(
    cmp_df, confidence=0.95
)

print("Mean TCE:", mean_tce)
print("95% Confidence Interval:", (ci_lower, ci_upper))
print("t-test statistic:", t_statistic)
print("t-critical:", t_critical)

Mean TCE: -3.703764130718093
95% Confidence Interval: (-10.42185922932066, 3.014330967884474)
t-test statistic: -1.1084863768619073
t-critical: 2.010634757624232


### Mean TCE: -3.62
Traffic Census Error (TCE)
* A negative TCE of -3.62% means that on average, the StreetLight AADT estimates are about 3.62% lower than the official Caltrans Traffic Census counts.

### 95% Confidence Interval (-10.78%, 3.54%)
* Based on the sample of locations, the results suggest 95% confidence that the true average TCE (i.e., the average percent difference between StreetLight and Census across the entire population) falls somewhere between -10.78% and +3.54%.
    * Since this interval includes zero, it's possible that the true average error is zero, meaning StreetLight might not be significantly over- or underestimating, on average.
    * But the range is quite wide (~14 percentage points), which indicates some variability in the data or a small sample size.

### T-Test Statistic  
* **-1.059**: This means your observed sample mean is about **1.059 standard errors** below the expected population mean. Since it's not far enough from the threshold (2.093), the result is **not significant**.

### Summary
* On average, StreetLight data is underestimating AADT by about 3.6% on this subset of locations.
* But with 95% confidence, the actual average error could be as much as 10.8% under or 3.5% over the true value.
* Because zero is in that range, you can't definitively say it's underestimating — the difference might not be statistically significant.


# AADT Confidence Interval - Interstate 605, District 7

## FHWA Links
* Guidelines for Obtaining AADT Estimates from Non-Traditional Sources:
    * https://www.fhwa.dot.gov/policyinformation/travel_monitoring/pubs/aadtnt/Guidelines_for_AADT_Estimates_Final.pdf

## AADT Analysis Locations
* Locations were determined based on the location on installed & recording Traffic Operations cameras
    * for additional information contact Zhenyu Zhu with Traffic Operations

## Traffic Census Data
* https://dot.ca.gov/programs/traffic-operations/census/traffic-volumes
* Back AADT, Peak Month, and Peak Hour usually represents traffic South or West of the count location.  
* Ahead AADT, Peak Month, and Peak Hour usually represents traffic North or East of the count location. Listing of routes with their designated  

* Because the Back & Ahead counts are included at each location in the Traffic Census Data, (e.g., "IRWINDALE, ARROW HIGHWAY") only one [OBJECTID*] per location was pulled; for this analysis the North Bound Nodes were used for the analysis. 
    * for more information see the diagram: https://traffic.onramp.dot.ca.gov/downloads/traffic/files/performance/census/Back_and_Ahead_Leg_Traffic_Count_Diagram.pdf

## StreetLight Analysis Data
* Analysis Type == Network Performance
* Segment Metrics
* 2022 was used to match currently available Traffic Census Data (as of 8/27/2025)
* pulled a variety of Day Types, but plan to just look at """All Day Types"""
* pulled a variety of Day Parts, but plan to just look at """All Day Parts"""


