# Complaints → Reducción, Deduplicación y Carga a Neo4j/Qdrant

Este documento resume, de forma reproducible, todo lo que se ejecutó en el notebook para:
1) reducir embeddings de **complaints** con **IPCA**,  
2) seleccionar representantes con **LSH** y **MiniBatchKMeans**,  
3) consolidar CSVs consistentes,  
4) subir **representantes** a **Qdrant** (con UUID v5 determinísticos), y  
5) **cargar/relacionar** complaints en **Neo4j**.

---

## 📁 1) Montaje de Drive y rutas base

```python
from google.colab import drive
drive.mount('/content/drive')
````

**Rutas principales**

* `BASE_DIR`: `/content/drive/MyDrive/NHTSA/embeddings/complaints_e5_mlg_instruct`
* `REDUCED_DIR`: `${BASE_DIR}/reduced_shards`
* `IPCA_PATH`: `${BASE_DIR}/ipca_model.npz` (modelo IPCA entrenado previamente)

---

## 🧭 2) Reducción con IPCA (idempotente)

**Objetivo**: transformar embeddings crudos (`embeddings_shard_XXXX.npy`) a reducidos válidos `.npy` con cabecera, y metas alineadas `reduced_meta_XXXX.parquet`.

**Puntos clave**

* `BLOCK_ROWS=100_000` para procesado en bloques.
* Si `reduced_XXXX.npy` y `reduced_meta_XXXX.parquet` existen y coinciden en tamaño → **skip seguro**.
* Función `maybe_repair_reduced` reescribe cabecera `.npy` si algún archivo reducido fue guardado “crudo”.

**Funciones**

* `load_ipca_npz(path)`: reconstruye el objeto `IncrementalPCA` desde `.npz`.
* `list_base_shards(base_dir)`: empareja `embeddings_shard_XXXX.npy` con `meta_shard_XXXX.parquet`.
* `transform_to_reduced(ipca, idx, npy_path, meta_path)`: aplica `ipca.transform` en bloques y guarda `reduced_XXXX.npy` + `reduced_meta_XXXX.parquet`.
* `ensure_reduced_all(ipca)`: recorre todos los shards base y deja todo consistente en `REDUCED_DIR`.

**Ejecución**

```python
ipca = load_ipca_npz(IPCA_PATH)
ready_ids = ensure_reduced_all(ipca)
```

---

## 🔎 3) LSH (representantes rápidos, reanudable por grupos)

**Objetivo**: escoger un representante por “bucket” de similitud, **por grupos de 5 shards** (controlable con `GROUP_SIZE`).

**Puntos clave**

* SimHash-like: `LSH_BANDS=20`, `LSH_ROWS_PER_BAND=12`, `LSH_SEED=42`.
* Produce CSVs `lsh_group_XXXX_reps.csv` con columnas clave: `id`, `_h`, `shard_idx`, `row_in_shard`.
* Repara tipos de CSVs antiguos con `repair_group_csvs_if_needed()` (merge contra metas reducidas).

**Funciones**

* `sign_hash_blocks(X)`, `lsh_buckets(signs)`: hashing y bucketing.
* `process_group(group_idx, shard_ids, ncomp)`: genera representantes del grupo → `lsh_group_XXXX_reps.csv`.
* `run_lsh_groups(valid_ids, ncomp, group_size=5)`: recorre los grupos.
* `repair_group_csvs_if_needed()`: rellena `shard_idx/row_in_shard` si faltan.
* `consolidate_groups_csv(out_path)`: genera `representantes_global.csv`.

**Ejecución**

```python
run_lsh_groups(ready_ids, ncomp=ipca.components_.shape[0])
repair_group_csvs_if_needed()
consolidate_groups_csv(REDUCED_DIR / "representantes_global.csv")
```

---

## 🧩 4) Deduplicación Global con MiniBatchKMeans (2-pasos)

**Objetivo**: una deduplicación **memoria-segura** en dos fases:

1. `partial_fit` en streaming sobre todos los `reduced_XXXX.npy`.
2. Asignación → seleccionar el **más cercano** por clúster (representante final).

**Parámetros**

* `K_TARGET = 20000` clústeres (≈ nº de representantes).
* `BATCH_ROWS = 100_000` filas por bloque.
* `MB_SIZE = 4096` minibatch_size.
* Salidas:

  * `OUT_DIR/kmeans_global/mbkm.pkl` y `centers.npy`
  * `REPS_CSV_PATH = .../kmeans_global/representantes_global_kmeans.csv`

**Ejecución (resumen)**

```python
# Fit en streaming (idempotente si existen modelo + centros)
# Assign + selección del más cercano (guarda representantes)
```

**CSV Final de Representantes (KMeans)**

* Columnas principales: `id`, `make`, `model`, `year`, `component`, `_h`,
  `shard_idx`, `row_in_shard`, `cluster`, `dist2`.

---

## 🕸️ 5) Neo4j — Carga/Relaciones de Complaints

**Objetivo**: crear `(:Complaint {qid})` con propiedades y relaciones normalizadas.

**Importante**

* Usar el **mismo namespace** que Qdrant para que `qid` coincida con el `id` del punto.
* Unicidad **solo** en `Complaint.qid`.
* `Make/Model/Component`: índices **no únicos** + normalización (UPPER/TRIM) para no romper por duplicados existentes.

**Configuración**

* `CSV_REPS`: usar el CSV final (KMeans o consolidado LSH).
* Variables entorno:

  * `NEO4J_URI`, `NEO4J_USER`, `NEO4J_PASSWORD` (e.g. Aura: `neo4j+s://...`).

**DDL ejecutado**

```cypher
CREATE CONSTRAINT complaint_qid IF NOT EXISTS FOR (c:Complaint) REQUIRE c.qid IS UNIQUE;
CREATE INDEX make_name_idx IF NOT EXISTS FOR (m:Make) ON (m.name);
CREATE INDEX model_name_idx IF NOT EXISTS FOR (m:Model) ON (m.name);
CREATE INDEX component_name_idx IF NOT EXISTS FOR (x:Component) ON (x.name);
CREATE INDEX complaint_make IF NOT EXISTS FOR (c:Complaint) ON (c.make);
CREATE INDEX complaint_model IF NOT EXISTS FOR (c:Complaint) ON (c.model);
```

**Upsert (Cypher)**

```cypher
UNWIND $rows AS r
MERGE (c:Complaint {qid: r.qid})
  ON CREATE SET
    c.comp_id      = r.comp_id,
    c._h           = r._h,
    c.shard_idx    = r.shard_idx,
    c.row_in_shard = r.row_in_shard,
    c.cluster      = r.cluster,
    c.dist2        = r.dist2,
    c.make         = r.make,
    c.model        = r.model,
    c.year         = r.year,
    c.component    = r.component
  ON MATCH SET
    c.comp_id      = coalesce(r.comp_id, c.comp_id),
    c._h           = coalesce(r._h, c._h),
    c.shard_idx    = coalesce(r.shard_idx, c.shard_idx),
    c.row_in_shard = coalesce(r.row_in_shard, c.row_in_shard),
    c.cluster      = coalesce(r.cluster, c.cluster),
    c.dist2        = coalesce(r.dist2, c.dist2),
    c.make         = coalesce(r.make, c.make),
    c.model        = coalesce(r.model, c.model),
    c.year         = coalesce(r.year, c.year),
    c.component    = coalesce(r.component, c.component)

FOREACH (_ IN CASE WHEN r.make IS NOT NULL THEN [1] ELSE [] END |
  MERGE (m:Make {name: r.make})
  MERGE (c)-[:OF_MAKE]->(m)
)

FOREACH (_ IN CASE WHEN r.model IS NOT NULL THEN [1] ELSE [] END |
  MERGE (md:Model {name: r.model})
  MERGE (c)-[:OF_MODEL]->(md)
)

FOREACH (_ IN CASE WHEN r.component IS NOT NULL THEN [1] ELSE [] END |
  MERGE (x:Component {name: r.component})
  MERGE (c)-[:MENTIONS]->(x)
)
```

**Normalización previa (Python)**

```python
def _norm(s):
    if s is None: return None
    s = str(s).strip()
    return s.upper() if s != "" else None

for col in ("make","model","component"):
    if col in df.columns:
        df[col] = df[col].apply(_norm)
```

**Resultado esperado**

* Logs tipo:

  ```
  Namespace (string): https://nhtsa.example/complaints
  Namespace (UUID)  : 82cc465c-bfae-5901-868b-5e87923a97f9
  Neo4j URI : neo4j+s://<tu-aura>.databases.neo4j.io
  Neo4j USER: neo4j
  Upsert Neo4j: 5000/20000
  ...
  ✅ Complaints cargados/actualizados en Neo4j.
  ```

---

## 🔗 6) Integración con búsqueda semántica (Qdrant → Neo4j)

**Buscar Complaints**

* En Qdrant: colección `nhtsa_complaints`.
* `hits[i]["id"]` = **qid** (UUID v5).
* En Neo4j: `MATCH (c:Complaint {qid: qid})`.

**Ejemplo de `ask()` (ruta complaints)**

```python
qids = [h["id"] for h in hits if h.get("id")]
details = cypher("""
    UNWIND $ids AS qid
    MATCH (c:Complaint {qid: qid})
    OPTIONAL MATCH (c)-[:OF_MAKE]->(mk:Make)
    OPTIONAL MATCH (c)-[:OF_MODEL]->(md:Model)
    OPTIONAL MATCH (c)-[:MENTIONS]->(x:Component)
    RETURN
      c.qid AS qid,
      c.comp_id AS comp_id,
      c.make AS make,
      c.model AS model,
      c.year AS year,
      c.component AS component,
      collect(DISTINCT mk.name) AS makes_related,
      collect(DISTINCT md.name) AS models_related,
      collect(DISTINCT x.name)  AS mentions
""", {"ids": qids})
```

---

## ✅ 7) Validaciones rápidas (Neo4j)

**Conteo**

```cypher
MATCH (c:Complaint) RETURN count(c) AS complaints;
```

**Muestra**

```cypher
MATCH (c:Complaint)
RETURN c.qid, c.comp_id, c.make, c.model, c.year, c.component
LIMIT 5;
```

**Relaciones creadas**

```cypher
MATCH (c:Complaint)-[r]->(x)
RETURN type(r) AS rel, count(*) AS n
ORDER BY n DESC;
```

---

## 🧹 8) (Opcional) Consolidar duplicados previos y aplicar unicidad

Si detectas duplicados en `Make/Model/Component`, puedes **merg**ear antes de poner restricciones únicas.

**Detectar**

```cypher
MATCH (m:Model)
WITH toUpper(trim(m.name)) AS k, collect(m) AS nodes
WHERE k IS NOT NULL AND size(nodes) > 1
RETURN k, size(nodes) AS cnt
ORDER BY cnt DESC LIMIT 50;
```

**Consolidar (requiere APOC)**

```cypher
MATCH (m:Model)
WITH toUpper(trim(m.name)) AS k, collect(m) AS nodes
WHERE k IS NOT NULL AND size(nodes) > 1
CALL apoc.refactor.mergeNodes(nodes, {properties:'discard', mergeRels:true}) YIELD node
RETURN node;
```

**Aplicar unicidad**

```cypher
CREATE CONSTRAINT model_name IF NOT EXISTS FOR (m:Model) REQUIRE m.name IS UNIQUE;
```

> Repetir para `Make` y `Component`.

---

## 🧯 10) Troubleshooting

* **Deprecation Qdrant**: `client.search` → usa `client.query_points` (mismo filtro/payloads).
* **Neo4j Constraint Creation Failed**: existe **data duplicada**.
  Solución: eliminar/merge duplicados **antes** de crear la restricción única.
* **Desalineado reducido/meta**: revisa que `reduced_XXXX.npy` y `reduced_meta_XXXX.parquet` tengan el **mismo `n`**. Usa `maybe_repair_reduced(...)`.

---

## 📦 11) Variables clave y convenciones

* **Namespace UUID v5** (crítico):
  `NAMESPACE_STR = "https://nhtsa.example/complaints"`
* **ID determinístico del punto**:
  `qid = uuid5(NAMESPACE, f"rep_{shard_idx:04d}_{row_in_shard:06d}")`
* **Colección Qdrant**: `nhtsa_complaints`
* **Etiquetas y relaciones (Neo4j)**:

  * `(:Complaint {qid})`
  * `(:Make {name})`, `(:Model {name})`, `(:Component {name})`
  * `[:OF_MAKE]`, `[:OF_MODEL]`, `[:MENTIONS]`

---

### ✅ Estado Final

* Reducción IPCA → OK
* LSH por grupos y consolidado → OK
* Dedup global KMeans → OK (representantes guardados)
* Qdrant (representantes con UUID v5) → **Subido**
* Neo4j (upsert por lotes con normalización) → **Cargado/Actualizado**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# ===============================================
# Complaints → IPCA reducido + LSH reanudable (CPU)
# - Reusa IPCA guardado
# - Guarda reducidos .npy con cabecera válida
# - LSH por lotes de 5 shards
# - Reparación opcional de CSVs viejos (tipos consistentes)
# - Consolidación final con columnas clave
# ===============================================
import os, gc, re
from pathlib import Path
from typing import List, Tuple, Dict

import numpy as np
import pandas as pd
from sklearn.decomposition import IncrementalPCA
from tqdm import tqdm

# -------------------- Config --------------------
BASE_DIR    = Path("/content/drive/MyDrive/NHTSA/embeddings/complaints_e5_mlg_instruct")
REDUCED_DIR = BASE_DIR / "reduced_shards"
IPCA_PATH   = BASE_DIR / "ipca_model.npz"   # modelo entrenado previamente
BLOCK_ROWS  = 100_000                      # tamaño de bloque para transform()
GROUP_SIZE  = 5                            # LSH por lotes de 5 shards (≈125k filas)

# LSH parámetros (SimHash-like)
LSH_BANDS          = 20
LSH_ROWS_PER_BAND  = 12
LSH_SEED           = 42

os.makedirs(REDUCED_DIR, exist_ok=True)

# -------------------- IPCA utils --------------------
def load_ipca_npz(path: Path) -> IncrementalPCA:
    if not path.exists():
        raise FileNotFoundError(f"No existe el modelo IPCA: {path}")
    z = np.load(path, allow_pickle=False)
    n_comp = z["components_"].shape[0]
    ipca = IncrementalPCA(n_components=n_comp)
    ipca.components_       = z["components_"]
    ipca.mean_             = z["mean_"]
    ipca.var_              = z["var_"]
    ipca.singular_values_  = z["singular_values_"]
    ipca.n_samples_seen_   = int(z["n_samples_seen_"])
    if "n_features_in_" in z and z["n_features_in_"] is not None:
        ipca.n_features_in_ = int(z["n_features_in_"])
    # Atributos derivados (por compatibilidad)
    if "explained_variance_" in z:
        ipca.explained_variance_ = z["explained_variance_"]
    else:
        ipca.explained_variance_ = (ipca.singular_values_ ** 2) / max(ipca.n_samples_seen_ - 1, 1)
    if "explained_variance_ratio_" in z:
        ipca.explained_variance_ratio_ = z["explained_variance_ratio_"]
    else:
        total_var = np.sum(getattr(ipca, "var_", ipca.explained_variance_))
        ipca.explained_variance_ratio_ = (
            np.zeros_like(ipca.explained_variance_) if total_var <= 0
            else ipca.explained_variance_ / total_var
        )
    return ipca

# -------------------- Shards base --------------------
def list_base_shards(base_dir: Path) -> List[Tuple[int, Path, Path]]:
    npys  = sorted(base_dir.glob("embeddings_shard_*.npy"))
    metas = sorted(base_dir.glob("meta_shard_*.parquet"))
    idx_npy  = {int(re.findall(r'(\d{4})', p.stem)[0]): p for p in npys if re.findall(r'(\d{4})', p.stem)}
    idx_meta = {int(re.findall(r'(\d{4})', p.stem)[0]): p for p in metas if re.findall(r'(\d{4})', p.stem)}
    pairs = []
    for idx in sorted(set(idx_npy) & set(idx_meta)):
        pairs.append((idx, idx_npy[idx], idx_meta[idx]))
    return pairs

# -------------------- Escritura NPY con cabecera --------------------
def open_npy_memmap(path: Path, shape, dtype=np.float32):
    """Crea un .npy con cabecera válida usando open_memmap."""
    from numpy.lib.format import open_memmap
    return open_memmap(str(path), mode="w+", dtype=dtype, shape=shape)

# -------------------- Reducción IPCA (idempotente) --------------------
def transform_to_reduced(ipca: IncrementalPCA, idx: int, npy_path: Path, meta_path: Path):
    out_npy  = REDUCED_DIR / f"reduced_{idx:04d}.npy"
    out_meta = REDUCED_DIR / f"reduced_meta_{idx:04d}.parquet"

    # Si ya existen y consistentes → skip
    if out_npy.exists() and out_meta.exists():
        try:
            X_test = np.load(out_npy, mmap_mode="r")
            n_npy  = int(X_test.shape[0])
            n_meta = len(pd.read_parquet(out_meta))
            if n_npy == n_meta and n_npy > 0:
                print(f"↪︎ Reducido {idx:04d} OK. SKIP.")
                return
        except Exception:
            pass  # reescribe si hay problema

    # Abrir fuente (puede ser grande)
    try:
        Xsrc = np.load(npy_path, mmap_mode="r")
        n_src, d_src = int(Xsrc.shape[0]), int(Xsrc.shape[1])
    except Exception as e:
        print(f"[skip] base {idx:04d} ilegible: {e}")
        return

    try:
        meta = pd.read_parquet(meta_path)
    except Exception as e:
        print(f"[skip] meta {idx:04d} ilegible: {e}")
        return

    n = min(n_src, len(meta))
    if n == 0:
        print(f"[skip] {idx:04d} n=0")
        return
    if n != n_src or n != len(meta):
        print(f"[warn] {idx:04d} filas npy={n_src} vs meta={len(meta)} → n={n}")

    # Crear .npy destino CON cabecera
    n_comp = int(ipca.components_.shape[0])
    tmp_npy = out_npy.with_suffix(".npy.tmp")
    fp = open_npy_memmap(tmp_npy, shape=(n, n_comp), dtype=np.float32)

    wrote = 0
    for i in tqdm(range(0, n, BLOCK_ROWS), desc=f"IPCA {idx:04d}", leave=False):
        j = min(n, i + BLOCK_ROWS)
        X = np.array(Xsrc[i:j], dtype=np.float32, copy=False)
        Z = ipca.transform(X)
        fp[wrote:wrote + (j - i), :] = Z.astype(np.float32, copy=False)
        wrote += (j - i)
        del X, Z
        gc.collect()

    # Cierra y mueve atom.
    del fp
    os.replace(tmp_npy, out_npy)
    meta.iloc[:n].reset_index(drop=True).to_parquet(out_meta, index=False)
    print(f"✅ Reducido {idx:04d}: {n} filas → {out_npy.name}")

def ensure_reduced_all(ipca: IncrementalPCA) -> List[int]:
    pairs = list_base_shards(BASE_DIR)
    print(f"Shards base detectados: {len(pairs)}")
    ready = []
    for idx, pX, pM in pairs:
        transform_to_reduced(ipca, idx, pX, pM)
        if (REDUCED_DIR / f"reduced_{idx:04d}.npy").exists() and (REDUCED_DIR / f"reduced_meta_{idx:04d}.parquet").exists():
            ready.append(idx)
    return sorted(ready)

# -------------------- Reparación de reducidos “crudos” --------------------
def maybe_repair_reduced(idx: int, nrows_hint: int, ncomp_hint: int) -> bool:
    """
    Si reduced_XXXX.npy no tiene cabecera .npy (fue escrito 'crudo'),
    lo reescribe como .npy válido. Devuelve True si repara o ya es válido.
    """
    p = REDUCED_DIR / f"reduced_{idx:04d}.npy"
    if not p.exists():
        return False
    # Intenta np.load primero
    try:
        _ = np.load(p, mmap_mode="r")
        return True  # ya es válido
    except Exception:
        pass

    # Intentar abrir como memmap crudo con shape de hints
    try:
        raw = np.memmap(p, mode="r", dtype=np.float32, shape=(nrows_hint, ncomp_hint))
        tmp = p.with_suffix(".npy.fix")
        fp = open_npy_memmap(tmp, shape=(nrows_hint, ncomp_hint), dtype=np.float32)
        fp[:] = np.array(raw, copy=False)
        del fp, raw
        os.replace(tmp, p)
        print(f"🔧 Reparado reduced {idx:04d} → cabecera .npy escrita.")
        return True
    except Exception as e:
        print(f"[repair fail] reduced {idx:04d}: {e}")
        return False

# -------------------- LSH helpers --------------------
def sign_hash_blocks(X: np.ndarray, n_bits=128, seed=LSH_SEED):
    rng = np.random.RandomState(seed)
    proj = rng.randn(X.shape[1], n_bits).astype(np.float32)
    return (X @ proj > 0).astype(np.uint8)

def lsh_buckets(signs: np.ndarray, bands=LSH_BANDS, rows=LSH_ROWS_PER_BAND) -> Dict[Tuple[int,int], List[int]]:
    assert bands * rows <= signs.shape[1], "bands*rows excede la firma"
    buckets = {}
    pos = 0
    for b in range(bands):
        slice_bits = signs[:, pos:pos+rows]
        if rows <= 60:
            vals = (slice_bits * (1 << np.arange(rows, dtype=np.uint64))).sum(axis=1).astype(np.uint64)
        else:
            vals = np.array([hash(tuple(r)) for r in slice_bits], dtype=np.int64)
        for i, v in enumerate(vals.tolist()):
            key = (b, int(v))
            buckets.setdefault(key, []).append(i)
        pos += rows
    return buckets

def load_reduced_flex(idx: int, nrows: int, ncomp: int) -> np.ndarray:
    """
    Carga reduced_XXXX.npy:
    - intenta np.load (cabecera válida)
    - si falla, abre como memmap crudo con shape=(nrows, ncomp)
    """
    p = REDUCED_DIR / f"reduced_{idx:04d}.npy"
    if not p.exists():
        raise FileNotFoundError(p)
    try:
        return np.load(p, mmap_mode="r")
    except Exception:
        return np.memmap(p, mode="r", dtype=np.float32, shape=(nrows, ncomp))

def process_group(group_idx: int, shard_ids: List[int], ncomp: int):
    out_csv = REDUCED_DIR / f"lsh_group_{group_idx:04d}_reps.csv"
    if out_csv.exists():
        print(f"↪︎ Grupo {group_idx:04d} ya procesado. SKIP.")
        return

    X_list, meta_list = [], []
    total = 0
    for sid in shard_ids:
        pX = REDUCED_DIR / f"reduced_{sid:04d}.npy"
        pM = REDUCED_DIR / f"reduced_meta_{sid:04d}.parquet"
        if not (pX.exists() and pM.exists()):
            print(f"[skip grupo] falta reducido/meta {sid:04d}")
            continue
        try:
            meta = pd.read_parquet(pM)
            nrows = len(meta)
            if nrows == 0:
                continue
            # repara si hace falta
            maybe_repair_reduced(sid, nrows, ncomp)
            X = load_reduced_flex(sid, nrows, ncomp)
            n_eff = min(nrows, int(X.shape[0]))
            X_list.append(np.array(X[:n_eff], dtype=np.float32, copy=False))
            # Normaliza tipos en meta (id y _h como string)
            meta = meta.iloc[:n_eff].copy()
            if "id" in meta.columns: meta["id"] = meta["id"].astype("string")
            if "_h" in meta.columns:  meta["_h"] = meta["_h"].astype("string")
            meta_list.append(meta)
            total += n_eff
        except Exception as e:
            print(f"[skip grupo] {sid:04d} no legible: {e}")

    if total == 0:
        print(f"[skip grupo] {group_idx:04d} sin datos útiles.")
        return

    X = np.vstack(X_list)
    meta = pd.concat(meta_list, ignore_index=True)
    assert X.shape[0] == len(meta)

    signs = sign_hash_blocks(X, n_bits=max(LSH_BANDS*LSH_ROWS_PER_BAND, 128), seed=LSH_SEED)
    buckets = lsh_buckets(signs, bands=LSH_BANDS, rows=LSH_ROWS_PER_BAND)

    seen = np.zeros(len(meta), dtype=bool)
    reps = []
    for _, idxs in buckets.items():
        for i in idxs:
            if not seen[i]:
                reps.append(i)
                for j in idxs:
                    seen[j] = True
                break
    reps = sorted(set(reps))

    # Asegura columnas clave: (id,_h) + localizadores
    out = meta.iloc[reps].reset_index(drop=True).copy()
    if "row_in_shard" not in out.columns:
        out["row_in_shard"] = out.index.astype("int64")
    if "shard_idx" not in out.columns:
        # Si no existe, asigna el primer shard del grupo (mejor que nada)
        # Nota: es preferible que vengas de reduced_meta_X donde ya exista shard_idx
        out["shard_idx"] = int(shard_ids[0])
    # tipos consistentes
    if "id" in out.columns: out["id"] = out["id"].astype("string")
    if "_h" in out.columns: out["_h"] = out["_h"].astype("string")

    out.to_csv(out_csv, index=False)
    print(f"✅ Grupo {group_idx:04d}: {len(reps)}/{len(meta)} reps → {out_csv.name}")

def run_lsh_groups(valid_ids: List[int], ncomp: int, group_size=GROUP_SIZE):
    ids = sorted(valid_ids)
    groups = [ids[i:i+group_size] for i in range(0, len(ids), group_size)]
    for gidx, g in enumerate(groups):
        process_group(gidx, g, ncomp)

# --------- Reparación de CSVs de grupos antiguos (opcional) ----------
def repair_group_csvs_if_needed():
    """
    Si algún lsh_group_*_reps.csv existe sin (shard_idx,row_in_shard),
    intenta reconstruir cruzando con reduced_meta_XXXX.parquet por ('id','_h').
    Fuerza dtypes a string para evitar errores int64 vs object.
    """
    csvs = sorted(REDUCED_DIR.glob("lsh_group_*_reps.csv"))
    if not csvs:
        return

    # Cargar metas y normalizar tipos
    metas = []
    for p in sorted(REDUCED_DIR.glob("reduced_meta_*.parquet")):
        try:
            sid = int(re.findall(r'(\d{4})', p.stem)[0])
            m = pd.read_parquet(p)
            if "row_in_shard" not in m.columns:
                m = m.reset_index(drop=True)
                m["row_in_shard"] = np.arange(len(m), dtype=np.int64)
            m["shard_idx"] = sid
            # normalizar llaves de cruce a string
            if "id" in m.columns:
                m["id"] = m["id"].astype("string")
            if "_h" in m.columns:
                m["_h"] = m["_h"].astype("string")
            metas.append(m[["id","_h","shard_idx","row_in_shard"]].copy() if {"id","_h"}.issubset(m.columns) else None)
        except Exception:
            continue
    metas = [m for m in metas if m is not None]
    meta_all = pd.concat(metas, ignore_index=True) if metas else pd.DataFrame()

    for p in csvs:
        try:
            df = pd.read_csv(p)
        except Exception:
            continue
        need_fix = not {"shard_idx","row_in_shard"}.issubset(df.columns)
        if not need_fix:
            continue
        if meta_all.empty or not {"id","_h"}.issubset(df.columns):
            print(f"⚠️ No pude reparar {p.name}: faltan claves de cruce (id,_h) o metas vacías.")
            continue

        # Forzar tipos a string antes del merge (evita error int64 vs object)
        df["id"] = df["id"].astype("string")
        df["_h"] = df["_h"].astype("string")

        df2 = df.merge(meta_all, on=["id","_h"], how="left")
        if {"shard_idx","row_in_shard"}.issubset(df2.columns) and df2["shard_idx"].notna().any():
            df2.to_csv(p, index=False)
            print(f"🔧 Reparado {p.name} añadiendo (shard_idx,row_in_shard).")
        else:
            print(f"⚠️ Reparación fallida para {p.name}: no se pudieron inferir los localizadores.")

# --------- Consolidación final con tipos consistentes ----------
def consolidate_groups_csv(out_path: Path):
    outs = sorted(REDUCED_DIR.glob("lsh_group_*_reps.csv"))
    if not outs:
        print("No hay CSV de grupos para consolidar.")
        return None

    dfs = []
    for p in outs:
        dfp = pd.read_csv(
            p,
            dtype={"id": "string", "_h": "string", "shard_idx": "Int64", "row_in_shard": "Int64"},
            keep_default_na=False
        )
        dfs.append(dfp)

    reps = pd.concat(dfs, ignore_index=True)

    # Ordena columnas (si existen)
    pref = [c for c in ["id","make","model","year","component","_h"] if c in reps.columns]
    tail = [c for c in ["shard_idx","row_in_shard"] if c in reps.columns]
    cols = pref + [c for c in reps.columns if c not in pref + tail] + tail
    reps = reps[cols]

    reps.to_csv(out_path, index=False)
    print(f"📦 Consolidado: {len(reps):,} representantes → {out_path}")
    return reps

# -------------------- MAIN --------------------
if __name__ == "__main__":
    ipca = load_ipca_npz(IPCA_PATH)
    ncomp = int(ipca.components_.shape[0])

    print("== A) Transformación IPCA → reducidos (idempotente) ==")
    ready_ids = ensure_reduced_all(ipca)

    print("== B) LSH por lotes de 5 shards reducidos (reanuda por grupo) ==")
    run_lsh_groups(ready_ids, ncomp)

    # Intentar reparar grupos antiguos si hiciera falta (seguro con dtypes)
    repair_group_csvs_if_needed()

    # Consolidado global asegurando columnas clave y tipos consistentes
    consolidate_groups_csv(REDUCED_DIR / "representantes_global.csv")


== A) Transformación IPCA → reducidos (idempotente) ==
Shards base detectados: 21
↪︎ Reducido 0000 OK. SKIP.
↪︎ Reducido 0001 OK. SKIP.
↪︎ Reducido 0002 OK. SKIP.
↪︎ Reducido 0003 OK. SKIP.
↪︎ Reducido 0004 OK. SKIP.
↪︎ Reducido 0005 OK. SKIP.
↪︎ Reducido 0006 OK. SKIP.
↪︎ Reducido 0007 OK. SKIP.
↪︎ Reducido 0008 OK. SKIP.
↪︎ Reducido 0009 OK. SKIP.
↪︎ Reducido 0010 OK. SKIP.
↪︎ Reducido 0011 OK. SKIP.
↪︎ Reducido 0012 OK. SKIP.
↪︎ Reducido 0013 OK. SKIP.
↪︎ Reducido 0014 OK. SKIP.
↪︎ Reducido 0015 OK. SKIP.
↪︎ Reducido 0016 OK. SKIP.
↪︎ Reducido 0017 OK. SKIP.
↪︎ Reducido 0018 OK. SKIP.
↪︎ Reducido 0019 OK. SKIP.
↪︎ Reducido 0020 OK. SKIP.
== B) LSH por lotes de 5 shards reducidos (reanuda por grupo) ==
↪︎ Grupo 0000 ya procesado. SKIP.
↪︎ Grupo 0001 ya procesado. SKIP.
↪︎ Grupo 0002 ya procesado. SKIP.
↪︎ Grupo 0003 ya procesado. SKIP.
↪︎ Grupo 0004 ya procesado. SKIP.
🔧 Reparado lsh_group_0005_reps.csv añadiendo (shard_idx,row_in_shard).
🔧 Reparado lsh_group_0006_reps.csv añadiendo 

In [None]:
# ===============================================
# GLOBAL DEDUP (memoria segura) con MiniBatchKMeans 2-pass
# - Paso 1: partial_fit sobre reduced_XXXX.npy (streaming)
# - Paso 2: asignación y selección del más cercano por clúster
# - Salida: representantes_global_kmeans.csv
# ===============================================
import os, gc, re, pickle, math
from pathlib import Path
from typing import List, Tuple

import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.cluster import MiniBatchKMeans

# -------------------- Config --------------------
BASE_DIR    = Path("/content/drive/MyDrive/NHTSA/embeddings/complaints_e5_mlg_instruct")
REDUCED_DIR = BASE_DIR / "reduced_shards"
OUT_DIR     = REDUCED_DIR / "kmeans_global"
OUT_DIR.mkdir(parents=True, exist_ok=True)

# Tamaños (ajusta según tu RAM/objetivo)
K_TARGET      = 20000        # nº de clústeres (≈ nº de representantes finales)
BATCH_ROWS    = 100_000      # filas por bloque al entrenar/asignar
MB_SIZE       = 4096         # minibatch_size de MiniBatchKMeans
MAX_SHARDS    = None         # None = todos; o limita p.ej. 20 para pruebas

MODEL_PATH    = OUT_DIR / "mbkm.pkl"
CENTERS_PATH  = OUT_DIR / "centers.npy"
REPS_CSV_PATH = OUT_DIR / "representantes_global_kmeans.csv"

# Columnas “meta” esperadas en reduced_meta_XXXX.parquet (usa las que existan)
META_PREF = ["id","make","model","year","component","_h"]

# -------------------- Utilidades --------------------
def list_reduced_pairs(reduced_dir: Path) -> List[Tuple[int, Path, Path]]:
    npys  = sorted(reduced_dir.glob("reduced_*.npy"))
    metas = sorted(reduced_dir.glob("reduced_meta_*.parquet"))
    idx_npy  = {int(re.findall(r'(\d{4})', p.stem)[0]): p for p in npys  if re.findall(r'(\d{4})', p.stem)}
    idx_meta = {int(re.findall(r'(\d{4})', p.stem)[0]): p for p in metas if re.findall(r'(\d{4})', p.stem)}
    ids = sorted(set(idx_npy) & set(idx_meta))
    pairs = [(i, idx_npy[i], idx_meta[i]) for i in ids]
    return pairs

def iterate_blocks(X_mem, start=0, stop=None, block=BATCH_ROWS):
    n = X_mem.shape[0]
    a = start
    b = n if stop is None else min(stop, n)
    while a < b:
        c = min(a + block, b)
        yield a, c
        a = c

def ensure_float32(arr):
    return np.asarray(arr, dtype=np.float32, order="C")

# -------------------- Paso 1: FIT (partial_fit) --------------------
pairs = list_reduced_pairs(REDUCED_DIR)
if MAX_SHARDS is not None:
    pairs = pairs[:MAX_SHARDS]

if not pairs:
    raise RuntimeError(f"No se encontraron reducidos en {REDUCED_DIR}")

# Inferir dimensión d
first_npy = np.load(pairs[0][1], mmap_mode="r")
d = int(first_npy.shape[1])
del first_npy

if MODEL_PATH.exists() and CENTERS_PATH.exists():
    # Reusar modelo y centroides si ya existen (idempotente)
    with open(MODEL_PATH, "rb") as f:
        mbkm = pickle.load(f)
    centers = np.load(CENTERS_PATH)
    print(f"↪︎ Modelo MBKM reutilizado: K={mbkm.n_clusters}, d={d}")
else:
    print(f"== Paso 1/2: Entrenando MiniBatchKMeans (streaming) ==")
    mbkm = MiniBatchKMeans(
        n_clusters=K_TARGET,
        init="k-means++",
        max_iter=100,          # iters internas de cada partial_fit
        batch_size=MB_SIZE,
        reassignment_ratio=0.01,
        random_state=42,
        verbose=0
    )
    # Entrenamiento incremental
    total_rows = 0
    for sid, pX, pM in pairs:
        try:
            X_mem = np.load(pX, mmap_mode="r")
            n = X_mem.shape[0]
            if n == 0:
                continue
            for a, c in tqdm(iterate_blocks(X_mem, 0, n), desc=f"FIT {sid:04d}", leave=False):
                batch = ensure_float32(X_mem[a:c])
                mbkm.partial_fit(batch)
                total_rows += (c - a)
                del batch
            del X_mem
            gc.collect()
        except Exception as e:
            print(f"[skip] {sid:04d} durante fit: {e}")

    centers = mbkm.cluster_centers_.astype(np.float32, copy=False)
    with open(MODEL_PATH, "wb") as f:
        pickle.dump(mbkm, f)
    np.save(CENTERS_PATH, centers)
    print(f"✅ Fit completo sobre ~{total_rows:,} filas. Guardado modelo+centros.")

# -------------------- Paso 2: ASSIGN + REPRESENTANTES --------------------
# Estructuras para guardar el mejor (mínima distancia) por clúster
best_dist   = np.full((K_TARGET,), np.inf, dtype=np.float32)
best_record = [None] * K_TARGET   # guardará dict con meta + localizadores

def update_best(cluster_idx: int, dist: np.float32, meta_row: dict):
    if dist < best_dist[cluster_idx]:
        best_dist[cluster_idx] = float(dist)
        best_record[cluster_idx] = meta_row

print("== Paso 2/2: Asignando y seleccionando representantes ==")
for sid, pX, pM in pairs:
    # cargar meta (solo columnas disponibles)
    try:
        meta = pd.read_parquet(pM)
    except Exception as e:
        print(f"[skip] {sid:04d} meta ilegible: {e}")
        continue
    keep_cols = [c for c in META_PREF if c in meta.columns]
    meta = meta[keep_cols].copy()
    # índice local por si no existe
    meta = meta.reset_index(drop=True)
    meta["row_in_shard"] = np.arange(len(meta), dtype=np.int64)
    meta["shard_idx"] = sid

    try:
        X_mem = np.load(pX, mmap_mode="r")
        n = X_mem.shape[0]
        if n == 0:
            continue
        for a, c in tqdm(iterate_blocks(X_mem, 0, n), desc=f"ASSIGN {sid:04d}", leave=False):
            Xb = ensure_float32(X_mem[a:c])       # [B,d]
            # predicción + distancias cuadráticas a centros
            labels = mbkm.predict(Xb)              # [B]
            # usa score_samples o euclidean dist manual (más control)
            # d^2 = ||x - mu||^2 = ||x||^2 + ||mu||^2 - 2 x·mu
            # computamos para cada punto solo su centro asignado:
            mu = centers[labels]                   # [B,d]
            d2 = np.einsum("ij,ij->i", Xb - mu, Xb - mu, optimize=True)  # [B]
            # actualizar mejores por clúster
            for i in range(len(labels)):
                clu = int(labels[i])
                dist = float(d2[i])
                mr = {k: (None if k not in meta.columns else meta.iloc[a+i][k]) for k in meta.columns}
                mr["cluster"] = clu
                mr["dist2"]   = dist
                update_best(clu, dist, mr)
            del Xb, mu, labels, d2
        del X_mem
        gc.collect()
    except Exception as e:
        print(f"[skip] {sid:04d} durante assign: {e}")

# Construir DataFrame de representantes (uno por clúster con dato)
reps = [r for r in best_record if r is not None]
if not reps:
    raise RuntimeError("No se pudo seleccionar ningún representante. Revisa logs.")

reps_df = pd.DataFrame(reps)

# Ordenar por distancia (opcional) y quitar NaN
reps_df = reps_df.sort_values(["cluster","dist2"]).reset_index(drop=True)

# Reordenar columnas
front = [c for c in ["id","make","model","year","component","_h"] if c in reps_df.columns]
tail  = [c for c in ["shard_idx","row_in_shard","cluster","dist2"] if c in reps_df.columns]
cols  = front + [c for c in reps_df.columns if c not in front + tail] + tail
reps_df = reps_df[cols]

# Guardar
reps_df.to_csv(REPS_CSV_PATH, index=False)
print(f"📦 Representantes KMeans: {len(reps_df):,} → {REPS_CSV_PATH}")


== Paso 1/2: Entrenando MiniBatchKMeans (streaming) ==




✅ Fit completo sobre ~525,000 filas. Guardado modelo+centros.
== Paso 2/2: Asignando y seleccionando representantes ==




📦 Representantes KMeans: 20,000 → /content/drive/MyDrive/NHTSA/embeddings/complaints_e5_mlg_instruct/reduced_shards/kmeans_global/representantes_global_kmeans.csv


In [None]:
!pip install neo4j

Collecting neo4j
  Downloading neo4j-6.0.2-py3-none-any.whl.metadata (5.2 kB)
Downloading neo4j-6.0.2-py3-none-any.whl (325 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.8/325.8 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: neo4j
Successfully installed neo4j-6.0.2


In [None]:
from os import getenv
NEO4J_URI = getenv("NEO4J_URI", "")
NEO4J_USER = getenv("NEO4J_USER", "")
NEO4J_PASSWORD = getenv("NEO4J_PASSWORD", "")

In [None]:
# ============================================================
# Complaints → Neo4j (upsert por lotes, con normalización)
#  - Usa UUID v5 (qid) con el MISMO namespace que Qdrant
#  - Mantiene unicidad SOLO en Complaint.qid
#  - Make/Model/Component: índices (no únicos) + nombres normalizados
#  - Evita crear más duplicados; permite limpiar luego y poner unique
# ============================================================

import os
import uuid
import math
import pandas as pd
from neo4j import GraphDatabase

# -------------------- CONFIG --------------------

# CSV de representantes (ajusta si usas el consolidado)
CSV_REPS = "/content/drive/MyDrive/NHTSA/embeddings/complaints_e5_mlg_instruct/reduced_shards/kmeans_global/representantes_global_kmeans.csv"
# Alternativa:
# CSV_REPS = "/content/drive/MyDrive/NHTSA/embeddings/complaints_e5_mlg_instruct/reduced_shards/representantes_global.csv"

# ⚠️ IMPORTANTÍSIMO:
# Debe ser el MISMO namespace con el que generaste los UUIDs al subir a Qdrant.
# Si fue "https://nhtsa.example/complaints", deja esa cadena. Si fue otro, cámbialo aquí.
NAMESPACE_STR = "https://nhtsa.example/complaints"
NAMESPACE = uuid.uuid5(uuid.NAMESPACE_URL, NAMESPACE_STR)

print("Namespace (string):", NAMESPACE_STR)
print("Namespace (UUID)  :", NAMESPACE)

# Variables de entorno (si no las tienes ya definidas en Colab)
# Sustituye <<<...>>> por tus valores reales, o comenta estas líneas si ya existen en tu entorno.
os.environ.setdefault("NEO4J_URI", "neo4j+s://<<<TU_HOST_DE_AURA>>>.databases.neo4j.io")
os.environ.setdefault("NEO4J_USER", "neo4j")
os.environ.setdefault("NEO4J_PASSWORD", "<<<TU_PASSWORD_AQUI>>>")

# -------------------- UTILIDADES --------------------

def make_qid(shard_idx: int, row_in_shard: int) -> str:
    """UUID v5 determinístico usando el namespace acordado."""
    raw = f"rep_{int(shard_idx):04d}_{int(row_in_shard):06d}"
    return str(uuid.uuid5(NAMESPACE, raw))

def _norm(s):
    """Normaliza nombre (clave) para evitar duplicados: trim + upper."""
    if s is None:
        return None
    s = str(s).strip()
    return s.upper() if s != "" else None

def _mask(s, keep=4):
    if not s: return "<empty>"
    return s[:keep] + "…" if len(s) > keep else "****"

# -------------------- CARGA CSV --------------------

# Lee CSV
df = pd.read_csv(CSV_REPS, dtype={"id":"string","_h":"string"})

# Validaciones mínimas
required_cols = {"shard_idx","row_in_shard"}
missing = [c for c in required_cols if c not in df.columns]
if missing:
    raise ValueError(f"CSV no contiene columnas requeridas: {missing}. Repara antes de cargar a Neo4j.")

# Normaliza tipos y crea qid
df["shard_idx"]    = pd.to_numeric(df["shard_idx"], errors="coerce").astype("Int64")
df["row_in_shard"] = pd.to_numeric(df["row_in_shard"], errors="coerce").astype("Int64")
df = df[df["shard_idx"].notna() & df["row_in_shard"].notna()].copy()

df["qid"] = [make_qid(int(s), int(r)) for s, r in zip(df["shard_idx"], df["row_in_shard"])]

# comp_id opcional desde 'id' si existe
df["comp_id"] = df["id"] if "id" in df.columns else None

# Normaliza claves de texto para evitar duplicados nuevos en el grafo
for col in ("make","model","component"):
    if col in df.columns:
        df[col] = df[col].apply(_norm)

# -------------------- NEO4J DRIVER --------------------

NEO4J_URI = os.getenv("NEO4J_URI", "")
NEO4J_USER = os.getenv("NEO4J_USER", "")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD", "")

print("Neo4j URI :", NEO4J_URI or "<empty>")
print("Neo4j USER:", NEO4J_USER or "<empty>")
print("Neo4j PASS:", _mask(NEO4J_PASSWORD))

assert NEO4J_URI.startswith(("neo4j://","neo4j+s://","neo4j+ssc://","bolt://","bolt+s://","bolt+ssc://")), \
    "NEO4J_URI debe comenzar con neo4j://, neo4j+s://, bolt://, etc."
assert NEO4J_USER and NEO4J_PASSWORD, "Faltan NEO4J_USER o NEO4J_PASSWORD."

driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

# -------------------- DDL (constraints / índices) --------------------
# Nota: Unicidad SOLO en Complaint.qid (no romperá por duplicados previos)
#       En Make/Model/Component dejamos índices NO únicos para permitir carga,
#       ya que reportaste duplicados existentes.

DDL = [
    "CREATE CONSTRAINT complaint_qid IF NOT EXISTS FOR (c:Complaint) REQUIRE c.qid IS UNIQUE",

    "CREATE INDEX make_name_idx IF NOT EXISTS FOR (m:Make) ON (m.name)",
    "CREATE INDEX model_name_idx IF NOT EXISTS FOR (m:Model) ON (m.name)",
    "CREATE INDEX component_name_idx IF NOT EXISTS FOR (x:Component) ON (x.name)",

    "CREATE INDEX complaint_make IF NOT EXISTS FOR (c:Complaint) ON (c.make)",
    "CREATE INDEX complaint_model IF NOT EXISTS FOR (c:Complaint) ON (c.model)"
]

with driver.session() as s:
    for stmt in DDL:
        s.run(stmt)

# -------------------- UPSERT (Cypher) --------------------
# Importante: usamos name normalizado (upper/trim) ya desde el DataFrame.
# Si prefieres normalizar en Cypher, podrías usar toUpper(trim(r.make)) etc.

CYPHER = """
UNWIND $rows AS r
MERGE (c:Complaint {qid: r.qid})
  ON CREATE SET
    c.comp_id      = r.comp_id,
    c._h           = r._h,
    c.shard_idx    = r.shard_idx,
    c.row_in_shard = r.row_in_shard,
    c.cluster      = r.cluster,
    c.dist2        = r.dist2,
    c.make         = r.make,
    c.model        = r.model,
    c.year         = r.year,
    c.component    = r.component
  ON MATCH SET
    c.comp_id      = coalesce(r.comp_id, c.comp_id),
    c._h           = coalesce(r._h, c._h),
    c.shard_idx    = coalesce(r.shard_idx, c.shard_idx),
    c.row_in_shard = coalesce(r.row_in_shard, c.row_in_shard),
    c.cluster      = coalesce(r.cluster, c.cluster),
    c.dist2        = coalesce(r.dist2, c.dist2),
    c.make         = coalesce(r.make, c.make),
    c.model        = coalesce(r.model, c.model),
    c.year         = coalesce(r.year, c.year),
    c.component    = coalesce(r.component, c.component)

FOREACH (_ IN CASE WHEN r.make IS NOT NULL THEN [1] ELSE [] END |
  MERGE (m:Make {name: r.make})
  MERGE (c)-[:OF_MAKE]->(m)
)

FOREACH (_ IN CASE WHEN r.model IS NOT NULL THEN [1] ELSE [] END |
  MERGE (md:Model {name: r.model})
  MERGE (c)-[:OF_MODEL]->(md)
)

FOREACH (_ IN CASE WHEN r.component IS NOT NULL THEN [1] ELSE [] END |
  MERGE (x:Component {name: r.component})
  MERGE (c)-[:MENTIONS]->(x)
)
"""

def to_row(rec: dict) -> dict:
    """Devuelve solo las claves esperadas en Cypher, con tipos adecuados."""
    return {
        "qid":          rec.get("qid"),
        "comp_id":      rec.get("comp_id"),
        "_h":           rec.get("_h"),
        "shard_idx":    int(rec["shard_idx"])     if pd.notna(rec.get("shard_idx"))     else None,
        "row_in_shard": int(rec["row_in_shard"])  if pd.notna(rec.get("row_in_shard"))  else None,
        "cluster":      int(rec["cluster"])       if pd.notna(rec.get("cluster"))       else None,
        "dist2":        float(rec["dist2"])       if pd.notna(rec.get("dist2"))         else None,
        "make":         rec.get("make"),          # ya normalizado (upper/trim) en DF
        "model":        rec.get("model"),
        "year":         int(rec["year"])          if pd.notna(rec.get("year"))          else None,
        "component":    rec.get("component"),
    }

rows = [to_row(r) for r in df.to_dict(orient="records")]

BATCH = 5_000
with driver.session() as session:
    for i in range(0, len(rows), BATCH):
        chunk = rows[i:i+BATCH]
        session.run(CYPHER, rows=chunk)
        print(f"Upsert Neo4j: {i+len(chunk)}/{len(rows)}")

print("✅ Complaints cargados/actualizados en Neo4j.")

# -------------------- (OPCIONAL) Limpieza de duplicados existentes --------------------
# Si más adelante quieres poner UNICIDAD en Make/Model/Component,
# primero debes consolidar duplicados. Con APOC es muy sencillo:
#
# // Detecta duplicados:
# MATCH (m:Model)
# WITH toUpper(trim(m.name)) AS k, collect(m) AS nodes
# WHERE k IS NOT NULL AND size(nodes) > 1
# RETURN k, size(nodes) AS cnt
# ORDER BY cnt DESC LIMIT 50;
#
# // Consolida (requiere APOC):
# MATCH (m:Model)
# WITH toUpper(trim(m.name)) AS k, collect(m) AS nodes
# WHERE k IS NOT NULL AND size(nodes) > 1
# CALL apoc.refactor.mergeNodes(nodes, {properties:'discard', mergeRels:true}) YIELD node
# RETURN node;
#
# // Luego sí puedes crear la restricción única:
# CREATE CONSTRAINT model_name IF NOT EXISTS FOR (m:Model) REQUIRE m.name IS UNIQUE;
# (Repite para Make y Component)


Namespace (string): https://nhtsa.example/complaints
Namespace (UUID)  : 82cc465c-bfae-5901-868b-5e87923a97f9
Neo4j URI : neo4j+s://66024f48.databases.neo4j.io
Neo4j USER: neo4j
Neo4j PASS: kDp5…
Upsert Neo4j: 5000/20000
Upsert Neo4j: 10000/20000
Upsert Neo4j: 15000/20000
Upsert Neo4j: 20000/20000
✅ Complaints cargados/actualizados en Neo4j.
