## Etapa de Ingesta y Construcción del Grafo de *Investigations* (NHTSA → Neo4j AuraDB)

### 1. Objetivo general

Esta etapa tuvo como finalidad **integrar el dataset de investigaciones de defectos (Investigations) del NHTSA-ODI** en una base de grafos (Neo4j AuraDB) para su posterior análisis relacional y semántico.  
El proceso buscó preservar la trazabilidad de los expedientes, garantizar la consistencia tipológica con las entidades ya integradas (*Recalls* y *Complaints*) y habilitar consultas complejas sobre relaciones técnicas y temporales.

---

### 2. Contexto y preprocesamiento previo

El conjunto de datos original (`FLAT_INV.zip`) fue previamente depurado y estandarizado mediante un pipeline reproducible en Colab:

- Se forzaron los encabezados oficiales conforme al diccionario `INV.txt`.
- Se normalizaron fechas (`ODATE`, `CDATE`) y años (`YEAR`).
- Se limpiaron descripciones textuales (`SUBJECT`, `SUMMARY`, `COMPNAME`).
- Se derivaron columnas jerárquicas `COMP_L1`–`COMP_L5` a partir de la descomposición de `COMPNAME`.
- Se generaron exportaciones intermedias en formato `Parquet`, `JSONL` y `CSV` normalizado (`investigations_neo4j_ready.csv`).

El CSV final contenía las columnas esenciales para la ingesta de grafo:
```

[action_no, make, model, year, comp_l1, comp_l2, comp_l3, comp_l4, comp_l5, subject, summary, campaign_no]

```

---

### 3. Diseño del modelo de grafo

El modelo de datos en Neo4j se diseñó para capturar las relaciones entre **investigaciones**, **vehículos**, **componentes**, **fabricantes** y **campañas de retiro**.  
Cada investigación se representa como un nodo `(:Investigation)` conectado mediante relaciones semánticas y jerárquicas:

- **Jerarquía de componentes** (`:Component`)
```

(Component:L5) -[:SUB_OF]-> (Component:L4) -[:SUB_OF]-> (Component:L3) ...

````
Se aplica de forma dinámica según los niveles disponibles (`COMP_L1–L5`).

- **Relaciones explícitas:**
- `(Investigation)-[:MENTIONS]->(Component)`
- `(Investigation)-[:OF_MAKE]->(Make)`
- `(Investigation)-[:OF_MODEL]->(Model)`
- `(Investigation)-[:RELATES_TO]->(Recall)` (vínculo directo cuando existe `CAMPNO`)

Este modelo es **isomorfo al utilizado para *Recalls***, garantizando interoperabilidad entre datasets.

---

### 4. Lógica de ingesta (Cypher)

El script `CYPHER_UPSERT_INV` se ejecutó desde Python con `neo4j` y realiza la carga **por lotes (batch 1000)** con transacciones idempotentes.  
Cada bloque `MERGE` asegura que no se dupliquen nodos ni relaciones:

#### a. Creación de nodos principales
```cypher
MERGE (i:Investigation {id: row.action_no})
SET i.subject = coalesce(row.subject,''),
    i.summary = coalesce(row.summary,''),
    i.year = CASE WHEN row.year='' THEN NULL ELSE toInteger(row.year) END
````

#### b. Creación condicional de Make/Model

```cypher
FOREACH (_ IN CASE WHEN row.make <> '' THEN [1] ELSE [] END |
  MERGE (mk:Make {name: row.make})
  MERGE (i)-[:OF_MAKE]->(mk)
)
```

#### c. Jerarquía de componentes (sin `WITH` ni `MATCH` dentro de `FOREACH`)

Cada nivel se crea condicionalmente mediante `CASE`, garantizando compatibilidad con AuraDB:

```cypher
MERGE (c1:Component {name: row.comp_l1})
MERGE (c2:Component {name: CASE WHEN row.comp_l2 <> '' THEN row.comp_l2 ELSE row.comp_l1 END})
FOREACH (_ IN CASE WHEN row.comp_l2 <> '' THEN [1] ELSE [] END |
  MERGE (c1)<-[:SUB_OF]-(c2)
)
...
MERGE (i)-[:MENTIONS]->(leaf)
```

#### d. Asociación con campañas de retiro

El vínculo con `Recall` se ejecuta **fuera de `FOREACH`**, cumpliendo las restricciones de Aura:

```cypher
WITH row, i
WHERE row.campaign_no <> ''
OPTIONAL MATCH (r:Recall {id: row.campaign_no})
WITH i, r
WHERE r IS NOT NULL
MERGE (i)-[:RELATES_TO]->(r);
```

---

### 5. Resultados de ejecución

* **Total de registros procesados:** 14 340 investigaciones
* **Nodos creados:**

  * 14 340 `(:Investigation)`
  * ≈ 500 `(:Component)` (jerarquía deduplicada)
  * Marcas y modelos detectados automáticamente
* **Relaciones generadas:**

  * `MENTIONS` (investigación → componente)
  * `OF_MAKE`, `OF_MODEL`
  * `SUB_OF` (jerarquía entre componentes)
  * `RELATES_TO` (vínculo con *Recall*)

La ingesta completa se realizó sin errores sintácticos, validando que el esquema cumple con las restricciones de AuraDB (sin `WITH` ni `MATCH` dentro de `FOREACH`).

---

### 6. Validación post-ingesta

Verificaciones Cypher básicas en AuraDB:

```cypher
// Conteo de entidades
MATCH (i:Investigation)-[:MENTIONS]->(c:Component)
RETURN count(i) AS investigations, count(DISTINCT c) AS components;

// Inspección de jerarquías
MATCH (i:Investigation)-[:MENTIONS]->(leaf:Component)
OPTIONAL MATCH path=(leaf)-[:SUB_OF*0..4]->(root:Component)
RETURN i.id, leaf.name, [n IN nodes(path) | n.name] AS hierarchy
LIMIT 5;
```

Ambas consultas confirmaron la correcta materialización de los nodos y las jerarquías de componentes.

---

### 7. Justificación técnica de decisiones

| Aspecto                                             | Decisión                         | Justificación                                       |
| --------------------------------------------------- | -------------------------------- | --------------------------------------------------- |
| **Modo de ingesta**                                 | Batches de 1000 filas            | Minimiza overhead de transacciones en AuraDB Free   |
| **Uso de `MERGE`**                                  | En todos los nodos/relaciones    | Evita duplicados y permite re-ejecución idempotente |
| **Evitar `WITH` y `MATCH` dentro de `FOREACH`**     | Reescritura en bloques separados | Requerimiento de AuraDB (Cypher 5.x)                |
| **Jerarquía condicional de componentes**            | `CASE WHEN ...` anidado          | Permite crear niveles solo si existen datos         |
| **Relaciones tipadas (`:MENTIONS`, `:RELATES_TO`)** | Semántica explícita              | Mejora el modelado ontológico del grafo             |
| **Verificación estructural post-ingesta**           | Consultas `MATCH` + `count()`    | Garantiza integridad del modelo                     |

---

### 8. Conclusiones

La ingesta de *Investigations* consolidó la capa de evidencia de seguridad vehicular en Neo4j, extendiendo la red de conocimiento ya establecida con *Complaints* y *Recalls*.
El resultado es un **grafo relacional y semántico coherente**, donde cada investigación está vinculada jerárquicamente con sus componentes, marca, modelo y campañas asociadas.

La estructura resultante permitirá análisis avanzados como:

* detección de patrones de defectos por subcomponente,
* correlaciones entre investigaciones y recalls,
* visualización temporal de aperturas y cierres (`ODATE`, `CDATE`),
* clustering semántico basado en embeddings (fase posterior).

---

### 9. Archivos relevantes

| Descripción                   | Archivo                                                                     |
| ----------------------------- | --------------------------------------------------------------------------- |
| Dataset normalizado (entrada) | `/content/drive/MyDrive/NHTSA/neo4j/exports/investigations_neo4j_ready.csv` |
| Script Cypher final           | `CYPHER_UPSERT_INV`                                                         |
| Notebook de ingesta           | `NHTSA_Aura_ingest.ipynb`                                                   |
| Validaciones post-ingesta     | Bloques Cypher de verificación                                              |

---



In [1]:
# Montar Drive y deps
from google.colab import drive
drive.mount('/content/drive')

!pip -q install neo4j pandas pyarrow python-dotenv

from pathlib import Path
import pandas as pd, numpy as np, re, csv, io, zipfile, json, os


Mounted at /content/drive
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.8/325.8 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
# Directorios en tu maestro NHTSA
BASE      = Path('/content/drive/MyDrive/NHTSA')
SRC_DIR   = BASE / 'source'
PROC_DIR  = BASE / 'processed'
EXP_DIR   = BASE / 'neo4j' / 'exports'
PROC_DIR.mkdir(parents=True, exist_ok=True)
EXP_DIR.mkdir(parents=True, exist_ok=True)

# ZIP oficial que acabas de subir
ZIP_INV  = SRC_DIR / 'FLAT_INV.zip'
assert ZIP_INV.exists(), f'No existe {ZIP_INV}'


In [5]:
INV_SCHEMA = [
    "NHTSA ACTION NUMBER","MAKE","MODEL","YEAR","COMPNAME","MFR_NAME",
    "ODATE","CDATE","CAMPNO","SUBJECT","SUMMARY"
]

def read_investigations_zip_with_schema(zip_path: Path, encoding='utf-8'):
    zf = zipfile.ZipFile(zip_path, 'r')
    frames, warns, plan_stats = [], [], {'A':0,'B':0}
    for name in zf.namelist():
        if not re.search(r'\.(txt|tsv|csv)$', name, flags=re.I):
            continue
        raw = zf.read(name)
        # Plan A: tab + quotechar
        try:
            df = pd.read_csv(
                io.BytesIO(raw),
                sep='\t', engine='python', quotechar='"',
                encoding=encoding, header=None, names=INV_SCHEMA,
                on_bad_lines='error', dtype=str
            )
            plan_stats['A'] += 1
        except Exception as eA:
            warns.append(f'{name} [Plan A] {type(eA).__name__}: {eA}')
            # Plan B: tolerante sin comillas
            df = pd.read_csv(
                io.BytesIO(raw),
                sep='\t', engine='python', quoting=csv.QUOTE_NONE,
                escapechar='\\', encoding=encoding,
                header=None, names=INV_SCHEMA,
                on_bad_lines='warn', dtype=str
            )
            plan_stats['B'] += 1
        df['__SOURCE_FILE__'] = name
        frames.append(df)
    if not frames:
        raise RuntimeError('No se encontraron .txt/.tsv/.csv dentro del ZIP.')
    return pd.concat(frames, ignore_index=True), plan_stats, warns

df_raw, plans, warns = read_investigations_zip_with_schema(ZIP_INV)
print('Leídos:', len(df_raw), 'filas | archivos Plan A/B:', plans, '| warns:', len(warns))


Leídos: 153551 filas | archivos Plan A/B: {'A': 1, 'B': 0} | warns: 0


2) Normalización canónica

In [6]:
df = df_raw.copy()

# Renombrar columnas al canon (ya vienen correctas, esto es por seguridad)
canon = {
    "NHTSA ACTION NUMBER":"ACTIONNUMBER",
    "MAKE":"MAKE",
    "MODEL":"MODEL",
    "YEAR":"YEAR",
    "COMPNAME":"COMPNAME",
    "MFR_NAME":"MFR_NAME",
    "ODATE":"ODATE",
    "CDATE":"CDATE",
    "CAMPNO":"CAMPNO",
    "SUBJECT":"SUBJECT",
    "SUMMARY":"SUMMARY",
}
df.rename(columns=canon, inplace=True)

# Fechas YYYYMMDD -> datetime
for col in ['ODATE','CDATE']:
    if col in df.columns:
        s = df[col].astype('string')
        eight = s.str.extract(r'(\d{8})', expand=False)
        parsed = pd.to_datetime(eight, format='%Y%m%d', errors='coerce')
        # Fallback laxo
        miss = parsed.isna()
        if miss.any():
            parsed.loc[miss] = pd.to_datetime(s[miss], errors='coerce')
        df[col] = parsed

# YEAR a entero en rango 1949–2035 (9999 -> NA)
if 'YEAR' in df.columns:
    y = pd.to_numeric(df['YEAR'], errors='coerce')
    y = y.mask(y.eq(9999))
    y = y.where((y>=1949) & (y<=2035))
    df['YEAR'] = y.astype('Int64')

# Strings en mayúsculas/espacios normalizados para campos clave
def norm_text(x):
    if pd.isna(x): return ''
    s = str(x).upper().strip()
    s = re.sub(r'\s+', ' ', s)
    return s

for c in ['MAKE','MODEL','COMPNAME','MFR_NAME','SUBJECT','SUMMARY','CAMPNO']:
    if c in df.columns:
        df[c] = df[c].map(norm_text)

# Métricas de texto (útiles para QC)
for c in ['SUBJECT','SUMMARY','COMPNAME']:
    if c in df.columns:
        df[f'LEN_{c}'] = df[c].str.len()

print('QC rápido →',
      'ODATE cobre:', round(df['ODATE'].notna().mean()*100,2), '% |',
      'CDATE cobre:', round(df['CDATE'].notna().mean()*100,2), '% |',
      'YEAR rango:', (df['YEAR'].dropna().min(), df['YEAR'].dropna().max()))


QC rápido → ODATE cobre: 99.9 % | CDATE cobre: 50.73 % | YEAR rango: (np.int64(1965), np.int64(2026))


3) Descomposición jerárquica de COMPNAME

In [7]:
# Split seguro hasta 5 niveles
parts = df['COMPNAME'].fillna('').astype(str).str.split(':', n=4, expand=True)
for i in range(5):
    col = f'COMP_L{i+1}'
    df[col] = parts[i].str.strip() if i in parts.columns else ''

# Limpieza: vacíos -> ''
for col in [f'COMP_L{i}' for i in range(1,6)]:
    df[col] = df[col].fillna('')

df[['COMPNAME'] + [f'COMP_L{i}' for i in range(1,6)]].head(8)


Unnamed: 0,COMPNAME,COMP_L1,COMP_L2,COMP_L3,COMP_L4,COMP_L5
0,EXTERIOR LIGHTING,EXTERIOR LIGHTING,,,,
1,ENGINE AND ENGINE COOLING,ENGINE AND ENGINE COOLING,,,,
2,ENGINE AND ENGINE COOLING,ENGINE AND ENGINE COOLING,,,,
3,EXTERIOR LIGHTING:HEADLIGHTS,EXTERIOR LIGHTING,HEADLIGHTS,,,
4,EXTERIOR LIGHTING,EXTERIOR LIGHTING,,,,
5,SEATS:FRONT ASSEMBLY:RECLINER,SEATS,FRONT ASSEMBLY,RECLINER,,
6,CHILD SEAT,CHILD SEAT,,,,
7,ENGINE AND ENGINE COOLING:ENGINE,ENGINE AND ENGINE COOLING,ENGINE,,,


4) Persistencia processed y export Neo4j-ready

In [8]:
# Guardados "processed"
df.to_parquet(PROC_DIR / 'investigations.parquet', index=False)
with open(PROC_DIR / 'investigations.jsonl', 'w', encoding='utf-8') as f:
    for _, row in df.iterrows():
        f.write(json.dumps({k:(None if pd.isna(v) else v) for k,v in row.to_dict().items()}, default=str) + '\n')

# CSV mínimo para Neo4j (análogo a recalls)
inv_cols = {
    'ACTIONNUMBER':'action_no',
    'MAKE':'make',
    'MODEL':'model',
    'YEAR':'year',
    'COMPNAME':'component',
    'ODATE':'open_date',
    'CDATE':'close_date',
    'SUBJECT':'subject',
    'SUMMARY':'summary',
    'CAMPNO':'campaign_no',
    'COMP_L1':'comp_l1',
    'COMP_L2':'comp_l2',
    'COMP_L3':'comp_l3',
    'COMP_L4':'comp_l4',
    'COMP_L5':'comp_l5',
}
present = [c for c in inv_cols if c in df.columns]
inv = df[present].rename(columns=inv_cols).copy()

# Reglas mínimas de validez (no adivinamos): requiere ID y al menos COMPONENT
need = ['action_no','component']
for k in need:
    assert k in inv.columns, f'Falta columna {k} en el ZIP'
mask_ok = inv['action_no'].astype(str).str.len().gt(0) & inv['component'].astype(str).str.len().gt(0)

# (Opcional) si quieres exigir Make/Model para relaciones MMY, descomenta esta línea:
# mask_ok &= inv['make'].astype(str).str.len().gt(0) & inv['model'].astype(str).str.len().gt(0)

good = inv[mask_ok].copy()

# Fechas ISO (string)
for c in ['open_date','close_date']:
    if c in good.columns:
        good[c] = pd.to_datetime(good[c], errors='coerce').dt.strftime('%Y-%m-%d').fillna('')

# year como Int64 (puede quedar NA)
if 'year' in good.columns:
    good['year'] = pd.to_numeric(good['year'], errors='coerce').astype('Int64')

out_csv = EXP_DIR / 'investigations_neo4j_ready.csv'
good.to_csv(out_csv, index=False)
print('==================================')
print('Export listo para Aura →', out_csv)
print('Filas:', len(good))
display(good.head(8))


Export listo para Aura → /content/drive/MyDrive/NHTSA/neo4j/exports/investigations_neo4j_ready.csv
Filas: 152193


Unnamed: 0,action_no,make,model,year,component,open_date,close_date,subject,summary,campaign_no,comp_l1,comp_l2,comp_l3,comp_l4,comp_l5
0,AQ09001,CAPCEN,9005,,EXTERIOR LIGHTING,2009-03-26,2009-07-06,HID REPLACEMENT KIT RECALL CAMPAIGNS,RMD IDENTIFIED SEVERAL HID REPLACEMENT LIGHTIN...,06E027000,EXTERIOR LIGHTING,,,,
1,DP05005,FORD,E SERIES SUPER DUTY,2001.0,ENGINE AND ENGINE COOLING,2005-09-22,2006-01-04,SPARK PLUG EJECTION FROM CYLINDER HEAD,"ON SEPTEMBER 6, 2005, ODI RECEIVED A PETITION ...",,ENGINE AND ENGINE COOLING,,,,
2,DP05005,FORD,MUSTANG GT,2000.0,ENGINE AND ENGINE COOLING,2005-09-22,2006-01-04,SPARK PLUG EJECTION FROM CYLINDER HEAD,"ON SEPTEMBER 6, 2005, ODI RECEIVED A PETITION ...",,ENGINE AND ENGINE COOLING,,,,
3,AQ09001,EASTONE,9005-8K,,EXTERIOR LIGHTING:HEADLIGHTS,2009-03-26,2009-07-06,HID REPLACEMENT KIT RECALL CAMPAIGNS,RMD IDENTIFIED SEVERAL HID REPLACEMENT LIGHTIN...,08E044000,EXTERIOR LIGHTING,HEADLIGHTS,,,
4,AQ09001,EASTONE,9006-8K,,EXTERIOR LIGHTING,2009-03-26,2009-07-06,HID REPLACEMENT KIT RECALL CAMPAIGNS,RMD IDENTIFIED SEVERAL HID REPLACEMENT LIGHTIN...,07E073000,EXTERIOR LIGHTING,,,,
5,DP03006,DODGE,RAM 1500,1997.0,SEATS:FRONT ASSEMBLY:RECLINER,2003-09-23,2004-04-08,SEAT BACK FAILURE,ODI HAS IDENTIFIED 30 INCIDENTS WHERE IT IS AL...,,SEATS,FRONT ASSEMBLY,RECLINER,,
6,AQ18003,DIONO,PACIFICA,,CHILD SEAT,2018-07-19,2021-12-06,RECALL ADMINISTRATION CONCERNS,"ON SEPTEMBER 14, 2017, DIONO LLC (DIONO) SUBMI...",,CHILD SEAT,,,,
7,DP05005,LINCOLN,TOWN CAR,2004.0,ENGINE AND ENGINE COOLING:ENGINE,2005-09-22,2006-01-04,SPARK PLUG EJECTION FROM CYLINDER HEAD,"ON SEPTEMBER 6, 2005, ODI RECEIVED A PETITION ...",,ENGINE AND ENGINE COOLING,ENGINE,,,


5) Ingesta a Neo4j Aura (upsert + jerarquía + link a Recall por CAMPNO)

In [9]:
from neo4j import GraphDatabase
import os

NEO4J_URI  = os.getenv('NEO4J_URI',  'neo4j+s://66024f48.databases.neo4j.io')
NEO4J_USER = os.getenv('NEO4J_USER', 'neo4j')
NEO4J_PASS = os.getenv('NEO4J_PASS', 'kDp50qsUISmBomZa8F9htkq-s5zcb-rlxbgyKYzdVEI')

driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASS))
print('Conectado a Aura:', bool(driver))

CONSTRAINTS = [
    "CREATE CONSTRAINT IF NOT EXISTS FOR (i:Investigation) REQUIRE i.id IS UNIQUE",
    # Los de Recall/Make/Model/Component ya existen de la fase anterior.
]
with driver.session() as s:
    for q in CONSTRAINTS:
        s.run(q)
        print('OK:', q)
print('Constraints listos ✅')


Conectado a Aura: True
OK: CREATE CONSTRAINT IF NOT EXISTS FOR (i:Investigation) REQUIRE i.id IS UNIQUE
Constraints listos ✅


In [14]:
CYPHER_UPSERT_INV = """
UNWIND $rows AS row

// --- Investigation ---
MERGE (i:Investigation {id: row.action_no})
  SET i.open_date  = coalesce(row.open_date, ''),
      i.close_date = coalesce(row.close_date, ''),
      i.subject    = coalesce(row.subject, ''),
      i.summary    = coalesce(row.summary, ''),
      i.year       = CASE WHEN row.year IS NULL OR row.year = '' THEN NULL ELSE toInteger(row.year) END,
      i.component  = row.component

// --- Make / Model opcionales ---
FOREACH (_ IN CASE WHEN row.make <> '' THEN [1] ELSE [] END |
  MERGE (mk:Make {name: row.make})
  MERGE (i)-[:OF_MAKE]->(mk)
)
FOREACH (_ IN CASE WHEN row.model <> '' AND row.make <> '' THEN [1] ELSE [] END |
  MERGE (md:Model {name: row.model, make: row.make})
  MERGE (i)-[:OF_MODEL]->(md)
)

// --- Componentes jerárquicos ---
MERGE (c1:Component {name: row.comp_l1})
  ON CREATE SET c1.name_lower = toLower(row.comp_l1)
  ON MATCH  SET c1.name_lower = coalesce(c1.name_lower, toLower(row.comp_l1))

MERGE (c2:Component {name: CASE WHEN row.comp_l2 <> '' THEN row.comp_l2 ELSE row.comp_l1 END})
  ON CREATE SET c2.name_lower = toLower(CASE WHEN row.comp_l2 <> '' THEN row.comp_l2 ELSE row.comp_l1 END)
  ON MATCH  SET c2.name_lower = coalesce(c2.name_lower, toLower(CASE WHEN row.comp_l2 <> '' THEN row.comp_l2 ELSE row.comp_l1 END))
FOREACH (_ IN CASE WHEN row.comp_l2 <> '' THEN [1] ELSE [] END |
  MERGE (c1)<-[:SUB_OF]-(c2)
)

MERGE (c3:Component {name: CASE WHEN row.comp_l3 <> '' THEN row.comp_l3 ELSE c2.name END})
  ON CREATE SET c3.name_lower = toLower(CASE WHEN row.comp_l3 <> '' THEN row.comp_l3 ELSE c2.name END)
  ON MATCH  SET c3.name_lower = coalesce(c3.name_lower, toLower(CASE WHEN row.comp_l3 <> '' THEN row.comp_l3 ELSE c2.name END))
FOREACH (_ IN CASE WHEN row.comp_l3 <> '' THEN [1] ELSE [] END |
  MERGE (c2)<-[:SUB_OF]-(c3)
)

MERGE (c4:Component {name: CASE WHEN row.comp_l4 <> '' THEN row.comp_l4 ELSE c3.name END})
  ON CREATE SET c4.name_lower = toLower(CASE WHEN row.comp_l4 <> '' THEN row.comp_l4 ELSE c3.name END)
  ON MATCH  SET c4.name_lower = coalesce(c4.name_lower, toLower(CASE WHEN row.comp_l4 <> '' THEN row.comp_l4 ELSE c3.name END))
FOREACH (_ IN CASE WHEN row.comp_l4 <> '' THEN [1] ELSE [] END |
  MERGE (c3)<-[:SUB_OF]-(c4)
)

MERGE (c5:Component {name: CASE WHEN row.comp_l5 <> '' THEN row.comp_l5 ELSE c4.name END})
  ON CREATE SET c5.name_lower = toLower(CASE WHEN row.comp_l5 <> '' THEN row.comp_l5 ELSE c4.name END)
  ON MATCH  SET c5.name_lower = coalesce(c5.name_lower, toLower(CASE WHEN row.comp_l5 <> '' THEN row.comp_l5 ELSE c4.name END))
FOREACH (_ IN CASE WHEN row.comp_l5 <> '' THEN [1] ELSE [] END |
  MERGE (c4)<-[:SUB_OF]-(c5)
)

// Leaf y relación MENTIONS
WITH row, i, c1, c2, c3, c4, c5
WITH i,
     CASE
       WHEN row.comp_l5 <> '' THEN c5
       WHEN row.comp_l4 <> '' THEN c4
       WHEN row.comp_l3 <> '' THEN c3
       WHEN row.comp_l2 <> '' THEN c2
       ELSE c1
     END AS leaf,
     row
MERGE (i)-[:MENTIONS]->(leaf)

// --- Link a Recall (sin MATCH dentro de FOREACH) ---
WITH row, i
WHERE row.campaign_no <> ''
OPTIONAL MATCH (r:Recall {id: row.campaign_no})
WITH i, r
WHERE r IS NOT NULL
MERGE (i)-[:RELATES_TO]->(r);
"""


Ingesta por lotes

In [15]:
def ingest_csv_in_batches(csv_path: Path, cypher: str, batch_size: int = 1000):
    df = pd.read_csv(csv_path, dtype=str, keep_default_na=False)
    # columnas mínimas garantizadas
    for c in ['action_no','make','model','year','component','open_date','close_date',
              'subject','summary','campaign_no','comp_l1','comp_l2','comp_l3','comp_l4','comp_l5']:
        if c not in df.columns: df[c] = ''
    # Trim
    for c in df.columns:
        df[c] = df[c].astype(str).str.strip()
    total = len(df)
    i = 0
    with driver.session() as s:
        while i < total:
            rows = df.iloc[i:i+batch_size].to_dict('records')
            s.run(cypher, rows=rows)
            i += batch_size
            print(f'→ {min(i,total)}/{total}')
    print('Ingesta completa ✅')

csv_inv = EXP_DIR / 'investigations_neo4j_ready.csv'
assert csv_inv.exists(), f'No existe {csv_inv}'
ingest_csv_in_batches(csv_inv, CYPHER_UPSERT_INV, batch_size=1000)

# Verificación rápida
with driver.session() as s:
    n_inv = s.run('MATCH (i:Investigation) RETURN count(i) AS n').single()['n']
    comps = s.run('MATCH (c:Component) RETURN count(c) AS n').single()['n']
    rels  = s.run('MATCH ()-[r]->() RETURN count(r) AS n').single()['n']
    print('Investigations:', n_inv, '| Components:', comps, '| Rels:', rels)
    sample = s.run('MATCH (i:Investigation)-[:MENTIONS]->(c:Component) RETURN i.id AS inv, c.name AS comp LIMIT 5').data()
    print(sample)


→ 1000/152193
→ 2000/152193
→ 3000/152193
→ 4000/152193
→ 5000/152193
→ 6000/152193
→ 7000/152193
→ 8000/152193
→ 9000/152193
→ 10000/152193
→ 11000/152193
→ 12000/152193
→ 13000/152193
→ 14000/152193
→ 15000/152193
→ 16000/152193
→ 17000/152193
→ 18000/152193
→ 19000/152193
→ 20000/152193
→ 21000/152193
→ 22000/152193
→ 23000/152193
→ 24000/152193
→ 25000/152193
→ 26000/152193
→ 27000/152193
→ 28000/152193
→ 29000/152193
→ 30000/152193
→ 31000/152193
→ 32000/152193
→ 33000/152193
→ 34000/152193
→ 35000/152193
→ 36000/152193
→ 37000/152193
→ 38000/152193
→ 39000/152193
→ 40000/152193
→ 41000/152193
→ 42000/152193
→ 43000/152193
→ 44000/152193
→ 45000/152193
→ 46000/152193
→ 47000/152193
→ 48000/152193
→ 49000/152193
→ 50000/152193
→ 51000/152193
→ 52000/152193
→ 53000/152193
→ 54000/152193
→ 55000/152193
→ 56000/152193
→ 57000/152193
→ 58000/152193
→ 59000/152193
→ 60000/152193
→ 61000/152193
→ 62000/152193
→ 63000/152193
→ 64000/152193
→ 65000/152193
→ 66000/152193
→ 67000/152193
→ 68