# Integración Neo4j + Qdrant + Modelo E5 Multilingüe para la Red Semántica de Recalls NHTSA

---

## **1. Objetivo General**

El objetivo del proyecto es construir una **base de conocimiento unificada** de los reportes de seguridad vehicular de la *National Highway Traffic Safety Administration (NHTSA)*, integrando información estructurada y no estructurada de diversas fuentes (Complaints, Recalls e Investigations).
El sistema permite realizar búsquedas semánticas, vincular entidades por similitud y visualizar comunidades de recalls relacionados mediante grafos interactivos.

---

## **2. Arquitectura del Sistema**

### **2.1 Componentes Principales**

| Componente                                                          | Descripción                                                                                                                                     |
| ------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| **Qdrant Cloud**                                                    | Motor vectorial para búsqueda semántica y almacenamiento de embeddings.                                                                         |
| **Neo4j AuraDB Free**                                               | Base de datos de grafos que almacena nodos y relaciones entre recalls, modelos, componentes y fabricantes.                                      |
| **Modelo E5 Multilingüe (intfloat/multilingual-e5-large-instruct)** | Modelo de *sentence-transformers* utilizado para codificar texto descriptivo de recalls, que permite consultas semánticas en múltiples idiomas. |
| **Google Colab**                                                    | Entorno de ejecución que integra la ingestión, búsqueda y visualización de resultados en un flujo reproducible.                                 |

---

## **3. Implementación Técnica**

### **3.1 Configuración de Entorno**

Se configuraron variables de entorno para conectar con los servicios externos:

```python
NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD
QDRANT_URL, QDRANT_API_KEY
E5_MODEL_NAME = "intfloat/multilingual-e5-large-instruct"
```

Los clientes se instancian de manera perezosa (lazy loading) para optimizar memoria y tiempo de carga.

---

### **3.2 Codificación de Textos y Búsqueda Semántica**

Se utilizó el modelo **E5-multilingual-large-instruct** para transformar descripciones textuales en embeddings normalizados de 1024 dimensiones.

```python
@torch.no_grad()
def e5_query(text: str) -> np.ndarray:
    vec = encoder.encode([f"query: {text}"], normalize_embeddings=True)
    return np.asarray(vec[0], dtype=np.float32)
```

Estos vectores se consultan contra las colecciones de Qdrant mediante `query_points`, permitiendo filtros por tipo de entidad (`recall`, `investigation`, `complaint`).

---

### **3.3 Enriquecimiento de Resultados con Neo4j**

La función `ask()` integra ambos mundos:

1. Genera el embedding de la pregunta.
2. Recupera los `k` resultados más relevantes desde Qdrant.
3. Consulta sus metadatos en Neo4j usando Cypher.

Ejemplo de consulta:

```python
ask("airbag sensor failure", k=10, type_in="recall")
```

Resultado enriquecido con:

* `camp_no`, `make`, `model`, `year`
* `component`, `summary`, `investigations` relacionadas.

---

## **4. Modelado de Relaciones SIMILAR_TO**

### **4.1 Criterios de Similitud**

Se generó una red de similitud entre recalls mediante reglas de coincidencia parcial:

* Mismo `make` y `model` con diferencia de año ≤ 1
* Mismo `component` con diferencia de año ≤ 1

Estas reglas se implementaron con Cypher parametrizado y reintentos controlados:

```python
MERGE (a)-[r:SIMILAR_TO]-(b)
SET r.reason = coalesce(r.reason, "MAKE_MODEL_YEAR±1")
```

### **4.2 Estadísticas del Proceso**

```
SKIP 0 → rels creadas/actualizadas: 60981 (acum: 60981)
SKIP 3000 → rels creadas/actualizadas: 50831 (acum: 111812)
SKIP 6000 → rels creadas/actualizadas: 43139 (acum: 154951)
SKIP 9000 → rels creadas/actualizadas: 30581 (acum: 185532)
SKIP 12000 → rels creadas/actualizadas: 1915 (acum: 187447)
✅ SIMILAR_TO completado.
```

Total: **187 447 relaciones** generadas entre recalls.

---

## **5. Visualización Interactiva**

### **5.1 Muestreo de Subgrafos**

Se implementaron dos modos de consulta:

* **Por marca (`seed_make`)**: muestra un subconjunto de recalls filtrados por `make`, `component` o rango de años.
* **Por nodo específico (`seed_id`)**: genera una red de primer orden alrededor de un recall.

Ejemplo de configuración:

```python
seed_make = "TOYOTA"
limit_query_nodes = 250
max_nodes_render  = 220
```

### **5.2 Construcción del Grafo con PyVis**

Los nodos se colorean según **marca** y los bordes según **componente** principal (e.g., AIR BAGS, BRAKES, ENGINE).
El tamaño del nodo refleja su **grado de conexión** (número de recalls similares).

```python
net = Network(height="700px", width="100%", directed=False, cdn_resources="in_line")
net.force_atlas_2based(gravity=-50, spring_length=120, damping=0.4)
```

El resultado se guarda y renderiza inline en Colab:

```python
net.write_html("/content/pyvis_similar_recalls.html", notebook=False)
display(HTML(open("/content/pyvis_similar_recalls.html").read()))
```

✅ **Salida:** Grafo interactivo navegable donde cada nodo muestra su `camp_no`, `make/model`, `year`, `component` y resumen del recall.

---

## **6. Resultados y Análisis Preliminar**

* Se evidencian **comunidades de recalls** agrupadas por sistema afectado (ej. airbags, frenos, motor).
* Marcas con alto número de recalls en la red (Toyota, Ford, GM) generan clústeres densos.
* Los nodos puente (recalls conectados a múltiples componentes) revelan posibles **riesgos transversales** o errores de manufactura compartidos.

---

## **7. Conclusiones y Próximos Pasos**

### **7.1 Conclusiones**

* La combinación **Qdrant + Neo4j + E5** permite búsquedas semánticas y relacionales integradas.
* El modelo **multilingüe** asegura compatibilidad con descripciones tanto en inglés como en español.
* La visualización **PyVis** ofrece una herramienta intuitiva para explorar similitudes entre recalls.

### **7.2 Próximos pasos**

1. **Integrar comunidad de detección automática:**
   Aplicar `gds.labelPropagation` o `gds.louvain` sobre la red SIMILAR_TO para detectar grupos funcionales.

2. **Extender consultas multi-fuente:**
   Incorporar vínculos `COMPLAINT` → `INVESTIGATION` → `RECALL`.

3. **Dashboard analítico (Plotly / Dash):**
   Mostrar estadísticas por fabricante, año y componente crítico.

4. **Embeddings incrementales:**
   Actualizar Qdrant dinámicamente con nuevos recalls (2025 en adelante).

---

## **8. Referencias**

* [Qdrant Documentation](https://qdrant.tech/documentation/)
* [Neo4j AuraDB Free](https://neo4j.com/cloud/platform/aura-graph-database/)
* [Sentence Transformers: Multilingual E5](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
* [NHTSA Vehicle Recall Database](https://www.nhtsa.gov/recalls)

---

# Código

In [1]:
!pip -q install neo4j qdrant-client sentence-transformers torch fastapi uvicorn python-dotenv langchain langchain-community

In [131]:
!pip -q install jinja2==3.1.4 pyvis==0.3.2

In [1]:
import os

# Neo4j AuraDB
os.environ["NEO4J_URI"] = "neo4j+s://66024f48.databases.neo4j.io"
os.environ["NEO4J_USER"] = "neo4j"
os.environ["NEO4J_PASS"] = "kDp50qsUISmBomZa8F9htkq-s5zcb-rlxbgyKYzdVEI"

# Qdrant Cloud
os.environ["QDRANT_URL"] = "https://6f99e241-2505-45c5-86f7-ad2aa70e3bb2.us-east-1-1.aws.cloud.qdrant.io"
os.environ["QDRANT_KEY"] = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3MiOiJtIn0.N4plF_VhgfphnbtCni7VAPvcxer5PNkPSG1xYUc0kLE"

# Modelo para consultas
os.environ["E5_MODEL_NAME"] = "intfloat/multilingual-e5-large-instruct"
os.environ["E5_MAX_LEN"] = "256"

In [2]:
import os
import numpy as np
from neo4j import GraphDatabase
from qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer
import torch

In [3]:
# Ruta al IPCA que ya usaste para reducir embeddings de complaints
IPCA_PATH_COMPLAINTS = r"..\data\embeddings\complaints_e5_mlg_instruct\ipca_model.npz"


In [4]:
# ---------- Lectura y validación de entorno ----------

def get_env():
    env = {
        "NEO4J_URI": os.getenv("NEO4J_URI", ""),
        "NEO4J_USER": os.getenv("NEO4J_USER", ""),
        "NEO4J_PASSWORD": os.getenv("NEO4J_PASSWORD") or os.getenv("NEO4J_PASS", ""),
        "QDRANT_URL": os.getenv("QDRANT_URL", ""),
        "QDRANT_API_KEY": os.getenv("QDRANT_API_KEY") or os.getenv("QDRANT_KEY", ""),
        "E5_MODEL_NAME": os.getenv("E5_MODEL_NAME", "intfloat/multilingual-e5-large-instruct"),
        "E5_MAX_LEN": int(os.getenv("E5_MAX_LEN", "256")),
    }
    assert env["NEO4J_URI"] and env["NEO4J_USER"] and env["NEO4J_PASSWORD"], "Config Neo4j incompleta."
    assert env["QDRANT_URL"] and env["QDRANT_API_KEY"], "Config Qdrant incompleta (QDRANT_URL/API_KEY)."
    return env

ENV = get_env()

In [5]:
# ---------- Clientes perezosos ----------
_neo_driver = None
_qdrant_client = None
_e5_encoder = None
_ipca_proj_complaints = None  # (components_, mean_) cache

def get_neo_driver():
    global _neo_driver
    if _neo_driver is None:
        _neo_driver = GraphDatabase.driver(
            ENV["NEO4J_URI"],
            auth=(ENV["NEO4J_USER"], ENV["NEO4J_PASSWORD"])
        )
    return _neo_driver

def get_qdrant_client():
    global _qdrant_client
    if _qdrant_client is None:
        _qdrant_client = QdrantClient(
            url=ENV["QDRANT_URL"],
            api_key=ENV["QDRANT_API_KEY"],
            timeout=180
        )
    return _qdrant_client

def get_encoder():
    global _e5_encoder
    if _e5_encoder is not None:
        return _e5_encoder

    model_name = ENV["E5_MODEL_NAME"]
    max_len    = ENV["E5_MAX_LEN"]
    try:
        _e5_encoder = SentenceTransformer(model_name, device="cuda" if torch.cuda.is_available() else "cpu")
        if torch.cuda.is_available():
            try:
                _e5_encoder._first_module().auto_model.to(dtype=torch.float16)
            except Exception:
                pass
    except Exception:
        _e5_encoder = SentenceTransformer(model_name, device="cpu")
    _e5_encoder.max_seq_length = max_len
    return _e5_encoder

@torch.no_grad()
def e5_query(text: str) -> np.ndarray:
    enc = get_encoder()
    vec = enc.encode(
        [f"query: {text}"],
        normalize_embeddings=True,
        convert_to_numpy=True,
        batch_size=4,
        show_progress_bar=False
    )[0]
    return np.asarray(vec, dtype=np.float32)   # típicamente 1024

# ---------- Proyección IPCA (igual que tu pipeline de complaints) ----------
def load_ipca_npz(path: str):
    z = np.load(path, allow_pickle=False)
    components_ = z["components_"].astype(np.float32)   # [n_comp, d_orig]
    mean_       = z["mean_"].astype(np.float32)         # [d_orig]
    return components_, mean_

def ensure_ipca_complaints():
    global _ipca_proj_complaints
    if _ipca_proj_complaints is None:
        if not os.path.exists(IPCA_PATH_COMPLAINTS):
            raise RuntimeError(
                f"No encuentro el IPCA para complaints: {IPCA_PATH_COMPLAINTS}. "
                "Sin esta proyección, no se puede consultar una colección reducida (256)."
            )
        _ipca_proj_complaints = load_ipca_npz(IPCA_PATH_COMPLAINTS)
    return _ipca_proj_complaints  # (components_, mean_)

def ipca_transform_row(vec_1d: np.ndarray, components_: np.ndarray, mean_: np.ndarray) -> np.ndarray:
    # Z = (x - mean) @ components_.T
    x = np.asarray(vec_1d, dtype=np.float32)
    x_center = x - mean_
    return x_center @ components_.T


In [None]:
def qdrant_collection_dim(collection_name: str) -> int:
    """
    Obtiene la dimensión de la colección en Qdrant de forma robusta
    ante diferencias de versiones del cliente (atributos vs dicts).
    """
    client = get_qdrant_client()
    info = client.get_collection(collection_name)

    # 1) Intentos por atributos (distintas versiones)
    for attr_path in [
        ("vectors_config", "params", "size"),   # algunas 1.7.x+
        ("config", "params", "size"),           # otras versiones
        ("result", "config", "params", "size"), # cuando viene envuelto
        ("vectors", "params", "size"),          # fallback histórico
        ("vectors_size",),                      # algunos modelos lo exponen directo
    ]:
        obj = info
        ok = True
        for a in attr_path:
            try:
                obj = getattr(obj, a)
            except Exception:
                ok = False
                break
        if ok and isinstance(obj, (int, float)):
            return int(obj)

    # 2) Intento como dict (Pydantic v1 y v2)
    d = None
    try:
        # Pydantic v1
        if hasattr(info, "dict"):
            d = info.dict()
        # Pydantic v2
        elif hasattr(info, "model_dump"):
            d = info.model_dump()
    except Exception:
        d = None

    def _dig(dct, path):
        cur = dct
        for p in path:
            if isinstance(cur, dict) and p in cur:
                cur = cur[p]
            else:
                return None
        return cur

    if isinstance(d, dict):
        for key_path in [
            ["vectors_config", "params", "size"],
            ["config", "params", "size"],
            ["result", "config", "params", "size"],
            ["vectors", "params", "size"],
            ["vectors_size"],
        ]:
            val = _dig(d, key_path)
            if isinstance(val, (int, float)):
                return int(val)

        # Último recurso: buscar cualquier "size" bajo "params"
        def _search_size(obj):
            if isinstance(obj, dict):
                if "params" in obj and isinstance(obj["params"], dict) and "size" in obj["params"]:
                    v = obj["params"]["size"]
                    if isinstance(v, (int, float)):
                        return int(v)
                for v in obj.values():
                    r = _search_size(v)
                    if r is not None:
                        return r
            elif isinstance(obj, list):
                for v in obj:
                    r = _search_size(v)
                    if r is not None:
                        return r
            return None
        val = _search_size(d)
        if val is not None:
            return val

    # 3) Si nada funcionó, asumimos 1024 (E5 original) con warning claro
    # (o podrías raise para obligar a configurar bien)
    # raise RuntimeError(f"No se pudo determinar la dimensión de la colección {collection_name}")
    return 256

In [7]:
def qdrant_search(qv, k=15, mmy=None, collection=None, type_in=None):
    client = get_qdrant_client()

    # Resolver colección por tipo
    if collection is None:
        mapping = {
            "recall": "nhtsa_recalls",
            "investigation": "nhtsa_investigations",
            "complaint": "nhtsa_complaints",
        }
        collection = mapping.get((type_in or "recall").lower(), "nhtsa_recalls")

    # Ajustar dimensión de consulta según colección
    expected_dim = qdrant_collection_dim(collection)
    qv = np.asarray(qv, dtype=np.float32)
    if qv.shape[0] != expected_dim:
        # Si es complaints y el índice está reducido (ej. 256), proyecta con IPCA
        if type_in == "complaint":
            components_, mean_ = ensure_ipca_complaints()
            proj = ipca_transform_row(qv, components_, mean_)  # [n_comp]
            if proj.shape[0] != expected_dim:
                raise RuntimeError(
                    f"Dimensión tras IPCA ({proj.shape[0]}) != esperada por Qdrant ({expected_dim}). "
                    "Verifica que sea el mismo IPCA con el que creaste esa colección."
                )
            qv = proj.astype(np.float32, copy=False)
        else:
            raise RuntimeError(
                f"Vector dim {qv.shape[0]} != {expected_dim} para colección '{collection}'. "
                "O recrea la colección con 1024 o aplica la misma reducción que usaste al indexar."
            )

    # Filtro opcional
    qdrant_filter = None
    if mmy:
        conds = [models.FieldCondition(key=kf, match=models.MatchValue(value=vf))
                 for kf, vf in mmy.items()]
        qdrant_filter = models.Filter(must=conds)

    qv_list = qv.tolist()

    # Llamada (firma nueva/antigua)
    try:
        resp = client.query_points(
            collection_name=collection,
            query=qv_list,
            limit=k,
            with_payload=True,
            with_vectors=False,
            query_filter=qdrant_filter
        )
    except TypeError:
        resp = client.query_points(
            collection_name=collection,
            query=qv_list,
            limit=k,
            with_payload=True,
            with_vectors=False,
            filter=qdrant_filter
        )

    points = getattr(resp, "points", resp)
    out = []
    for h in points:
        payload = dict(h.payload) if getattr(h, "payload", None) else {}
        out.append({
            "id": str(h.id),
            "score": float(h.score),
            "payload": payload,
        })
    return out

def cypher(query, params=None):
    with get_neo_driver().session() as s:
        return s.run(query, **(params or {})).data()


In [8]:
def qdrant_search(qv, k=15, mmy=None, collection=None, type_in=None):
    client = get_qdrant_client()

    # Resolver colección por tipo
    if collection is None:
        mapping = {
            "recall": "nhtsa_recalls",
            "investigation": "nhtsa_investigations",
            "complaint": "nhtsa_complaints",
        }
        collection = mapping.get((type_in or "recall").lower(), "nhtsa_recalls")

    # Ajustar dimensión de consulta según colección
    expected_dim = qdrant_collection_dim(collection)
    qv = np.asarray(qv, dtype=np.float32)
    if qv.shape[0] != expected_dim:
        # Si es complaints y el índice está reducido (ej. 256), proyecta con IPCA
        if type_in == "complaint":
            components_, mean_ = ensure_ipca_complaints()
            proj = ipca_transform_row(qv, components_, mean_)  # [n_comp]
            if proj.shape[0] != expected_dim:
                raise RuntimeError(
                    f"Dimensión tras IPCA ({proj.shape[0]}) != esperada por Qdrant ({expected_dim}). "
                    "Verifica que sea el mismo IPCA con el que creaste esa colección."
                )
            qv = proj.astype(np.float32, copy=False)
        else:
            raise RuntimeError(
                f"Vector dim {qv.shape[0]} != {expected_dim} para colección '{collection}'. "
                "O recrea la colección con 1024 o aplica la misma reducción que usaste al indexar."
            )

    # Filtro opcional
    qdrant_filter = None
    if mmy:
        conds = [models.FieldCondition(key=kf, match=models.MatchValue(value=vf))
                 for kf, vf in mmy.items()]
        qdrant_filter = models.Filter(must=conds)

    qv_list = qv.tolist()

    # Llamada (firma nueva/antigua)
    try:
        resp = client.query_points(
            collection_name=collection,
            query=qv_list,
            limit=k,
            with_payload=True,
            with_vectors=False,
            query_filter=qdrant_filter
        )
    except TypeError:
        resp = client.query_points(
            collection_name=collection,
            query=qv_list,
            limit=k,
            with_payload=True,
            with_vectors=False,
            filter=qdrant_filter
        )
    if hasattr(resp, "points"):
        pts = resp.points
    elif isinstance(resp, list):
        pts = resp
    else:
        pts = []

    out = []
    
    for pt in pts:
        if isinstance(pt, dict):
            score = pt.get("score")
            payload = pt.get("payload") or {}
            pid = pt.get("id")
        else:
            score = getattr(pt, "score", None)
            payload = getattr(pt, "payload", {}) or {}
            pid = getattr(pt, "id", None)
        out.append({"id": pid, "score": score, "payload": payload})
    return out

def cypher(query, params=None):
    with get_neo_driver().session() as s:
        return s.run(query, **(params or {})).data()

def _extract_complaint_id(payload: dict):
    if not payload:
        return None
    for key in ("qid", "id", "comp_id", "QID", "CompID", "compId"):
        v = payload.get(key)
        if v is not None:
            return str(v)
    return None

# ---------- Orquestador alto nivel ----------
def ask(question, mmy=None, k=15, type_in="recall"):
    """
    Busca en Qdrant (colección por type_in) y enriquece con Neo4j.
      - recall/investigation: payload['id']
      - complaint: payload['qid'] (fallback a payload['id'])
    """
    # 1) Vector E5 (1024)
    qv = e5_query(question)

    # 2) Qdrant
    hits = qdrant_search(qv, k=k, mmy=mmy, type_in=type_in)

    # 3) IDs para cruce en Neo4j
    if type_in == "complaint":
        ids = []
        for h in hits:
            p = h.get("payload") or {}
            ids.append(p.get("qid") or p.get("id"))
        ids = [x for x in ids if x]
    else:
        ids = []
        for h in hits:
            p = h.get("payload") or {}
            ids.append(p.get("id"))
        ids = [x for x in ids if x]

    if not ids:
        return {"query": question, "hits": hits, f"{type_in}s": []}

    # 4) Neo4j por tipo
    if type_in == "recall":
        query = query = """
            UNWIND $ids AS rid
            MATCH (r:Recall {id: rid})
            OPTIONAL MATCH (r)-[:OF_MAKE]->(m:Make)
            OPTIONAL MATCH (r)-[:OF_MODEL]->(md:Model)
            OPTIONAL MATCH (r)-[:MENTIONS]->(x:Component)
            RETURN
                r.id AS id,
                r.camp_no AS camp_no,
                coalesce(r.subject, r.summary) AS title,   // <- no hay r.title ni r.name
                r.summary AS summary,
                r.make AS make,
                r.model AS model,
                r.recall_date AS recall_date,
                r.component AS component,
                r.consequence AS consequence,
                collect(DISTINCT m.name)  AS makes,
                collect(DISTINCT md.name) AS models,
                collect(DISTINCT x.name)  AS components
            """
        details = cypher(query, {"ids": ids})
        return {"query": question, "hits": hits, "recalls": details}

    elif type_in == "investigation":
        query = """
            UNWIND $ids AS iid
            MATCH (i:Investigation {id: iid})
            OPTIONAL MATCH (i)-[:RELATES_TO]->(r:Recall)
            RETURN
                i.id AS id,
                coalesce(i.subject, i.summary) AS title,   // <- no hay i.title
                i.subject AS subject,
                i.summary AS summary,
                i.open_date AS open_date,
                i.close_date AS close_date,
                collect(DISTINCT r.id) AS recalls
            """
        details = cypher(query, {"ids": ids})
        return {"query": question, "hits": hits, "investigations": details}

    elif type_in == "complaint":
    # 1) IDs desde los hits (robusto a distintos nombres)
        cand_ids = []
        for h in hits:
            cid = _extract_complaint_id(h.get("payload") or {})
            if cid:
                cand_ids.append(cid)

        if not cand_ids:
            return {"query": question, "hits": hits, "complaints": []}

    # 2) Cypher tolerante a propiedad y tipo (string/int)
        query = """
        UNWIND $ids AS raw
        WITH toString(raw) AS cid
        MATCH (c:Complaint)
        WHERE  toString(c.qid)     = cid
        OR  toString(c.id)      = cid
        OR  toString(c.comp_id) = cid
        RETURN c.qid AS qid, c.id AS id, c.comp_id AS comp_id,
            c.make AS make, c.model AS model, c.year AS year,
            c.component AS component, c._h AS h
        """
        details = cypher(query, {"ids": cand_ids})
        return {"query": question, "hits": hits, "complaints": details}

In [9]:
def qdrant_collection_dim(collection_name: str) -> int:
    client = get_qdrant_client()
    info = client.get_collection(collection_name)
    # Caso 1: vector simple
    try:
        return int(info.config.params.vectors.size)
    except Exception:
        pass
    # Caso 2: named vectors (elige el correcto o el primero)
    try:
        vecs = info.config.params.vectors  # dict-like: {name: VectorParams}
        # Si sabes el nombre exacto, ponlo aquí (p.ej., "reduced" o "text")
        # return int(vecs["reduced"].size)
        # Si no, usa el primero y *avisa*
        name, vp = next(iter(vecs.items()))
        print(f"[qdrant] Colección '{collection_name}' usa vector nombrado '{name}' con size={vp.size}")
        return int(vp.size)
    except Exception:
        pass
    # Si llega aquí, mejor explotar con un error claro
    raise RuntimeError(f"No pude determinar la dimensión de la colección '{collection_name}'. Revisa el schema en Qdrant.")


In [150]:
print("complaints dim =", qdrant_collection_dim("nhtsa_complaints"))
print("recalls dim    =", qdrant_collection_dim("nhtsa_recalls"))
print("invest dim     =", qdrant_collection_dim("nhtsa_investigations"))
print("len(e5_query)  =", len(e5_query("airbag sensor failure")))

complaints dim = 256
recalls dim    = 1024
invest dim     = 1024
len(e5_query)  = 1024


In [10]:
# == CARGA IPCA DESDE .NPZ (cacheado) ==
# Ruta al .npz que usaste al reducir complaints
IPCA_PATH_COMPLAINTS = r"..\data\embeddings\complaints_e5_mlg_instruct\ipca_model.npz"

_ipca_cache = {"loaded": False, "components_": None, "mean_": None, "n_in": None, "n_out": None}

def ensure_ipca_complaints():
    """
    Carga (una vez) el IPCA desde un .npz con las claves:
      - components_ (shape = [n_out, n_in])
      - mean_       (shape = [n_in])
    Opcionalmente puede traer: var_, singular_values_, etc. (no son necesarios para transformar).
    """
    import os, numpy as np
    if _ipca_cache["loaded"]:
        return _ipca_cache["components_"], _ipca_cache["mean_"]

    assert os.path.exists(IPCA_PATH_COMPLAINTS), f"No existe el IPCA .npz: {IPCA_PATH_COMPLAINTS}"
    z = np.load(IPCA_PATH_COMPLAINTS, allow_pickle=False)

    components_ = z["components_"]
    mean_       = z["mean_"]

    assert components_.ndim == 2, f"components_ debe ser 2D, got {components_.ndim}D"
    assert mean_.ndim == 1, f"mean_ debe ser 1D, got {mean_.ndim}D"
    n_out, n_in = components_.shape
    assert mean_.shape[0] == n_in, f"mean_.shape={mean_.shape} y components_.shape[1]={n_in} no coinciden"

    _ipca_cache.update(dict(loaded=True, components_=components_.astype("float32"),
                            mean_=mean_.astype("float32"), n_in=n_in, n_out=n_out))
    print(f"[IPCA] cargado {n_in}→{n_out} desde {IPCA_PATH_COMPLAINTS}")
    return _ipca_cache["components_"], _ipca_cache["mean_"]

def ipca_transform_row(vec, components_, mean_):
    """ Proyección equivalente a IPCA.transform para un solo vector. """
    import numpy as np
    x = np.asarray(vec, dtype=np.float32)
    # sanity
    d_in = components_.shape[1]
    assert x.shape[0] == d_in, f"Vector de entrada {x.shape[0]}D no coincide con IPCA {d_in}D"
    return (x - mean_) @ components_.T  # (n_in,) @ (n_out,n_in)^T = (n_out,)


In [11]:
def qdrant_search(qv, k=15, mmy=None, collection=None, type_in=None):
    client = get_qdrant_client()

    # 1) Resolver colección por tipo
    if collection is None:
        mapping = {
            "recall": "nhtsa_recalls",
            "investigation": "nhtsa_investigations",
            "complaint": "nhtsa_complaints",
        }
        collection = mapping.get((type_in or "recall").lower(), "nhtsa_recalls")

    # 2) Alinear dimensión de consulta con la colección
    expected_dim = qdrant_collection_dim(collection)  # ahora es estricto
    import numpy as np
    qv = np.asarray(qv, dtype=np.float32)

    if qv.shape[0] != expected_dim:
        if type_in == "complaint":
            # proyectar con el MISMO IPCA usado al indexar complaints
            components_, mean_ = ensure_ipca_complaints()
            proj = ipca_transform_row(qv, components_, mean_)  # (expected_dim,)
            if proj.shape[0] != expected_dim:
                raise RuntimeError(
                    f"Tras IPCA obtuve {proj.shape[0]}D y Qdrant espera {expected_dim}D "
                    f"para '{collection}'. Verifica que el .npz sea el que usaste al indexar."
                )
            qv = proj.astype(np.float32, copy=False)
        else:
            # recalls & investigations son 1024D en tu setup; si no coincide, explota (mejor que devolver mal)
            raise RuntimeError(
                f"Vector dim {qv.shape[0]} != {expected_dim} para '{collection}'. "
                "Para consultar colecciones reducidas necesitas el proyector exacto; "
                "si no, reindexa esa colección a 1024D."
            )

    # 3) Filtro opcional
    qdrant_filter = None
    if mmy:
        conds = [models.FieldCondition(key=kf, match=models.MatchValue(value=vf))
                 for kf, vf in mmy.items()]
        qdrant_filter = models.Filter(must=conds)

    # 4) Consulta
    try:
        resp = client.query_points(
            collection_name=collection,
            query=qv.tolist(),
            limit=k,
            with_payload=True,
            with_vectors=False,
            query_filter=qdrant_filter
        )
    except TypeError:
        resp = client.query_points(
            collection_name=collection,
            query=qv.tolist(),
            limit=k,
            with_payload=True,
            with_vectors=False,
            filter=qdrant_filter
        )

    # 5) Normalizar hits -> lista de dicts
    if hasattr(resp, "points"):
        pts = resp.points
    elif isinstance(resp, list):
        pts = resp
    else:
        pts = []

    out = []
    for pt in pts:
        if isinstance(pt, dict):
            score = pt.get("score")
            payload = pt.get("payload") or {}
            pid = pt.get("id")
        else:
            score = getattr(pt, "score", None)
            payload = getattr(pt, "payload", {}) or {}
            pid = getattr(pt, "id", None)
        out.append({"id": pid, "score": score, "payload": payload})
    return out


In [153]:
print("Dims:", 
      "complaints", qdrant_collection_dim("nhtsa_complaints"),
      "recalls", qdrant_collection_dim("nhtsa_recalls"),
      "invest", qdrant_collection_dim("nhtsa_investigations"),
      "| len(e5_query) =", len(e5_query("airbag sensor failure")))

print("— complaint —")
print(len(ask("airbag sensor failure", k=5, type_in="complaint")["hits"]))

print("— recall —")
print(len(ask("airbag sensor failure", k=5, type_in="recall")["hits"]))

print("— investigation —")
print(len(ask("airbag sensor failure", k=5, type_in="investigation")["hits"]))


Dims: complaints 256 recalls 1024 invest 1024 | len(e5_query) = 1024
— complaint —
[IPCA] cargado 1024→256 desde ..\data\embeddings\complaints_e5_mlg_instruct\ipca_model.npz
5
— recall —
5
— investigation —
5


In [115]:
# Buscar recalls relacionados con "airbag sensor failure"
ask("airbag sensor failure", k=10, type_in="recall")

{'query': 'airbag sensor failure',
 'hits': [{'id': 9680,
   'score': 0.8614178,
   'payload': {'id': '22V240000',
    'text': 'BMW of North America, LLC (BMW) is recalling certain 2022-2023 iX xDrive40, iX XDrive50, and iX M60 vehicles. The air bag malfunction indicator light and display message may not illuminate in the event of a problem with the air bag control or pedestrian protection systems, due to incorrect software.. An air bag malfunction indicator light that fails to warn the driver of a problem increases the risk of injury in a crash.. Dealers will reprogram the air bag control unit software, free of charge. Owner notification letters were mailed June 13, 2022. Owners may contact BMW customer service at 1-800-525-7417.. Owners may also contact the National Highway Traffic Safety Administration Vehicle Safety Hotline at 1-888-327-4236 (TTY 1-800-424-9153), or go to www.nhtsa.gov.',
    'chunk_id': 0,
    'chunk_id_str': '22V240000::ch0',
    'make': 'BMW',
    'model': 'intf

In [None]:
# Buscar investigations investigations con "airbag sensor failure"
ask("airbag sensor failure", k=10, type_in="investigation")

{'query': 'airbag sensor failure',
 'hits': [{'id': 3100,
   'score': 0.9264505,
   'payload': {'chunk_id': 'PE02038::ch0',
    'id': 'PE02038',
    'make': 'NISSAN',
    'model': 'intfloat/multilingual-e5-large-instruct',
    'year': None,
    'component': 'AIR BAGS:FRONTAL',
    'camp_no': 'NONE',
    'chunk_idx': 0,
    'text': "DRIVER'S AIR BAG SYSTEM MALFUNCTION nan",
    'device': 'cuda',
    'dim': 1024,
    'ts': '2025-10-12T08:13:21.694679Z'}},
  {'id': 586,
   'score': 0.91803133,
   'payload': {'chunk_id': 'DP91005::ch0',
    'id': 'DP91005',
    'make': 'STERLING',
    'model': 'intfloat/multilingual-e5-large-instruct',
    'year': None,
    'component': 'SEAT BELTS',
    'camp_no': 'NONE',
    'chunk_idx': 0,
    'text': 'AUTOMATIC SEAT BELT FAILURE AUTOMATIC SEAT BELT FAILURE',
    'device': 'cuda',
    'dim': 1024,
    'ts': '2025-10-12T08:13:21.694679Z'}},
  {'id': 606,
   'score': 0.9129369,
   'payload': {'chunk_id': 'DP92011::ch0',
    'id': 'DP92011',
    'make': 'C

In [265]:
# Buscar complaints relacionados con "airbag sensor failure"
ask("bolsa de aire falla", k=10, type_in="complaint")

{'query': 'bolsa de aire falla',
 'hits': [{'id': 'bea7123a-5219-5d0d-80c8-fdd8dbb5efa2',
   'score': 0.5490877,
   'payload': {'id': '16569',
    'make': 'GULF STREAM',
    'model': 'MOTORHOME',
    'year': 1994,
    'component': 'ENGINE AND ENGINE COOLING:ENGINE:GASOLINE:BELTS AND ASSOCIATED PULLEYS',
    '_h': '-8874416744072004341',
    'shard_idx': 0,
    'row_in_shard': 16568,
    'cluster': 6881,
    'dist2': 0.0130619490519166}},
  {'id': '02046686-d252-5b61-a31c-76a1bc38a139',
   'score': 0.54188716,
   'payload': {'id': '108631',
    'make': 'HYUNDAI',
    'model': 'ELANTRA',
    'year': 1995,
    'component': 'AIR BAGS',
    '_h': '8418913386829335213',
    'shard_idx': 4,
    'row_in_shard': 8630,
    'cluster': 19688,
    'dist2': 0.0137858092784881}},
  {'id': '1a6b60be-16e8-55ee-92b1-1100002fb6e3',
   'score': 0.52552146,
   'payload': {'id': '85319',
    'make': 'FORD',
    'model': 'PROBE',
    'year': 1993,
    'component': 'AIR BAGS:FRONTAL:SENSOR/CONTROL MODULE',
  

In [168]:
import time, random
from neo4j import GraphDatabase
from neo4j.exceptions import SessionExpired, ServiceUnavailable, TransientError

driver = get_neo_driver()  # tu helper existente

def cypher_retry(query: str, params=None, retries: int = 5, timeout_s: float = 90.0, access_mode: str = "READ"):
    """
    Ejecuta Cypher con reintentos y backoff exponencial con jitter.
    - access_mode: "READ" o "WRITE"
    - retries: reintentos adicionales ante errores transitorios
    """
    params = params or {}
    delay = 1.0

    def _work(tx):
        # Nota: en v5, timeout se pasa en run(); parameters= para evitar colisión de kwargs
        res = tx.run(query, parameters=params, timeout=timeout_s)
        # Si esperas mucho volumen, mejor NO usar .data()
        return res.data()

    for attempt in range(retries):
        try:
            if access_mode.upper() == "READ":
                with driver.session(default_access_mode="READ") as session:
                    return session.execute_read(_work)
            else:
                with driver.session(default_access_mode="WRITE") as session:
                    return session.execute_write(_work)

        except (SessionExpired, ServiceUnavailable, TransientError) as e:
            # último intento: propaga
            if attempt == retries - 1:
                raise
            # Backoff exponencial con jitter (0–250ms) y techo de 8s
            jitter = random.uniform(0, 0.25)
            wait = min(delay + jitter, 8.0)
            print(f"[retry {attempt+1}/{retries}] {e.__class__.__name__}: reintentando en {wait:.2f}s…")
            time.sleep(wait)
            delay *= 2.0


In [264]:
cypher_retry("""
MATCH ()-[r:SIMILAR_TO]->() RETURN count(r) AS rels
""")


[{'rels': 186359}]

In [216]:
# =========================
# Visualización SIMILAR_TO
# =========================
# Dependencias (Colab)
import math
import time
from collections import defaultdict
from pyvis.network import Network
from IPython.display import IFrame, display

# --- Helpers Neo4j (usa tus credenciales de entorno) ---
try:
    driver  # si ya existe en el entorno previo
except NameError:
    from neo4j import GraphDatabase
    import os
    NEO4J_URI = os.getenv("NEO4J_URI", "")
    NEO4J_USER = os.getenv("NEO4J_USER", "")
    NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD", "")
    assert NEO4J_URI and NEO4J_USER and NEO4J_PASSWORD, "Faltan credenciales Neo4j en variables de entorno"
    driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

def cypher_retry(query, params=None, retries=4, timeout_s=90):
    params = params or {}
    delay = 1.0
    for t in range(retries):
        try:
            with driver.session() as s:
                return s.run(query, **params, timeout=timeout_s).data()
        except Exception:
            if t == retries - 1:
                raise
            time.sleep(delay)
            delay = min(delay * 2, 8.0)

# ------------- Config -------------
# Elige UNA de estas formas de muestreo:
seed_id        = None            # p.ej. "22V240000" para ego-network
seed_make      = "TOYOTA"        # o filtra por MAKE (usa uppercase si normalizaste así)
seed_component = None            # o filtra por COMPONENT (p.ej. "AIR BAGS")
year_min, year_max = None, None  # límites de año (p.ej. 2015, 2022)

# Tamaños del muestreo y recorte
limit_neighbors    = 100         # vecinos directos cuando usas seed_id
limit_query_nodes  = 250         # tope de nodos devueltos por filtros (make/component)
max_nodes_render   = 220         # tope final de nodos a pintar (performance)

# ------------- Query builder -------------
if seed_id:
    # Ego-network de 1 salto alrededor del recall dado
    QUERY = """
    MATCH (r:Recall {id:$id})-[:SIMILAR_TO]-(n:Recall)
    WHERE ($ymin IS NULL OR n.year >= $ymin) AND ($ymax IS NULL OR n.year <= $ymax)
    WITH collect(r) + collect(n) AS nodes
    UNWIND nodes AS m
    WITH DISTINCT m
        MATCH (m)-[r:SIMILAR_TO]-(m2:Recall)
        RETURN DISTINCT
        m.id        AS id,
        m.camp_no   AS camp_no,
        m.make      AS make,
        m.model     AS model,
        m.year      AS year,
        m.component AS component,
        m.summary   AS summary,
        m2.id       AS id2,
        r.reason    AS reason
    LIMIT $lim
    """
    params = {"id": seed_id, "lim": limit_neighbors, "ymin": year_min, "ymax": year_max}
else:
    # Filtro por make/component (y rango de años opcional), subgrafo inducido
    # Importante: usamos slicing para aplicar $lim (permitido), no LIMIT variable.
    QUERY = """
    MATCH (r:Recall)
    WHERE ($mk IS NULL OR r.make = $mk)
        AND ($comp IS NULL OR r.component = $comp)
        AND ($ymin IS NULL OR r.year >= $ymin)
        AND ($ymax IS NULL OR r.year <= $ymax)
    WITH r
    ORDER BY r.id
    WITH collect(r.id) AS all_ids
    WITH all_ids[..$lim] AS ids
    UNWIND ids AS id
    MATCH (a:Recall {id:id})-[r:SIMILAR_TO]-(b:Recall)
    RETURN DISTINCT
        a.id AS id,  a.camp_no AS camp_no,  a.make AS make,
        a.model AS model, a.year AS year, a.component AS component,
        a.summary AS summary,
        b.id AS id2, b.camp_no AS camp_no_b, b.make AS make_b,
        b.model AS model_b, b.year AS year_b, b.component AS component_b,
        b.summary AS summary_b,
        r.reason AS reason
    """
    params = {
        "mk": seed_make,
        "comp": seed_component,
        "ymin": year_min,
        "ymax": year_max,
        "lim": limit_query_nodes
    }

rows = cypher_retry(QUERY, params)

# ------------- Construcción de grafo en memoria -------------
nodes = {}                  # id -> props
edges = set()               # pares (u,v) sin dirección
edge_color = {}             # (u,v) -> color por reason
deg   = defaultdict(int)

def edge_color_by_reason(reason: str) -> str:
    reason = (reason or "").upper()
    if "MAKE_MODEL" in reason:
        return "#2ca02c"    # verde: make+model±1
    if "COMPONENT" in reason:
        return "#d62728"    # rojo: component±1
    return "#999999"        # genérico / vacío / mixto

for r in rows:
    u = r["id"]
    v = r["id2"]
    if u is None or v is None or u == v:
        continue

    # guarda props del nodo u
    if u not in nodes:
        nodes[u] = {
            "id": u,
            "camp_no": r.get("camp_no"),
            "make": r.get("make"),
            "model": r.get("model"),
            "year": r.get("year"),
            "component": r.get("component"),
            "summary": r.get("summary"),
        }
    # también asegúrate de que v exista en nodes si quieres tooltips en ambos
    if v not in nodes:
        nodes[v] = {
            "id": v,
            "camp_no": r.get("camp_no_b"),
            "make": r.get("make_b"),
            "model": r.get("model_b"),
            "year": r.get("year_b"),
            "component": r.get("component_b"),
            "summary": r.get("summary_b"),
        }

    # arista sin dirección (ordenada)
    a, b = sorted([u, v])
    if (a, b) not in edges:
        edges.add((a, b))
        edge_color[(a, b)] = edge_color_by_reason(r.get("reason"))
        deg[a] += 1
        deg[b] += 1

# Recorte por grado para no sobrecargar el render
if len(nodes) > max_nodes_render:
    top_nodes = set(sorted(nodes.keys(), key=lambda x: deg.get(x, 0), reverse=True)[:max_nodes_render])
    edges = {(u, v) for (u, v) in edges if u in top_nodes and v in top_nodes}
    nodes = {nid: nodes[nid] for nid in top_nodes}

# ------------- PyVis: estilo y render -------------
html_path = "/content/pyvis_similar_recalls.html"
net = Network(height="700px", width="100%", directed=False, notebook=False, cdn_resources="in_line")
net.force_atlas_2based(gravity=-50, central_gravity=0.005, spring_length=120, damping=0.4)

def color_for_make(make: str) -> str:
    palette = [
        "#1f77b4","#ff7f0e","#2ca02c","#d62728","#9467bd",
        "#8c564b","#e377c2","#7f7f7f","#bcbd22","#17becf"
    ]
    if not make:
        return "#999999"
    return palette[hash(make) % len(palette)]

def color_for_component(comp: str) -> str:
    if comp and "AIR" in comp.upper():
        return "#17becf"
    if comp and "BRAKE" in comp.upper():
        return "#d62728"
    if comp and "ENGINE" in comp.upper():
        return "#ff7f0e"
    if comp and "ELECTR" in comp.upper():
        return "#9467bd"
    return "#7f7f7f"

for nid, props in nodes.items():
    make = props.get("make")
    comp = props.get("component")
    degree = deg.get(nid, 1)
    size = 8 + 3 * math.log2(1 + degree)
    title = (
        f"<b>{props.get('camp_no') or nid}</b><br>"
        f"<b>Make/Model:</b> {make or '-'} / {props.get('model') or '-'}<br>"
        f"<b>Year:</b> {props.get('year') or '-'}<br>"
        f"<b>Component:</b> {comp or '-'}<br><hr>"
        f"{(props.get('summary') or '')[:400]}{'…' if (props.get('summary') and len(props.get('summary'))>400) else ''}"
    )
    node_color = color_for_make(make)
    net.add_node(
        nid,
        label=(props.get("camp_no") or nid),
        title=title,
        color=node_color,
        borderWidthSelected=3,
        shape="dot",
        size=size
    )

for (u, v) in edges:
    net.add_edge(u, v, color=edge_color.get((u, v), "#cccccc"), width=1)

# ---------- Exportar y mostrar (inline, sin IFrame/localhost) ----------
from IPython.display import HTML

# Guarda el HTML en el directorio actual del notebook
html_path = os.path.abspath("pyvis_similar_recalls.html")
net.write_html(html_path, notebook=False)

print(f"✅ Grafo guardado en: {html_path}  (si no se ve, ábrelo en el navegador)")


✅ Grafo guardado en: c:\Users\moral\tec_final\notebooks\pyvis_similar_recalls.html  (si no se ve, ábrelo en el navegador)


In [None]:
import pandas as pd
from collections import Counter

QUERY = "airbag sensor failure"
K = 200

res = ask(QUERY, k=K, type_in="complaint")
C = res.get("complaints", []) or []
df = pd.DataFrame(C)

if df.empty:
    print("Sigue vacío. Revisa que existan nodos :Complaint y propiedades id/qid/comp_id.")
else:
    for col in ["make","model","component"]:
        if col in df.columns:
            df[col] = df[col].fillna("").astype(str).str.strip().str.upper()

    def top_counts(col, n=10):
        if col not in df.columns: 
            return pd.DataFrame({col: [], "count": []})
        s = df[col][df[col]!=""]
        return pd.DataFrame(Counter(s).most_common(n), columns=[col,"count"])

    print(f"Complaints devueltas: {len(df)}  |  query: {QUERY}\n")
    print("Top MAKE");      display(top_counts("make"))
    p
    rint("Top MODEL");     display(top_counts("model"))
    print("Top COMPONENT"); display(top_counts("component"))

    if {"make","component"}.issubset(df.columns):
        pairs = (df.loc[(df["make"]!="") & (df["component"]!=""), ["make","component"]]
                   .value_counts().rename("count").reset_index().head(15))
        print("Top MAKE–COMPONENT"); display(pairs)


Complaints devueltas: 200  |  query: airbag sensor failure

Top MAKE


Unnamed: 0,make,count
0,FORD,35
1,CHEVROLET,29
2,DODGE,28
3,JEEP,10
4,MAZDA,9
5,PLYMOUTH,8
6,MERCURY,8
7,GMC,6
8,CHRYSLER,6
9,OLDSMOBILE,6


Top MODEL


Unnamed: 0,model,count
0,CARAVAN,9
1,TAURUS,8
2,SUBURBAN,5
3,VOYAGER,5
4,RAM,4
5,EXPLORER,4
6,GRAND CHEROKEE,4
7,626,3
8,BLAZER,3
9,CAPRICE,3


Top COMPONENT


Unnamed: 0,component,count
0,"SERVICE BRAKES, HYDRAULIC:ANTILOCK/TRACTION CO...",27
1,SEAT BELTS:FRONT:BUCKLE ASSEMBLY,14
2,AIR BAGS:FRONTAL,11
3,ENGINE AND ENGINE COOLING:EXHAUST SYSTEM:EMISS...,11
4,POWER TRAIN:AUTOMATIC TRANSMISSION,8
5,"SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS",8
6,AIR BAGS,7
7,AIR BAGS:FRONTAL:SENSOR/CONTROL MODULE,6
8,SEAT BELTS:FRONT:ANCHORAGE,6
9,VEHICLE SPEED CONTROL,6


Top MAKE–COMPONENT


Unnamed: 0,make,component,count
0,CHEVROLET,"SERVICE BRAKES, HYDRAULIC:ANTILOCK/TRACTION CO...",7
1,GMC,"SERVICE BRAKES, HYDRAULIC:ANTILOCK/TRACTION CO...",4
2,FORD,"SERVICE BRAKES, HYDRAULIC:ANTILOCK/TRACTION CO...",4
3,FORD,AIR BAGS:FRONTAL:SENSOR/CONTROL MODULE,4
4,MAZDA,SEAT BELTS:FRONT:BUCKLE ASSEMBLY,3
5,DODGE,ENGINE AND ENGINE COOLING:EXHAUST SYSTEM:EMISS...,3
6,PLYMOUTH,"SERVICE BRAKES, HYDRAULIC:ANTILOCK/TRACTION CO...",3
7,FORD,SEAT BELTS:FRONT:BUCKLE ASSEMBLY,3
8,DODGE,"SERVICE BRAKES, HYDRAULIC:ANTILOCK/TRACTION CO...",3
9,DODGE,VEHICLE SPEED CONTROL,2


In [193]:
C = ask("airbag sensor failure", k=100, type_in="complaint")["complaints"]
pairs = [(c["make"], c["component"]) for c in C if c.get("make") and c.get("component")]

from collections import Counter
edges = Counter(pairs).most_common(30)  # (make,component) -> peso

from pyvis.network import Network
net = Network(height="650px", width="100%", directed=False, notebook=True, cdn_resources="remote")

seen=set()
for (mk,comp),w in edges:
    if ("M:"+mk) not in seen: net.add_node("M:"+mk, label=mk, color="#1f77b4", shape="box"); seen.add("M:"+mk)
    if ("C:"+comp) not in seen: net.add_node("C:"+comp, label=comp, color="#d62728", shape="dot"); seen.add("C:"+comp)
    net.add_edge("M:"+mk, "C:"+comp, width=min(8,max(1,w//2)))
net.show("bip_make_component.html")



bip_make_component.html


In [194]:
R = ask("airbag sensor failure", k=100, type_in="recall")["recalls"]
yrs = [int(str(r.get("recall_date"))[:4]) for r in R if r.get("recall_date")]
from collections import Counter
print(sorted(Counter(yrs).items()))  # (año, #recalls)

[(2010, 4), (2011, 1), (2012, 2), (2013, 3), (2014, 3), (2015, 2), (2016, 7), (2017, 18), (2018, 13), (2019, 12), (2020, 4), (2021, 13), (2022, 5), (2023, 2), (2024, 7), (2025, 4)]


In [199]:
# Armar pares (make, model) desde las complaints de P1
def norm(s):
    return (str(s).strip().upper()
            .replace("-", " ")
            .replace("  ", " "))

pairs = sorted({(norm(c.get("make")), norm(c.get("model")))
                for c in (C or []) if c.get("make") and c.get("model")})
pairs = [{"make": mk, "model": md} for (mk, md) in pairs]

q_strict = """
UNWIND $pairs AS mm
MATCH (r:Recall)
WITH r, mm, coalesce(r.year, toInteger(left(r.recall_date,4))) AS ryear
WHERE toUpper(r.make) = mm.make
  AND toUpper(r.model) = mm.model
  AND ryear IS NOT NULL

MATCH (c:Complaint)
WHERE toUpper(c.make) = mm.make
  AND toUpper(c.model) = mm.model
  AND c.year IS NOT NULL
  AND abs(ryear - c.year) <= 1

RETURN r.id AS recall_id, c.qid AS complaint_qid,
       ryear AS recall_year, c.year AS complaint_year,
       mm.make AS make, mm.model AS model
LIMIT 300
"""

rows_strict = cypher(q_strict, {"pairs": pairs})
print("Matches (estricto, ±1):", len(rows_strict))
rows_strict[:10]


Matches (estricto, ±1): 228


[{'recall_id': '17V820000',
  'complaint_qid': '7da610cd-f95a-594b-8cd9-a0a6a44d2cbd',
  'recall_year': 2005,
  'complaint_year': 2004,
  'make': 'DODGE',
  'model': 'DAKOTA'},
 {'recall_id': '17V820000',
  'complaint_qid': '6aa8e9a9-95aa-5e2f-9bba-114e5323d396',
  'recall_year': 2005,
  'complaint_year': 2004,
  'make': 'DODGE',
  'model': 'DAKOTA'},
 {'recall_id': '14V796000',
  'complaint_qid': '07a70f0f-8847-59fd-b04f-4a7be15359bc',
  'recall_year': 2005,
  'complaint_year': 2004,
  'make': 'DODGE',
  'model': 'RAM 1500'},
 {'recall_id': '15V312000',
  'complaint_qid': 'abe226a6-683f-5e88-8215-6f4cafe28399',
  'recall_year': 2003,
  'complaint_year': 2003,
  'make': 'DODGE',
  'model': 'RAM 1500'},
 {'recall_id': '15V312000',
  'complaint_qid': 'dfe9cbca-2136-54d3-804c-691dd0d86a27',
  'recall_year': 2003,
  'complaint_year': 2002,
  'make': 'DODGE',
  'model': 'RAM 1500'},
 {'recall_id': '15V312000',
  'complaint_qid': '894ab862-5d22-5b8b-a35a-917b00c8d15f',
  'recall_year': 2003,

In [202]:
import os, math, hashlib, random


assert 'rows_strict' in globals() and rows_strict, "rows_strict está vacío. Ejecuta primero el paso 3."

# --- helpers ---
def hash_color(s):
    # paleta 10 colores estables por make
    palette = ["#17becf"]
    h = int(hashlib.md5(str(s).encode()).hexdigest(), 16)
    return palette[h % len(palette)]

def edge_color_by_yeardiff(d):
    if d == 0: return "#2ca02c"   # verde
    if abs(d) == 1: return "#ff7f0e"  # naranja
    return "#d62728"              # rojo (por si traes ±2 en otro escenario)

# --- normaliza registros y limita tamaño si quieres ---
rows = rows_strict[:]  # copia
MAX_EDGES = 220        # sube/baja si el render se pone pesado
if len(rows) > MAX_EDGES:
    rows = rows[:MAX_EDGES]

# --- construye el grafo ---
net = Network(height="720px", width="100%", directed=False, notebook=True, cdn_resources="remote")
net.force_atlas_2based(gravity=-30, central_gravity=0.01, spring_length=140, damping=0.6)

seen = set()

for r in rows:
    rid  = r.get("recall_id")
    qid  = r.get("complaint_qid")
    ry   = r.get("recall_year")
    cy   = r.get("complaint_year")
    mk   = (r.get("make") or "").strip()
    md   = (r.get("model") or "").strip()

    if not rid or not qid: 
        continue

    # Nodo Recall (cuadrado, color por make)
    rec_node_id = f"R:{rid}"
    if rec_node_id not in seen:
        color = hash_color(mk or "NA")
        title = (f"<b>RECALL</b><br>"
                 f"<b>ID:</b> {rid}<br>"
                 f"<b>Make/Model:</b> {mk or '-'} / {md or '-'}<br>"
                 f"<b>Year:</b> {ry or '-'}")
        net.add_node(rec_node_id, label=rid, color=color, shape="box",
                     borderWidthSelected=3, size=16, title=title)
        seen.add(rec_node_id)

    # Nodo Complaint (círculo gris)
    comp_node_id = f"C:{qid}"
    if comp_node_id not in seen:
        title = (f"<b>COMPLAINT</b><br>"
                 f"<b>QID:</b> {qid}<br>"
                 f"<b>Year (vehículo):</b> {cy or '-'}")
        net.add_node(comp_node_id, label=str(qid), color="#8c8c8c",
                     shape="dot", borderWidthSelected=2, size=10, title=title)
        seen.add(comp_node_id)

    # Arista (color por diferencia de año)
    if ry is not None and cy is not None:
        d = int(ry) - int(cy)
    else:
        d = 1  # neutro
    net.add_edge(rec_node_id, comp_node_id, color=edge_color_by_yeardiff(d), width=1)

# --- guardar y mostrar ---
out_html = os.path.abspath("graph_recall_complaint.html")
net.show(out_html)
print(f"✅ Grafo guardado y mostrado: {out_html}  (si no se embebe, ábrelo en el navegador)")

c:\Users\moral\tec_final\notebooks\graph_recall_complaint.html
✅ Grafo guardado y mostrado: c:\Users\moral\tec_final\notebooks\graph_recall_complaint.html  (si no se embebe, ábrelo en el navegador)


In [266]:
# === MATCH Complaint↔Recall solo con lo que viene de C (ignora R) ===
def norm(s):
    return (str(s).upper().strip()
            .replace("-", " ").replace("_", " ").replace("/", " ")
            .replace("  ", " "))

C = ask(QUERY, k=600, type_in="complaint").get("complaints", []) or []
# Conservamos R e I si ya los traías, pero aquí no los necesitamos para el match

pairs_mm = sorted({(norm(c.get("make")), norm(c.get("model")))
                   for c in C if c.get("make") and c.get("model")})
pairs_mm = [{"make": mk, "model": md} for (mk, md) in pairs_mm]

print("Familias make/model (desde C):", len(pairs_mm))

q_match_allR = """
UNWIND $pairs AS mm
// Recalls por familia
MATCH (r:Recall)
WITH r, mm, coalesce(r.year, toInteger(left(r.recall_date,4))) AS ryear
WHERE ryear IS NOT NULL
  AND toUpper(r.make)  = mm.make
  AND toUpper(r.model) = mm.model

// Complaints por misma familia
MATCH (c:Complaint)
WHERE c.year IS NOT NULL
  AND toUpper(c.make)  = mm.make
  AND toUpper(c.model) = mm.model

// Ventana temporal más tolerante
AND abs(ryear - c.year) <= 2

RETURN DISTINCT
  r.id AS rid, ryear AS ryear,
  toString(coalesce(c.qid, c.id)) AS cqid, c.year AS cyear,
  mm.make AS make, mm.model AS model
LIMIT 8000
"""

rowsM = cypher(q_match_allR, {"pairs": pairs_mm})
edges_match = [("R:"+r["rid"], "C:"+r["cqid"], "MATCH_CR", int(r["ryear"]), int(r["cyear"]))
               for r in rowsM if r.get("rid") and r.get("cqid")]
print("Matches (±2) =", len(edges_match))


Familias make/model (desde C): 222
Matches (±2) = 470


In [252]:
# ============================================
# Subgrafo unificado: Complaints / Recalls / Investigations
# + Make / Model / Component
# ============================================
from pyvis.network import Network
import os, hashlib

# -------- Config --------
QUERY = "airbag sensor failure"
K_COMPLAINTS = 200
K_RECALLS    = 200
K_INVEST     = 200
INV_CAND_LIMIT = 1000   # tope de aristas INV_CAND a incluir

# -------- Helpers -------
def norm(s):
    return (str(s).strip().upper()
            .replace("-", " ").replace("_"," ").replace("/", " ")
            .replace("  ", " "))

def hash_color(s):
    palette = ["#1f77b4","#ff7f0e","#2ca02c","#d62728","#9467bd",
               "#8c564b","#e377c2","#7f7f7f","#bcbd22","#17becf"]
    import hashlib as _h
    h = int(_h.md5(str(s).encode()).hexdigest(), 16)
    return palette[h % len(palette)]

def edge_color(kind):
    return {
        "OF_MAKE":    "#2ca02c",
        "OF_MODEL":   "#17becf",
        "MENTIONS":   "#e377c2",
        "RELATES_TO": "#1f77b4",
    }.get(kind, "#999999")

# ======================================================
# 1) Recuperar sets con ask()
# ======================================================
C = ask(QUERY, k=K_COMPLAINTS, type_in="complaint").get("complaints", []) or []
R = ask(QUERY, k=K_RECALLS,    type_in="recall").get("recalls", []) or []
I = ask(QUERY, k=K_INVEST,     type_in="investigation").get("investigations", []) or []

cids = sorted({str(c.get("qid") or c.get("id")) for c in C if (c.get("qid") or c.get("id"))})
rids = sorted({r.get("id") for r in R if r.get("id")})
iids = sorted({i.get("id") for i in I if i.get("id")})
print(f"Complaints: {len(cids)} | Recalls: {len(rids)} | Investigations: {len(iids)}")

# ======================================================
# 2) Edges existentes en Neo4j (OF_MAKE / OF_MODEL / MENTIONS / RELATES_TO)
# ======================================================
edges_existing = []

# Complaints → Make/Model/Component
q_c = """
UNWIND $cids AS cid
MATCH (c:Complaint)
WHERE toString(c.qid)=cid OR toString(c.id)=cid OR toString(c.comp_id)=cid
OPTIONAL MATCH (c)-[:OF_MAKE]->(mk:Make)
OPTIONAL MATCH (c)-[:OF_MODEL]->(md:Model)
OPTIONAL MATCH (c)-[:MENTIONS]->(cp:Component)
RETURN toString(c.qid) AS cqid, mk.name AS mk, md.name AS md, cp.name AS cp
"""
rowsC = cypher(q_c, {"cids": cids})
for r in rowsC:
    cqid = r["cqid"]
    if r.get("mk"): edges_existing.append(("C:"+cqid, "M:"+r["mk"], "OF_MAKE"))
    if r.get("md"): edges_existing.append(("C:"+cqid, "D:"+r["md"], "OF_MODEL"))
    if r.get("cp"): edges_existing.append(("C:"+cqid, "X:"+r["cp"], "MENTIONS"))

# Recalls → Make/Model/Component (+ props base)
q_r = """
UNWIND $rids AS rid
MATCH (r:Recall {id:rid})
OPTIONAL MATCH (r)-[:OF_MAKE]->(mk:Make)
OPTIONAL MATCH (r)-[:OF_MODEL]->(md:Model)
OPTIONAL MATCH (r)-[:MENTIONS]->(cp:Component)
RETURN r.id AS rid, mk.name AS mk, md.name AS md, cp.name AS cp,
       r.make AS r_make, r.model AS r_model, coalesce(r.year, toInteger(left(r.recall_date,4))) AS r_year
"""
rowsR = cypher(q_r, {"rids": rids})
recall_props = {r["rid"]: {"make": r.get("r_make"), "model": r.get("r_model"), "year": r.get("r_year")} for r in rowsR}
for r in rowsR:
    rid = r["rid"]
    if r.get("mk"): edges_existing.append(("R:"+rid, "M:"+r["mk"], "OF_MAKE"))
    if r.get("md"): edges_existing.append(("R:"+rid, "D:"+r["md"], "OF_MODEL"))
    if r.get("cp"): edges_existing.append(("R:"+rid, "X:"+r["cp"], "MENTIONS"))

# Investigations → (RELATES_TO / OF_MAKE / OF_MODEL / MENTIONS)
q_i_full = """
UNWIND $iids AS iid
MATCH (i:Investigation {id:iid})
OPTIONAL MATCH (i)-[:RELATES_TO]->(r:Recall)
OPTIONAL MATCH (i)-[:OF_MAKE]->(mk:Make)
OPTIONAL MATCH (i)-[:OF_MODEL]->(md:Model)
OPTIONAL MATCH (i)-[:MENTIONS]->(cp:Component)
RETURN i.id AS iid,
       collect(DISTINCT r.id)       AS rids,
       collect(DISTINCT mk.name)    AS makes,
       collect(DISTINCT md.name)    AS models,
       collect(DISTINCT cp.name)    AS components
"""
rowsI = cypher(q_i_full, {"iids": iids})
for row in rowsI:
    iid = row["iid"]
    for rid in row.get("rids", []) or []:
        edges_existing.append(("I:"+iid, "R:"+rid, "RELATES_TO"))
    for mk in row.get("makes", []) or []:
        edges_existing.append(("I:"+iid, "M:"+mk, "OF_MAKE"))
    for md in row.get("models", []) or []:
        edges_existing.append(("I:"+iid, "D:"+md, "OF_MODEL"))
    for cp in row.get("components", []) or []:
        edges_existing.append(("I:"+iid, "X:"+cp, "MENTIONS"))

# ======================================================
# 3) Matches Complaint ↔ Recall (familia make/model; ventana ±2; NO depende de rids)
# ======================================================
pairs_mm = sorted({(norm(c.get("make")), norm(c.get("model")))
                   for c in C if c.get("make") and c.get("model")})
pairs_mm = [{"make": mk, "model": md} for (mk, md) in pairs_mm]
print("Familias make/model (desde C):", len(pairs_mm))

q_match_allR = """
UNWIND $pairs AS mm
// Recalls por familia
MATCH (r:Recall)
WITH r, mm, coalesce(r.year, toInteger(left(r.recall_date,4))) AS ryear
WHERE ryear IS NOT NULL
  AND toUpper(r.make)  = mm.make
  AND toUpper(r.model) = mm.model

// Complaints por la misma familia
MATCH (c:Complaint)
WHERE c.year IS NOT NULL
  AND toUpper(c.make)  = mm.make
  AND toUpper(c.model) = mm.model

AND abs(ryear - c.year) <= 2

RETURN DISTINCT
  r.id AS rid, ryear AS ryear,
  toString(coalesce(c.qid, c.id)) AS cqid, c.year AS cyear
LIMIT 8000
"""
rowsM = cypher(q_match_allR, {"pairs": pairs_mm})
edges_match = [("R:"+r["rid"], "C:"+r["cqid"], ("MATCH_CR", int(r["ryear"])-int(r["cyear"])))
               for r in rowsM if r.get("rid") and r.get("cqid")]
print("Matches (±2) =", len(edges_match))

# ======================================================
# 4) Asegurar nodos para TODOS los recalls referenciados (investigations y matches)
# ======================================================
rids_from_inv   = { rid for row in rowsI for rid in (row.get("rids") or []) if rid }
rids_from_match = { r["rid"] for r in rowsM }
rids_all = sorted(set(rids) | rids_from_inv | rids_from_match)

# Traer/Completar props de recalls adicionales
q_r_props = """
UNWIND $rids AS rid
MATCH (r:Recall {id:rid})
RETURN r.id AS rid,
       r.make AS r_make,
       r.model AS r_model,
       coalesce(r.year, toInteger(left(r.recall_date,4))) AS r_year
"""
rowsR_all = cypher(q_r_props, {"rids": rids_all})
for r in rowsR_all:
    recall_props[r["rid"]] = {
        "make":  r.get("r_make"),
        "model": r.get("r_model"),
        "year":  r.get("r_year"),
    }
print(f"Recalls a dibujar (rids_all): {len(rids_all)} | top-K R: {len(rids)} | inv: {len(rids_from_inv)} | match: {len(rids_from_match)}")

# ======================================================
# 5) Candidatos Investigation→Recall por familia (INV_CAND)
# ======================================================
edges_inv_candidates = []
rids_set = set(rids_all)  # usa el universo ampliado
# props rápidos de R (para familia)
R_by_id = {r.get("id"): r for r in R if r.get("id")}

for row in rowsI:
    iid = row["iid"]
    makes  = set([m for m in (row.get("makes") or []) if m])
    models = set([m for m in (row.get("models") or []) if m])
    comps  = set([c for c in (row.get("components") or []) if c])

    for rid in rids_all:
        # si no tienes props del recall en R, no importa: revisa en recall_props
        rp = recall_props.get(rid, {})
        ok = False
        if rp.get("make")  in makes:  ok = True
        if rp.get("model") in models: ok = True
        # componente (si lo guardaste en recall_props; si no, omite esta línea)
        # if rp.get("component") in comps: ok = True
        if ok:
            edges_inv_candidates.append(("I:"+iid, "R:"+rid, "INV_CAND"))

# ======================================================
# 6) Construir NODOS
# ======================================================
nodes = {}
def add_node(node_id, label, ntype, meta=None):
    if node_id in nodes: return
    meta = meta or {}
    nodes[node_id] = {"label": label, "type": ntype, **meta}

# Complaints, Recalls, Investigations (para investigations, usa iids_all)
iids_from_ask     = set(iids)
iids_from_rowsI   = { row["iid"] for row in rowsI }
iids_from_invCand = { u[2:] for (u,_,_) in edges_inv_candidates if u.startswith("I:") }
iids_all          = sorted(iids_from_ask | iids_from_rowsI | iids_from_invCand)

for cid in cids:      add_node("C:"+cid, cid, "Complaint")
for rid in rids_all:  add_node("R:"+rid, rid, "Recall", recall_props.get(rid, {}))
for iid in iids_all:  add_node("I:"+iid, iid, "Investigation")

# Make/Model/Component a partir de edges existentes
for (_, dst, kind) in edges_existing:
    if dst.startswith("M:"): add_node(dst, dst[2:], "Make")
    if dst.startswith("D:"): add_node(dst, dst[2:], "Model")
    if dst.startswith("X:"): add_node(dst, dst[2:], "Component")

# ======================================================
# 7) Construir EDGES (unificado) y PRUNING de aislados
# ======================================================
edges_all = []
edges_all += [(u, v, k)            for (u, v, k) in edges_existing]
edges_all += [(u, v, k)            for (u, v, k) in edges_match]
edges_all += [(u, v, "INV_CAND")   for (u, v, _) in edges_inv_candidates[:INV_CAND_LIMIT]]

# calcular grado y eliminar aislados
deg = {}
for (u, v, k) in edges_all:
    deg[u] = deg.get(u, 0) + 1
    deg[v] = deg.get(v, 0) + 1

nodes = {nid: props for nid, props in nodes.items() if deg.get(nid, 0) > 0}
edges_all = [e for e in edges_all if e[0] in nodes and e[1] in nodes]
print("🔎 Nodos render:", len(nodes), "| Aristas render:", len(edges_all))

# ======================================================
# 8) Render PyVis y guardar HTML
# ======================================================
net = Network(height="780px", width="100%", directed=False, notebook=True, cdn_resources="remote")
net.force_atlas_2based(gravity=-30, central_gravity=0.01, spring_length=140, damping=0.6)

def style_for(nid, props):
    t = props["type"]
    if t == "Recall":
        mk = (props.get("make") or "").strip()
        color = "#7e57c2"  # o: hash_color(mk or "NA")
        return {"shape":"box", "color":color, "size":18}
    if t == "Complaint":
        return {"shape":"dot", "color":"#8c8c8c", "size":10}
    if t == "Investigation":
        return {"shape":"diamond", "color":"#ff9800", "size":14}
    if t == "Make":
        return {"shape":"box", "color":"#2ca02c", "size":12}
    if t == "Model":
        return {"shape":"box", "color":"#17becf", "size":12}
    if t == "Component":
        return {"shape":"dot", "color":"#e377c2", "size":12}
    return {"shape":"ellipse", "color":"#999999", "size":10}

def tooltip(nid, p):
    t = p["type"]
    if t == "Recall":
        return (f"<b>RECALL</b><br><b>ID:</b> {p['label']}<br>"
                f"<b>Make/Model:</b> {p.get('make') or '-'} / {p.get('model') or '-'}<br>"
                f"<b>Year:</b> {p.get('year') or '-'}")
    if t == "Complaint":
        return f"<b>COMPLAINT</b><br><b>QID:</b> {p['label']}"
    if t == "Investigation":
        return f"<b>INVESTIGATION</b><br><b>ID:</b> {p['label']}"
    return f"<b>{t.upper()}</b><br>{p['label']}"

for nid, props in nodes.items():
    st = style_for(nid, props)
    net.add_node(nid, label=str(props["label"]), title=tooltip(nid, props),
                 color=st["color"], shape=st["shape"], size=st["size"], borderWidthSelected=3)

for (u, v, k) in edges_all:
    if isinstance(k, tuple) and k[0] == "MATCH_CR":
        dy = k[1]
        col = "#2ca02c" if dy == 0 else ("#ff7f0e" if abs(dy)==1 else "#d62728")
        net.add_edge(u, v, color=col, width=2, title=f"MATCH_CR Δy={dy}")
    elif k == "INV_CAND":
        net.add_edge(u, v, color="#cccccc", width=1, title="INV_CAND (familia make/model/component)")
    else:
        net.add_edge(u, v, color=edge_color(k), width=1.5, title=k)

html_path = os.path.abspath("graph_unificado_arc.html")
net.write_html(html_path, notebook=False)
print(html_path)
print("✅ Visual listo:", html_path)


Complaints: 200 | Recalls: 200 | Investigations: 159
Familias make/model (desde C): 124
Matches (±2) = 452
Recalls a dibujar (rids_all): 227 | top-K R: 200 | inv: 1 | match: 26
🔎 Nodos render: 1212 | Aristas render: 3352
c:\Users\moral\tec_final\notebooks\graph_unificado_arc.html
✅ Visual listo: c:\Users\moral\tec_final\notebooks\graph_unificado_arc.html


In [269]:
# ============================================
# Subgrafo unificado: Complaints / Recalls / Investigations
# + Make / Model / Component
# ============================================
from pyvis.network import Network
import os, hashlib

# -------- Config --------
QUERY = "brake fluid leak"
K_COMPLAINTS = 200
K_RECALLS    = 200
K_INVEST     = 200
INV_CAND_LIMIT = 1000   # tope de aristas INV_CAND a incluir

# -------- Helpers -------
def norm(s):
    return (str(s).strip().upper()
            .replace("-", " ").replace("_"," ").replace("/", " ")
            .replace("  ", " "))

def hash_color(s):
    palette = ["#1f77b4","#ff7f0e","#2ca02c","#d62728","#9467bd",
               "#8c564b","#e377c2","#7f7f7f","#bcbd22","#17becf"]
    import hashlib as _h
    h = int(_h.md5(str(s).encode()).hexdigest(), 16)
    return palette[h % len(palette)]

def edge_color(kind):
    return {
        "OF_MAKE":    "#2ca02c",
        "OF_MODEL":   "#17becf",
        "MENTIONS":   "#e377c2",
        "RELATES_TO": "#1f77b4",
    }.get(kind, "#999999")

# ======================================================
# 1) Recuperar sets con ask()
# ======================================================
C = ask(QUERY, k=K_COMPLAINTS, type_in="complaint").get("complaints", []) or []
R = ask(QUERY, k=K_RECALLS,    type_in="recall").get("recalls", []) or []
I = ask(QUERY, k=K_INVEST,     type_in="investigation").get("investigations", []) or []

cids = sorted({str(c.get("qid") or c.get("id")) for c in C if (c.get("qid") or c.get("id"))})
rids = sorted({r.get("id") for r in R if r.get("id")})
iids = sorted({i.get("id") for i in I if i.get("id")})
print(f"Complaints: {len(cids)} | Recalls: {len(rids)} | Investigations: {len(iids)}")

# ======================================================
# 2) Edges existentes en Neo4j (OF_MAKE / OF_MODEL / MENTIONS / RELATES_TO)
# ======================================================
edges_existing = []

# Complaints → Make/Model/Component
q_c = """
UNWIND $cids AS cid
MATCH (c:Complaint)
WHERE toString(c.qid)=cid OR toString(c.id)=cid OR toString(c.comp_id)=cid
OPTIONAL MATCH (c)-[:OF_MAKE]->(mk:Make)
OPTIONAL MATCH (c)-[:OF_MODEL]->(md:Model)
OPTIONAL MATCH (c)-[:MENTIONS]->(cp:Component)
RETURN toString(c.qid) AS cqid, mk.name AS mk, md.name AS md, cp.name AS cp
"""
rowsC = cypher(q_c, {"cids": cids})
for r in rowsC:
    cqid = r["cqid"]
    if r.get("mk"): edges_existing.append(("C:"+cqid, "M:"+r["mk"], "OF_MAKE"))
    if r.get("md"): edges_existing.append(("C:"+cqid, "D:"+r["md"], "OF_MODEL"))
    if r.get("cp"): edges_existing.append(("C:"+cqid, "X:"+r["cp"], "MENTIONS"))

# Recalls → Make/Model/Component (+ props base)
q_r = """
UNWIND $rids AS rid
MATCH (r:Recall {id:rid})
OPTIONAL MATCH (r)-[:OF_MAKE]->(mk:Make)
OPTIONAL MATCH (r)-[:OF_MODEL]->(md:Model)
OPTIONAL MATCH (r)-[:MENTIONS]->(cp:Component)
RETURN r.id AS rid, mk.name AS mk, md.name AS md, cp.name AS cp,
       r.make AS r_make, r.model AS r_model, coalesce(r.year, toInteger(left(r.recall_date,4))) AS r_year
"""
rowsR = cypher(q_r, {"rids": rids})
recall_props = {r["rid"]: {"make": r.get("r_make"), "model": r.get("r_model"), "year": r.get("r_year")} for r in rowsR}
for r in rowsR:
    rid = r["rid"]
    if r.get("mk"): edges_existing.append(("R:"+rid, "M:"+r["mk"], "OF_MAKE"))
    if r.get("md"): edges_existing.append(("R:"+rid, "D:"+r["md"], "OF_MODEL"))
    if r.get("cp"): edges_existing.append(("R:"+rid, "X:"+r["cp"], "MENTIONS"))

# Investigations → (RELATES_TO / OF_MAKE / OF_MODEL / MENTIONS)
q_i_full = """
UNWIND $iids AS iid
MATCH (i:Investigation {id:iid})
OPTIONAL MATCH (i)-[:RELATES_TO]->(r:Recall)
OPTIONAL MATCH (i)-[:OF_MAKE]->(mk:Make)
OPTIONAL MATCH (i)-[:OF_MODEL]->(md:Model)
OPTIONAL MATCH (i)-[:MENTIONS]->(cp:Component)
RETURN i.id AS iid,
       collect(DISTINCT r.id)       AS rids,
       collect(DISTINCT mk.name)    AS makes,
       collect(DISTINCT md.name)    AS models,
       collect(DISTINCT cp.name)    AS components
"""
rowsI = cypher(q_i_full, {"iids": iids})
for row in rowsI:
    iid = row["iid"]
    for rid in row.get("rids", []) or []:
        edges_existing.append(("I:"+iid, "R:"+rid, "RELATES_TO"))
    for mk in row.get("makes", []) or []:
        edges_existing.append(("I:"+iid, "M:"+mk, "OF_MAKE"))
    for md in row.get("models", []) or []:
        edges_existing.append(("I:"+iid, "D:"+md, "OF_MODEL"))
    for cp in row.get("components", []) or []:
        edges_existing.append(("I:"+iid, "X:"+cp, "MENTIONS"))

# ======================================================
# 3) Matches Complaint ↔ Recall (familia make/model; ventana ±2; NO depende de rids)
# ======================================================
pairs_mm = sorted({(norm(c.get("make")), norm(c.get("model")))
                   for c in C if c.get("make") and c.get("model")})
pairs_mm = [{"make": mk, "model": md} for (mk, md) in pairs_mm]
print("Familias make/model (desde C):", len(pairs_mm))

q_match_allR = """
UNWIND $pairs AS mm
// Recalls por familia
MATCH (r:Recall)
WITH r, mm, coalesce(r.year, toInteger(left(r.recall_date,4))) AS ryear
WHERE ryear IS NOT NULL
  AND toUpper(r.make)  = mm.make
  AND toUpper(r.model) = mm.model

// Complaints por la misma familia
MATCH (c:Complaint)
WHERE c.year IS NOT NULL
  AND toUpper(c.make)  = mm.make
  AND toUpper(c.model) = mm.model

AND abs(ryear - c.year) <= 2

RETURN DISTINCT
  r.id AS rid, ryear AS ryear,
  toString(coalesce(c.qid, c.id)) AS cqid, c.year AS cyear
LIMIT 8000
"""
rowsM = cypher(q_match_allR, {"pairs": pairs_mm})
edges_match = [("R:"+r["rid"], "C:"+r["cqid"], ("MATCH_CR", int(r["ryear"])-int(r["cyear"])))
               for r in rowsM if r.get("rid") and r.get("cqid")]
print("Matches (±2) =", len(edges_match))

# ======================================================
# 4) Asegurar nodos para TODOS los recalls referenciados (investigations y matches)
# ======================================================
rids_from_inv   = { rid for row in rowsI for rid in (row.get("rids") or []) if rid }
rids_from_match = { r["rid"] for r in rowsM }
rids_all = sorted(set(rids) | rids_from_inv | rids_from_match)

# Traer/Completar props de recalls adicionales
q_r_props = """
UNWIND $rids AS rid
MATCH (r:Recall {id:rid})
RETURN r.id AS rid,
       r.make AS r_make,
       r.model AS r_model,
       coalesce(r.year, toInteger(left(r.recall_date,4))) AS r_year
"""
rowsR_all = cypher(q_r_props, {"rids": rids_all})
for r in rowsR_all:
    recall_props[r["rid"]] = {
        "make":  r.get("r_make"),
        "model": r.get("r_model"),
        "year":  r.get("r_year"),
    }
print(f"Recalls a dibujar (rids_all): {len(rids_all)} | top-K R: {len(rids)} | inv: {len(rids_from_inv)} | match: {len(rids_from_match)}")

# ======================================================
# 5) Candidatos Investigation→Recall por familia (INV_CAND)
# ======================================================
edges_inv_candidates = []
rids_set = set(rids_all)  # usa el universo ampliado
# props rápidos de R (para familia)
R_by_id = {r.get("id"): r for r in R if r.get("id")}

for row in rowsI:
    iid = row["iid"]
    makes  = set([m for m in (row.get("makes") or []) if m])
    models = set([m for m in (row.get("models") or []) if m])
    comps  = set([c for c in (row.get("components") or []) if c])

    for rid in rids_all:
        # si no tienes props del recall en R, no importa: revisa en recall_props
        rp = recall_props.get(rid, {})
        ok = False
        if rp.get("make")  in makes:  ok = True
        if rp.get("model") in models: ok = True
        # componente (si lo guardaste en recall_props; si no, omite esta línea)
        # if rp.get("component") in comps: ok = True
        if ok:
            edges_inv_candidates.append(("I:"+iid, "R:"+rid, "INV_CAND"))

# ======================================================
# 6) Construir NODOS
# ======================================================
nodes = {}
def add_node(node_id, label, ntype, meta=None):
    if node_id in nodes: return
    meta = meta or {}
    nodes[node_id] = {"label": label, "type": ntype, **meta}

# Complaints, Recalls, Investigations (para investigations, usa iids_all)
iids_from_ask     = set(iids)
iids_from_rowsI   = { row["iid"] for row in rowsI }
iids_from_invCand = { u[2:] for (u,_,_) in edges_inv_candidates if u.startswith("I:") }
iids_all          = sorted(iids_from_ask | iids_from_rowsI | iids_from_invCand)

for cid in cids:      add_node("C:"+cid, cid, "Complaint")
for rid in rids_all:  add_node("R:"+rid, rid, "Recall", recall_props.get(rid, {}))
for iid in iids_all:  add_node("I:"+iid, iid, "Investigation")

# Make/Model/Component a partir de edges existentes
for (_, dst, kind) in edges_existing:
    if dst.startswith("M:"): add_node(dst, dst[2:], "Make")
    if dst.startswith("D:"): add_node(dst, dst[2:], "Model")
    if dst.startswith("X:"): add_node(dst, dst[2:], "Component")

# ======================================================
# 7) Construir EDGES (unificado) y PRUNING de aislados
# ======================================================
edges_all = []
edges_all += [(u, v, k)            for (u, v, k) in edges_existing]
edges_all += [(u, v, k)            for (u, v, k) in edges_match]
edges_all += [(u, v, "INV_CAND")   for (u, v, _) in edges_inv_candidates[:INV_CAND_LIMIT]]

# calcular grado y eliminar aislados
deg = {}
for (u, v, k) in edges_all:
    deg[u] = deg.get(u, 0) + 1
    deg[v] = deg.get(v, 0) + 1

nodes = {nid: props for nid, props in nodes.items() if deg.get(nid, 0) > 0}
edges_all = [e for e in edges_all if e[0] in nodes and e[1] in nodes]
print("🔎 Nodos render:", len(nodes), "| Aristas render:", len(edges_all))

# ======================================================
# 8) Render PyVis y guardar HTML
# ======================================================
net = Network(height="780px", width="100%", directed=False, notebook=True, cdn_resources="remote")
net.force_atlas_2based(gravity=-30, central_gravity=0.01, spring_length=140, damping=0.6)

def style_for(nid, props):
    t = props["type"]
    if t == "Recall":
        mk = (props.get("make") or "").strip()
        color = "#7e57c2"  # o: hash_color(mk or "NA")
        return {"shape":"box", "color":color, "size":18}
    if t == "Complaint":
        return {"shape":"dot", "color":"#8c8c8c", "size":10}
    if t == "Investigation":
        return {"shape":"diamond", "color":"#ff9800", "size":14}
    if t == "Make":
        return {"shape":"box", "color":"#2ca02c", "size":12}
    if t == "Model":
        return {"shape":"box", "color":"#17becf", "size":12}
    if t == "Component":
        return {"shape":"dot", "color":"#e377c2", "size":12}
    return {"shape":"ellipse", "color":"#999999", "size":10}

def tooltip(nid, p):
    t = p["type"]
    if t == "Recall":
        return (f"<b>RECALL</b><br><b>ID:</b> {p['label']}<br>"
                f"<b>Make/Model:</b> {p.get('make') or '-'} / {p.get('model') or '-'}<br>"
                f"<b>Year:</b> {p.get('year') or '-'}")
    if t == "Complaint":
        return f"<b>COMPLAINT</b><br><b>QID:</b> {p['label']}"
    if t == "Investigation":
        return f"<b>INVESTIGATION</b><br><b>ID:</b> {p['label']}"
    return f"<b>{t.upper()}</b><br>{p['label']}"

for nid, props in nodes.items():
    st = style_for(nid, props)
    net.add_node(nid, label=str(props["label"]), title=tooltip(nid, props),
                 color=st["color"], shape=st["shape"], size=st["size"], borderWidthSelected=3)

for (u, v, k) in edges_all:
    if isinstance(k, tuple) and k[0] == "MATCH_CR":
        dy = k[1]
        col = "#2ca02c" if dy == 0 else ("#ff7f0e" if abs(dy)==1 else "#d62728")
        net.add_edge(u, v, color=col, width=2, title=f"MATCH_CR Δy={dy}")
    elif k == "INV_CAND":
        net.add_edge(u, v, color="#cccccc", width=1, title="INV_CAND (familia make/model/component)")
    else:
        net.add_edge(u, v, color=edge_color(k), width=1.5, title=k)

html_path = os.path.abspath("graph_unificado_brake.html")
net.write_html(html_path, notebook=False)
print(html_path)
print("✅ Visual listo:", html_path)

Complaints: 200 | Recalls: 200 | Investigations: 136
Familias make/model (desde C): 110
Matches (±2) = 420
Recalls a dibujar (rids_all): 223 | top-K R: 200 | inv: 0 | match: 23
🔎 Nodos render: 1173 | Aristas render: 3204
c:\Users\moral\tec_final\notebooks\graph_unificado_brake.html
✅ Visual listo: c:\Users\moral\tec_final\notebooks\graph_unificado_brake.html


In [12]:
# ============================================
# Subgrafo unificado: Complaints / Recalls / Investigations
# + Make / Model / Component
# ============================================
from pyvis.network import Network
import os, hashlib

# -------- Config --------
QUERY = "derrame de líquido"
K_COMPLAINTS = 200
K_RECALLS    = 200
K_INVEST     = 200
INV_CAND_LIMIT = 1000   # tope de aristas INV_CAND a incluir

# -------- Helpers -------
def norm(s):
    return (str(s).strip().upper()
            .replace("-", " ").replace("_"," ").replace("/", " ")
            .replace("  ", " "))

def hash_color(s):
    palette = ["#1f77b4","#ff7f0e","#2ca02c","#d62728","#9467bd",
               "#8c564b","#e377c2","#7f7f7f","#bcbd22","#17becf"]
    import hashlib as _h
    h = int(_h.md5(str(s).encode()).hexdigest(), 16)
    return palette[h % len(palette)]

def edge_color(kind):
    return {
        "OF_MAKE":    "#2ca02c",
        "OF_MODEL":   "#17becf",
        "MENTIONS":   "#e377c2",
        "RELATES_TO": "#1f77b4",
    }.get(kind, "#999999")

# ======================================================
# 1) Recuperar sets con ask()
# ======================================================
C = ask(QUERY, k=K_COMPLAINTS, type_in="complaint").get("complaints", []) or []
R = ask(QUERY, k=K_RECALLS,    type_in="recall").get("recalls", []) or []
I = ask(QUERY, k=K_INVEST,     type_in="investigation").get("investigations", []) or []

cids = sorted({str(c.get("qid") or c.get("id")) for c in C if (c.get("qid") or c.get("id"))})
rids = sorted({r.get("id") for r in R if r.get("id")})
iids = sorted({i.get("id") for i in I if i.get("id")})
print(f"Complaints: {len(cids)} | Recalls: {len(rids)} | Investigations: {len(iids)}")

# ======================================================
# 2) Edges existentes en Neo4j (OF_MAKE / OF_MODEL / MENTIONS / RELATES_TO)
# ======================================================
edges_existing = []

# Complaints → Make/Model/Component
q_c = """
UNWIND $cids AS cid
MATCH (c:Complaint)
WHERE toString(c.qid)=cid OR toString(c.id)=cid OR toString(c.comp_id)=cid
OPTIONAL MATCH (c)-[:OF_MAKE]->(mk:Make)
OPTIONAL MATCH (c)-[:OF_MODEL]->(md:Model)
OPTIONAL MATCH (c)-[:MENTIONS]->(cp:Component)
RETURN toString(c.qid) AS cqid, mk.name AS mk, md.name AS md, cp.name AS cp
"""
rowsC = cypher(q_c, {"cids": cids})
for r in rowsC:
    cqid = r["cqid"]
    if r.get("mk"): edges_existing.append(("C:"+cqid, "M:"+r["mk"], "OF_MAKE"))
    if r.get("md"): edges_existing.append(("C:"+cqid, "D:"+r["md"], "OF_MODEL"))
    if r.get("cp"): edges_existing.append(("C:"+cqid, "X:"+r["cp"], "MENTIONS"))

# Recalls → Make/Model/Component (+ props base)
q_r = """
UNWIND $rids AS rid
MATCH (r:Recall {id:rid})
OPTIONAL MATCH (r)-[:OF_MAKE]->(mk:Make)
OPTIONAL MATCH (r)-[:OF_MODEL]->(md:Model)
OPTIONAL MATCH (r)-[:MENTIONS]->(cp:Component)
RETURN r.id AS rid, mk.name AS mk, md.name AS md, cp.name AS cp,
       r.make AS r_make, r.model AS r_model, coalesce(r.year, toInteger(left(r.recall_date,4))) AS r_year
"""
rowsR = cypher(q_r, {"rids": rids})
recall_props = {r["rid"]: {"make": r.get("r_make"), "model": r.get("r_model"), "year": r.get("r_year")} for r in rowsR}
for r in rowsR:
    rid = r["rid"]
    if r.get("mk"): edges_existing.append(("R:"+rid, "M:"+r["mk"], "OF_MAKE"))
    if r.get("md"): edges_existing.append(("R:"+rid, "D:"+r["md"], "OF_MODEL"))
    if r.get("cp"): edges_existing.append(("R:"+rid, "X:"+r["cp"], "MENTIONS"))

# Investigations → (RELATES_TO / OF_MAKE / OF_MODEL / MENTIONS)
q_i_full = """
UNWIND $iids AS iid
MATCH (i:Investigation {id:iid})
OPTIONAL MATCH (i)-[:RELATES_TO]->(r:Recall)
OPTIONAL MATCH (i)-[:OF_MAKE]->(mk:Make)
OPTIONAL MATCH (i)-[:OF_MODEL]->(md:Model)
OPTIONAL MATCH (i)-[:MENTIONS]->(cp:Component)
RETURN i.id AS iid,
       collect(DISTINCT r.id)       AS rids,
       collect(DISTINCT mk.name)    AS makes,
       collect(DISTINCT md.name)    AS models,
       collect(DISTINCT cp.name)    AS components
"""
rowsI = cypher(q_i_full, {"iids": iids})
for row in rowsI:
    iid = row["iid"]
    for rid in row.get("rids", []) or []:
        edges_existing.append(("I:"+iid, "R:"+rid, "RELATES_TO"))
    for mk in row.get("makes", []) or []:
        edges_existing.append(("I:"+iid, "M:"+mk, "OF_MAKE"))
    for md in row.get("models", []) or []:
        edges_existing.append(("I:"+iid, "D:"+md, "OF_MODEL"))
    for cp in row.get("components", []) or []:
        edges_existing.append(("I:"+iid, "X:"+cp, "MENTIONS"))

# ======================================================
# 3) Matches Complaint ↔ Recall (familia make/model; ventana ±2; NO depende de rids)
# ======================================================
pairs_mm = sorted({(norm(c.get("make")), norm(c.get("model")))
                   for c in C if c.get("make") and c.get("model")})
pairs_mm = [{"make": mk, "model": md} for (mk, md) in pairs_mm]
print("Familias make/model (desde C):", len(pairs_mm))

q_match_allR = """
UNWIND $pairs AS mm
// Recalls por familia
MATCH (r:Recall)
WITH r, mm, coalesce(r.year, toInteger(left(r.recall_date,4))) AS ryear
WHERE ryear IS NOT NULL
  AND toUpper(r.make)  = mm.make
  AND toUpper(r.model) = mm.model

// Complaints por la misma familia
MATCH (c:Complaint)
WHERE c.year IS NOT NULL
  AND toUpper(c.make)  = mm.make
  AND toUpper(c.model) = mm.model

AND abs(ryear - c.year) <= 2

RETURN DISTINCT
  r.id AS rid, ryear AS ryear,
  toString(coalesce(c.qid, c.id)) AS cqid, c.year AS cyear
LIMIT 8000
"""
rowsM = cypher(q_match_allR, {"pairs": pairs_mm})
edges_match = [("R:"+r["rid"], "C:"+r["cqid"], ("MATCH_CR", int(r["ryear"])-int(r["cyear"])))
               for r in rowsM if r.get("rid") and r.get("cqid")]
print("Matches (±2) =", len(edges_match))

# ======================================================
# 4) Asegurar nodos para TODOS los recalls referenciados (investigations y matches)
# ======================================================
rids_from_inv   = { rid for row in rowsI for rid in (row.get("rids") or []) if rid }
rids_from_match = { r["rid"] for r in rowsM }
rids_all = sorted(set(rids) | rids_from_inv | rids_from_match)

# Traer/Completar props de recalls adicionales
q_r_props = """
UNWIND $rids AS rid
MATCH (r:Recall {id:rid})
RETURN r.id AS rid,
       r.make AS r_make,
       r.model AS r_model,
       coalesce(r.year, toInteger(left(r.recall_date,4))) AS r_year
"""
rowsR_all = cypher(q_r_props, {"rids": rids_all})
for r in rowsR_all:
    recall_props[r["rid"]] = {
        "make":  r.get("r_make"),
        "model": r.get("r_model"),
        "year":  r.get("r_year"),
    }
print(f"Recalls a dibujar (rids_all): {len(rids_all)} | top-K R: {len(rids)} | inv: {len(rids_from_inv)} | match: {len(rids_from_match)}")

# ======================================================
# 5) Candidatos Investigation→Recall por familia (INV_CAND)
# ======================================================
edges_inv_candidates = []
rids_set = set(rids_all)  # usa el universo ampliado
# props rápidos de R (para familia)
R_by_id = {r.get("id"): r for r in R if r.get("id")}

for row in rowsI:
    iid = row["iid"]
    makes  = set([m for m in (row.get("makes") or []) if m])
    models = set([m for m in (row.get("models") or []) if m])
    comps  = set([c for c in (row.get("components") or []) if c])

    for rid in rids_all:
        # si no tienes props del recall en R, no importa: revisa en recall_props
        rp = recall_props.get(rid, {})
        ok = False
        if rp.get("make")  in makes:  ok = True
        if rp.get("model") in models: ok = True
        # componente (si lo guardaste en recall_props; si no, omite esta línea)
        # if rp.get("component") in comps: ok = True
        if ok:
            edges_inv_candidates.append(("I:"+iid, "R:"+rid, "INV_CAND"))

# ======================================================
# 6) Construir NODOS
# ======================================================
nodes = {}
def add_node(node_id, label, ntype, meta=None):
    if node_id in nodes: return
    meta = meta or {}
    nodes[node_id] = {"label": label, "type": ntype, **meta}

# Complaints, Recalls, Investigations (para investigations, usa iids_all)
iids_from_ask     = set(iids)
iids_from_rowsI   = { row["iid"] for row in rowsI }
iids_from_invCand = { u[2:] for (u,_,_) in edges_inv_candidates if u.startswith("I:") }
iids_all          = sorted(iids_from_ask | iids_from_rowsI | iids_from_invCand)

for cid in cids:      add_node("C:"+cid, cid, "Complaint")
for rid in rids_all:  add_node("R:"+rid, rid, "Recall", recall_props.get(rid, {}))
for iid in iids_all:  add_node("I:"+iid, iid, "Investigation")

# Make/Model/Component a partir de edges existentes
for (_, dst, kind) in edges_existing:
    if dst.startswith("M:"): add_node(dst, dst[2:], "Make")
    if dst.startswith("D:"): add_node(dst, dst[2:], "Model")
    if dst.startswith("X:"): add_node(dst, dst[2:], "Component")

# ======================================================
# 7) Construir EDGES (unificado) y PRUNING de aislados
# ======================================================
edges_all = []
edges_all += [(u, v, k)            for (u, v, k) in edges_existing]
edges_all += [(u, v, k)            for (u, v, k) in edges_match]
edges_all += [(u, v, "INV_CAND")   for (u, v, _) in edges_inv_candidates[:INV_CAND_LIMIT]]

# calcular grado y eliminar aislados
deg = {}
for (u, v, k) in edges_all:
    deg[u] = deg.get(u, 0) + 1
    deg[v] = deg.get(v, 0) + 1

nodes = {nid: props for nid, props in nodes.items() if deg.get(nid, 0) > 0}
edges_all = [e for e in edges_all if e[0] in nodes and e[1] in nodes]
print("🔎 Nodos render:", len(nodes), "| Aristas render:", len(edges_all))

# ======================================================
# 8) Render PyVis y guardar HTML
# ======================================================
net = Network(height="780px", width="100%", directed=False, notebook=True, cdn_resources="remote")
net.force_atlas_2based(gravity=-30, central_gravity=0.01, spring_length=140, damping=0.6)

def style_for(nid, props):
    t = props["type"]
    if t == "Recall":
        mk = (props.get("make") or "").strip()
        color = "#7e57c2"  # o: hash_color(mk or "NA")
        return {"shape":"box", "color":color, "size":18}
    if t == "Complaint":
        return {"shape":"dot", "color":"#8c8c8c", "size":10}
    if t == "Investigation":
        return {"shape":"diamond", "color":"#ff9800", "size":14}
    if t == "Make":
        return {"shape":"box", "color":"#2ca02c", "size":12}
    if t == "Model":
        return {"shape":"box", "color":"#17becf", "size":12}
    if t == "Component":
        return {"shape":"dot", "color":"#e377c2", "size":12}
    return {"shape":"ellipse", "color":"#999999", "size":10}

def tooltip(nid, p):
    t = p["type"]
    if t == "Recall":
        return (f"<b>RECALL</b><br><b>ID:</b> {p['label']}<br>"
                f"<b>Make/Model:</b> {p.get('make') or '-'} / {p.get('model') or '-'}<br>"
                f"<b>Year:</b> {p.get('year') or '-'}")
    if t == "Complaint":
        return f"<b>COMPLAINT</b><br><b>QID:</b> {p['label']}"
    if t == "Investigation":
        return f"<b>INVESTIGATION</b><br><b>ID:</b> {p['label']}"
    return f"<b>{t.upper()}</b><br>{p['label']}"

for nid, props in nodes.items():
    st = style_for(nid, props)
    net.add_node(nid, label=str(props["label"]), title=tooltip(nid, props),
                 color=st["color"], shape=st["shape"], size=st["size"], borderWidthSelected=3)

for (u, v, k) in edges_all:
    if isinstance(k, tuple) and k[0] == "MATCH_CR":
        dy = k[1]
        col = "#2ca02c" if dy == 0 else ("#ff7f0e" if abs(dy)==1 else "#d62728")
        net.add_edge(u, v, color=col, width=2, title=f"MATCH_CR Δy={dy}")
    elif k == "INV_CAND":
        net.add_edge(u, v, color="#cccccc", width=1, title="INV_CAND (familia make/model/component)")
    else:
        net.add_edge(u, v, color=edge_color(k), width=1.5, title=k)

html_path = os.path.abspath("graph_unificado_liquido.html")
net.write_html(html_path, notebook=False)
print(html_path)
print("✅ Visual listo:", html_path)

[IPCA] cargado 1024→256 desde ..\data\embeddings\complaints_e5_mlg_instruct\ipca_model.npz
Complaints: 200 | Recalls: 200 | Investigations: 163
Familias make/model (desde C): 123
Matches (±2) = 428
Recalls a dibujar (rids_all): 221 | top-K R: 200 | inv: 0 | match: 21
🔎 Nodos render: 1281 | Aristas render: 2606
c:\Users\moral\tec_final\notebooks\graph_unificado_liquido.html
✅ Visual listo: c:\Users\moral\tec_final\notebooks\graph_unificado_liquido.html


In [268]:
# ============================================
# Subgrafo unificado: Complaints / Recalls / Investigations
# + Make / Model / Component
# ============================================
from pyvis.network import Network
import os, hashlib

# -------- Config --------
QUERY = "problemas con sensores"
K_COMPLAINTS = 200
K_RECALLS    = 200
K_INVEST     = 200
INV_CAND_LIMIT = 1000   # tope de aristas INV_CAND a incluir

# -------- Helpers -------
def norm(s):
    return (str(s).strip().upper()
            .replace("-", " ").replace("_"," ").replace("/", " ")
            .replace("  ", " "))

def hash_color(s):
    palette = ["#1f77b4","#ff7f0e","#2ca02c","#d62728","#9467bd",
               "#8c564b","#e377c2","#7f7f7f","#bcbd22","#17becf"]
    import hashlib as _h
    h = int(_h.md5(str(s).encode()).hexdigest(), 16)
    return palette[h % len(palette)]

def edge_color(kind):
    return {
        "OF_MAKE":    "#2ca02c",
        "OF_MODEL":   "#17becf",
        "MENTIONS":   "#e377c2",
        "RELATES_TO": "#1f77b4",
    }.get(kind, "#999999")

# ======================================================
# 1) Recuperar sets con ask()
# ======================================================
C = ask(QUERY, k=K_COMPLAINTS, type_in="complaint").get("complaints", []) or []
R = ask(QUERY, k=K_RECALLS,    type_in="recall").get("recalls", []) or []
I = ask(QUERY, k=K_INVEST,     type_in="investigation").get("investigations", []) or []

cids = sorted({str(c.get("qid") or c.get("id")) for c in C if (c.get("qid") or c.get("id"))})
rids = sorted({r.get("id") for r in R if r.get("id")})
iids = sorted({i.get("id") for i in I if i.get("id")})
print(f"Complaints: {len(cids)} | Recalls: {len(rids)} | Investigations: {len(iids)}")

# ======================================================
# 2) Edges existentes en Neo4j (OF_MAKE / OF_MODEL / MENTIONS / RELATES_TO)
# ======================================================
edges_existing = []

# Complaints → Make/Model/Component
q_c = """
UNWIND $cids AS cid
MATCH (c:Complaint)
WHERE toString(c.qid)=cid OR toString(c.id)=cid OR toString(c.comp_id)=cid
OPTIONAL MATCH (c)-[:OF_MAKE]->(mk:Make)
OPTIONAL MATCH (c)-[:OF_MODEL]->(md:Model)
OPTIONAL MATCH (c)-[:MENTIONS]->(cp:Component)
RETURN toString(c.qid) AS cqid, mk.name AS mk, md.name AS md, cp.name AS cp
"""
rowsC = cypher(q_c, {"cids": cids})
for r in rowsC:
    cqid = r["cqid"]
    if r.get("mk"): edges_existing.append(("C:"+cqid, "M:"+r["mk"], "OF_MAKE"))
    if r.get("md"): edges_existing.append(("C:"+cqid, "D:"+r["md"], "OF_MODEL"))
    if r.get("cp"): edges_existing.append(("C:"+cqid, "X:"+r["cp"], "MENTIONS"))

# Recalls → Make/Model/Component (+ props base)
q_r = """
UNWIND $rids AS rid
MATCH (r:Recall {id:rid})
OPTIONAL MATCH (r)-[:OF_MAKE]->(mk:Make)
OPTIONAL MATCH (r)-[:OF_MODEL]->(md:Model)
OPTIONAL MATCH (r)-[:MENTIONS]->(cp:Component)
RETURN r.id AS rid, mk.name AS mk, md.name AS md, cp.name AS cp,
       r.make AS r_make, r.model AS r_model, coalesce(r.year, toInteger(left(r.recall_date,4))) AS r_year
"""
rowsR = cypher(q_r, {"rids": rids})
recall_props = {r["rid"]: {"make": r.get("r_make"), "model": r.get("r_model"), "year": r.get("r_year")} for r in rowsR}
for r in rowsR:
    rid = r["rid"]
    if r.get("mk"): edges_existing.append(("R:"+rid, "M:"+r["mk"], "OF_MAKE"))
    if r.get("md"): edges_existing.append(("R:"+rid, "D:"+r["md"], "OF_MODEL"))
    if r.get("cp"): edges_existing.append(("R:"+rid, "X:"+r["cp"], "MENTIONS"))

# Investigations → (RELATES_TO / OF_MAKE / OF_MODEL / MENTIONS)
q_i_full = """
UNWIND $iids AS iid
MATCH (i:Investigation {id:iid})
OPTIONAL MATCH (i)-[:RELATES_TO]->(r:Recall)
OPTIONAL MATCH (i)-[:OF_MAKE]->(mk:Make)
OPTIONAL MATCH (i)-[:OF_MODEL]->(md:Model)
OPTIONAL MATCH (i)-[:MENTIONS]->(cp:Component)
RETURN i.id AS iid,
       collect(DISTINCT r.id)       AS rids,
       collect(DISTINCT mk.name)    AS makes,
       collect(DISTINCT md.name)    AS models,
       collect(DISTINCT cp.name)    AS components
"""
rowsI = cypher(q_i_full, {"iids": iids})
for row in rowsI:
    iid = row["iid"]
    for rid in row.get("rids", []) or []:
        edges_existing.append(("I:"+iid, "R:"+rid, "RELATES_TO"))
    for mk in row.get("makes", []) or []:
        edges_existing.append(("I:"+iid, "M:"+mk, "OF_MAKE"))
    for md in row.get("models", []) or []:
        edges_existing.append(("I:"+iid, "D:"+md, "OF_MODEL"))
    for cp in row.get("components", []) or []:
        edges_existing.append(("I:"+iid, "X:"+cp, "MENTIONS"))

# ======================================================
# 3) Matches Complaint ↔ Recall (familia make/model; ventana ±2; NO depende de rids)
# ======================================================
pairs_mm = sorted({(norm(c.get("make")), norm(c.get("model")))
                   for c in C if c.get("make") and c.get("model")})
pairs_mm = [{"make": mk, "model": md} for (mk, md) in pairs_mm]
print("Familias make/model (desde C):", len(pairs_mm))

q_match_allR = """
UNWIND $pairs AS mm
// Recalls por familia
MATCH (r:Recall)
WITH r, mm, coalesce(r.year, toInteger(left(r.recall_date,4))) AS ryear
WHERE ryear IS NOT NULL
  AND toUpper(r.make)  = mm.make
  AND toUpper(r.model) = mm.model

// Complaints por la misma familia
MATCH (c:Complaint)
WHERE c.year IS NOT NULL
  AND toUpper(c.make)  = mm.make
  AND toUpper(c.model) = mm.model

AND abs(ryear - c.year) <= 2

RETURN DISTINCT
  r.id AS rid, ryear AS ryear,
  toString(coalesce(c.qid, c.id)) AS cqid, c.year AS cyear
LIMIT 8000
"""
rowsM = cypher(q_match_allR, {"pairs": pairs_mm})
edges_match = [("R:"+r["rid"], "C:"+r["cqid"], ("MATCH_CR", int(r["ryear"])-int(r["cyear"])))
               for r in rowsM if r.get("rid") and r.get("cqid")]
print("Matches (±2) =", len(edges_match))

# ======================================================
# 4) Asegurar nodos para TODOS los recalls referenciados (investigations y matches)
# ======================================================
rids_from_inv   = { rid for row in rowsI for rid in (row.get("rids") or []) if rid }
rids_from_match = { r["rid"] for r in rowsM }
rids_all = sorted(set(rids) | rids_from_inv | rids_from_match)

# Traer/Completar props de recalls adicionales
q_r_props = """
UNWIND $rids AS rid
MATCH (r:Recall {id:rid})
RETURN r.id AS rid,
       r.make AS r_make,
       r.model AS r_model,
       coalesce(r.year, toInteger(left(r.recall_date,4))) AS r_year
"""
rowsR_all = cypher(q_r_props, {"rids": rids_all})
for r in rowsR_all:
    recall_props[r["rid"]] = {
        "make":  r.get("r_make"),
        "model": r.get("r_model"),
        "year":  r.get("r_year"),
    }
print(f"Recalls a dibujar (rids_all): {len(rids_all)} | top-K R: {len(rids)} | inv: {len(rids_from_inv)} | match: {len(rids_from_match)}")

# ======================================================
# 5) Candidatos Investigation→Recall por familia (INV_CAND)
# ======================================================
edges_inv_candidates = []
rids_set = set(rids_all)  # usa el universo ampliado
# props rápidos de R (para familia)
R_by_id = {r.get("id"): r for r in R if r.get("id")}

for row in rowsI:
    iid = row["iid"]
    makes  = set([m for m in (row.get("makes") or []) if m])
    models = set([m for m in (row.get("models") or []) if m])
    comps  = set([c for c in (row.get("components") or []) if c])

    for rid in rids_all:
        # si no tienes props del recall en R, no importa: revisa en recall_props
        rp = recall_props.get(rid, {})
        ok = False
        if rp.get("make")  in makes:  ok = True
        if rp.get("model") in models: ok = True
        # componente (si lo guardaste en recall_props; si no, omite esta línea)
        # if rp.get("component") in comps: ok = True
        if ok:
            edges_inv_candidates.append(("I:"+iid, "R:"+rid, "INV_CAND"))

# ======================================================
# 6) Construir NODOS
# ======================================================
nodes = {}
def add_node(node_id, label, ntype, meta=None):
    if node_id in nodes: return
    meta = meta or {}
    nodes[node_id] = {"label": label, "type": ntype, **meta}

# Complaints, Recalls, Investigations (para investigations, usa iids_all)
iids_from_ask     = set(iids)
iids_from_rowsI   = { row["iid"] for row in rowsI }
iids_from_invCand = { u[2:] for (u,_,_) in edges_inv_candidates if u.startswith("I:") }
iids_all          = sorted(iids_from_ask | iids_from_rowsI | iids_from_invCand)

for cid in cids:      add_node("C:"+cid, cid, "Complaint")
for rid in rids_all:  add_node("R:"+rid, rid, "Recall", recall_props.get(rid, {}))
for iid in iids_all:  add_node("I:"+iid, iid, "Investigation")

# Make/Model/Component a partir de edges existentes
for (_, dst, kind) in edges_existing:
    if dst.startswith("M:"): add_node(dst, dst[2:], "Make")
    if dst.startswith("D:"): add_node(dst, dst[2:], "Model")
    if dst.startswith("X:"): add_node(dst, dst[2:], "Component")

# ======================================================
# 7) Construir EDGES (unificado) y PRUNING de aislados
# ======================================================
edges_all = []
edges_all += [(u, v, k)            for (u, v, k) in edges_existing]
edges_all += [(u, v, k)            for (u, v, k) in edges_match]
edges_all += [(u, v, "INV_CAND")   for (u, v, _) in edges_inv_candidates[:INV_CAND_LIMIT]]

# calcular grado y eliminar aislados
deg = {}
for (u, v, k) in edges_all:
    deg[u] = deg.get(u, 0) + 1
    deg[v] = deg.get(v, 0) + 1

nodes = {nid: props for nid, props in nodes.items() if deg.get(nid, 0) > 0}
edges_all = [e for e in edges_all if e[0] in nodes and e[1] in nodes]
print("🔎 Nodos render:", len(nodes), "| Aristas render:", len(edges_all))

# ======================================================
# 8) Render PyVis y guardar HTML
# ======================================================
net = Network(height="780px", width="100%", directed=False, notebook=True, cdn_resources="remote")
net.force_atlas_2based(gravity=-30, central_gravity=0.01, spring_length=140, damping=0.6)

def style_for(nid, props):
    t = props["type"]
    if t == "Recall":
        mk = (props.get("make") or "").strip()
        color = "#7e57c2"  # o: hash_color(mk or "NA")
        return {"shape":"box", "color":color, "size":18}
    if t == "Complaint":
        return {"shape":"dot", "color":"#8c8c8c", "size":10}
    if t == "Investigation":
        return {"shape":"diamond", "color":"#ff9800", "size":14}
    if t == "Make":
        return {"shape":"box", "color":"#2ca02c", "size":12}
    if t == "Model":
        return {"shape":"box", "color":"#17becf", "size":12}
    if t == "Component":
        return {"shape":"dot", "color":"#e377c2", "size":12}
    return {"shape":"ellipse", "color":"#999999", "size":10}

def tooltip(nid, p):
    t = p["type"]
    if t == "Recall":
        return (f"<b>RECALL</b><br><b>ID:</b> {p['label']}<br>"
                f"<b>Make/Model:</b> {p.get('make') or '-'} / {p.get('model') or '-'}<br>"
                f"<b>Year:</b> {p.get('year') or '-'}")
    if t == "Complaint":
        return f"<b>COMPLAINT</b><br><b>QID:</b> {p['label']}"
    if t == "Investigation":
        return f"<b>INVESTIGATION</b><br><b>ID:</b> {p['label']}"
    return f"<b>{t.upper()}</b><br>{p['label']}"

for nid, props in nodes.items():
    st = style_for(nid, props)
    net.add_node(nid, label=str(props["label"]), title=tooltip(nid, props),
                 color=st["color"], shape=st["shape"], size=st["size"], borderWidthSelected=3)

for (u, v, k) in edges_all:
    if isinstance(k, tuple) and k[0] == "MATCH_CR":
        dy = k[1]
        col = "#2ca02c" if dy == 0 else ("#ff7f0e" if abs(dy)==1 else "#d62728")
        net.add_edge(u, v, color=col, width=2, title=f"MATCH_CR Δy={dy}")
    elif k == "INV_CAND":
        net.add_edge(u, v, color="#cccccc", width=1, title="INV_CAND (familia make/model/component)")
    else:
        net.add_edge(u, v, color=edge_color(k), width=1.5, title=k)

html_path = os.path.abspath("graph_unificado_sensores.html")
net.write_html(html_path, notebook=False)
print(html_path)
print("✅ Visual listo:", html_path)

Complaints: 200 | Recalls: 200 | Investigations: 166
Familias make/model (desde C): 110
Matches (±2) = 337
Recalls a dibujar (rids_all): 219 | top-K R: 200 | inv: 0 | match: 19
🔎 Nodos render: 1308 | Aristas render: 2910
c:\Users\moral\tec_final\notebooks\graph_unificado_sensores.html
✅ Visual listo: c:\Users\moral\tec_final\notebooks\graph_unificado_sensores.html
