## Encaje de *Investigations* (ODI/NHTSA)

### 1. Descripción General

En esta etapa se realizó el **encaje semántico (embedding)** del conjunto de datos correspondiente a las **investigaciones vehiculares (Investigations)** del *Office of Defects Investigation (ODI)* perteneciente a la *National Highway Traffic Safety Administration (NHTSA)*.
El objetivo fue transformar los textos descriptivos de cada investigación en representaciones vectoriales de alta dimensión, aptas para tareas posteriores de **búsqueda semántica**, **alineación entre fuentes heterogéneas (Complaints, Recalls, Investigations)** y **consulta mediante agentes de lenguaje**.

El proceso se aplicó sobre la versión procesada y normalizada del dataset original (≈153 550 registros brutos).
Tras la depuración y deduplicación por número de acción (`NHTSA ACTION NUMBER`), el corpus consolidado quedó en **5 309 registros base**, de los cuales se generaron **6 354 segmentos textuales (chunks)**.
Cada chunk corresponde a una unidad de análisis semántico derivada del campo concatenado `SUBJECT + SUMMARY`, segmentada de forma adaptativa con ventana de 1 600 caracteres y solape de 300.

---

### 2. Modelo y Configuración

El modelo empleado fue **`intfloat/multilingual-e5-large-instruct`**, alojado en Hugging Face, basado en *XLM-RoBERTa-large*.
Se seleccionó este modelo por tres razones principales:

1. **Cobertura multilingüe:** capta matices técnicos y terminología en inglés con robustez frente a textos institucionales.
2. **Entrenamiento instruccional:** optimizado para tareas de *retrieval* y alineación semántica, ideal para datasets heterogéneos.
3. **Desempeño empírico:** probado previamente en *Complaints*, donde obtuvo `Recall@10 ≈ 0.81` y `nDCG@10 ≈ 0.62`.

El proceso se ejecutó en GPU **NVIDIA T4 (CUDA, Colab)** con los siguientes parámetros:

| Parámetro                  | Valor                                         |
| -------------------------- | --------------------------------------------- |
| `batch_size`               | Adaptativo (inicial = 256, mínimo = 32)       |
| `max_seq_length`           | 480 tokens                                    |
| Precisión                  | `fp16 autocast` (PyTorch 2.5, `torch.amp`)    |
| Normalización              | L2 global posterior                           |
| Lógica de fallos           | Reducción automática del lote ante `CUDA OOM` |
| Tiempo de carga del modelo | 56.6 s                                        |

---

### 3. Estrategia de Optimización

Antes del encaje se aplicó **deduplicación estructurada** mediante hashing MD5:
cada texto consolidado (`SUBJECT + SUMMARY`) fue reducido a una clave única, y sólo se procesaron las entradas distintas.
Posteriormente, los vectores obtenidos se expandieron de nuevo al conjunto completo, preservando todos los metadatos asociados (`make`, `model`, `year`, `component`, `camp_no`, etc.).

Esta estrategia redujo el número efectivo de textos a encajar de **6 354 a 5 615**, es decir, un **11.6 % de reducción efectiva** (88.4 % de unicidad).
El tiempo total de inferencia fue de **112.9 s**.

---

### 4. Hallazgo: Redundancia Textual Estructural

El análisis de distribución muestra una **alta repetición semántica** entre investigaciones, debida a la forma en que el ODI genera sus reportes:

| Métrica                         | Valor              |
| ------------------------------- | ------------------ |
| Filas base                      | 5 309              |
| Chunks totales                  | 6 354              |
| Textos únicos encajados         | 5 615              |
| Promedio de chunks por registro | 1.19               |
| Tamaño promedio de texto        | ≈ 2 400 caracteres |
| Máx. longitud observada         | ≈ 5 900 caracteres |

#### Causas observadas

1. **Replicación por combinación Make–Model–Year:** una misma investigación se asocia a varios vehículos.
2. **Textos institucionales estándar:** `SUMMARY` repite fórmulas y descripciones de defectos.
3. **Ausencia del campo `NARRATIVE`:** la versión 2023 del dataset no incluye la narrativa libre del investigador, limitando la diversidad léxica.

---

### 5. Implicaciones y Validez

* **El encaje único por texto es suficiente**, dado que los duplicados son semánticamente idénticos.
* **La deduplicación es metodológicamente correcta**, ya que conserva todos los nodos (`make`, `model`, `component`) mediante metadatos.
* **Se optimiza el costo computacional** sin pérdida de información.

La redundancia detectada constituye, en sí, un hallazgo estructural relevante para caracterizar la naturaleza del dataset ODI.

---

### 6. Resultados del Encaje

| Atributo                   | Valor                                     |
| -------------------------- | ----------------------------------------- |
| Modelo                     | `intfloat/multilingual-e5-large-instruct` |
| Dimensión                  | 1 024                                     |
| Textos únicos encajados    | 5 615                                     |
| Total de chunks expandidos | 6 354                                     |
| Tamaño final del embedding | ≈ 24.8 MB                                 |
| Dispositivo                | GPU CUDA (fp16)                           |
| Normalización              | L2 global                                 |
| Tiempo total de encaje     | 112.9 s                                   |

Los resultados fueron almacenados en:

* **Embeddings:**
  `/content/drive/MyDrive/NHTSA/embeddings/investigations_e5_mlg_instruct/invest_embeddings.npy`

* **Metadatos:**
  `/content/drive/MyDrive/NHTSA/embeddings/investigations_e5_mlg_instruct/invest_chunks_meta.parquet`

* **Configuración reproducible:**
  `/content/drive/MyDrive/NHTSA/embeddings/investigations_e5_mlg_instruct/config.json`

---

### 7. Conclusiones

El encaje de *Investigations* consolida un pipeline reproducible, estable y alineado con las fases de *Complaints* y *Recalls*.
Los logros principales son:

1. **Ejecución eficiente y sin pérdidas semánticas** (solo 5 615 inferencias necesarias).
2. **Preservación de la trazabilidad** mediante metadatos sincronizados y configuración reproducible.
3. **Identificación cuantitativa de la redundancia textual**, que orienta futuras mejoras en la captura de campos descriptivos (`NARRATIVE`).

En conjunto, este resultado establece una base sólida para las siguientes etapas de:

* **Indexación vectorial FAISS o Neo4j**,
* **Búsqueda semántica interdataset**, y
* **Modelado de grafo de conocimiento NHTSA**.

---



In [2]:
# ⬇️ CE L D A  1 — Rutas, librerías y montaje de Drive
from google.colab import drive
drive.mount('/content/drive')

!pip -q install -U sentence-transformers==3.0.1

import os, re, json, textwrap, hashlib, time, math
from pathlib import Path
import numpy as np
import pandas as pd
from tqdm import tqdm

pd.set_option("display.max_colwidth", 160)


Mounted at /content/drive
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [6]:
# ⬇️ CE L D A  2 — Utilidades: carga de corpus y chunking dinámico
# Rutas de entrada/salida
BASE         = Path("/content/drive/MyDrive/NHTSA")
PROC_DIR     = BASE / "processed"
EMB_DIR      = BASE / "embeddings" / "investigations_e5_mlg_instruct"
EMB_DIR.mkdir(parents=True, exist_ok=True)

PARQ_IN  = PROC_DIR / "investigations.parquet"       # si lo tienes ya normalizado
JSONL_IN = PROC_DIR / "investigations_corpus.jsonl"  # alternativa equivalente

# Archivos de salida
VEC_NPY   = EMB_DIR / "investigations_embeddings.npy"
META_OUT  = EMB_DIR / "investigations_chunks_meta.parquet"
MAP_CSV   = EMB_DIR / "id_map.csv"
CONF_JSON = EMB_DIR / "config.json"

def strip_html_basic(s: str) -> str:
    if not isinstance(s, str):
        return ""
    s = re.sub(r"<br\s*/?>", " ", s, flags=re.I)
    s = re.sub(r"<[^>]+>", " ", s)
    return re.sub(r"\s+", " ", s).strip()

def normalize_year(x):
    try:
        v = int(str(x))
        if 1950 <= v <= 2035:
            return v
    except:
        pass
    return np.nan

def first_nonnull(d, keys, default=None):
    for k in keys:
        if k in d and pd.notna(d[k]) and str(d[k]).strip() != "":
            return d[k]
    return default

def load_investigations_corpus(parquet_path: Path, jsonl_path: Path, dedup_by_id: bool = True) -> pd.DataFrame:
    """
    Devuelve un DataFrame base con columnas canónicas:
      id        (NHTSA ACTION NUMBER)
      text      (SUBJECT + ' ' + SUMMARY)  [o JSONL['text']]
      make, model, year, component, camp_no, dateopen, dateclose (si existen)
    Si dedup_by_id=True, se deja UNA fila por id (texto más largo) para evitar
    duplicación de chunks cuando hay múltiples filas por MMY para el mismo id.
    """
    if parquet_path.exists():
        df = pd.read_parquet(parquet_path).copy()
        # Normaliza nombres a MAYÚSCULAS del esquema oficial ODI
        df.columns = [c.upper() for c in df.columns]

        # ---- Mapeo específico de Investigations (ODI) ----
        assert "NHTSA ACTION NUMBER" in df.columns, "Falta 'NHTSA ACTION NUMBER' en investigations.parquet"
        cand_id = "NHTSA ACTION NUMBER"

        # Texto base = SUBJECT + SUMMARY (no hay NARRATIVE en Investigations)
        summary   = df["SUMMARY"]  if "SUMMARY"  in df.columns else pd.Series("", index=df.index)
        subject   = df["SUBJECT"]  if "SUBJECT"  in df.columns else pd.Series("", index=df.index)

        def strip_html_basic(s: str) -> str:
            if not isinstance(s, str):
                return ""
            s = re.sub(r"<br\s*/?>", " ", s, flags=re.I)
            s = re.sub(r"<[^>]+>", " ", s)
            return re.sub(r"\s+", " ", s).strip()

        text = (subject.fillna("") + " " + summary.fillna("")).map(strip_html_basic)

        base = pd.DataFrame({
            "id":        df[cand_id].astype(str),
            "text":      text.astype(str),
            "make":      df["MAKE"].astype(str)      if "MAKE"     in df.columns else None,
            "model":     df["MODEL"].astype(str)     if "MODEL"    in df.columns else None,
            "year":      df["YEAR"]                  if "YEAR"     in df.columns else np.nan,
            "component": df["COMPNAME"].astype(str)  if "COMPNAME" in df.columns else None,
            "camp_no":   df["CAMPNO"].astype(str)    if "CAMPNO"   in df.columns else None,
            "dateopen":  df["ODATE"]                 if "ODATE"    in df.columns else None,
            "dateclose": df["CDATE"]                 if "CDATE"    in df.columns else None,
        })

        # Aseo y normalización ligera
        def normalize_year(x):
            try:
                v = int(str(x))
                return v if 1950 <= v <= 2035 else np.nan
            except:
                return np.nan

        if "year" in base.columns:
            base["year"] = base["year"].map(normalize_year)

        for c in ["make","model","component","camp_no"]:
            if c in base.columns and base[c] is not None:
                base[c] = base[c].astype(str).str.strip().str.upper()

        # ---- DEDUP por id (¡clave para no duplicar chunks!) ----
        if dedup_by_id:
            base["__len"] = base["text"].str.len()
            base = (base
                    .sort_values(["id","__len"], ascending=[True, False])
                    .drop_duplicates(subset=["id"], keep="first")
                    .drop(columns="__len"))

        return base

    elif jsonl_path.exists():
        rows = []
        with open(jsonl_path, "r", encoding="utf-8") as f:
            for line in f:
                try:
                    obj = json.loads(line)
                except json.JSONDecodeError:
                    continue
                _id  = obj.get("id")
                _txt = obj.get("text")
                meta = obj.get("metadata", {}) or {}
                if not _id or not _txt:
                    continue
                rows.append({
                    "id":        str(_id),
                    "text":      strip_html_basic(str(_txt)),
                    "make":      first_nonnull(meta, ["MAKE","MAKETXT","make"]),
                    "model":     first_nonnull(meta, ["MODEL","MODELTXT","model"]),
                    "year":      normalize_year(first_nonnull(meta, ["YEAR","YEARTXT","year"])),
                    "component": first_nonnull(meta, ["COMPNAME","COMPONENT","component"]),
                    "camp_no":   first_nonnull(meta, ["CAMPNO","camp_no"]),
                    "dateopen":  first_nonnull(meta, ["ODATE","DATEOPEN","dateopen"]),
                    "dateclose": first_nonnull(meta, ["CDATE","DATECLOSE","dateclose"]),
                })
        if not rows:
            raise AssertionError("JSONL no aportó filas válidas.")
        base = pd.DataFrame(rows)
        for c in ["make","model","component","camp_no"]:
            if c in base.columns:
                base[c] = base[c].astype(str).str.strip().str.upper()
        if dedup_by_id:
            base["__len"] = base["text"].str.len()
            base = (base
                    .sort_values(["id","__len"], ascending=[True, False])
                    .drop_duplicates(subset=["id"], keep="first")
                    .drop(columns="__len"))
        return base

    else:
        raise FileNotFoundError("No se encontró ni investigations.parquet ni investigations_corpus.jsonl")


def make_chunks_dynamic(
    df_base: pd.DataFrame,
    short_max_len: int = 1200,   # textos cortos se quedan en 1 chunk
    long_width:   int = 1600,    # tamaño de ventana para largos
    stride:       int = 300,     # solape
    show_progress: bool = True
) -> pd.DataFrame:
    """
    Genera chunks con ventana deslizante y solape para textos largos.
    Salida: ['chunk_id','id','make','model','year','component','camp_no','chunk_idx','text']
    """
    assert {"id","text"}.issubset(df_base.columns), "df_base debe incluir 'id' y 'text'."
    rows = []
    it = tqdm(df_base.itertuples(index=False), total=len(df_base), desc="Chunking dinámico") if show_progress else df_base.itertuples(index=False)

    # Column mapping tolerante
    cols = {c.lower(): c for c in df_base.columns}
    c_id   = cols.get("id")
    c_txt  = cols.get("text")
    c_make = cols.get("make")
    c_model= cols.get("model")
    c_year = cols.get("year")
    c_comp = cols.get("component")
    c_camp = cols.get("camp_no")

    step = max(1, long_width - stride)

    for r in it:
        rid   = str(getattr(r, c_id))
        text  = str(getattr(r, c_txt) or "")
        make  = (getattr(r, c_make)  if c_make  else None)
        model = (getattr(r, c_model) if c_model else None)
        year  = (getattr(r, c_year)  if c_year  else None)
        comp  = (getattr(r, c_comp)  if c_comp  else None)
        camp  = (getattr(r, c_camp)  if c_camp  else None)

        L = len(text)
        if L == 0:
            continue

        if L <= short_max_len:
            pieces = [text]
        else:
            pieces, start = [], 0
            while start < L:
                end = min(start + long_width, L)
                pieces.append(text[start:end])
                if end >= L:
                    break
                start += step

        for j, ch in enumerate(pieces):
            rows.append({
                "chunk_id":  f"{rid}::ch{j}",
                "id":        rid,
                "make":      make,
                "model":     model,
                "year":      year,
                "component": comp,
                "camp_no":   camp,
                "chunk_idx": j,
                "text":      ch.strip()
            })
    return pd.DataFrame.from_records(rows)


In [7]:
from pathlib import Path
import pandas as pd
import numpy as np
import re

# --- helpers ya usados antes ---
def normalize_year(x):
    try:
        v = int(str(x))
        if 1949 <= v <= 2035:
            return v
    except:
        pass
    return np.nan

def strip_html_basic(s: str) -> str:
    if pd.isna(s): return s
    s = re.sub(r"<br\s*/?>", " ", str(s), flags=re.I)
    s = re.sub(r"<[^>]+>", "", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def _safe_series(df, col_name):
    """Devuelve la serie si existe; si no, una serie vacía ('' * nfilas)."""
    if col_name in df.columns:
        return df[col_name].fillna("").astype(str)
    else:
        return pd.Series([""] * len(df), index=df.index, dtype=str)

def load_investigations_corpus(parquet_path: Path, jsonl_path: Path) -> pd.DataFrame:
    """
    Estandariza a columnas canónicas mínimas y deja UNA fila por id:
      - id   = 'NHTSA ACTION NUMBER'
      - text = SUBJECT + ' ' + SUMMARY
      - make, model, year, component, camp_no (si existen) -> solo informativos aquí
    """
    if parquet_path.exists():
        df = pd.read_parquet(parquet_path).copy()
        df.columns = [c.upper().strip() for c in df.columns]
        print("Columnas detectadas:", df.columns.tolist())

        # 1) ID correcto (sin CAMPNO)
        id_candidates = [
            "NHTSA ACTION NUMBER", "NHTSA_ACTION_NUMBER",
            "ODINUMBER", "ACTIONNUMBER", "ACTION_NUMBER",
            "INV_ID", "INVNUM", "INV_NO", "ID"
        ]
        cand_id = next((c for c in id_candidates if c in df.columns), None)
        assert cand_id is not None, "No se encontró columna de ID compatible (p.ej. 'NHTSA ACTION NUMBER')."

        # 2) Texto base: SUBJECT + SUMMARY (no hay NARRATIVE en Investigations)
        subj = df["SUBJECT"].fillna("").astype(str) if "SUBJECT" in df.columns else pd.Series("", index=df.index)
        summ = df["SUMMARY"].fillna("").astype(str) if "SUMMARY" in df.columns else pd.Series("", index=df.index)

        def _strip(s: str) -> str:
            s = re.sub(r"<br\s*/?>", " ", s, flags=re.I)
            s = re.sub(r"<[^>]+>", " ", s)
            return re.sub(r"\s+", " ", s).strip()

        text = (subj + " " + summ).map(_strip)

        base = pd.DataFrame({
            "id":        df[cand_id].astype(str),
            "text":      text.astype(str),
            "make":      df["MAKE"].astype(str)      if "MAKE"     in df.columns else None,
            "model":     df["MODEL"].astype(str)     if "MODEL"    in df.columns else None,
            "year":      df["YEAR"]                  if "YEAR"     in df.columns else np.nan,
            "component": df["COMPNAME"].astype(str)  if "COMPNAME" in df.columns else None,
            "camp_no":   df["CAMPNO"].astype(str)    if "CAMPNO"   in df.columns else None,
            "dateopen":  df["ODATE"]                 if "ODATE"    in df.columns else None,
            "dateclose": df["CDATE"]                 if "CDATE"    in df.columns else None,
        })

        # Normalización ligera
        def _norm_year(x):
            try:
                v = int(str(x));  return v if 1950 <= v <= 2035 else np.nan
            except: return np.nan
        if "year" in base.columns:
            base["year"] = base["year"].map(_norm_year)

        for c in ["make","model","component","camp_no"]:
            if c in base.columns and base[c] is not None:
                base[c] = base[c].astype(str).str.strip().str.upper()

        base["text"] = base["text"].astype(str).str.replace(r"\s+", " ", regex=True).str.strip()
        base = base[base["text"].str.len() > 0].reset_index(drop=True)

        # 3) DEDUP por id: deja UNA fila por id (texto más largo)
        base["__len"] = base["text"].str.len()
        base = (base
                .sort_values(["id","__len"], ascending=[True, False])
                .drop_duplicates(subset=["id"], keep="first")
                .drop(columns="__len")
                .reset_index(drop=True))
        return base

    elif jsonl_path.exists():
        import json
        rows = []
        with open(jsonl_path, "r", encoding="utf-8") as f:
            for line in f:
                try:
                    obj = json.loads(line)
                except json.JSONDecodeError:
                    continue
                _id  = obj.get("id")
                _txt = obj.get("text")
                meta = obj.get("metadata", {}) or {}
                if not _id or not _txt:
                    continue
                rows.append({
                    "id": str(_id),
                    "text": strip_html_basic(str(_txt)),
                    "make": (meta.get("MAKE") or meta.get("MAKETXT") or meta.get("make")),
                    "model": (meta.get("MODEL") or meta.get("MODELTXT") or meta.get("model")),
                    "year": normalize_year(meta.get("YEAR") or meta.get("YEARTXT") or meta.get("year")),
                    "component": (meta.get("COMPNAME") or meta.get("COMPONENT") or meta.get("component")),
                    "camp_no": (meta.get("CAMPNO") or meta.get("camp_no")),
                    "dateopen": (meta.get("ODATE") or meta.get("DATEOPEN") or meta.get("dateopen")),
                    "dateclose": (meta.get("CDATE") or meta.get("DATECLOSE") or meta.get("dateclose")),
                })
        assert rows, "JSONL no aportó filas válidas."
        base = pd.DataFrame(rows)
        for c in ["make","model","component","camp_no"]:
            if c in base.columns:
                base[c] = base[c].astype(str).str.strip().str.upper()
        base = base[base["text"].str.len() > 0].reset_index(drop=True)

        # DEDUP por id (mismo criterio: texto más largo)
        base["__len"] = base["text"].str.len()
        base = (base
                .sort_values(["id","__len"], ascending=[True, False])
                .drop_duplicates(subset=["id"], keep="first")
                .drop(columns="__len")
                .reset_index(drop=True))
        return base

    else:
        raise FileNotFoundError("No se encontró ni investigations.parquet ni investigations_corpus.jsonl")



In [9]:
# --- Imports y paths ---
from pathlib import Path
import pandas as pd

# (usa aquí tu definición de make_chunks_dynamic SIN cambios)

# ⬇️ Carga deduplicada + chunking
PARQ_IN  = Path("/content/drive/MyDrive/NHTSA/processed/investigations.parquet")
JSONL_IN = Path("/content/drive/MyDrive/NHTSA/processed/investigations_corpus.jsonl")

# 1) Cargar base deduplicada por id (SUBJECT + SUMMARY), usando tu load_investigations_corpus corregida
invest_base = load_investigations_corpus(PARQ_IN, JSONL_IN)
assert invest_base["id"].is_unique, "Hay IDs repetidos: revisa load_investigations_corpus (dedup)."
print("Corpus base (investigations):", invest_base.shape)
display(invest_base.head(3)[["id","text"]])

# 2) Chunkear (sobre base ya deduplicada)
invest_ch = make_chunks_dynamic(
    invest_base,
    short_max_len=1200,   # 1 chunk si el texto es razonablemente corto
    long_width=1600,      # ventana para largos
    stride=300,           # solape
    show_progress=True
)
print("Chunks generados:", invest_ch.shape)

# 3) Sanity checks de chunks y orden
assert invest_ch["chunk_id"].is_unique, "chunk_id duplicados: revisa unicidad de id antes de chunkear."
invest_ch = invest_ch.sort_values(["id","chunk_idx"]).reset_index(drop=True)
display(invest_ch.head(5))

# 4) Guardar a disco (sobrescribe los mismos nombres)
OUT_DIR_INV = Path("/content/drive/MyDrive/NHTSA/processed")
OUT_DIR_INV.mkdir(parents=True, exist_ok=True)

CHUNK_PARQ = OUT_DIR_INV / "investigations_chunks.parquet"
CHUNK_MAP  = OUT_DIR_INV / "investigations_chunks_map.csv"

# Si prefieres parquet minimalista, usa solo estas columnas:
# invest_ch_save = invest_ch[["chunk_id","id","chunk_idx","text"]]
# invest_ch_save.to_parquet(CHUNK_PARQ, index=False)
# invest_ch_save[["chunk_id","id","text"]].to_csv(CHUNK_MAP, index=False)

# Si quieres conservar los metadatos que ya trae invest_base en la fila elegida:
invest_ch.to_parquet(CHUNK_PARQ, index=False)
invest_ch[["chunk_id","id","text"]].to_csv(CHUNK_MAP, index=False)

print("✅ Guardado:", CHUNK_PARQ, "|", CHUNK_MAP)


Columnas detectadas: ['NHTSA ACTION NUMBER', 'MAKE', 'MODEL', 'YEAR', 'COMPNAME', 'MFR_NAME', 'ODATE', 'CDATE', 'CAMPNO', 'SUBJECT', 'SUMMARY', '__SOURCE_FILE__', 'SUBJECT_LEN', 'SUMMARY_LEN', 'COMPNAME_LEN', 'TEXT_TOTAL_LEN']
Corpus base (investigations): (5309, 9)


Unnamed: 0,id,text
0,AQ08001,"PACE AMERICAN 573 RETRACTION BY LETTER DATED NOVEMBER 8, 2007, PACE AMERICAN NOTIFIED NHTSA THAT A DEFECT WHICH RELATED TO MOTOR VEHICLE SAFETY EXISTED IN 1..."
1,AQ09001,HID REPLACEMENT KIT RECALL CAMPAIGNS RMD IDENTIFIED SEVERAL HID REPLACEMENT LIGHTING RECALL CAMPAIGNS THAT APPEARED NOT TO HAVE BEEN CONDUCTED OR CONDUCTED ...
2,AQ09002,Monaco RV Recalls Responsiiblity NHTSA opened this investigation to review issues in connection with recalls initiated by Monaco Coach Corporation (Monaco C...


Chunking: 100%|██████████| 5309/5309 [00:00<00:00, 155251.45it/s]

Chunks generados: (6354, 9)





Unnamed: 0,chunk_id,id,make,model,year,component,camp_no,chunk_idx,text
0,AQ08001::ch0,AQ08001,PACE AMERICAN,TRAILER,,WHEELS,NONE,0,"PACE AMERICAN 573 RETRACTION BY LETTER DATED NOVEMBER 8, 2007, PACE AMERICAN NOTIFIED NHTSA THAT A DEFECT WHICH RELATED TO MOTOR VEHICLE SAFETY EXISTED IN 1..."
1,AQ09001::ch0,AQ09001,CAPCEN,9005,,EXTERIOR LIGHTING,06E027000,0,HID REPLACEMENT KIT RECALL CAMPAIGNS RMD IDENTIFIED SEVERAL HID REPLACEMENT LIGHTING RECALL CAMPAIGNS THAT APPEARED NOT TO HAVE BEEN CONDUCTED OR CONDUCTED ...
2,AQ09002::ch0,AQ09002,HOLIDAY RAMBLER,ATLANTIS SE,,ELECTRICAL SYSTEM:WIRING:FUSES AND CIRCUIT BREAKERS,NONE,0,Monaco RV Recalls Responsiiblity NHTSA opened this investigation to review issues in connection with recalls initiated by Monaco Coach Corporation (Monaco C...
3,AQ09002::ch1,AQ09002,HOLIDAY RAMBLER,ATLANTIS SE,,ELECTRICAL SYSTEM:WIRING:FUSES AND CIRCUIT BREAKERS,NONE,1,"fies that a manufacturer?s recall obligations ?shall be treated as a claim of the United States Government against such manufacturer . . . , and given prior..."
4,AQ10001::ch0,AQ10001,DODGE,DURANGO,,UNKNOWN OR OTHER,NONE,0,"Fleet Vehicle Recall Completion Audit On November 18, 2010, NHTSA's Recall Management Division (RMD) opened an audit query (AQ10-001) to investigate recall ..."


✅ Guardado: /content/drive/MyDrive/NHTSA/processed/investigations_chunks.parquet | /content/drive/MyDrive/NHTSA/processed/investigations_chunks_map.csv


In [10]:
# ─────────────────────────────────────────────────────────────────────────────
# CE L D A · Encaje eficiente (dedupe + fp16 + lote adaptativo + shards)
# ─────────────────────────────────────────────────────────────────────────────
import os, hashlib, math, json, gc, time, re
import numpy as np
import pandas as pd
from pathlib import Path
from tqdm import tqdm

# IMPORTANTE: setear antes de usar CUDA para reducir fragmentación
os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True,max_split_size_mb:128")

import torch
from sentence_transformers import SentenceTransformer

# === Parámetros ajustables ===
MODEL_NAME   = "intfloat/multilingual-e5-large-instruct"
DEVICE       = "cuda" if torch.cuda.is_available() else "cpu"
MAX_TOKENS   = 480            # truncado de tokens a nivel modelo
NORMALIZE    = True
INIT_BS      = 256 if DEVICE=="cuda" else 32   # lote inicial
MIN_BS       = 32                                # piso del adaptativo
SHARD_EVERY  = 50_000
OUT_DIR      = Path("/content/drive/MyDrive/NHTSA/embeddings/investigations_e5_mlg_instruct")
OUT_DIR.mkdir(parents=True, exist_ok=True)

VEC_NPY   = OUT_DIR/"invest_embeddings.npy"
META_PARQ = OUT_DIR/"invest_chunks_meta.parquet"
CONF_JSON = OUT_DIR/"config.json"

# Sanity mínima
assert "text" in invest_ch.columns and "id" in invest_ch.columns, "Faltan columnas 'id' y 'text' en invest_ch."
invest_ch = invest_ch.sort_values(["id","chunk_idx"]).reset_index(drop=True)

# === 1) Cargar modelo ===
t0 = time.time()
model = SentenceTransformer(MODEL_NAME, device=DEVICE)
if DEVICE == "cuda":
    try:
        torch.set_float32_matmul_precision("high")
    except Exception:
        pass
model.max_seq_length = MAX_TOKENS
model.eval()
load_s = time.time() - t0
print(f"Modelo cargado en {load_s:.2f}s | device={DEVICE} | max_seq_len={model.max_seq_length}")

# === 2) Preparar textos + desduplicar por contenido ===
texts = invest_ch["text"].astype(str).tolist()
n = len(texts)
print(f"Total de chunks a encajar: {n:,}")

def fast_hash(s: str) -> str:
    return hashlib.md5(s.encode("utf-8", errors="ignore")).hexdigest()

t_hash = time.time()
text_hashes = [fast_hash(t) for t in texts]
uniq_map = {}
uniq_texts = []
for i, h in enumerate(text_hashes):
    if h not in uniq_map:
        uniq_map[h] = len(uniq_texts)
        uniq_texts.append(texts[i])
n_uniq = len(uniq_texts)
print(f"Textos únicos: {n_uniq:,} ({n_uniq/n:.1%} del total) | hashing: {time.time()-t_hash:.1f}s")

# === 3) Encaje con lote adaptativo + autocast fp16 en CUDA ===
def batched_encode_adaptive(text_list, init_bs=INIT_BS, min_bs=MIN_BS):
    vecs = []
    i = 0
    cur_bs = init_bs
    pbar = tqdm(total=len(text_list), desc="Embedding únicos", unit="row", mininterval=0.5)
    with torch.inference_mode():
        while i < len(text_list):
            bs = min(cur_bs, len(text_list) - i)
            batch = text_list[i:i+bs]
            try:
                ctx = (torch.amp.autocast("cuda", dtype=torch.float16) if DEVICE=="cuda" else nullcontext())
                with ctx:
                    v = model.encode(
                        batch,
                        batch_size=bs,               # usar bs actual
                        convert_to_numpy=True,
                        show_progress_bar=False,
                        normalize_embeddings=False   # normalizamos una sola vez al final
                    ).astype("float32", copy=False)
                vecs.append(v)
                i += bs
                pbar.update(bs)

                # higiene memoria
                del batch, v
                gc.collect()
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()

                # si sobra VRAM, intentamos subir un paso (hasta init_bs)
                if DEVICE == "cuda":
                    free, total = torch.cuda.mem_get_info()
                    if free/total > 0.50 and cur_bs < init_bs:
                        cur_bs = min(init_bs, cur_bs*2)

            except RuntimeError as e:
                msg = str(e).lower()
                if "out of memory" in msg or "cuda" in msg:
                    # reducir lote y reintentar MISMO índice
                    if torch.cuda.is_available():
                        torch.cuda.empty_cache()
                    new_bs = max(min_bs, bs//2)
                    if new_bs < bs:
                        cur_bs = new_bs
                        continue
                    else:
                        raise
                else:
                    raise
    pbar.close()
    V = np.vstack(vecs) if vecs else np.empty((0, model.get_sentence_embedding_dimension()), dtype="float32")
    if NORMALIZE and len(V):
        norms = np.linalg.norm(V, axis=1, keepdims=True) + 1e-12
        V = (V / norms).astype("float32", copy=False)
    return V

# nullcontext para CPU
from contextlib import nullcontext

t1 = time.time()
uniq_vecs = batched_encode_adaptive(uniq_texts, init_bs=INIT_BS, min_bs=MIN_BS)
embed_s = time.time() - t1
dim = 0 if uniq_vecs.ndim==1 else uniq_vecs.shape[1]
print(f"Encaje únicos → shape={uniq_vecs.shape}, dim={dim}, t={embed_s:.1f}s")

# === 4) Expandir únicos → todos los chunks ===
idx_of_hash = np.fromiter((uniq_map[h] for h in text_hashes), dtype=np.int32, count=len(text_hashes))
X = uniq_vecs[idx_of_hash]
assert X.shape[0] == n, "El mapeo de únicos no coincide con el total de filas."

# === 5) Persistencia (shards si procede) ===
if n > SHARD_EVERY:
    parts = []
    n_shards = math.ceil(n / SHARD_EVERY)
    for s in range(n_shards):
        a, b = s*SHARD_EVERY, min((s+1)*SHARD_EVERY, n)
        shard_path = OUT_DIR/f"invest_embeddings_part{s+1:03d}.npy"
        np.save(shard_path, X[a:b])
        parts.append(str(shard_path))
    with open(OUT_DIR/"shards.json", "w") as f:
        json.dump({"parts": parts, "dim": int(dim), "total": int(n)}, f, indent=2)
    print(f"Guardado en {len(parts)} shards. Índice → {OUT_DIR/'shards.json'}")
else:
    np.save(VEC_NPY, X)
    print(f"Guardado → {VEC_NPY} (≈ {X.nbytes/1024**2:.2f} MB)")

# === 6) Metadatos + config ===
meta_df = invest_ch.copy()
meta_df["model"]  = MODEL_NAME
meta_df["device"] = DEVICE
meta_df["dim"]    = int(dim)
meta_df["ts"]     = pd.Timestamp.utcnow()
meta_df.to_parquet(META_PARQ, index=False)

with open(CONF_JSON, "w") as f:
    json.dump({
        "model": MODEL_NAME,
        "device": DEVICE,
        "init_batch_size": INIT_BS,
        "min_batch_size": MIN_BS,
        "normalize": NORMALIZE,
        "max_seq_len": MAX_TOKENS,
        "n_chunks": int(n),
        "n_unique_texts": int(n_uniq),
        "load_time_s": round(load_s, 2),
        "embed_time_s": round(embed_s, 2),
        "vectors_file": (str(VEC_NPY) if n <= SHARD_EVERY else "sharded"),
        "meta_file": str(META_PARQ)
    }, f, indent=2)

print("Meta parquet:", META_PARQ)
print("Config     :", CONF_JSON)


  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/128 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_xlm-roberta_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

Modelo cargado en 56.61s | device=cuda | max_seq_len=480
Total de chunks a encajar: 6,354
Textos únicos: 5,615 (88.4% del total) | hashing: 0.0s


Embedding únicos: 100%|██████████| 5615/5615 [01:52<00:00, 49.76row/s]


Encaje únicos → shape=(5615, 1024), dim=1024, t=112.9s
Guardado → /content/drive/MyDrive/NHTSA/embeddings/investigations_e5_mlg_instruct/invest_embeddings.npy (≈ 24.82 MB)
Meta parquet: /content/drive/MyDrive/NHTSA/embeddings/investigations_e5_mlg_instruct/invest_chunks_meta.parquet
Config     : /content/drive/MyDrive/NHTSA/embeddings/investigations_e5_mlg_instruct/config.json
