## Etapa de Ingesta y Normalización de Datos de *Complaints* (NHTSA)

### 1. Descripción General

Esta etapa tuvo como objetivo la **ingesta, depuración y normalización del conjunto de datos de quejas de consumidores** (*Complaints*) del sistema **Office of Defects Investigation (ODI)** de la *National Highway Traffic Safety Administration* (NHTSA).
El propósito principal fue garantizar una estructura coherente y reproducible del dataset histórico (1949–2025), que sirviera como insumo para su posterior integración en un **grafo de conocimiento** y en tareas de análisis semántico mediante *embeddings* y modelos de lenguaje.

El archivo fuente, `FLAT_CMPL.zip`, fue obtenido desde el portal oficial de NHTSA y contiene la base consolidada de quejas vehiculares, de neumáticos, asientos de seguridad infantil y equipo complementario. Cada registro corresponde a una queja individual presentada por un consumidor y puede incluir información técnica del vehículo, descripción del incidente, componente afectado, fechas de ocurrencia y recepción, y metadatos geográficos.

---

### 2. Procedimiento de Ingesta

La lectura del archivo se realizó con un **esquema de tolerancia controlada a errores (Plan A/B)**.

* **Plan A** intentó una lectura fiel al formato tab-delimited original (`sep='\t'`, `quotechar='"'`), preservando comillas y escapes.
* **Plan B**, activado automáticamente ante errores de parseo por comillas no balanceadas o campos mal cerrados, utilizó `quoting=csv.QUOTE_NONE` y `on_bad_lines="warn"`, priorizando robustez frente a fidelidad sintáctica.

El archivo fue procesado íntegramente bajo **Plan B**, logrando la carga completa de **2,137,711 registros** distribuidos en **49 campos oficiales** definidos por la *Appendix A* de la documentación de NHTSA, más **4 campos adicionales** de trazabilidad y métricas. El proceso preservó la integridad total de los datos, sin descartes.

---

### 3. Aplicación del Esquema Oficial

Durante la ingesta se **forzaron los encabezados oficiales** definidos en la especificación de 2021 (49 variables en orden exacto), lo cual evitó la pérdida de la primera fila de datos, un problema común en versiones previas del pipeline.

Entre las variables clave se incluyen:

* **Identificadores:** `CMPLID`, `ODINO`, `VIN`
* **Características del vehículo:** `MAKETXT`, `MODELTXT`, `YEARTXT`, `DRIVE_TRAIN`, `FUEL_TYPE`
* **Datos del incidente:** `CRASH`, `FIRE`, `FAILDATE`, `DATEA`, `LDATE`
* **Descripción narrativa:** `CDESCR` (texto libre)
* **Componente técnico:** `COMPDESC` (estructura jerárquica con subcomponentes)
* **Ubicación:** `CITY`, `STATE`
* **Metadatos adicionales:** `__SOURCE_FILE__`, métricas de longitud de texto

Se verificó la correspondencia con el orden y denominaciones de la *Appendix A*, garantizando compatibilidad con scripts de análisis externos y otros datasets NHTSA (e.g., *Recalls*, *Investigations*).

---

### 4. Normalización y Tipificación

Se aplicó un conjunto de transformaciones sistemáticas:

* **Fechas (`YYYYMMDD`)**: Conversión a tipo `datetime` en las variables `FAILDATE`, `DATEA`, `LDATE`, `PURCH_DT` y `MANUF_DT`.
* **Campos booleanos (`Y/N`)**: Estandarización a valores binarios (`1` = *Sí*, `0` = *No*) en `CRASH`, `FIRE`, `POLICE_RPT_YN`, `ANTI_BRAKES_YN`, entre otros.
* **Años de modelo (`YEARTXT`)**: Conversión a tipo entero con validación de rango plausible (1949–2035), sustituyendo `9999` por valores nulos.
* **Campos numéricos**: Coerción segura a valores numéricos en `INJURED`, `DEATHS`, `OCCURENCES`, `MILES`, etc.
* **Texto narrativo**: Limpieza básica de etiquetas HTML y espacios redundantes en `CDESCR` y `COMPDESC`, junto con cálculo de métricas de longitud (`_LEN` y `TEXT_TOTAL_LEN`).
* **VIN**: Normalización de mayúsculas y eliminación de caracteres no alfanuméricos.

---

### 5. Control de Calidad

Los controles de consistencia incluyeron:

* **Cobertura total de campos narrativos:** 100 % de registros con `CDESCR` no nulo.
* **Integridad de fechas clave:** `DATEA` y `LDATE` con 100 % de cobertura; `FAILDATE` con más del 97 %.
* **Validez de años de modelo:** rango efectivo 1949–2025.
* **Uso de planes:** `Plan B` en 100 % de archivos procesados, con una sola advertencia no crítica.
* **Consistencia tipográfica:** Sin duplicados de encabezado, sin campos anómalos (e.g., “UNNAMED: n”).

Los resultados finales se almacenaron en dos formatos:

* **Parquet:** `/content/drive/MyDrive/NHTSA/processed/complaints.parquet`
* **JSONL:** `/content/drive/MyDrive/NHTSA/processed/complaints.jsonl`

Ambos formatos son equivalentes en contenido y aptos para su indexación en motores semánticos o grafos de conocimiento.

---

### 6. Derivación de Vistas Temáticas

A partir del dataset normalizado se construyeron vistas complementarias para integración en Neo4j:

1. **`complaints_MMY.csv`**
   Contiene los pares únicos *Make–Model–Year* con identificador canónico `MMY_ID`, estandarizado en mayúsculas y libre de duplicados.

2. **`complaints_components.csv`**
   Desagrega el campo `COMPDESC` en jerarquías de componentes (`COMP_L1`, `COMP_L2`, `COMP_L3`), permitiendo la modelación estructural de subsistemas vehiculares.

3. **`complaints_ids.csv`**
   Conjunto mínimo de identificadores (`CMPLID`, `ODINO`, `VIN`) para enlace con otras entidades del grafo.

4. **`complaints_corpus.jsonl`** *(opcional)*
   Corpus textual que combina `CDESCR` con metadatos técnicos (`MAKETXT`, `MODELTXT`, `COMPDESC`, etc.), diseñado para generación de *embeddings* y análisis semántico supervisado.

---

### 7. Conclusiones

La etapa de procesamiento de *Complaints* consolidó un **pipeline reproducible, robusto y alineado con las especificaciones oficiales de NHTSA**, logrando una cobertura total del dataset y la eliminación completa de anomalías estructurales.

El resultado constituye una base **coherente y tipificada**, apta tanto para su análisis descriptivo y exploratorio como para su incorporación en un **modelo de grafo multimodal**, en el que cada queja puede vincularse con vehículos, componentes, investigaciones y campañas de retiro (*Recalls*).
El proceso se caracteriza por su **trazabilidad, reproducibilidad y compatibilidad** con las siguientes etapas de análisis de conocimiento y modelado predictivo.

---


In [1]:
# === Montaje de Drive (si usarás archivos en tu Google Drive) ===
from google.colab import drive
drive.mount('/content/drive')

# === Librerías ===
import io, os, re, csv, zipfile, json, textwrap, unicodedata
from pathlib import Path
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 150)
pd.set_option("display.width", 220)

# === Utilidades de parsing y limpieza ===
def strip_html_basic(s: str) -> str:
    if pd.isna(s):
        return s
    s = re.sub(r"<br\s*/?>", " ", str(s), flags=re.I)
    s = re.sub(r"<[^>]+>", "", s)        # Tag stripper simple
    s = re.sub(r"\s+", " ", s).strip()
    return s

def normalize_spaces(s: str) -> str:
    if pd.isna(s):
        return s
    s = unicodedata.normalize("NFKC", str(s))
    s = re.sub(r"\s+", " ", s).strip()
    return s

def coerce_int_with_range(x, lo=None, hi=None):
    try:
        v = int(float(str(x)))  # maneja "1999.0"
        if lo is not None and v < lo: return np.nan
        if hi is not None and v > hi: return np.nan
        return v
    except:
        return np.nan

def to_datetime_safe(s, dayfirst=False):
    return pd.to_datetime(s, errors="coerce", dayfirst=dayfirst)

def to_numeric_safe(s):
    return pd.to_numeric(s, errors="coerce")

def compute_len(s):
    if pd.isna(s): return 0
    return len(str(s))


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
def read_tab_robust(raw_bytes: bytes, expected_cols=None, encoding="utf-8"):
    """
    Intenta primero Plan A (engine='python', quotechar='"').
    Si falla por comillas no escapadas, cae al Plan B (QUOTE_NONE).
    Devuelve (df, plan_usado, warn_msgs)
    """
    warn_msgs = []

    # --- Plan A ---
    try:
        df = pd.read_csv(io.BytesIO(raw_bytes),
                         sep="\t",
                         engine="python",     # importante: 'python' ignora low_memory
                         quotechar='"',
                         encoding=encoding,
                         on_bad_lines="error")
        plan = "A"
        return df, plan, warn_msgs
    except Exception as eA:
        warn_msgs.append(f"[Plan A] Falló: {type(eA).__name__}: {eA}")

    # --- Plan B ---
    try:
        df = pd.read_csv(io.BytesIO(raw_bytes),
                         sep="\t",
                         engine="python",           # mantener python
                         quoting=csv.QUOTE_NONE,    # sin comillas
                         escapechar="\\",           # permite escapar tabs/quotes
                         encoding=encoding,
                         on_bad_lines="warn")       # no interrumpe
        plan = "B"
        return df, plan, warn_msgs
    except Exception as eB:
        warn_msgs.append(f"[Plan B] Falló: {type(eB).__name__}: {eB}")
        raise RuntimeError("Ambos planes fallaron. Revisa el archivo fuente.\n" + "\n".join(warn_msgs))


def read_all_tabs_from_zip(zip_path: str, encoding="utf-8"):
    """
    Lee todos los archivos *.txt / *.tsv dentro de un ZIP con el lector robusto.
    Concatena por columnas (outer join) preservando 'extras'.
    """
    zf = zipfile.ZipFile(zip_path, "r")
    frames = []
    plan_stats = {"A": 0, "B": 0}
    warnings = []

    for name in zf.namelist():
        if not re.search(r"\.(txt|tsv)$", name, flags=re.I):
            continue
        raw = zf.read(name)
        df, plan, warn_msgs = read_tab_robust(raw, encoding=encoding)
        plan_stats[plan] = plan_stats.get(plan, 0) + 1
        warnings.extend([f"{name}: {m}" for m in warn_msgs])
        df["__SOURCE_FILE__"] = name
        frames.append(df)

    if not frames:
        raise RuntimeError("No se encontraron .txt/.tsv en el ZIP.")
    # Outer concat para no perder columnas extra
    wide = pd.concat(frames, ignore_index=True, sort=False)
    return wide, plan_stats, warnings


In [None]:
COMPLAINTS_SCHEMA_ORDER_OFFICIAL = [
    # Appendix A (49 campos) en orden:
    "CMPLID","ODINO","MFR_NAME","MAKETXT","MODELTXT","YEARTXT","CRASH","FAILDATE",
    "FIRE","INJURED","DEATHS","COMPDESC","CITY","STATE","VIN","DATEA","LDATE",
    "MILES","OCCURENCES","CDESCR","CMPL_TYPE","POLICE_RPT_YN","PURCH_DT","ORIG_OWNER_YN",
    "ANTI_BRAKES_YN","CRUISE_CONT_YN","NUM_CYLS","DRIVE_TRAIN","FUEL_SYS","FUEL_TYPE",
    "TRANS_TYPE","VEH_SPEED","DOT","TIRE_SIZE","LOC_OF_TIRE","TIRE_FAIL_TYPE",
    "ORIG_EQUIP_YN","MANUF_DT","SEAT_TYPE","RESTRAINT_TYPE","DEALER_NAME","DEALER_TEL",
    "DEALER_CITY","DEALER_STATE","DEALER_ZIP","PROD_TYPE","REPAIRED_YN","MEDICAL_ATTN",
    "VEHICLES_TOWED_YN"
]

YN_FIELDS = [
    "CRASH","FIRE","POLICE_RPT_YN","ORIG_OWNER_YN","ANTI_BRAKES_YN","CRUISE_CONT_YN",
    "ORIG_EQUIP_YN","REPAIRED_YN","MEDICAL_ATTN"
]

DATE_YYYYMMDD_FIELDS = ["FAILDATE","DATEA","LDATE","PURCH_DT","MANUF_DT"]

def _parse_yyyymmdd(series):
    return pd.to_datetime(series, format="%Y%m%d", errors="coerce")

def _yn_to_int(series):
    return (series.astype(str).str.strip().str.upper()
            .replace({"Y":1,"YES":1,"T":1,"TRUE":1,"1":1,
                      "N":0,"NO":0,"F":0,"FALSE":0,"0":0})
            .map(lambda x: 1 if x==1 else (0 if x==0 else np.nan)).astype("Int64"))

def normalize_complaints(df_raw: pd.DataFrame) -> pd.DataFrame:
    df = df_raw.copy()
    df.columns = [c.strip().upper() for c in df.columns]

    # ---- Aliases para alinear con Appendix A ----
    alias_map = {
        # Identificadores
        "ODI": "ODINO", "ODI_NUMBER": "ODINO", "ODINUMBER": "ODINO",
        # Fabricante / modelo
        "MAKE": "MAKETXT", "MODEL": "MODELTXT", "MODEL_YEAR": "YEARTXT", "YEAR": "YEARTXT",
        "MFRNAME": "MFR_NAME", "MFR": "MFR_NAME",
        # Descripciones
        "DESCRIPTION": "CDESCR", "DESCR": "CDESCR", "NARRATIVE": "CDESCR",
        "COMPONENT": "COMPDESC", "COMPDESC.": "COMPDESC",
        # Fechas posibles variantes
        "FAIL_DATE": "FAILDATE", "DATE_A": "DATEA", "LDATE_RCVD": "LDATE",
        "PURCHDATE": "PURCH_DT", "MANUFDATE": "MANUF_DT",
        # Otros
        "MILEAGE": "MILES", "VEH_SPEED_MPH": "VEH_SPEED",
    }
    for a, b in alias_map.items():
        if a in df.columns and b not in df.columns:
            df.rename(columns={a: b}, inplace=True)

    # ---- Tipificación claves ----
    if "YEARTXT" in df.columns:
        # 9999 (unknown) → NaN; y filtro por rango plausible
        def _coerce_year(x):
            try:
                v = int(str(x))
                return np.nan if v == 9999 or v < 1949 or v > 2035 else v
            except:
                return np.nan
        df["YEARTXT"] = df["YEARTXT"].map(_coerce_year).astype("Int64")

    # VIN: normalizado ligero
    if "VIN" in df.columns:
        df["VIN"] = (df["VIN"].astype(str).str.upper()
                     .str.replace(r"[^A-Z0-9]", "", regex=True).str.strip())

    # Fechas YYYYMMDD (5 campos del catálogo)
    for col in DATE_YYYYMMDD_FIELDS:
        if col in df.columns:
            df[col] = _parse_yyyymmdd(df[col])

    # Y/N → 0/1
    for col in YN_FIELDS:
        if col in df.columns:
            df[col] = _yn_to_int(df[col])

    # Numéricos
    for col in ["INJURED","DEATHS","OCCURENCES","NUM_CYLS","VEH_SPEED","MILES"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    # Texto largo y limpieza mínima
    text_fields = []
    if "CDESCR" in df.columns: text_fields.append("CDESCR")
    if "COMPDESC" in df.columns: text_fields.append("COMPDESC")

    for col in text_fields:
        df[col] = df[col].map(strip_html_basic).map(normalize_spaces)
        df[col + "_LEN"] = df[col].map(compute_len)

    # Reorden oficial + preserva extras
    ordered = [c for c in COMPLAINTS_SCHEMA_ORDER_OFFICIAL if c in df.columns]
    extras = [c for c in df.columns if c not in ordered]
    df = df[ordered + extras]

    # Métrica de longitud total de texto
    len_cols = [c for c in df.columns if c.endswith("_LEN")]
    if len_cols:
        df["TEXT_TOTAL_LEN"] = df[len_cols].sum(axis=1)

    return df


In [None]:
INV_SCHEMA_ORDER = [
    # Ajusta/ordena según tu insumo:
    "ODINUMBER", "ACTIONNUMBER", "MAKETXT", "MODELTXT", "YEARTXT",
    "COMPONENT", "DATEOPEN", "DATECLOSE", "SUMMARY", "NARRATIVE",
    "PRIORTYP", "STATUS", "CAMPNO"  # CAMPNO puede enlazar a recalls
]
TEXT_FIELDS_INV = ["SUMMARY", "NARRATIVE"]

def normalize_investigations(df_raw: pd.DataFrame) -> pd.DataFrame:
    df = df_raw.copy()
    df.columns = [c.strip().upper() for c in df.columns]

    alias_map = {
        "ODI": "ODINUMBER",
        "ODI_NUMBER": "ODINUMBER",
        "MAKE": "MAKETXT",
        "MODEL": "MODELTXT",
        "YEAR": "YEARTXT",
        "OPEN_DATE": "DATEOPEN",
        "CLOSE_DATE": "DATECLOSE",
        "DESCRIPTION": "SUMMARY"
    }
    for a, b in alias_map.items():
        if a in df.columns and b not in df.columns:
            df.rename(columns={a: b}, inplace=True)

    if "YEARTXT" in df.columns:
        df["YEARTXT"] = df["YEARTXT"].map(lambda x: coerce_int_with_range(x, 1950, 2035))

    for col in ["DATEOPEN", "DATECLOSE"]:
        if col in df.columns:
            df[col] = to_datetime_safe(df[col], dayfirst=False)

    for col in TEXT_FIELDS_INV:
        if col in df.columns:
            df[col] = df[col].map(strip_html_basic).map(normalize_spaces)
            df[col + "_LEN"] = df[col].map(compute_len)

    if "COMPONENT" in df.columns:
        parts = df["COMPONENT"].astype(str).str.split(":", n=2, expand=True)
        for i in range(3):
            if i < parts.shape[1]:
                df[f"COMP_L{i+1}"] = parts[i].map(lambda s: s.strip() if isinstance(s, str) else s)

    ordered = [c for c in INV_SCHEMA_ORDER if c in df.columns]
    extras = [c for c in df.columns if c not in ordered]
    df = df[ordered + extras]

    df["TEXT_TOTAL_LEN"] = df[[c for c in df.columns if c.endswith("_LEN")]].sum(axis=1)
    return df


In [None]:
# === Rutas de ejemplo (ajusta) ===
ZIP_PATH_COMPLAINTS = "/content/drive/MyDrive/Proyecto final/FLAT_CMPL.zip"

from pathlib import Path
import io, re, csv, zipfile, json
import pandas as pd
import numpy as np

OUT_DIR = Path("/content/drive/MyDrive/NHTSA/processed")
OUT_DIR.mkdir(parents=True, exist_ok=True)

# --- Catálogo oficial de complaints (49 campos) ---
COMPLAINTS_SCHEMA_ORDER_OFFICIAL = [
    "CMPLID","ODINO","MFR_NAME","MAKETXT","MODELTXT","YEARTXT","CRASH","FAILDATE",
    "FIRE","INJURED","DEATHS","COMPDESC","CITY","STATE","VIN","DATEA","LDATE",
    "MILES","OCCURENCES","CDESCR","CMPL_TYPE","POLICE_RPT_YN","PURCH_DT","ORIG_OWNER_YN",
    "ANTI_BRAKES_YN","CRUISE_CONT_YN","NUM_CYLS","DRIVE_TRAIN","FUEL_SYS","FUEL_TYPE",
    "TRANS_TYPE","VEH_SPEED","DOT","TIRE_SIZE","LOC_OF_TIRE","TIRE_FAIL_TYPE",
    "ORIG_EQUIP_YN","MANUF_DT","SEAT_TYPE","RESTRAINT_TYPE","DEALER_NAME","DEALER_TEL",
    "DEALER_CITY","DEALER_STATE","DEALER_ZIP","PROD_TYPE","REPAIRED_YN","MEDICAL_ATTN",
    "VEHICLES_TOWED_YN"
]

# ========= Lectores robustos =========

def read_tab_robust_with_schema(raw_bytes: bytes, names, encoding="utf-8"):
    """
    Igual que tu Plan A/B, pero forzando header=None + names=...
    para que nunca tome la primera fila como encabezado.
    """
    # --- Plan A: comillas bien formadas ---
    try:
        df = pd.read_csv(
            io.BytesIO(raw_bytes),
            sep="\t",
            engine="python",
            quotechar='"',
            encoding=encoding,
            header=None,
            names=names,
            on_bad_lines="error",
            dtype=str,         # todo string: más estable para concatenar
        )
        return df, "A", []
    except Exception as eA:
        warns = [f"[Plan A] {type(eA).__name__}: {eA}"]

    # --- Plan B: sin comillas (tolerante) ---
    try:
        df = pd.read_csv(
            io.BytesIO(raw_bytes),
            sep="\t",
            engine="python",
            quoting=csv.QUOTE_NONE,
            escapechar="\\",
            encoding=encoding,
            header=None,
            names=names,
            on_bad_lines="warn",
            dtype=str,
        )
        return df, "B", warns
    except Exception as eB:
        warns.append(f"[Plan B] {type(eB).__name__}: {eB}")
        raise RuntimeError("Ambos planes fallaron al leer el archivo.\n" + "\n".join(warns))


def read_complaints_zip_with_schema(zip_path: str, encoding="utf-8"):
    """
    Lee TODOS los .txt/.tsv en el ZIP de complaints forzando el esquema oficial (49 columnas).
    Concatena preservando extras de pandas (no debería haber extras aquí).
    """
    zf = zipfile.ZipFile(zip_path, "r")
    frames, warns = [], []
    plan_stats = {"A": 0, "B": 0}

    for name in zf.namelist():
        if not re.search(r"\.(txt|tsv)$", name, flags=re.I):
            continue
        raw = zf.read(name)
        df, plan, w = read_tab_robust_with_schema(raw, COMPLAINTS_SCHEMA_ORDER_OFFICIAL, encoding=encoding)
        plan_stats[plan] = plan_stats.get(plan, 0) + 1
        warns.extend([f"{name}: {msg}" for msg in w])
        df["__SOURCE_FILE__"] = name
        frames.append(df)

    if not frames:
        raise RuntimeError("No se encontraron .txt/.tsv en el ZIP.")

    wide = pd.concat(frames, ignore_index=True, sort=False)
    return wide, plan_stats, warns


def read_all_tabs_from_zip(zip_path: str, encoding="utf-8"):
    """
    Versión genérica (para investigations). Mantén tu versión previa si ya la tienes.
    """
    zf = zipfile.ZipFile(zip_path, "r")
    frames, warns = [], []
    plan_stats = {"A": 0, "B": 0}

    for name in zf.namelist():
        if not re.search(r"\.(txt|tsv)$", name, flags=re.I):
            continue
        raw = zf.read(name)
        # Reusa tu lector genérico (Plan A/B) que NO fuerza schema.
        try:
            df = pd.read_csv(io.BytesIO(raw), sep="\t", engine="python", quotechar='"',
                             encoding=encoding, on_bad_lines="error")
            plan_stats["A"] += 1
        except Exception as eA:
            warns.append(f"{name} [Plan A] {type(eA).__name__}: {eA}")
            df = pd.read_csv(io.BytesIO(raw), sep="\t", engine="python", quoting=csv.QUOTE_NONE,
                             escapechar="\\", encoding=encoding, on_bad_lines="warn")
            plan_stats["B"] += 1
        df["__SOURCE_FILE__"] = name
        frames.append(df)

    if not frames:
        raise RuntimeError("No se encontraron .txt/.tsv en el ZIP.")

    wide = pd.concat(frames, ignore_index=True, sort=False)
    return wide, plan_stats, warns


In [None]:
def run_pipeline_from_zip(zip_path: str, kind: str):
    """
    kind in {"complaints", "investigations"}
    Devuelve df_norm y métricas QC.
    """
    if kind == "complaints":
        df_raw, plan_stats, warns = read_complaints_zip_with_schema(zip_path)
        df_norm = normalize_complaints(df_raw)   # tu función del parche Appendix A
        stem = "complaints"
        key_text_cols = [c for c in ["CDESCR","COMPDESC"] if c in df_norm.columns]
        date_cols = [c for c in ["FAILDATE","DATEA","LDATE","PURCH_DT","MANUF_DT"] if c in df_norm.columns]

    elif kind == "investigations":
        df_raw, plan_stats, warns = read_all_tabs_from_zip(zip_path)
        df_norm = normalize_investigations(df_raw)  # tu función para investigations
        stem = "investigations"
        key_text_cols = [c for c in ["SUMMARY","NARRATIVE"] if c in df_norm.columns]
        date_cols = [c for c in ["DATEOPEN","DATECLOSE"] if c in df_norm.columns]
    else:
        raise ValueError("kind debe ser 'complaints' o 'investigations'.")

    # === Control de Calidad básico ===
    qc = {}
    qc["n_rows"] = len(df_norm)
    qc["n_cols"] = df_norm.shape[1]
    qc["plans_usage"] = plan_stats
    qc["warnings_count"] = len(warns)

    # Cobertura de fechas
    cov = {}
    for c in date_cols:
        cov[c] = float(df_norm[c].notna().mean()) if c in df_norm.columns else None
    qc["date_coverage"] = cov

    # Cobertura de texto
    text_cov = {}
    for c in key_text_cols:
        text_cov[c] = {
            "coverage": float(df_norm[c].notna().mean()),
            "median_len": float(df_norm[c + "_LEN"].median()) if (c + "_LEN") in df_norm.columns else None
        }
    qc["text_coverage"] = text_cov

    # Rango de años
    if "YEARTXT" in df_norm.columns:
        yrs = df_norm["YEARTXT"].dropna()
        qc["year_range"] = (int(yrs.min()) if not yrs.empty else None,
                            int(yrs.max()) if not yrs.empty else None)
    else:
        qc["year_range"] = (None, None)

    # === Persistencia ===
    parquet_path = OUT_DIR / f"{stem}.parquet"
    jsonl_path = OUT_DIR / f"{stem}.jsonl"

    df_norm.to_parquet(parquet_path, index=False)
    with open(jsonl_path, "w", encoding="utf-8") as f:
        for _, row in df_norm.iterrows():
            f.write(json.dumps({k: (None if pd.isna(v) else v) for k, v in row.to_dict().items()},
                               default=str, ensure_ascii=False) + "\n")

    # Resumen
    summary = {
        "dataset": kind,
        "rows": qc["n_rows"],
        "cols": qc["n_cols"],
        "plans_usage": qc["plans_usage"],
        "warnings": min(qc["warnings_count"], 15),
        "out_parquet": str(parquet_path),
        "out_jsonl": str(jsonl_path),
        "first_columns": list(df_norm.columns[:30]),  # ahora sí deben ser nombres válidos
    }
    print(json.dumps(summary, indent=2, ensure_ascii=False))

    return df_norm, qc, warns


In [None]:
# Complaints (con header forzado al catálogo oficial)
dfC, qcC, warnsC = run_pipeline_from_zip(ZIP_PATH_COMPLAINTS, "complaints")


  .replace({"Y":1,"YES":1,"T":1,"TRUE":1,"1":1,
  .replace({"Y":1,"YES":1,"T":1,"TRUE":1,"1":1,


{
  "dataset": "complaints",
  "rows": 2137711,
  "cols": 53,
  "plans_usage": {
    "A": 0,
    "B": 1
  },
  "out_parquet": "/content/drive/MyDrive/NHTSA/processed/complaints.parquet",
  "out_jsonl": "/content/drive/MyDrive/NHTSA/processed/complaints.jsonl",
  "first_columns": [
    "CMPLID",
    "ODINO",
    "MFR_NAME",
    "MAKETXT",
    "MODELTXT",
    "YEARTXT",
    "CRASH",
    "FAILDATE",
    "FIRE",
    "INJURED",
    "DEATHS",
    "COMPDESC",
    "CITY",
    "STATE",
    "VIN",
    "DATEA",
    "LDATE",
    "MILES",
    "OCCURENCES",
    "CDESCR",
    "CMPL_TYPE",
    "POLICE_RPT_YN",
    "PURCH_DT",
    "ORIG_OWNER_YN",
    "ANTI_BRAKES_YN",
    "CRUISE_CONT_YN",
    "NUM_CYLS",
    "DRIVE_TRAIN",
    "FUEL_SYS",
    "FUEL_TYPE"
  ]
}


In [None]:
dfC.head()


Unnamed: 0,CMPLID,ODINO,MFR_NAME,MAKETXT,MODELTXT,YEARTXT,CRASH,FAILDATE,FIRE,INJURED,DEATHS,COMPDESC,CITY,STATE,VIN,DATEA,LDATE,MILES,OCCURENCES,CDESCR,CMPL_TYPE,POLICE_RPT_YN,PURCH_DT,ORIG_OWNER_YN,ANTI_BRAKES_YN,CRUISE_CONT_YN,NUM_CYLS,DRIVE_TRAIN,FUEL_SYS,FUEL_TYPE,TRANS_TYPE,VEH_SPEED,DOT,TIRE_SIZE,LOC_OF_TIRE,TIRE_FAIL_TYPE,ORIG_EQUIP_YN,MANUF_DT,SEAT_TYPE,RESTRAINT_TYPE,DEALER_NAME,DEALER_TEL,DEALER_CITY,DEALER_STATE,DEALER_ZIP,PROD_TYPE,REPAIRED_YN,MEDICAL_ATTN,VEHICLES_TOWED_YN,__SOURCE_FILE__,CDESCR_LEN,COMPDESC_LEN,TEXT_TOTAL_LEN
0,1,958241,"Volvo Car USA, LLC",VOLVO,760,1987,0,NaT,0,0,0,ENGINE AND ENGINE COOLING:COOLING SYSTEM:RADIA...,EL CAJON,CA,NAN,1995-01-03,1995-01-03,,,RADIATOR FAILED @ HIGHWAY SPEED OBSTRUCTING DR...,EVOQ,0,NaT,0,0,0,,,,,,,,,,,,NaT,,,,,,,,V,,0,N,FLAT_CMPL.txt,97,58,155
1,2,958130,Ford Motor Company,FORD,THUNDERBIRD,1992,0,1994-12-22,0,0,0,"FUEL SYSTEM, GASOLINE:DELIVERY",CLINTONTOWN,MI,1FAPP6045NH,1995-01-03,1995-01-03,,1.0,"FUEL LEAKED FROM FUEL TANK AREA, EMITTING STRO...",EVOQ,0,NaT,0,0,0,,,,,,,,,,,,NaT,,,,,,,,V,,0,N,FLAT_CMPL.txt,69,30,99
2,3,958132,"Kia America, Inc.",KIA,SEPHIA,1994,1,1994-12-30,0,0,0,POWER TRAIN:AUTOMATIC TRANSMISSION,SAN FRANCISC,CA,NAN,1995-01-03,1995-01-03,,,SHIFTED INTO REVERSE VEHICLE JERKED VIOLENTLY....,EVOQ,0,NaT,0,0,0,,,,,,,,,,,,NaT,,,,,,,,V,,0,N,FLAT_CMPL.txt,80,34,114
3,4,958133,"Chrysler (FCA US, LLC)",DODGE,600,1987,0,1994-12-31,0,0,0,"FUEL SYSTEM, GASOLINE:STORAGE:TANK ASSEMBLY",MUSKEGON,MI,1B3BE36D4HC,1995-01-03,1995-01-03,,,FUEL TANK ; LEAKS BECAUSE OF RUST GAS LEAK BY ...,EVOQ,0,NaT,0,0,0,,,,,,,,,,,,NaT,,,,,,,,V,,0,N,FLAT_CMPL.txt,68,43,111
4,5,958137,"Chrysler (FCA US, LLC)",DODGE,CARAVAN,1991,0,1994-12-18,0,0,0,SEATS,MESQUITE,TX,2B4GK4535MR,1995-01-03,1995-01-03,,1.0,"DRIVER SIDE SEAT FRAME BROKE IN TWO, CAUSING S...",EVOQ,0,NaT,0,0,0,,,,,,,,,,,,NaT,,,,,,,,V,,0,N,FLAT_CMPL.txt,98,5,103


In [2]:
from pathlib import Path
import pandas as pd
import numpy as np

IN_PARQUET = "/content/drive/MyDrive/NHTSA/processed/complaints.parquet"
OUT_DIR = Path("/content/drive/MyDrive/NHTSA/processed")
OUT_DIR.mkdir(parents=True, exist_ok=True)

# Carga completa
dfC = pd.read_parquet(IN_PARQUET)
print(dfC.shape)
dfC.head(2)


(2137711, 53)


Unnamed: 0,CMPLID,ODINO,MFR_NAME,MAKETXT,MODELTXT,YEARTXT,CRASH,FAILDATE,FIRE,INJURED,DEATHS,COMPDESC,CITY,STATE,VIN,DATEA,LDATE,MILES,OCCURENCES,CDESCR,CMPL_TYPE,POLICE_RPT_YN,PURCH_DT,ORIG_OWNER_YN,ANTI_BRAKES_YN,CRUISE_CONT_YN,NUM_CYLS,DRIVE_TRAIN,FUEL_SYS,FUEL_TYPE,TRANS_TYPE,VEH_SPEED,DOT,TIRE_SIZE,LOC_OF_TIRE,TIRE_FAIL_TYPE,ORIG_EQUIP_YN,MANUF_DT,SEAT_TYPE,RESTRAINT_TYPE,DEALER_NAME,DEALER_TEL,DEALER_CITY,DEALER_STATE,DEALER_ZIP,PROD_TYPE,REPAIRED_YN,MEDICAL_ATTN,VEHICLES_TOWED_YN,__SOURCE_FILE__,CDESCR_LEN,COMPDESC_LEN,TEXT_TOTAL_LEN
0,1,958241,"Volvo Car USA, LLC",VOLVO,760,1987,0,NaT,0,0,0,ENGINE AND ENGINE COOLING:COOLING SYSTEM:RADIA...,EL CAJON,CA,NAN,1995-01-03,1995-01-03,,,RADIATOR FAILED @ HIGHWAY SPEED OBSTRUCTING DR...,EVOQ,0,NaT,0,0,0,,,,,,,,,,,,NaT,,,,,,,,V,,0,N,FLAT_CMPL.txt,97,58,155
1,2,958130,Ford Motor Company,FORD,THUNDERBIRD,1992,0,1994-12-22,0,0,0,"FUEL SYSTEM, GASOLINE:DELIVERY",CLINTONTOWN,MI,1FAPP6045NH,1995-01-03,1995-01-03,,1.0,"FUEL LEAKED FROM FUEL TANK AREA, EMITTING STRO...",EVOQ,0,NaT,0,0,0,,,,,,,,,,,,NaT,,,,,,,,V,,0,N,FLAT_CMPL.txt,69,30,99


In [None]:
# Descomponer COMPDESC en niveles por ":" (hasta 3 niveles)
if "COMPDESC" in dfC.columns:
    parts = dfC["COMPDESC"].fillna("").astype(str).str.split(":", n=2, expand=True)
    comp_df = pd.DataFrame({
        "COMP_L1": parts[0].str.strip() if 0 in parts.columns else None,
        "COMP_L2": parts[1].str.strip() if 1 in parts.columns else None,
        "COMP_L3": parts[2].str.strip() if 2 in parts.columns else None,
    })
    comp_df = comp_df.replace({"": np.nan}).dropna(how="all")
    comp_df = comp_df.drop_duplicates().sort_values(["COMP_L1","COMP_L2","COMP_L3"])
    out_comp = OUT_DIR / "complaints_components.csv"
    comp_df.to_csv(out_comp, index=False)
    print("OK ->", out_comp, "rows:", len(comp_df))
    comp_df.head(10)
else:
    print("No COMPDESC column found.")


OK -> /content/drive/MyDrive/NHTSA/processed/complaints_components.csv rows: 757


In [None]:
# Descomponer COMPDESC en niveles por ":" (hasta 3 niveles)
if "COMPDESC" in dfC.columns:
    parts = dfC["COMPDESC"].fillna("").astype(str).str.split(":", n=2, expand=True)
    comp_df = pd.DataFrame({
        "COMP_L1": parts[0].str.strip() if 0 in parts.columns else None,
        "COMP_L2": parts[1].str.strip() if 1 in parts.columns else None,
        "COMP_L3": parts[2].str.strip() if 2 in parts.columns else None,
    })
    comp_df = comp_df.replace({"": np.nan}).dropna(how="all")
    comp_df = comp_df.drop_duplicates().sort_values(["COMP_L1","COMP_L2","COMP_L3"])
    out_comp = OUT_DIR / "complaints_components.csv"
    comp_df.to_csv(out_comp, index=False)
    print("OK ->", out_comp, "rows:", len(comp_df))
    comp_df.head(10)
else:
    print("No COMPDESC column found.")


OK -> /content/drive/MyDrive/NHTSA/processed/complaints_components.csv rows: 757


In [None]:
id_cols = [c for c in ["CMPLID","ODINO","VIN"] if c in dfC.columns]
ids_df = dfC[id_cols].dropna(how="all").drop_duplicates()

# Limpieza ligera VIN (si existe)
if "VIN" in ids_df.columns:
    ids_df["VIN"] = ids_df["VIN"].astype(str).str.upper().str.replace(r"[^A-Z0-9]", "", regex=True)

out_ids = OUT_DIR / "complaints_ids.csv"
ids_df.to_csv(out_ids, index=False)
print("OK ->", out_ids, "rows:", len(ids_df))
ids_df.head(10)


OK -> /content/drive/MyDrive/NHTSA/processed/complaints_ids.csv rows: 2137711


Unnamed: 0,CMPLID,ODINO,VIN
0,1,958241,NAN
1,2,958130,1FAPP6045NH
2,3,958132,NAN
3,4,958133,1B3BE36D4HC
4,5,958137,2B4GK4535MR
5,6,958246,NAN
6,7,958248,NAN
7,8,958244,1P4FH5435KX
8,9,958244,1P4FH5435KX
9,10,958244,1P4FH5435KX


In [3]:
import json

text_col = "CDESCR"
meta_cols = [c for c in ["CMPLID","ODINO","MAKETXT","MODELTXT","YEARTXT","COMPDESC","VIN","DATEA","LDATE","CITY","STATE"] if c in dfC.columns]

# Filtra filas con texto
corpus = dfC[dfC[text_col].notna()].copy()

out_jsonl = OUT_DIR / "complaints_corpus.jsonl"
with open(out_jsonl, "w", encoding="utf-8") as f:
    for _, row in corpus.iterrows():
        rec = {
            "id": str(row["CMPLID"]) if "CMPLID" in row else None,
            "text": str(row[text_col]),
            "metadata": {k: (None if pd.isna(row[k]) else str(row[k])) for k in meta_cols}
        }
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")

print("OK ->", out_jsonl)


OK -> /content/drive/MyDrive/NHTSA/processed/complaints_corpus.jsonl
