# Informe de Ejecución del Proceso de Carga de Vectores a Qdrant

## Resumen

Este documento describe la ingesta de tres conjuntos vectoriales — **Recalls**, **Investigations** y **Complaints (representantes)** — en **Qdrant Cloud** desde **Google Colab**. Se fijaron versiones de dependencias, se estandarizó la creación/validación de colecciones, se implementó carga por lotes y, donde aplica, **UUID v5 determinísticos**. Se corrigieron rutas y nombres de colecciones a:

* `nhtsa_investigations`
* `nhtsa_recalls`
* `nhtsa_complaints`

---

## 1) Configuración del entorno

* **Montaje de Drive:** `drive.mount('/content/drive')`
* **Credenciales:**
  `QDRANT_URL="https://…qdrant.io"`
  `QDRANT_API_KEY="…"`.
* **Dependencias (fijadas):**
  `qdrant-client==1.9.2`, `pandas==2.2.2`, `numpy==1.26.4`, `pyarrow==17.0.0`.
* **Ruta base en Drive:**

  ```
  BASE = /content/drive/MyDrive/NHTSA
  ```

---

## 2) Funciones auxiliares (resumen)

* `ensure_collection(client, name, dim, distance)`: crea o **recrea** la colección si la dimensión no coincide.
* `df_to_payloads(df)`: convierte `NaN→None` y tipos `numpy` a nativos.
* `upsert_batches(...)`: `upsert` por lotes (batch) con `wait=True`.
* **(Complaints)** `detect_rep_columns`, `load_reps_and_vectors`, `make_point_uuid(...)` (UUID v5).

---

## 3) Procesos de carga

### 3.1 Recalls

**Colección destino:** `nhtsa_recalls`
**Rutas:**

```
RCL_EMB_DIR = BASE/embeddings/recalls_e5_mlg_instruct
RCL_META    = RCL_EMB_DIR/recalls_chunks_meta.parquet
RCL_EMB     = RCL_EMB_DIR/recalls_embeddings.npy
```

**Flujo:**

1. Cargar `recalls_embeddings.npy` y `recalls_chunks_meta.parquet`.
2. `assert len(vecs) == len(meta)`.
3. `ensure_collection(client, "nhtsa_recalls", dim=vecs.shape[1])`.
4. `upsert_batches(..., batch_size=512–1024)` con **IDs enteros secuenciales** (reiniciados si se recrea la colección).

> **Nota de robustez:** si se recrea la colección o se fuerza recarga limpia, **no reutilizar checkpoint** y comenzar IDs en `0`.

---

### 3.2 Investigations

**Colección destino:** `nhtsa_investigations`
**Rutas:**

```
INVDIR = BASE/embeddings/investigations_e5_mlg_instruct
META   = INVDIR/invest_chunks_meta.parquet
NPY    = INVDIR/invest_embeddings.npy    # (6354, 1024)
```

**Flujo:**

1. Verificar correspondencia: `invest_embeddings.npy` (≈**6,354** × **1,024**) ↔ `invest_chunks_meta.parquet` (**6,354** filas).
2. `ensure_collection(client, "nhtsa_investigations", dim=1024)`.
3. `upsert_batches(..., batch_size=512)`.

**Verificación posterior:**

* `client.count("nhtsa_investigations", exact=True).count` → **6354**.
* `scroll` con `with_payload=True` para muestreo.

---

### 3.3 Complaints (representantes)

**Colección destino:** `nhtsa_complaints`
**Rutas:**

```
REDUCED  = BASE/embeddings/complaints_e5_mlg_instruct/reduced_shards
CSV_REPS = REDUCED/kmeans_global/representantes_global_kmeans.csv
# Fallback:
# CSV_REPS = REDUCED/representantes_global.csv
# Vectores reducidos por shard: REDUCED/reduced_0000.npy, reduced_0001.npy, ...
```

**Características clave:**

* Los **vectores** viven **por shard** (`reduced_####.npy`); el CSV indica `shard_idx` y `row_in_shard`.
* **IDs determinísticos:** `UUID v5` a partir de `rep_{shard_idx}_{row_in_shard}` con un **namespace fijo** del proyecto.
* **Carga por shard:** para cada `shard_idx`, se extraen únicamente las filas (índices) necesarias y se suben por lotes (`UPSERT_BATCH` ≈ 2000, ajustar si el free tier se satura).

**Flujo:**

1. `detect_rep_columns` mapea columnas (`shard_idx`, `row_in_shard`, `make`, `model`, `year`, etc.); si falta, se **infieren** desde `id`.
2. `load_reps_and_vectors` carga CSV y vectorizadores por shard (mmap).
3. `ensure_collection(client, "nhtsa_complaints", dim=<dim reducido>)`.
4. Construir `payload_df` (ligero).
5. Iterar por `shard_idx`:

   * `point_ids = uuid5(namespace, f"rep_{shard:04d}_{row:06d}")`
   * `vecs = stack(arr[row] for row in local_rows)`
   * `upsert_batches(client, "nhtsa_complaints", vecs, payload, point_ids, batch_size=UPSERT_BATCH)`.

**Verificación posterior:**

* `client.count("nhtsa_complaints", exact=True).count`.
* `scroll` de muestra con payload.

---

## 4) Configuración de Qdrant (Cloud Free)

* **Timeout recomendado:** `timeout=180`.
* **Distancia:** `models.Distance.COSINE`.
* **Tamaño de lote:** 512–1024 (o 2000 para UUID reps si no hay timeouts).
* **Recreación segura:** si la dimensión no coincide, **borrar y recrear** colección antes de cargar.

---

## 5) Validaciones y control de integridad

* **Aserciones previas:** igualdad `#vectores == #metadatos`; dimensión constante.
* **Tipos payload:** `NaN→None`, `numpy.* →` tipos nativos Python.
* **IDs:**

  * Recalls/Investigations: enteros secuenciales (reiniciar si recarga limpia).
  * Complaints reps: **UUID v5** (estables entre ejecuciones).

---

## 6) Conclusión

Se pobló Qdrant con:

* **Investigations:** `nhtsa_investigations` (≈ **6,354** puntos, **dim 1024**).
* **Recalls:** `nhtsa_recalls` (dim acorde al modelo e5).
* **Complaints (representantes):** `nhtsa_complaints` (IDs **UUID v5** determinísticos).

La **carga por lotes**, la **recreación condicionada** de colecciones y los **identificadores estables** aseguran reproducibilidad y coherencia para **búsqueda semántica**, **alineación entre fuentes** y **serving** posterior.

---

### Apéndice A — Rutas resumidas

```
BASE = /content/drive/MyDrive/NHTSA

# Investigations
/embeddings/investigations_e5_mlg_instruct/invest_embeddings.npy
/embeddings/investigations_e5_mlg_instruct/invest_chunks_meta.parquet
→ colección: nhtsa_investigations

# Recalls
/embeddings/recalls_e5_mlg_instruct/recalls_embeddings.npy
/embeddings/recalls_e5_mlg_instruct/recalls_chunks_meta.parquet
→ colección: nhtsa_recalls

# Complaints (representantes globales)
/embeddings/complaints_e5_mlg_instruct/reduced_shards/kmeans_global/representantes_global_kmeans.csv
# o: /embeddings/complaints_e5_mlg_instruct/reduced_shards/representantes_global.csv
# vectores por shard:
# /embeddings/complaints_e5_mlg_instruct/reduced_shards/reduced_0000.npy, reduced_0001.npy, ...
→ colección: nhtsa_complaints
```

### Apéndice B — Verificaciones rápidas

```python
# Conteo exacto
client.count("nhtsa_investigations", exact=True).count
client.count("nhtsa_recalls", exact=True).count
client.count("nhtsa_complaints", exact=True).count

# Muestra de payloads
scrolled, _ = client.scroll("nhtsa_investigations", limit=3, with_payload=True, with_vectors=False)
for p in scrolled: print(p.id, p.payload)
```

---


In [None]:
!apt install tree

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 38 not upgraded.
Need to get 47.9 kB of archives.
After this operation, 116 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tree amd64 2.0.2-1 [47.9 kB]
Fetched 47.9 kB in 0s (335 kB/s)
Selecting previously unselected package tree.
(Reading database ... 126675 files and directories currently installed.)
Preparing to unpack .../tree_2.0.2-1_amd64.deb ...
Unpacking tree (2.0.2-1) ...
Setting up tree (2.0.2-1) ...
Processing triggers for man-db (2.10.2-1) ...


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!tree drive/MyDrive/NHTSA/

[01;34mdrive/MyDrive/NHTSA/[0m
├── [01;34membeddings[0m
│   ├── [01;34mcomplaints_e5_mlg_instruct[0m
│   │   ├── [00mcheckpoint.json[0m
│   │   ├── [01;34mclustered[0m
│   │   ├── [00membeddings_shard_0000.npy[0m
│   │   ├── [00membeddings_shard_0001.npy[0m
│   │   ├── [00membeddings_shard_0002.npy[0m
│   │   ├── [00membeddings_shard_0003.npy[0m
│   │   ├── [00membeddings_shard_0004.npy[0m
│   │   ├── [00membeddings_shard_0005.npy[0m
│   │   ├── [00membeddings_shard_0006.npy[0m
│   │   ├── [00membeddings_shard_0007.npy[0m
│   │   ├── [00membeddings_shard_0008.npy[0m
│   │   ├── [00membeddings_shard_0009.npy[0m
│   │   ├── [00membeddings_shard_0010.npy[0m
│   │   ├── [00membeddings_shard_0011.npy[0m
│   │   ├── [00membeddings_shard_0012.npy[0m
│   │   ├── [00membeddings_shard_0013.npy[0m
│   │   ├── [00membeddings_shard_0014.npy[0m
│   │   ├── [00membeddings_shard_0015.npy[0m
│   │   ├── [00membeddings_shard_0016.npy[0m
│   │   ├── [00membedd

## Credenciales

In [None]:
QDRANT_URL = "https://6f99e241-2505-45c5-86f7-ad2aa70e3bb2.us-east-1-1.aws.cloud.qdrant.io"
QDRANT_API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3MiOiJtIn0.N4plF_VhgfphnbtCni7VAPvcxer5PNkPSG1xYUc0kLE"

In [None]:
from pathlib import Path

# --- Rutas a los Archivos de Origen ---

BASE = Path("/content/drive/MyDrive/NHTSA")

# --- INVESTIGATIONS ---
INV_EMB_DIR = BASE / "embeddings" / "investigations_e5_mlg_instruct"
INV_META    = INV_EMB_DIR / "invest_chunks_meta.parquet"
INV_EMB     = INV_EMB_DIR / "invest_embeddings.npy" # Asumiendo que usarás el archivo completo
INV_PARTS   = sorted(INV_EMB_DIR.glob("invest_embeddings_part*.npy"))

# --- RECALLS ---
RCL_EMB_DIR = BASE / "embeddings" / "recalls_e5_mlg_instruct"
RCL_META    = RCL_EMB_DIR / "recalls_chunks_meta.parquet"
RCL_EMB     = RCL_EMB_DIR / "recalls_embeddings.npy"

# --- COMPLAINTS (Añadido para el futuro) ---
# (Puedes ajustar estas rutas cuando llegues a la parte de complaints)
CPL_EMB_DIR = BASE / "embeddings" / "complaints_e5_mlg_instruct"


# --- Nombres de las Colecciones en Qdrant ---

INV_COLLECTION = "nhtsa_investigations"
RCL_COLLECTION = "nhtsa_recalls"
CPL_COLLECTION = "nhtsa_complaints" # Para cuando cargues los datos de quejas


# --- Archivo de Checkpoint ---
CHKPT_FILE = BASE / "embeddings" / "qdrant_upload_checkpoint.json"

print("✅ Rutas y nombres de colecciones definidos.")

✅ Rutas y nombres de colecciones definidos.


In [None]:
!pip install qdrant_client



In [None]:
import json, math
import numpy as np
import pandas as pd
from qdrant_client import QdrantClient, models

def get_client():
    return QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY, timeout=180)

def recreate_if_needed(client, collection_name, dim, distance=models.Distance.COSINE):
    # Crea o ajusta la colección con HNSW por defecto
    if collection_name in [c.name for c in client.get_collections().collections]:
        info = client.get_collection(collection_name)
        same_dim = (info.vectors_count is not None)  # no siempre da tamaño; validamos con config
        try:
            cfg = client.get_collection(collection_name).config
            current_dim = cfg.params.vectors.get("size", None) if isinstance(cfg.params.vectors, dict) else cfg.params.vectors.size
            if current_dim != dim:
                print(f"[!] Dimensión distinta ({current_dim} vs {dim}). Recreando colección {collection_name}...")
                client.delete_collection(collection_name)
                client.recreate_collection(
                    collection_name=collection_name,
                    vectors_config=models.VectorParams(size=dim, distance=distance),
                )
            else:
                print(f"[OK] Colección {collection_name} ya existe con dim={dim}.")
        except Exception:
            # Si falla la lectura del tamaño, la recreamos para asegurar coherencia
            print(f"[i] Recreando colección {collection_name} para asegurar dim={dim} ...")
            try:
                client.delete_collection(collection_name)
            except Exception:
                pass
            client.recreate_collection(
                collection_name=collection_name,
                vectors_config=models.VectorParams(size=dim, distance=distance),
            )
    else:
        print(f"[i] Creando colección {collection_name} (dim={dim}) ...")
        client.recreate_collection(
            collection_name=collection_name,
            vectors_config=models.VectorParams(size=dim, distance=distance),
        )

def df_to_payloads(df: pd.DataFrame):
    # Reemplaza NaN por None y convierte a tipos JSON-seguros
    clean = df.where(pd.notnull(df), None)
    # Convierte tipos numpy a nativos (por si acaso)
    def to_native(v):
        if isinstance(v, (np.generic,)):
            return v.item()
        return v
    return [{k: to_native(v) for k, v in row.items()} for row in clean.to_dict(orient="records")]

def load_checkpoint():
    if CHKPT_FILE.exists():
        try:
            return json.loads(CHKPT_FILE.read_text())
        except Exception:
            return {}
    return {}

def save_checkpoint(state: dict):
    CHKPT_FILE.parent.mkdir(parents=True, exist_ok=True)
    CHKPT_FILE.write_text(json.dumps(state, indent=2))

def upload_batches(client, collection, embeddings_np, meta_df, global_start_id=0, batch_size=1024, progress_key=None):
    """
    Sube vectors+payloads en lotes. Asume que embeddings_np.shape[0] == len(meta_df).
    Los IDs serán enteros consecutivos a partir de global_start_id (para scroll eficiente).
    """
    n = embeddings_np.shape[0]
    dim = embeddings_np.shape[1]
    print(f"[i] Subiendo {n} vectores (dim={dim}) a {collection}...")
    state = load_checkpoint()
    done_upto = state.get(progress_key, 0) if progress_key else 0

    start = done_upto
    while start < n:
        end = min(start + batch_size, n)
        vecs = embeddings_np[start:end].astype(np.float32, copy=False)
        payloads = df_to_payloads(meta_df.iloc[start:end].reset_index(drop=True))
        ids = list(range(global_start_id + start, global_start_id + end))

        client.upsert(
            collection_name=collection,
            points=models.Batch(ids=ids, vectors=vecs.tolist(), payloads=payloads),
            wait=True,
        )
        start = end

        if progress_key:
            state[progress_key] = start
            save_checkpoint(state)

        print(f"  – Subidos: {start}/{n}")
    print("[OK] Carga completada.")
    return global_start_id + n  # siguiente ID disponible


4) Subir recalls (un solo .npy)

In [None]:
# ====== RECALLS (Versión Original + Diagnóstico) ======

client = get_client()

# Carga embeddings y meta
recalls_vecs = np.load(RCL_EMB)
recalls_meta = pd.read_parquet(RCL_META)

# --- LÍNEAS DE VERIFICACIÓN CRUCIALES ---
print("--- Verificando archivos cargados ---")
print(f"Dimensiones de los vectores: {recalls_vecs.shape}")
print(f"Filas en los metadatos: {len(recalls_meta)}")
print("------------------------------------")
# ----------------------------------------

assert len(recalls_vecs) == len(recalls_meta), f"Desalineado: {len(recalls_vecs)} vs {len(recalls_meta)}"

# Usa la variable de la celda de configuración
print(f"Preparando la colección '{RCL_COLLECTION}'...")
recreate_if_needed(client, RCL_COLLECTION, dim=recalls_vecs.shape[1], distance=models.Distance.COSINE)

next_id = 0

# Carga por lotes (reanudable)
next_id = upload_batches(
    client,
    RCL_COLLECTION,
    recalls_vecs,
    recalls_meta,
    global_start_id=next_id,
    batch_size=1024,
    progress_key="recalls_progress"
)

print(f"\nProceso finalizado. Se intentaron subir {next_id} puntos a '{RCL_COLLECTION}'.")

--- Verificando archivos cargados ---
Dimensiones de los vectores: (12871, 1024)
Filas en los metadatos: 12871
------------------------------------
Preparando la colección 'nhtsa_recalls'...
[OK] Colección nhtsa_recalls ya existe con dim=1024.
[i] Subiendo 12871 vectores (dim=1024) a nhtsa_recalls...
  – Subidos: 1024/12871
  – Subidos: 2048/12871
  – Subidos: 3072/12871
  – Subidos: 4096/12871
  – Subidos: 5120/12871
  – Subidos: 6144/12871
  – Subidos: 7168/12871
  – Subidos: 8192/12871
  – Subidos: 9216/12871
  – Subidos: 10240/12871
  – Subidos: 11264/12871
  – Subidos: 12288/12871
  – Subidos: 12871/12871
[OK] Carga completada.

Proceso finalizado. Se intentaron subir 12871 puntos a 'nhtsa_recalls'.


5) Subir investigations (múltiples partes invest_embeddings_partXXX.npy)

In [None]:
from pathlib import Path
import numpy as np, pandas as pd, json, re, os

BASE = Path("/content/drive/MyDrive/NHTSA")
INV_DIR = BASE / "embeddings" / "investigations_e5_mlg_instruct"
PROC    = BASE / "processed"

CANDIDATE_META = [
    INV_DIR / "invest_chunks_meta.parquet",         # meta guardado junto a embeddings
    PROC    / "investigations_chunks.parquet",      # meta de la fase "processed"
]

CANDIDATE_EMB = {
    "single": INV_DIR / "invest_embeddings.npy",    # corrida “compacta”
    "shards": sorted(INV_DIR.glob("invest_embeddings_part*.npy")),  # corrida “grande”
}

CHKPT = BASE / "embeddings" / "qdrant_upload_checkpoint.json"

def count_meta(meta_path):
    try:
        df = pd.read_parquet(meta_path)
        return len(df)
    except Exception as e:
        return None

def count_emb_single(npy):
    try:
        arr = np.load(npy, mmap_mode="r")
        return arr.shape
    except Exception:
        return None

def count_emb_shards(parts):
    total = 0
    dimset = set()
    for p in parts:
        arr = np.load(p, mmap_mode="r")
        total += arr.shape[0]
        dimset.add(arr.shape[1])
    return (total, dimset)

print("== META candidates ==")
meta_info = []
for mp in CANDIDATE_META:
    n = count_meta(mp) if mp.exists() else None
    print(f"  {mp}: {'NO' if n is None else n}")
    if n is not None: meta_info.append((mp, n))

print("\n== EMBEDDINGS candidates ==")
emb_info = []
if CANDIDATE_EMB["single"].exists():
    s = count_emb_single(CANDIDATE_EMB["single"])
    print(f"  single {CANDIDATE_EMB['single'].name}: {s}")
    if s: emb_info.append(("single", CANDIDATE_EMB["single"], s[0], s[1]))
else:
    print("  single: NO")

if CANDIDATE_EMB["shards"]:
    tot, dims = count_emb_shards(CANDIDATE_EMB["shards"])
    print(f"  shards x{len(CANDIDATE_EMB['shards'])}: total={tot}, dims={dims}")
    emb_info.append(("shards", CANDIDATE_EMB["shards"], tot, list(dims)[0] if len(dims)==1 else None))
else:
    print("  shards: NO")

# Emparejar por conteo
matches = []
for mp, nmeta in meta_info:
    for kind, emb, nvec, dim in emb_info:
        if nmeta == nvec:
            matches.append((mp, kind, emb, nmeta, dim))

print("\n== MATCHES por conteo (meta filas == embeddings vectores) ==")
if matches:
    for mp, kind, emb, nmeta, dim in matches:
        print(f"  ✅ {mp.name}  ↔  {kind} ({nmeta} vectores, dim={dim})")
else:
    print("  ❌ Ninguno. Meta y embeddings vienen de corridas distintas.")

# Limpia checkpoint para evitar cursores adelantados
if CHKPT.exists():
    print("\n[i] Borrando checkpoint obsoleto:", CHKPT)
    CHKPT.unlink()


== META candidates ==
  /content/drive/MyDrive/NHTSA/embeddings/investigations_e5_mlg_instruct/invest_chunks_meta.parquet: 6354
  /content/drive/MyDrive/NHTSA/processed/investigations_chunks.parquet: 6354

== EMBEDDINGS candidates ==
  single invest_embeddings.npy: (6354, 1024)
  shards x8: total=356139, dims={1024}

== MATCHES por conteo (meta filas == embeddings vectores) ==
  ✅ invest_chunks_meta.parquet  ↔  single (6354 vectores, dim=1024)
  ✅ investigations_chunks.parquet  ↔  single (6354 vectores, dim=1024)


In [None]:
# ====== Subida INVESTIGATIONS → Qdrant (colección: nhtsa_investigations) ======
!pip -q install qdrant-client==1.9.2 pyarrow==17.0.0 pandas==2.2.2 numpy==1.26.4

from pathlib import Path
import json
import numpy as np, pandas as pd
from qdrant_client import QdrantClient, models

# --- Config Qdrant (asegúrate de tener definidas QDRANT_URL y QDRANT_API_KEY arriba) ---
INV_COLLECTION = "nhtsa_investigations"

# --- Rutas ---
BASE   = Path("/content/drive/MyDrive/NHTSA")
INVDIR = BASE / "embeddings" / "investigations_e5_mlg_instruct"
META   = INVDIR / "invest_chunks_meta.parquet"   # 6,354 filas
NPY    = INVDIR / "invest_embeddings.npy"        # (6354, 1024)

# --- Helpers ---
def collection_exists(client: QdrantClient, name: str) -> bool:
    # Algunos SDKs viejos no traen .collection_exists(); este helper es compatible
    try:
        client.get_collection(name)
        return True
    except Exception:
        return False

def ensure_collection(client: QdrantClient, collection_name: str, dim: int, distance=models.Distance.COSINE):
    """
    Crea la colección si no existe o la recrea si la dimensión no coincide.
    """
    if not collection_exists(client, collection_name):
        print(f"[i] Creando colección '{collection_name}' (dim={dim}) ...")
        client.create_collection(
            collection_name=collection_name,
            vectors_config=models.VectorParams(size=dim, distance=distance),
        )
        return True  # creada nueva
    # Existe: validamos dimensión
    cfg = client.get_collection(collection_name).config
    current_dim = cfg.params.vectors.size if hasattr(cfg.params.vectors, "size") else cfg.params.vectors.get("size")
    if current_dim != dim:
        print(f"[!] Dimensión distinta ({current_dim} vs {dim}). Recreando '{collection_name}'...")
        client.delete_collection(collection_name)
        client.create_collection(
            collection_name=collection_name,
            vectors_config=models.VectorParams(size=dim, distance=distance),
        )
        return True  # recreada
    print(f"[OK] Colección '{collection_name}' existe con dim={dim}.")
    return False  # no recreada

def df_to_payloads(df: pd.DataFrame):
    clean = df.where(pd.notnull(df), None)
    def nat(v):
        return v.item() if isinstance(v, np.generic) else v
    return [{k: nat(v) for k, v in row.items()} for row in clean.to_dict("records")]

def upsert_batches(client, collection, vecs_np, meta_df, id_offset=0, batch_size=512):
    """
    Sube vectors+payloads en lotes. Lote 512 recomendado para Qdrant Cloud free.
    """
    n = vecs_np.shape[0]
    i = 0
    print(f"[i] Subiendo {n} vectores (dim={vecs_np.shape[1]}) a '{collection}' en lotes de {batch_size} ...")
    while i < n:
        j = min(i + batch_size, n)
        client.upsert(
            collection_name=collection,
            points=models.Batch(
                ids=list(range(id_offset + i, id_offset + j)),
                vectors=vecs_np[i:j].astype("float32", copy=False).tolist(),
                payloads=df_to_payloads(meta_df.iloc[i:j].reset_index(drop=True)),
            ),
            wait=True,
        )
        i = j
        print(f"  – Subidos {i}/{n}")
    print("[OK] Carga completada.")

# --- Carga y checks previos ---
meta = pd.read_parquet(META)
vecs = np.load(NPY)

print("--- Verificación local ---")
print("Embeddings:", vecs.shape, "| Meta filas:", len(meta))
assert vecs.shape == (6354, 1024), f"Embeddings con forma inesperada: {vecs.shape}"
assert len(meta) == 6354, f"Metadatos con filas inesperadas: {len(meta)}"
print("--------------------------")

# --- Cliente Qdrant ---
client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY, timeout=180)

# --- Asegurar colección con dimensión correcta ---
recreated = ensure_collection(client, INV_COLLECTION, dim=vecs.shape[1], distance=models.Distance.COSINE)

# Si quieres forzar carga "desde cero", asegurarte de no reusar IDs:
# Si la colección existía y no se recreó, calculamos el siguiente ID por si ya hay datos
try:
    current_count = client.count(INV_COLLECTION, exact=True).count
except Exception:
    current_count = 0

# Para esta subida quieres empezar desde 0 y sobrescribir completamente:
if not recreated and current_count > 0:
    print(f"[i] La colección tenía {current_count} puntos. Se borrará para recargar limpia.")
    client.delete_collection(INV_COLLECTION)
    client.create_collection(
        collection_name=INV_COLLECTION,
        vectors_config=models.VectorParams(size=vecs.shape[1], distance=models.Distance.COSINE),
    )
    current_count = 0

# --- Subida ---
upsert_batches(client, INV_COLLECTION, vecs, meta, id_offset=0, batch_size=512)

# --- Verificación post-carga ---
cnt = client.count(INV_COLLECTION, exact=True).count
print(f"[✓] Conteo exacto en '{INV_COLLECTION}':", cnt)

# --- Muestra 3 puntos (scroll) ---
scrolled, next_page = client.scroll(INV_COLLECTION, limit=3, with_payload=True, with_vectors=False)
print("[Vista rápida de payloads]")
for p in scrolled:
    print("-", p.id, {k: p.payload.get(k) for k in list(p.payload.keys())[:5]})  # primeras 5 claves de cada payload


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m230.1/230.1 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m67.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m64.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.0/322.0 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26

  client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY, timeout=180)


[i] Creando colección 'nhtsa_investigations' (dim=1024) ...
[i] Subiendo 6354 vectores (dim=1024) a 'nhtsa_investigations' en lotes de 512 ...
  – Subidos 512/6354
  – Subidos 1024/6354
  – Subidos 1536/6354
  – Subidos 2048/6354
  – Subidos 2560/6354
  – Subidos 3072/6354
  – Subidos 3584/6354
  – Subidos 4096/6354
  – Subidos 4608/6354
  – Subidos 5120/6354
  – Subidos 5632/6354
  – Subidos 6144/6354
  – Subidos 6354/6354
[OK] Carga completada.
[✓] Conteo exacto en 'nhtsa_investigations': 6354
[Vista rápida de payloads]
- 0 {'chunk_id': 'AQ08001::ch0', 'id': 'AQ08001', 'make': 'PACE AMERICAN', 'model': 'intfloat/multilingual-e5-large-instruct', 'year': None}
- 1 {'chunk_id': 'AQ09001::ch0', 'id': 'AQ09001', 'make': 'CAPCEN', 'model': 'intfloat/multilingual-e5-large-instruct', 'year': None}
- 2 {'chunk_id': 'AQ09002::ch0', 'id': 'AQ09002', 'make': 'HOLIDAY RAMBLER', 'model': 'intfloat/multilingual-e5-large-instruct', 'year': None}


In [None]:
import pandas as pd, pathlib as pl
csv = pl.Path("/content/drive/MyDrive/NHTSA/embeddings/complaints_e5_mlg_instruct/reduced_shards/kmeans_global/representantes_global_kmeans.csv")
reps = pd.read_csv(csv)
len(reps), reps.head(3)


(20000,
       id   make  model    year  \
 0   9364  DODGE   COLT  1993.0   
 1  14695  BUICK  BUICK  1990.0   
 2  15395  LEXUS  ES300  1995.0   
 
                                          component                   _h  \
 0  SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS -7992261938585274089   
 1                                            TIRES  1890294604798842119   
 2                 SEAT BELTS:FRONT:BUCKLE ASSEMBLY -1669910696480997720   
 
    shard_idx  row_in_shard  cluster     dist2  
 0          0          9363        0  0.000000  
 1          0         14694        1  0.000000  
 2          0         15394        2  0.017362  )

In [None]:
# =======================================================
# Qdrant — Upload de representantes (complaints → vectors)
# con IDs UUID determinísticos (uuid5)
# (colección destino: nhtsa_complaints)
# =======================================================
!pip -q install qdrant-client==1.9.2 pyarrow==17.0.0 pandas==2.2.2 numpy==1.26.4

from pathlib import Path
import numpy as np, pandas as pd, json, re, gc, uuid
from qdrant_client import QdrantClient, models
from tqdm import tqdm

# ------------- Rutas (ajusta si difieren) ----------
BASE     = Path("/content/drive/MyDrive/NHTSA")
REDUCED  = BASE / "embeddings" / "complaints_e5_mlg_instruct" / "reduced_shards"
CSV_REPS = REDUCED / "kmeans_global" / "representantes_global_kmeans.csv"
if not CSV_REPS.exists():
    CSV_REPS = REDUCED / "representantes_global.csv"   # fallback

# ------------- Parámetros de carga -------------------
UPSERT_BATCH = 2000         # tamaño de batch para upsert (ajusta si tu free tier se ahoga)
SNIPPET_LEN  = 0            # 0 = no subir snippet; >0 = longitud máxima de snippet

# ------------- Colección destino ---------------------
CPL_COLLECTION = "nhtsa_complaints"

# ============ Helpers Qdrant ============
def collection_exists(client: QdrantClient, name: str) -> bool:
    """Compat: algunos SDKs no exponen collection_exists()."""
    try:
        client.get_collection(name)
        return True
    except Exception:
        return False

def ensure_collection(client: QdrantClient, collection_name: str, dim: int, distance=models.Distance.COSINE):
    """Crea o valida colección con dimensión adecuada (recrea si no coincide)."""
    if not collection_exists(client, collection_name):
        client.create_collection(
            collection_name=collection_name,
            vectors_config=models.VectorParams(size=dim, distance=distance),
        )
    else:
        cfg = client.get_collection(collection_name).config
        current_dim = cfg.params.vectors.size if hasattr(cfg.params.vectors, "size") else cfg.params.vectors.get("size")
        if current_dim != dim:
            client.delete_collection(collection_name)
            client.create_collection(
                collection_name=collection_name,
                vectors_config=models.VectorParams(size=dim, distance=distance),
            )

def df_to_payloads(df: pd.DataFrame):
    """Convierte DataFrame → lista de dict con tipos nativos."""
    clean = df.where(pd.notnull(df), None)
    import numpy as _np
    def nat(v): return v.item() if isinstance(v, _np.generic) else v
    return [{k: nat(v) for k,v in r.items()} for r in clean.to_dict("records")]

def upsert_batches(client, collection, vecs_np, meta_df, ids, batch_size=UPSERT_BATCH):
    """Upsert por lotes. ids puede ser lista de UUID (string) o ints."""
    n = vecs_np.shape[0]; i = 0
    while i < n:
        j = min(i+batch_size, n)
        client.upsert(
            collection_name=collection,
            points=models.Batch(
                ids=ids[i:j],
                vectors=vecs_np[i:j].astype("float32").tolist(),
                payloads=df_to_payloads(meta_df.iloc[i:j].reset_index(drop=True))
            ),
            wait=True
        )
        i = j
        print(f"  – Subidos {i}/{n}")

# --------- Detección de columnas en reps ---------
def detect_rep_columns(df: pd.DataFrame):
    cols = {c.lower(): c for c in df.columns}
    def pick(cands):
        for c in cands:
            if c in cols: return cols[c]
        return None
    mapping = {
        "id":        pick(["id","odinumber","cmplid","qid","qdrant_id"]),
        "make":      pick(["make","maketxt"]),
        "model":     pick(["model","modeltxt"]),
        "year":      pick(["year","yeartxt","model_year"]),
        "component": pick(["component","compdesc","compname"]),
        "hash":      pick(["_h","hash","text_hash"]),
        "shard":     pick(["shard_idx","shard","reduced_shard","sidx"]),
        "row":       pick(["row_in_shard","local_idx","offset","row","rid"]),
        "cluster":   pick(["cluster"]),
        "dist2":     pick(["dist2","distance","dist"]),
    }
    # Intentar inferir shard/row desde id estilo rep_0005_012345
    if mapping["shard"] is None or mapping["row"] is None:
        idc = mapping["id"]
        if idc:
            m = df[idc].astype(str).str.extract(r"rep_(\d{4})_(\d+)", expand=True)
            if m.notna().any().any():
                if mapping["shard"] is None:
                    df["__infer_shard"] = pd.to_numeric(m[0], errors="coerce").astype("Int64")
                    mapping["shard"] = "__infer_shard"
                if mapping["row"] is None:
                    df["__infer_row"]   = pd.to_numeric(m[1], errors="coerce").astype("Int64")
                    mapping["row"] = "__infer_row"
    assert mapping["shard"] is not None, "CSV reps no tiene 'shard_idx' (ni pude inferirlo de 'id')."
    assert mapping["row"]   is not None, "CSV reps no tiene 'row_in_shard' (ni pude inferirlo de 'id')."
    return mapping

# --------- Carga reps + vectores por shard ----------
def load_reps_and_vectors(csv_reps: Path, reduced_dir: Path):
    assert csv_reps.exists(), f"No existe {csv_reps}"
    reps = pd.read_csv(csv_reps)
    mapping = detect_rep_columns(reps)

    # Normaliza tipos mínimos
    reps["__shard_idx__"] = pd.to_numeric(reps[mapping["shard"]], errors="coerce").astype("Int64")
    reps["__row_in__"]    = pd.to_numeric(reps[mapping["row"]], errors="coerce").astype("Int64")
    reps = reps[reps["__shard_idx__"].notna() & reps["__row_in__"].notna()].copy()
    reps["__shard_idx__"] = reps["__shard_idx__"].astype(int)
    reps["__row_in__"]    = reps["__row_in__"].astype(int)

    # ID estable si falta
    if mapping["id"] is None:
        reps["__id__"] = reps.apply(lambda r: f"rep_{int(r['__shard_idx__']):04d}_{int(r['__row_in__']):06d}", axis=1)
        mapping["id"] = "__id__"

    # Cache de vectores por shard y dimensión
    vec_cache = {}
    dim = None
    needed = sorted(reps["__shard_idx__"].unique().tolist())
    for s in needed:
        f = reduced_dir / f"reduced_{s:04d}.npy"
        assert f.exists(), f"Falta vector reducido: {f}"
        arr = np.load(f, mmap_mode="r")  # float32
        vec_cache[s] = arr
        if dim is None:
            dim = int(arr.shape[1])
    return reps, mapping, vec_cache, dim

# ---------- Snippets opcionales (desactivado por defecto) ----------
def build_payload_df(reps: pd.DataFrame, mapping: dict) -> pd.DataFrame:
    cols = {}
    for k in ["id","make","model","year","component","hash","cluster","dist2"]:
        c = mapping.get(k)
        if c and c in reps.columns:
            cols[k] = c
    out = pd.DataFrame({
        "id": reps[cols.get("id")].astype(str) if "id" in cols else reps.index.astype(str),
        "make": reps[cols["make"]].astype(str).str.strip() if "make" in cols else None,
        "model": reps[cols["model"]].astype(str).str.strip() if "model" in cols else None,
        "year": pd.to_numeric(reps[cols["year"]], errors="coerce").astype("Int64") if "year" in cols else None,
        "component": reps[cols["component"]].astype(str).str.strip() if "component" in cols else None,
        "_h": reps[cols["hash"]].astype(str) if "hash" in cols else None,
        "shard_idx": reps["__shard_idx__"].astype(int),
        "row_in_shard": reps["__row_in__"].astype(int),
    })
    if "cluster" in cols:
        out["cluster"] = pd.to_numeric(reps[cols["cluster"]], errors="coerce").astype("Int64")
    if "dist2" in cols:
        out["dist2"] = pd.to_numeric(reps[cols["dist2"]], errors="coerce")
    return out

# ---------- IDs → UUID determinísticos ----------
NAMESPACE = uuid.uuid5(uuid.NAMESPACE_URL, "https://nhtsa.example/complaints")  # namespace fijo del proyecto

def make_point_uuid(shard_idx: int, row_in_shard: int) -> str:
    """Convierte 'rep_{shard}_{row}' → UUID v5 determinístico."""
    raw = f"rep_{int(shard_idx):04d}_{int(row_in_shard):06d}"
    return str(uuid.uuid5(NAMESPACE, raw))

# ====================== PIPELINE ======================
# 1) Cargar reps + vectores por shard
reps, mapping, vec_cache, dim = load_reps_and_vectors(CSV_REPS, REDUCED)
print(f"[i] Representantes: {len(reps):,} | dim={dim} | shards={sorted(reps['__shard_idx__'].unique())[:10]}...")

# 2) Preparar cliente y colección (compat check OFF por warning de versión)
client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY, timeout=180, check_compatibility=False)
ensure_collection(client, CPL_COLLECTION, dim=dim, distance=models.Distance.COSINE)

# 3) Payload base (ligero)
payload_df = build_payload_df(reps, mapping)

# 4) Upsert por shard (UUIDs determinísticos)
total = len(reps)
done = 0
for sidx, block in reps.groupby("__shard_idx__"):
    arr = vec_cache[sidx]  # [n, dim]
    local_rows = block["__row_in__"].tolist()
    block_payload = payload_df.loc[block.index].reset_index(drop=True)

    # IDs UUID determinísticos
    point_ids = [make_point_uuid(sidx, r) for r in local_rows]

    # vectores (en el mismo orden)
    vecs = np.stack([arr[r] for r in local_rows]).astype("float32", copy=False)

    print(f"[i] Subiendo shard {sidx:04d}: {len(block)} puntos …")
    upsert_batches(client, CPL_COLLECTION, vecs, block_payload, point_ids, batch_size=UPSERT_BATCH)
    done += len(block)
    print(f"    – Progreso: {done}/{total}")

print("[OK] Complaints (representantes) cargados en Qdrant → colección:", CPL_COLLECTION)


[i] Representantes: 20,000 | dim=256 | shards=[np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9)]...
[i] Subiendo shard 0000: 12755 puntos …
  – Subidos 2000/12755
  – Subidos 4000/12755
  – Subidos 6000/12755
  – Subidos 8000/12755
  – Subidos 10000/12755
  – Subidos 12000/12755
  – Subidos 12755/12755
    – Progreso: 12755/20000
[i] Subiendo shard 0001: 1989 puntos …
  – Subidos 1989/1989
    – Progreso: 14744/20000
[i] Subiendo shard 0002: 997 puntos …
  – Subidos 997/997
    – Progreso: 15741/20000
[i] Subiendo shard 0003: 563 puntos …
  – Subidos 563/563
    – Progreso: 16304/20000
[i] Subiendo shard 0004: 458 puntos …
  – Subidos 458/458
    – Progreso: 16762/20000
[i] Subiendo shard 0005: 366 puntos …
  – Subidos 366/366
    – Progreso: 17128/20000
[i] Subiendo shard 0006: 396 puntos …
  – Subidos 396/396
    – Progreso: 17524/20000
[i] Subiendo shard 0007: 364 puntos …
  – Subidos 364/364
    – Progr