# Ejercicio 8: Bases de Datos Vectoriales

Las bases de datos vectoriales permiten almacenar y recuperar información representada como vectores en espacios de alta dimensión. Primero vamos a revisar los fundamentos matemáticos en los que se basan.

## 1. Espacios Vectoriales

Cada documento, imagen, o consulta se representa como un vector real en un espacio ℝ^n:

$\[ \vec{d} = [d_1, d_2, \dots, d_n] \in \mathbb{R}^n \]$

Donde $\( n \)$ suele ser 384, 768 o 1536, dependiendo del modelo de embeddings utilizado.

In [2]:
# Carga del corpus 20 Newsgroups.
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

In [3]:
newgroups = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))
newgroupsdocs = newgroups.data

In [4]:
newgroupsdocs_df = pd.DataFrame(newgroupsdocs, columns=['raw'])
newgroupsdocs_df

Unnamed: 0,raw
0,\n\nI am sure some bashers of Pens fans are pr...
1,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,1) I have an old Jasmine drive which I cann...
...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...
18842,\nNot in isolated ground recepticles (usually ...
18843,I just installed a DX2-66 CPU in a clone mothe...
18844,\nWouldn't this require a hyper-sphere. In 3-...


In [5]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [6]:
from nltk.tokenize import word_tokenize

In [7]:
stop_words = set(stopwords.words('english'))
def preprocess_doc(doc):
  words = word_tokenize(doc)
  word_filtered = [w for w in words if w not in stop_words and w.isalpha()]
  return ' '.join(word_filtered)

In [8]:
newgroupsdocs_df['preprocessed'] = newgroupsdocs_df['raw'].apply(preprocess_doc)
newgroupsdocs_df

Unnamed: 0,raw,preprocessed
0,\n\nI am sure some bashers of Pens fans are pr...,I sure bashers Pens fans pretty confused lack ...
1,My brother is in the market for a high-perform...,My brother market video card supports VESA loc...
2,\n\n\n\n\tFinally you said what you dream abou...,Finally said dream Mediterranean That new The ...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,Think It SCSI card DMA transfers NOT disks The...
4,1) I have an old Jasmine drive which I cann...,I old Jasmine drive I use new system My unders...
...,...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...,DN From nyeda David Nye DN A neurology DN cons...
18842,\nNot in isolated ground recepticles (usually ...,Not isolated ground recepticles usually unusua...
18843,I just installed a DX2-66 CPU in a clone mothe...,I installed CPU clone motherboard tried mounti...
18844,\nWouldn't this require a hyper-sphere. In 3-...,Would require In points specifies sphere far I...


In [1]:
!pip install gensim



In [11]:
import gensim.downloader as api
model = api.load("word2vec-google-news-300")



In [12]:
def embedding_doc(doc):
  words = word_tokenize(doc)
  vec_words = [model[word]for word in words if word in model]
  return np.mean(vec_words, axis=0)

In [13]:
import numpy as np
newgroupsdocs_df['embedding'] = newgroupsdocs_df['preprocessed'].apply(embedding_doc)
newgroupsdocs_df

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Unnamed: 0,raw,preprocessed,embedding
0,\n\nI am sure some bashers of Pens fans are pr...,I sure bashers Pens fans pretty confused lack ...,"[0.044840634, 0.042145394, -0.0068237716, 0.11..."
1,My brother is in the market for a high-perform...,My brother market video card supports VESA loc...,"[-0.055559713, -0.015373476, 0.02511609, 0.018..."
2,\n\n\n\n\tFinally you said what you dream abou...,Finally said dream Mediterranean That new The ...,"[0.04519834, -0.0076894974, 0.1272826, 0.09746..."
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,Think It SCSI card DMA transfers NOT disks The...,"[0.012831859, 0.0052521536, 0.0058194674, 0.01..."
4,1) I have an old Jasmine drive which I cann...,I old Jasmine drive I use new system My unders...,"[0.049170937, 0.0056467694, 0.04843648, 0.0792..."
...,...,...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...,DN From nyeda David Nye DN A neurology DN cons...,"[-0.011498247, 0.085058756, 0.00591256, 0.0744..."
18842,\nNot in isolated ground recepticles (usually ...,Not isolated ground recepticles usually unusua...,"[0.030495537, -0.00055185956, -0.038831923, 0...."
18843,I just installed a DX2-66 CPU in a clone mothe...,I installed CPU clone motherboard tried mounti...,"[0.004587446, -0.004250663, 0.017289298, 0.091..."
18844,\nWouldn't this require a hyper-sphere. In 3-...,Would require In points specifies sphere far I...,"[-0.015062968, -0.004964193, 0.06576029, 0.107..."


In [14]:
query = 'chicken'
query_emb = embedding_doc(query)
query_emb

array([-0.13476562, -0.02404785,  0.04418945,  0.27539062, -0.03271484,
        0.13183594,  0.17382812, -0.00095367, -0.0625    ,  0.20996094,
       -0.03051758, -0.3046875 , -0.10742188, -0.08203125, -0.43359375,
        0.03637695, -0.11474609,  0.01092529, -0.34375   , -0.02929688,
        0.30273438, -0.08203125,  0.22265625,  0.11083984, -0.14257812,
       -0.04443359, -0.01745605,  0.01531982,  0.0018692 ,  0.23828125,
       -0.26367188, -0.15136719,  0.13183594, -0.12792969,  0.0703125 ,
        0.24609375,  0.17871094,  0.12353516,  0.06396484,  0.265625  ,
       -0.12158203, -0.22558594,  0.13867188,  0.125     , -0.03588867,
       -0.20019531, -0.08837891, -0.00234985,  0.20703125,  0.21679688,
       -0.12695312,  0.23730469,  0.0234375 ,  0.1328125 , -0.06835938,
       -0.09179688,  0.17089844, -0.07617188,  0.22070312,  0.03735352,
       -0.04492188,  0.20410156, -0.12011719, -0.00543213,  0.23632812,
       -0.28125   , -0.1484375 , -0.07373047,  0.21972656, -0.08

## 2. Medidas de Similitud

El principio básico de una base vectorial es buscar elementos cuyo vector esté "cerca" del vector de consulta. Existen varias formas de medir esta cercanía:

### a. Distancia Euclidiana (L2)

$\[ \text{dist}(⇡\vec{q}, \vec{d}) = \sqrt{\sum_{i=1}^n (q_i - d_i)^2} \]$

Utilizada cuando los vectores no están normalizados. Implementada por defecto en `FAISS` con `IndexFlatL2`.

### b. Similitud Coseno

$\[ \cos(\theta) = \frac{\vec{q} \cdot \vec{d}}{\|\vec{q}\| \cdot \|\vec{d}\|} \]$

Esta métrica es ideal cuando se desea medir ángulos (dirección) en lugar de magnitudes. Se usa en `ChromaDB` y también puede simularse en FAISS si los vectores están normalizados.

Existe una relación entre ambas (cuando los vectores están normalizados):
$\[ \text{dist}_{\text{L2}}^2 = 2 - 2 \cdot \cos(\theta) \]$

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

In [19]:
results = cosine_similarity(x=newgroupsdocs_df['embedding'].reshape(1,-1), y=query_emb.tolist())
results

AttributeError: 'Series' object has no attribute 'reshape'

In [None]:
sim = cosine_similarity([query_emb], [newgroupsdocs_df['embedding'][0]])
sim

array([[0.18173103]], dtype=float32)

In [None]:
z = zip((newgroupsdocs_df['preprocessed'], newgroupsdocs_df['embedding']), sim)
print(sorted(z, key=lambda x: x[1], reverse=True))

[(0        I sure bashers Pens fans pretty confused lack ...
1        My brother market video card supports VESA loc...
2        Finally said dream Mediterranean That new The ...
3        Think It SCSI card DMA transfers NOT disks The...
4        I old Jasmine drive I use new system My unders...
                               ...                        
18841    DN From nyeda David Nye DN A neurology DN cons...
18842    Not isolated ground recepticles usually unusua...
18843    I installed CPU clone motherboard tried mounti...
18844    Would require In points specifies sphere far I...
18845    After tip Gary Crum crum I got Phone Pontiac S...
Name: preprocessed, Length: 18846, dtype: object, array([0.18173103], dtype=float32))]


In [None]:
from numpy.linalg import norm

dist1 = norm(query - doc1)
dist2 = norm(query - doc2)

print("Distancia Euclidiana a doc1:", dist1)
print("Distancia Euclidiana a doc2:", dist2)

def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

sim1 = cosine_similarity(query, doc1)
sim2 = cosine_similarity(query, doc2)

print("Similitud coseno con doc1:", sim1)
print("Similitud coseno con doc2:", sim2)

Distancia Euclidiana a doc1: 0.2449489742783178
Distancia Euclidiana a doc2: 0.24494897427831785
Similitud coseno con doc1: 0.8951435925492911
Similitud coseno con doc2: 0.8846153846153845


## 3. Normalización de Vectores

Muchos sistemas normalizan los vectores para que su norma sea 1:

$\[ \hat{\vec{v}} = \frac{\vec{v}}{\|\vec{v}\|} \]$

Esto transforma la distancia Euclidiana en una función directa de la similitud coseno, facilitando búsquedas eficientes y comparables.

In [None]:
def normalize(v):
    return v / norm(v)

q_norm = normalize(query)
d1_norm = normalize(doc1)
d2_norm = normalize(doc2)

print("Vector normalizado q:", q_norm)
print("Similitud coseno post-normalización (dot):", np.dot(q_norm, d1_norm), np.dot(q_norm, d2_norm))

# Relación teórica: dist² = 2 - 2cos(θ)
dot = np.dot(q_norm, d1_norm)
euclidean_sq = norm(q_norm - d1_norm)**2
print("2 - 2cos(theta):", 2 - 2 * dot)
print("Distancia euclidiana al cuadrado:", euclidean_sq)

Vector normalizado q: [0.19611614 0.58834841 0.78446454]
Similitud coseno post-normalización (dot): 0.8951435925492911 0.8846153846153845
2 - 2cos(theta): 0.2097128149014178
Distancia euclidiana al cuadrado: 0.2097128149014178


## 4. Indexación y Aceleración

Buscar en millones de vectores directamente es costoso $(\( O(n \cdot d) \))$. Se usan estructuras aproximadas para acelerar:

### a. IVF (Inverted File Index)
- Aplica clustering (K-means) a los vectores.
- Durante la búsqueda, se consulta solo un subconjunto de clústeres.

### b. HNSW (Hierarchical Navigable Small World)
- Construye un grafo jerárquico de vecinos más cercanos.
- Permite búsquedas logarítmicas eficientes.