# Ejercicio 6: Dense Retrieval e Introducción a FAISS

## Objetivo de la práctica

Generar embeddings con sentence-transformers (SBERT, E5), e indexar documentos con FAISS 

## Parte 0: Carga del Corpus
### Actividad

1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.

In [32]:
from sentence_transformers import SentenceTransformer
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import faiss

In [33]:
# Fetch the 20 newsgroups dataset
newsgroups = fetch_20newsgroups()
newsgroups.filenames.shape

(11314,)

In [34]:
# Fetch the first 2000 documents from the dataset
docs_2000 = newsgroups.data[:2000]
print(len(docs_2000))

2000


## Parte 2: Generación de Embeddings
### Actividad

1. Usa dos modelos de sentence-transformers. Puedes usar: `'all-MiniLM-L6-v2'` (SBERT), o `'intfloat/e5-base'` (E5). Cuando uses E5, antepon `"passage: "` a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

In [35]:
# Load a pretrained Sentence Transformer model
model_sbert = SentenceTransformer("all-MiniLM-L6-v2")

# Calculate embeddings by calling model.encode()
embeddings_sbert = model_sbert.encode(docs_2000, show_progress_bar=True)

# Store the embeddings in a array numpy
embeddings_sbert_np = np.array(embeddings_sbert)

Batches: 100%|██████████| 63/63 [00:12<00:00,  5.01it/s]


In [36]:
# Load a pretrained E5 model
model_e5 = SentenceTransformer("intfloat/e5-base")

# Prepare the documents for E5 model by adding a prefix ("passage: ")
docs_2000_e5 = ["passage: " + doc for doc in docs_2000]

# Calculate embeddings for the E5 model
embeddings_e5 = model_e5.encode(docs_2000_e5, show_progress_bar=True, batch_size=64)

# Store the embeddings in a numpy array
embeddings_e5_np = np.array(embeddings_e5)

Batches: 100%|██████████| 32/32 [02:51<00:00,  5.35s/it]


## Parte 3: Indexación con FAISS
### Actividad

1. Crea un índice plano con faiss.IndexFlatL2 para búsquedas por distancia euclidiana.
2. Asegúrate de usar la dimensión correcta `(embedding_dim = doc_embeddings.shape[1])`.
3. Agrega los vectores de documentos al índice.

In [38]:
# Calculate dimensions of the embeddings
embeddings_sbert_dim = embeddings_sbert_np.shape[1]

# Create a FAISS index
index_sbert = faiss.IndexFlatL2(embeddings_sbert_dim)

# Required training for the index
# (not needed for IndexFlatL2, but necessary for other types of indices)
print(index_sbert.is_trained)

# Add the embeddings to the index
index_sbert.add(embeddings_sbert_np)
print(f"Number of vectors in the index: {index_sbert.ntotal}")

True
Number of vectors in the index: 2000


In [39]:
# Calculate dimensions of the E5 embeddings
embeddings_e5_dim = embeddings_e5_np.shape[1]

# Create a FAISS index for E5 embeddings
index_e5 = faiss.IndexFlatL2(embeddings_e5_dim)

# Add the E5 embeddings to the index
index_e5.add(embeddings_e5_np)
print(f"Number of vectors in the E5 index: {index_e5.ntotal}")

Number of vectors in the E5 index: 2000


## Parte 4: Consulta Semántica
### Actividad

1. Escribe una consulta en lenguaje natural. Ejemplos:

    * "God, religion, and spirituality"
    * "space exploration"
    * "car maintenance"

2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon `"query: "` a la consulta.
3. Recupera los 5 documentos más relevantes con `index.search(...)`.
4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).

In [40]:
# Insert a query
query_sbert = "car maintenance"

# Encode the query using the same model
query_embedding = model_sbert.encode([query_sbert])

# Retrieve the top 5 most similar documents
k = 5
D, I = index_sbert.search(query_embedding, k)
# D contains distances, I contains indices of the nearest neighbors

# Print the results
# Print only 500 characters of each document
for i in range(k):
    print(f"Document {I[0][i]} (Distance: {D[0][i]:.4f}):")
    print(docs_2000[I[0][i]][:500])  # Print only the first 500 characters
    print("-" * 80)

Document 1214 (Distance: 1.1100):
From: welty@cabot.balltown.cma.COM (richard welty)
Subject: rec.autos: Welcome to to the new reader
Keywords: Monthly Posting
Reply-To: welty@balltown.cma.com
Organization: New York State Institute for Sebastian Cabot Studies
Expires: Thu, 20 May 1993 04:00:05 GMT
Lines: 269

Archive-name: rec-autos/part1

[most recent changes, 15 March 1993: addition of alt.autos.karting -- rpw]

               === Welcome to Rec.Autos.* ===

This article is sent out automatically each month, and contains a gen
--------------------------------------------------------------------------------
Document 1766 (Distance: 1.3416):
From: harmons@.WV.TEK.COM (Harmon Sommer)
Subject: Re: BMW MOA members read this!
Lines: 22

Sender: 
Reply-To: harmons@gyro.WV.TEK.COM (Harmon Sommer)
Distribution: 
Organization: /usr/ens/etc/organization
Keywords: 


>>: As a new BMW owner I was thinking about signing up for the MOA, but
>>: right now it is beginning to look suspiciously like th

In [42]:
# Insert a query for E5
query_e5 = "car maintenance"

# Encode the query using the E5 model
# Note: E5 model requires a prefix "query: "
query_embedding_e5 = model_e5.encode(["query: " + query_e5])

# Retrieve the top 5 most similar documents from the E5 index
D_e5, I_e5 = index_e5.search(query_embedding_e5, k)

# Print the results for E5
for i in range(k):
    print(f"Document {I_e5[0][i]} (Distance: {D_e5[0][i]:.4f}):")
    print(docs_2000[I_e5[0][i]][:500])  # Print only the first 500 characters
    print("-" * 80)

Document 526 (Distance: 0.3950):
From: al@qiclab.scn.rain.com (Alan Peterman)
Subject: Re: "ELECTRONIC" ODOMETER
Organization: SCN Research/Qic Laboratories of Tigard, Oregon.
Lines: 24

In article <C5Fp8B.2Co@megatest.com> alung@megatest.com (Aaron Lung) writes:
>If I'm not mistaken, altering the odometer is *illegal*.  Furthermore,
>I surmise it'll be tough to alter BMW's odometer if you got at it.
>Some of the newer BMW's have electronic odometers making it even
>more tamperproof.

On the cars mentioned - 3 series from the l
--------------------------------------------------------------------------------
Document 64 (Distance: 0.3956):
From: sjp@hpuerca.atl.hp.com (Steve Phillips)
Subject: Re: Ford and the automobile
Organization: Hewlett-Packard NARC Atlanta
X-Newsreader: Tin 1.1.3 PL5
Lines: 14

: Ford and his automobile.  I need information on whether Ford is
: partially responsible for all of the car accidents and the depletion of
: the ozone layer.  Also, any other additional i