# Ejercicio 6: Dense Retrieval e Introducción a FAISS

## Objetivo de la práctica

Generar embeddings con sentence-transformers (SBERT, E5), e indexar documentos con FAISS

## Parte 0: Carga del Corpus
### Actividad

1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.

In [19]:
# Carga del corpus 20 Newsgroups.
from sklearn.datasets import fetch_20newsgroups

In [20]:
# Limitar el corpus a los primeros 2000 documentos.
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data[:2000]

## Parte 2: Generación de Embeddings
### Actividad

1. Usa dos modelos de sentence-transformers. Puedes usar: `'all-MiniLM-L6-v2'` (SBERT), o `'intfloat/e5-base'` (E5). Cuando uses E5, antepon `"passage: "` a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

In [21]:
# Usar el modelo sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
embeddings = model.encode(newsgroupsdocs, show_progress_bar=True)

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

In [22]:
import numpy as np
import pandas as pd
# Generar una estructura de datos con id, newsgroupsdocs, doc_embedding
data = []
for idx, (doc, embedding) in enumerate(zip(newsgroupsdocs, embeddings)):
    data.append((idx, doc, embedding))
# Convertir a DataFrame para visualización clara
df = pd.DataFrame(data, columns=['id', 'doc', 'doc-embedding'])
print(df.head(10))

   id                                                doc  \
0   0  \n\nI am sure some bashers of Pens fans are pr...   
1   1  My brother is in the market for a high-perform...   
2   2  \n\n\n\n\tFinally you said what you dream abou...   
3   3  \nThink!\n\nIt's the SCSI card doing the DMA t...   
4   4  1)    I have an old Jasmine drive which I cann...   
5   5  \n\nBack in high school I worked as a lab assi...   
6   6  \n\nAE is in Dallas...try 214/241-6060 or 214/...   
7   7  \n[stuff deleted]\n\nOk, here's the solution t...   
8   8  \n\n\nYeah, it's the second one.  And I believ...   
9   9  \nIf a Christian means someone who believes in...   

                                       doc-embedding  
0  [0.0020779956, 0.023450442, 0.024808843, -0.01...  
1  [0.050060306, 0.02698094, -0.008864783, -0.035...  
2  [0.016404718, 0.08100051, -0.049535986, -0.008...  
3  [-0.019391451, 0.011494355, -0.014787266, -0.0...  
4  [-0.0392871, -0.055402845, -0.07453618, -0.012...  
5  [0.021

In [23]:
# Guardar los embeddings en un array para luego ser indexados
doc_embeddings = np.array(embeddings)

## Parte 3: Indexación con FAISS
### Actividad

1. Crea un índice plano con faiss.IndexFlatL2 para búsquedas por distancia euclidiana.
2. Asegúrate de usar la dimensión correcta `(embedding_dim = doc_embeddings.shape[1])`.
3. Agrega los vectores de documentos al índice.

In [25]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0


In [28]:
import faiss
# Vectores de documentos.
embedding_dim = doc_embeddings.shape[1]
# Índice de  para búsquedas por distancia euclidiana
index = faiss.IndexFlatL2(embedding_dim)

## Parte 4: Consulta Semántica
### Actividad

1. Escribe una consulta en lenguaje natural. Ejemplos:

    * "God, religion, and spirituality"
    * "space exploration"
    * "car maintenance"

2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon `"query: "` a la consulta.
3. Recupera los 5 documentos más relevantes con `index.search(...)`.
4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).

In [29]:
query = "God, religion, and spirituality"
# Codifica la consulta utilizando el mismo modelo de embeddings.
query_embedding = model.encode([query])

In [30]:
# Recuperación de los 5 documentos más relevantes.
k = 5
distances, indices = index.search(query_embedding, k)

In [31]:
# Muestra los textos de los documentos recuperados, mostrando los primeros 500 caracteres de cada uno
for idx in indices[0]:
    print(newsgroupsdocs[idx][:500])

I know that the placebo effect is where a patient feels better or 
even gets better because of his/her belief in the medicine and 
the doctor administering it.  Is there also an anti-placebo 
effect where the patient dislikes/distrusts doctors and medicine 
and therefore doesn't get better or feel better in spite of the 
medicine?

Is there an effect where the doctor believes so strongly in a 
medicine that he/she sees improvement where the is none or sees 
more improvement than there is?  If so
I know that the placebo effect is where a patient feels better or 
even gets better because of his/her belief in the medicine and 
the doctor administering it.  Is there also an anti-placebo 
effect where the patient dislikes/distrusts doctors and medicine 
and therefore doesn't get better or feel better in spite of the 
medicine?

Is there an effect where the doctor believes so strongly in a 
medicine that he/she sees improvement where the is none or sees 
more improvement than there is?  If s