# Ejercicio 6: Dense Retrieval e Introducción a FAISS

## Objetivo de la práctica

Generar embeddings con sentence-transformers (SBERT, E5), e indexar documentos con FAISS

## Parte 0: Carga del Corpus
### Actividad

1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.

In [1]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

In [2]:
corpus_limitado = newsgroupsdocs[:2000]

## Parte 2: Generación de Embeddings
### Actividad

1. Usa dos modelos de sentence-transformers. Puedes usar: `'all-MiniLM-L6-v2'` (SBERT), o `'intfloat/e5-base'` (E5). Cuando uses E5, antepon `"passage: "` a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

In [5]:
import numpy as np
from sentence_transformers import SentenceTransformer
model_sbert = SentenceTransformer('all-MiniLM-L6-v2')
model_e5 = SentenceTransformer('intfloat/e5-base')

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_h

SBERT

In [6]:
# Generar los embeddings
embeddings_sbert = model_sbert.encode(
    corpus_limitado,
    show_progress_bar=True,
    convert_to_numpy=True
)
embeddings_sbert_np = embeddings_sbert

Batches: 100%|██████████| 63/63 [01:17<00:00,  1.22s/it]


E5

In [7]:
corpus_e5 = [f"passage: {doc}" for doc in corpus_limitado]
# Generar los embeddings con el corpus modificado
embeddings_e5 = model_e5.encode(
    corpus_e5,
    show_progress_bar=True,
    convert_to_numpy=True
)
embeddings_e5_np = embeddings_e5

Batches: 100%|██████████| 63/63 [07:27<00:00,  7.11s/it]


## Parte 3: Consulta
### Actividad

1. Escribe una consulta en lenguaje natural. Ejemplos:

    * "God, religion, and spirituality"
    * "space exploration"
    * "car maintenance"

2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon `"query: "` a la consulta.
3. Recupera los 5 documentos más relevantes con similitud coseno.
4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).

In [8]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def retrieve_top_k(query_embedding, doc_embeddings, corpus, k=5):
    similarity_scores = cosine_similarity(query_embedding, doc_embeddings)[0]
    top_indices = np.argsort(similarity_scores)[::-1][:k]
    results = []
    for i in top_indices:
        results.append({
            'index': i,
            'score': similarity_scores[i],
            'text': corpus[i]
        })
    return results


In [12]:
QUERY = "God, religion, and spirituality"

In [13]:
# SBERT
query_embedding_sbert = model_sbert.encode(QUERY, convert_to_numpy=True).reshape(1, -1)
results_sbert = retrieve_top_k(query_embedding_sbert, embeddings_sbert_np, corpus_limitado, k=5)
for rank, res in enumerate(results_sbert):
    print(f"\n[{rank+1}. Documento #{res['index']}] (Similitud: {res['score']:.4f})")
    print("-" * 20)
    print(res['text'][:500].strip() + "...")


[1. Documento #996] (Similitud: 0.4150)
--------------------
Humanist, or sub-humanist? :-)...

[2. Documento #282] (Similitud: 0.3307)
--------------------
I didn't know God was a secular humanist...

Kent...

[3. Documento #677] (Similitud: 0.3013)
--------------------
(Deletion)
 
For me, it is a "I believe no gods exist" and a "I don't believe gods exist".
 
In other words, I think that statements like gods are or somehow interfere
with this world are false or meaningless. In Ontology, one can fairly
conclude that when "A exist" is meaningless A does not exist. Under the
Pragmatic definition of truth, "A exists" is meaningless makes A exist
even logically false.
 
A problem with such statements is that one can't disprove a subjective god
by definition, and...

[4. Documento #943] (Similitud: 0.2878)
--------------------
Atoms are not objective.  They aren't even real.  What scientists call
an atom is nothing more than a mathematical model that describes 
certain physical, observab

In [14]:
# Búsqueda con E5 ('intfloat/e5-base')
QUERY_E5_PREFIXED = f"query: {QUERY}"
query_embedding_e5 = model_e5.encode(QUERY_E5_PREFIXED, convert_to_numpy=True).reshape(1, -1)
results_e5 = retrieve_top_k(query_embedding_e5, embeddings_e5_np, corpus_limitado, k=5)
for rank, res in enumerate(results_e5):
    print(f"\n[{rank+1}. Documento #{res['index']}] (Similitud: {res['score']:.4f})")
    print("-" * 20)
    print(res['text'][:500].strip() + "...")


[1. Documento #171] (Similitud: 0.8287)
--------------------
But no one (or at least, not many people) are trying to pass off God
as a scientific fact.  Not so with Kirlian photography.  I'll admit that
it is possible that some superior intelligence exists elsewhere, and if
people want to label that intelligence "God", I'm not going to stop
them.  Anyway, let's _not_ turn this into a theological debate.  ;-)


    Read alt.fan.robert.mcelwaine sometime.  I've never been so
closed-minded before subscribing to that group.  :)...

[2. Documento #1883] (Similitud: 0.8223)
--------------------
...


Seems to me if you learned to differentiate between illusion and
reality on your own you wouldn't need to rely on doctrines that
need to be updated.  My experience of Christianity (25+ years) is
that most Christians seek answers from clergymen who have little
or no direct experience of spiritual matters, and that most of
these questions can be answered by simple introspection.  Most
people susp