# Examen Final - Recuperación de Información  
**Fecha:** 28 de julio de 2025  
**Autor:** Wilson Inga  

---

## Objetivo General

Diseñar e implementar un sistema de Recuperación de Información (IR) que:

a) Ingesta y preprocesa un corpus de documentos científicos (subset del 1% de arXiv).  
b) Implemente recuperación usando:
- Modelo **TF–IDF**
- Modelo **BM25**
- Índice vectorial usando **FAISS** o **ChromaDB** 
 
c) Integre un módulo **RAG (Retrieval-Augmented Generation)** con un modelo de lenguaje.  
d) Evalúe la calidad de la recuperación comparando los resultados de cada modelo.



# 1. Carga del Corpus

In [2]:
import os
import pandas as pd
import kaggle

# Descargar el dataset
dataset = "Cornell-University/arxiv"
path = "../data/arxiv"

# Descargar el dataset si no existe
if not os.path.exists(path):
    os.makedirs(path)

    kaggle.api.dataset_download_files(dataset, path=path, unzip=True)
    print("Dataset descargado y descomprimido.")
else:
    print("El dataset ya existe en la ruta especificada.")

El dataset ya existe en la ruta especificada.


In [10]:
import json
import random

input_file = os.path.join(path, "arxiv-metadata-oai-snapshot.json")
output_file = os.path.join(path, "arxiv_sample_1_percent.jsonl")

# Contar líneas totales (puede tardar un poco)
with open(input_file, 'r', encoding='utf-8') as f:
    total_lines = sum(1 for _ in f)

# Número de muestras (1%)
sample_size = int(total_lines * 0.01)

# Elegir aleatoriamente los índices de las líneas a guardar
sample_indices = set(random.sample(range(total_lines), sample_size))

# Extraer solo esas líneas y guardarlas
with open(input_file, 'r', encoding='utf-8') as fin, open(output_file, 'w', encoding='utf-8') as fout:
    for i, line in enumerate(fin):
        if i in sample_indices:
            fout.write(line)

print(f"Guardado 1% ({sample_size} líneas) en '{output_file}'")


Guardado 1% (27923 líneas) en '../data/arxiv\arxiv_sample_1_percent.jsonl'


In [13]:
df_sample = pd.read_json(output_file, lines=True)
df_sample

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,0704.0015,Christian Stahn,Christian Stahn,Fermionic superstring loop amplitudes in the p...,22 pages; signs and coefficients adjusted for ...,"JHEP 0705:034,2007",10.1088/1126-6708/2007/05/034,,hep-th,,The pure spinor formulation of the ten-dimen...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2009-11-13,"[[Stahn, Christian, ]]"
1,0704.0046,Denes Petz,"I. Csiszar, F. Hiai and D. Petz",A limit relation for entropy and channel capac...,"LATEX file, 11 pages","J. Math. Phys. 48(2007), 092102.",10.1063/1.2779138,,quant-ph cs.IT math.IT,,"In a quantum mechanical model, Diosi, Feldma...","[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2009-11-13,"[[Csiszar, I., ], [Hiai, F., ], [Petz, D., ]]"
2,0704.0060,Carlos Bertulani,"C.A. Bertulani, G. Cardella, M. De Napoli, G. ...",Coulomb excitation of unstable nuclei at inter...,"12 pages, 2 figures, accepted for publication ...","Phys.Lett.B650:233-238,2007",10.1016/j.physletb.2007.05.029,,nucl-th,,We investigate the Coulomb excitation of low...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-11-26,"[[Bertulani, C. A., ], [Cardella, G., ], [De N..."
3,0704.0069,John Robertson,John W. Robertson,Dynamical Objects for Cohomologically Expandin...,38 pages,,,,math.DS,,The goal of this paper is to construct invar...,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2010-01-08,"[[Robertson, John W., ]]"
4,0704.0079,Stephen C. Power,"Stephen C. Power (Lancaster University), Baruc...",Operator algebras associated with unitary comm...,38 pages,,,,math.OA,,We define nonselfadjoint operator algebras w...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2007-05-23,"[[Power, Stephen C., , Lancaster University], ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27918,solv-int/9712008,Ziad Maassarani,Z. Maassarani (Laval university),The XXC Models,"6 pages, LaTeX",Phys. Lett. A 244 (1998) 160-164,10.1016/S0375-9601(98)00322-3,LAVAL-PHY-27/97,solv-int cond-mat math.QA nlin.SI q-alg,,A class of recently introduced multi-states ...,"[{'version': 'v1', 'created': 'Thu, 11 Dec 199...",2009-10-30,"[[Maassarani, Z., , Laval university]]"
27919,solv-int/9801026,Yasuhiro Fujii,Yasuhiro Fujii and Miki Wadati,Correlation Functions of Finite XXZ model with...,"16pages, LaTeX2e file, errors corrected",,,,solv-int cond-mat hep-th nlin.SI,,The finite XXZ model with boundaries is cons...,"[{'version': 'v1', 'created': 'Wed, 28 Jan 199...",2007-05-23,"[[Fujii, Yasuhiro, ], [Wadati, Miki, ]]"
27920,solv-int/9807004,I. A. B. Strachan,I.A.B.Strachan,Degenerate Frobenius manifolds and the bi-Hami...,"28 pages, LaTeX","J. Math. Phys. 40, 5058 (1999);",10.1063/1.533015,,solv-int nlin.SI,,The bi-Hamiltonian structure of certain mult...,"[{'version': 'v1', 'created': 'Wed, 8 Jul 1998...",2020-12-16,"[[Strachan, I. A. B., ]]"
27921,solv-int/9901001,Jeremy Schiff,"Michael Fisher, Jeremy Schiff",The Camassa-Holm Equation: Conserved Quantitie...,"8 pages, LaTeX",,10.1016/S0375-9601(99)00466-1,,solv-int nlin.SI,,Using a Miura-Gardner-Kruskal type construct...,"[{'version': 'v1', 'created': 'Mon, 4 Jan 1999...",2009-10-31,"[[Fisher, Michael, ], [Schiff, Jeremy, ]]"


#  Implementación de la Arquitectura

# 1. Preprocemiento del Corpus

In [32]:
import json
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

# Descarga de recursos NLTK
nltk.download('stopwords')
stop_words = set(stopwords.words("english"))

# Preprocesamiento
def preprocess_text(title, abstract):
    text = f"{title} {abstract}".lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = text.split()
    tokens = [t for t in tokens if t not in stop_words]
    return ' '.join(tokens)

# Cargar corpus desde archivo JSONL
def load_corpus(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        corpus = [json.loads(line) for line in f]
    return corpus

# Cargar datos
file_path = "../data/arxiv/arxiv_sample_1_percent.jsonl"  
corpus = load_corpus(file_path)
df_sample = pd.DataFrame(corpus)

# Aplicar preprocesamiento
df_sample['Preprocesado'] = df_sample.apply(
    lambda row: preprocess_text(row['title'], row['abstract']), axis=1
)

# Crear columna con texto combinado original (sin preprocesar)
df_sample['Original'] = df_sample['title'] + " " + df_sample['abstract']

# Mostrar primeras filas con original y preprocesado
df_sample[['Original', 'Preprocesado']].head(25)



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\wil_s\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Original,Preprocesado
0,Fermionic superstring loop amplitudes in the p...,fermionic superstring loop amplitudes pure spi...
1,A limit relation for entropy and channel capac...,limit relation entropy channel capacity per un...
2,Coulomb excitation of unstable nuclei at inter...,coulomb excitation unstable nuclei intermediat...
3,Dynamical Objects for Cohomologically Expandin...,dynamical objects cohomologically expanding ma...
4,Operator algebras associated with unitary comm...,operator algebras associated unitary commutati...
5,Interface dynamics of microscopic cavities in ...,interface dynamics microscopic cavities water ...
6,Specific heat and bimodality in canonical and ...,specific heat bimodality canonical grand canon...
7,Isospin breaking in the yield of heavy meson p...,isospin breaking yield heavy meson pairs ee an...
8,Remnant evolution after a carbon-oxygen white ...,remnant evolution carbonoxygen white dwarf mer...
9,Direct Theorems in the Theory of Approximation...,direct theorems theory approximation banach sp...


# 2. Indexación

## TF-IDF

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_tfidf = TfidfVectorizer()
tfidf_matrix = vectorizer_tfidf.fit_transform(df_sample['Preprocesado'])

print("TF–IDF indexado. Forma de la matriz:", tfidf_matrix.shape)


TF–IDF indexado. Forma de la matriz: (27923, 123456)


## BM25

In [35]:
from rank_bm25 import BM25Okapi

# Tokenizar corpus preprocesado
tokenized_corpus = [doc.split() for doc in df_sample['Preprocesado']]

# Crear índice BM25
bm25 = BM25Okapi(tokenized_corpus)

print("BM25 indexado. Número de documentos:", len(tokenized_corpus))


BM25 indexado. Número de documentos: 27923


## FAISS

In [36]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Cargar modelo de embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")

# Obtener embeddings
embeddings = model.encode(df_sample['Preprocesado'], convert_to_numpy=True)

# Crear índice FAISS
index = faiss.IndexFlatL2(embeddings.shape[1])  # Distancia L2
index.add(embeddings)

print("FAISS indexado. Total vectores:", index.ntotal)


FAISS indexado. Total vectores: 27923


In [37]:
# Guardar los IDs de los documentos
doc_ids = df_sample['id'].tolist()

# 3. Recuperación

## 3.1. Búsqueda con TF-IDF

In [38]:
def search_tfidf(query, top_k=10):
    query_proc = preprocess_text(query, "")
    q_vec = vectorizer_tfidf.transform([query_proc])
    scores = tfidf_matrix @ q_vec.T
    top_indices = scores.toarray().ravel().argsort()[-top_k:][::-1]
    return df_sample.iloc[top_indices][['id', 'title', 'abstract']]

## 3.2. Búsqueda con BM25

In [39]:
def search_bm25(query, top_k=10):
    query_proc = preprocess_text(query, "")
    query_tokens = query_proc.split()
    scores = bm25.get_scores(query_tokens)
    top_indices = np.argsort(scores)[-top_k:][::-1]
    return df_sample.iloc[top_indices][['id', 'title', 'abstract']]

## 3.3. Búsqueda con FAISS

In [40]:
def search_faiss(query, top_k=10):
    query_proc = preprocess_text(query, "")
    query_embed = model.encode([query_proc])
    _, top_indices = index.search(query_embed, top_k)
    return df_sample.iloc[top_indices[0]][['id', 'title', 'abstract']]

## 3.4. Comparación de modelos

In [53]:
# Leer queries.txt
with open("../data/arxiv/queries.txt", "r", encoding="utf-8") as f:
    queries = [line.strip() for line in f.readlines()]

### Tabla Comparativa

In [55]:
import pandas as pd

def rank_overlap(list1, list2):
    return len(set(list1) & set(list2))

# Inicializar lista para los resultados
results_comparison = []

# Evaluar cada consulta
for query in queries:
    tfidf_ids = search_tfidf(query)['id'].tolist()
    bm25_ids = search_bm25(query)['id'].tolist()
    faiss_ids = search_faiss(query)['id'].tolist()

    results_comparison.append({
        "Query": query,
        "TF–IDF vs BM25": rank_overlap(tfidf_ids, bm25_ids),
        "TF–IDF vs FAISS": rank_overlap(tfidf_ids, faiss_ids),
        "BM25 vs FAISS": rank_overlap(bm25_ids, faiss_ids)
    })

# Crear DataFrame comparativo
df_comparativa = pd.DataFrame(results_comparison)
# Mostrar tabla comparativa
display(df_comparativa)
# Guardar DataFrame comparativo
df_comparativa.to_csv("../data/arxiv/comparative_results.csv", index=False)


Unnamed: 0,Query,TF–IDF vs BM25,TF–IDF vs FAISS,BM25 vs FAISS
0,diphoton production cross sections,5,4,3
1,quantum chromodynamics,3,1,2
2,higgs boson decay,6,3,5
3,machine learning for particle physics,2,1,2
4,top quark production,7,6,5


# 4. RAG

## 4.1. Instalación y carga del modelo

In [80]:
import os
from dotenv import load_dotenv

# Cargar las variables de entorno desde .env
load_dotenv()

# Obtener la clave desde la variable de entorno
api_key = os.getenv("GEMINI_API_KEY")

In [84]:
import google.generativeai as genai

# Cargar tu API key
import os
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv("GEMINI_API_KEY")
genai.configure(api_key=api_key)

# Listar los modelos disponibles
for m in genai.list_models():
    print(m.name)

models/embedding-gecko-001
models/gemini-1.0-pro-vision-latest
models/gemini-pro-vision
models/gemini-1.5-pro-latest
models/gemini-1.5-pro-002
models/gemini-1.5-pro
models/gemini-1.5-flash-latest
models/gemini-1.5-flash
models/gemini-1.5-flash-002
models/gemini-1.5-flash-8b
models/gemini-1.5-flash-8b-001
models/gemini-1.5-flash-8b-latest
models/gemini-2.5-pro-preview-03-25
models/gemini-2.5-flash-preview-05-20
models/gemini-2.5-flash
models/gemini-2.5-flash-lite-preview-06-17
models/gemini-2.5-pro-preview-05-06
models/gemini-2.5-pro-preview-06-05
models/gemini-2.5-pro
models/gemini-2.0-flash-exp
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-2.0-flash-preview-image-generation
models/gemini-2.0-flash-lite-preview-02-05
models/gemini-2.0-flash-lite-preview
models/gemini-2.0-pro-exp
models/gemini-2.0-pro-exp-02-05
models/gemini-exp-1206
models/gemini-2.0-flash-thin

## 4.2. Función RAG

In [103]:
# Modelo de embeddings (FAISS)
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Modelo generativo (Gemini)
genai_model = genai.GenerativeModel("gemini-1.5-flash")


In [104]:
def generate_rag_with_gemini(query, top_k=3):
    query_proc = preprocess_text(query, "")
    
    # Usar el modelo de embeddings correctamente
    query_embed = embedding_model.encode([query_proc])
    _, top_indices = index.search(query_embed, top_k)
    results = df_sample.iloc[top_indices[0]]

    # Construir el contexto con los abstracts
    context = ""
    for _, row in results.iterrows():
        context += f"Title: {row['title']}\nAbstract: {row['abstract']}\n\n"

    # Crear el prompt para Gemini
    prompt = (
        f"Context:\n{context}\n"
        f"Question: Based on the above scientific articles, summarize the main ideas and explain why they are relevant to the query: \"{query}\""
    )

    # Generar respuesta usando Gemini
    response = genai_model.generate_content(prompt)

    return response.text, results[['id', 'title']]


In [105]:
respuesta, docs = generate_rag_with_gemini("machine learning for particle physics")
print("Respuesta generada:")
print(respuesta)
display(docs)

Respuesta generada:
The three abstracts highlight different applications of machine learning (ML) within particle physics, demonstrating its growing relevance:

* **Abstract 1:** Focuses on using neural networks to improve Monte Carlo integration in phase space, a crucial step in simulating particle collisions. This is relevant because efficient phase space integration is essential for accurately predicting experimental outcomes and comparing them to theoretical models.  The ML approach offers a speed and efficiency advantage over traditional methods like VEGAS, particularly for complex scenarios.

* **Abstract 2:** Provides a broad overview of ML applications in neutrino physics.  This is significant because neutrinos are notoriously difficult to detect and analyze, and ML techniques are proving vital for tackling challenges like background noise reduction, signal extraction, and dealing with limited statistics. This abstract emphasizes the importance of understanding both the benefit

Unnamed: 0,id,title
10453,1810.11509,Neural Network-Based Approach to Phase Space I...
13213,2008.01242,A Review on Machine Learning for Neutrino Expe...
13093,2007.04506,Bayesian Neural Networks for Fast SUSY Predict...


# 5. Evaluación

## 5.1. Comparar documentos recuperados por los 3 modelos

In [106]:
def compare_models(query, top_k=10):
    tfidf_docs = search_tfidf(query, top_k)['id'].tolist()
    bm25_docs = search_bm25(query, top_k)['id'].tolist()
    faiss_docs = search_faiss(query, top_k)['id'].tolist()

    common_docs = set(tfidf_docs) & set(bm25_docs) & set(faiss_docs)
    
    print("Query:", query)
    print(f"\nDocumentos en común entre los tres modelos ({len(common_docs)}):")
    print(common_docs)

    print("\nOrdenamiento de IDs:")
    print("TF–IDF:", tfidf_docs)
    print("BM25  :", bm25_docs)
    print("FAISS :", faiss_docs)

    return tfidf_docs, bm25_docs, faiss_docs

## 5.2. Medir similitud entre rankings

In [107]:
def count_rank_overlap(a, b):
    return len(set(a) & set(b))

def evaluate_rank_overlap(query):
    tfidf_docs, bm25_docs, faiss_docs = compare_models(query)

    print(f"\nCoincidencias top-10:")
    print(f"TF–IDF vs BM25 : {count_rank_overlap(tfidf_docs, bm25_docs)}")
    print(f"TF–IDF vs FAISS: {count_rank_overlap(tfidf_docs, faiss_docs)}")
    print(f"BM25   vs FAISS: {count_rank_overlap(bm25_docs, faiss_docs)}")

In [108]:
print(f"embedding_model type: {type(embedding_model)}")
print(f"genai_model type: {type(genai_model)}")


embedding_model type: <class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>
genai_model type: <class 'google.generativeai.generative_models.GenerativeModel'>


## 5.3. Evaluar la respuesta generada con RAG

In [109]:
def evaluate_rag(query):
    response, supporting_docs = generate_rag_with_gemini(query)

    print("Respuesta generada con RAG:")
    print(response)

    print("\nDocumentos usados:")
    display(supporting_docs)


In [114]:
# Evaluar RAG
evaluate_rag(query)

Respuesta generada con RAG:
These three abstracts highlight different applications of machine learning (ML) within particle physics, demonstrating its growing importance in addressing complex computational challenges:

* **Abstract 1:** Focuses on using neural networks to improve Monte Carlo simulations for phase space integration.  This is crucial for accurately calculating probabilities of particle interactions and decay rates, a fundamental aspect of particle physics analyses.  The relevance to the query is the demonstration of ML's ability to significantly speed up and improve the efficiency of a core computational task in the field.

* **Abstract 2:** Provides a broader overview of ML's impact on neutrino physics.  It emphasizes how ML tackles challenges like large backgrounds, elusive signals, and limited data – common problems in experimental particle physics.  The relevance is in showcasing the widespread adoption of ML across various experimental aspects of neutrino physics, a

Unnamed: 0,id,title
10453,1810.11509,Neural Network-Based Approach to Phase Space I...
13213,2008.01242,A Review on Machine Learning for Neutrino Expe...
13093,2007.04506,Bayesian Neural Networks for Fast SUSY Predict...
