# TFM - Comparativa de técnicas de recuperación con FiQA (fase exploratoria)

Este notebook recoge los primeros pasos del trabajo experimental con FiQA (parte del benchmark BEIR), evaluando diferentes técnicas.

---

## Índice

1. Instalación de dependencias  
2. Montaje de Google Drive  
3. Descarga y guardado del dataset FiQA  
4. Conversión a DataFrames y análisis exploratorio  
5. Implementación de BM25  
6. Recuperación densa con embeddings  
7. Método híbrido  
8. Evaluación y comparación  
9. Conclusiones de la fase exploratoria

---



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os
import pandas as pd
import json
import numpy as np

In [4]:
base_path = "/content/drive/MyDrive/TFM_QA_RAG"
print(f"Carpeta lista en: {base_path}")

Carpeta lista en: /content/drive/MyDrive/TFM_QA_RAG


In [None]:
# Cargar corpus.jsonl
#corpus_path = f"{base_path}/corpus.jsonl"
#with open(corpus_path, "r", encoding="utf-8") as f:
   # corpus = [json.loads(line) for line in f]

#df_corpus = pd.DataFrame(corpus)

# Cargar queries.jsonl
#queries_path = f"{base_path}/queries.jsonl"
#with open(queries_path, "r", encoding="utf-8") as f:
  #  queries = [json.loads(line) for line in f]

#df_queries = pd.DataFrame(queries)

# Cargar qrels/test.tsv
#qrels_path = f"{base_path}/qrels/test.tsv"
#df_qrels = pd.read_csv(qrels_path, sep="\t", names=["query_id", "doc_id", "relevance"])

Se cargan los tres componentes del dataset FiQA desde los archivos originales: `corpus.jsonl`, `queries.jsonl` y `qrels/test.tsv`. Todos se convierten a formato DataFrame y quedan listos para su exploración o guardado posterior.

In [None]:
# Guardar los DataFrames como CSV en Google Drive

#df_corpus.to_csv(f"{base_path}/fiqa_corpus.csv", index=False)
#df_queries.to_csv(f"{base_path}/fiqa_queries.csv", index=False)
#df_qrels.to_csv(f"{base_path}/fiqa_qrels.csv", index=False)

#print("Archivos CSV guardados correctamente en Google Drive.")

Archivos CSV guardados correctamente en Google Drive.


---
4. Conversión a DataFrames y análisis exploratorio  


In [6]:
print("df_queries columns:", df_queries.columns)
print(df_queries.head())

print("df_corpus columns:", df_corpus.columns)
print(df_corpus.head())

print("df_qrels columns:", df_qrels.columns)
print(df_qrels.head())

df_queries columns: Index(['_id', 'text', 'metadata'], dtype='object')
   _id                                               text metadata
0    0  What is considered a business expense on a bus...       {}
1    4  Business Expense - Car Insurance Deductible Fo...       {}
2    5                     Starting a new online business       {}
3    6            “Business day” and “due date” for bills       {}
4    7  New business owner - How do taxes work for the...       {}
df_corpus columns: Index(['_id', 'title', 'text', 'metadata'], dtype='object')
   _id  title                                               text metadata
0    3    NaN  I'm not saying I don't like the idea of on-the...       {}
1   31    NaN  So nothing preventing false ratings besides ad...       {}
2   56    NaN  You can never use a health FSA for individual ...       {}
3   59    NaN  Samsung created the LCD and other flat screen ...       {}
4   63    NaN  Here are the SEC requirements: The federal sec...       {}
df_q

In [7]:
# Carga
df_queries = pd.read_csv(f"{base_path}/fiqa_queries.csv")
df_corpus = pd.read_csv(f"{base_path}/fiqa_corpus.csv")
df_qrels = pd.read_csv(f"{base_path}/fiqa_qrels.csv")

if df_qrels.iloc[0].tolist() == ['query-id', 'corpus-id', 'score']:
    df_qrels = df_qrels.iloc[1:]

# Forzamos tipos y renombramos
df_queries['_id'] = df_queries['_id'].astype(str)
df_corpus['_id'] = df_corpus['_id'].astype(str)
df_qrels['query_id'] = df_qrels['query_id'].astype(str)
df_qrels['doc_id'] = df_qrels['doc_id'].astype(str)

# Parámetros de muestreo
N_QUERIES = 300
N_DOCS = 3000
RANDOM_STATE = 42

# Muestreo aleatorio
subset_queries = df_queries.sample(n=N_QUERIES, random_state=RANDOM_STATE)
subset_docs = df_corpus.sample(n=N_DOCS, random_state=RANDOM_STATE)

q_ids = set(subset_queries['_id'])
d_ids = set(subset_docs['_id'])

# Filtrado de qrels para queries y docs del subset
subset_qrels = df_qrels[
    df_qrels['query_id'].isin(q_ids) & df_qrels['doc_id'].isin(d_ids)
].copy()

print(f"Queries en subset: {len(subset_queries)}")
print(f"Docs en subset: {len(subset_docs)}")
print(f"Qrels en subset: {len(subset_qrels)}")

# Guardar
subset_queries.to_csv(f"{base_path}/subset_queries.csv", index=False)
subset_docs.to_csv(f"{base_path}/subset_corpus.csv", index=False)
subset_qrels.to_csv(f"{base_path}/subset_qrels.csv", index=False)

print("Archivos de subset guardados en Drive.")

Queries en subset: 300
Docs en subset: 3000
Qrels en subset: 4
Archivos de subset guardados en Drive.
