<a href="https://colab.research.google.com/github/ecazar/NoteBooks/blob/main/cazar_examen2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0) importacion del dataset

In [2]:
import kagglehub
path = kagglehub.dataset_download("Cornell-University/arxiv")

Using Colab cache for faster access to the 'arxiv' dataset.


In [3]:
path

'/kaggle/input/arxiv'

In [11]:
import pandas as pd
reader = pd.read_json(path+'/arxiv-metadata-oai-snapshot.json', lines=True, chunksize=10000)
for chunk in reader:
    df = chunk
    break

In [12]:
df_docs = df[['id', 'title', 'categories', 'abstract']].copy()
df_docs['text'] = df_docs['title'] + ' ' + df_docs['abstract']
df_docs = df_docs.rename(columns={'id': 'doc_id'})

In [13]:
df_docs.columns

Index(['doc_id', 'title', 'categories', 'abstract', 'text'], dtype='object')

## 1) Preprocesamiento de Datos

In [14]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import re

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')

def preprocess(text):
    text = str(text).lower()
    tokens = re.findall(r'\b[a-z]+\b', text)
    clean_tokens = [stemmer.stem(t) for t in tokens if t not in stop_words]
    return " ".join(clean_tokens)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


preprocesamos en texto

In [15]:
df_docs['clean_text'] = df_docs['text'].apply(preprocess)

## 2) Representación mediante Embeddings

In [19]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m109.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.2


In [20]:
!pip install sentence-transformers



In [21]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Cargar modelo eficiente
model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [22]:
# Generar embeddings
doc_embeddings = model.encode(df_docs['clean_text'].tolist(), show_progress_bar=True)

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

In [23]:
# Crear índice FAISS
d = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(doc_embeddings)

## 3) Recuperación Inicial (First-Stage Retrieval)

In [24]:
def search_initial(query, k=50):
    query_clean = preprocess(query)
    query_emb = model.encode([query_clean])
    D, I = index.search(query_emb, k)
    results = []
    for idx in I[0]:
        if idx < len(df_docs):
            item = df_docs.iloc[idx].to_dict()
            results.append(item)
    return results

## 4) Re-ranking de Resultados

In [25]:
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 555d8d16-13f7-4e50-b64c-c44dbd1e6f15)')' thrown while requesting HEAD https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2/resolve/main/special_tokens_map.json
Retrying in 1s [Retry 1/5].


special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [26]:
def rerank_results(query, initial_results):
    if not initial_results:
        return []

    # Preparamos pares
    pairs = [[query, doc['text']] for doc in initial_results]

    # Predecir scores
    scores = cross_encoder.predict(pairs)

    # Asignar scores
    for i, doc in enumerate(initial_results):
        doc['score'] = scores[i]

    # Ordenar descendente
    ranked_results = sorted(initial_results, key=lambda x: x['score'], reverse=True)
    return ranked_results

## 5) Simulación de Consultas

creamos topics

In [27]:
topics = [
    {
        'qid': '1',
        'query': 'transformer models for natural language processing',
        'keywords': ['transformer', 'nlp', 'language', 'attention']
    },
    {
        'qid': '2',
        'query': 'graph neural networks for social network analysis',
        'keywords': ['graph', 'gnn', 'network', 'social']
    },
    {
        'qid': '3',
        'query': 'federated learning privacy and security',
        'keywords': ['federated', 'privacy', 'security', 'distributed']
    },
    {
        'qid': '4',
        'query': 'blockchain consensus protocols and scalability',
        'keywords': ['blockchain', 'consensus', 'scalability']
    },
    {
        'qid': '5',
        'query': 'computer vision object detection using deep learning',
        'keywords': ['vision', 'object', 'detection', 'cnn']
    }
]


In [28]:
def get_ground_truth(doc_df, keywords):
    relevant_ids = set()
    for idx, row in doc_df.iterrows():
        # Si alguna palabra clave está en el título, es relevante
        if any(k in row['title'].lower() for k in keywords):
            relevant_ids.add(row['doc_id'])
    return relevant_ids

## 6) Evaluación del Sistema

In [32]:
print("Evaluación de consultas sobre el dataset arXiv\n")
print(f"{'QID':<3} {'Consulta':<40}  P@10_ini  R@10_ini  P@10_rnk  R@10_rnk")
print("-" * 85)

results_metrics = []

for topic in topics:
    qid = topic['qid']
    query = topic['query']

    # Documentos relevantes estimados a partir de palabras clave
    relevant_docs = get_ground_truth(df_docs, topic['keywords'])
    total_relevant = len(relevant_docs)

    # Si no hay documentos relevantes, se omite la consulta
    if total_relevant == 0:
        continue

    # Recuperación inicial usando embeddings
    initial_results = search_initial(query, k=50)
    top10_initial = [doc['doc_id'] for doc in initial_results[:10]]

    # Reordenamiento de resultados con cross-encoder
    reranked_results = rerank_results(query, initial_results)
    top10_reranked = [doc['doc_id'] for doc in reranked_results[:10]]

    # Cálculo de aciertos
    hits_initial = len(set(top10_initial) & relevant_docs)
    hits_reranked = len(set(top10_reranked) & relevant_docs)

    # Métricas
    p10_ini = hits_initial / 10
    r10_ini = hits_initial / total_relevant

    p10_rnk = hits_reranked / 10
    r10_rnk = hits_reranked / total_relevant

    # Mostrar resultados
    print(f"{qid:<3} {query[:38]:<40}  "
          f"{p10_ini:.2f}      {r10_ini:.2f}      "
          f"{p10_rnk:.2f}      {r10_rnk:.2f}")

    results_metrics.append({
        'query': query,
        'p10_initial': p10_ini,
        'r10_initial': r10_ini,
        'p10_rerank': p10_rnk,
        'r10_rerank': r10_rnk
    })


Evaluación de consultas sobre el dataset arXiv

QID Consulta                                  P@10_ini  R@10_ini  P@10_rnk  R@10_rnk
-------------------------------------------------------------------------------------
1   transformer models for natural languag    0.00      0.00      0.00      0.00
2   graph neural networks for social netwo    1.00      0.02      1.00      0.02
3   federated learning privacy and securit    0.20      0.09      0.20      0.09
4   blockchain consensus protocols and sca    0.20      0.50      0.20      0.50
5   computer vision object detection using    0.50      0.05      0.50      0.05


## 7) Análisis de Resultados

In [33]:
sample_topic = topics[0]
query = sample_topic['query']

print("\nAnálisis cualitativo de resultados")
print(f"Consulta evaluada: {query}\n")

initial = search_initial(query, k=10)
reranked = rerank_results(query, initial)

print("Top 3 resultados iniciales:")
for i, doc in enumerate(initial[:3], start=1):
    print(f"{i}. {doc['title']}")

print("\nTop 3 resultados tras el re-ranking:")
for i, doc in enumerate(reranked[:3], start=1):
    print(f"{i}. {doc['title']}  (score={doc['score']:.3f})")



Análisis cualitativo de resultados
Consulta evaluada: transformer models for natural language processing

Top 3 resultados iniciales:
1. Morphing Ensemble Kalman Filters
2. Equivalence of different descriptions for $\eta$ Particle in Simplest
  Little Higgs Model
3. IDF revisited: A simple new derivation within the Robertson-Sp\"arck
  Jones probabilistic model

Top 3 resultados tras el re-ranking:
1. Morphing Ensemble Kalman Filters  (score=-5.229)
2. On the generalized Freedman-Townsend model  (score=-9.532)
3. Gauge transformations and symmetries of integrable systems  (score=-9.911)
