# Configuración de GFS: Creación del Almacén de Búsqueda de Archivos

Este cuaderno configura Google Generative File Search (GFS) para la comparación de RAG.

**Objetivos**:
1. Inicializar el cliente GFS con la clave API
2. Crear el almacén de búsqueda de archivos
3. Cargar documentos desde `data/raw/`
4. Verificar la finalización de la indexación
5. Probar consultas básicas

In [1]:
import sys
from pathlib import Path
import json
import time

# Add src to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / "src"))

# Force reload of modules
import importlib
if 'gfs_client' in sys.modules:
    importlib.reload(sys.modules['gfs_client'])
if 'data_loader' in sys.modules:
    importlib.reload(sys.modules['data_loader'])
if 'utils' in sys.modules:
    importlib.reload(sys.modules['utils'])

from gfs_client import GFSClient
from data_loader import scan_documents, check_gfs_compatibility
from utils import load_api_key

import polars as pl

print("Imports successful")


Imports successful


## 1. Inicializar Cliente GFS

In [2]:
# Load API key from .env
api_key = load_api_key("GOOGLE_API_KEY", str(project_root / ".env"))

# Initialize client
gfs = GFSClient(api_key=api_key, model_id="gemini-2.5-flash")

print("GFS client initialized")

GFS client initialized


## 2. Verificar Almacenes Existentes

In [3]:
# List existing stores
existing_stores = gfs.list_stores()

print(f"Existing stores: {len(existing_stores)}")
for store in existing_stores:
    print(f"  - {store.display_name}: {store.name}")

Existing stores: 19
  - RAG Comparison Document Store: fileSearchStores/rag-comparison-document-sto-bl56voi03s98
  - RAG Comparison Document Store: fileSearchStores/rag-comparison-document-sto-vjeb3eavqtut
  - RAG Comparison Document Store: fileSearchStores/rag-comparison-document-sto-v68rdv4540e0
  - RAG Comparison Document Store: fileSearchStores/rag-comparison-document-sto-7040d5gdzax3
  - RAG Comparison Document Store: fileSearchStores/rag-comparison-document-sto-apaqaior2z2o
  - RAG Comparison Document Store: fileSearchStores/rag-comparison-document-sto-mmidib7ivotz
  - RAG Comparison Document Store: fileSearchStores/rag-comparison-document-sto-wb9kd8me89wn
  - RAG Comparison Document Store: fileSearchStores/rag-comparison-document-sto-5ail0a2oviht
  - RAG Comparison Document Store: fileSearchStores/rag-comparison-document-sto-blmaj7gsmtit
  - RAG Comparison Document Store: fileSearchStores/rag-comparison-document-sto-0lorkz3md71l
  - RAG Comparison Document Store: fileSearchStore

## 3. Crear Nuevo Almacén de Búsqueda de Archivos

In [4]:
# Use existing store or create new one
store_display_name = "RAG Comparison Document Store"

# Check if we already have stores
if len(existing_stores) > 0:
    # Use the first existing store
    store = existing_stores[0]
    print(f"Using existing store: {store.display_name}")
    print(f"Store name: {store.name}")
else:
    # Create new store
    store = gfs.create_file_search_store(display_name=store_display_name)
    print(f"Created store: {store.display_name}")
    print(f"Store name: {store.name}")

# Save store metadata
store_metadata = {
    "display_name": store.display_name,
    "store_name": store.name,
}

# Create directory if it doesn't exist
metadata_dir = project_root / "models" / "gfs_stores"
metadata_dir.mkdir(parents=True, exist_ok=True)

metadata_path = metadata_dir / "metadata.json"
with open(metadata_path, "w") as f:
    json.dump(store_metadata, f, indent=2)

print(f"\nMetadata saved to: {metadata_path}")


Using existing store: RAG Comparison Document Store
Store name: fileSearchStores/rag-comparison-document-sto-bl56voi03s98

Metadata saved to: /Users/ggoni/docencia-repos/rag-with-gfs/models/gfs_stores/metadata.json


## 4. Escanear y Cargar Documentos

In [5]:
# Scan documents
data_dir = project_root / "data" / "raw"
df = scan_documents(data_dir)

print(f"Total files found: {len(df)}")

if len(df) == 0:
    print("\nNo files found in data/raw/")
    print("Add documents to continue.")
else:
    # Check compatibility
    df_compat = check_gfs_compatibility(df)
    compatible_files = df_compat.filter(pl.col("gfs_compatible"))
    
    print(f"Compatible files: {len(compatible_files)}")
    print(f"Incompatible files: {len(df) - len(compatible_files)}")

Total files found: 4
Compatible files: 4
Incompatible files: 0


In [6]:
# Upload compatible files to store
if len(df) > 0 and len(compatible_files) > 0:
    upload_results = []
    
    for i, row in enumerate(compatible_files.iter_rows(named=True)):
        file_path = Path(row["file_path"])
        print(f"\nUploading {i+1}/{len(compatible_files)}: {file_path.name}")
        
        try:
            start_time = time.time()
            operation = gfs.upload_to_store(
                store_name=store.name,
                file_path=file_path,
                wait_for_completion=True
            )
            elapsed = time.time() - start_time
            
            upload_results.append({
                "file_name": file_path.name,
                "status": "success",
                "upload_time_seconds": elapsed
            })
            
            print(f"  ✓ Uploaded successfully ({elapsed:.1f}s)")
            
        except Exception as e:
            upload_results.append({
                "file_name": file_path.name,
                "status": "failed",
                "error": str(e)
            })
            print(f"  ✗ Failed: {e}")
    
    # Save upload results
    results_path = project_root / "models" / "gfs_stores" / "upload_results.json"
    with open(results_path, "w") as f:
        json.dump(upload_results, f, indent=2)
    
    print(f"\nUpload results saved to: {results_path}")
else:
    print("No compatible files to upload")


Uploading 1/4: data_science_workflow.txt
  ✓ Uploaded successfully (6.0s)

Uploading 2/4: nlp_overview.txt
  ✓ Uploaded successfully (5.7s)

Uploading 3/4: deep_learning_guide.txt
  ✓ Uploaded successfully (5.3s)

Uploading 4/4: ml_fundamentals.txt
  ✓ Uploaded successfully (5.6s)

Upload results saved to: /Users/ggoni/docencia-repos/rag-with-gfs/models/gfs_stores/upload_results.json


## 5. Verificar Estado del Almacén

In [7]:
# Get updated store info
store_info = gfs.get_store_info(store.name)

print("Store Status:")
print(f"  Display name: {store_info.display_name}")

# Handle optional attributes safely
if hasattr(store_info, 'size_bytes') and store_info.size_bytes:
    print(f"  Size: {store_info.size_bytes / (1024*1024):.2f} MB")
else:
    print(f"  Size: N/A")

if hasattr(store_info, 'active_documents_count'):
    print(f"  Active documents: {store_info.active_documents_count}")
else:
    print(f"  Active documents: N/A")

if hasattr(store_info, 'pending_documents_count'):
    print(f"  Pending documents: {store_info.pending_documents_count}")
else:
    print(f"  Pending documents: N/A")

if hasattr(store_info, 'failed_documents_count'):
    print(f"  Failed documents: {store_info.failed_documents_count}")
else:
    print(f"  Failed documents: N/A")

if hasattr(store_info, 'update_time') and store_info.update_time:
    print(f"  Last update: {store_info.update_time}")
else:
    print(f"  Last update: N/A")

Store Status:
  Display name: RAG Comparison Document Store
  Size: 0.12 MB
  Active documents: 20
  Pending documents: None
  Failed documents: None
  Last update: 2025-12-01 00:02:12.733951+00:00


## 6. Consulta de Prueba

In [8]:
# Test query (only if documents are uploaded)
if len(df) > 0 and len(compatible_files) > 0:
    test_query = "What are the main topics covered in these documents?"
    
    print(f"Test query: {test_query}")
    print("=" * 60)
    
    response = gfs.query_with_file_search(
        query=test_query,
        store_names=[store.name],
        temperature=0.0
    )
    
    print(f"\nResponse:\n{response.text}")
    
    # Check for citations
    citations = gfs.extract_citations(response)
    if citations:
        print("\n[Sources cited from documents]")
    else:
        print("\n[No citations found]")
else:
    print("Skipping test query - no documents uploaded")

Test query: What are the main topics covered in these documents?

Response:
The documents primarily cover the field of Natural Language Processing (NLP), detailing its fundamental concepts, methods, and core tasks.

The main topics include:
*   **Introduction to NLP**: Defining NLP as a branch of artificial intelligence that enables computers to understand, interpret, and generate human language, combining computational linguistics, machine learning, and deep learning.
*   **Fundamental Concepts**:
    *   **Text Preprocessing**: Essential steps before modeling, such as tokenization, lowercasing, stop word removal, stemming/lemmatization, and handling punctuation, special characters, numbers, and dates.
    *   **Text Representation**: Both traditional methods like Bag of Words (BoW), TF-IDF, N-grams, Word2Vec, GloVe, and FastText, as well as modern methods such as contextual embeddings (BERT, ELMo, GPT), sentence embeddings (Sentence-BERT, Universal Sentence Encoder), and subword toke

## Resumen

**Completado**:
- Creado el almacén de búsqueda de archivos GFS
- Cargados documentos compatibles
- Verificado el estado de indexación
- Probada la funcionalidad básica de consulta

**Próximos Pasos**:
- Proceder a `03_gfs_experiments.ipynb` para experimentos detallados de RAG
- Probar varios patrones de consulta
- Medir latencia y calidad de recuperación