### Loading PDF

Markdown loader

In [1]:
from langchain_pymupdf4llm import PyMuPDF4LLMLoader

path = "/home/cristian/projects/rag_pae/data/pdfs/amazonica/A1.pdf"

def load_pdf(path: str):
    loader = PyMuPDF4LLMLoader(path)
    doc = loader.load()
    return doc

doc = load_pdf(path)

In [2]:
for page in doc:
    print(f"Page {page}")
    print(page.page_content)
    print("\n" + "=" * 80 + "\n")

Page page_content='## Estado, militares y conflicto en la frontera amazónica colombiana: referentes históricos para la interpretación regional del conflicto

Carlos G. Zárate B. [1 ]

**Resumen**

El actual conflicto colombiano ha tenido en la región amazónica (zona de frontera con Brasil,
Perú y Ecuador) una expresión particular poco estudiada y analizada. Las erróneas concepciones
sobre el papel y las posibilidades de la región y sus fronteras, así como el diseño de políticas
públicas del Estado colombiano (que han sido orientadas casi exclusivamente al ejercicio de la
soberanía y al control territorial, con un énfasis excesivo en sus aspectos militares), arrojan muy
pobres resultados, desde que empezaron a implementarse en los años treinta del siglo pasado,
en términos de integración social, económica y política de la Amazonia al resto de la nación.
La violencia y los conflictos sociales que han padecido la región y sus zonas de frontera en las
últimas décadas tienen mucho que ver c

### Chunking

Semantic Chunking based on similarity

In [3]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_ollama import OllamaEmbeddings
from langchain.vectorstores import Chroma

embeddings = OllamaEmbeddings(model="nomic-embed-text")

semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  
    breakpoint_threshold_amount=95,  
    number_of_chunks=None,  
    buffer_size=1
)

split_docs = semantic_splitter.split_documents(doc)
vectorstore = Chroma.from_documents(documents=split_docs, embedding= embeddings)
retriever = vectorstore.as_retriever()


### Structured Output

In [4]:
from pydantic import BaseModel, Field
from typing import List, Optional, Literal

class AcademicPaper(BaseModel):
    title: str = Field(description="The complete and exact title of the article.")
    authors: List[str] = Field(description="The list of authors of the article, in the order they appear in the publication.")
    publication_year: Optional[int] = Field(description="The year in which the article was published.")
    journal: Optional[str] = Field(description="The name of the journal or conference proceedings where it was published.")
    abstract: Optional[str] = Field(description="The complete and accurate abstract as it appears in the article.")
    keywords: Optional[List[str]] = Field(description="A list of keywords associated with the article, if available.")
    regions: Optional[List[str]] = Field(description="The list of Colombian geographical regions where the study is focused in order of relevance, if applicable.")
    language: str = Field(default="Spanish", description="The primary language in which the article is written, default is Spanish.")
    typology: Optional[Literal["Border", "Transboundary"]] = Field(description="describes the type of territorial coverage presented by the source analyzed.")

In [5]:
system_prompt = (
    "You are an expert in extracting metadata from academic papers.\n"
    "Your task is to extract the following metadata from the provided text:\n"
    "CRITICAL RULES:\n"
    "- ONLY extract information that is explicitly present in the provided text.\n"
    "- If information is not clearly present, leave the field empty or null.\n"
    "- DO NOT invent, infer, or guess any information.\n"
    "- DO NOT change the language of the metadata, keep it in the original language of the paper.\n"
    "- Assume Spanish as the primary language for metadata extraction.\n"
    "- DO NOT take into account references or citations when extracting metadata.\n"
    "- Extract information exactly as it appears in the source text.\n"

    "EDGE CASES:\n"
    "- If there are multiple titles, extract ONLY the main article title.\n"
    "- If the paper is in multiple languages, extract metadata in the primary language.\n"
    "- Ignore bibliographic references when extracting metadata.\n"
    "- If the year appears multiple times, use the publication year, not the submission or acceptance year.\n"
    "- If there are abstracts in multiple languages, prioritize the Spanish version.\n"

    "The metadata fields you need to extract are:\n"
    "- Title: The complete and exact title of the article, not section headings.\n"
    "- Authors: List all primary authors, not editors or reviewers.\n"
    "- Publication Year: The year of publication, not submission or acceptance.\n"
    "- Journal: The name of the journal or conference proceedings where it was published, not the publisher.\n"
    "- Abstract: The complete abstract content, excluding the word 'Abstract.'.\n"
    "- Keywords: A list of keywords associated with the article, if available.\n"
    "- Regions: A list of Colombian geographical regions where the study is focused, if applicable.\n"
    "- Language: The primary language in which the article is written, default is Spanish.\n"

    "You will be provided with the text of an academic paper. Based on the following academic paper text: {context}\n"
)

FIELD_QUERIES = {
    "title": ["title", "article title", "research title", "paper title", "titulo", "título del artículo"],
    "authors": ["written by", "researchers", "corresponding author", "author list", "authors", "autores", "autor", "investigadores"],
    "publication_year": ["published", "publication year", "copyright", "year", "año de publicación", "publicado en"],
    "journal": ["journal", "published in", "proceedings", "conference", "publication venue", "revista", "congreso"],
    "abstract": [
        "abstract", "resumen", "resumo", "summary", "síntesis",
        "introduction abstract", "paper abstract", "study abstract",
        "research abstract", "article summary", "objetivo metodología resultados",
        "objetivo método resultado", "background objective method"
    ],
    "keywords": ["keywords", "key terms", "palabras clave", "termos-chave", "tags", "términos clave"],
    "regions": ["region", "geographical area", "geographic region", "area of study", "geographical scope", "región geográfica", "área geográfica"],
    "language": ["language", "idioma", "linguagem", "written in", "escrito en"],
    "typology": ["tipology", "tipo de cobertura territorial", "border", "transboundary", "cobertura territorial"]
}

FIELD_PROMPTS = {
    "title": (
        "Extract ONLY the main title of this academic paper.\n"
        "REQUIREMENTS:\n"
        "- Must be the complete and exact title as it appears in the document\n"
        "- Ignore section headings, references, or secondary titles\n"
        "- Do NOT modify or translate the title\n"
        "- If multiple titles exist, choose the main article title (usually at the beginning)\n"
        "- If no clear title is found, leave empty\n"
        "WHAT TO AVOID:\n"
        "- Section names like 'Introduction', 'Methods', 'Results'\n"
        "- Titles of referenced papers\n"
        "- Journal names or publication information"
    ),
    
    "authors": (
        "Extract ALL primary author names EXACTLY as they appear in the publication.\n"
        "CRITICAL REQUIREMENTS:\n"
        "- ONLY extract authors from the article header/title section\n"
        "- Include ALL authors in the order they appear in the publication\n"
        "- Extract only author names, NOT affiliations, emails, or institutions\n"
        "- Include only primary authors who contributed to THIS specific article\n"
        "- Maintain original name format and language\n"
        
        "WHERE TO LOOK FOR AUTHORS:\n"
        "- Immediately after or below the article title\n"
        "- In the first page of the document\n"
        "- Before the abstract section\n"
        "- In author information sections\n"
        
        "WHAT TO INCLUDE:\n"
        "- First name and last name combinations\n"
        "- Names with initials (e.g., 'J. García', 'María J. López')\n"
        "- Names with multiple surnames (e.g., 'García-Martínez')\n"
        "- Complete author names as they appear in the byline\n"
        
        "STRICT EXCLUSIONS - DO NOT INCLUDE:\n"
        "- Authors mentioned in references or bibliography sections\n"
        "- Authors cited within the text (e.g., 'According to Smith et al.')\n"
        "- Editors, reviewers, or journal staff\n"
        "- Email addresses, affiliations, or institutional information\n"
        "- Academic titles like 'Dr.', 'Prof.', 'PhD'\n"
        "- Authors from cited papers or external studies\n"
        "- Names that appear only in reference lists\n"
        "- Names mentioned in acknowledgments only\n"
        
        "VALIDATION RULES:\n"
        "- Authors must appear in the document header area\n"
        "- Authors must be clearly identified as contributors to THIS article\n"
        "- If uncertain whether a name is an author or cited reference, exclude it\n"
        "- Authors typically appear before the abstract and after the title\n"
        "- Look for formatting that indicates authorship (e.g., superscript numbers for affiliations)"
    ),
    
    "abstract": (
        "Extract the COMPLETE abstract section exactly as it appears.\n"
        "REQUIREMENTS:\n"
        "- Include the full abstract content in its original language\n"
        "- Exclude the heading words 'Abstract', 'Resumen', or 'Resumo'\n"
        "- Must be a coherent, complete text block\n"
        "- If multiple abstracts exist, prioritize Spanish version\n"
        "STRUCTURE TO LOOK FOR:\n"
        "- Objective/purpose statement\n"
        "- Methodology description\n"
        "- Main results or findings\n"
        "- Conclusions or implications\n"
        "WHAT TO EXCLUDE:\n"
        "- Author names or affiliations\n"
        "- Keywords section\n"
        "- Section headings\n"
        "- References or citations\n"
        "- Fragmented or incomplete sentences"
    ),
    
    "publication_year": (
        "Extract ONLY the publication year of this specific article.\n"
        "REQUIREMENTS:\n"
        "- Must be a 4-digit year (e.g., 2023)\n"
        "- Must be the actual publication year, not submission or acceptance\n"
        "- Look for phrases like 'published in', 'copyright', or journal publication info\n"
        "WHAT TO AVOID:\n"
        "- Submission dates\n"
        "- Acceptance dates\n"
        "- Years mentioned in the content or references\n"
        "- Conference presentation years if different from publication"
    ),
    
    "journal": (
        "Extract the EXACT name of the journal or conference proceedings.\n"
        "REQUIREMENTS:\n"
        "- Must be the complete and official name\n"
        "- Include subtitle if it's part of the official name\n"
        "- Maintain original language and formatting\n"
        "WHAT TO INCLUDE:\n"
        "- Journal names (e.g., 'Revista Colombiana de Ciencias')\n"
        "- Conference proceedings names\n"
        "- Book series names if applicable\n"
        "WHAT TO EXCLUDE:\n"
        "- Publisher names (e.g., 'Elsevier', 'Springer')\n"
        "- Volume or issue numbers\n"
        "- ISSN numbers\n"
        "- Editorial information"
    ),
    
    "keywords": (
        "Extract ALL keywords exactly as they appear in the document.\n"
        "REQUIREMENTS:\n"
        "- Must be explicitly listed as keywords in the document\n"
        "- Include all keywords in their original language\n"
        "- Maintain original formatting and order\n"
        "COMMON LOCATIONS:\n"
        "- After the abstract section\n"
        "- Before the introduction\n"
        "- In document metadata section\n"
        "WHAT TO EXCLUDE:\n"
        "- Terms that are not explicitly marked as keywords\n"
        "- Subject classifications\n"
        "- Terms inferred from content\n"
        "- JEL codes or similar classifications"
    ),
    
    "regions": (
        "Extract ONLY Colombian geographical regions explicitly mentioned as study focus areas.\n"
        "VALID COLOMBIAN REGIONS (extract only these if mentioned):\n"
        "- Amazonía (Amazon region)\n"
        "- Andina (Andean region)\n"
        "- Atlántica or Caribe (Atlantic/Caribbean region)\n"
        "- Insular (Insular region)\n"
        "- Orinoquía (Orinoco region)\n"
        "- Pacífica (Pacific region)\n"
        "REQUIREMENTS:\n"
        "- Must be explicitly mentioned as study area or geographic focus\n"
        "- Must be one of the official Colombian regions listed above\n"
        "- Extract in order of relevance/prominence in the text\n"
        "WHAT TO EXCLUDE:\n"
        "- Specific cities, departments, or municipalities\n"
        "- Other countries or international regions\n"
        "- 'Colombia' as a country name\n"
        "- Geographic features that are not official regions\n"
        "- Regions mentioned only in passing or in references"
    ),
    
    "language": (
        "Identify the PRIMARY language in which the article is written.\n"
        "REQUIREMENTS:\n"
        "- Determine the main language of the document content\n"
        "- Default to 'Spanish' if unclear\n"
        "- Use standard language names in English\n"
        "COMMON VALUES:\n"
        "- 'Spanish' (most common)\n"
        "- 'English'\n"
        "- 'Portuguese'\n"
        "DETERMINATION CRITERIA:\n"
        "- Language of the title and abstract\n"
        "- Language of the main body text\n"
        "- If multilingual, choose the predominant language"
    ),

    "typology": (
        "Identify the territorial coverage type of the source analyzed.\n"
        "REQUIREMENTS:\n"
        "- ONLY answer with one of these two values: 'Border' or 'Transboundary'.\n"
        "- 'Border' means it addresses only one side of the border.\n"
        "- 'Transboundary' means it includes information from both sides of the border.\n"
        "ANSWER FORMAT:\n"
        "- Respond ONLY with 'Border' or 'Transboundary'.\n"
        "- If not clear, leave empty or null.\n"
        )
}

In [6]:
from langchain_ollama import ChatOllama
llm = ChatOllama(model="mistral:instruct")
llm_structured = llm.with_structured_output(AcademicPaper)

### Extraction chain

In [76]:
import pprint
import json

def extract_single_field(field_name, retriever, llm_structured):
    # 1. Obtener queries para este campo específico
    field_queries = FIELD_QUERIES[field_name]
    
    # 2. Hacer retrieval con cada query del campo
    retrieved_docs = []
    for query in field_queries:
        docs = retriever.invoke(query)
        retrieved_docs.extend(docs)
    
    # 3. Combinar todos los chunks relevantes
    combined_text = "\n".join([doc.page_content for doc in retrieved_docs])
    
    # 4. Crear prompt específico para este campo
    field_instruction = FIELD_PROMPTS[field_name]
    full_prompt = system_prompt + "\n" + field_instruction + "\n"
    
    # 5. Extraer usando LLM estructurado
    result = llm_structured.invoke(full_prompt.format(context=combined_text))
    
    # 6. Devolver resultado
    return result

def extract_metadata(retriever, llm_structured):
    metadata = {}
    for field in FIELD_QUERIES.keys():

        # Skip fields for optimization - See title optimization below
        if field in ["keywords", 'abstract']:
            continue

        result = extract_single_field(field, retriever, llm_structured)

        #Optimization: Title field offers good information about others fields
        if field == "title":
            result = result.model_dump()
            metadata["title"] = result['title']
            metadata["abstract"] = result['abstract']
            metadata["keywords"] = result['keywords']
            continue

        metadata[field] = result.model_dump()[field]
    return metadata

result = extract_metadata(retriever, llm_structured)
pprint.pprint(result)

{'abstract': 'El actual conflicto colombiano ha tenido en la región amazónica '
             '(zona de frontera con Brasil, Perú y Ecuador) una expresión '
             'particular poco estudiada y analizada. Las erróneas concepciones '
             'sobre el papel y las posibilidades de la región y sus fronteras, '
             'así como el diseño de políticas públicas del Estado colombiano '
             '(que han sido orientadas casi exclusivamente al ejercicio de la '
             'soberanía y al control territorial, con un énfasis excesivo en '
             'sus aspectos militares), arrojan muy pobres resultados, desde '
             'que empezaron a implementarse en los años treinta del siglo '
             'pasado, en términos de integración social, económica y política '
             'de la Amazonia al resto de la nación. La violencia y los '
             'conflictos sociales que han padecido la región y sus zonas de '
             'frontera en las últimas décadas tienen mucho 

Extraction focused on articles

In [7]:
import pprint
import json

def extract_single_field(field_name, retriever, llm_structured, use_first_pages_only=False, doc=None):
    """
    Extrae un campo específico usando retrieval o primeras páginas
    """
    if use_first_pages_only and doc:
        # Para campos que típicamente están en las primeras páginas
        total_pages = len(doc)
        first_pages_count = max(1, int(total_pages * 0.2))  # 10% de las páginas
        first_pages = doc[:first_pages_count]
        combined_text = "\n".join([page.page_content for page in first_pages])
    else:
        # Usar retrieval normal
        field_queries = FIELD_QUERIES[field_name]
        
        # Hacer retrieval con cada query del campo
        retrieved_docs = []
        for query in field_queries:
            docs = retriever.invoke(query)
            retrieved_docs.extend(docs)
        
        # Remover duplicados manteniendo orden
        seen = set()
        unique_docs = []
        for doc_item in retrieved_docs:
            if doc_item.page_content not in seen:
                seen.add(doc_item.page_content)
                unique_docs.append(doc_item)
        
        combined_text = "\n".join([doc_item.page_content for doc_item in unique_docs])
    
    # Crear prompt específico para este campo
    field_instruction = FIELD_PROMPTS[field_name]
    full_prompt = system_prompt + "\n" + field_instruction + "\n"
    
    # Extraer usando LLM estructurado
    result = llm_structured.invoke(full_prompt.format(context=combined_text))
    
    return result

def extract_metadata_optimized(retriever, llm_structured, doc):
    """
    Versión optimizada que minimiza llamadas al LLM y mejora precisión
    """
    metadata = {}
    
    # Fase 1: Extraer información básica de las primeras páginas (title, authors, etc.)
    # Usar solo las primeras páginas para campos que típicamente están ahí
    first_page_fields = ["title", "authors", "publication_year", "journal", "language"]
    
    for field in first_page_fields:
        print(f"Procesando campo: {field}")
        result = extract_single_field(field, retriever, llm_structured, 
                                    use_first_pages_only=True, doc=doc)
        
        # Si es el campo title, extraer todos los campos básicos en una sola llamada
        if field == "title":
            result_dict = result.model_dump()
            metadata["title"] = result_dict['title']
            metadata["authors"] = result_dict['authors']
            metadata["publication_year"] = result_dict['publication_year']
            metadata["journal"] = result_dict['journal']
            metadata["language"] = result_dict['language']
            # También intentar extraer abstract y keywords si están disponibles
            if result_dict.get('abstract'):
                metadata["abstract"] = result_dict['abstract']
            if result_dict.get('keywords'):
                metadata["keywords"] = result_dict['keywords']
            break
        else:
            metadata[field] = result.model_dump()[field]
    
    # Fase 2: Extraer campos más específicos si no se obtuvieron en la fase 1
    remaining_fields = []
    
    for field in FIELD_QUERIES.keys():
        if field not in metadata:
            remaining_fields.append(field)
        elif metadata.get(field) is None or (isinstance(metadata[field], str) and len(metadata[field]) < 3):
            remaining_fields.append(field)
    
    for field in remaining_fields:
        if field not in metadata:
            print(f"Procesando campo restante: {field}")
            result = extract_single_field(field, retriever, llm_structured)
            metadata[field] = result.model_dump()[field]
    
    return metadata

result_optimized = extract_metadata_optimized(retriever, llm_structured, doc)
pprint.pprint(result_optimized)

Procesando campo: title
Procesando campo restante: regions
Procesando campo restante: typology
{'abstract': 'El actual conflicto colombiano ha tenido en la región amazónica '
             '(zona de frontera con Brasil y los países andinos) una expresión '
             'particular poco estudiada y analizada... Los relatores y '
             'especialistas de dicha comisión, en representación de diversas '
             'posturas ideológicas y académicas, parecen concordar en la '
             'existencia de ciertas ‘fallas geológicas’ en la construcción del '
             'Estado-nación colombiano, aludiendo a sus orígenes, sus formas '
             'específicas y, en general, su singularidad regional... Todas '
             'estas condiciones están presentes en las zonas de contacto '
             'fronterizo amazónico de Colombia con Ecuador, Perú, Brasil y '
             'Venezuela, por lo que el conocimiento de su historia, su '
             'dinámica y sus particularidades, puede ap

### Testing

In [8]:

ground_truth_data = {
    "title": """Estado, militares y conflicto en la frontera amazónica colombiana: referentes históricos para la interpretación regional del conflicto""",  
    "authors": ['Carlos G. Zárate B.'],        
    "publication_year": 2015,                          
    "journal": "MUNDO AMAZÓNCO",

    "abstract": """El actual conflicto colombiano ha tenido en la región amazónica (zona de frontera con Brasil, Perú y Ecuador) una expresión particular poco estudiada y analizada. Las erróneas concepciones sobre el papel y las posibilidades de la región y sus fronteras, así como el diseño de políticas públicas del Estado colombiano (que han sido orientadas casi exclusivamente al ejercicio de la soberanía y al control territorial, con un énfasis excesivo en sus aspectos militares), arrojan muy pobres resultados, desde que empezaron a implementarse en los años treinta del siglo pasado,en términos de integración social, económica y política de la Amazonia al resto de la nación.La violencia y los conflictos sociales que han padecido la región y sus zonas de frontera en lasúltimas décadas tienen mucho que ver con esta situación. Un eventual acuerdo en el actual proceso de paz podría ayudar a resolverlos, siempre y cuando se cumplan otras condiciones y se implementen simultáneamente reformas que han venido aplazándose durante décadas.""",  

    "keywords": ("Amazonia; fronteras; conflicto; relaciones internacionales; guerrillas").split("; "), 
    "regions": ["Amazonia"],                          
    "language": "Spanish"                             
}

In [None]:
# ...existing code...
# Instalar primero: pip install deepeval

import deepeval
from deepeval import evaluate
from deepeval.models import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, ContextualPrecisionMetric
from langchain_ollama import ChatOllama
from typing import List, Dict, Any
import json

# 1. Crear wrapper para usar Ollama con DeepEval
class OllamaLLM(DeepEvalBaseLLM):
    def __init__(self, model_name: str = "llama3.2"):
        self.model = ChatOllama(model=model_name, temperature=0)
    
    def load_model(self):
        return self.model
    
    def generate(self, prompt: str) -> str:
        response = self.model.invoke(prompt)
        return response.content
    
    async def a_generate(self, prompt: str) -> str:
        response = await self.model.ainvoke(prompt)
        return response.content
    
    def get_model_name(self):
        return "ollama-llama3.2"

# 2. Configurar el modelo para DeepEval
deepeval_llm = OllamaLLM("llama3.2")

# 3. Función para crear casos de prueba
def create_test_cases(ground_truth_data: Dict, extracted_metadata: Dict, 
                     retriever, queries: List[str]) -> List[LLMTestCase]:
    """
    Crear casos de prueba para DeepEval
    """
    test_cases = []
    
    for field, gt_value in ground_truth_data.items():
        if field in extracted_metadata:
            # Obtener contexto relevante para este campo
            field_queries = FIELD_QUERIES.get(field, [field])
            retrieved_docs = []
            
            for query in field_queries[:2]:  # Limitar a 2 queries por campo
                docs = retriever.invoke(query)
                retrieved_docs.extend(docs[:3])  # Top 3 docs por query
            
            # Combinar contexto
            context = [doc.page_content for doc in retrieved_docs[:5]]  # Max 5 docs
            
            # Crear caso de prueba
            test_case = LLMTestCase(
                input=f"Extract {field} from the academic paper",
                actual_output=str(extracted_metadata[field]),
                expected_output=str(gt_value),
                retrieval_context=context
            )
            test_cases.append(test_case)
    
    return test_cases

# 4. Función principal de evaluación
def evaluate_metadata_extraction_deepeval(ground_truth_data: Dict, extracted_metadata: Dict, 
                                         retriever) -> Dict[str, float]:
    """
    Evaluar extracción de metadata usando DeepEval
    """
    
    # Crear queries de evaluación
    evaluation_queries = []
    for field, queries in FIELD_QUERIES.items():
        evaluation_queries.extend(queries[:2])
    
    # Crear casos de prueba
    test_cases = create_test_cases(ground_truth_data, extracted_metadata, 
                                 retriever, evaluation_queries)
    
    # Definir métricas
    metrics = [
        AnswerRelevancyMetric(threshold=0.7, model=deepeval_llm),
        FaithfulnessMetric(threshold=0.7, model=deepeval_llm),
        ContextualPrecisionMetric(threshold=0.7, model=deepeval_llm)
    ]
    
    # Ejecutar evaluación
    results = {}
    
    for i, test_case in enumerate(test_cases):
        field_name = list(ground_truth_data.keys())[i % len(ground_truth_data)]
        
        print(f"Evaluando campo: {field_name}")
        
        # Evaluar cada métrica
        for metric in metrics:
            try:
                metric.measure(test_case)
                score = metric.score
                metric_name = metric.__class__.__name__.replace("Metric", "")
                results[f"{field_name}_{metric_name}"] = score
                print(f"  {metric_name}: {score:.3f}")
            except Exception as e:
                print(f"  Error en {metric.__class__.__name__}: {e}")
                results[f"{field_name}_{metric.__class__.__name__}"] = 0.0
    
    # Calcular promedios por métrica
    metric_names = ["AnswerRelevancy", "Faithfulness", "ContextualPrecision"]
    for metric_name in metric_names:
        metric_scores = [v for k, v in results.items() if metric_name in k]
        if metric_scores:
            results[f"avg_{metric_name}"] = sum(metric_scores) / len(metric_scores)
    
    # Score general
    all_scores = [v for k, v in results.items() if not k.startswith("avg_")]
    if all_scores:
        results["overall_score"] = sum(all_scores) / len(all_scores)
    
    return results

# 5. Evaluación personalizada adicional
def custom_metadata_evaluation(ground_truth_data: Dict, extracted_metadata: Dict) -> Dict[str, float]:
    """
    Evaluación personalizada específica para metadata
    """
    scores = {}
    
    for field, gt_value in ground_truth_data.items():
        if field not in extracted_metadata:
            scores[f"{field}_missing"] = 0.0
            continue
            
        ext_value = extracted_metadata[field]
        
        # Evaluar según tipo de campo
        if field == "title":
            # Para título, buscar coincidencia exacta o parcial
            gt_clean = str(gt_value).lower().strip()
            ext_clean = str(ext_value).lower().strip()
            
            if gt_clean == ext_clean:
                scores[f"{field}_exact"] = 1.0
            elif gt_clean in ext_clean or ext_clean in gt_clean:
                scores[f"{field}_partial"] = 0.8
            else:
                scores[f"{field}_mismatch"] = 0.0
                
        elif field == "authors":
            # Para autores, contar coincidencias
            gt_authors = gt_value if isinstance(gt_value, list) else [gt_value]
            ext_authors = ext_value if isinstance(ext_value, list) else [ext_value]
            
            matches = 0
            for gt_author in gt_authors:
                for ext_author in ext_authors:
                    if str(gt_author).lower() in str(ext_author).lower() or \
                       str(ext_author).lower() in str(gt_author).lower():
                        matches += 1
                        break
            
            scores[f"{field}_accuracy"] = matches / max(len(gt_authors), 1)
            
        elif field == "publication_year":
            # Para año, coincidencia exacta
            scores[f"{field}_exact"] = 1.0 if str(gt_value) == str(ext_value) else 0.0
            
        elif field in ["keywords", "regions"]:
            # Para listas, calcular Jaccard similarity
            gt_set = set(str(item).lower() for item in (gt_value if isinstance(gt_value, list) else [gt_value]))
            ext_set = set(str(item).lower() for item in (ext_value if isinstance(ext_value, list) else [ext_value]))
            
            if gt_set or ext_set:
                intersection = len(gt_set.intersection(ext_set))
                union = len(gt_set.union(ext_set))
                scores[f"{field}_jaccard"] = intersection / union if union > 0 else 0.0
            else:
                scores[f"{field}_jaccard"] = 1.0
                
        else:
            # Para otros campos, similitud de strings
            gt_str = str(gt_value).lower().strip()
            ext_str = str(ext_value).lower().strip()
            
            if gt_str == ext_str:
                scores[f"{field}_exact"] = 1.0
            elif gt_str in ext_str or ext_str in gt_str:
                scores[f"{field}_partial"] = 0.7
            else:
                scores[f"{field}_mismatch"] = 0.0
    
    # Calcular score promedio
    field_scores = [v for k, v in scores.items() if not k.endswith(('_missing', '_mismatch'))]
    scores["custom_avg_score"] = sum(field_scores) / len(field_scores) if field_scores else 0.0
    
    return scores

# 6. Función combinada de evaluación
def comprehensive_evaluation(ground_truth_data: Dict, extracted_metadata: Dict, 
                           retriever) -> Dict[str, Any]:
    """
    Evaluación comprehensiva combinando DeepEval y métricas personalizadas
    """
    print("=== Iniciando Evaluación Comprehensiva ===\n")
    
    # Evaluación con DeepEval
    print("1. Ejecutando evaluación con DeepEval...")
    try:
        deepeval_results = evaluate_metadata_extraction_deepeval(
            ground_truth_data, extracted_metadata, retriever
        )
        print("✓ DeepEval completado\n")
    except Exception as e:
        print(f"✗ Error en DeepEval: {e}\n")
        deepeval_results = {}
    
    # Evaluación personalizada
    print("2. Ejecutando evaluación personalizada...")
    custom_results = custom_metadata_evaluation(ground_truth_data, extracted_metadata)
    print("✓ Evaluación personalizada completada\n")
    
    # Combinar resultados
    combined_results = {
        "deepeval_metrics": deepeval_results,
        "custom_metrics": custom_results,
        "summary": {
            "deepeval_overall": deepeval_results.get("overall_score", 0.0),
            "custom_overall": custom_results.get("custom_avg_score", 0.0)
        }
    }
    
    # Calcular score final
    scores = [
        deepeval_results.get("overall_score", 0.0),
        custom_results.get("custom_avg_score", 0.0)
    ]
    combined_results["summary"]["final_score"] = sum(s for s in scores if s > 0) / max(len([s for s in scores if s > 0]), 1)
    
    return combined_results

# 7. Ejecutar evaluación
print("Iniciando evaluación con DeepEval...")
evaluation_results = comprehensive_evaluation(
    ground_truth_data, 
    result_optimized, 
    retriever
)

# 8. Mostrar resultados
print("\n" + "="*60)
print("RESULTADOS DE EVALUACIÓN")
print("="*60)

print(f"\n📊 RESUMEN GENERAL:")
print(f"Score Final: {evaluation_results['summary']['final_score']:.3f}")
print(f"DeepEval Overall: {evaluation_results['summary']['deepeval_overall']:.3f}")
print(f"Custom Overall: {evaluation_results['summary']['custom_overall']:.3f}")

print(f"\n🔍 MÉTRICAS DEEPEVAL:")
for metric, score in evaluation_results["deepeval_metrics"].items():
    if metric.startswith("avg_") or metric == "overall_score":
        print(f"  {metric}: {score:.3f}")

print(f"\n📝 MÉTRICAS PERSONALIZADAS:")
for metric, score in evaluation_results["custom_metrics"].items():
    if not metric.endswith(('_missing', '_mismatch')) or score > 0:
        print(f"  {metric}: {score:.3f}")

print(f"\n💾 Guardando resultados completos...")
with open("evaluation_results.json", "w", encoding="utf-8") as f:
    json.dump(evaluation_results, f, indent=2, ensure_ascii=False)

print("✓ Evaluación completada. Resultados guardados en 'evaluation_results.json'")
# ...existing code...

Iniciando evaluación con DeepEval...
=== Iniciando Evaluación Comprehensiva ===

1. Ejecutando evaluación con DeepEval...


Evaluando campo: title


  AnswerRelevancy: 0.500


  Error en FaithfulnessMetric: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.


  ContextualPrecision: 0.500
Evaluando campo: authors


  AnswerRelevancy: 0.500


  Error en FaithfulnessMetric: 'reason'


  ContextualPrecision: 1.000
Evaluando campo: publication_year


  AnswerRelevancy: 1.000


  Error en FaithfulnessMetric: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.


  ContextualPrecision: 0.750
Evaluando campo: journal


  AnswerRelevancy: 0.600


  Error en FaithfulnessMetric: 'reason'


  ContextualPrecision: 0.756
Evaluando campo: abstract


  AnswerRelevancy: 0.667


  Error en FaithfulnessMetric: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.


  ContextualPrecision: 1.000
Evaluando campo: keywords


  AnswerRelevancy: 0.600


  Error en FaithfulnessMetric: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.


  ContextualPrecision: 1.000
Evaluando campo: regions


  AnswerRelevancy: 1.000


  Faithfulness: 0.667


  ContextualPrecision: 0.756
Evaluando campo: language


  AnswerRelevancy: 0.600


  Error en FaithfulnessMetric: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.


  ContextualPrecision: 0.756
✓ DeepEval completado

2. Ejecutando evaluación personalizada...
✓ Evaluación personalizada completada


RESULTADOS DE EVALUACIÓN

📊 RESUMEN GENERAL:
Score Final: 0.689
DeepEval Overall: 0.527
Custom Overall: 0.852

🔍 MÉTRICAS DEEPEVAL:
  avg_AnswerRelevancy: 0.683
  avg_Faithfulness: 0.083
  avg_ContextualPrecision: 0.815
  overall_score: 0.527

📝 MÉTRICAS PERSONALIZADAS:
  title_exact: 1.000
  authors_accuracy: 1.000
  publication_year_exact: 1.000
  keywords_jaccard: 0.111
  regions_jaccard: 1.000
  language_exact: 1.000
  custom_avg_score: 0.852

💾 Guardando resultados completos...
✓ Evaluación completada. Resultados guardados en 'evaluation_results.json'
