# Mappr

> Scale up evaluation report mapping against evaluation frameworks using agentic workflows


::: {.callout-warning}
This notebook is a work in progress.
:::

## Approach

**Problem**: Manually mapping evaluation reports against IOM's Strategic Results Framework (SRF) is time-consuming and resource-intensive with ~150 outputs to analyze.

**Solution**: Three-stage pipeline leveraging Global Compact for Migration (GCM) as a SRF Outputs pruning mechanism:

### Stage 1: SRF Enablers & Cross-cutting Analysis
- **Parallel analysis** of Enablers (7 categories) and Cross-cutting Priorities (4 categories)
- **Purpose**: Identify if report is primarily meta-evaluation/transversal in nature
- **Fast processing**: ~11 items total, provides context for subsequent stages

### Stage 2: Informed GCM Analysis
- **GCM Objectives analysis** (23 items) informed by Stage 1 results
- **Condensed representations**: UN General Assembly Resolution formulation simplified for retrieval efficiency

### Stage 3: Targeted SRF Analysis  
- **SRF Filtering**: Use GCM results + `gcm_srf_lut` lookup table to prune ~150 SRF outputs to ~20-50 relevant ones
- **Deep analysis**: Full hierarchy context (objective → outcome → output → indicators)
- **Parallel processing**: Final targeted analysis of pruned SRF outputs

::: {.column-body}
![Overall Pipeline Diagram](img/mapping-pipeline.png){fig-align="center" width="800px"}
:::


### Key Features
- **Agentic workflow (ReAct pattern)**: LLM navigates document headings, explores sections iteratively until confident
- **DSPy signatures**: Structured reasoning with built-in tracing for evaluation
- **Rate-limited parallelization**: Respects API constraints (15 RPM) using fastcore
- **False positive bias**: Better to over-include than miss relevant mappings

In [None]:
#| default_exp mappr

In [None]:
#| exports
from pathlib import Path
from functools import reduce
from toolslm.md_hier import *
from rich import print
import json
from fastcore.all import *

from typing import List
import dspy

from evaluatr.frameworks import IOMEvalData

In [None]:
#| exports
from dotenv import load_dotenv
import os

load_dotenv()
GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')

In [None]:
#| exports
cfg = AttrDict({
    'lm': 'gemini/gemini-2.0-flash-exp',
    'api_key': GEMINI_API_KEY,
    'max_tokens': 8192,
    'track_usage': False,
})

In [None]:
#| exports
lm = dspy.LM(cfg.lm, api_key=cfg.api_key)
dspy.configure(lm=lm)

In [None]:
#| eval: false
doc = Path("../_data/md_library/49d2fba781b6a7c0d94577479636ee6f/abridged_evaluation_report_final_olta_ndoja_pdf/enriched")
pages = doc.ls(file_exts=".md").sorted(key=lambda p: int(p.stem.split('_')[1]))
report = '\n\n---\n\n'.join(page.read_text() for page in pages)
print(report[:1000])

## Language Model Tools

### Hierarchical report navigation

Thanks to `toolslm.md_hier` and a clean markdown structure of a `report` markdown, we can create a nested dictionary of section, subsection, ... as follows:

In [None]:
#| eval: false
hdgs = create_heading_dict(report); hdgs

{'PPMi .... page 1': {},
 'CONTENTS .... page 3': {},
 '1. Introduction .... page 4': {},
 '2. Background of the JI-HoA .... page 5': {'2.1. Context and design of the JI-HoA .... page 5': {},
  '2.2. External factors affecting the implementation of the JI .... page 7': {}},
 '3. Methodology .... page 8': {},
 '4. Findings .... page 10': {'4.1. Relevance .... page 10': {'4.1.1. Relevance of programme activities for migrants, returnees, and communities .... page 10': {}},
  'Overall performance score for relevance: $3.9 / 5$ <br> Robustness score for the evidence: $4.5 / 5$': {'4.1.1.1 Needs of migrants .... page 10': {},
   '4.1.1.2 Needs of returnees .... page 10': {},
   '4.1.1.3 Needs of community members .... page 12': {},
   "4.1.2. Programme's relevance to the needs of stakeholders .... page 12": {'4.1.2.1 Needs of governments .... page 12': {},
    '4.1.2.2 Needs of other stakeholders .... page 13': {}},
   '4.2. Coherence .... page 13': {"4.2.1. The JI-HoA's alignment with the o

In [None]:
#| exports
def find_section_path(
    hdgs: dict, # The nested dictionary structure
    target_section: str # The section name to find
):
    "Find the nested key path for a given section name"
    def search_recursive(current_dict, path=[]):
        for key, value in current_dict.items():
            current_path = path + [key]
            if key == target_section:
                return current_path
            if isinstance(value, dict):
                result = search_recursive(value, current_path)
                if result:
                    return result
        return None
    
    return search_recursive(hdgs)

Then we can retrieve the subsection path (list of nested headings to reach this specific section) in this nested `hdgs` dict :

In [None]:
#| eval: false
path = find_section_path(hdgs, "4.1.1.1 Needs of migrants .... page 10"); path

['4. Findings .... page 10',
 'Overall performance score for relevance: $3.9 / 5$ <br> Robustness score for the evidence: $4.5 / 5$',
 '4.1.1.1 Needs of migrants .... page 10']

Then retrieve the specific subsection content:

In [None]:
#| exports
def get_content_tool(hdgs, keys_list):
    "Navigate through nested levels using the exact key strings"
    return reduce(lambda current, key: current[key], keys_list, hdgs).text

In [None]:
#| eval: false
content = get_content_tool(hdgs, path)
print(content[:500])

## ReAct (Reasoning & Acting)

**Why We Built a Custom Iterative Analyzer Instead of Using DSPy ReAct?**

We could have leveraged DSPy's built-in [`ReAct` module](https://dspy.ai/api/modules/ReAct), which provides an agent-based approach where the LLM automatically decides when and how to use exploration tools. The "ReAct" concept has been introduced in [this paper](https://arxiv.org/pdf/2210.03629). However, we chose to implement our own iterative analyzer from scratch for several critical reasons:

**Open-ended vs. Structured Nature**: DSPy's ReAct is designed for open-ended problem solving where the agent explores freely using available tools. Our use case requires a more structured, methodical approach to document analysis with predictable exploration patterns.

**Document-Specific Control**: Our approach is tailored specifically for structured document exploration with hierarchical headings, allowing us to implement domain-specific logic for section navigation and content retrieval.

**Evaluator Requirements**: Since traces will be reviewed by human evaluators for error analysis, we needed explicit, step-by-step decision logging rather than the more implicit reasoning chains that ReAct provides.

### Formatters

We define here a set of function formatting both evaluation frameworks themes to analyze (SRF enablers, objectives, GCM objectives, ...) and traces.

In [None]:
#| exports
def format_enabler_theme(theme):
    parts = [
        f'## Enabler {theme.id}: {theme.title}',
        '### Description', 
        theme.description
    ]
    return '\n'.join(parts)

For instance: 

In [None]:
#| eval: false
eval_data = IOMEvalData()
data_evidence = eval_data.srf_enablers[3]  # "Data and evidence" is at index 3
print(format_enabler_theme(data_evidence))

In [None]:
#| exports
def format_evidence(theme):
    parts = [
        f'## Enabler {theme.id}: {theme.title}',
        '### Description', 
        theme.description
    ]
    return '\n'.join(parts)

### Signatures

A [DSPy signature](https://dspy.ai/learn/programming/signatures) is a declarative specification of input/output behavior of a DSPy module. Signatures allow you to tell the LM (Language Model) what it needs to do, rather than specify how we should ask the LM to do it.

In [None]:
#| exports
class Overview(dspy.Signature):
    """Identify sections relevant to enabler/cross-cutting category"""
    theme: str = dspy.InputField(desc="Theme being analyzed")
    all_headings: str = dspy.InputField(desc="Complete document structure")
    priority_sections: List[str] = dspy.OutputField(desc="Ordered list of section keys to explore first")
    strategy: str = dspy.OutputField(desc="Reasoning for this exploration strategy")


For instance on "Data and evidence" SRF Enabler:

In [None]:
#| eval: false
overview_analyzer = dspy.ChainOfThought(Overview)
result_overview = overview_analyzer(
    theme = format_enabler_theme(data_evidence),
    all_headings=str(hdgs),
)

print(f'Priority sections: {result_overview.priority_sections}')
print(f'Strategy: {result_overview.strategy}')

In [None]:
#| exports
class Exploration(dspy.Signature):
    """Decide next exploration step for enabler/cross-cutting analysis"""
    theme: str = dspy.InputField(desc="Theme being analyzed")
    current_findings: str = dspy.InputField(desc="Evidence found so far")
    available_sections: str = dspy.InputField(desc="Remaining sections to explore")
    next_section: str = dspy.OutputField(desc="Next section key to explore, or 'DONE' if sufficient")
    reasoning: str = dspy.OutputField(desc="Why this section or why stopping")

In [None]:
#| eval: false
exploration = dspy.ChainOfThought(Exploration)

result_exploration = exploration(
    theme = format_enabler_theme(data_evidence),
    current_findings="No evidence collected yet",
    available_sections=str(result_overview.priority_sections)
)

print("Next section:", result_exploration.next_section)
print("Reasoning:", result_exploration.reasoning)

In [None]:
#| exports
class Assessment(dspy.Signature):
    """Assess if current evidence is sufficient for enabler analysis"""
    theme: str = dspy.InputField(desc="Theme being analyzed")
    evidence_so_far: str = dspy.InputField(desc="All evidence collected")
    sections_explored: str = dspy.InputField(desc="Sections already checked")
    sufficient: bool = dspy.OutputField(desc="Is evidence sufficient to make conclusion?")
    confidence_score: float = dspy.OutputField(desc="Confidence in current findings (0-1)")
    next_priority: str = dspy.OutputField(desc="If continuing, what type of section to prioritize")

In [None]:
#| exports
class Synthesis(dspy.Signature):
    """Provide detailed rationale and synthesis of enabler analysis"""
    theme: str = dspy.InputField(desc="Theme being analyzed")
    all_evidence: str = dspy.InputField(desc="All collected evidence")
    sections_explored: str = dspy.InputField(desc="List of sections explored")
    theme_covered: bool = dspy.OutputField(desc="Final decision on theme coverage")
    confidence_explanation: str = dspy.OutputField(desc="Detailed explanation of confidence score")
    evidence_summary: str = dspy.OutputField(desc="Key evidence supporting the conclusion")
    gaps_identified: str = dspy.OutputField(desc="Any gaps or missing aspects")

### Evidence Collection Pipeline

In [None]:
#| exports
class ThemeAnalyzer(dspy.Module):
    "Full pipeline for theme analysis"
    def __init__(self, overview_sig, exploration_sig, assessment_sig, synthesis_sig, max_iter=10):
        self.overview = dspy.ChainOfThought(overview_sig)
        self.explore = dspy.ChainOfThought(exploration_sig)
        self.assess = dspy.ChainOfThought(assessment_sig)
        self.synthesize = dspy.ChainOfThought(synthesis_sig)
        self.max_iter = max_iter

In [None]:
#| exports
@patch
def forward(self:ThemeAnalyzer, theme, headings, get_content_fn=get_content_tool):
    priority_sections = self.get_overview(theme, headings)
    evidence = self.explore_iteratively(theme, priority_sections, headings, get_content_fn)
    return self.synthesize_findings(theme, evidence)

In [None]:
#| exports
@patch
def get_overview(self:ThemeAnalyzer, theme, headings):
    overview = self.overview(theme=theme, all_headings=str(headings))
    print("Overview priority sections:", overview.priority_sections)
    print("Overview strategy:", overview.strategy)
    return overview.priority_sections


In [None]:
#| exports
@patch
def explore_iteratively(self:ThemeAnalyzer, theme, priority_sections, headings, get_content_fn):
    evidence_collected = []
    sections_explored = []
    available_sections = priority_sections.copy()
    
    for i in range(self.max_iter):
        print(f"\n--- Iteration {i+1} ---")
        if not available_sections:
            print("No more sections to explore, stopping")
            break
        if self.should_stop_exploring(theme, evidence_collected, sections_explored):
            break
        decision = self.make_exploration_decision(theme, evidence_collected, available_sections)
        if decision.next_section == 'DONE':
            print("Decision says DONE, breaking")
            break
        evidence_collected, sections_explored = self.process_section(decision, 
                                                                     headings, 
                                                                     get_content_fn, 
                                                                     evidence_collected, 
                                                                     sections_explored, 
                                                                     available_sections)
    
    return {"evidence": evidence_collected, "sections": sections_explored}


In [None]:
#| exports
@patch
def make_exploration_decision(self:ThemeAnalyzer, theme, evidence_collected, available_sections):
    decision = self.explore(
        theme=theme,
        current_findings="\n\n".join(evidence_collected) if evidence_collected else "No evidence collected yet",
        available_sections=str(available_sections)
    )
    print("Decision:", decision.next_section)
    print("Reasoning:", decision.reasoning)
    return decision


In [None]:
#| exports
@patch
def should_stop_exploring(self:ThemeAnalyzer, theme, evidence_collected, sections_explored):
    if not evidence_collected:
        return False
        
    assessment = self.assess(
        theme=theme,
        evidence_so_far="\n\n".join(evidence_collected),
        sections_explored=str(sections_explored)
    )
    print("Assessment - Sufficient:", assessment.sufficient, "Confidence:", assessment.confidence_score)
    
    return assessment.sufficient and assessment.confidence_score > 0.8


In [None]:
#| exports
@patch
def process_section(self:ThemeAnalyzer, decision, headings, get_content_fn, evidence_collected, sections_explored, available_sections):
    path = find_section_path(headings, decision.next_section)
    print("Path found:", path)
    
    if path:
        content = get_content_fn(headings, path)
        print("Content length:", len(content))
        evidence_collected.append(f"# Section: {decision.next_section}\n## Content\n{content}")
        sections_explored.append(decision.next_section)
        if decision.next_section in available_sections:
            available_sections.remove(decision.next_section)
    else:
        print("No path found for section!")
    
    return evidence_collected, sections_explored


In [None]:
#| exports
@patch
def synthesize_findings(self:ThemeAnalyzer, theme, evidence):
    synthesis = self.synthesize(
        theme=theme,
        all_evidence="\n\n".join(evidence["evidence"]),
        sections_explored=str(evidence["sections"])
    )
    print("Synthesis result:", synthesis.theme_covered)
    print("Synthesis reasoning:", synthesis.confidence_explanation)
    print("Synthesis evidence:", synthesis.evidence_summary)
    print("Synthesis gaps:", synthesis.gaps_identified)
    return synthesis


For instance:

- create the analyzer

In [None]:
#| eval: false
analyzer = ThemeAnalyzer(Overview, Exploration, Assessment, Synthesis)

- pick a theme

In [None]:
#| eval: false
theme = format_enabler_theme(eval_data.srf_enablers[3])  # "Data and evidence"
print(theme)

In [None]:
#| eval: false
result = analyzer(theme, hdgs, get_content_tool)

### In progress ...

In [None]:
#| eval: false
condensed_gcm = {
    "7": {
        "title": "Address and reduce vulnerabilities in migration",
        "core_theme": "Protect migrants in vulnerable situations through comprehensive support and rights-based approaches",
        "key_principles": ["Human rights-based approach", "Best interests of the child", "Gender-responsive policies", "Non-discrimination"],
        "target_groups": ["Unaccompanied children", "Women at risk", "Trafficking victims", "Workers facing exploitation", "Persons with disabilities"],
        "main_activities": ["Identification and assistance", "Legal protection and remedies", "Child protection systems", "Status regularization procedures", "Crisis response"]
    },
    "21": {
        "title": "Cooperate in facilitating safe and dignified return and readmission, as well as sustainable reintegration",
        "core_theme": "Safe and dignified return, readmission, and sustainable reintegration of migrants",
        "key_principles": ["Due process and individual assessment", "Prohibition of collective expulsion", "Non-refoulement", "Human right to return"],
        "target_groups": ["Returning migrants", "Children in return processes", "Communities of origin"],
        "main_activities": ["Cooperation frameworks", "Travel documents and identification", "Consular assistance", "Reintegration support", "Monitoring mechanisms"]
    }
}