# Extract

> Report Sections Extraction


This module provides tools for automatically identifying and extracting core sections from evaluation reports. When working with large reports (50-200+ pages), we need to focus on key sections—such as executive summaries, introductions, conclusions, and recommendations—to support tagging and mapping exercises against evaluation frameworks (e.g., SRF, GCM).

Focusing on these sections helps:

- Reduce noise from tangential content
- Prevent confirmation bias by avoiding the tendency to flag any passing mention as a theme

The approach uses an LLM to parse a report's table of contents, identify which sections contain substantive thematic content, and extract just those sections for further processing.



In [None]:
#| default_exp extract

In [None]:
#| export
from fastcore.all import *
from operator import getitem
from pydantic import BaseModel
from lisette.core import completion, mk_msg
from toolslm.md_hier import create_heading_dict
import json
from iomeval.core import load_prompt

In [None]:
#| export
class CoreSectionsOutput(BaseModel):
    "Identify the core sections of the report"
    section_paths: list[list[str]]
    reasoning: str

For instance, given a markdown:

In [None]:
sample_md = """# Report Title ... page 1

## Executive Summary ... page 1

This is a summary of key findings.

## 1. Introduction ... page 2

Background information here.

### 1.1 Objectives ... page 2

The objectives are...

## 2. Findings ... page 3

Detailed findings.

## 3. Conclusions ... page 5

Main conclusions.

## 4. Recommendations ... page 6

Key recommendations."""


In [None]:
hdgs = create_heading_dict(sample_md)
hdgs

{'Report Title ... page 1': {'Executive Summary ... page 1': {},
  '1. Introduction ... page 2': {'1.1 Objectives ... page 2': {}},
  '2. Findings ... page 3': {},
  '3. Conclusions ... page 5': {},
  '4. Recommendations ... page 6': {}}}

## Navigating Nested Headings

Reports have hierarchical structure (sections, subsections, etc.). We represent this as a nested dictionary using `create_heading_dict` from `toolslm.md_hier`. To extract text from a specific section, we need to navigate through this hierarchy using a path of keys.


In [None]:
#| export
def get_text(ks:list[str], # List of exact key strings forming path through nested dict
             hdgs:dict # Nested dictionary of headings created by `create_heading_dict`
            ) -> str: # Extracted markdown text for the section
    "Navigate through nested heading levels and return the text content"
    return L(ks).reduce(getitem, hdgs).text

In [None]:
path = ['Report Title ... page 1', '3. Conclusions ... page 5']
print(get_text(path, hdgs))

## 3. Conclusions ... page 5

Main conclusions.


## Handling Nested Selections

When the LLM identifies core sections, it might select both a parent section and its children (e.g., "Introduction" and "Introduction > Objectives"). To avoid duplicate content, we filter out any paths that are children of other selected paths.

In [None]:
#| export
def rm_nested(paths:list[list[str]] # List of section paths, where each path is a list of keys
             ) -> list[list[str]]: # Filtered list with nested paths removed
    "Remove paths that are children of other paths in the list"
    paths = sorted(paths, key=len)
    keep = []
    for p in paths:
        if not any(p[:len(k)] == k for k in keep): keep.append(p)
    return keep

In [None]:
nested_paths = [
    ['Report Title ... page 1', '1. Introduction ... page 2'],
    ['Report Title ... page 1', '1. Introduction ... page 2', '1.1 Objectives ... page 2'],
    ['Report Title ... page 1', '3. Conclusions ... page 5']
]
rm_nested(nested_paths)

[['Report Title ... page 1', '1. Introduction ... page 2'],
 ['Report Title ... page 1', '3. Conclusions ... page 5']]

## LLM-Based Section Identification

Rather than using rigid pattern matching, we use an LLM to intelligently identify core sections. This handles multilingual reports, varied naming conventions, and unusual structures. The LLM receives the table of contents as a nested dictionary and returns paths to the most relevant sections.

In [None]:
#| export
def identify_core_sections(
    hdgs:dict, # Nested dictionary of report headings from `create_heading_dict`
    sp:str=None, # System prompt for section identification
    response_format:type[BaseModel]=CoreSectionsOutput, # Pydantic model for structured output
    model:str='claude-sonnet-4-5' # LLM model to use for identification
) -> dict: # Dictionary with 'section_paths' and 'reasoning' keys
    "Use LLM to identify core sections (exec summary, intro, conclusions, recommendations) from ToC"
    if sp is None: sp = load_prompt('select_sections', 'files/prompts')
    res = completion(model=model, messages=[mk_msg(f"Here is the table of contents as a nested dictionary:\n\n{hdgs}")], 
                     system=[{"type": "text", "text": sp}], response_format=response_format)
    return json.loads(res.choices[0].message.content)

In [None]:
#| eval: false
sections = identify_core_sections(hdgs)
sections

{'section_paths': [['Report Title ... page 1', 'Executive Summary ... page 1'],
  ['Report Title ... page 1', '1. Introduction ... page 2'],
  ['Report Title ... page 1',
   '1. Introduction ... page 2',
   '1.1 Objectives ... page 2'],
  ['Report Title ... page 1', '3. Conclusions ... page 5'],
  ['Report Title ... page 1', '4. Recommendations ... page 6']],
 'reasoning': "Selected Executive Summary (p1) as it provides high-level overview of core themes. Included Introduction (p2) and Objectives (p2) to understand evaluation purpose and scope. Added Conclusions (p5) where authors synthesize key findings and themes. Included Recommendations (p6) as they reveal what evaluators deemed most important for action. Excluded Findings section as it's typically more descriptive detail covered by conclusions. Total estimated coverage: ~5 pages, which is appropriate given the report's compact structure and will capture the core thematic content for synthesis."}

## Putting It All Together

The main entry point combines all the pieces: parse the report structure, identify core sections, remove nested duplicates, and extract the text.


In [None]:
#| export

@delegates(identify_core_sections)
def extract_sections(
    md:str, # Markdown text of full report
    **kwargs # Additional keyword arguments passed to `identify_core_sections`
) -> str: # Concatenated text of all core sections
    "Extract and concatenate core sections (exec summary, intro, conclusions, recommendations) from report markdown"
    hdgs = create_heading_dict(md)
    sections = identify_core_sections(hdgs, **kwargs)
    paths = rm_nested(sections['section_paths'])
    return '\n'.join([get_text(p, hdgs) for p in paths])

In [None]:
#| eval: false
text = extract_sections(sample_md, model='claude-sonnet-4-5')
print(text[:200])

## Executive Summary ... page 1

This is a summary of key findings.
## 1. Introduction ... page 2

Background information here.

### 1.1 Objectives ... page 2

The objectives are...
## 3. Conclusions 
