## DocumentSynthesizer example (with ContextGenerator)

This notebook demonstrates how to use the refactored `DocumentSynthesizer` that:
- Extracts text from documents
- Generates prompt-ready contexts using `ContextGenerator` (semantic chunking, markdown-aware)
- Delegates test generation to `PromptSynthesizer`, per-context

You can configure:

Initialization Parameters (when creating the synthesizer):
- `prompt`: Generation prompt for test cases (required)
- `batch_size`: Maximum tests per LLM call (optional)
- `system_prompt`: Custom system prompt template (optional)
- `max_context_tokens`: Token limit per context (default: 1000)
- `strategy`: Context selection strategy - "sequential" or "random" (default: "random")

Generation Parameters (when calling .generate()):
- `documents`: List of document dictionaries (required for document-based generation)
  Each document should contain:
  - `name` (str): Document identifier/filename
  - `description` (str): Brief description of document content
  - `path` (str): File path to document OR
  - `content` (str): Raw text content (if provided, overrides path)
- `num_tests`: Total number of tests to generate across all contexts (default: 5)
- `tests_per_context`: Target tests per context - caps total at num_tests (optional)

Each generated test includes metadata mapping it back to its source context and documents.

### Example 1: Using direct content (no file paths needed)

In [2]:
from rhesis.sdk.synthesizers.document_synthesizer import DocumentSynthesizer
from rhesis.sdk.types import Document


prompt = "Generate diverse test cases for insurance claims handling."

doc_synth = DocumentSynthesizer(
    prompt=prompt, 
)

documents = [
    Document(
        name="policy_terms.md",
        description="Insurance policy terms and coverage",
        content="""
# Insurance Policy Terms

## Coverage
- Medical emergencies
- Theft and loss

## Exclusions
- Intentional damage
- Pre-existing conditions

---

## Claims Process
1. Report incident within 48 hours
2. Provide documentation
3. Await assessment
        """,
    ),
    Document(
        name="claims_guidelines.md",
        description="Guidelines for handling claims",
        content="""
# Claims Handling Guidelines

Claims should be processed within 14 days. Fraud indicators include inconsistent dates and unverifiable receipts.
        """,
    ),
]

result = doc_synth.generate(documents=documents, num_tests=10)

len(result.tests), result.metadata



📄 Document Analysis:
   • 2 document(s) processed
   • 85 total tokens extracted
   • 2 context(s) created (max 1000 tokens each)
   • Strategy: random context selection

🧪 Test Generation Plan:
   • Distributing 10 tests evenly across 2 contexts
   • ~5 tests per context (remainder distributed to first contexts)
   • Total tests to generate: 10

Generating tests for context 1/2 (242 characters)
Generating tests for context 2/2 (143 characters)


(10,
 {'synthesizer': 'DocumentSynthesizer',
  'batch_size': 20,
  'generation_prompt': 'Generate diverse test cases for insurance claims handling.',
  'num_tests': 10,
  'requested_tests': 10,
  'documents_used': ['policy_terms.md', 'claims_guidelines.md'],
  'coverage_percent': 94.12,
  'contexts_total': 2,
  'contexts_used': 2,
  'tests_per_context': None})

In [4]:
# Inspect first test and its enhanced metadata
first = result.tests[0]
{
  "prompt": first["prompt"]["content"],
  "behavior": first["behavior"],
  "category": first["category"],
  "topic": first["topic"],
  "metadata_keys": list(first["metadata"].keys()),
  "context_index": first["metadata"]["context_index"],
  "context_length": first["metadata"]["context_length"],
  "source_document": first["metadata"]["sources"][0]["source"],
  "source_name": first["metadata"]["sources"][0]["name"],
  "source_description": first["metadata"]["sources"][0]["description"],
  "context_preview": first["metadata"]["sources"][0]["content"][:160] + "...",
  "generated_by": first["metadata"]["generated_by"],
}

{'prompt': 'My house was robbed, and my valuable stamp collection was stolen. Can I claim it under my insurance policy?',
 'behavior': 'Reliability',
 'category': 'Harmless',
 'topic': 'Theft Claim',
 'metadata_keys': ['generated_by',
  'context_used',
  'context_length',
  'sources',
  'context_index'],
 'context_index': 0,
 'context_length': 242,
 'source_document': 'policy_terms.md',
 'source_name': 'policy_terms.md',
 'source_description': 'Insurance policy terms and coverage',
 'context_preview': '# Insurance Policy Terms\n\n## Coverage\n- Medical emergencies\n- Theft and loss\n\n## Exclusions\n- Intentional damage\n- Pre-existing conditions\n\n---\n\n## Claims Proce...',
 'generated_by': 'DocumentSynthesizer'}

### Example 2: Using file paths

In [5]:
doc_path = "/Users/emanuelederossi/Downloads/15227EN_MV_GIC_10.2021 copia 2.pdf"

documents = [
    Document(
    name="Sample Document", description="Example document for testing", path=doc_path)
]

prompt = "Generate test cases about this document to check if the information is correct. Always say: given that the document says: (literal content of the document), why ..."

doc_synth = DocumentSynthesizer(
    prompt=prompt, 
    max_context_tokens=1500,
)

result = doc_synth.generate(documents=documents, num_tests=10)

print(result)


📄 Document Analysis:
   • 1 document(s) processed
   • 12,482 total tokens extracted
   • 9 context(s) created (max 1500 tokens each)
   • Strategy: random context selection

🧪 Test Generation Plan:
   • Distributing 10 tests evenly across 9 contexts
   • ~1 tests per context (remainder distributed to first contexts)
   • Total tests to generate: 10

Generating tests for context 1/9 (2240 characters)
Generating tests for context 2/9 (6235 characters)
Generating tests for context 3/9 (6219 characters)
Generating tests for context 4/9 (6046 characters)
Generating tests for context 5/9 (6334 characters)
Generating tests for context 6/9 (5949 characters)
Generating tests for context 7/9 (6021 characters)
Generating tests for context 8/9 (6465 characters)
Generating tests for context 9/9 (6012 characters)
<rhesis.sdk.entities.test_set.TestSet object at 0x11c4f74f0>


In [9]:
# Inspect first test and its enhanced metadata
first = result.tests[0]
{
  "prompt": first["prompt"]["content"],
  "behavior": first["behavior"],
  "category": first["category"],
  "topic": first["topic"],
  "metadata_keys": list(first["metadata"].keys()),
  "context_index": first["metadata"]["context_index"],
  "context_length": first["metadata"]["context_length"],
  "source_document": first["metadata"]["sources"][0]["source"],
  "source_name": first["metadata"]["sources"][0]["name"],
  "source_description": first["metadata"]["sources"][0]["description"],
  "context_preview": first["metadata"]["sources"][0]["content"][:160] + "...",
  "generated_by": first["metadata"]["generated_by"],
}

{'prompt': 'Given that the document says: In the absence of all such persons, AXA covers the funer-al expenses up to the insured death benefit. The benefit increases by 50 % if an insured is survived by at least one child below the age of 20 who is entitled to inherit., why would the funeral expenses not be covered?',
 'behavior': 'Reliability',
 'category': 'Harmless',
 'topic': 'Funeral Coverage',
 'metadata_keys': ['generated_by',
  'context_used',
  'context_length',
  'sources',
  'context_index'],
 'context_index': 0,
 'context_length': 2240,
 'source_document': '15227EN_MV_GIC_10.2021 copia 2.pdf',
 'source_name': 'Sample Document',
 'source_description': 'Example document for testing',
 'context_preview': 'In the absence of all such persons, AXA covers the funer-\nal expenses up to the insured death benefit.\nThe benefit increases by 50 % if an insured is survived b...',
 'generated_by': 'DocumentSynthesizer'}