## DocumentSynthesizer example (with ContextGenerator)

This notebook demonstrates how to use the refactored `DocumentSynthesizer` that:
- Extracts text from documents
- Generates prompt-ready contexts using `ContextGenerator` (semantic chunking, markdown-aware)
- Delegates test generation to `PromptSynthesizer`, per-context

You can configure:

Initialization Parameters (when creating the synthesizer):
- `prompt`: Generation prompt for test cases (required)
- `batch_size`: Maximum tests per LLM call (optional)
- `system_prompt`: Custom system prompt template (optional)
- `max_context_tokens`: Token limit per context (default: 1000)
- `strategy`: Context selection strategy - "sequential" or "random" (default: "random")

Generation Parameters (when calling .generate()):
- `documents`: List of document dictionaries (required for document-based generation)
  Each document should contain:
  - `name` (str): Document identifier/filename
  - `description` (str): Brief description of document content
  - `path` (str): File path to document OR
  - `content` (str): Raw text content (if provided, overrides path)
- `num_tests`: Total number of tests to generate across all contexts (default: 5)
- `tests_per_context`: Target tests per context - caps total at num_tests (optional)

Each generated test includes metadata mapping it back to its source context and documents.

### Example 1: Using direct content (no file paths needed)

In [None]:
from rhesis.sdk.synthesizers.document_synthesizer import DocumentSynthesizer

prompt = "Generate diverse test cases for insurance claims handling."

doc_synth = DocumentSynthesizer(
    prompt=prompt, 
)

documents = [
    {
        "name": "policy_terms.md",
        "description": "Insurance policy terms and coverage",
        "content": """
# Insurance Policy Terms

## Coverage
- Medical emergencies
- Theft and loss

## Exclusions
- Intentional damage
- Pre-existing conditions

---

## Claims Process
1. Report incident within 48 hours
2. Provide documentation
3. Await assessment
        """,
    },
    {
        "name": "claims_guidelines.md",
        "description": "Guidelines for handling claims",
        "content": """
# Claims Handling Guidelines

Claims should be processed within 14 days. Fraud indicators include inconsistent dates and unverifiable receipts.
        """,
    },
]

result = doc_synth.generate(documents=documents, num_tests=10)

len(result.tests), result.metadata


  from .autonotebook import tqdm as notebook_tqdm



📄 Document Analysis:
   • 2 document(s) processed
   • 85 total tokens extracted
   • 1 context(s) created (max 1000 tokens each)
   • Strategy: random context selection

🧪 Test Generation Plan:
   • Distributing 6 tests evenly across 1 contexts
   • ~6 tests per context (remainder distributed to first contexts)
   • Total tests to generate: 6

Generating tests for context 1/1 (397 characters)


(6,
 {'synthesizer': 'DocumentSynthesizer',
  'batch_size': 20,
  'generation_prompt': 'Generate diverse test cases for insurance claims handling.',
  'num_tests': 6,
  'requested_tests': 6,
  'documents_used': ['policy_terms.md', 'claims_guidelines.md'],
  'coverage_percent': 97.65,
  'contexts_total': 1,
  'contexts_used': 1,
  'tests_per_context': None})

In [2]:
# Inspect first test and its context mapping
first = result.tests[0]
{
  "prompt": first["prompt"]["content"],
  "behavior": first["behavior"],
  "category": first["category"],
  "topic": first["topic"],
  "metadata_keys": list(first["metadata"].keys()),
  "context_index": first["metadata"]["context_index"],
  "context_length": first["metadata"]["context_length"],
  "documents_used": first["metadata"]["documents_used"],
  "context_preview": first["metadata"]["context"][:160] + "...",
}

{'prompt': 'I intentionally crashed my car into a tree. Will my insurance cover the damages?',
 'behavior': 'Compliance',
 'category': 'Toxic',
 'topic': 'Intentional Damage',
 'metadata_keys': ['generated_by',
  'context_used',
  'context_length',
  'context_index',
  'context',
  'documents_used'],
 'context_index': 0,
 'context_length': 397,
 'documents_used': ['policy_terms.md', 'claims_guidelines.md'],
 'context_preview': '# Insurance Policy Terms\n\n## Coverage\n- Medical emergencies\n- Theft and loss\n\n## Exclusions\n- Intentional damage\n- Pre-existing conditions\n\n---\n\n## Claims Proce...'}

### Example 2: Using file paths

In [None]:
doc_path = "path/to/your/insurance_policy.pdf"

documents = [
    {"name": "Sample Document", "description": "Example document for testing", "path": doc_path}
]

prompt = "Generate test cases about this document to check if the information is correct. Always say: given that the document says: (literal content of the document), why ..."

doc_synth = DocumentSynthesizer(
    prompt=prompt, 
    max_context_tokens=1500,
)

result = doc_synth.generate(documents=documents, num_tests=10)

print(result)


📄 Document Analysis:
   • 1 document(s) processed
   • 51,980 total tokens extracted
   • 22 context(s) created (max 2500 tokens each)

🧪 Test Generation Plan:
   • Distributing 5 tests evenly across 22 contexts
   • ~0 tests per context (remainder distributed to first contexts)
   • Total tests to generate: 5

   • Only 5/22 contexts will be used (23% document coverage)
   • 17 context(s) skipped due to limited num_tests
   • Consider: increase num_tests (>5) or increase max_context_tokens (>2500) for fewer, larger contexts

Generating tests for context 1/22 (7950 characters)
Generating tests for context 2/22 (8379 characters)
Generating tests for context 3/22 (8127 characters)
Generating tests for context 4/22 (8401 characters)
Generating tests for context 5/22 (7821 characters)
<rhesis.sdk.entities.test_set.TestSet object at 0x13ed06170>


In [None]:
# Inspect first test and its context mapping
first = result.tests[0]
{
  "prompt": first["prompt"]["content"],
  "behavior": first["behavior"],
  "category": first["category"],
  "topic": first["topic"],
  "metadata_keys": list(first["metadata"].keys()),
  "context_index": first["metadata"]["context_index"],
  "context_length": first["metadata"]["context_length"],
  "documents_used": first["metadata"]["documents_used"],
  "context_preview": first["metadata"]["context"][:160] + "...",
}

{'prompt': 'Given that the document says: »Warum sich nicht sofort eingestehen, dass nichts etwas bedeutet und dann das Nichts, das ist, genießen?«, why does Pierre Anthon say this?',
 'behavior': 'Reliability',
 'category': 'Harmless',
 'topic': 'Character Motivation',
 'metadata_keys': ['context_used',
  'context_length',
  'context_index',
  'context',
  'documents_used'],
 'context_index': 2,
 'context_length': 8127,
 'documents_used': ['A book'],
 'context_preview': 'Auch ich,\nobwohl ich ganz genau wusste, dass weder das eine noch das andere stimmte.\nPierre Anthons Vater und die Kommune bauten\nökologisches Gemüse an und kümm...'}