# Cite-Right guide

This notebook demonstrates all features, options, and extras of the cite-right library.

## Installation Options

Cite-Right has several optional extras you can install:

```bash
# Basic installation
pip install cite-right

# With embeddings (sentence-transformers)
pip install "cite-right[embeddings]"

# With tiktoken tokenizer (OpenAI's tokenizer)
pip install "cite-right[tiktoken]"

# With HuggingFace tokenizers
pip install "cite-right[huggingface]"

# With spaCy for better sentence segmentation and claim decomposition
pip install "cite-right[spacy]"
python -m spacy download en_core_web_sm

# With pySBD for fast sentence boundary detection
pip install "cite-right[pysbd]"

# With LangChain integration
pip install "cite-right[langchain]"

# With LlamaIndex integration
pip install "cite-right[llamaindex]"

# Install everything!
pip install "cite-right[embeddings,tiktoken,huggingface,spacy,pysbd,langchain,llamaindex]"
```

---
## Part 1: Core Citation Alignment
---

### 1.1 Basic Usage with `align_citations()`

The core function that aligns answer spans to source documents.

In [1]:
from cite_right import SourceDocument, align_citations

# Define source documents
sources = [
    SourceDocument(
        id="climate_report",
        text="Global temperatures have risen by 1.1¬∞C since pre-industrial times. "
        "The rate of warming has accelerated in recent decades.",
        metadata={"year": 2024, "type": "report"},  # Optional metadata
    ),
    SourceDocument(
        id="renewable_energy",
        text="Solar and wind power now account for 12% of global electricity generation. "
        "Investment in renewable energy reached $500 billion in 2023.",
    ),
]

# An AI-generated answer
answer = (
    "Global temperatures have increased by 1.1¬∞C since the pre-industrial era. "
    "Renewable energy sources like solar and wind now generate 12% of global electricity."
)

# Align citations
results = align_citations(answer, sources)

# Display results
for span_citations in results:
    print(f"Answer span: {span_citations.answer_span.text!r}")
    print(f"Status: {span_citations.status}")
    print(
        f"Char range: [{span_citations.answer_span.char_start}:{span_citations.answer_span.char_end}]"
    )
    for citation in span_citations.citations:
        print(f"  Source: {citation.source_id} (index {citation.source_index})")
        print(f"  Evidence: {citation.evidence!r}")
        print(f"  Evidence offsets: [{citation.char_start}:{citation.char_end}]")
        print(f"  Score: {citation.score:.3f}")
        print(f"  Answer coverage: {citation.components.get('answer_coverage', 0):.1%}")
    print()

Answer span: 'Global temperatures have increased by 1.1¬∞C since the pre-industrial era.'
Status: supported
Char range: [0:73]
  Source: climate_report (index 0)
  Evidence: 'Global temperatures have risen by 1.1¬∞C since pre-industrial'
  Evidence offsets: [0:60]
  Score: 1.760
  Answer coverage: 72.7%

Answer span: 'Renewable energy sources like solar and wind now generate 12% of global electricity.'
Status: supported
Char range: [74:158]
  Source: renewable_energy (index 1)
  Evidence: 'Solar and wind power now account for 12% of global electricity'
  Evidence offsets: [0:62]
  Score: 1.532
  Answer coverage: 64.3%



### 1.2 Using Plain Strings as Sources

You can also pass plain strings instead of `SourceDocument` objects.

In [2]:
from cite_right import align_citations

# Sources as plain strings (IDs will be "0", "1", etc.)
sources = [
    "Revenue grew 15% in Q4 2024, driven by strong holiday sales.",
    "The company announced plans to expand into 5 new markets next year.",
]

answer = "Revenue increased by 15% in the fourth quarter."

results = align_citations(answer, sources)
for sc in results:
    print(f"Status: {sc.status}")
    if sc.citations:
        print(f"Source ID: {sc.citations[0].source_id}")

Status: partial
Source ID: 0


### 1.3 Using `SourceChunk` for Pre-chunked Documents

Use `SourceChunk` when you have pre-chunked documents from a RAG pipeline and want citation offsets relative to the original document.

In [3]:
from cite_right import SourceChunk, align_citations

# Original full document
full_document = (
    "Introduction: This report covers Q4 2024 performance. "
    "Revenue: Total revenue reached $50M, up 15% YoY. "
    "Outlook: We expect continued growth in 2025."
)

# Pre-chunked sections (e.g., from a RAG retrieval)
chunks = [
    SourceChunk(
        source_id="report",
        text="Revenue: Total revenue reached $50M, up 15% YoY.",
        doc_char_start=56,  # Where this chunk starts in full document
        doc_char_end=104,  # Where this chunk ends
        document_text=full_document,  # Optional: enables absolute offset computation
        source_index=0,  # Optional: custom source index
    ),
    SourceChunk(
        source_id="report",
        text="Outlook: We expect continued growth in 2025.",
        doc_char_start=105,
        doc_char_end=149,
        document_text=full_document,
    ),
]

answer = "Revenue was $50M, a 15% increase year-over-year."

results = align_citations(answer, chunks)

for sc in results:
    print(f"Answer: {sc.answer_span.text!r}")
    for citation in sc.citations:
        print(f"  Evidence: {citation.evidence!r}")
        print(f"  Offsets in original doc: [{citation.char_start}:{citation.char_end}]")
        # Verify the offset works
        print(
            f"  Verification: {full_document[citation.char_start : citation.char_end]!r}"
        )

Answer: 'Revenue was $50M, a 15% increase year-over-year.'
  Evidence: 'venue reached $50M, up 15% Y'
  Offsets in original doc: [71:99]
  Verification: 'venue reached $50M, up 15% Y'


---
## Part 2: Configuration Options
---

### 2.1 Configuration Presets

`CitationConfig` provides presets for common use cases.

In [4]:
from cite_right import CitationConfig, SourceDocument, align_citations

sources = [
    SourceDocument(
        id="doc",
        text="The product launch was successful with 10,000 units sold in the first week.",
    )
]
answer = "The launch was a success."

# Balanced (default) - good for general use
balanced = CitationConfig.balanced()
print(
    f"Balanced: top_k={balanced.top_k}, min_answer_coverage={balanced.min_answer_coverage}"
)

# Strict - high precision, minimize false positives
strict = CitationConfig.strict()
print(
    f"Strict: top_k={strict.top_k}, min_answer_coverage={strict.min_answer_coverage}, supported_threshold={strict.supported_answer_coverage}"
)

# Permissive - lenient, good for paraphrased content
permissive = CitationConfig.permissive()
print(
    f"Permissive: top_k={permissive.top_k}, allow_embedding_only={permissive.allow_embedding_only}"
)

# Fast - optimized for speed
fast = CitationConfig.fast()
print(f"Fast: top_k={fast.top_k}, max_candidates_total={fast.max_candidates_total}")

# Compare results
print("\n--- Results with different presets ---")
for name, config in [
    ("balanced", balanced),
    ("strict", strict),
    ("permissive", permissive),
    ("fast", fast),
]:
    results = align_citations(answer, sources, config=config)
    status = results[0].status if results else "no results"
    num_citations = len(results[0].citations) if results else 0
    print(f"{name:12s}: status={status}, citations={num_citations}")

Balanced: top_k=3, min_answer_coverage=0.2
Strict: top_k=2, min_answer_coverage=0.4, supported_threshold=0.7
Permissive: top_k=5, allow_embedding_only=True
Fast: top_k=1, max_candidates_total=100

--- Results with different presets ---
balanced    : status=supported, citations=1
strict      : status=partial, citations=1
permissive  : status=supported, citations=1
fast        : status=supported, citations=1


### 2.2 Key CitationConfig Options

The most commonly customized options (see docs for the full list).

In [5]:
from cite_right import CitationConfig, CitationWeights

config = CitationConfig(
    # Result limits
    top_k=3,  # Max citations per answer span
    max_citations_per_source=2,  # Max citations from same source
    # Quality thresholds
    min_answer_coverage=0.2,  # Min fraction of answer tokens matched
    supported_answer_coverage=0.6,  # Threshold for "supported" status
    # Embedding options (when using an embedder)
    allow_embedding_only=False,  # Allow citations with only semantic match
    # Multi-span evidence
    multi_span_evidence=False,  # Enable non-contiguous evidence spans
    # Score weights
    weights=CitationWeights(
        alignment=1.0,
        answer_coverage=1.0,
        lexical=0.5,
        embedding=0.5,
    ),
)

print(f"top_k={config.top_k}, supported_threshold={config.supported_answer_coverage}")

top_k=3, supported_threshold=0.6


### 2.3 Multi-Span Evidence

When enabled, citations can include multiple non-contiguous evidence spans.

In [6]:
from cite_right import CitationConfig, SourceDocument, align_citations

# Source with relevant info spread across the text
sources = [
    SourceDocument(
        id="report",
        text="Revenue grew 15% this year. The company hired 500 new employees. Profits also increased by 20%.",
    )
]

answer = "Revenue grew 15% and profits increased by 20%."

# Enable multi-span evidence
config = CitationConfig(
    multi_span_evidence=True,
    multi_span_merge_gap_chars=16,  # Merge spans within 16 chars
    multi_span_max_spans=5,  # Allow up to 5 spans
)

results = align_citations(answer, sources, config=config)

for sc in results:
    print(f"Answer: {sc.answer_span.text!r}")
    for citation in sc.citations:
        print(f"  Number of evidence spans: {len(citation.evidence_spans)}")
        for i, span in enumerate(citation.evidence_spans):
            print(
                f"    Span {i + 1}: [{span.char_start}:{span.char_end}] {span.evidence!r}"
            )

Answer: 'Revenue grew 15% and profits increased by 20%.'
  Number of evidence spans: 1
    Span 1: [65:94] 'Profits also increased by 20%'
  Number of evidence spans: 1
    Span 1: [0:16] 'Revenue grew 15%'


### 2.4 Backend Selection: Python vs Rust

Choose between pure Python or the faster Rust backend for alignment.

In [7]:
import time

from cite_right import SourceDocument, align_citations

sources = [
    SourceDocument(id="doc", text="The quick brown fox jumps over the lazy dog. " * 100)
]
answer = "The brown fox jumped over a lazy dog."

# Auto (default) - uses Rust if available, falls back to Python
start = time.perf_counter()
results_auto = align_citations(answer, sources, backend="auto")
print(f"Auto backend: {(time.perf_counter() - start) * 1000:.2f}ms")

# Force Python backend
start = time.perf_counter()
results_python = align_citations(answer, sources, backend="python")
print(f"Python backend: {(time.perf_counter() - start) * 1000:.2f}ms")

# Force Rust backend (will error if not installed)
try:
    start = time.perf_counter()
    results_rust = align_citations(answer, sources, backend="rust")
    print(f"Rust backend: {(time.perf_counter() - start) * 1000:.2f}ms")
except RuntimeError as e:
    print(f"Rust backend not available: {e}")

Auto backend: 3.02ms
Python backend: 4.59ms
Rust backend: 2.76ms


### 2.5 Metrics Callback for Observability

Use `on_metrics` to receive detailed timing and statistics.

In [8]:
from cite_right import AlignmentMetrics, SourceDocument, align_citations

sources = [
    SourceDocument(id="doc1", text="Revenue grew 15% in Q4 2024."),
    SourceDocument(id="doc2", text="The company expanded into new markets."),
]
answer = "Revenue increased by 15%. The company expanded globally."


# Define a callback to receive metrics
def metrics_callback(metrics: AlignmentMetrics) -> None:
    print("=== Alignment Metrics ===")
    print(f"Total time: {metrics.total_time_ms:.2f}ms")
    print(f"Number of answer spans: {metrics.num_answer_spans}")
    print(f"Number of candidates: {metrics.num_candidates}")
    print(f"Number of alignments: {metrics.num_alignments}")
    print(f"Embedding time: {metrics.embedding_time_ms:.2f}ms")
    print(f"Alignment time: {metrics.alignment_time_ms:.2f}ms")


results = align_citations(answer, sources, on_metrics=metrics_callback)

=== Alignment Metrics ===
Total time: 0.14ms
Number of answer spans: 2
Number of candidates: 2
Number of alignments: 2
Embedding time: 0.00ms
Alignment time: 0.04ms


---
## Part 3: Tokenizers
---

### 3.1 SimpleTokenizer (Default)

Rule-based tokenizer with number/currency/percent normalization.

In [9]:
from cite_right import SimpleTokenizer, TokenizerConfig

# Default configuration
tokenizer = SimpleTokenizer()
result = tokenizer.tokenize("Revenue grew 15% to $50M.")
print(f"Text: {result.text}")
print(f"Token IDs: {result.token_ids}")
print(f"Token spans: {result.token_spans}")

# Custom configuration - disable normalizations
config = TokenizerConfig(
    normalize_numbers=False,  # Keep "1,000" as-is instead of "1000"
    normalize_percent=False,  # Keep "%" instead of "percent"
    normalize_currency=False,  # Keep "$" instead of "dollar"
)
tokenizer_raw = SimpleTokenizer(config)
result_raw = tokenizer_raw.tokenize("Revenue grew 15% to $50M.")
print("\nWith normalization disabled:")
print(f"Token IDs: {result_raw.token_ids}")

Text: Revenue grew 15% to $50M.
Token IDs: [1, 2, 3, 4, 5, 6, 7, 8]
Token spans: [(0, 7), (8, 12), (13, 15), (15, 16), (17, 19), (20, 21), (21, 23), (23, 24)]

With normalization disabled:
Token IDs: [1, 2, 3, 4, 5, 6, 7, 8]


### 3.2 TiktokenTokenizer (OpenAI)

Uses OpenAI's tiktoken for tokenization. Install with `pip install cite-right[tiktoken]`.

In [10]:
try:
    from cite_right.text.tokenizer_tiktoken import TiktokenTokenizer

    tokenizer = TiktokenTokenizer("cl100k_base")
    result = tokenizer.tokenize("Hello, world! This is a test.")
    print("Encoding: cl100k_base")
    print(f"Token IDs: {result.token_ids}")
    print(f"Token spans: {result.token_spans}")

except ImportError:
    print(
        "TiktokenTokenizer not available. Install with: pip install cite-right[tiktoken]"
    )

Encoding: cl100k_base
Token IDs: [9906, 11, 1917, 0, 1115, 374, 264, 1296, 13]
Token spans: [(0, 5), (5, 6), (6, 12), (12, 13), (13, 18), (18, 21), (21, 23), (23, 28), (28, 29)]


### 3.3 HuggingFaceTokenizer

Use any HuggingFace tokenizer. Install with `pip install cite-right[huggingface]`.

In [11]:
try:
    from cite_right import SourceDocument, align_citations
    from cite_right.text.tokenizer_huggingface import HuggingFaceTokenizer

    # Load tokenizer from HuggingFace Hub using from_pretrained()
    # Note: Don't pass a string to the constructor - use from_pretrained()!
    tokenizer = HuggingFaceTokenizer.from_pretrained(
        "Qwen/Qwen3-235B-A22B-Thinking-2507-FP8"
    )
    result = tokenizer.tokenize("Hello, world! This is a test.")
    print("Model: Qwen/Qwen3-235B-A22B-Thinking-2507-FP8")
    print(f"Token IDs: {result.token_ids}")
    print(f"Token spans: {result.token_spans}")

    # Works with any HuggingFace model
    # tokenizer = HuggingFaceTokenizer.from_pretrained("gpt2")
    # tokenizer = HuggingFaceTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

    # Use with align_citations
    sources = [SourceDocument(id="doc", text="The quick brown fox jumps.")]
    answer = "The brown fox jumped."
    results = align_citations(answer, sources, tokenizer=tokenizer)
    print(f"\nAlignment status: {results[0].status}")

except ImportError:
    print(
        "HuggingFaceTokenizer not available. Install with: pip install cite-right[huggingface]"
    )

  from .autonotebook import tqdm as notebook_tqdm


Model: Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
Token IDs: [9707, 11, 1879, 0, 1096, 374, 264, 1273, 13]
Token spans: [(0, 5), (5, 6), (6, 12), (12, 13), (13, 18), (18, 21), (21, 23), (23, 28), (28, 29)]

Alignment status: supported


---
## Part 4: Segmenters
---

### 4.1 SimpleSegmenter (Default)

Rule-based sentence segmentation.

In [12]:
from cite_right.text.segmenter_simple import SimpleSegmenter

segmenter = SimpleSegmenter()
text = "Dr. Smith went to the store. He bought milk. The price was $3.50."

segments = segmenter.segment(text)
for seg in segments:
    print(f"[{seg.doc_char_start}:{seg.doc_char_end}] {seg.text!r}")

[0:3] 'Dr.'
[4:28] 'Smith went to the store.'
[29:44] 'He bought milk.'
[45:65] 'The price was $3.50.'


### 4.2 PySBDSegmenter

Fast, rule-based sentence boundary detection. Install with `pip install cite-right[pysbd]`.

In [13]:
try:
    from cite_right import PySBDSegmenter

    # English (default)
    segmenter = PySBDSegmenter(language="en")
    text = "Dr. Smith went to the store. He bought milk. The price was $3.50."

    segments = segmenter.segment(text)
    print("PySBD Segmentation:")
    for seg in segments:
        print(f"  [{seg.doc_char_start}:{seg.doc_char_end}] {seg.text!r}")

    # Multi-language support
    # german_segmenter = PySBDSegmenter(language="de")
    # spanish_segmenter = PySBDSegmenter(language="es")

except RuntimeError:
    print("PySBDSegmenter not available. Install with: pip install cite-right[pysbd]")

PySBD Segmentation:
  [0:28] 'Dr. Smith went to the store.'
  [29:44] 'He bought milk.'
  [45:65] 'The price was $3.50.'


### 4.3 SpacySegmenter

spaCy-based segmentation with clause splitting. Install with `pip install cite-right[spacy]`.

In [14]:
try:
    from cite_right import PySBDSegmenter, SpacySegmenter

    # Text with a conjunction that can be split at clause level
    text = "Dr. Smith went to the store. He bought milk, and he paid $3.50."

    # PySBD: Sentence-level only (no clause splitting)
    pysbd = PySBDSegmenter()
    segments = pysbd.segment(text)
    print("PySBD (sentences only - no clause splitting):")
    for seg in segments:
        print(f"  [{seg.doc_char_start}:{seg.doc_char_end}] {seg.text!r}")

    # SpacySegmenter: Splits at clause conjunctions ("and", "or", "but")
    spacy_seg = SpacySegmenter(model="en_core_web_sm")
    segments = spacy_seg.segment(text)
    print("\nspaCy (with clause splitting at conjunctions):")
    for seg in segments:
        print(f"  [{seg.doc_char_start}:{seg.doc_char_end}] {seg.text!r}")

    print("\n‚Üí Notice SpacySegmenter splits 'He bought milk, and he paid $3.50'")
    print("  into two clauses at the 'and' conjunction!")

except RuntimeError as e:
    print(f"Segmenter not available: {e}")

PySBD (sentences only - no clause splitting):
  [0:28] 'Dr. Smith went to the store.'
  [29:63] 'He bought milk, and he paid $3.50.'

spaCy (with clause splitting at conjunctions):
  [0:28] 'Dr. Smith went to the store.'
  [29:44] 'He bought milk,'
  [49:63] 'he paid $3.50.'

‚Üí Notice SpacySegmenter splits 'He bought milk, and he paid $3.50'
  into two clauses at the 'and' conjunction!


### 4.4 Answer Segmenters

Control how the answer text is split into spans.

In [None]:
# Text with conjunctions to demonstrate clause splitting
# SimpleAnswerSegmenter: sentence-level only
from cite_right.text.answer_segmenter import SimpleAnswerSegmenter

answer = """Revenue grew 15% in Q4, and profits also increased.

The company expanded into new markets, but they faced challenges."""

simple = SimpleAnswerSegmenter()
spans = simple.segment(answer)
print("SimpleAnswerSegmenter (sentence-level):")
for span in spans:
    print(f"  [{span.sentence_index}] {span.text!r}")

# SpacyAnswerSegmenter with clause splitting: breaks at "and", "but"
try:
    from cite_right import SpacyAnswerSegmenter

    spacy_clause = SpacyAnswerSegmenter(model="en_core_web_sm", split_clauses=True)
    spans = spacy_clause.segment(answer)
    print("\nSpacyAnswerSegmenter (with clause splitting):")
    for span in spans:
        print(f"  [{span.sentence_index}] {span.text!r}")

    print("\n‚Üí Notice SpacyAnswerSegmenter splits at 'and' and 'but' conjunctions!")

except RuntimeError:
    print("\nSpacyAnswerSegmenter not available")

SimpleAnswerSegmenter (sentence-level):
  [0] 'Revenue grew 15% in Q4, and profits also increased.'
  [1] 'The company expanded into new markets, but they faced challenges.'

SpacyAnswerSegmenter (with clause splitting):
  [0] 'Revenue grew 15% in Q4,'
  [1] 'profits also increased.'
  [2] 'The company expanded into new markets,'
  [3] 'they faced challenges.'

‚Üí Notice SpacyAnswerSegmenter splits at 'and' and 'but' conjunctions!


---
## Part 5: Embeddings
---

### 5.1 SentenceTransformerEmbedder

Embeddings help find citations when the answer paraphrases the source. Install with `pip install cite-right[embeddings]`.

In [16]:
try:
    from cite_right import (
        CitationConfig,
        SentenceTransformerEmbedder,
        SourceDocument,
        align_citations,
    )

    embedder = SentenceTransformerEmbedder("all-MiniLM-L6-v2")

    # Semantically similar but lexically different (paraphrased)
    answer = "The method is computationally efficient and uses minimal resources."
    sources = [
        SourceDocument(
            id="paper",
            text="Our approach requires very few computing cycles and has low memory overhead.",
        )
    ]

    # Without embeddings - low lexical overlap means no match
    results_no_embed = align_citations(answer, sources)
    print(f"Without embeddings: status={results_no_embed[0].status}")
    print(f"  Citations found: {len(results_no_embed[0].citations)}")

    # With embeddings + allow_embedding_only - semantic similarity finds the match
    config = CitationConfig(allow_embedding_only=True, min_embedding_similarity=0.3)
    results_with_embed = align_citations(
        answer, sources, embedder=embedder, config=config
    )
    print(f"\nWith embeddings: status={results_with_embed[0].status}")
    if results_with_embed[0].citations:
        c = results_with_embed[0].citations[0]
        print(f"  Embedding score: {c.components.get('embedding_score', 0):.3f}")
        print(f"  Evidence: {c.evidence!r}")

except ImportError:
    print("Embeddings not available. Install with: pip install cite-right[embeddings]")

Without embeddings: status=unsupported
  Citations found: 0

With embeddings: status=partial
  Embedding score: 0.591
  Evidence: 'Our approach requires very few computing cycles and has low memory overhead.'


---
## Part 6: Convenience Functions
---

### 6.1 Quick Groundedness Checks

In [17]:
from cite_right import is_grounded, is_hallucinated

sources = ["Global temperatures have risen by 1.1¬∞C since pre-industrial times."]

# Well-grounded answer
grounded = "Global temperatures have risen by 1.1¬∞C."
print(f"'{grounded}'")
print(f"  is_grounded(threshold=0.5): {is_grounded(grounded, sources, threshold=0.5)}")
print(
    f"  is_hallucinated(threshold=0.3): {is_hallucinated(grounded, sources, threshold=0.3)}"
)

# Potentially hallucinated answer
hallucinated = "The ice caps will completely melt by 2030."
print(f"\n'{hallucinated}'")
print(
    f"  is_grounded(threshold=0.5): {is_grounded(hallucinated, sources, threshold=0.5)}"
)
print(
    f"  is_hallucinated(threshold=0.3): {is_hallucinated(hallucinated, sources, threshold=0.3)}"
)

'Global temperatures have risen by 1.1¬∞C.'
  is_grounded(threshold=0.5): True
  is_hallucinated(threshold=0.3): False

'The ice caps will completely melt by 2030.'
  is_grounded(threshold=0.5): False
  is_hallucinated(threshold=0.3): True


### 6.2 Detailed Groundedness Metrics with `check_groundedness()`

In [18]:
from cite_right import HallucinationConfig, check_groundedness

answer = (
    "Revenue grew 15% in Q4 2024. "
    "The company plans to expand into 5 new markets. "
    "Profit margins improved significantly. "
    "The CEO will retire next year."
)

sources = [
    "Annual report: Revenue grew 15% in Q4 2024, driven by strong holiday sales.",
    "Press release: The company announced plans to expand into 5 new markets next year.",
]

# Custom hallucination config
hallucination_config = HallucinationConfig(
    weak_citation_threshold=0.4,  # Below this is "weak" evidence
    include_partial_in_grounded=True,  # Count partial matches in groundedness
)

metrics = check_groundedness(answer, sources, hallucination_config=hallucination_config)

print("=== Groundedness Metrics ===")
print(f"Groundedness Score: {metrics.groundedness_score:.1%}")
print(f"Hallucination Rate: {metrics.hallucination_rate:.1%}")
print(f"Average Confidence: {metrics.avg_confidence:.1%}")
print(f"Minimum Confidence: {metrics.min_confidence:.1%}")

print("\nSpan breakdown:")
print(f"  Total spans: {metrics.num_spans}")
print(f"  Supported: {metrics.num_supported} ({metrics.supported_ratio:.1%})")
print(f"  Partial: {metrics.num_partial} ({metrics.partial_ratio:.1%})")
print(f"  Unsupported: {metrics.num_unsupported} ({metrics.unsupported_ratio:.1%})")
print(f"  Weak citations: {metrics.num_weak_citations}")

if metrics.unsupported_spans:
    print("\nUnsupported spans:")
    for span in metrics.unsupported_spans:
        print(f"  - {span.text!r}")

print("\nPer-span details:")
for sc in metrics.span_confidences:
    print(f"  [{sc.status:11s}] conf={sc.confidence:.2f} {sc.span.text[:50]!r}...")

=== Groundedness Metrics ===
Groundedness Score: 59.4%
Hallucination Rate: 40.6%
Average Confidence: 58.3%
Minimum Confidence: 0.0%

Span breakdown:
  Total spans: 4
  Supported: 2 (52.4%)
  Partial: 1 (21.0%)
  Unsupported: 1 (26.6%)
  Weak citations: 1

Unsupported spans:
  - 'Profit margins improved significantly.'

Per-span details:
  [supported  ] conf=1.00 'Revenue grew 15% in Q4 2024.'...
  [supported  ] conf=1.00 'The company plans to expand into 5 new markets.'...
  [unsupported] conf=0.00 'Profit margins improved significantly.'...
  [partial    ] conf=0.33 'The CEO will retire next year.'...


### 6.3 Annotating Answers with Citations

In [19]:
from cite_right import SourceDocument, annotate_answer

sources = [
    SourceDocument(id="report", text="Revenue grew 15% in Q4 2024."),
    SourceDocument(id="press", text="The company will expand into 5 new markets."),
]

answer = (
    "Revenue grew 15% in Q4. "
    "The company will expand into 5 new markets. "
    "Stock prices are expected to rise."
)

# Different annotation formats
print("Markdown format (default):")
print(annotate_answer(answer, sources, format="markdown"))

print("\nSuperscript format:")
print(annotate_answer(answer, sources, format="superscript"))

print("\nFootnote format:")
print(annotate_answer(answer, sources, format="footnote"))

print("\nWithout unsupported markers:")
print(annotate_answer(answer, sources, format="markdown", include_unsupported=False))

Markdown format (default):
Revenue grew 15% in Q4.[1] The company will expand into 5 new markets.[2] Stock prices are expected to rise.[?]

Superscript format:
Revenue grew 15% in Q4.^1 The company will expand into 5 new markets.^2 Stock prices are expected to rise.[?]

Footnote format:
Revenue grew 15% in Q4.[^1] The company will expand into 5 new markets.[^2] Stock prices are expected to rise.[?]

Without unsupported markers:
Revenue grew 15% in Q4.[1] The company will expand into 5 new markets.[2] Stock prices are expected to rise.


### 6.4 Citation Summary

In [20]:
from cite_right import SourceDocument, align_citations, get_citation_summary

sources = [
    SourceDocument(
        id="report_q4", text="Q4 revenue reached $50M, up 15% from last year."
    ),
    SourceDocument(
        id="press_release", text="The company announced 3 new product lines."
    ),
]

answer = (
    "Revenue in Q4 was $50M, a 15% increase. "
    "Three new products were announced. "
    "The CEO expressed optimism about the future."
)

results = align_citations(answer, sources)
summary = get_citation_summary(results)
print(summary)

Citation Summary:
- 0 of 3 spans fully supported
- 2 spans partially supported
- 1 spans unsupported
- Sources cited: press_release, report_q4


---
## Part 7: Fact Verification
---

### 7.1 Basic Fact Verification with `verify_facts()`

Decompose answers into atomic claims and verify each one.

In [21]:
from cite_right import FactVerificationConfig, SourceDocument, verify_facts

sources = [
    SourceDocument(
        id="report", text="Revenue grew 15% in Q4 2024. Profits increased by 20%."
    ),
    SourceDocument(id="press", text="The company will expand into Europe next year."),
]

answer = (
    "Revenue grew 15% in Q4. "
    "Profits increased by 20%. "
    "The company plans to enter the Asian market."
)

# Configure verification thresholds
config = FactVerificationConfig(
    verified_coverage_threshold=0.6,  # Coverage needed for "verified" status
    partial_coverage_threshold=0.3,  # Coverage needed for "partial" status
)

metrics = verify_facts(answer, sources, config=config)

print("=== Fact Verification Results ===")
print(f"Total claims: {metrics.num_claims}")
print(f"Verified: {metrics.num_verified}")
print(f"Partial: {metrics.num_partial}")
print(f"Unverified: {metrics.num_unverified}")
print(f"Verification rate: {metrics.verification_rate:.1%}")
print(f"Average confidence: {metrics.avg_confidence:.1%}")

print("\nPer-claim details:")
for v in metrics.claim_verifications:
    print(f'  [{v.status:10s}] conf={v.confidence:.2f} "{v.claim.text}"')
    if v.source_ids:
        print(f"              sources: {v.source_ids}")

=== Fact Verification Results ===
Total claims: 3
Verified: 2
Partial: 0
Unverified: 1
Verification rate: 66.7%
Average confidence: 75.0%

Per-claim details:
  [verified  ] conf=1.00 "Revenue grew 15% in Q4."
              sources: ['report']
  [verified  ] conf=1.00 "Profits increased by 20%."
              sources: ['report']
  [unverified] conf=0.25 "The company plans to enter the Asian market."
              sources: ['press']


### 7.2 Claim Decomposition

Break sentences into atomic claims for finer-grained verification.

In [22]:
from cite_right import SimpleClaimDecomposer
from cite_right.core.results import AnswerSpan

# Simple decomposer (treats each span as one claim)
simple_decomposer = SimpleClaimDecomposer()
span = AnswerSpan(
    text="Revenue grew 15% and profits increased by 20%",
    char_start=0,
    char_end=45,
)

claims = simple_decomposer.decompose(span)
print(f"SimpleClaimDecomposer: {len(claims)} claim(s)")
for c in claims:
    print(f"  [{c.char_start}:{c.char_end}] {c.text!r}")

# spaCy-based decomposer (splits on conjunctions)
try:
    from cite_right import SpacyClaimDecomposer

    spacy_decomposer = SpacyClaimDecomposer(
        model="en_core_web_sm",
        min_claim_tokens=2,  # Minimum tokens for a valid claim
    )

    claims = spacy_decomposer.decompose(span)
    print(f"\nSpacyClaimDecomposer: {len(claims)} claim(s)")
    for c in claims:
        print(f"  [{c.char_start}:{c.char_end}] {c.text!r}")

except RuntimeError as e:
    print(f"\nSpacyClaimDecomposer not available: {e}")

SimpleClaimDecomposer: 1 claim(s)
  [0:45] 'Revenue grew 15% and profits increased by 20%'

SpacyClaimDecomposer: 2 claim(s)
  [0:16] 'Revenue grew 15%'
  [21:45] 'profits increased by 20%'


---
## Part 8: Framework Integrations
---

### 8.1 LangChain Integration

In [23]:
from cite_right import (
    align_citations,
    from_langchain_chunks,
    from_langchain_documents,
    is_langchain_available,
)

print(f"LangChain available: {is_langchain_available()}")

if is_langchain_available():
    from langchain_core.documents import Document

    # Simulate LangChain documents from a retriever
    lc_docs = [
        Document(
            page_content="Revenue grew 15% in Q4.", metadata={"source": "report.pdf"}
        ),
        Document(
            page_content="The company expanded globally.",
            metadata={"source": "news.txt"},
        ),
    ]

    # Convert to cite-right format
    sources = from_langchain_documents(lc_docs, id_key="source")
    print(f"Converted {len(sources)} documents")

    # Use with align_citations
    results = align_citations("Revenue increased by 15%.", sources)
    print(f"Status: {results[0].status}")

    # For chunked documents with offsets
    lc_chunks = [
        Document(
            page_content="Revenue grew 15%.",
            metadata={"source": "report.pdf", "start_index": 100, "end_index": 117},
        ),
    ]
    chunk_sources = from_langchain_chunks(
        lc_chunks,
        id_key="source",
        start_key="start_index",
        end_key="end_index",
    )
    print(f"Chunk source_id: {chunk_sources[0].source_id}")
    print(f"Chunk doc_char_start: {chunk_sources[0].doc_char_start}")
else:
    print("Install with: pip install cite-right[langchain]")

LangChain available: True
Converted 2 documents
Status: partial
Chunk source_id: report.pdf
Chunk doc_char_start: 100


### 8.2 LlamaIndex Integration

In [24]:
from cite_right import (
    align_citations,
    from_llamaindex_chunks,
    from_llamaindex_nodes,
    is_llamaindex_available,
)

print(f"LlamaIndex available: {is_llamaindex_available()}")

if is_llamaindex_available():
    from llama_index.core.schema import TextNode

    # Simulate LlamaIndex nodes from a retriever
    nodes = [
        TextNode(text="Revenue grew 15% in Q4.", metadata={"file_name": "report.pdf"}),
        TextNode(
            text="The company expanded globally.", metadata={"file_name": "news.txt"}
        ),
    ]

    # Convert to cite-right format
    sources = from_llamaindex_nodes(nodes, id_key="file_name")
    print(f"Converted {len(sources)} nodes")

    # Use with align_citations
    results = align_citations("Revenue increased by 15%.", sources)
    print(f"Status: {results[0].status}")

    # For nodes with character offsets
    chunked_nodes = [
        TextNode(
            text="Revenue grew 15%.",
            metadata={
                "file_name": "report.pdf",
                "start_char_idx": 100,
                "end_char_idx": 117,
            },
        ),
    ]
    chunk_sources = from_llamaindex_chunks(
        chunked_nodes,
        id_key="file_name",
        start_key="start_char_idx",
        end_key="end_char_idx",
    )
    print(f"Chunk source_id: {chunk_sources[0].source_id}")
else:
    print("Install with: pip install cite-right[llamaindex]")

LlamaIndex available: True
Converted 2 nodes
Status: partial
Chunk source_id: report.pdf


---
## Part 9: Understanding Citation Scores
---

### 9.1 Score Components Explained

In [25]:
from cite_right import SourceDocument, align_citations

sources = [
    SourceDocument(id="exact", text="Revenue increased by exactly 15.3% in Q4 2024."),
    SourceDocument(
        id="partial", text="Company financials showed growth in the fourth quarter."
    ),
]

answer = "Revenue increased by 15.3% in Q4 2024."
results = align_citations(answer, sources)

print(f"Answer: {results[0].answer_span.text!r}\n")

for citation in results[0].citations:
    print(f"Citation from '{citation.source_id}':")
    print(f"  Final score: {citation.score:.3f}")
    print(f"  Evidence: {citation.evidence!r}")
    print("  \nScore components:")

    c = citation.components
    print(
        f"    alignment_score: {c.get('alignment_score', 0):.0f} (raw Smith-Waterman score)"
    )
    print(
        f"    normalized_alignment: {c.get('normalized_alignment', 0):.3f} (alignment / max possible)"
    )
    print(f"    matches: {c.get('matches', 0):.0f} (number of matched tokens)")
    print(
        f"    answer_coverage: {c.get('answer_coverage', 0):.1%} (fraction of answer matched)"
    )
    print(
        f"    evidence_coverage: {c.get('evidence_coverage', 0):.1%} (fraction of evidence matched)"
    )
    print(f"    lexical_score: {c.get('lexical_score', 0):.3f} (IDF-weighted overlap)")
    print(f"    embedding_score: {c.get('embedding_score', 0):.3f} (cosine similarity)")
    print(
        f"    embedding_only: {c.get('embedding_only', 0):.0f} (1 if no lexical match)"
    )
    print()

Answer: 'Revenue increased by 15.3% in Q4 2024.'

Citation from 'exact':
  Final score: 2.438
  Evidence: 'Revenue increased by exactly 15.3% in Q4 2024'
  
Score components:
    alignment_score: 15 (raw Smith-Waterman score)
    normalized_alignment: 0.938 (alignment / max possible)
    matches: 8 (number of matched tokens)
    answer_coverage: 100.0% (fraction of answer matched)
    evidence_coverage: 88.9% (fraction of evidence matched)
    lexical_score: 1.000 (IDF-weighted overlap)
    embedding_score: 0.000 (cosine similarity)
    embedding_only: 0 (1 if no lexical match)



### 9.2 Customizing Score Weights

In [26]:
from cite_right import CitationConfig, CitationWeights, SourceDocument, align_citations

sources = [
    SourceDocument(id="doc", text="The quick brown fox jumps over the lazy dog.")
]
answer = "The brown fox jumped over a lazy dog."

# Default weights
default_weights = (
    CitationWeights()
)  # alignment=1.0, answer_coverage=1.0, lexical=0.5, embedding=0.5

# Emphasize lexical matching
lexical_weights = CitationWeights(
    alignment=0.5,
    answer_coverage=0.5,
    evidence_coverage=0.0,
    lexical=2.0,  # Double weight for lexical
    embedding=0.0,
)

# Emphasize coverage
coverage_weights = CitationWeights(
    alignment=0.5,
    answer_coverage=2.0,  # Double weight for coverage
    evidence_coverage=1.0,
    lexical=0.5,
    embedding=0.5,
)

for name, weights in [
    ("default", default_weights),
    ("lexical", lexical_weights),
    ("coverage", coverage_weights),
]:
    config = CitationConfig(weights=weights)
    results = align_citations(answer, sources, config=config)
    if results[0].citations:
        score = results[0].citations[0].score
        print(f"{name:10s} weights: final_score={score:.3f}")

default    weights: final_score=1.688
lexical    weights: final_score=2.156
coverage   weights: final_score=2.823


---
## Part 10: Full Example - RAG Post-Processing Pipeline
---

In [27]:
from cite_right import (
    CitationConfig,
    SourceDocument,
    align_citations,
    annotate_answer,
    check_groundedness,
)

# Simulate a RAG pipeline output
retrieved_sources = [
    SourceDocument(
        id="gepa_intro",
        text="We introduce GEPA (Genetic-Pareto), a reflective prompt optimizer "
        "that merges textual reflection with multi-objective evolutionary search.",
    ),
    SourceDocument(
        id="grpo_comparison",
        text="On Qwen3 8B, GEPA outperforms GRPO by up to 19% while requiring "
        "up to 35x fewer rollouts.",
    ),
    SourceDocument(
        id="mipro_comparison",
        text="GEPA surpasses MIPROv2 on every benchmark, obtaining aggregate "
        "optimization gains of +14%, more than doubling MIPROv2's +7%.",
    ),
]

llm_answer = (
    "GEPA is a new prompt optimization method that combines reflection with evolutionary search. "
    "It achieves 19% better performance than GRPO while using 35x fewer samples. "
    "Compared to MIPROv2, GEPA achieves +14% optimization gains. "
    "The authors expect it to revolutionize the field."
)

print("=" * 60)
print("RAG POST-PROCESSING PIPELINE")
print("=" * 60)

# Step 1: Check overall groundedness
print("\nüìä Step 1: Groundedness Check")
metrics = check_groundedness(llm_answer, retrieved_sources)
print(f"   Groundedness: {metrics.groundedness_score:.1%}")
print(f"   Hallucination rate: {metrics.hallucination_rate:.1%}")
print(
    f"   Supported/Partial/Unsupported: {metrics.num_supported}/{metrics.num_partial}/{metrics.num_unsupported}"
)

# Step 2: Identify problematic spans
print("\n‚ö†Ô∏è  Step 2: Problematic Spans")
if metrics.unsupported_spans:
    for span in metrics.unsupported_spans:
        print(f"   UNSUPPORTED: {span.text!r}")
if metrics.weakly_supported_spans:
    for span in metrics.weakly_supported_spans:
        print(f"   WEAK: {span.text!r}")

# Step 3: Generate annotated answer
print("\nüìù Step 3: Annotated Answer")
annotated = annotate_answer(llm_answer, retrieved_sources)
print(f"   {annotated}")

# Step 4: Detailed citation breakdown
print("\nüìã Step 4: Citation Details")
results = align_citations(llm_answer, retrieved_sources)
for i, sc in enumerate(results):
    print(f"\n   Sentence {i + 1}: {sc.answer_span.text[:60]}...")
    print(f"   Status: {sc.status}")
    if sc.citations:
        best = sc.citations[0]
        print(f"   Best source: {best.source_id}")
        print(f"   Coverage: {best.components.get('answer_coverage', 0):.1%}")

RAG POST-PROCESSING PIPELINE

üìä Step 1: Groundedness Check
   Groundedness: 31.1%
   Hallucination rate: 68.9%
   Supported/Partial/Unsupported: 1/2/1

‚ö†Ô∏è  Step 2: Problematic Spans
   UNSUPPORTED: 'The authors expect it to revolutionize the field.'
   WEAK: 'It achieves 19% better performance than GRPO while using 35x fewer samples.'
   WEAK: 'Compared to MIPROv2, GEPA achieves +14% optimization gains.'

üìù Step 3: Annotated Answer
   GEPA is a new prompt optimization method that combines reflection with evolutionary search.[1] It achieves 19% better performance than GRPO while using 35x fewer samples.[2] Compared to MIPROv2, GEPA achieves +14% optimization gains.[3] The authors expect it to revolutionize the field.[?]

üìã Step 4: Citation Details

   Sentence 1: GEPA is a new prompt optimization method that combines refle...
   Status: supported
   Best source: gepa_intro
   Coverage: 61.5%

   Sentence 2: It achieves 19% better performance than GRPO while using 35x...
   