# Page Number Cleanup & Citation Scheme

**Decision**: Remove fabricated page numbers (which were just `char_position // 2000 + 1`). 
Replace with a proper citation scheme: **Chapter ‚Üí Section ‚Üí Paragraph ID**.

**Why**: The EPUB has zero embedded page markers. The old "page 42" meant nothing ‚Äî it was an arbitrary character-count estimate that can't be verified against the physical book. A citation system must be grounded in real structure.

**Citation format**: `CLB.7.¬ß3.p12` = Clear Light of Bliss, Chapter 7, Section 3, Paragraph 12

This works for all 23 books without requiring anything the EPUBs don't contain.

In [1]:
import json
import os

INPUT_FILE = "06_document_structure_layer1.json"

with open(INPUT_FILE) as f:
    data = json.load(f)

print(f"Loaded: {data['total_chapters']} chapters, {data['total_paragraphs']} paragraphs")
print(f"Current total_pages value: {data.get('total_pages', 'N/A')}")

# Count how many paragraphs have page_number set
has_page = sum(
    1 for ch in data['chapters'] 
    for p in ch['paragraphs'] 
    if p.get('page_number') is not None
)
print(f"Paragraphs with page_number: {has_page}/{data['total_paragraphs']}")

Loaded: 33 chapters, 3449 paragraphs
Current total_pages value: 49
Paragraphs with page_number: 3449/3449


## Build Citation IDs

Each paragraph gets a human-readable citation string:
- `book_abbrev.chapter_index.¬ßsection_index.p{paragraph_index}`
- Example: `CLB.7.¬ß3.p12` ‚Äî Clear Light of Bliss, Ch 7, 3rd section heading, paragraph 12

For paragraphs before any section heading: `CLB.7.p5` (no section marker).

Also assigns each section a sequential index within its chapter for stable referencing.

In [2]:
# Book abbreviation map (for all 23 books eventually)
BOOK_ABBREVS = {
    'Clear_Light_of_Bliss': 'CLB',
}

book_abbrev = BOOK_ABBREVS.get(data['book_id'], data['book_id'][:3].upper())

# ‚îÄ‚îÄ Build citation IDs ‚îÄ‚îÄ
for ch in data['chapters']:
    ch_idx = ch['chapter_index']
    
    # Index sections sequentially within the chapter
    for sec_idx, section in enumerate(ch.get('sections', [])):
        section['section_index'] = sec_idx + 1  # 1-based
    
    # Track current section for each paragraph
    current_section_idx = None
    section_para_positions = {}  # section paragraph_index -> section_index
    
    # Build a lookup: paragraph_index -> which section it falls under
    section_starts = {s['paragraph_index']: s.get('section_index', i+1) 
                      for i, s in enumerate(ch.get('sections', []))}
    
    for para in ch['paragraphs']:
        p_idx = para['paragraph_index']
        
        # Update current section if this paragraph starts one
        if p_idx in section_starts:
            current_section_idx = section_starts[p_idx]
        
        # Build citation string
        if current_section_idx is not None:
            citation = f"{book_abbrev}.{ch_idx}.¬ß{current_section_idx}.p{p_idx}"
        else:
            citation = f"{book_abbrev}.{ch_idx}.p{p_idx}"
        
        # Add to paragraph
        para['citation'] = citation
        para['section_index'] = current_section_idx
        
        # Remove fake page_number
        if 'page_number' in para:
            del para['page_number']

# Remove page-related fields from chapters
for ch in data['chapters']:
    if 'pages' in ch:
        del ch['pages']

# Remove top-level page count
if 'total_pages' in data:
    del data['total_pages']

# Add citation metadata
data['citation_scheme'] = {
    'format': '{book_abbrev}.{chapter_index}.¬ß{section_index}.p{paragraph_index}',
    'book_abbreviation': book_abbrev,
    'example': f'{book_abbrev}.7.¬ß3.p12',
    'note': 'section_index omitted for paragraphs before first section heading in a chapter',
    'page_numbers': 'Not available ‚Äî EPUB contains no embedded page markers',
}

print(f"‚úì Citation IDs assigned")
print(f"\nSample citations:")
for ch in data['chapters'][7:9]:  # Show teaching chapters
    print(f"\n  Chapter {ch['chapter_index']}: \"{ch['chapter_title']}\"")
    for p in ch['paragraphs'][:3]:
        preview = p['text'][:70] + "..." if len(p['text']) > 70 else p['text']
        print(f"    {p['citation']:25s} [{p['structural_role']:12s}] {preview}")
    if len(ch['paragraphs']) > 6:
        mid = len(ch['paragraphs']) // 2
        p = ch['paragraphs'][mid]
        preview = p['text'][:70] + "..." if len(p['text']) > 70 else p['text']
        print(f"    {p['citation']:25s} [{p['structural_role']:12s}] {preview}")

‚úì Citation IDs assigned

Sample citations:

  Chapter 7: "Introduction and Preliminaries"
    CLB.7.p0                  [BODY_FIRST  ] It is very pleasing to have this opportunity to explain the method for...
    CLB.7.p1                  [LIST_ITEM   ] 1	An introduction to the general paths
    CLB.7.p2                  [LIST_ITEM   ] 2	The source of the lineage from which these instructions are derived
    CLB.7.¬ß2.p58              [VERSE       ] Losang Trinlay

  Chapter 8: "Channels, Winds and Drops"
    CLB.8.¬ß1.p0               [SECTION_HEAD] THE ACTUAL PRACTICE
    CLB.8.¬ß1.p1               [BODY_FIRST  ] As explained in the introduction to the general paths of Secret Mantra...
    CLB.8.¬ß1.p2               [LIST_ITEM   ] 1	How to practise the Mahamudra that is theunion of bliss and emptines...
    CLB.8.¬ß6.p45              [BODY        ] From each of these eight petals or channel spokes of the heart, three ...


## Save Updated Structure

In [3]:
with open(INPUT_FILE, 'w', encoding='utf-8') as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

file_size = os.path.getsize(INPUT_FILE) / (1024 * 1024)
print(f"‚úì Saved to {INPUT_FILE}")
print(f"  File size: {file_size:.1f} MB")

‚úì Saved to 06_document_structure_layer1.json
  File size: 2.3 MB


## üö¶ Validation

In [4]:
print("=" * 70)
print("üö¶ VALIDATION: Citation Scheme")
print("=" * 70)

# Reload to verify
with open(INPUT_FILE) as f:
    saved = json.load(f)

checks = []

# Check 1: No page_number fields remain
has_page = sum(1 for ch in saved['chapters'] for p in ch['paragraphs'] if 'page_number' in p)
checks.append(('No page_number fields remain', has_page == 0, f"{has_page} found"))

# Check 2: All paragraphs have citations
has_citation = sum(1 for ch in saved['chapters'] for p in ch['paragraphs'] if 'citation' in p)
total_p = sum(len(ch['paragraphs']) for ch in saved['chapters'])
checks.append(('All paragraphs have citation', has_citation == total_p,
               f"{has_citation}/{total_p}"))

# Check 3: Citations are unique
all_citations = [p['citation'] for ch in saved['chapters'] for p in ch['paragraphs']]
unique_citations = set(all_citations)
checks.append(('All citations are unique', len(unique_citations) == len(all_citations),
               f"{len(unique_citations)} unique / {len(all_citations)} total"))

# Check 4: No total_pages field
checks.append(('No total_pages in metadata', 'total_pages' not in saved, ''))

# Check 5: No pages in chapters
has_chapter_pages = sum(1 for ch in saved['chapters'] if 'pages' in ch)
checks.append(('No pages in chapter metadata', has_chapter_pages == 0,
               f"{has_chapter_pages} chapters still have pages"))

# Check 6: Citation scheme documented
checks.append(('Citation scheme documented', 'citation_scheme' in saved, ''))

# Check 7: Sections have section_index
sections_with_idx = sum(1 for ch in saved['chapters'] for s in ch.get('sections', []) if 'section_index' in s)
total_sections = sum(len(ch.get('sections', [])) for ch in saved['chapters'])
checks.append(('Sections have section_index', sections_with_idx == total_sections,
               f"{sections_with_idx}/{total_sections}"))

all_pass = True
for desc, passed, detail in checks:
    status = "‚úì" if passed else "‚úó"
    if not passed:
        all_pass = False
    detail_str = f" ({detail})" if detail else ""
    print(f"  {status} {desc}{detail_str}")

if all_pass:
    print(f"\n  ‚úÖ ALL CHECKS PASSED")
    print(f"  Fake page numbers removed. Citation scheme ready for all 23 books.")
else:
    print(f"\n  ‚ö†Ô∏è  SOME CHECKS FAILED")

üö¶ VALIDATION: Citation Scheme
  ‚úì No page_number fields remain (0 found)
  ‚úì All paragraphs have citation (3449/3449)
  ‚úì All citations are unique (3449 unique / 3449 total)
  ‚úì No total_pages in metadata
  ‚úì No pages in chapter metadata (0 chapters still have pages)
  ‚úì Citation scheme documented
  ‚úì Sections have section_index (150/150)

  ‚úÖ ALL CHECKS PASSED
  Fake page numbers removed. Citation scheme ready for all 23 books.
