# Page Number Diagnostic: Finding Real Page Markers in the EPUB

**Problem**: Our re-extraction gets 49 "pages" from `ebook-page-break` CSS class, but the print book has ~270 pages. The old extraction had a `position_to_page` mapping with 270 pages — where did it come from?

**EPUBs can encode page numbers in several ways:**
1. `<span epub:type="pagebreak" id="page42"/>` — EPUB3 page break markers
2. Adobe page-map XML file
3. `<a id="page42"/>` or `<a name="page42"/>` anchor tags  
4. CSS class markers (we found `ebook-page-break` but only 48 of them)
5. NCX/OPF page-list navigation

This notebook searches for ALL of these patterns.

In [1]:
import os
import re
import json
from collections import defaultdict

import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup
from lxml import etree

EPUB_DIR = os.path.expanduser("~/Documents/gesha_la_rag/epub_directory/")
EPUB_FILE = "Clear_Light_of_Bliss.epub"
EPUB_PATH = os.path.join(EPUB_DIR, EPUB_FILE)

book = epub.read_epub(EPUB_PATH)
items = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))
print(f"✓ Loaded: {len(items)} sections")

✓ Loaded: 89 sections


## Method 1: Search for pagebreak elements in HTML

Look for EPUB3 `epub:type="pagebreak"`, `role="doc-pagebreak"`, and anchor-based page markers.

In [2]:
# ── Method 1: Search ALL HTML for page break patterns ──
page_markers = []
page_patterns_found = defaultdict(int)

for idx, item in enumerate(items):
    content = item.get_content()
    html_str = content.decode('utf-8', errors='replace')
    soup = BeautifulSoup(content, 'html.parser')
    
    # Pattern A: epub:type="pagebreak" (EPUB3 standard)
    for el in soup.find_all(attrs={"epub:type": "pagebreak"}):
        page_id = el.get('id', el.get('title', '?'))
        page_markers.append(('epub:type=pagebreak', page_id, idx, item.get_name()))
        page_patterns_found['epub:type=pagebreak'] += 1
    
    # Pattern B: role="doc-pagebreak" (ARIA)
    for el in soup.find_all(attrs={"role": "doc-pagebreak"}):
        page_id = el.get('id', el.get('title', el.get('aria-label', '?')))
        page_markers.append(('role=doc-pagebreak', page_id, idx, item.get_name()))
        page_patterns_found['role=doc-pagebreak'] += 1
    
    # Pattern C: <a id="pageNNN"> or <a id="pNNN"> or <span id="pageNNN">
    for el in soup.find_all(['a', 'span'], id=re.compile(r'(?i)(page|pg|p)[-_]?\d+')):
        page_markers.append(('id=page*', el.get('id'), idx, item.get_name()))
        page_patterns_found['id=page*'] += 1
    
    # Pattern D: <a name="pageNNN">
    for el in soup.find_all('a', attrs={'name': re.compile(r'(?i)(page|pg|p)[-_]?\d+')}):
        page_markers.append(('name=page*', el.get('name'), idx, item.get_name()))
        page_patterns_found['name=page*'] += 1
    
    # Pattern E: Any element with class containing "page" and a number-like id
    for el in soup.find_all(class_=re.compile(r'(?i)page')):
        el_id = el.get('id', '')
        el_class = ' '.join(el.get('class', []))
        if el_class != 'ebook-page-break':  # Already found these
            page_markers.append(('class=*page*', f"{el_class} id={el_id}", idx, item.get_name()))
            page_patterns_found['class=*page*'] += 1
    
    # Pattern F: Raw regex for page-number-like patterns in HTML source
    # Look for id attributes with numbers that could be page refs
    raw_page_ids = re.findall(r'id=["\'](page[-_]?\d+|p\d+)["\'\s/>]', html_str, re.I)
    for pid in raw_page_ids:
        if ('id=page*', pid, idx, item.get_name()) not in page_markers:
            page_markers.append(('raw_regex_id', pid, idx, item.get_name()))
            page_patterns_found['raw_regex_id'] += 1

print(f"Page markers found: {len(page_markers)}")
print(f"\nBy pattern type:")
for pattern, count in sorted(page_patterns_found.items(), key=lambda x: -x[1]):
    print(f"  {pattern}: {count}")

if page_markers:
    print(f"\nFirst 10 markers:")
    for ptype, pid, sidx, sname in page_markers[:10]:
        print(f"  [{ptype}] {pid}  (section {sidx}: {sname})")
    if len(page_markers) > 10:
        print(f"  ... and {len(page_markers) - 10} more")
        print(f"\nLast 5 markers:")
        for ptype, pid, sidx, sname in page_markers[-5:]:
            print(f"  [{ptype}] {pid}  (section {sidx}: {sname})")
else:
    print("\n⚠️  No page markers found in HTML content!")

Page markers found: 4

By pattern type:
  class=*page*: 4

First 10 markers:
  [class=*page*] Book-title-title-page id=  (section 4: Clear_Light_of_Bliss_Text_2019-08-3.xhtml)
  [class=*page*] Chapter-title-TOC-Level-1-no-new-page id=_idParaDest-4  (section 8: Clear_Light_of_Bliss_Text_2019-08-7.xhtml)
  [class=*page*] line-drawings-no-page-before-and-after _idGenObjectStyle-Disabled id=_idContainer017  (section 8: Clear_Light_of_Bliss_Text_2019-08-7.xhtml)
  [class=*page*] Chapter-title-TOC-Level-1-no-new-page id=_idParaDest-28  (section 82: Clear_Light_of_Bliss_Text_2019-08-81.xhtml)


## Method 2: Check EPUB package/OPF for page-list

Some EPUBs have page navigation defined in the OPF package or NCX navigation.

In [3]:
# ── Method 2: Check OPF/NCX for page-list ──
print("Checking EPUB package metadata for page lists...")
print("=" * 70)

# Check all non-document items (OPF, NCX, etc.)
all_items = list(book.get_items())
print(f"Total items in EPUB: {len(all_items)}")

for item in all_items:
    item_type = item.get_type()
    name = item.get_name()
    
    # Look at navigation and metadata items
    if item_type in (ebooklib.ITEM_NAVIGATION, ebooklib.ITEM_UNKNOWN):
        print(f"\n{'─' * 70}")
        print(f"Item: {name} (type={item_type})")
        content = item.get_content()
        if content:
            text = content.decode('utf-8', errors='replace')
            # Look for page-list references
            if 'page-list' in text.lower() or 'pagelist' in text.lower() or 'page_list' in text.lower():
                print(f"  ✓ Contains 'page-list' reference!")
                # Show relevant section
                for line in text.split('\n'):
                    if 'page' in line.lower():
                        print(f"    {line.strip()[:150]}")
            elif 'page' in text.lower():
                print(f"  Contains 'page' references")
                page_lines = [l for l in text.split('\n') if 'page' in l.lower()]
                for line in page_lines[:5]:
                    print(f"    {line.strip()[:150]}")

# Also check the NCX (EPUB2 navigation)
print(f"\n{'─' * 70}")
print("Checking NCX navigation...")
try:
    ncx_items = list(book.get_items_of_type(ebooklib.ITEM_NAVIGATION))
    for ncx in ncx_items:
        content = ncx.get_content().decode('utf-8', errors='replace')
        if 'pageTarget' in content or 'page-list' in content:
            print(f"  ✓ NCX contains page targets!")
            # Count them
            page_targets = re.findall(r'pageTarget', content)
            print(f"  Found {len(page_targets)} pageTarget entries")
        else:
            print(f"  No pageTarget entries in NCX")
except Exception as e:
    print(f"  Error reading NCX: {e}")

Checking EPUB package metadata for page lists...
Total items in EPUB: 154

──────────────────────────────────────────────────────────────────────
Item: toc.ncx (type=4)
  Contains 'page' references
    <meta name="dtb:totalPageCount" content="0" />
    <meta name="dtb:maxPageNumber" content="0" />

──────────────────────────────────────────────────────────────────────
Checking NCX navigation...
  No pageTarget entries in NCX


## Method 3: Inspect the raw EPUB ZIP for page-map files

Some publishers include a separate page-map.xml file.

In [4]:
import zipfile

print("Inspecting EPUB as ZIP archive...")
print("=" * 70)

with zipfile.ZipFile(EPUB_PATH, 'r') as zf:
    all_files = zf.namelist()
    
    # Look for page-map or navigation files
    interesting = [f for f in all_files if any(x in f.lower() for x in 
                   ['page', 'nav', 'ncx', 'opf', 'toc', 'map'])]
    
    print(f"Total files in EPUB: {len(all_files)}")
    print(f"\nPotentially relevant files:")
    for f in interesting:
        info = zf.getinfo(f)
        print(f"  {f} ({info.file_size:,} bytes)")
    
    # Read each interesting file and look for page data
    for f in interesting:
        content = zf.read(f).decode('utf-8', errors='replace')
        if 'page' in content.lower():
            page_refs = re.findall(r'(?i)page[-_]?\d+', content)
            if page_refs:
                unique_refs = sorted(set(page_refs))
                print(f"\n{'─' * 70}")
                print(f"File: {f}")
                print(f"  Contains {len(page_refs)} page references ({len(unique_refs)} unique)")
                print(f"  First 10: {unique_refs[:10]}")
                if len(unique_refs) > 10:
                    print(f"  Last 5: {unique_refs[-5:]}")
                    
                # Try to extract page number range
                numbers = []
                for ref in unique_refs:
                    m = re.search(r'(\d+)', ref)
                    if m:
                        numbers.append(int(m.group(1)))
                if numbers:
                    print(f"  Page range: {min(numbers)} to {max(numbers)}")

Inspecting EPUB as ZIP archive...
Total files in EPUB: 158

Potentially relevant files:
  OEBPS/toc.ncx (6,505 bytes)
  OEBPS/toc.xhtml (4,217 bytes)
  OEBPS/content.opf (24,373 bytes)


## Method 4: Check the OLD extraction JSON

If it exists, inspect where `position_to_page` came from.

In [5]:
# Look for the old extraction JSON that had position_to_page
search_paths = [
    os.path.expanduser("~/Documents/gesha_la_rag/Clear_Light_of_Bliss.json"),
    os.path.expanduser("~/Documents/gesha_la_rag/extracted_text/Clear_Light_of_Bliss.json"),
    os.path.expanduser("~/Documents/gesha_la_rag/checkpoints/Clear_Light_of_Bliss.json"),
]

# Also search recursively
import glob
found_jsons = glob.glob(os.path.expanduser("~/Documents/gesha_la_rag/**/*Clear_Light*.json"), recursive=True)
search_paths.extend(found_jsons)

print("Searching for old extraction JSON with position_to_page...")
print("=" * 70)

for path in set(search_paths):
    if os.path.exists(path):
        try:
            with open(path, 'r') as f:
                data = json.load(f)
            
            has_p2p = 'position_to_page' in data
            print(f"\n✓ Found: {path}")
            print(f"  Has position_to_page: {has_p2p}")
            
            if has_p2p:
                p2p = data['position_to_page']
                pages = sorted(set(p2p.values()))
                positions = sorted(int(k) for k in p2p.keys())
                print(f"  Entries: {len(p2p)}")
                print(f"  Page range: {min(pages)} to {max(pages)}")
                print(f"  Position range: {min(positions)} to {max(positions)}")
                print(f"  ✓ THIS is our page number source!")
                
                # Show first 10 entries
                print(f"\n  First 10 position→page mappings:")
                for pos in positions[:10]:
                    print(f"    char {pos} → page {p2p[str(pos)]}")
            
            # Also check what other keys are in this file
            print(f"  Top-level keys: {list(data.keys())[:10]}")
            
        except Exception as e:
            print(f"  ✗ Error reading {path}: {e}")

if not any(os.path.exists(p) for p in search_paths):
    print("\n⚠️  No old extraction JSON found")
    print("  The position_to_page data may need to be regenerated")

Searching for old extraction JSON with position_to_page...

✓ Found: /home/matt/Documents/gesha_la_rag/extracted_text/Clear_Light_of_Bliss.json
  Has position_to_page: True
  Entries: 330
  Page range: 1 to 270
  Position range: 0 to 538522
  ✓ THIS is our page number source!

  First 10 position→page mappings:
    char 0 → page 1
    char 20 → page 1
    char 473 → page 1
    char 1169 → page 1
    char 1314 → page 1
    char 2572 → page 2
    char 3454 → page 2
    char 4121 → page 3
    char 6050 → page 4
    char 7107 → page 4
  Top-level keys: ['book_id', 'book_title', 'creator', 'chapters', 'position_to_page', 'total_length']


## Summary & Recommendation

In [6]:
print("=" * 70)
print("PAGE NUMBER DIAGNOSTIC SUMMARY")
print("=" * 70)

print(f"\nMethods checked:")
print(f"  1. HTML pagebreak elements:  {len(page_markers)} found")
print(f"  2. OPF/NCX page-list:        (see output above)")
print(f"  3. ZIP page-map files:        (see output above)")
print(f"  4. Old extraction JSON:       (see output above)")

if len(page_markers) >= 200:
    print(f"\n✅ EPUB contains embedded page markers!")
    print(f"   We can extract real page numbers from these.")
elif any(os.path.exists(p) for p in search_paths):
    print(f"\n✅ Old extraction has position_to_page data!")  
    print(f"   We can reuse this mapping in the new extraction.")
else:
    print(f"\n⚠️  No reliable page number source found.")
    print(f"   Options:")
    print(f"   a) Use the 48 ebook-page-break markers as section breaks (not pages)")
    print(f"   b) Estimate pages from character count (~2000 chars/page)")
    print(f"   c) Cross-reference with the physical book's page numbers")

print(f"\n→ Paste this output into Claude and we'll integrate the fix.")

PAGE NUMBER DIAGNOSTIC SUMMARY

Methods checked:
  1. HTML pagebreak elements:  4 found
  2. OPF/NCX page-list:        (see output above)
  3. ZIP page-map files:        (see output above)
  4. Old extraction JSON:       (see output above)

✅ Old extraction has position_to_page data!
   We can reuse this mapping in the new extraction.

→ Paste this output into Claude and we'll integrate the fix.
