EPUB Structure Exploration - Verify Paragraph Boundaries Exist

This script examines the raw EPUB structure to verify:
1. Do paragraph boundaries exist in the source EPUB?
2. What format are they in (HTML <p> tags, newlines, etc.)?
3. How should we extract them properly?

Run this BEFORE building the full extraction pipeline.

In [48]:
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup
import re

print("Exploratory EPUB Analysis")
print("=" * 70)

Exploratory EPUB Analysis


In [49]:
# Configuration
EPUB_PATH = r"C:\Users\DELL\Documents\gesha_la_rag\epub_directory\epub_directory\Clear_Light_of_Bliss.epub"

print(f"\nExamining: Clear_Light_of_Bliss.epub")
print("=" * 70)


Examining: Clear_Light_of_Bliss.epub


In [50]:
# Load EPUB
book = epub.read_epub(EPUB_PATH)

print("\n✓ EPUB loaded successfully")

# Get metadata
try:
    title = book.get_metadata('DC', 'title')
    print(f"Title: {title[0][0] if title else 'Unknown'}")
except:
    print("Title: Unable to extract")


✓ EPUB loaded successfully
Title: Clear Light of Bliss


In [51]:
# Get all document items
items = [item for item in book.get_items() if item.get_type() == ebooklib.ITEM_DOCUMENT]

print(f"\nTotal document sections: {len(items)}")
print("\nFirst 5 sections:")
for i, item in enumerate(items[:5]):
    print(f"  {i+1}. {item.get_name()}")


Total document sections: 89

First 5 sections:
  1. cover.xhtml
  2. Clear_Light_of_Bliss_Text_2019-08.xhtml
  3. Clear_Light_of_Bliss_Text_2019-08-1.xhtml
  4. Clear_Light_of_Bliss_Text_2019-08-2.xhtml
  5. Clear_Light_of_Bliss_Text_2019-08-3.xhtml


In [52]:
# Examine the FIRST content section in detail
print("\n" + "=" * 70)
print("DETAILED EXAMINATION OF FIRST CONTENT SECTION")
print("=" * 70)

# Skip title pages, get to actual content (usually around item 8-12)
test_item = items[10] if len(items) > 10 else items[0]

print(f"\nExamining: {test_item.get_name()}")

# Get raw HTML
raw_html = test_item.get_content().decode('utf-8', errors='replace')

print(f"\nRaw HTML length: {len(raw_html)} characters")
print("\n" + "-" * 70)
print("RAW HTML SAMPLE (first 1000 characters):")
print("-" * 70)
print(raw_html[:1000])


DETAILED EXAMINATION OF FIRST CONTENT SECTION

Examining: Clear_Light_of_Bliss_Text_2019-08-9.xhtml

Raw HTML length: 4117 characters

----------------------------------------------------------------------
RAW HTML SAMPLE (first 1000 characters):
----------------------------------------------------------------------
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" epub:prefix="z3998: http://www.daisy.org/z3998/2012/vocab/structure/#" lang="en" xml:lang="en">
  <head/>
  <body><div>
			<p id="_idParaDest-6" class="Chapter-title-TOC-Level-1"><a id="_idTextAnchor005"/>Preface</p>
			<p class="Text-1st-para">I have written this book primarily for the benefit of Western Dharma practitioners with the hope that indirectly it will prove beneficial for all living beings.</p>
			<p class="Text-2nd-para">As for how it was composed, I have based it on the slight experience I have gained through the kindness

In [53]:
# Parse with BeautifulSoup
soup = BeautifulSoup(raw_html, 'html.parser')

print("\n" + "=" * 70)
print("PARSED HTML STRUCTURE")
print("=" * 70)

# Check for paragraph tags
paragraphs = soup.find_all('p')
print(f"\nNumber of <p> tags found: {len(paragraphs)}")

if paragraphs:
    print("\nFirst 3 paragraphs:")
    for i, p in enumerate(paragraphs[:3], 1):
        text = p.get_text()
        print(f"\n[Paragraph {i}]")
        print(f"Length: {len(text)} characters")
        print(f"Text: {text[:200]}...")
        print(f"Has <p> tag: YES")


PARSED HTML STRUCTURE

Number of <p> tags found: 8

First 3 paragraphs:

[Paragraph 1]
Length: 7 characters
Text: Preface...
Has <p> tag: YES

[Paragraph 2]
Length: 160 characters
Text: I have written this book primarily for the benefit of Western Dharma practitioners with the hope that indirectly it will prove beneficial for all living beings....
Has <p> tag: YES

[Paragraph 3]
Length: 894 characters
Text: As for how it was composed, I have based it on the slight experience I have gained through the kindness of my holy Spiritual Guide from whom I received instructions on the generation stage and complet...
Has <p> tag: YES


In [54]:
# Extract text WITHOUT collapsing whitespace (the key test)
print("\n" + "=" * 70)
print("TEXT EXTRACTION TEST - PRESERVING WHITESPACE")
print("=" * 70)

# Method 1: Get text while preserving structure
text_with_structure = soup.get_text()

print(f"\nExtracted text length: {len(text_with_structure)} characters")

# Check for newlines
newline_count = text_with_structure.count('\n')
double_newline_count = text_with_structure.count('\n\n')

print(f"\nWhitespace analysis:")
print(f"  Single newlines (\\n): {newline_count}")
print(f"  Double newlines (\\n\\n): {double_newline_count}")

# Show first 1500 characters with newlines visible
print("\n" + "-" * 70)
print("TEXT SAMPLE (with whitespace preserved):")
print("-" * 70)
print(repr(text_with_structure[:1500]))


TEXT EXTRACTION TEST - PRESERVING WHITESPACE

Extracted text length: 2824 characters

Whitespace analysis:
  Single newlines (\n): 16
  Double newlines (\n\n): 4

----------------------------------------------------------------------
TEXT SAMPLE (with whitespace preserved):
----------------------------------------------------------------------
'\n\n\n\n\nPreface\nI have written this book primarily for the benefit of Western Dharma practitioners with the hope that indirectly it will prove beneficial for all living beings.\nAs for how it was composed, I have based it on the slight experience I have gained through the kindness of my holy Spiritual Guide from whom I received instructions on the generation stage and completion stage of Secret Mantra. In addition, I have drawn material from Je Tsongkhapa’s Lamp Thoroughly Illuminating the Five Stages, which contains the quintessence of Je Tsongkhapa’s Tantric teachings, and also from Je Tsongkhapa’s commentary to the Six Yogas of Naropa. I 

In [55]:
# Compare: What happens if we collapse whitespace (the old way)?
print("\n" + "=" * 70)
print("COMPARISON: COLLAPSED vs PRESERVED WHITESPACE")
print("=" * 70)

# Old method (destroys paragraphs)
text_collapsed = re.sub(r'\s+', ' ', text_with_structure).strip()

print(f"\nOriginal (preserved):  {len(text_with_structure)} chars, {double_newline_count} paragraph breaks")
print(f"Collapsed (old way):   {len(text_collapsed)} chars, {text_collapsed.count(chr(10)+chr(10))} paragraph breaks")

print("\n" + "-" * 70)
print("Collapsed version sample:")
print("-" * 70)
print(text_collapsed[:500])


COMPARISON: COLLAPSED vs PRESERVED WHITESPACE

Original (preserved):  2824 chars, 4 paragraph breaks
Collapsed (old way):   2815 chars, 0 paragraph breaks

----------------------------------------------------------------------
Collapsed version sample:
----------------------------------------------------------------------
Preface I have written this book primarily for the benefit of Western Dharma practitioners with the hope that indirectly it will prove beneficial for all living beings. As for how it was composed, I have based it on the slight experience I have gained through the kindness of my holy Spiritual Guide from whom I received instructions on the generation stage and completion stage of Secret Mantra. In addition, I have drawn material from Je Tsongkhapa’s Lamp Thoroughly Illuminating the Five Stages, w


In [56]:
# Test extraction method that preserves paragraphs
print("\n" + "=" * 70)
print("PROPOSED EXTRACTION METHOD")
print("=" * 70)

def extract_paragraphs_properly(soup):
    """
    Extract text preserving paragraph structure.
    
    Strategy:
    1. Find all <p> tags (if they exist)
    2. Extract text from each <p>
    3. Join with double newlines
    4. Clean up excessive whitespace WITHIN paragraphs only
    """
    paragraphs = soup.find_all('p')
    
    if paragraphs:
        # Method A: HTML has <p> tags
        para_texts = []
        for p in paragraphs:
            text = p.get_text()
            # Clean whitespace WITHIN paragraph, but preserve paragraph boundaries
            text = re.sub(r'\s+', ' ', text).strip()
            if text:
                para_texts.append(text)
        
        # Join with double newlines
        result = '\n\n'.join(para_texts)
        return result, "HTML_P_TAGS"
    else:
        # Method B: No <p> tags, look for other structure
        text = soup.get_text()
        # Try to detect natural paragraph breaks (double+ newlines)
        text = re.sub(r'\n\s*\n\s*\n+', '\n\n', text)  # Normalize multiple newlines to double
        text = re.sub(r'(?<!\n)\n(?!\n)', ' ', text)   # Single newlines become spaces
        return text.strip(), "NATURAL_BREAKS"

extracted, method = extract_paragraphs_properly(soup)

print(f"\nMethod used: {method}")
print(f"Extracted length: {len(extracted)} characters")
print(f"Paragraph breaks (\\n\\n): {extracted.count(chr(10)+chr(10))}")

print("\n" + "-" * 70)
print("Sample of properly extracted text:")
print("-" * 70)
print(extracted[:800])


PROPOSED EXTRACTION METHOD

Method used: HTML_P_TAGS
Extracted length: 2822 characters
Paragraph breaks (\n\n): 7

----------------------------------------------------------------------
Sample of properly extracted text:
----------------------------------------------------------------------
Preface

I have written this book primarily for the benefit of Western Dharma practitioners with the hope that indirectly it will prove beneficial for all living beings.

As for how it was composed, I have based it on the slight experience I have gained through the kindness of my holy Spiritual Guide from whom I received instructions on the generation stage and completion stage of Secret Mantra. In addition, I have drawn material from Je Tsongkhapa’s Lamp Thoroughly Illuminating the Five Stages, which contains the quintessence of Je Tsongkhapa’s Tantric teachings, and also from Je Tsongkhapa’s commentary to the Six Yogas of Naropa. I have also consulted the first Panchen Lama’s root text on the Mah

In [57]:
# Verify on a TEACHING chapter (not front matter)
print("\n" + "=" * 70)
print("VERIFICATION: ACTUAL TEACHING CONTENT")
print("=" * 70)

# Try to find a teaching chapter (usually mid-book)
teaching_item = items[len(items)//2] if len(items) > 10 else items[-1]

print(f"\nExamining teaching chapter: {teaching_item.get_name()}")

raw_html = teaching_item.get_content().decode('utf-8', errors='replace')
soup = BeautifulSoup(raw_html, 'html.parser')

extracted, method = extract_paragraphs_properly(soup)

para_count = extracted.count('\n\n') + 1  # Double newlines + 1 = paragraph count

print(f"\nMethod: {method}")
print(f"Estimated paragraphs: {para_count}")
print(f"Total characters: {len(extracted)}")

print("\n" + "-" * 70)
print("First 3 paragraphs of teaching content:")
print("-" * 70)

paragraphs = extracted.split('\n\n')
for i, para in enumerate(paragraphs[:3], 1):
    print(f"\n[Paragraph {i}] ({len(para)} chars)")
    print(para[:300] + "..." if len(para) > 300 else para)


VERIFICATION: ACTUAL TEACHING CONTENT

Examining teaching chapter: Clear_Light_of_Bliss_Text_2019-08-43.xhtml

Method: HTML_P_TAGS
Estimated paragraphs: 2
Total characters: 26

----------------------------------------------------------------------
First 3 paragraphs of teaching content:
----------------------------------------------------------------------

[Paragraph 1] (10 chars)
page break

[Paragraph 2] (14 chars)
Losang Trinlay


In [58]:
# Final verdict
print("\n" + "=" * 70)
print("EXPLORATION SUMMARY")
print("=" * 70)

print("\n✓ VERIFIED: Paragraph structure EXISTS in EPUB")
print(f"✓ Method: {method}")
print(f"✓ Paragraphs detected: {para_count}")
print(f"✓ Previous extraction DESTROYED this structure with re.sub(r'\\s+', ' ')")

print("\n" + "=" * 70)
print("RECOMMENDATION")
print("=" * 70)
print("\n1. ✓ EPUBs have paragraph structure")
print("2. ✓ We can extract it properly")
print("3. ✓ Proceed with full re-extraction using the corrected method")
print("\nReady to create full extraction + embedding pipeline.")
print("=" * 70)


EXPLORATION SUMMARY

✓ VERIFIED: Paragraph structure EXISTS in EPUB
✓ Method: HTML_P_TAGS
✓ Paragraphs detected: 2
✓ Previous extraction DESTROYED this structure with re.sub(r'\s+', ' ')

RECOMMENDATION

1. ✓ EPUBs have paragraph structure
2. ✓ We can extract it properly
3. ✓ Proceed with full re-extraction using the corrected method

Ready to create full extraction + embedding pipeline.


In [59]:
# %%
# COMPREHENSIVE DIAGNOSTIC - Multiple Chapters
print("\n" + "=" * 70)
print("COMPREHENSIVE DIAGNOSTIC - SCAN ALL CHAPTERS")
print("=" * 70)

chapter_stats = []

for i, item in enumerate(items):
    try:
        raw_html = item.get_content().decode('utf-8', errors='replace')
        soup = BeautifulSoup(raw_html, 'html.parser')
        
        # Count different structural elements
        p_tags = len(soup.find_all('p'))
        div_tags = len(soup.find_all('div'))
        br_tags = len(soup.find_all('br'))
        
        # Get text and count natural breaks
        text = soup.get_text()
        double_newlines = text.count('\n\n')
        char_count = len(text)
        
        chapter_stats.append({
            'index': i,
            'name': item.get_name(),
            'p_tags': p_tags,
            'div_tags': div_tags,
            'br_tags': br_tags,
            'double_newlines': double_newlines,
            'char_count': char_count
        })
    except:
        pass

print(f"\nAnalyzed {len(chapter_stats)} sections")
print("\nTop 10 sections by content size:")
print("-" * 70)

# Fixed header line - moved backslash out of f-string
newline_label = "\\n\\n"
print(f"{'Idx':<5} {'<p>':<6} {'Chars':<8} {newline_label:<6} {'Name':<40}")
print("-" * 70)

# Sort by character count to find substantial teaching chapters
sorted_stats = sorted(chapter_stats, key=lambda x: x['char_count'], reverse=True)[:10]
for stat in sorted_stats:
    print(f"{stat['index']:<5} {stat['p_tags']:<6} {stat['char_count']:<8} {stat['double_newlines']:<6} {stat['name'][:40]:<40}")

print("\n" + "=" * 70)
print("DETAILED LOOK AT LARGEST TEACHING CHAPTER")
print("=" * 70)

# Get the chapter with most content
largest = sorted_stats[0]
largest_item = items[largest['index']]

print(f"\nChapter: {largest['name']}")
print(f"Characters: {largest['char_count']}")
print(f"<p> tags: {largest['p_tags']}")

raw_html = largest_item.get_content().decode('utf-8', errors='replace')
soup = BeautifulSoup(raw_html, 'html.parser')

# Show structure
paragraphs = soup.find_all('p')
if paragraphs:
    print(f"\nFirst 5 paragraphs from this chapter:")
    print("-" * 70)
    for i, p in enumerate(paragraphs[:5], 1):
        text = p.get_text().strip()
        print(f"\n[Para {i}] ({len(text)} chars)")
        print(text[:200] + "..." if len(text) > 200 else text)
else:
    print("\n⚠️ NO <p> TAGS FOUND - checking text structure...")
    text = soup.get_text()
    print(f"Raw text sample:\n{repr(text[:500])}")

# %%


COMPREHENSIVE DIAGNOSTIC - SCAN ALL CHAPTERS

Analyzed 89 sections

Top 10 sections by content size:
----------------------------------------------------------------------
Idx   <p>    Chars    \n\n   Name                                    
----------------------------------------------------------------------
43    94     45846    5      Clear_Light_of_Bliss_Text_2019-08-42.xht
31    77     44875    5      Clear_Light_of_Bliss_Text_2019-08-30.xht
80    154    37713    4      Clear_Light_of_Bliss_Text_2019-08-79.xht
39    46     29830    5      Clear_Light_of_Bliss_Text_2019-08-38.xht
59    51     27623    4      Clear_Light_of_Bliss_Text_2019-08-58.xht
84    1109   23952    6      Clear_Light_of_Bliss_Text_2019-08-83.xht
65    59     23709    4      Clear_Light_of_Bliss_Text_2019-08-64.xht
63    42     19505    4      Clear_Light_of_Bliss_Text_2019-08-62.xht
35    72     18123    4      Clear_Light_of_Bliss_Text_2019-08-34.xht
12    82     16767    4      Clear_Light_of_Bliss_Text_2