# EPUB DRM Detection

**Purpose:** Verify that EPUBs downloaded long ago are still accessible and not DRM-protected

**Context:** Before building the Buddhist RAG system on Geshe Kelsang Gyatso's teachings,
we need to confirm our EPUB files can actually be read programmatically.

**What This Checks:**
1. Can we open the EPUB as a ZIP file?
2. Are there DRM indicators (encryption.xml, rights.xml)?
3. Can ebooklib parse it?
4. Can we actually extract and read content?

**Why This Matters:** The sacred nature of these texts means we need to ensure
we can properly preserve their structure. DRM would prevent this.

In [9]:
import os
import zipfile
from pathlib import Path
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup
from IPython.display import display, HTML

print("‚úì Libraries loaded")

‚úì Libraries loaded


## Configuration

Set the path to your EPUB directory or specific file:

In [10]:
# CONFIGURE THIS PATH
EPUB_PATH = r"/home/matt/Documents/gesha_la_rag/epub_directory"

# OR check a single file:
# EPUB_PATH = r"C:\Users\DELL\Documents\gesha_la_rag\epub_directory\epub_directory\Clear_Light_of_Bliss.epub"

print(f"Target: {EPUB_PATH}")

Target: /home/matt/Documents/gesha_la_rag/epub_directory


## Core Functions

In [11]:
def check_single_epub(filepath):
    """
    Comprehensive DRM check on a single EPUB file.
    
    Returns: (status, details)
        status: 'ACCESSIBLE', 'DRM_PROTECTED', 'CORRUPTED', 'UNKNOWN'
        details: dict with specific findings
    """
    details = {
        'can_open_as_zip': False,
        'has_encryption_xml': False,
        'has_rights_xml': False,
        'can_open_with_ebooklib': False,
        'can_read_content': False,
        'content_sample': None,
        'paragraph_count': 0,
        'error': None
    }
    
    # Test 1: Can we open it as a ZIP? (EPUBs are ZIP files)
    try:
        with zipfile.ZipFile(filepath, 'r') as zip_ref:
            details['can_open_as_zip'] = True
            namelist = zip_ref.namelist()
            
            # Check for DRM indicators
            details['has_encryption_xml'] = 'META-INF/encryption.xml' in namelist
            details['has_rights_xml'] = 'META-INF/rights.xml' in namelist
            
    except zipfile.BadZipFile:
        details['error'] = "Not a valid ZIP file"
        return 'CORRUPTED', details
    except Exception as e:
        details['error'] = f"ZIP error: {str(e)}"
        return 'UNKNOWN', details
    
    # Test 2: Can ebooklib open it?
    try:
        book = epub.read_epub(filepath)
        details['can_open_with_ebooklib'] = True
        
        # Test 3: Can we actually read content?
        items = [item for item in book.get_items() if item.get_type() == ebooklib.ITEM_DOCUMENT]
        
        if items:
            # Try to read first substantial content item (skip cover pages)
            for item in items[:15]:  # Check first 15 sections to skip title pages
                try:
                    raw_html = item.get_content().decode('utf-8', errors='replace')
                    soup = BeautifulSoup(raw_html, 'html.parser')
                    
                    # Count paragraphs
                    paragraphs = soup.find_all('p')
                    if paragraphs:
                        details['paragraph_count'] = len(paragraphs)
                    
                    text = soup.get_text().strip()
                    
                    # Look for actual content (not just titles)
                    if len(text) > 100:
                        details['can_read_content'] = True
                        details['content_sample'] = text[:300]
                        break
                        
                except Exception as e:
                    continue
        
        if not details['can_read_content']:
            details['error'] = "Could open file but couldn't read actual content"
            return 'DRM_PROTECTED', details
            
    except Exception as e:
        details['error'] = f"ebooklib error: {str(e)}"
        
        # If we can open as ZIP but not with ebooklib, likely DRM
        if details['can_open_as_zip']:
            return 'DRM_PROTECTED', details
        else:
            return 'CORRUPTED', details
    
    # Determine final status
    if details['has_encryption_xml'] or details['has_rights_xml']:
        return 'DRM_PROTECTED', details
    elif details['can_read_content']:
        return 'ACCESSIBLE', details
    else:
        return 'UNKNOWN', details

print("‚úì Check function defined")

‚úì Check function defined


## Run the Check

In [12]:
path = Path(EPUB_PATH)

# Determine if we're checking a file or directory
if path.is_file():
    epub_files = [path]
elif path.is_dir():
    epub_files = sorted(path.glob("**/*.epub"))
else:
    print(f"‚ùå Invalid path: {EPUB_PATH}")
    epub_files = []

if not epub_files:
    print("‚ùå No EPUB files found")
else:
    print("="*80)
    print("EPUB DRM DETECTION SCAN")
    print("="*80)
    print(f"\nFound {len(epub_files)} EPUB file(s)\n")
    
    # Track results
    results = {
        'ACCESSIBLE': [],
        'DRM_PROTECTED': [],
        'CORRUPTED': [],
        'UNKNOWN': []
    }
    
    # Check each file
    for i, filepath in enumerate(epub_files, 1):
        filename = filepath.name
        print(f"[{i}/{len(epub_files)}] Checking: {filename}")
        
        status, details = check_single_epub(filepath)
        results[status].append((filename, details))
        
        # Print result
        if status == 'ACCESSIBLE':
            print(f"  ‚úì ACCESSIBLE - Can read content")
            if details['paragraph_count'] > 0:
                print(f"    Found {details['paragraph_count']} paragraphs in first chapter")
        elif status == 'DRM_PROTECTED':
            print(f"  ‚úó DRM PROTECTED - {details.get('error', 'Encrypted')}")
        elif status == 'CORRUPTED':
            print(f"  ‚úó CORRUPTED - {details.get('error', 'Unknown')}")
        else:
            print(f"  ? UNKNOWN - {details.get('error', 'Unclear status')}")
        
        print()
    
    # Summary
    print("="*80)
    print("RESULTS SUMMARY")
    print("="*80)
    print(f"\n‚úì Accessible:     {len(results['ACCESSIBLE'])} files")
    print(f"‚úó DRM Protected:  {len(results['DRM_PROTECTED'])} files")
    print(f"‚úó Corrupted:      {len(results['CORRUPTED'])} files")
    print(f"? Unknown:        {len(results['UNKNOWN'])} files")
    
    # Show problematic files
    if results['DRM_PROTECTED']:
        print("\n" + "-"*80)
        print("DRM PROTECTED FILES:")
        print("-"*80)
        for filename, details in results['DRM_PROTECTED']:
            print(f"\n  ‚Ä¢ {filename}")
            if details['has_encryption_xml']:
                print(f"    - Has encryption.xml (Adobe DRM)")
            if details['has_rights_xml']:
                print(f"    - Has rights.xml (Apple FairPlay)")
            if details['error']:
                print(f"    - Error: {details['error']}")
    
    if results['CORRUPTED']:
        print("\n" + "-"*80)
        print("CORRUPTED FILES:")
        print("-"*80)
        for filename, details in results['CORRUPTED']:
            print(f"\n  ‚Ä¢ {filename}")
            print(f"    - Error: {details['error']}")
    
    # Final verdict
    print("\n" + "="*80)
    if len(results['ACCESSIBLE']) == len(epub_files):
        print("üéâ ALL EPUBs ARE ACCESSIBLE!")
        print("You can proceed with building the RAG system.")
    elif results['DRM_PROTECTED']:
        print("‚ö†Ô∏è  WARNING: Some EPUBs are DRM protected")
        print("You may need to request DRM-free versions from Tharpa Publications.")
    print("="*80)

EPUB DRM DETECTION SCAN

Found 26 EPUB file(s)

[1/26] Checking: Clear_Light_of_Bliss.epub
  ‚úì ACCESSIBLE - Can read content
    Found 3 paragraphs in first chapter

[2/26] Checking: Essence-of-Vajrayana.epub
  ‚úì ACCESSIBLE - Can read content
    Found 3 paragraphs in first chapter

[3/26] Checking: Great-Treasury-of-Merit.epub
  ‚úì ACCESSIBLE - Can read content
    Found 3 paragraphs in first chapter

[4/26] Checking: Guide_to_Bodhisattva_s_Way_of_Life_2020.epub
  ‚úì ACCESSIBLE - Can read content
    Found 3 paragraphs in first chapter

[5/26] Checking: Heart_Jewel.epub
  ‚úì ACCESSIBLE - Can read content
    Found 4 paragraphs in first chapter

[6/26] Checking: How to Understand the Mind.epub
  ‚úì ACCESSIBLE - Can read content
    Found 30 paragraphs in first chapter

[7/26] Checking: How-to-Solve-Our-Human-Problems-US.epub
  ‚úó DRM PROTECTED - None

[8/26] Checking: How_to_Transform_Your_Life-US.epub
  ‚úó DRM PROTECTED - None

[9/26] Checking: Introduction_to_Buddhism_US-20

## View Content Sample

For accessible files, let's look at a sample of the actual content:

In [13]:
# Show content sample from first accessible file
if results['ACCESSIBLE']:
    filename, details = results['ACCESSIBLE'][0]
    
    print(f"Sample from: {filename}")
    print("="*80)
    print()
    print(details['content_sample'])
    print()
    print("="*80)
    print("\n‚úì Content is readable - paragraph structure preserved")
else:
    print("No accessible files to sample")

Sample from: Clear_Light_of_Bliss.epub

About the Author





Geshe Kelsang Gyatso is a fully accomplished meditation master and internationally renowned teacher of Buddhism who has pioneered the introduction of modern Buddhism into contemporary society. He is the author of 23 highly acclaimed books that perfectly transmit the ancient wis


‚úì Content is readable - paragraph structure preserved
