# PDF Extraction from Webpages

This notebook extracts all PDF document references from a single webpage, including:
- Standard HTML links (`<a href="...pdf">`)
- Embedded documents (`<iframe>`, `<embed>`, `<object>`)
- PDFs referenced in JavaScript code (configuration objects, viewer plugins, etc.)

It captures both the PDF URLs and their associated contextual labels where available.

## Installation and Imports

In [None]:
import subprocess
import sys

# Install required packages
packages = ['playwright', 'beautifulsoup4', 'requests']
for package in packages:
    try:
        __import__(package)
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Install Playwright browser (correct way)
subprocess.check_call([sys.executable, "-m", "playwright", "install", "chromium"])


Installing playwright...
Installing beautifulsoup4...
Installing requests...


FileNotFoundError: [WinError 2] The system cannot find the file specified

In [None]:
import re
import json
from typing import List, Dict, Tuple, Optional
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright
import asyncio

## PDF Extraction Functions

In [None]:
class PDFExtractor:
    """Extract PDF references from a webpage using multiple detection methods."""
    
    def __init__(self, url: str):
        self.url = url
        self.base_url = self._get_base_url(url)
        self.html_content = None
        self.page_object = None
        self.pdfs = []
    
    def _get_base_url(self, url: str) -> str:
        """Extract base URL for resolving relative paths."""
        parsed = urlparse(url)
        return f"{parsed.scheme}://{parsed.netloc}"
    
    def _normalize_url(self, url: str) -> Optional[str]:
        """Normalize and validate URLs, handling relative paths and escaped characters."""
        if not url:
            return None
        
        # Unescape common escape sequences
        url = url.replace('\\/', '/')
        url = url.replace('\\\\', '\\')
        
        # Convert relative URLs to absolute
        if url.startswith('/'):
            url = urljoin(self.base_url, url)
        elif not url.startswith('http'):
            url = urljoin(self.url, url)
        
        # Ensure it ends with .pdf
        if url.lower().endswith('.pdf') or '.pdf' in url.lower():
            return url
        
        return None
    
    def fetch_page(self) -> bool:
        """Fetch the webpage using Playwright."""
        try:
            with sync_playwright() as p:
                browser = p.chromium.launch(headless=True)
                page = browser.new_page()
                page.goto(self.url, wait_until='networkidle', timeout=30000)
                
                # Wait for JavaScript to execute
                page.wait_for_timeout(2000)
                
                # Get full HTML content
                self.html_content = page.content()
                self.page_object = page
                
                browser.close()
                return True
        except Exception as e:
            print(f"Error fetching page: {e}")
            return False
    
    def extract_from_html_links(self) -> List[Dict]:
        """Extract PDFs from standard HTML <a> tags."""
        pdfs = []
        soup = BeautifulSoup(self.html_content, 'html.parser')
        
        for link in soup.find_all('a', href=True):
            href = link.get('href', '')
            if href.lower().endswith('.pdf') or '.pdf' in href.lower():
                normalized_url = self._normalize_url(href)
                if normalized_url:
                    # Get anchor text
                    anchor_text = link.get_text(strip=True)
                    
                    # Get nearby heading if no anchor text
                    if not anchor_text:
                        parent = link.find_parent(['div', 'section', 'article'])
                        if parent:
                            heading = parent.find(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
                            if heading:
                                anchor_text = heading.get_text(strip=True)
                    
                    pdfs.append({
                        'url': normalized_url,
                        'source': 'HTML Link',
                        'anchor_text': anchor_text or 'No text',
                        'element': 'a'
                    })
        
        return pdfs
    
    def extract_from_embedded_documents(self) -> List[Dict]:
        """Extract PDFs from <iframe>, <embed>, and <object> tags."""
        pdfs = []
        soup = BeautifulSoup(self.html_content, 'html.parser')
        
        # Extract from iframes
        for iframe in soup.find_all('iframe'):
            src = iframe.get('src', '')
            if src and ('.pdf' in src.lower() or 'pdf' in src.lower()):
                normalized_url = self._normalize_url(src)
                if normalized_url:
                    title = iframe.get('title', '') or iframe.get('name', '')
                    pdfs.append({
                        'url': normalized_url,
                        'source': 'Embedded Document (iframe)',
                        'anchor_text': title or 'Embedded PDF',
                        'element': 'iframe'
                    })
        
        # Extract from embed tags
        for embed in soup.find_all('embed'):
            src = embed.get('src', '')
            if src and '.pdf' in src.lower():
                normalized_url = self._normalize_url(src)
                if normalized_url:
                    pdfs.append({
                        'url': normalized_url,
                        'source': 'Embedded Document (embed)',
                        'anchor_text': embed.get('title', '') or 'Embedded PDF',
                        'element': 'embed'
                    })
        
        # Extract from object tags
        for obj in soup.find_all('object'):
            data = obj.get('data', '')
            if data and '.pdf' in data.lower():
                normalized_url = self._normalize_url(data)
                if normalized_url:
                    pdfs.append({
                        'url': normalized_url,
                        'source': 'Embedded Document (object)',
                        'anchor_text': obj.get('title', '') or 'Embedded PDF',
                        'element': 'object'
                    })
        
        return pdfs
    
    def extract_from_javascript(self) -> List[Dict]:
        """Extract PDFs from JavaScript code, including configuration objects."""
        pdfs = []
        
        # Pattern for common PDF references in JavaScript
        patterns = [
            r'["\']([^"\']*/[^"\']*.pdf)["\']',  # String literals with .pdf
            r'pdfUrl["\']?\s*:\s*["\']([^"\']*.pdf)["\']',  # pdfUrl property
            r'pdf["\']?\s*:\s*["\']([^"\']*.pdf)["\']',  # pdf property
            r'url["\']?\s*:\s*["\']([^"\']*.pdf)["\']',  # url property
            r'src["\']?\s*:\s*["\']([^"\']*.pdf)["\']',  # src property
            r'file["\']?\s*:\s*["\']([^"\']*.pdf)["\']',  # file property
            r'document["\']?\s*:\s*["\']([^"\']*.pdf)["\']',  # document property
            r'flipbookOptions.*?pdfUrl["\']?\s*:\s*["\']([^"\']*.pdf)["\']',  # Flipbook specific
        ]
        
        found_urls = set()
        
        for pattern in patterns:
            matches = re.finditer(pattern, self.html_content, re.IGNORECASE | re.DOTALL)
            for match in matches:
                url = match.group(1)
                normalized_url = self._normalize_url(url)
                if normalized_url and normalized_url not in found_urls:
                    found_urls.add(normalized_url)
                    pdfs.append({
                        'url': normalized_url,
                        'source': 'JavaScript Code',
                        'anchor_text': 'JavaScript-embedded PDF',
                        'element': 'script'
                    })
        
        # Extract from JSON-like structures in script tags
        soup = BeautifulSoup(self.html_content, 'html.parser')
        for script in soup.find_all('script', type=None):
            if script.string:
                # Look for PDF URLs in JSON objects
                json_pattern = r'[{,]\s*["\']?(?:pdf|url|src|file)["\']?\s*:\s*["\']([^"\']*.pdf)["\']'
                matches = re.finditer(json_pattern, script.string, re.IGNORECASE)
                for match in matches:
                    url = match.group(1)
                    normalized_url = self._normalize_url(url)
                    if normalized_url and normalized_url not in found_urls:
                        found_urls.add(normalized_url)
                        # Try to extract context from script id or nearby elements
                        script_id = script.get('id', '')
                        pdfs.append({
                            'url': normalized_url,
                            'source': 'JavaScript Configuration',
                            'anchor_text': f'Script: {script_id}' if script_id else 'JavaScript Configuration',
                            'element': 'script'
                        })
        
        return pdfs
    
    def extract_all(self) -> List[Dict]:
        """Extract all PDF references from the webpage."""
        print(f"Fetching page: {self.url}")
        if not self.fetch_page():
            return []
        
        print("Extracting PDFs from HTML links...")
        self.pdfs.extend(self.extract_from_html_links())
        
        print("Extracting PDFs from embedded documents...")
        self.pdfs.extend(self.extract_from_embedded_documents())
        
        print("Extracting PDFs from JavaScript code...")
        self.pdfs.extend(self.extract_from_javascript())
        
        # Remove duplicates
        seen_urls = set()
        unique_pdfs = []
        for pdf in self.pdfs:
            if pdf['url'] not in seen_urls:
                seen_urls.add(pdf['url'])
                unique_pdfs.append(pdf)
        
        self.pdfs = unique_pdfs
        return self.pdfs
    
    def display_results(self):
        """Display extraction results in a formatted manner."""
        print(f"\n{'='*80}")
        print(f"PDF Extraction Results for: {self.url}")
        print(f"{'='*80}")
        print(f"Total PDFs found: {len(self.pdfs)}\n")
        
        if not self.pdfs:
            print("No PDFs found on this page.")
            return
        
        for i, pdf in enumerate(self.pdfs, 1):
            print(f"{i}. PDF URL: {pdf['url']}")
            print(f"   Source: {pdf['source']}")
            print(f"   Label: {pdf['anchor_text']}")
            print(f"   Element: {pdf['element']}")
            print()
    
    def to_json(self) -> str:
        """Export results as JSON."""
        return json.dumps(self.pdfs, indent=2)
    
    def to_csv(self) -> str:
        """Export results as CSV."""
        import csv
        from io import StringIO
        
        output = StringIO()
        writer = csv.DictWriter(output, fieldnames=['url', 'source', 'anchor_text', 'element'])
        writer.writeheader()
        writer.writerows(self.pdfs)
        return output.getvalue()

## Usage Example

In [None]:
# Example: Extract PDFs from the Siddhartha accreditations page
url = "https://siddhartha.org.in/accreditations/"

extractor = PDFExtractor(url)
pdfs = extractor.extract_all()

# Display results
extractor.display_results()

## Export Results

In [None]:
# Export as JSON
json_output = extractor.to_json()
print("JSON Output:")
print(json_output)

# Save to file
with open('pdf_extraction_results.json', 'w') as f:
    f.write(json_output)
print("\nResults saved to 'pdf_extraction_results.json'")

In [None]:
# Export as CSV
csv_output = extractor.to_csv()
print("CSV Output:")
print(csv_output)

# Save to file
with open('pdf_extraction_results.csv', 'w') as f:
    f.write(csv_output)
print("Results saved to 'pdf_extraction_results.csv'")

## Advanced Usage: Custom URL

In [None]:
# You can use this with any URL
custom_url = input("Enter the URL to extract PDFs from: ")
custom_extractor = PDFExtractor(custom_url)
custom_pdfs = custom_extractor.extract_all()
custom_extractor.display_results()

## Summary of Extraction Methods

This notebook extracts PDFs using the following methods:

1. **HTML Links**: Scans all `<a>` tags with `href` attributes ending in `.pdf`
2. **Embedded Documents**: Checks `<iframe>`, `<embed>`, and `<object>` tags for PDF references
3. **JavaScript Code**: Uses regex patterns to find PDF URLs in:
   - String literals within script tags
   - JSON-like configuration objects (e.g., flipbook options)
   - Common property names (pdfUrl, pdf, url, src, file, document)

### Features:
- **URL Normalization**: Converts relative URLs to absolute URLs
- **Escape Handling**: Properly handles escaped characters in URLs
- **Context Extraction**: Captures anchor text, labels, and nearby headings
- **Deduplication**: Removes duplicate PDF URLs
- **Multiple Export Formats**: JSON and CSV output options
- **Headless Browser**: Uses Playwright for JavaScript execution

### Limitations:
- Does not perform recursive crawling (single page only)
- JavaScript-only PDFs may lack contextual labels
- Some obfuscated or dynamically generated URLs may not be detected