# Text Extraction Fundamentals

In this notebook, we'll explore how to extract text from various document formats using LlamaIndex readers.

## Learning Objectives

By the end of this notebook, you will be able to:
- Extract text from different file formats (PDF, DOCX, CSV, JSON, Markdown, databases)
- Clean and normalize extracted text
- Extract metadata from documents
- Create a universal document processing pipeline

## Why Text Extraction Matters

Before we can build AI applications that work with documents, we need to extract and process the text. Different file formats require different extraction approaches, and the quality of extraction directly impacts downstream AI tasks.

In [None]:
## Extracting from Common Document Formats

from llama_index.readers.file import PDFReader, DocxReader
from llama_index.readers.web import SimpleWebPageReader
import pathlib

# Extract from PDF
pdf_reader = PDFReader()
pdf_docs = pdf_reader.load_data(file=pathlib.Path("samples/pdf-report.pdf"))

# Extract from DOCX
docx_reader = DocxReader()
docx_docs = docx_reader.load_data(file=pathlib.Path("samples/docx-report.docx"))

# Extract from Web
web_reader = SimpleWebPageReader()
web_docs = web_reader.load_data(urls=["https://example.com"])

# Display extracted text samples
print("Extracted Text Samples:\n")
print("=" * 80)
print(f"\nPDF extract ({len(pdf_docs[0].text)} chars):")
print(f"  {pdf_docs[0].text[:150]}...\n")

print(f"DOCX extract ({len(docx_docs[0].text)} chars):")
print(f"  {docx_docs[0].text[:150]}...\n")

print(f"Web extract ({len(web_docs[0].text)} chars):")
print(f"  {web_docs[0].text[:150]}...")

print("\n✓ Successfully extracted text from PDF, DOCX, and web sources")

In [None]:
## Extracting from Data Formats

from llama_index.readers.file import CSVReader, MarkdownReader
from llama_index.readers.json import JSONReader
from llama_index.readers.database import DatabaseReader
import pathlib

# CSV files
csv_reader = CSVReader()
csv_docs = csv_reader.load_data(file=pathlib.Path("samples/csv-data.csv"))

# JSON files
json_reader = JSONReader()
json_docs = json_reader.load_data(input_file="samples/json-data.json")

# Markdown files
md_reader = MarkdownReader()
md_docs = md_reader.load_data(file="samples/README.md")

# Databases
db_reader = DatabaseReader(uri="sqlite:///samples/database.db")
db_docs = db_reader.load_data(query="SELECT * FROM orders")

# Display extracted text samples
print("Structured Data Extraction:\n")
print("=" * 80)

print(f"\nCSV extract ({len(csv_docs[0].text)} chars):")
print(f"  {csv_docs[0].text[:150]}...\n")

print(f"JSON extract ({len(json_docs[0].text)} chars):")
print(f"  {json_docs[0].text[:150]}...\n")

print(f"Markdown extract ({len(md_docs[0].text)} chars):")
print(f"  {md_docs[0].text[:150]}...\n")

print(f"Database extract ({len(db_docs[0].text)} chars):")
print(f"  {db_docs[0].text[:150]}...")

print("\n✓ Successfully extracted text from CSV, JSON, Markdown, and database")

In [None]:
## Text Cleaning and Normalization

import re
from llama_index.core.schema import Document

# Get raw text from a document
raw_text = pdf_docs[0].text

def clean_text(text):
    """
    Clean and normalize extracted text.
    
    Args:
        text (str): Raw text to clean
        
    Returns:
        str: Cleaned text
    """
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)

    # Remove special characters but keep structural elements
    text = re.sub(r'[^\w\s\.\,\;\:\-\(\)\[\]\{\}\"\'\n\t]', '', text)

    # Fix common OCR errors (example)
    text = text.replace('l<eywor', 'keyword')

    return text.strip()

# Clean the text
cleaned_text = clean_text(raw_text)

# Compare original vs cleaned
print("Text Cleaning Comparison:\n")
print("=" * 80)
print(f"\nOriginal (first 100 chars):")
print(f"  {raw_text[:100]}\n")

print(f"Cleaned (first 100 chars):")
print(f"  {cleaned_text[:100]}\n")

print(f"Length comparison:")
print(f"  Original: {len(raw_text):,} characters")
print(f"  Cleaned:  {len(cleaned_text):,} characters")
print(f"  Removed:  {len(raw_text) - len(cleaned_text)} characters")

print("\n✓ Text cleaned and normalized")

In [None]:
## Metadata Extraction

def extract_metadata(text, filename):
    """
    Extract metadata from document text.
    
    Args:
        text (str): Document text
        filename (str): Source filename
        
    Returns:
        dict: Extracted metadata
    """
    metadata = {
        "source": filename,
        "file_type": filename.split('.')[-1],
    }

    # Extract title (assume first line might be title)
    lines = text.split('\n')
    if lines and len(lines[0]) < 100:  # Simple heuristic for title
        metadata["title"] = lines[0].strip()

    # Try to extract date with regex
    date_match = re.search(r'\d{1,2}[\/\-\.]\d{1,2}[\/\-\.]\d{2,4}', text)
    if date_match:
        metadata["date"] = date_match.group(0)

    return metadata

# Extract metadata
metadata = extract_metadata(raw_text, "samples/pdf-report.pdf")

# Create a document with cleaned text and metadata
processed_doc = Document(
    text=cleaned_text,
    metadata=metadata
)

print("Extracted Metadata:\n")
print("=" * 80)
for key, value in metadata.items():
    print(f"  {key:<12}: {value}")

print("\n✓ Metadata extracted and attached to document")

In [None]:
## Universal Document Processing Pipeline

def process_document(file_path):
    """
    Process a document with appropriate reader and cleaning.
    
    Args:
        file_path (str): Path to the document
        
    Returns:
        Document: Processed LlamaIndex Document object
    """
    # Determine file type
    file_type = file_path.split('.')[-1].lower()

    # Select appropriate reader
    if file_type == 'pdf':
        reader = PDFReader()
    elif file_type in ['docx', 'doc']:
        reader = DocxReader()
    elif file_type in ['html', 'htm']:
        reader = SimpleWebPageReader()
    else:
        # Default to simple text reading
        with open(file_path, 'r') as f:
            return Document(text=f.read(), metadata={"source": file_path})

    # Load and extract text
    docs = reader.load_data(file=file_path)

    if not docs:
        return None

    # Clean the text
    cleaned_text = clean_text(docs[0].text)

    # Extract metadata
    metadata = extract_metadata(docs[0].text, file_path)

    # Create processed document
    return Document(text=cleaned_text, metadata=metadata)

# Test the pipeline
processed_doc = process_document("samples/pdf-report.pdf")

print("Document Processing Pipeline:\n")
print("=" * 80)
print(f"\nProcessed document:")
print(f"  Characters: {len(processed_doc.text):,}")
print(f"  Metadata:   {processed_doc.metadata}")

print("\n✓ Universal processing pipeline ready for use")
print("\nThis pipeline can be used with any supported document format,")
print("automatically selecting the right reader and applying consistent")
print("cleaning and metadata extraction.")

## Summary

We've covered the complete text extraction pipeline:

### Key Components

1. **Format-Specific Readers** - LlamaIndex provides readers for common formats:
   - Documents: PDF, DOCX
   - Web: HTML pages
   - Data: CSV, JSON, Markdown
   - Databases: SQL queries

2. **Text Cleaning** - Normalize extracted text by:
   - Removing excessive whitespace
   - Filtering special characters
   - Fixing common OCR errors

3. **Metadata Extraction** - Enrich documents with:
   - Source information
   - File type
   - Titles and dates
   - Custom metadata

4. **Universal Pipeline** - A single function that:
   - Auto-detects file type
   - Selects appropriate reader
   - Applies consistent cleaning
   - Extracts metadata

### Best Practices

- **Choose the right reader** for each format
- **Clean consistently** to avoid downstream issues
- **Extract metadata** to enrich your documents
- **Build pipelines** for reusable, maintainable code

This foundation enables building robust AI applications that work with real-world documents.