# Featherless.ai - PDF to Podcast Pipeline
This notebook demonstrates a complete workflow for:
1. **Extracting content** from PDF documents using PyMuPDF
2. **Processing and cleaning** the extracted text with [Featherless.ai](http://featherless.ai) API
3. **Generating podcast-style content** from the processed text

The pipeline handles:
- PDF validation and extraction
- Text chunking for efficient processing
- API calls to Featherless AI for text cleaning
- Output management and preview

## Setup and Configuration

First, we need to install required libraries and set up our configuration.

## Configuration Parameters

- `pdf_path`: Path to the PDF file to be processed
- `DEFAULT_MODEL`: The Featherless AI model to use for text processing
- `BASE_URL`: The Featherless API endpoint
- `FEATHERLESS_API_KEY`: Your API key for authentication
- `CHUNK_SIZE`: The target size for text chunks (in characters)

In [None]:
!pip install PyPDF2
!pip install rich ipywidgets pymupdf4llm

In [2]:
pdf_path = '/path_to_your_pdf.pdf'
DEFAULT_MODEL = "Qwen/Qwen2.5-72B-Instruct" # Go through our model catalog on https://featherless.ai/models
# Configuration
BASE_URL = "https://api.featherless.ai/v1"
FEATHERLESS_API_KEY = "YOUR_FEATHERLESS_API_KEY" # Available in https://featherless.ai/account/api-keys


In [3]:
import pymupdf4llm
from typing import Optional
import os
import torch
import requests
from tqdm.notebook import tqdm
import warnings

warnings.filterwarnings('ignore')

## PDF Handling Functions

These functions handle PDF validation, metadata extraction, and text extraction.

In [4]:
def validate_pdf(file_path: str) -> bool:
    if not os.path.exists(file_path):
        print(f"Error: File not found at path: {file_path}")
        return False
    if not file_path.lower().endswith('.pdf'):
        print("Error: File is not a PDF")
        return False
    return True

In [5]:
def extract_text_from_pdf(file_path: str, max_chars: int = 60000) -> Optional[str]:
    if not validate_pdf(file_path):
        return None
    
    try:
        # Convert PDF to markdown text
        markdown_text = pymupdf4llm.to_markdown(file_path)
        
        # Truncate if exceeds max_chars
        if len(markdown_text) > max_chars:
            print(f"Truncating text to {max_chars} characters")
            markdown_text = markdown_text[:max_chars]
        
        print(f"\nExtraction complete! Total characters: {len(markdown_text)}")
        return markdown_text
            
    except Exception as e:
        print(f"An unexpected error occurred: {str(e)}")
        return None

In [6]:
# Get PDF metadata
def get_pdf_metadata(file_path: str) -> Optional[dict]:
    if not validate_pdf(file_path):
        return None
    
    try:
        # Get metadata using page chunks feature
        data = pymupdf4llm.to_markdown(file_path, page_chunks=True)
        
        metadata = {
            'num_pages': len(data),
            'metadata': {
                'pages': [page.get('metadata', {}) for page in data]
            }
        }
        return metadata
    except Exception as e:
        print(f"Error extracting metadata: {str(e)}")
        return None

## Text Processing Configuration

Setting up the system prompt for the Featherless AI model to clean the PDF text.

In [None]:
# Extract metadata first
print("Extracting metadata...")
metadata = get_pdf_metadata(pdf_path)
if metadata:
    print("\nPDF Metadata:")
    print(f"Number of pages: {metadata['num_pages']}")
    print("Document info:")
    for key, value in metadata['metadata'].items():
        print(f"{key}: {value}")

# Extract text
print("\nExtracting text...")
extracted_text = extract_text_from_pdf(pdf_path)

# Display first 500 characters of extracted text as preview
if extracted_text:
    print("\nPreview of extracted text (first 500 characters):")
    print("-" * 50)
    print(extracted_text[:500])
    print("-" * 50)
    print(f"\nTotal characters extracted: {len(extracted_text)}")

# Optional: Save the extracted text to a file
if extracted_text:
    output_file = 'extracted_text.txt'
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(extracted_text)
    print(f"\nExtracted text has been saved to {output_file}")

In [8]:
device = "cuda" if torch.cuda.is_available() else "cpu"

SYS_PROMPT = """
You are a world class text pre-processor, here is the raw data from a PDF, please parse and return it in a way that is crispy and usable to send to a podcast writer.

The raw data is messed up with new lines, Latex math and you will see fluff that we can remove completely. Basically take away any details that you think might be useless in a podcast author's transcript.

Remember, the podcast could be on any topic whatsoever so the issues listed above are not exhaustive

Please be smart with what you remove and be creative ok?

Remember DO NOT START SUMMARIZING THIS, YOU ARE ONLY CLEANING UP THE TEXT AND RE-WRITING WHEN NEEDED

Be very smart and aggressive with removing details, you will get a running portion of the text and keep returning the processed text.

PLEASE DO NOT ADD MARKDOWN FORMATTING, STOP ADDING SPECIAL CHARACTERS THAT MARKDOWN CAPATILISATION ETC LIKES

ALWAYS start your response directly with processed text and NO ACKNOWLEDGEMENTS about my questions ok?
Here is the text:
"""

## Text Chunking Functions

The following functions split the extracted text into manageable chunks for processing.

In [9]:
def create_word_bounded_chunks(text, target_chunk_size):
    """
    Split text into chunks at word boundaries close to the target chunk size.
    
    Parameters:
    -----------
    text : str
        The input text to be split into chunks
    target_chunk_size : int
        Approximate size for each chunk in characters
        
    Returns:
    --------
    list of str
        Text split into chunks that don't break words
    """
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0
    
    for word in words:
        word_length = len(word) + 1  # +1 for the space
        if current_length + word_length > target_chunk_size and current_chunk:
            # Join the current chunk and add it to chunks
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]
            current_length = word_length
        else:
            current_chunk.append(word)
            current_length += word_length
    
    # Add the last chunk if it exists
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

## API Call Function

This function sends each text chunk to the Featherless AI API for processing.

In [10]:
def process_chunk(text_chunk, chunk_num):
    """Process a chunk of text using Featherless API"""
    messages = [
        {"role": "system", "content": SYS_PROMPT},
        {"role": "user", "content": text_chunk},
    ]
    
    try:
        response = requests.post(
            f"{BASE_URL}/chat/completions",
            headers={
                "Content-Type": "application/json",
                "Authorization": f"Bearer {FEATHERLESS_API_KEY}"
            },
            json={
                "model": DEFAULT_MODEL,
                "messages": messages
            }
        )
        response.raise_for_status()
        processed_text = response.json()["choices"][0]["message"]["content"]
        
        # Print chunk information for monitoring
        print(f"INPUT TEXT:\n{text_chunk[:500]}...")  # Show first 500 chars of input
        print(f"\nPROCESSED TEXT:\n{processed_text[:500]}...")  # Show first 500 chars of output
        print(f"{'='*90}\n")
        
        return processed_text
        
    except Exception as e:
        print(f"Error processing chunk {chunk_num}: {str(e)}")
        return text_chunk  # Return original text in case of error

## Processing Workflow

The following cells implement the complete workflow:
1. Read the file
2. Create chunks
3. Process each chunk
4. Save and preview the results

In [11]:
INPUT_FILE = "./extracted_text.txt"  # Replace with your file path
CHUNK_SIZE = 1000  # Adjust chunk size if needed

# Read the file content first
with open(INPUT_FILE, 'r', encoding='utf-8') as file:
    text_content = file.read()

# Create chunks from the actual content
chunks = create_word_bounded_chunks(text_content, CHUNK_SIZE)
num_chunks = len(chunks)

In [None]:
num_chunks

In [13]:
# First, initialize the text variable
processed_text = ""
# Read the file
with open(INPUT_FILE, 'r', encoding='utf-8') as file:
    text = file.read()

# Calculate number of chunks
num_chunks = (len(text) + CHUNK_SIZE - 1) // CHUNK_SIZE

# Create output file name
output_file = f"clean_{os.path.basename(INPUT_FILE)}"

In [None]:
with open(output_file, 'w', encoding='utf-8') as out_file:
    for chunk_num, chunk in enumerate(tqdm(chunks, desc="Processing chunks")):
        # Process chunk and append to complete text
        processed_chunk = process_chunk(chunk, chunk_num)
        processed_text += processed_chunk + "\n"
        
        # Write chunk immediately to file
        out_file.write(processed_chunk + "\n")
        out_file.flush()

In [None]:
print(f"\nProcessing complete!")
print(f"Input file: {INPUT_FILE}")
print(f"Output file: {output_file}")
print(f"Total chunks processed: {num_chunks}")

# Preview the beginning and end of the complete processed text
print("\nPreview of final processed text:")
print("\nBEGINNING:")
print(processed_text[:1000])
print("\n...\n\nEND:")
print(processed_text[-1000:])

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Calculate text statistics
original_length = len(text)
processed_length = len(processed_text)
reduction_percent = ((original_length - processed_length) / original_length) * 100

# Create bar chart
fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.bar(['Original Text', 'Processed Text'], 
        [original_length, processed_length],
        color=['#1f77b4', '#2ca02c'])

# Add labels and title
ax.set_ylabel('Character Count')
ax.set_title('Text Processing Results')

# Add text on top of bars
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 0.1,
            f'{height:,}',
            ha='center', va='bottom')

# Add reduction percentage
plt.figtext(0.5, 0.01, f"Text reduction: {reduction_percent:.1f}%", 
           ha="center", fontsize=12, bbox={"facecolor":"orange", "alpha":0.5, "pad":5})

plt.tight_layout()
plt.show()