BriefCast-AI: Local PDF-to-Podcast with Ollama

Turn PDF documents into a friendly two-speaker podcast fully on your machine.

This project ingests one or more PDFs, renders each page as an image, uses a multimodal Ollama model to understand text + charts + infographics, builds a grounded fact pack, writes a conversation between two hosts in plain language, and finally synthesizes the dialogue into audio with locally generated voices.

It is designed for document-heavy workflows such as:

economic reports
policy briefs
technical notes
slide decks exported as PDF
reports with charts, tables, diagrams, and infographics

What the app does

Upload one or more PDFs.
Render each page to a high-resolution image.
Extract page text.
Send page image + page text to a multimodal Ollama model.
Build a structured page-level fact pack.
Merge all page facts into a document brief.
Generate a two-person podcast script in easy, friendly language.
Convert each speaker turn to speech with local TTS.
Merge all turns into a final WAV file.
Preview the result in the app.

Why this architecture

A PDF with infographics is not just text. A text-only parser will miss chart meaning, labels, and layout cues. This project therefore uses a multimodal pipeline:

PyMuPDF renders each PDF page to PNG and extracts text.
Ollama vision model reads each page image plus extracted text.
Ollama text model writes the final dialogue.
Kokoro generates the audio locally.
Streamlit provides the local UI.

This design is more reliable than asking one model to do everything in a single step.

Recommended stack

Core libraries

Python 3.10+
Streamlit
PyMuPDF
Pillow
requests
pydantic
numpy
soundfile
Kokoro

Local model runtime

Ollama

Recommended Ollama models

Page understanding: qwen2.5vl:7b
Script generation: qwen3.5:9b
Optional embeddings for retrieval: embeddinggemma

You can swap the models later if you want faster or higher-quality variants.

Project structure

podcast_local/
├── app.py
├── voices.json
├── requirements.txt
├── data/
│   ├── uploads/
│   ├── rendered_pages/
│   ├── facts/
│   └── output/
└── README.md

Folder purpose

uploads/: original PDF files
rendered_pages/: PNG version of each page
facts/: extracted structured summaries per page or document
output/: generated WAV/MP3 files
voices.json: voice selector mapping shown in the UI

End-to-end workflow

Stage 1. PDF ingestion

For each uploaded PDF, the app:

opens the PDF with PyMuPDF
extracts page text
renders each page at a higher DPI for better chart and infographic readability

Stage 2. Page-level multimodal understanding

Each page is processed with a multimodal Ollama model.

Input sent to the model:

page image
extracted text
prompt instructing the model to identify:
- title
- main findings
- important numbers
- chart takeaways
- jargon to explain
- caveats / uncertain readings

Output expected from the model:

structured JSON

Stage 3. Document brief

The page-level fact packs are merged into a single grounded brief that includes:

executive summary
major themes
key statistics
policy relevance
terms that need explanation

Stage 4. Podcast script generation

A second Ollama model turns the grounded document brief into:

episode title
intro hook
turn-by-turn script for Host A and Host B
closing summary

The writing style should be:

friendly
clear
plain language
accurate
non-robotic

Stage 5. Local TTS

Each line is synthesized locally:

Host A uses Voice A
Host B uses Voice B

Then all audio chunks are concatenated into one final WAV file.

Installation

1. Install system dependencies

macOS

brew install ollama ffmpeg espeak-ng

Ubuntu / Debian

sudo apt update
sudo apt install -y ffmpeg espeak-ng

Install Ollama from its official installer if it is not already available.

2. Create a Python environment

python -m venv .venv
source .venv/bin/activate

3. Install Python packages

pip install --upgrade pip
pip install streamlit pymupdf pillow requests pydantic numpy soundfile kokoro

If you plan to use additional language support in Kokoro:

pip install "misaki[ja]"   # Japanese
pip install "misaki[zh]"   # Mandarin Chinese

4. Pull Ollama models

ollama pull qwen2.5vl:7b
ollama pull qwen3.5:9b
ollama pull embeddinggemma

Run the app

streamlit run app.py

Then open the local Streamlit URL shown in your terminal.

Example configuration for `voices.json`

{
  "Warm Female": "af_heart",
  "Clear Female": "af_bella",
  "Warm Male": "am_adam",
  "Calm Male": "bm_george"
}

The exact voice IDs available in your environment may vary depending on the Kokoro package version and your voice assets.

Minimal example: render a PDF page

import fitz
from pathlib import Path

def render_pdf(pdf_path: str, out_dir: str, dpi: int = 240):
    out_dir = Path(out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    doc = fitz.open(pdf_path)
    pages = []

    for i, page in enumerate(doc):
        text = page.get_text("text")
        pix = page.get_pixmap(dpi=dpi, alpha=False)
        img_path = out_dir / f"page_{i+1}.png"
        img_path.write_bytes(pix.tobytes("png"))
        pages.append({
            "page": i + 1,
            "text": text,
            "image_path": str(img_path),
        })

    return pages

Minimal example: call Ollama for structured JSON

import json
import requests

OLLAMA_URL = "http://localhost:11434/api/chat"

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "main_points": {"type": "array", "items": {"type": "string"}}
    },
    "required": ["title", "main_points"]
}

payload = {
    "model": "qwen3.5:9b",
    "messages": [
        {"role": "user", "content": "Summarize this page as JSON."}
    ],
    "stream": False,
    "format": schema
}

response = requests.post(OLLAMA_URL, json=payload, timeout=120)
response.raise_for_status()
data = response.json()
parsed = json.loads(data["message"]["content"])
print(parsed)

Minimal example: Kokoro TTS

from kokoro import KPipeline
import numpy as np
import soundfile as sf

pipeline = KPipeline(lang_code='a')  # American English
text = "Welcome to today's episode. We are discussing what this report means in simple language."

segments = []
for _, _, audio in pipeline(text, voice='af_heart', speed=1.0):
    segments.append(audio.astype('float32'))

final_audio = np.concatenate(segments)
sf.write("sample.wav", final_audio, 24000)

Recommended prompt rules for page extraction

Use prompts that force the model to stay grounded. Good rules include:

use both the image and the extracted text
do not invent facts not visible in the page
identify exact numbers when possible
explain chart meaning in simple language
mention uncertainty if a graphic is hard to read
return only JSON matching the schema

Recommended prompt rules for script generation

For the second-stage writer model, specify:

two hosts
friendly tone
easy language for non-specialists
short speaking turns
explain jargon immediately
only use facts from the document brief
mention uncertainty when the source is ambiguous
close with a short summary of takeaways

Performance notes

Rendering pages at 220–300 DPI improves chart reading but increases processing time.
Use a smaller vision model if you want faster page extraction.
Use a larger writer model if you want better conversational quality.
If a document is very large, cache the page-level JSON so you do not re-run the entire extraction every time.
For multiple documents, add embeddings and retrieval so the script is grounded only on the most relevant passages.

Troubleshooting

Ollama is not responding

Check that the server is running:

ollama serve

Then test locally:

curl http://localhost:11434/api/tags

The model misses chart details

Try one or more of these:

increase render DPI to 300
crop important chart regions and re-run extraction
reduce the amount of text sent with the page
add a second pass focused only on visuals

Kokoro fails on macOS Apple Silicon

Try running with:

PYTORCH_ENABLE_MPS_FALLBACK=1 python app.py

TTS sounds unnatural

Try:

shorter turns
more punctuation in the script
a different voice pair
slower speaking speed
cleaning up long numbers, acronyms, and formulas before synthesis

Spanish pronunciation is weak

Use the correct Kokoro language pipeline for Spanish and make sure the voice and language code match.

Suggested roadmap

v1

upload PDFs
multimodal extraction
structured fact pack
two-host script
two selectable voices
final WAV output

v2

edit script before synthesis
add source citations by page number
save projects
export MP3
chapter markers
retry per page

v3

multi-document retrieval with embeddings
bilingual podcasts
optional narrator mode
per-speaker speed and pause control
glossary mode for technical terms

Privacy

This project is designed to run locally:

local PDFs
local Ollama inference
local TTS
local Streamlit interface

That makes it suitable for sensitive internal reports, drafts, and working papers.

License

Choose the license that fits your project, for example MIT.

A good first milestone

Before building the full UI, validate the following in order:

Render a PDF page correctly.
Extract a good JSON summary from one page.
Generate a short script from two or three page summaries.
Generate clean speech from two different Kokoro voices.
Only then connect the full Streamlit workflow.

That sequence will save a lot of debugging time.

Kokoro TTS setup and `tts.py` integration

This section replaces the earlier TTS snippet and aligns the project with Kokoro's current behavior, where generated audio can arrive as a PyTorch tensor. The code now converts audio safely before concatenating and writing files.

Install TTS dependencies

On macOS:

brew install ffmpeg espeak-ng
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install kokoro soundfile numpy

Create `voices.json`

Put this file in the project root:

{
  "Warm Female": "af_heart",
  "Clear Female": "af_bella",
  "Warm Male": "am_adam",
  "Calm Male": "bm_george"
}

You can change the voice IDs later after testing which ones you prefer.

Add `tts.py`

Place the provided tts.py file in the project root. It does four things:

Loads the user-facing voice labels from voices.json
Maps Host A and Host B to Kokoro voice IDs
Synthesizes each turn of the script with Kokoro
Saves a WAV file and optionally converts it to MP3 with ffmpeg

Minimal usage example

from pathlib import Path
from tts import synthesize_dialogue_from_labels

turns = [
    {"speaker": "Host A", "text": "Welcome to our podcast."},
    {"speaker": "Host B", "text": "Today we explain this report in plain language."},
]

result = synthesize_dialogue_from_labels(
    turns=turns,
    host_a_label="Warm Female",
    host_b_label="Calm Male",
    voices_json_path="voices.json",
    output_wav=Path("data/output/podcast.wav"),
    lang_code="a",
    speed=1.0,
    output_mp3=Path("data/output/podcast.mp3"),
)

print(result)

Expected output:

{
    "wav": "data/output/podcast.wav",
    "mp3": "data/output/podcast.mp3"
}

Streamlit integration example

import json
import streamlit as st
from pathlib import Path
from tts import synthesize_dialogue_from_labels

with open("voices.json", "r", encoding="utf-8") as f:
    voices = json.load(f)

voice_a_label = st.selectbox("Voice for Host A", list(voices.keys()), index=0)
voice_b_label = st.selectbox("Voice for Host B", list(voices.keys()), index=1)

script = {
    "turns": [
        {"speaker": "Host A", "text": "Welcome to the episode."},
        {"speaker": "Host B", "text": "We will break this document down simply."},
    ]
}

if st.button("Generate audio"):
    result = synthesize_dialogue_from_labels(
        turns=script["turns"],
        host_a_label=voice_a_label,
        host_b_label=voice_b_label,
        voices_json_path="voices.json",
        output_wav=Path("data/output/podcast.wav"),
        lang_code="a",
        speed=1.0,
        output_mp3=Path("data/output/podcast.mp3"),
    )

    st.audio(result["wav"], format="audio/wav")
    st.success(f"Saved: {result}")

Apple Silicon note

If you hit MPS issues on an M-series Mac, try:

PYTORCH_ENABLE_MPS_FALLBACK=1 streamlit run app.py

or for a direct test:

PYTORCH_ENABLE_MPS_FALLBACK=1 python test_kokoro.py

Why the earlier code failed

The earlier TTS snippet used:

audio.astype(np.float32)

That fails when audio is a PyTorch tensor. The updated tts.py fixes this by converting tensors safely with:

if hasattr(audio, "detach"):
    audio = audio.detach().cpu().numpy()

before calling np.asarray(audio, dtype=np.float32).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
tts.py		tts.py
voices.json		voices.json

Folders and files

Latest commit

History

Repository files navigation

BriefCast-AI: Local PDF-to-Podcast with Ollama

What the app does

Why this architecture

Recommended stack

Core libraries

Local model runtime

Recommended Ollama models

Project structure

Folder purpose

End-to-end workflow

Stage 1. PDF ingestion

Stage 2. Page-level multimodal understanding

Stage 3. Document brief

Stage 4. Podcast script generation

Stage 5. Local TTS

Installation

1. Install system dependencies

macOS

Ubuntu / Debian

2. Create a Python environment

3. Install Python packages

4. Pull Ollama models

Run the app

Example configuration for voices.json

Minimal example: render a PDF page

Minimal example: call Ollama for structured JSON

Minimal example: Kokoro TTS

Recommended prompt rules for page extraction

Recommended prompt rules for script generation

Performance notes

Troubleshooting

Ollama is not responding

The model misses chart details

Kokoro fails on macOS Apple Silicon

TTS sounds unnatural

Spanish pronunciation is weak

Suggested roadmap

v1

v2

v3

Privacy

License

A good first milestone

Kokoro TTS setup and tts.py integration

Install TTS dependencies

Create voices.json

Add tts.py

Minimal usage example

Streamlit integration example

Apple Silicon note

Why the earlier code failed

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Example configuration for `voices.json`

Kokoro TTS setup and `tts.py` integration

Create `voices.json`

Add `tts.py`

Packages