Turn PDF documents into a friendly two-speaker podcast fully on your machine.
This project ingests one or more PDFs, renders each page as an image, uses a multimodal Ollama model to understand text + charts + infographics, builds a grounded fact pack, writes a conversation between two hosts in plain language, and finally synthesizes the dialogue into audio with locally generated voices.
It is designed for document-heavy workflows such as:
- economic reports
- policy briefs
- technical notes
- slide decks exported as PDF
- reports with charts, tables, diagrams, and infographics
- Upload one or more PDFs.
- Render each page to a high-resolution image.
- Extract page text.
- Send page image + page text to a multimodal Ollama model.
- Build a structured page-level fact pack.
- Merge all page facts into a document brief.
- Generate a two-person podcast script in easy, friendly language.
- Convert each speaker turn to speech with local TTS.
- Merge all turns into a final WAV file.
- Preview the result in the app.
A PDF with infographics is not just text. A text-only parser will miss chart meaning, labels, and layout cues. This project therefore uses a multimodal pipeline:
- PyMuPDF renders each PDF page to PNG and extracts text.
- Ollama vision model reads each page image plus extracted text.
- Ollama text model writes the final dialogue.
- Kokoro generates the audio locally.
- Streamlit provides the local UI.
This design is more reliable than asking one model to do everything in a single step.
- Python 3.10+
- Streamlit
- PyMuPDF
- Pillow
- requests
- pydantic
- numpy
- soundfile
- Kokoro
- Ollama
- Page understanding:
qwen2.5vl:7b - Script generation:
qwen3.5:9b - Optional embeddings for retrieval:
embeddinggemma
You can swap the models later if you want faster or higher-quality variants.
podcast_local/
├── app.py
├── voices.json
├── requirements.txt
├── data/
│ ├── uploads/
│ ├── rendered_pages/
│ ├── facts/
│ └── output/
└── README.md
uploads/: original PDF filesrendered_pages/: PNG version of each pagefacts/: extracted structured summaries per page or documentoutput/: generated WAV/MP3 filesvoices.json: voice selector mapping shown in the UI
For each uploaded PDF, the app:
- opens the PDF with PyMuPDF
- extracts page text
- renders each page at a higher DPI for better chart and infographic readability
Each page is processed with a multimodal Ollama model.
Input sent to the model:
- page image
- extracted text
- prompt instructing the model to identify:
- title
- main findings
- important numbers
- chart takeaways
- jargon to explain
- caveats / uncertain readings
Output expected from the model:
- structured JSON
The page-level fact packs are merged into a single grounded brief that includes:
- executive summary
- major themes
- key statistics
- policy relevance
- terms that need explanation
A second Ollama model turns the grounded document brief into:
- episode title
- intro hook
- turn-by-turn script for Host A and Host B
- closing summary
The writing style should be:
- friendly
- clear
- plain language
- accurate
- non-robotic
Each line is synthesized locally:
- Host A uses Voice A
- Host B uses Voice B
Then all audio chunks are concatenated into one final WAV file.
brew install ollama ffmpeg espeak-ngsudo apt update
sudo apt install -y ffmpeg espeak-ngInstall Ollama from its official installer if it is not already available.
python -m venv .venv
source .venv/bin/activatepip install --upgrade pip
pip install streamlit pymupdf pillow requests pydantic numpy soundfile kokoroIf you plan to use additional language support in Kokoro:
pip install "misaki[ja]" # Japanese
pip install "misaki[zh]" # Mandarin Chineseollama pull qwen2.5vl:7b
ollama pull qwen3.5:9b
ollama pull embeddinggemmastreamlit run app.pyThen open the local Streamlit URL shown in your terminal.
{
"Warm Female": "af_heart",
"Clear Female": "af_bella",
"Warm Male": "am_adam",
"Calm Male": "bm_george"
}The exact voice IDs available in your environment may vary depending on the Kokoro package version and your voice assets.
import fitz
from pathlib import Path
def render_pdf(pdf_path: str, out_dir: str, dpi: int = 240):
out_dir = Path(out_dir)
out_dir.mkdir(parents=True, exist_ok=True)
doc = fitz.open(pdf_path)
pages = []
for i, page in enumerate(doc):
text = page.get_text("text")
pix = page.get_pixmap(dpi=dpi, alpha=False)
img_path = out_dir / f"page_{i+1}.png"
img_path.write_bytes(pix.tobytes("png"))
pages.append({
"page": i + 1,
"text": text,
"image_path": str(img_path),
})
return pagesimport json
import requests
OLLAMA_URL = "http://localhost:11434/api/chat"
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"main_points": {"type": "array", "items": {"type": "string"}}
},
"required": ["title", "main_points"]
}
payload = {
"model": "qwen3.5:9b",
"messages": [
{"role": "user", "content": "Summarize this page as JSON."}
],
"stream": False,
"format": schema
}
response = requests.post(OLLAMA_URL, json=payload, timeout=120)
response.raise_for_status()
data = response.json()
parsed = json.loads(data["message"]["content"])
print(parsed)from kokoro import KPipeline
import numpy as np
import soundfile as sf
pipeline = KPipeline(lang_code='a') # American English
text = "Welcome to today's episode. We are discussing what this report means in simple language."
segments = []
for _, _, audio in pipeline(text, voice='af_heart', speed=1.0):
segments.append(audio.astype('float32'))
final_audio = np.concatenate(segments)
sf.write("sample.wav", final_audio, 24000)Use prompts that force the model to stay grounded. Good rules include:
- use both the image and the extracted text
- do not invent facts not visible in the page
- identify exact numbers when possible
- explain chart meaning in simple language
- mention uncertainty if a graphic is hard to read
- return only JSON matching the schema
For the second-stage writer model, specify:
- two hosts
- friendly tone
- easy language for non-specialists
- short speaking turns
- explain jargon immediately
- only use facts from the document brief
- mention uncertainty when the source is ambiguous
- close with a short summary of takeaways
- Rendering pages at 220–300 DPI improves chart reading but increases processing time.
- Use a smaller vision model if you want faster page extraction.
- Use a larger writer model if you want better conversational quality.
- If a document is very large, cache the page-level JSON so you do not re-run the entire extraction every time.
- For multiple documents, add embeddings and retrieval so the script is grounded only on the most relevant passages.
Check that the server is running:
ollama serveThen test locally:
curl http://localhost:11434/api/tagsTry one or more of these:
- increase render DPI to 300
- crop important chart regions and re-run extraction
- reduce the amount of text sent with the page
- add a second pass focused only on visuals
Try running with:
PYTORCH_ENABLE_MPS_FALLBACK=1 python app.pyTry:
- shorter turns
- more punctuation in the script
- a different voice pair
- slower speaking speed
- cleaning up long numbers, acronyms, and formulas before synthesis
Use the correct Kokoro language pipeline for Spanish and make sure the voice and language code match.
- upload PDFs
- multimodal extraction
- structured fact pack
- two-host script
- two selectable voices
- final WAV output
- edit script before synthesis
- add source citations by page number
- save projects
- export MP3
- chapter markers
- retry per page
- multi-document retrieval with embeddings
- bilingual podcasts
- optional narrator mode
- per-speaker speed and pause control
- glossary mode for technical terms
This project is designed to run locally:
- local PDFs
- local Ollama inference
- local TTS
- local Streamlit interface
That makes it suitable for sensitive internal reports, drafts, and working papers.
Choose the license that fits your project, for example MIT.
Before building the full UI, validate the following in order:
- Render a PDF page correctly.
- Extract a good JSON summary from one page.
- Generate a short script from two or three page summaries.
- Generate clean speech from two different Kokoro voices.
- Only then connect the full Streamlit workflow.
That sequence will save a lot of debugging time.
This section replaces the earlier TTS snippet and aligns the project with Kokoro's current behavior, where generated audio can arrive as a PyTorch tensor. The code now converts audio safely before concatenating and writing files.
On macOS:
brew install ffmpeg espeak-ng
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install kokoro soundfile numpyPut this file in the project root:
{
"Warm Female": "af_heart",
"Clear Female": "af_bella",
"Warm Male": "am_adam",
"Calm Male": "bm_george"
}You can change the voice IDs later after testing which ones you prefer.
Place the provided tts.py file in the project root. It does four things:
- Loads the user-facing voice labels from
voices.json - Maps Host A and Host B to Kokoro voice IDs
- Synthesizes each turn of the script with Kokoro
- Saves a WAV file and optionally converts it to MP3 with
ffmpeg
from pathlib import Path
from tts import synthesize_dialogue_from_labels
turns = [
{"speaker": "Host A", "text": "Welcome to our podcast."},
{"speaker": "Host B", "text": "Today we explain this report in plain language."},
]
result = synthesize_dialogue_from_labels(
turns=turns,
host_a_label="Warm Female",
host_b_label="Calm Male",
voices_json_path="voices.json",
output_wav=Path("data/output/podcast.wav"),
lang_code="a",
speed=1.0,
output_mp3=Path("data/output/podcast.mp3"),
)
print(result)Expected output:
{
"wav": "data/output/podcast.wav",
"mp3": "data/output/podcast.mp3"
}import json
import streamlit as st
from pathlib import Path
from tts import synthesize_dialogue_from_labels
with open("voices.json", "r", encoding="utf-8") as f:
voices = json.load(f)
voice_a_label = st.selectbox("Voice for Host A", list(voices.keys()), index=0)
voice_b_label = st.selectbox("Voice for Host B", list(voices.keys()), index=1)
script = {
"turns": [
{"speaker": "Host A", "text": "Welcome to the episode."},
{"speaker": "Host B", "text": "We will break this document down simply."},
]
}
if st.button("Generate audio"):
result = synthesize_dialogue_from_labels(
turns=script["turns"],
host_a_label=voice_a_label,
host_b_label=voice_b_label,
voices_json_path="voices.json",
output_wav=Path("data/output/podcast.wav"),
lang_code="a",
speed=1.0,
output_mp3=Path("data/output/podcast.mp3"),
)
st.audio(result["wav"], format="audio/wav")
st.success(f"Saved: {result}")If you hit MPS issues on an M-series Mac, try:
PYTORCH_ENABLE_MPS_FALLBACK=1 streamlit run app.pyor for a direct test:
PYTORCH_ENABLE_MPS_FALLBACK=1 python test_kokoro.pyThe earlier TTS snippet used:
audio.astype(np.float32)That fails when audio is a PyTorch tensor. The updated tts.py fixes this by converting tensors safely with:
if hasattr(audio, "detach"):
audio = audio.detach().cpu().numpy()before calling np.asarray(audio, dtype=np.float32).