<a href="https://colab.research.google.com/github/duper203/upstage_cookbook/blob/main/from_togetherai/again_PDF_to_Podcast.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Implementation of Notebook LM's PDF to Podcast

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/PDF_to_Podcast.ipynb)

### Introduction

Inspired by [Notebook LM's](https://notebooklm.google/) podcast generation feature and a recent open source implementation of [Open Notebook LM](https://github.com/gabrielchua/open-notebooklm). In this cookbook we will implement a walkthrough of how you can build a PDF to podcast pipeline.

Given any PDF we will generate a conversation between a host and a guest discussing and explaining the contents of the PDF.

In doing so we will learn the following:
1. How we can use JSON mode and structured generation with open models like Llama 3 70b to extract a script for the Podcast given text from the PDF.
2. How we can use TTS models to bring this script to life as a conversation.


In [None]:
!apt install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg
!pip install ffmpeg-python
!pip install PyAudio
!pip install pypdf #to read PDF content
!pip install cartesia #to access TTS model
!pip install -qU langchain-upstage langchain

In [29]:
import os

# Standard library imports
from pathlib import Path
from tempfile import NamedTemporaryFile
from typing import List, Literal, Tuple, Optional

# Third-party imports
from pydantic import BaseModel
from pypdf import PdfReader

from cartesia import Cartesia
from pydantic import ValidationError

from google.colab import userdata

from langchain_upstage import ChatUpstage
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser

from typing import Optional, Dict, Union, List, Any


In [11]:
os.environ["CARTESIA_API_KEY"] = userdata.get("CARTESIA_API_KEY")

client_cartesia = Cartesia(api_key=os.environ.get("CARTESIA_API_KEY"))

os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")
# client_together = Together(api_key=os.environ.get("TOGETHER_API_KEY"))

### Define Dialoge Schema with Pydantic

We need a way of telling the LLM what the structure of the podcast script between the guest and host will look like. We will do this using `pydantic` models.

Below we define the required classes.

- The overall conversation consists of lines said by either the host or the guest. The `DialogueItem` class specifies the structure of these lines.
- The full script is a combination of multiple lines performed by the speakers, here we also include a scratchpad field to allow the LLM to ideate and brainstorm the overall flow of the script prior to actually generating the lines. The `Dialogue` class specifies this.

In [12]:
class DialogueItem(BaseModel):
    """A single dialogue item."""

    speaker: Literal["Host (Jane)", "Guest"]
    text: str


class Dialogue(BaseModel):
    """The dialogue between the host and guest."""

    scratchpad: str
    name_of_guest: str
    dialogue: List[DialogueItem]

In [13]:
# Adapted and modified from https://github.com/gabrielchua/open-notebooklm
SYSTEM_PROMPT = """
You are a world-class podcast producer tasked with transforming the provided input text {text} into an engaging and informative podcast script. The input may be unstructured or messy, sourced from PDFs or web pages. Your goal is to extract the most interesting and insightful content for a compelling podcast discussion.

# Steps to Follow:

1. **Analyze the Input:**
   Carefully examine the text, identifying key topics, points, and interesting facts or anecdotes that could drive an engaging podcast conversation. Disregard irrelevant information or formatting issues.

2. **Brainstorm Ideas:**
   In the `<scratchpad>`, creatively brainstorm ways to present the key points engagingly. Consider:
   - Analogies, storytelling techniques, or hypothetical scenarios to make content relatable
   - Ways to make complex topics accessible to a general audience
   - Thought-provoking questions to explore during the podcast
   - Creative approaches to fill any gaps in the information

3. **Craft the Dialogue:**
   Develop a natural, conversational flow between the host (Jane) and the guest speaker (the author or an expert on the topic). Incorporate:
   - The best ideas from your brainstorming session
   - Clear explanations of complex topics
   - An engaging and lively tone to captivate listeners
   - A balance of information and entertainment

   Rules for the dialogue:
   - The host (Jane) always initiates the conversation and interviews the guest
   - Include thoughtful questions from the host to guide the discussion
   - Incorporate natural speech patterns, including occasional verbal fillers (e.g., "Uhh", "Hmmm", "um," "well," "you know")
   - Allow for natural interruptions and back-and-forth between host and guest - this is very important to make the conversation feel authentic
   - Ensure the guest's responses are substantiated by the input text, avoiding unsupported claims
   - Maintain a PG-rated conversation appropriate for all audiences
   - Avoid any marketing or self-promotional content from the guest
   - The host concludes the conversation

4. **Summarize Key Insights:**
   Naturally weave a summary of key points into the closing part of the dialogue. This should feel like a casual conversation rather than a formal recap, reinforcing the main takeaways before signing off.

5. **Maintain Authenticity:**
   Throughout the script, strive for authenticity in the conversation. Include:
   - Moments of genuine curiosity or surprise from the host
   - Instances where the guest might briefly struggle to articulate a complex idea
   - Light-hearted moments or humor when appropriate
   - Brief personal anecdotes or examples that relate to the topic (within the bounds of the input text)

6. **Consider Pacing and Structure:**
   Ensure the dialogue has a natural ebb and flow:
   - Start with a strong hook to grab the listener's attention
   - Gradually build complexity as the conversation progresses
   - Include brief "breather" moments for listeners to absorb complex information
   - For complicated concepts, reasking similar questions framed from a different perspective is recommended
   - End on a high note, perhaps with a thought-provoking question or a call-to-action for listeners

IMPORTANT RULE: Each line of dialogue should be no more than 100 characters (e.g., can finish within 5-8 seconds)

Remember: Always reply in valid JSON format, without code blocks. Begin directly with the JSON output.
"""

### Call the LLM to Generate Podcast Script

Below we call `Llama-3.1-70B` to generate a script for our podcast. We will also be able to read it's `scratchpad` and see how it structured the overall conversation.

### Load in PDF of Choice

Here we will load in an academic paper that proposes the use of many open source language models in a collaborative manner together to outperform proprietary models that are much larger!

In [14]:
from langchain_upstage import UpstageDocumentParseLoader
def get_PDF_text(file):
    text = ''
    loader = UpstageDocumentParseLoader(file, output_format='text')

    pages = loader.load()
    for page in pages:
      text += page.page_content

    return text

In [15]:
text = get_PDF_text('solar.pdf')
text

'SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective\nDepth Up-ScalingDahyun Kim∗, Chanjun Park∗†, Sanghoon Kim∗†, Wonsung Lee∗†, Wonho Song∗\nYunsu Kim∗, Hyeonwoo Kim∗, Yungi Kim, Hyeonju Lee, Jihoo Kim\nChangbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim\nMikyoung Cha, Hwalsuk Lee†, Sunghun Kim†Upstage AI, South Korea{kdahyun, chanjun.park, limerobot, wonsung.lee, hwalsuk.lee, hunkim}@upstage.aiAbstractWe introduce SOLAR 10.7B, a large language\nmodel (LLM) with 10.7 billion parameters,\ndemonstrating superior performance in various\nnatural language processing (NLP) tasks. In-\nspired by recent efforts to efficiently up-scale\nLLMs, we present a method for scaling LLMs\ncalled depth up-scaling (DUS), which encom-\npasses depthwise scaling and continued pre-\ntraining. In contrast to other LLM up-scaling\nmethods that use mixture-of-experts, DUS does\nnot require complex changes to train and infer-\nence efficiently. We show experimentally that\n

### Generate Script

Below we generate the script and print out the lines.

In [16]:
text

'SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective\nDepth Up-ScalingDahyun Kim∗, Chanjun Park∗†, Sanghoon Kim∗†, Wonsung Lee∗†, Wonho Song∗\nYunsu Kim∗, Hyeonwoo Kim∗, Yungi Kim, Hyeonju Lee, Jihoo Kim\nChangbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim\nMikyoung Cha, Hwalsuk Lee†, Sunghun Kim†Upstage AI, South Korea{kdahyun, chanjun.park, limerobot, wonsung.lee, hwalsuk.lee, hunkim}@upstage.aiAbstractWe introduce SOLAR 10.7B, a large language\nmodel (LLM) with 10.7 billion parameters,\ndemonstrating superior performance in various\nnatural language processing (NLP) tasks. In-\nspired by recent efforts to efficiently up-scale\nLLMs, we present a method for scaling LLMs\ncalled depth up-scaling (DUS), which encom-\npasses depthwise scaling and continued pre-\ntraining. In contrast to other LLM up-scaling\nmethods that use mixture-of-experts, DUS does\nnot require complex changes to train and infer-\nence efficiently. We show experimentally that\n

In [24]:
## ------ 1 ------##
# ERROR : call_llm(SYSTEM_PROMPT, text, Dialogue) => token limit error
def call_llm(system_prompt: str, text: str, dialogue_format):
    """Call the LLM with the given prompt and dialogue format."""
    llm = ChatUpstage(model='solar-pro', extra_body={"response_format": {"type": "json_object", "schema":dialogue_format.model_json_schema()}})


    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        ("human", "{text}")
    ])

    # Create the chain
    chain = chat_prompt | llm

    # Call the chain with the input text
    response = chain.invoke({"text": text})
    return response


In [42]:
## ------ 2 ------##
# ERROR 는 없지만 dialogue 생성이 제대로 안됨
def call_llm(system_prompt: str, text: str, dialogue_format):
    """Call the LLM with the given prompt and dialogue format."""
    llm = ChatUpstage(model='solar-pro', extra_body={"response_format": {"type": "json_object", "schema":dialogue_format.model_json_schema()}})

    chat_prompt = ChatPromptTemplate.from_messages([
        SystemMessage(content=system_prompt),
        HumanMessage(content="{text}")
    ])

    chain = chat_prompt | llm
    response = chain.invoke({"text": text})
    return response
call_llm(SYSTEM_PROMPT, text, Dialogue)

AIMessage(content='{\n  "response": {\n    "status": "success",\n    "message": "Here\'s your podcast script based on the provided input text. I hope you find it engaging and informative!",\n    "script": [\n      {\n        "speaker": "Jane",\n        "line": "Welcome to another episode of \'Insightful Talks\'! Today, we\'re diving into the fascinating world of... Hmmm, where should I start?",\n        "tags": ["host_intro"]\n      },\n      {\n        "speaker": "Guest",\n        "line": "Hi Jane, I\'m excited to be here! I think it\'s important to understand the roots of...",\n        "tags": ["guest_intro"]\n      },\n      {\n        "speaker": "Jane",\n        "line": "Absolutely! Can you tell us about a key event or discovery that sparked your interest in this topic?",\n        "tags": ["host_question_1"]\n      },\n      {\n        "speaker": "Guest",\n        "line": "Well, I\'ll never forget when I first learned about... It was like a lightbulb went off in my head!",\n       

In [51]:
## ------ 3 ------##
# ERROR 는 없지만 pdf 내용이 반영이 안됨
def call_llm(system_prompt: str, text: str, dialogue_format):
    """Call the LLM with the given prompt and dialogue format."""
    llm = ChatUpstage(model='solar-pro', extra_body={"response_format": {"type": "json_object", "schema":dialogue_format.model_json_schema()}})

    chat_prompt = ChatPromptTemplate.from_messages([
        SystemMessage(content=system_prompt),
        HumanMessage(content="{text}")
    ])

    # Create the chain
    chain = chat_prompt | llm

    # Call the chain with the input text
    response = chain.invoke({"text": text})
    return response

call_llm(SYSTEM_PROMPT, text, Dialogue)

AIMessage(content='{\n  "dialogue": [\n    {\n      "speaker": "Jane",\n      "text": "Hello everyone, welcome to our podcast! Today, we have a fascinating topic: \'The Science of Dreams\'. Let\'s dive right in. Our guest, Dr. [Guest], has some intriguing insights on this subject. Dr., can you begin by sharing what dreams are and why we have them?"\n    },\n    {\n      "speaker": "Guest",\n      "text": "Uhh, absolutely, Jane. Dreams are essentially stories our minds create while we sleep. They\'re a combination of memories, emotions, and imagination. As for why we have them, there are several theories. One suggests dreams help us process emotions and experiences from our daily lives."\n    },\n    {\n      "speaker": "Jane",\n      "text": "Hmm, interesting. So, dreams are like our brain\'s own soap opera, huh? Can you tell us about any common dream themes or symbols?"\n    },\n    {\n      "speaker": "Guest",\n      "text": "Well, many people report dreaming about falling, flying, o

In [None]:
def call_llm(system_prompt: str, text: str, dialogue_format):
  llm = ChatUpstage(model='solar-pro', extra_body={"response_format": {"type": "json_object", "schema":Dialogue}})

  chat_prompt = ChatPromptTemplate([
      ("system", SYSTEM_PROMPT),
      ("human", "{text}")
  ])

  # Create the chain
  chain = chat_prompt | llm

  # Call the chain with the input text
  response = chain.invoke({"text": text})
  return response

In [41]:
# 4
# ERROR : token limit error
def call_llm(
    text: str,
    llm: Optional[ChatUpstage] = ChatUpstage(model='solar-pro')
) -> List[Dict[str, Any]]:
    """
    Call the LLM with the given prompt and dialogue format.
    """
    # Create the prompt template
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                """
                You are a podcast producer transforming the given text into an engaging podcast script.
                Follow the format:
                {{
                    "scratchpad": "Identify main topics and key insights first.",
                    "name_of_guest": "<string>",
                    "dialogue": [
                        {{"speaker": "Host (Jane)", "text": "<string>"}},
                        {{"speaker": "Guest", "text": "<string>"}}
                    ]
                }}
                Ensure each line is under 100 characters.
                """,
            ),
            (
                "human",
                "Transform the following text into podcast dialogue:\n{text}",
            ),
            (
                "human",
                "Respond with a JSON object containing 'scratchpad', 'name_of_guest', and 'dialogue'.",
            ),
        ]
    )

    # Create the output parser
    output_parser = JsonOutputParser()

    # Create the chain
    chain = prompt | llm | output_parser

    # Run the chain, providing necessary input variables
    # Note: We are providing 'input_text' which your prompt seems to expect
    result = chain.invoke({"text": text})

    return result

BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 4096 tokens. However, your messages resulted in 18562 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

In [43]:
import json
def generate_script(system_prompt: str, input_text: str, output_model):
    """Get the dialogue from the LLM."""
    # Load as python object
    try:
        response = call_llm(system_prompt, input_text, output_model)
        json_string = json.dumps(response.content)
        response_json=json.loads(json_string)
        dialogue = output_model.model_validate_json(response_json)
    except ValidationError as e:
        error_message = f"Failed to parse dialogue JSON: {e}"
        system_prompt_with_error = f"{system_prompt}\n\nPlease return a VALID JSON object. This was the earlier error: {error_message}"
        response = call_llm(system_prompt_with_error, input_text, output_model)
        json_string = json.dumps(response.content)
        response_json=json.loads(json_string)
        dialogue = output_model.model_validate_json(response_json)
    return dialogue

In [52]:
script = generate_script(SYSTEM_PROMPT, text, Dialogue)

ValidationError: 3 validation errors for Dialogue
scratchpad
  Field required [type=missing, input_value={'response': {'dialogue':...'s been a pleasure!"}]}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.9/v/missing
name_of_guest
  Field required [type=missing, input_value={'response': {'dialogue':...'s been a pleasure!"}]}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.9/v/missing
dialogue
  Field required [type=missing, input_value={'response': {'dialogue':...'s been a pleasure!"}]}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.9/v/missing

In [None]:
len(script.dialogue)

13

In [None]:
script.dialogue

[DialogueItem(speaker='Host (Jane)', text="Welcome to [Podcast Name], I'm your host, Jane. Today, we're exploring the world of meditation with Dr. Sarah Thompson, a renowned psychologist and mindfulness expert. Welcome, Sarah!"),
 DialogueItem(speaker='Guest', text="Thanks for having me, Jane. It's great to be here."),
 DialogueItem(speaker='Host (Jane)', text="So, Sarah, let's dive right in. Can you explain what meditation is and its benefits?"),
 DialogueItem(speaker='Guest', text='Absolutely. Meditation is a practice where you focus your mind, often on your breath or a mantra. It has numerous benefits, like improved focus, better sleep, and enhanced emotional well-being.'),
 DialogueItem(speaker='Host (Jane)', text='Interesting. How does meditation help with stress reduction?'),
 DialogueItem(speaker='Guest', text='Meditation helps us become more aware of our thoughts and emotions, allowing us to respond to stressors more calmly and effectively.'),
 DialogueItem(speaker='Host (Jane)

### Generate Podcast Using TTS

Below we read through the script and parse choose the TTS voice depending on the speaker. We define a speaker and guest voice id.

We can loop through the lines in the script and generate them by a call to the TTS model with specific voice and lines configurations. The lines all appended to the same buffer and once the script finishes we write this out to a `wav` file, ready to be played.



In [None]:
import subprocess
import ffmpeg

host_id = "694f9389-aac1-45b6-b726-9d9369183238" # Jane - host
guest_id = "a0e99841-438c-4a64-b679-ae501e7d6091" # Guest

model_id = "sonic-english" # The Sonic Cartesia model for English TTS

output_format = {
    "container": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 44100,
    }

# Set up a WebSocket connection.
ws = client_cartesia.tts.websocket()

# Open a file to write the raw PCM audio bytes to.
f = open("podcast.pcm", "wb")

# Generate and stream audio.
for line in script.dialogue:
    if line.speaker == "Guest":
        voice_id = guest_id
    else:
        voice_id = host_id

    for output in ws.send(
        model_id=model_id,
        transcript='-' + line.text, # the "-"" is to add a pause between speakers
        voice_id=voice_id,
        stream=True,
        output_format=output_format,
    ):
        buffer = output["audio"]  # buffer contains raw PCM audio bytes
        f.write(buffer)

# Close the connection to release resources
ws.close()
f.close()

# Convert the raw PCM bytes to a WAV file.
ffmpeg.input("podcast.pcm", format="f32le").output("podcast.wav").run()

# Play the file
subprocess.run(["ffplay", "-autoexit", "-nodisp", "podcast.wav"])

CompletedProcess(args=['ffplay', '-autoexit', '-nodisp', 'podcast.wav'], returncode=0)

In [None]:
# Play the podcast
import IPython
IPython.display.Audio("podcast.wav")