<a href="https://colab.research.google.com/github/duper203/official_cookbook_upstage/blob/pdf-to-podcast/PDF_to_Podcast_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Implementation of Notebook LM's PDF to Podcast

## Overview

Inspired by [Notebook LM's](https://notebooklm.google/) podcast generation feature and a recent open source implementation of [Open Notebook LM](https://github.com/gabrielchua/open-notebooklm). In this cookbook we will implement a walkthrough of how you can build a PDF to podcast pipeline.

## Purpose of the Excercise

The purpose of this exercise is to guide users through the process of building an automated pipeline that transforms a PDF document into a podcast-ready script and audio output. Specifically, it integrates PDF parsing, question generation, Retrieval-Augmented Generation (RAG) for contextual answers, and text-to-speech (TTS) synthesis to create a complete podcast production workflow.

## 1. Install Dependencies / Import Necessary Libraries



In [None]:
!apt install -qU libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg
!pip install -qU ffmpeg-python
!pip install -qU PyAudio
!pip install -qU cartesia #to access TTS model
!pip install -qU langchain-upstage langchain langchain_community
!pip install -qU faiss-cpu

In [None]:
import os
from google.colab import userdata

from pathlib import Path
from tempfile import NamedTemporaryFile
from typing import List, Literal, Tuple, Optional, Dict, Union, List, Any

import json
from pydantic import BaseModel

from cartesia import Cartesia
from pydantic import ValidationError

from langchain_upstage import ChatUpstage, UpstageEmbeddings, UpstageDocumentParseLoader
from langchain_core.prompts import ChatPromptTemplate

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

In [None]:
# @title set API key
from pprint import pprint
import os

import warnings

warnings.filterwarnings("ignore")

if "google.colab" in str(get_ipython()):
    # Running in Google Colab. Please set the UPSTAGE_API_KEY in the Colab Secrets
    from google.colab import userdata

    os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")
    os.environ["CARTESIA_API_KEY"] = userdata.get("CARTESIA_API_KEY")

else:
    # Running locally. Please set the UPSTAGE_API_KEY in the .env file
    from dotenv import load_dotenv

    load_dotenv()

assert (
    "UPSTAGE_API_KEY" in os.environ
), "Please set the UPSTAGE_API_KEY environment variable"

## 2. Generate QnA context for the podcast using RAG

### [2-1] Generate 7 Questions to Ask from the PDF

In this step, we use a Upstage Solar to generate insightful and engaging questions based on the content of the provided PDF. The goal is to create a comprehensive set of questions that cover various aspects of the document, making them suitable for a podcast interview format. The questions should be designed to provoke thought, encourage in-depth discussion, and highlight key points from the PDF content.


*   Upstage DocParse
*   Upstage Solar


In [None]:
#Load in PDF of Choice
def get_PDF_text(file):
    text = ''
    loader = UpstageDocumentParseLoader(file, output_format='text')

    pages = loader.load()
    for page in pages:
      text += page.page_content

    return text

text = get_PDF_text('pdfs/solar_paper.pdf')

In [None]:
# Generate Questions Using LLM

QUESTION_PROMPT = """
You are an AI assistant tasked with generating a list of engaging questions for a podcast interview.
Based on the given text, create 7 questions that would be relevant for a podcast discussion.
The questions should be thought-provoking, insightful, and aimed at extracting key information.
Ensure the questions are diverse and cover different aspects of the text content.

Return the questions as a json array and have all the key as questions
"""

def generate_questions(system_prompt: str, text: str):

    llm = ChatUpstage(extra_body={"response_format": {"type": "json_object"}})
    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        ("human", "{text}")
    ])

    chain = chat_prompt | llm

    response = chain.invoke({"text": text})
    print(response.content)

    try:
        response_dict = json.loads(response.content)
        questions = response_dict.get("questions", [])
        if not isinstance(questions, list) or len(questions) == 0:
            raise ValueError("Invalid response format or no questions generated")
    except (json.JSONDecodeError, ValueError) as e:
        print(f"Error generating questions: {e}")
        return []
    return questions

questions=generate_questions(QUESTION_PROMPT, text)

In [None]:
questions

['What is the main contribution of the SOLAR 10.7B model?',
 'How does the depth up-scaling (DUS) method differ from other up-scaling methods like mixture-of-experts (MoE)?',
 'What are the advantages of using DUS over other up-scaling methods?',
 'What are the key components of the SOLAR 10.7B model?',
 'How does the SOLAR 10.7B model outperform existing models in various NLP tasks?',
 'What are the different stages of fine-tuning for the SOLAR 10.7B-Instruct model?',
 "What is the role of the alignment tuning stage in enhancing the SOLAR 10.7B-Instruct model's performance?",
 'How does the SOLAR 10.7B-Instruct model compare to other top-performing models in terms of performance metrics?',
 'What are the limitations and considerations of the depth up-scaling (DUS) method?',
 'How does the SOLAR 10.7B-Instruct model address ethical concerns in its operation?']

### [2-2] Retrieve and Generate Answers for Each Question Using RAG

Once the questions are generated, we use a Retrieval-Augmented Generation (RAG) approach to obtain contextually relevant answers. This involves retrieving the most relevant sections from the PDF content, which has been embedded into a vector store, and then using the language model to generate detailed and informative answers. This ensures that the responses are backed by the original document, making them accurate and well-supported for podcast narration.


*   Upstage Embedding Model
*   Faiss

In [None]:
# Embed PDF Content and Create Vector Store
def vectorstore_embed(file_path: str) -> List[float]:
    """Embed the given text using the LLM."""
    loader = UpstageDocumentParseLoader('solar.pdf', output_format='text')
    documents = loader.load()


    text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000, chunk_overlap=200, length_function=len
    )

    texts = text_splitter.split_documents(documents)

    for doc in texts:
        doc.page_content = doc.page_content.replace('\t', ' ')

    embeddings =  UpstageEmbeddings(model="solar-embedding-1-large")
    vectorstore = FAISS.from_documents(texts, embeddings)

    return vectorstore

vectorstore=vectorstore_embed('pdfs/solar_paper.pdf')

In [None]:
# Retrieve Contexts for Questions
def retrieve_contents(vectorstore: str, question: str):

  retriever_store = vectorstore.as_retriever(search_kwargs={"k": 1})

  docs = retriever_store.get_relevant_documents(question)

  return docs

In [None]:
# Generate Answers Using LLM
def generate_answer(question: str) -> str:
    """Generate an answer to a given question using the provided context."""

    context=retrieve_contents(vectorstore,question)

    prompt = f"You are a Guest of the podcast interview and you will be answering as a professional. You just have to answer the following question based on the provided document: {question}. I want you to answer as if you are  podcast interview"

    llm = ChatUpstage()
    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", prompt),
        ("human","{context}")
    ])

    chain = chat_prompt | llm

    response = chain.invoke({"context": context})
    print(response.content)

    return response.content


In [None]:
#Create QA Script
def create_qa_script(questions, pdf_text):
    qa_script = []
    for question in questions:
        answer = generate_answer(question)
        qa_script.append({"speaker": "Host (Jane)", "text": question})
        qa_script.append({"speaker": "Guest", "text": answer})
    return qa_script

qa_script = create_qa_script(questions, text)

The main goal of the study presented in this paper is to investigate the advantages and limitations of dental, pharmacy, and public health education.
The proposed DUS method differs from other LLM up-scaling methods by focusing on depth up-scaling, which involves scaling the number of layers in the base model and continually pretraining the scaled model. Unlike some other methods that use Mixture of Experts (MoE) to scale the model, DUS uses a depthwise scaling method similar to Tan and Le (2019) adapted for the LLM architecture. This approach makes DUS more straightforward to use and immediately compatible with easy-to-use LLM frameworks like Hugging Face (Wolf et al., 2019) without requiring any changes to the existing framework.
The key components of the DUS method are not explicitly mentioned in the provided document. However, based on the context, it seems that the DUS method refers to the "Depth Up-Scaling" approach mentioned in the text. The limitations and considerations discus

In [None]:
qa_script

[{'speaker': 'Host (Jane)',
  'text': 'What is the main contribution of the SOLAR 10.7B model?'},
 {'speaker': 'Guest',
  'text': 'The main contribution of the SOLAR 10.7B model is the introduction of a depth-wise scaled and continually pretrained model that is available under the Apache 2.0 license for commercial use. This model outperforms other benchmarks in various fields, bridging the gap between academic research and practical applications.'},
 {'speaker': 'Host (Jane)',
  'text': 'How does the depth up-scaling (DUS) method differ from other up-scaling methods like mixture-of-experts (MoE)?'},
 {'speaker': 'Guest',
  'text': 'Depth up-scaling (DUS) differs from other up-scaling methods like mixture-of-experts (MoE) in several ways. Firstly, DUS focuses on increasing the number of layers in the base model, while MoE introduces a Mixture-of-Experts architecture to scale the model. Secondly, DUS uses a depthwise scaling method similar to Tan and Le (2019), which is adapted for the L

## 3. Generating the Complete Podcast Script with QnA script above

This section involves generating an entire podcast script from the given Q&A content. The function should transform structured data into a conversational format suitable for a podcast setting, ensuring an engaging and natural dialogue flow between the host and the guest.



In [None]:
class DialogueItem(BaseModel):
    """A single dialogue item."""

    speaker: Literal["Host (Jane)", "Guest"]
    text: str


class Dialogue(BaseModel):
    """The dialogue between the host and guest."""

    name_of_guest: str
    dialogue: List[DialogueItem]

In [None]:
# Adapted and modified from https://github.com/gabrielchua/open-notebooklm
SYSTEM_PROMPT = """
You are a world-class podcast producer tasked with transforming the provided input text {text} into an engaging and informative podcast script.
Ensure the response adheres to this format:

{{
"name_of_guest": "<string>",
"dialogue": [
    {{
      "speaker": "Host (Jane)",
      "text": "<string>"
    }},
    {{
      "speaker": "Guest",
      "text": "<string>",
    }},
    ...
  ]
}}

# Steps to Follow:

0. for "name_of_guest": "<string>" should be a real person name

1. **Craft the Dialogue:**
   Develop a natural, conversational flow between the host (Jane) and the guest speaker (the author or an expert on the topic).

   Dialogue content:
   the {text} will be the main context for the podcast which is a QnA content.
   Need all the questions and answers from the {text} in the podcast script.

   Incorporate:
   - Clear explanations of complex topics
   - An engaging and lively tone to captivate listeners
   - A balance of information and entertainment

   Rules for the dialogue:
   - The host (Jane) always initiates the conversation and interviews the guest
   - Include thoughtful questions from the host to guide the discussion
   - Incorporate natural speech patterns, including occasional verbal fillers (e.g., "Uhh", "Hmmm", "um," "well," "you know")
   - Allow for natural interruptions and back-and-forth between host and guest - this is very important to make the conversation feel authentic
   - Ensure the guest's responses are substantiated by the input text, avoiding unsupported claims
   - Maintain a PG-rated conversation appropriate for all audiences
   - Avoid any marketing or self-promotional content from the guest
   - The host concludes the conversation


2. **Maintain Authenticity:**
   Throughout the script, strive for authenticity in the conversation. Include:
   - Moments of genuine curiosity or surprise from the host
   - Instances where the guest might briefly struggle to articulate a complex idea
   - Light-hearted moments or humor when appropriate
   - Brief personal anecdotes or examples that relate to the topic (within the bounds of the input text)

3. **Consider Pacing and Structure:**
   Ensure the dialogue has a natural ebb and flow:
   - Start with a strong hook to grab the listener's attention
   - Gradually build complexity as the conversation progresses
   - Include brief "breather" moments for listeners to absorb complex information
   - For complicated concepts, reasking similar questions framed from a different perspective is recommended
   - End on a high note, perhaps with a thought-provoking question or a call-to-action for listeners

IMPORTANT RULE: Each line of dialogue should be no more than 100 characters (e.g., can finish within 5-8 seconds)

Remember: Always reply in valid JSON format, without code blocks. Begin directly with the JSON output.
"""

In [None]:
def call_llm(system_prompt: str, text, dialogue_format):
    """Call the LLM with the given prompt and dialogue format."""
    llm = ChatUpstage(extra_body={"response_format": {"type": "json_object", "schema":dialogue_format.model_json_schema()}})


    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        ("human", "{text}")
    ])

    # Create the chain
    chain = chat_prompt | llm

    # Call the chain with the input text
    response = chain.invoke({"text": text})
    return response

In [None]:
def generate_script(system_prompt: str, input_text, output_model):
    """Get the dialogue from the LLM."""
    # Load as python object
    try:
        response = call_llm(system_prompt, input_text, output_model)
        dialogue = output_model.model_validate_json(response.content)
    except ValidationError as e:
        error_message = f"Failed to parse dialogue JSON: {e}"
        system_prompt_with_error = f"{system_prompt}\n\nPlease return a VALID JSON object. This was the earlier error: {error_message}"
        response = call_llm(system_prompt_with_error, input_text, output_model)
        dialogue = output_model.model_validate_json(response.content)
    return dialogue

### Generate script

In [None]:
script = generate_script(SYSTEM_PROMPT, qa_script, Dialogue)

In [None]:
script

Dialogue(name_of_guest='Dr. Alice Chan', dialogue=[DialogueItem(speaker='Host (Jane)', text="Welcome to our podcast, Dr. Alice Chan. Today, we'll be discussing the SOLAR 10.7B model and its contributions to the field of natural language processing. Let's start with the main contribution of this model. Can you explain what it is, Dr. Chan?"), DialogueItem(speaker='Guest', text='Of course, Jane. The main contribution of the SOLAR 10.7B model is the introduction of a depth-wise scaled and continually pretrained model that is available under the Apache 2.0 license for commercial use. This model outperforms other benchmarks in various fields, bridging the gap between academic research and practical applications.'), DialogueItem(speaker='Host (Jane)', text="Interesting! Now, let's talk about the depth up-scaling (DUS) method. How does it differ from other up-scaling methods like mixture-of-experts (MoE)?"), DialogueItem(speaker='Guest', text='Depth up-scaling (DUS) differs from other up-scal

## 4. Generate Podcast Using TTS

Below we read through the script and parse choose the TTS voice depending on the speaker. We define a speaker and guest voice id.

We can loop through the lines in the script and generate them by a call to the TTS model with specific voice and lines configurations. The lines all appended to the same buffer and once the script finishes we write this out to a `wav` file, ready to be played.



In [None]:
import subprocess
import ffmpeg

host_id = "694f9389-aac1-45b6-b726-9d9369183238" # Jane - host
guest_id = "a0e99841-438c-4a64-b679-ae501e7d6091" # Guest

model_id = "sonic-english" # The Sonic Cartesia model for English TTS

output_format = {
    "container": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 44100,
}

client_cartesia = Cartesia(api_key=os.environ.get("CARTESIA_API_KEY"))


# Set up a WebSocket connection.
ws = client_cartesia.tts.websocket()

# Open a file to write the raw PCM audio bytes to.
f = open("podcast.pcm", "wb")

# Generate and stream audio.
for line in script.dialogue:
    if line.speaker == "Guest":
        voice_id = guest_id
    else:
        voice_id = host_id

    for output in ws.send(
        model_id=model_id,
        transcript='-' + line.text, # the "-"" is to add a pause between speakers
        voice_id=voice_id,
        stream=True,
        output_format=output_format,
    ):
        buffer = output["audio"]  # buffer contains raw PCM audio bytes
        f.write(buffer)

# Close the connection to release resources
ws.close()
f.close()

# Convert the raw PCM bytes to a WAV file.
ffmpeg.input("podcast.pcm", format="f32le").output("podcast.wav").run()

# Play the file
subprocess.run(["ffplay", "-autoexit", "-nodisp", "podcast.wav"])

CompletedProcess(args=['ffplay', '-autoexit', '-nodisp', 'podcast.wav'], returncode=0)

In [None]:
# Play the podcast
import IPython
IPython.display.Audio("podcast.wav")