# PDF2Podcast Test Notebook

Questo notebook dimostra le funzionalità della libreria pdf2podcast, incluse:
- Estrazione testo da PDF
- Chunking del testo
- Ricerca semantica
- Generazione podcast

In [1]:
import os
import sys
from pathlib import Path
from dotenv import load_dotenv

# Aggiungi il percorso corretto per i moduli
module_path = str(Path(os.getcwd()).parent)
if module_path not in sys.path:
    sys.path.append(module_path)

# Import diretti dai moduli locali
from pdf2podcast import PodcastGenerator
from pdf2podcast.core.rag import AdvancedPDFProcessor as SimplePDFProcessor
from pdf2podcast.core.llm import GeminiLLM
from pdf2podcast.core.tts import AWSPollyTTS, GoogleTTS
from pdf2podcast.core.prompts import PodcastPromptBuilder
from pdf2podcast.core.processing import SimpleChunker, SemanticRetriever

  from .autonotebook import tqdm as notebook_tqdm


## Setup

Setup iniziale e configurazione

In [2]:
# Carica le variabili d'ambiente
load_dotenv("../.env")

# Verifica le chiavi necessarie
api_key = os.getenv("GENAI_API_KEY")
if not api_key:
    raise ValueError("GENAI_API_KEY non trovata nel file .env")

# Verifica che esista il PDF di test
PDF_PATH = "./test2.pdf"
if not os.path.exists(PDF_PATH):
    raise ValueError(f"File PDF non trovato: {PDF_PATH}")

## Test Base: Estrazione Testo

In [4]:
# Inizializza il processore PDF base
processor = SimplePDFProcessor(
    max_chars_per_chunk=4000,
    extract_images=True,
    metadata=True
)

# Estrai il testo
text = processor.process_document(PDF_PATH)
print("Lunghezza testo estratto:", len(text))
print("\nPrimi 500 caratteri:\n")
print(text[:500])

Lunghezza testo estratto: 39685

Primi 500 caratteri:

Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works. Attention Is All You Need Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com Jakob Uszkoreit∗ Google Research usz@google.com Llion Jones∗ Google Research llion@google.com Aidan N. Gomez∗† University of Toronto aidan@cs.toronto.edu Łukasz K


## Test Chunking

In [6]:
# Inizializza il chunker
chunker = SimpleChunker()

# Dividi il testo in chunks
chunks = chunker.chunk_text(text, chunk_size=4000)

print(f"Numero di chunks: {len(chunks)}")
print("\nPrimo chunk:\n")
print(chunks[0])

Numero di chunks: 411

Primo chunk:

Understanding Deep Learning Simon J.D. Prince November 21, 2024 The most recent version of this document can be found at http://udlbook.com. Copyright in this work has been licensed exclusively to The MIT Press, https://mitpress.mit.edu, which released the final version to the public in December 2023. Inquiries regarding rights should be addressed to the MIT Press, Rights & Permissions Department. This work is subject to a Creative Commons CC-BY-NC-ND license. I would really appreciate help improving this document. No detail too small! Please contact me with suggestions, factual inaccuracies, ambiguities, questions, and errata via github or by e-mail at udlbookmail@gmail.com.
This book is dedicated to Blair, Calvert, Coppola, Ellison, Faulkner, Kerpatenko, Morris, Robinson, Sträussler, Wallace, Waymon, Wojnarowicz, and all the others whose work is even more important and interesting than deep learning.
Contents Preface ix Acknowledgements xi 1 Intro

## Test Ricerca Semantica

In [7]:
# Inizializza il retriever
retriever = SemanticRetriever()

# Aggiungi i chunks al retriever
retriever.add_texts(chunks)

# Prova una query
query = "Explain the main concepts of the paper"
relevant_chunks = retriever.get_relevant_chunks(query, k=5)

print("Chunks più rilevanti per la query:", query)
for i, chunk in enumerate(relevant_chunks, 1):
    print(f"\nChunk {i}:\n{chunk}")

Chunks più rilevanti per la query: Explain the main concepts of the paper

Chunk 1:
16 1 Introduction is Reinforcement Learning: An Introduction (Sutton & Barto, 2018). A good initial resource is Foundations of Deep Reinforcement Learning (Graesser & Keng, 2019). 1.7 How to read this book Most remaining chapters in this book contain a main body of text, a notes section, and a set of problems. The main body of the text is intended to be self-contained and can be read without recourse to the other parts of the chapter. As much as possible, background mathematics is incorporated into the main body of the text. However, for larger topics that would be a distraction to the main thread of the argument, the background material is appendicized, and a reference is provided in the margin. Most notation in this book is Appendix A Notation standard. However, some conventions are less widely used, and the reader is encouraged to consult appendix A before proceeding. The main body of text includes m

## Test Completo: Generazione Podcast

In [3]:
## Test Completo: Generazione Podcast
chunker = SimpleChunker()
retriever = SemanticRetriever()
prompt_builder = PodcastPromptBuilder()

# Configura il processore PDF
processor = SimplePDFProcessor(
    chunker=chunker,
    retriever=retriever,
    extract_images=True,
    max_chars_per_chunk=6000,
    metadata=True
)

# Crea il generator con la nuova configurazione basata sui manager
generator = PodcastGenerator(
    rag_system=processor,
    llm_type="gemini",
    tts_type="google",  # Using Google TTS for testing
    llm_config={
        "api_key": api_key,
        "max_output_tokens": 8000,
        "temperature": 0.1,
        "prompt_builder": prompt_builder
    },
    tts_config={
        "language": "en",
        "tld": "com",
        "slow": False
    },
    chunker=chunker,
    retriever=retriever,
)

# Genera il podcast con una query specifica
result = generator.generate(
    pdf_path=PDF_PATH,
    output_path="output.mp3",
    complexity="advanced",
    audience="experts",
    query="Explain the main concepts of the paper and the main results"
)

# Mostra i risultati
print("Script generato:\n")
print(result["script"])

print("\nDettagli audio:")
print(f"File: {result['audio']['path']}")
print(f"Dimensione: {result['audio']['size']} bytes")

len(result["script"])

Script generato:

The ideal of a perfect legal system, a system devoid of bias, error, and inefficiency, remains a perpetually elusive aspiration.  The inherent complexities of human interaction, the subjective interpretation of laws, and the limitations of even the most meticulously crafted legal frameworks conspire to create a persistent gap between this ideal and the reality of legal practice. This discrepancy, far from being a mere philosophical abstraction, represents a significant challenge with profound societal implications.  Consider, for instance, the disparities in sentencing for similar crimes, often influenced by factors extraneous to the legal merits of the case – socioeconomic background, racial bias, or even the perceived demeanor of the defendant.  These inconsistencies highlight the inherent limitations of a system striving for objectivity within a fundamentally subjective human context.  The pursuit of justice, therefore, necessitates a constant critical examination 

6824

In [5]:
len(result["script"])

6360