# Using OpenAI's TTS-01 model to turn a PDF into spoken audio

## Module and API set-up

To use the TTS-01 model, we must point the OpenAI API to use our personal API key. 

In order to use this repo, you must have a valid API key that is referenced as an environment variable.

For more information, visit [OpenAI's developer quickstart guide](https://platform.openai.com/docs/quickstart/developer-quickstart)

In [3]:
import pymupdf 
from openai import OpenAI, OpenAIError
from pathlib import Path 
import os
import re

os.environ.get("OPENAI_API_KEY")
client = OpenAI()

## PDF extraction and sanitation

Scanned/image-based PDF's require Optical Character Recognition in order to handle the text in a PDF. 

We can use pymupdf to open the given file via the pymupdf.open(), then iteratively perform OCR extraction on each page via page.get_text().

In [17]:
def scrape_pdf(path):
    corpus = ""
    doc = pymupdf.open(path)
    for page in doc:
        corpus += page.get_text()
    
    print(corpus[0:750]) #truncated for readability 

scrape_pdf("tts-test.pdf")

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse non sem sem. Aenean sit amet bibendum sem. Nunc
placerat placerat scelerisque. Integer bibendum ligula odio, consequat consectetur arcu faucibus id. Maecenas bibendum est
augue, in molestie purus porttitor commodo. Orci varius natoque penatibus et magnis dis parturient montes, nascetur
ridiculus mus. Donec gravida lectus massa, a malesuada turpis pulvinar id. Phasellus gravida semper mi, quis semper nunc
porta eu. Vestibulum nec blandit mi, vitae suscipit sapien. Ut auctor consequat libero, quis malesuada diam tincidunt quis.
Pellentesque vulputate ex venenatis lorem rhoncus tempus. In faucibus nisl augue. Ut diam turpis, fermentum sit amet
finibus iaculis, luctus eu velit


As seen in the output, the text has been extracted with no lost or incorrect characters.

However, this is not always the case. Given the nature of PDFs, is not guaranteed that OCR operations will be perfect, especially on scanned documents.

If we were to convert this text output into audio, all characters, even incorrect ones, will be 'read' by the model, resulting in poor audio transcribing.

We can demonstrate this by performing an OCR operation on a scanned textbook.<br>
The output has broken characters and inconsistent line breaks.

In [28]:
def scrape_pdf(path):
    corpus = ""
    doc = pymupdf.open(path)
    for page in doc:
        corpus += page.get_text()
    print(corpus[0:1200]) #truncated for readability 
    return corpus

corpus = scrape_pdf("scanned-tts-test.pdf")

Individuals 
Is Politics Really 
About People? 
Barrie Axford 
Introduction 
In the Introduction to this volume we learned that the concepts and ideas 
which are at the heart of political study are themselves intellectual battle-
grounds. Key terms like 'power' and 'freedom' are linguistic and often moral 
minefields over which the student of politics has to pass with great care, and it 
is easy to confuse or else conflate the normative with the empirical, or to 
transgress seeming rules about the basis of scientific inquiry. So in the study of 
policies 
very little can be taken for granted, and this caveat extends to what we 
study and how we study it. 
In this chapter we will begin our examination of the nature of political 
inquiry and the scope and content of politics by looking at the place of the 
individual in political life. The term 'individual' is in common use, so much 
so that we tend to take its meaning for 'granted. We are all individuals in the 
sense that we are single

The text output can be sanitised using Regex. Whilst this will not fully fix the broken characters such as "t~", the audio output will be far more natural sounding. 

- The first regex expression removes standalone whole numbers, which removes page numbers, numbered captions etc..
- The second expression removes lines that have TOC references, such as Contents and Chapter
- Third expression removes whitespace (ensures natural sounding flow) and non alphanumeric characters 
- Fourth and fifth expressions collapses multiple consecutive whitespace characters (eg, paragraph breaks) to prevent unnatural and choppy transcription
- Last expression ensures that there is a full stop followed by a space between a lowercase letter and an uppercase letter to offset lost punctuation from the OCR process

In [25]:
def clean_tts(text):
    text = re.sub(r'\b\d+\b', '', text)
    text = re.sub(r'\b(CONTENTS|Chapter|Section)\b.*?\n', '', text, flags=re.IGNORECASE)
    text = re.sub(r'[^\w\s.,!?-]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    text = '. '.join(sentence.capitalize() for sentence in text.split('. '))
    text = re.sub(r'([a-z])\s([A-Z])', r'\1. \2', text)
    return text

In [32]:
print(clean_tts(corpus)[0:500])

Individuals is politics really about people? barrie axford introduction in the introduction to this volume we learned that the concepts and ideas which are at the heart of political study are themselves intellectual battle- grounds. Key terms like power and freedom are linguistic and often moral minefields over which the student of politics has to pass with great care, and it is easy to confuse or else conflate the normative with the empirical, or to transgress seeming rules about the basis of s


## TTS generation

OpenAI's Audio API has a audio endpoint that returns the TTS audio data. 

We can call the endpoint via [OpenAI's TTS documentation](https://platform.openai.com/docs/guides/text-to-speech/overview).

In [None]:
def tts(oai_model, oai_voice, oai_input):
    speech_file_path = Path(__file__).parent / "speech.wav"
    try:
        response = client.audio.speech.create(
            model=oai_model,
            voice=oai_voice,
            input=oai_input
        )   
        response.stream_to_file(speech_file_path)
        return True
    except OpenAIError as e:
        print(f"Error: An API error occurred: {str(e)}")
    except Exception as e:
        print(f"Error: An unexpected error occurred: {str(e)}")
    return False

The endpoint takes 3 inputs, which are wrapped as parameters in this case to allow user choice: 
- 'model', allows the use of the following models:
  - Lower quality, lower latency and cheaper tts-01 model ($15.00 / 1M characters)
  - Higher quality, higher latency and more expensive tts-01-hd model ($30.00 / 1M characters)
- 'voice', the API has 6 different voices
  - alloy, echo, fable, onyx, nova, and shimmer
- 'input', which in this case refers to the extracted corpus 

The response.stream_to_file() method then writes the TTS audio data to the 'speech.wav' file.

In case the API throws an error, we can wrap the endpoint in a try/catch loop to return the error in a human readable format.

## Entry point/frontend

Reference the .wav file in the repo to view the output.

In [None]:
def main():
    pdf_path = input("Enter the path to your PDF file: ")
    text_inp = scrape_pdf(pdf_path)
    token_count = len(text_inp)
    print("Character count:", token_count)
    print("\nEstimated cost of TTS (tts-1): $", token_count * 0.000015)
    print("\nEstimated cost of TTS (tts-1-hd): $", token_count * 0.00003)

    while True:
        model_choice = input("\nChoose the TTS model (1 for tts-1, 2 for tts-1-hd): ")
        if model_choice in ['1', '2']:
            break
        print("Invalid choice. Please enter 1 or 2.")

    oai_model = "tts-1" if model_choice == '1' else "tts-1-hd"
    
    voice_options = ["alloy", "echo", "fable", "onyx", "nova", "shimmer"]
    print("\nAvailable voices:", ", ".join(voice_options))
    print("For more information about the voices, visit:\nhttps://platform.openai.com/docs/guides/text-to-speech/quickstart")

    while True:
        oai_voice = input("\nChoose a voice from the options above: ").lower()
        if oai_voice in voice_options:
            break
        print("Invalid voice. Please choose from the available options.")

    print(f"\nGenerating TTS with model {oai_model} and voice {oai_voice}...")
    tts(oai_model, oai_voice, text_inp)
    print("TTS generation complete. Audio saved as 'speech.wav'.")

if __name__ == "__main__":
    main()