# Using Google Cloud's TTS module to turn a *long form* PDF into audio

## GCP set-up & module import

In [57]:
import pymupdf
import re
import os
from google.cloud import texttospeech

To use the Google Cloud Platform (GCP) TTS module, we must set up a GCP console and project. <br>
You will need the [Google Cloud CLI](https://cloud.google.com/sdk/docs/install) and a valid GCP account with the Text-to-Speech API module enabled. <br></br>
For more infomation, visit the following links:<br>
https://cloud.google.com/text-to-speech/docs/before-you-begin<br>
https://cloud.google.com/text-to-speech?hl=en

After setting up a GCP project, expose the application credentials to your shell, where:
- 'gcp-auth.json' is the credential environment variable file, stored in JSON format
- pdf-to-tts is the project name
- Project ID can be found in the GCP console 

In [58]:
# gcloud iam service-accounts create pdf-to-tts
# gcloud iam service-accounts keys create gcp-auth.json --iam-account pdf-to-tts@Project ID.iam.gserviceaccount.com
# export GOOGLE_APPLICATION_CREDENTIALS='/path/to/your/client_secret.json'

In [59]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "auth.json"

## PDF extraction and sanitation

Scanned/image-based PDF's require Optical Character Recognition in order to handle the text in a PDF. 

We can use pymupdf to open the given file via the pymupdf.open(), then iteratively perform OCR extraction on each page via page.get_text().

In [60]:
def scrape_pdf(path):
    corpus = ""
    doc = pymupdf.open(path)
    for page in doc:
        corpus += page.get_text()
    
    print(corpus[0:750]) #truncated for readability 

scrape_pdf("input/tts-test.pdf")

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse non sem sem. Aenean sit amet bibendum sem. Nunc
placerat placerat scelerisque. Integer bibendum ligula odio, consequat consectetur arcu faucibus id. Maecenas bibendum est
augue, in molestie purus porttitor commodo. Orci varius natoque penatibus et magnis dis parturient montes, nascetur
ridiculus mus. Donec gravida lectus massa, a malesuada turpis pulvinar id. Phasellus gravida semper mi, quis semper nunc
porta eu. Vestibulum nec blandit mi, vitae suscipit sapien. Ut auctor consequat libero, quis malesuada diam tincidunt quis.
Pellentesque vulputate ex venenatis lorem rhoncus tempus. In faucibus nisl augue. Ut diam turpis, fermentum sit amet
finibus iaculis, luctus eu velit


As seen in the output, the text has been extracted with no lost or incorrect characters.

However, this is not always the case. Given the nature of PDFs, is not guaranteed that OCR operations will be perfect, especially on scanned documents.

If we were to convert this text output into audio, all characters, even incorrect ones, will be 'read' by the model, resulting in poor audio transcribing.

We can demonstrate this by performing an OCR operation on a scanned textbook.<br>
The output has broken characters and inconsistent line breaks.

In [61]:
def scrape_pdf(path):
    corpus = ""
    doc = pymupdf.open(path)
    for page in doc:
        corpus += page.get_text()
    return corpus

corpus = scrape_pdf("input/scanned-tts-test.pdf")
print(corpus[0:1200]) #truncated for readability 

Individuals 
Is Politics Really 
About People? 
Barrie Axford 
Introduction 
In the Introduction to this volume we learned that the concepts and ideas 
which are at the heart of political study are themselves intellectual battle-
grounds. Key terms like 'power' and 'freedom' are linguistic and often moral 
minefields over which the student of politics has to pass with great care, and it 
is easy to confuse or else conflate the normative with the empirical, or to 
transgress seeming rules about the basis of scientific inquiry. So in the study of 
policies 
very little can be taken for granted, and this caveat extends to what we 
study and how we study it. 
In this chapter we will begin our examination of the nature of political 
inquiry and the scope and content of politics by looking at the place of the 
individual in political life. The term 'individual' is in common use, so much 
so that we tend to take its meaning for 'granted. We are all individuals in the 
sense that we are single

The text output can be sanitised using Regex. Whilst this will not fully fix the broken characters such as "t~", the audio output will be far more natural sounding. 

- The first regex expression removes standalone whole numbers, which removes page numbers, numbered captions etc..
- The second expression removes lines that have TOC references, such as Contents and Chapter
- Third expression removes whitespace (ensures natural sounding flow) and non alphanumeric characters 
- Fourth and fifth expressions collapses multiple consecutive whitespace characters (eg, paragraph breaks) to prevent unnatural and choppy transcription
- Last expression ensures that there is a full stop followed by a space between a lowercase letter and an uppercase letter to offset lost punctuation from the OCR process

In [62]:
def clean_tts(text):
    text = re.sub(r'\b\d+\b', '', text)
    text = re.sub(r'\b(CONTENTS|Chapter|Section)\b.*?\n', '', text, flags=re.IGNORECASE)
    text = re.sub(r'[^\w\s.,!?-]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    text = '. '.join(sentence.capitalize() for sentence in text.split('. '))
    text = re.sub(r'([a-z])\s([A-Z])', r'\1. \2', text)
    return text

In [63]:
print(clean_tts(corpus)[0:500])

Individuals is politics really about people? barrie axford introduction in the introduction to this volume we learned that the concepts and ideas which are at the heart of political study are themselves intellectual battle- grounds. Key terms like power and freedom are linguistic and often moral minefields over which the student of politics has to pass with great care, and it is easy to confuse or else conflate the normative with the empirical, or to transgress seeming rules about the basis of s


## TTS generation

We can leverage the texttospeech class from the `google.cloud` module:
- The `TextToSpeechClient` object is used to interact with the Google Cloud Text-to-Speech API
- SynthesisInput object encapsulates the input text that needs to be synthesized into speech
- VoiceSelectionParams object encapsulates the language code, voice name, and SSML gender
- AudioConfig object specifies the audio encoding format (MP3 default)
  
To call the TTS module from GCP, we call the `synthesize_speech()` method on the client object.

In [64]:
def gcp_tts(text, output_file, language_code='en-gb', voice_name='en-GB-Standard-A', ssml_gender='FEMALE'):
    client = texttospeech.TextToSpeechClient()
    input_text = texttospeech.SynthesisInput(text=text)
    voice = texttospeech.VoiceSelectionParams(
        language_code=language_code,
        name=voice_name,
        ssml_gender=texttospeech.SsmlVoiceGender[ssml_gender]
    )
    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )
    try:
        response = client.synthesize_speech(
            request={"input": input_text, "voice": voice, "audio_config": audio_config}
        )
        with open(output_file, "wb") as out:
            out.write(response.audio_content)
        print(f'Audio content written to file "{output_file}"')
        return True
    except Exception as e:
        print(f"Error: An unexpected error occurred: {str(e)}")
        return False


The function takes several parameters:
- text: The input text scraped from the PDF that needs to be converted into speech.
- output_file: The file path where the generated audio content will be saved.
  - The default audio output type is MP3
- language_code: The language code for the desired voice. The default value is 'en-gb' (No.1). Other supported codes include:
  - en-US(2)
  - fr-FR(3)
  - de-DE(4)
  - es-ES(5)
- voice_name: The name of the voice to be used. The default value is 'en-GB-Standard-A'.
  - To view other voices, [call the `voices:list` endpoint](https://cloud.google.com/text-to-speech/docs/list-voices), where `PROJECT_ID` refers to your project ID <br>
  - `curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "x-goog-user-project: PROJECT_ID" \
    -H "Content-Type: application/json; charset=utf-8" \
    "https://texttospeech.googleapis.com/v1/voices"`
- ssml_gender: The gender of the voice. The default value is 'FEMALE'.
  - One of "male", "female" or "neutral"

## Entry point

Reference `gcp_tts_output.mp3` in the repo for output.

In [65]:
def main():
    text_inp = scrape_pdf("input/tts-test.pdf")
    token_count = len(text_inp)
    print("Character count:", token_count)
    print("\nGoogle Cloud TTS selected.")
    language_options = {
        '1': ('en-US', 'English (US)'),
        '2': ('en-GB', 'English (UK)'),
        '3': ('fr-FR', 'French'),
        '4': ('de-DE', 'German'),
        '5': ('es-ES', 'Spanish')
    }
    print("\nAvailable languages:")
    for key, (code, name) in language_options.items():
        print(f"{key}: {name}")

    while True:
        lang_choice = input("Choose a language (1-5): ")
        # '1': ('en-US', 'English (US)'),
        # '2': ('en-GB', 'English (UK)'),
        # '3': ('fr-FR', 'French'),
        # '4': ('de-DE', 'German'),
        # '5': ('es-ES', 'Spanish')
        if lang_choice in language_options:
            language_code = language_options[lang_choice][0]
            break
        print("Invalid choice. Please enter a number between 1 and 5.")

    voice_name = f'{language_code}-Standard-A'

    while True:
        gender_choice = input("\nChoose voice gender (M for Male, F for Female): ").upper()
        if gender_choice in ['M', 'F']:
            ssml_gender = 'MALE' if gender_choice == 'M' else 'FEMALE'
            break
        print("Invalid choice. Please enter M or F.")

    output_file = "gcp_tts_output.mp3"

    success = gcp_tts(text_inp, output_file, language_code, voice_name, ssml_gender)
    if success:
        print(f"TTS conversion completed successfully. Output saved to {output_file}")
    else:
        print("TTS conversion failed.")

if __name__ == "__main__":
    main()

Character count: 4007

Google Cloud TTS selected.

Available languages:
1: English (US)
2: English (UK)
3: French
4: German
5: Spanish
Audio content written to file "gcp_tts_output.mp3"
TTS conversion completed successfully. Output saved to gcp_tts_output.mp3
