# Buildilng a PDF to audiobook converter in Python + CoPilot
We're going to run through this iteratively, making improvements as we go.  To start, we'll write a program that will:
1. Convert a number of pages of a PDF file to a text file
1. Convert that text file into an mp3 file using Microsoft Edge's excellent text-to-speech AI service

Incrementally, we'll make improvements to:
1. Ignore headers and other text which we don't want read to us in the audiobook
1. Convert the entire PDF to an audiobook
1. Properly format the text file so that paragraphs aren't broken by newlines
1. Provide indication of progress being made in the audiobook conversion (as this can take a while)

Let's get started!

## Version 01

In [None]:
# Activate the pdf_audiobook environment using mamba
#! mamba activate pdf_audiobook

# Change to the pdf_audiobook directory
#! cd ~/projects/pdf_audiobook

# Install the modules using mamba and pip
#! mamba install -c conda-forge pypdf2
#! ~/mambaforge/envs/pdf_audiobook/bin/pip install edge-tts

In [None]:
# Extracts text from PDF files and converts it to an audiobook
# Uses PyPDF2 and edge-tts

# Import modules
import PyPDF2
import edge_tts

# Define some constants
PDF_FILE = "Plato - Apology.pdf"
AUDIO_FILE = "Plato - Apology.mp3"
TEXT_FILE = 'extracted_text.txt'

In [None]:
'''
A function to extract text from a PDF file and save it as a text file
Takes the following of arguments:
    * PDF file to extract from
    * Start page (defaults to 0)
    * End page (defaults to -1, which means the last page)
'''
def extract_text_from_pdf(pdf_file, start_page=0, end_page=-1):
    # Open PDF file
    pdf_file = open(pdf_file, "rb")

    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)

    # Get the number of pages in the PDF file
    num_pages = pdf_reader.numPages

    # Create a text file to save the text to
    text_file = open(TEXT_FILE, "w")

    # If the end page is -1, set it to the last page
    if end_page == -1:
        end_page = num_pages

    # Loop through all the pages
    for page_num in range(start_page, end_page):
        # Get the page object
        page_obj = pdf_reader.getPage(page_num)

        # Extract the text from the page
        page_text = page_obj.extractText()

        # Write the text to the text file
        text_file.write(page_text)

    # Close the text file
    text_file.close()

In [None]:
# Call the function to extract text from the PDF file defined above
extract_text_from_pdf(pdf_file=PDF_FILE, start_page=2, end_page=7)

In [None]:
!edge-tts --voice en-AU-NatashaNeural --file '{TEXT_FILE}' --write-media '{AUDIO_FILE}'

-----

## Version 02 -- Refining the pipeline ...
Let's have a go at trying to exclude the header and footer using a "visitor function".

### [Using a visitor](https://pypdf2.readthedocs.io/en/stable/user/extract-text.html#using-a-visitor)
You can use visitor-functions to control which part of a page you want to process and extract. The visitor-functions you provide will get called for each operator or for each text fragment.

The function provided in argument visitor_text of function extract_text has five arguments: current transformation matrix, text matrix, font-dictionary and font-size. In most cases the x and y coordinates of the current position are in index 4 and 5 of the current transformation matrix.

The font-dictionary may be None in case of unknown fonts. If not None it may e.g. contain key “/BaseFont” with value “/Arial,Bold”.

Caveat: In complicated documents the calculated positions might be wrong.

The function provided in argument visitor_operand_before has four arguments: operand, operand-arguments, current transformation matrix and text matrix.

#### [Example 1: Ignore header and footer](https://pypdf2.readthedocs.io/en/stable/user/extract-text.html#example-1-ignore-header-and-footer)
The following example reads the text of page 4 of this PDF document, but ignores header (y < 720) and footer (y > 50).

```python
    from pypdf import PdfReader

    reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
    page = reader.pages[3]

    parts = []


    def visitor_body(text, cm, tm, fontDict, fontSize):
        y = tm[5]
        if y > 50 and y < 720:
            parts.append(text)


    page.extract_text(visitor_text=visitor_body)
    text_body = "".join(parts)

    print(text_body)
```

In [1]:
from PyPDF2 import PdfReader

# Define some constants
PDF_FILE = "Plato - Apology.pdf"
AUDIO_FILE = "Plato - Apology.mp3"
TEXT_FILE = 'extracted_text.txt'

In [25]:
'''
A function to extract text from a PDF file and save it as a text file
Takes the following of arguments:
    * PDF file to extract from
    * Start page (defaults to 0)
    * End page (defaults to -1, which means the last page)
'''
def extract_text_from_pdf(pdf_file, start_page=0, end_page=-1):
    # Create a PDF reader object
    pdf_reader = PdfReader(pdf_file)

    # Get the number of pages in the PDF file
    num_pages = pdf_reader.numPages

    # If the end page is -1, set it to the last page
    if end_page == -1:
        end_page = num_pages

    # Create a text file to save the text to
    text_file = open(TEXT_FILE, "w")

    # A visitor function which ignores the header and footer, returning paragraphs of text only
    def ignore_header_footer(text, cm, tm, fontDict, fontSize):
        #import pdb; pdb.set_trace()
        y = tm[5]
        # Need to fiddle around with these numbers to adapt to the headers of the specific book
        if y > 60 and y < 720:
            page_text.append(text)
            #print(page_text)

    # Loop through all the pages
    for page_num in range(start_page, end_page):

        # List to hold page text elements extracted by visitor function
        page_text = []

        # Get the page object
        page_obj = pdf_reader.pages[page_num]

        # Extract the text from the page
        page_obj.extract_text(visitor_text=ignore_header_footer)
        page_text = ''.join(page_text)

        #print(page_text)
        
        # Write the text to the text file
        text_file.write(page_text)

    # Close the text file
    text_file.close()

In [27]:
# Call the function to extract text from the PDF file defined above
extract_text_from_pdf(pdf_file=PDF_FILE, start_page=2, end_page=-1)

In [28]:
!edge-tts --voice en-AU-NatashaNeural --file '{TEXT_FILE}' --write-media '{AUDIO_FILE}'

## TODO:
* Determine whether the token being used by edge_tts is able to be used simultaneously, or if you get booted
    * If booted, find the token your browsers uses and submit a PR to modify edge_tts to parameterize this
    * If not booted, implement chapters / chunking of books
    * **[CONFIRMED]** Not booted when running in parallel ... so chunk it up!
* Explore reimplementing [pdfplumber](https://github.com/jsvine/pdfplumber) package.  Any value?
* Find a better way to ignore headers rather than the manual adjustment of the y-coordinates.  (pdfplumber helps?)
* Explore approaches to properly format the text file so that paragraphs aren't broken by newlines
    * Determine whether a different boundary in the AI TTS service (e.g. word vs sentence) could sort this out
    * Remove all newline characters which aren't preceded by periods within the prior 3 characters (allowing for punctuation / quotes / spaces)
        * Pro:  quick and easy
        * Con:  doesn't distinguish headings
    * Analyze the layout of the PDF document and try to focus only on the text itself
        * Pro:  should allow for distinguishing the headings & paragraphs
        * Con:  likely still results in text that needs newline processing
    * Remove / trim silence (vis PocketCasts) in the mp3 itself
        * Pro:  quickly improves the quality of the listening experience
        * Con:  likely compresses content in a suboptimal way
    * Pass the PDF converter a list of strings or regex to remove in the output
        * Pro:  removes the duplicate / repeated headers, footers, page numbers, reference markers [1], etc ...
        * Con:  doesn't fully address the problem
* Explore using GitHub "dependabot" for vulnerability management [dependabot](https://github.com/apps/dependabothttps://github.com/apps/dependabot)
* Explore using [this guy's implementation of the Edge TTS service](https://github.com/alekssamos/msspeech)
    * Are there others that are better?
    * You found this one by:
        * Searching Google for the chrome-extension://jdiccldimpdaibmpdkjnbmckianbfold code that the edge_tts library uses in its headers in the Communicate class
        * Clicking on the [first result, which was this gist](https://gist.github.com/Dobby233Liu/1daafa6ea07f780725250cdf1082bc2e)
        * Reading the chat on the gist and noting that [this user seems to have successfully created his own library](https://gist.github.com/alekssamos)
* Have a look at the raw JavaScript that is the Edge TTS service in the Edge browser directly:
    * view-source:chrome-extension://jdiccldimpdaibmpdkjnbmckianbfold/_generated_background_page.html

-----

## Feature Roadmap
* Break the book out into separate files which are named for the chapter of the book to enable chapters in the audiobook
    * This will also hopefully prevent the connection issues that we're having on longer conversions