# Buildilng a PDF to audiobook converter in Python + CoPilot
We're going to run through this iteratively, making improvements as we go.  To start, we'll write a program that will:
1. Convert a number of pages of a PDF file to a text file
1. Convert that text file into an mp3 file using Microsoft Edge's excellent text-to-speech AI service

Incrementally, we'll make improvements to:
1. Ignore headers and other text which we don't want read to us in the audiobook
1. Convert the entire PDF to an audiobook
1. Properly format the text file so that paragraphs aren't broken by newlines
1. Provide indication of progress being made in the audiobook conversion (as this can take a while)

Let's get started!

## Version 01

In [None]:
# Activate the pdf_audiobook environment using mamba
#! mamba activate pdf_audiobook

# Change to the pdf_audiobook directory
#! cd ~/projects/pdf_audiobook

# Install the modules using mamba and pip
#! mamba install -c conda-forge pypdf2
#! ~/mambaforge/envs/pdf_audiobook/bin/pip install edge-tts

In [None]:
# Extracts text from PDF files and converts it to an audiobook
# Uses PyPDF2 and edge-tts

# Import modules
import PyPDF2
import edge_tts

# Define some constants
PDF_FILE = "Plato - Apology.pdf"
AUDIO_FILE = "Plato - Apology.mp3"
TEXT_FILE = 'extracted_text.txt'

In [None]:
'''
A function to extract text from a PDF file and save it as a text file
Takes the following of arguments:
    * PDF file to extract from
    * Start page (defaults to 0)
    * End page (defaults to -1, which means the last page)
'''
def extract_text_from_pdf(pdf_file, start_page=0, end_page=-1):
    # Open PDF file
    pdf_file = open(pdf_file, "rb")

    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)

    # Get the number of pages in the PDF file
    num_pages = pdf_reader.numPages

    # Create a text file to save the text to
    text_file = open(TEXT_FILE, "w")

    # If the end page is -1, set it to the last page
    if end_page == -1:
        end_page = num_pages

    # Loop through all the pages
    for page_num in range(start_page, end_page):
        # Get the page object
        page_obj = pdf_reader.getPage(page_num)

        # Extract the text from the page
        page_text = page_obj.extractText()

        # Write the text to the text file
        text_file.write(page_text)

    # Close the text file
    text_file.close()

In [None]:
# Call the function to extract text from the PDF file defined above
extract_text_from_pdf(pdf_file=PDF_FILE, start_page=2, end_page=7)

In [None]:
!edge-tts --voice en-AU-NatashaNeural --file '{TEXT_FILE}' --write-media '{AUDIO_FILE}'

-----

## Version 02 -- Refining the pipeline ...
Let's have a go at trying to exclude the header and footer using a "visitor function".

### [Using a visitor](https://pypdf2.readthedocs.io/en/stable/user/extract-text.html#using-a-visitor)
You can use visitor-functions to control which part of a page you want to process and extract. The visitor-functions you provide will get called for each operator or for each text fragment.

The function provided in argument visitor_text of function extract_text has five arguments: current transformation matrix, text matrix, font-dictionary and font-size. In most cases the x and y coordinates of the current position are in index 4 and 5 of the current transformation matrix.

The font-dictionary may be None in case of unknown fonts. If not None it may e.g. contain key “/BaseFont” with value “/Arial,Bold”.

Caveat: In complicated documents the calculated positions might be wrong.

The function provided in argument visitor_operand_before has four arguments: operand, operand-arguments, current transformation matrix and text matrix.

#### [Example 1: Ignore header and footer](https://pypdf2.readthedocs.io/en/stable/user/extract-text.html#example-1-ignore-header-and-footer)
The following example reads the text of page 4 of this PDF document, but ignores header (y < 720) and footer (y > 50).

```python
    from pypdf import PdfReader

    reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
    page = reader.pages[3]

    parts = []


    def visitor_body(text, cm, tm, fontDict, fontSize):
        y = tm[5]
        if y > 50 and y < 720:
            parts.append(text)


    page.extract_text(visitor_text=visitor_body)
    text_body = "".join(parts)

    print(text_body)
```

In [None]:
from PyPDF2 import PdfReader

# Define some constants
PDF_FILE = "Plato - Apology.pdf"
AUDIO_FILE = "Plato - Apology.mp3"
TEXT_FILE = 'extracted_text.txt'

In [None]:
'''
A function to extract text from a PDF file and save it as a text file
Takes the following of arguments:
    * PDF file to extract from
    * Start page (defaults to 0)
    * End page (defaults to -1, which means the last page)
'''
def extract_text_from_pdf(pdf_file, start_page=0, end_page=-1):
    # Create a PDF reader object
    pdf_reader = PdfReader(pdf_file)

    # Get the number of pages in the PDF file
    num_pages = pdf_reader.numPages

    # If the end page is -1, set it to the last page
    if end_page == -1:
        end_page = num_pages

    # Create a text file to save the text to
    text_file = open(TEXT_FILE, "w")

    # A visitor function which ignores the header and footer, returning paragraphs of text only
    def ignore_header_footer(text, cm, tm, fontDict, fontSize):
        #import pdb; pdb.set_trace()
        y = tm[5]
        # Need to fiddle around with these numbers to adapt to the headers of the specific book
        if y > 60 and y < 720:
            page_text.append(text)
            #print(page_text)

    # Loop through all the pages
    for page_num in range(start_page, end_page):

        # List to hold page text elements extracted by visitor function
        page_text = []

        # Get the page object
        page_obj = pdf_reader.pages[page_num]

        # Extract the text from the page
        page_obj.extract_text(visitor_text=ignore_header_footer)
        page_text = ''.join(page_text)

        #print(page_text)
        
        # Write the text to the text file
        text_file.write(page_text)

    # Close the text file
    text_file.close()

In [None]:
# Call the function to extract text from the PDF file defined above
extract_text_from_pdf(pdf_file=PDF_FILE, start_page=2, end_page=-1)

In [None]:
!edge-tts --voice en-AU-NatashaNeural --file '{TEXT_FILE}' --write-media '{AUDIO_FILE}'

## Version 03 -- Reimplementing in pdfplumber
It turns out that the [pdfplumber](https://github.com/jsvine/pdfplumber) library is far more feature rich and widely used (as measured by GitHub stars) than py2pdf.  Let's reimplement in pdfplumber, and also do a bit of refactoring to tidy things up while we're at it.

In [None]:
# !mamba search pdfplumber
# Install pdfplumber in the project environment
# !~/mambaforge/envs/pdf_audiobook/bin/pip install pdfplumber

In [None]:
import pdfplumber
print(pdfplumber.__version__)

# define some constants
PDF_FILE = "Plato - Apology.pdf"
AUDIO_FILE = "Plato - Apology.mp3"
TEXT_FILE = 'extracted_text.txt'

In [None]:
# load the PDF file and examine a page
pdf = pdfplumber.open(PDF_FILE)

# get the first page
page = pdf.pages[0]

# examine an image of the first page
im = page.to_image()

Annoying ... it turns out that this is solved by editing the ImageMagic policy as per [this StackOverflow question](https://stackoverflow.com/questions/52998331/imagemagick-security-policy-pdf-blocking-conversion).  From the same, here's a one-line solution:

```bash
sed -i '/disable ghostscript format types/,+6d' /etc/ImageMagick-6/policy.xml
```

In [None]:
# Add to bottom of policy
# !sudo sed -i.backup '/<\/policymap>/i \ \ <policy domain="coder" rights="read|write" pattern="PDF" />' /etc/ImageMagick-6/policy.xml
# !sudo sed -i.backup '/<\/policymap>/i \ \ <policy domain="coder" rights="read|write" pattern="LABEL" />' /etc/ImageMagick-6/policy.xml

# Add to top of policy
# !sudo sed -i.backup '/<policymap>/a \ \ <policy domain="coder" rights="read | write" pattern="PDF" />' /etc/ImageMagick-6/policy.xml

In [None]:
# load the PDF file and examine a page
pdf = pdfplumber.open(PDF_FILE)

# get the first page
page = pdf.pages[0]

# examine an image of the first page
im = page.to_image()

This is bloody annoying, and needs more work.  We'll move on for now without visual debugging.

In [None]:
# load the PDF file and examine a page
pdf = pdfplumber.open(PDF_FILE)

# get the third page
page = pdf.pages[2]

# extract the text from the page
text = page.extract_text()
print(text)

In [None]:
# Adjust the tolerance so that we get paragraphs of text
text = page.extract_text(y_tolerance=10)
print(text)

In [None]:
type(text)

Let's capture only paragraphs of text (i.e. exclude the header and footer) by cropping to a bounding box.  Bounding boxes are described using 4 points:
* x0: the distance of the left side of the box from the left side of the page
* top: the distance of the top of the box from the top of the page
* x1:the distance of the right side of the box from the **left** side of the page
* bottom: the distance of the bottom of the box from the **top** of the page

NB all distances are measured from the **left** side and **top** of the page.

These are defined in the [documentation here](https://github.com/jsvine/pdfplumber#char-properties).  We've measured these distances in the document using Acrobat's measurement tool.  This will need to be adjusted for each document.

In [None]:
# Measured these distances in our demo file
x0 = 70
top = 75
x1 = 540
bottom = 717

# create tuple for parameterizing crop()
bounding_box = (x0, top, x1, bottom)

In [None]:
page = pdf.pages[2]
page = page.crop(bbox=bounding_box)
text = page.extract_text()
print(text)

In [None]:
with open(file=TEXT_FILE, mode='w') as text_file:
    text_file.write(text)

That works!  We still have the annoying issue of the newline characters being inserted at the end of every line.

In [None]:
text = page.extract_text(layout=True)
print(text)

In [None]:
text = page.extract_words(split_at_punctuation=True)
print(text)

In [None]:
len(text)

In [None]:
text[0]

In [None]:
text[1]

In [None]:
text = page.extract_words()
for i in range(5):
    print(text[i])

In [None]:
text = page.extract_text(y_tolerance=14)
print(text[0:500])

In [None]:
text = page.extract_text()
print(text[0:500])
#print(text)

In [None]:
page.extract_text()

Let's muck around with pdfminer.six directly and explore the hierarchy of each page ...

In [None]:
from pdfminer.high_level import extract_pages

In [None]:
pages = extract_pages(PDF_FILE)
pages

In [None]:
for page in pages:
    for element in page:
        print(element)

This has proved frustrating.  Let's go another direction.  We'll have a look at the [PyMuPDF package](https://github.com/pymupdf/PyMuPDF).  

[Reading the docs here.](https://pymupdf.readthedocs.io/en/latest/tutorial.html#extracting-text-and-images).

In [None]:
# !~/mambaforge/envs/pdf_audiobook/bin/pip install pymupdf

In [None]:
# import and create some constants
import fitz

PDF_FILE = "Plato - Apology.pdf"
AUDIO_FILE = "Plato - Apology.mp3"
TEXT_FILE = 'extracted_text.txt'

In [None]:
# open the PDF
doc = fitz.open(PDF_FILE)

In [None]:
page = doc.load_page(3)
text = page.get_text()
text[0:2000]

In [None]:
text

In [None]:
text = page.get_text('blocks')

In [None]:
text

PyMuPDF Approach

Remove newline characters so that text reads normally
* Extract blocks of text
* Filter out header and footer
* With each block, remove newline characters
* Output each block as a continuous string without embedded newlines
* Follow each block with a newline character

Identify chapters and extract them individually


PyMuPDF will [extractBLOCKS()](https://pymupdf.readthedocs.io/en/latest/textpage.html#TextPage.extractBLOCKS) for us, which contain useful information for processing text.

extractBLOCKS() -- which can be called via the convenience method ```page.get_text("blocks")``` -- returns a tuple as follows:
```
(x0, y0, x1, y1, block text, block number, block type)
```

Where block number is the block sequence number on the page, and block type = 0 for text, and 1 for image.

1 inch = 72 points in PDF terms.  Typical margins are:
* Top & bottom:  0.75 inches
* Right & left:  1.25 inches

Let's define some constants to help us work in these dimensions.  Recall, also, that PDFs are measured from the left and top sides of the page.  First, we'll get the dimensions of the page itself.

In [None]:
page.bound()

In [None]:
HEADER_INCHES = 0.75
LEFT_MARGIN_INCHES = 1.25

PAGE_LENGTH = page.bound()[3]
PAGE_WIDTH = page.bound()[2]

# Define header and footer bounding boxes
HEADER_FROM_TOP = round(0.75 * 72)  # this can be used to filter out the header
FOOTER_FROM_TOP = PAGE_LENGTH - HEADER_FROM_TOP # this can be used to filter out the footer




In [None]:
'''
A function which iterates over the pages, blocks, lines and spans of a PDF file
The function returns a dictionary containing:
    * Each font size in the PDF
    * The first sentence of each font
    * The count of words in each font

Parameters
----------
pdf_file : str
    The path to the PDF file

Returns
-------
dict
    A dictionary containing the font sizes, first sentences and word counts

'''

def get_font_info(pdf_file):
    doc = fitz.open(pdf_file)
    font_info = {}

    for page in doc:
        for block in page.get_text('blocks'):
            font_size = block[11]
            if font_size not in font_info:
                font_info[font_size] = {'first_sentence': block[4].split('.')[0], 'word_count': len(block[4].split())}
            else:
                font_info[font_size]['word_count'] += len(block[4].split())

    return font_info

In [None]:
# A dictionary containing the font sizes, first sentences and word counts
font_info = get_font_info(PDF_FILE)

In [None]:
'''
A function which analyzes the size and counts of each font in a PDF file and returns a dictionary of the results.
The function also captures an example the first line in each font and stores it in the dictionary.

Parameters
----------
pdf_file : str
    The path to the PDF file to analyze.

Returns
-------
font_dict : dict
    A dictionary of the font sizes and counts in the PDF file.

'''

def analyze_fonts(pdf_file):
    
    # Open the PDF file
    doc = fitz.open(pdf_file)
    
    # Create a dictionary to hold the results
    font_dict = {}
    
    # Loop through all the pages in the PDF file
    for page in doc:
        
        # Loop through all the text blocks on the page
        for block in page.get_text('blocks'):
            
            # Loop through all the text lines in the block
            for line in block['lines']:
                
                for span in line['spans']:
                
                    # Get the font size
                    font_size = span['size']
                    
                    # Get the font name
                    font_name = line.fontname
                    
                    # Get the first line of text in the font
                    first_line = line.get_text()
                    
                    # If the font size is already in the dictionary, increment the count
                    if font_size in font_dict:
                        font_dict[font_size]['count'] += 1
                        
                        # If the font name is not already in the list of font names, add it
                        if font_name not in font_dict[font_size]['font_names']:
                            font_dict[font_size]['font_names'].append(font_name)
                            
                        # If the first line is not already in the list of first lines, add it
                        if first_line not in font_dict[font_size]['first_lines']:
                            font_dict[font_size]['first_lines'].append(first_line)
                        
                    # If the font size is not already in the dictionary
                    else:
                        font_dict[font_size] = {}
                        font_dict[font_size]['count'] = 1
                        font_dict[font_size]['font_names'] = [font_name]
                        font_dict[font_size]['first_lines'] = [first_line]

    return font_dict

In [None]:
# Create a dictionary of the fonts in the PDF file
font_dict = analyze_fonts(PDF_FILE)

### Thinking out loud
What's an algorithm for grouping all of the characters by text size and then identifying which are useful for the audiobook?
* What types of characters do you expect?
    * Paragraphs
    * Chapter headers
    * Header
    * Footer
    * Annotations / references
* Analytze the entire document, grouping text by size
* Produce an image of each page which contains the first example of a size of text, with that text highlighted / redboxed
* Ask the user to label each of the groups of text size, using labels you predefine for audiobook layout (e.g. chapter heading, paragraph, etc ...)
* Generate an audiobook in accordance with the predefined layout

Start simply!
* Allow for chapter headers and paragraphs, only.
* Don't prompt the user -- make assumptions.

## Version 04 -- Keep it Simple & Use PyMuPDF
Keep it simple!  This approach:
* Extract text blocks using PyMuPDF
* Filter out blocks that are in headers / footers
* Remove newlines within each block
* Push block text to file and add newline after it
* Every N pages, create a new file (rather than fiddle with chapters)
* Create mp3 of text file

In [None]:
import fitz
from pathlib import Path

PROJECT_PATH = Path('~/projects/pdf_audiobook/')
PROJECT_PATH = PROJECT_PATH.expanduser()

PDF_PATH = PROJECT_PATH / 'PDFs'
TEXT_OUTPUT_PATH = PROJECT_PATH / 'output/txt'
AUDIO_OUTPUT_PATH = PROJECT_PATH / 'output/mp3'

PDF_FILE = PDF_PATH / "Plato - Apology.pdf"
AUDIO_FILE = AUDIO_OUTPUT_PATH / "Plato - Apology.mp3"
TEXT_FILE = TEXT_OUTPUT_PATH / 'extracted_text.txt'

AUDIO_FILE_PREFIX = 'Plato - Apology'
TEXT_FILE_PREFIX = 'Plato - Apology'

pdf = fitz.open(PDF_FILE)

PAGE_DIMENSIONS = pdf.load_page(0).bound()

HEADER_INCHES = 0.75
LEFT_MARGIN_INCHES = 1.25

PAGE_LENGTH = pdf.load_page(0).bound()[3]
PAGE_WIDTH = pdf.load_page(0).bound()[2]

# Define header and footer bounding boxes
HEADER_FROM_TOP = round(0.75 * 72)  # this can be used to filter out the header
FOOTER_FROM_TOP = PAGE_LENGTH - HEADER_FROM_TOP # this can be used to filter out the footer
LEFT_MARGIN_FROM_LEFT = LEFT_MARGIN_INCHES * 72
RIGHT_MARGIN_FROM_LEFT = PAGE_WIDTH - LEFT_MARGIN_FROM_LEFT

PyMuPDF will [extractBLOCKS()](https://pymupdf.readthedocs.io/en/latest/textpage.html#TextPage.extractBLOCKS) for us, which contain useful information for processing text.

extractBLOCKS() -- which can be called via the convenience method ```page.get_text("blocks")``` -- returns a list of tuples where each tuple contains:
```
(x0, y0, x1, y1, block text, block number, block type)
```

... where block number is the block sequence number on the page, and block type = 0 for text, and 1 for image.

1 inch = 72 points in PDF terms.  Typical margins are:
* Top & bottom:  0.75 inches
* Right & left:  1.25 inches

Let's define some constants to help us work in these dimensions.  Recall, also, that PDFs are measured from the left and top sides of the page.  First, we'll get the dimensions of the page itself.

1. Extract text blocks using PyMuPDF
1. Filter out blocks that are in headers / footers
1. Remove newlines within each block
1. Push block text to file and add newline after it
1. Every N pages, create a new file (rather than fiddle with chapters)
1. Create mp3 of text file

In [None]:
import os

PAGE_START = 2
#PAGE_STOP = 6
PAGE_STOP = pdf.page_count

PAGES_PER_FILE = 10 #TODO: reconcile this 1-index with the 0-index of the page count

page_count = 0
file_label = 0

# open the text file with zero-padded file count in the filename
text_file = open(file=TEXT_OUTPUT_PATH / (TEXT_FILE_PREFIX + f'_{file_label:02}' + '.txt'), mode='w')

# iterate through each page in the document
for page in pdf.pages(start=PAGE_START, stop=PAGE_STOP):

    # if the page count is a multiple of PAGES_PER_FILE, increment the file counter and open a new text file
    if (page_count % PAGES_PER_FILE == 0 and page_count > 0):
        # create an mp3 of the text file using edge-tts
        #!edge-tts --voice en-AU-NatashaNeural --file '{TEXT_FILE}' --write-media '{AUDIO_FILE}'
        #!edge-tts --voice en-AU-NatashaNeural --file '{text_file.name}' --write-media '{AUDIO_FILE_PREFIX}_{file_label:02}.mp3'
        
        print(f'Closing {text_file.name}')
        text_file.close()

        print(f'Writing {AUDIO_OUTPUT_PATH}/{AUDIO_FILE_PREFIX}_{file_label:02}.mp3')
        os.system(f"edge-tts --voice en-AU-NatashaNeural --file '{text_file.name}' --write-media '{AUDIO_OUTPUT_PATH}/{AUDIO_FILE_PREFIX}_{file_label:02}.mp3'")
        #os.system(f'edge-tts voice en-AU-NatashaNeural --file '{text_file.name}' --write-media '{text_file.name[:-4]}.mp3'")

        # increment the file counter
        file_label = file_label + 1
        
        # open a new text file with zero-padded file count in the filename
        print(f'Opening {TEXT_OUTPUT_PATH}/{TEXT_FILE_PREFIX}_{file_label:02}.txt')
        text_file = open(file=TEXT_OUTPUT_PATH / (TEXT_FILE_PREFIX + f'_{file_label:02}' + '.txt'), mode='w')

        
    # iterate through each block of text on the page
    for block in page.get_text('blocks'):
        print(f'Processing block #{block[5]} of page #{page.number}')
        # get the y coordinate of the bottom right corner of the block
        block_end_y = block[3]

        # if the block is not in the header or footer
        if not (block_end_y < HEADER_FROM_TOP or block_end_y > FOOTER_FROM_TOP):

            # remove the newline characters from the block text
            block_text = block[4].replace('\n', ' ')
            # write the block text to the text file and append a newline character
            print(f'Writing block #{block[5]} of page #{page.number} to {text_file.name}')
            text_file.write(block_text + '\n')
    
    # increment page counter
    #import pdb; pdb.set_trace()
    print(f'Incrementing page counter from {page_count} to {page_count + 1}')
    page_count = page_count + 1

    # if we've processed all of the pages in the range, close the text file and create an mp3 of the text file using edge-tts
    if page_count == PAGE_STOP - PAGE_START:
        print(f'Closing {text_file.name}')
        text_file.close()

        print(f'Writing {AUDIO_OUTPUT_PATH}/{AUDIO_FILE_PREFIX}_{file_label:02}.mp3')
        os.system(f"edge-tts --voice en-AU-NatashaNeural --file '{text_file.name}' --write-media '{AUDIO_OUTPUT_PATH}/{AUDIO_FILE_PREFIX}_{file_label:02}.mp3'")


### TODO:
* Separate the 2 loops:  text and audio file creation (this will also make it clear where there are errors in creation)
* Print some timestamps in audio file creation
* Handle audio file creation directly within Python using edge-tts so you can handle errors properly