# Buildilng a PDF to audiobook converter in Python + CoPilot

## Version 01
We're going to run through this iteratively, making improvements as we go.  To start, we'll write a program that will:
1. Convert a number of pages of a PDF file to a text file
1. Convert that text file into an mp3 file using Microsoft Edge's excellent text-to-speech service

Incrementally, we'll make improvements to:
1. Ignore headers and other text which we don't want read to us in the audiobook
1. Convert the entire PDF to an audiobook
1. Properly format the text file so that paragraphs aren't broken by newlines
1. Provide indication of progress being made in the audiobook conversion (as this can take a while)

Let's get started!

## Version 01

In [None]:
# Activate the pdf_audiobook environment using mamba
#! mamba activate pdf_audiobook

# Change to the pdf_audiobook directory
#! cd ~/projects/pdf_audiobook

# Install the modules using mamba and pip
#! mamba install -c conda-forge pypdf2
#! ~/mambaforge/envs/pdf_audiobook/bin/pip install edge-tts

In [5]:
# Extracts text from PDF files and converts it to an audiobook
# Uses PyPDF2 and edge-tts

# Import modules
import PyPDF2
import edge_tts

# Define some constants
PDF_FILE = "Plato - Apology.pdf"
AUDIO_FILE = "Plato - Apology.mp3"
TEXT_FILE = 'extracted_text.txt'

In [6]:
'''
A function to extract text from a PDF file and save it as a text file
Takes the following of arguments:
    * PDF file to extract from
    * Start page (defaults to 0)
    * End page (defaults to -1, which means the last page)
'''
def extract_text_from_pdf(pdf_file, start_page=0, end_page=-1):
    # Open PDF file
    pdf_file = open(pdf_file, "rb")

    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)

    # Get the number of pages in the PDF file
    num_pages = pdf_reader.numPages

    # Create a text file to save the text to
    text_file = open(TEXT_FILE, "w")

    # If the end page is -1, set it to the last page
    if end_page == -1:
        end_page = num_pages

    # Loop through all the pages
    for page_num in range(start_page, end_page):
        # Get the page object
        page_obj = pdf_reader.getPage(page_num)

        # Extract the text from the page
        page_text = page_obj.extractText()

        # Write the text to the text file
        text_file.write(page_text)

    # Close the text file
    text_file.close()

In [7]:
# Call the function to extract text from the PDF file defined above
extract_text_from_pdf(pdf_file=PDF_FILE, start_page=2, end_page=7)

In [8]:
!edge-tts --voice en-AU-NatashaNeural --file '{TEXT_FILE}' --write-media '{AUDIO_FILE}'