#**Splitting and Merging PDFs with Python**

[link text](https://www.blog.pythonlibrary.org/2018/04/11/splitting-and-merging-pdfs-with-python/)

In [None]:
#installing the pypdf2 library
!pip install pypdf2 

Collecting pypdf2
  Downloading PyPDF2-1.26.0.tar.gz (77 kB)
[?25l[K     |████▎                           | 10 kB 21.4 MB/s eta 0:00:01[K     |████████▌                       | 20 kB 27.9 MB/s eta 0:00:01[K     |████████████▊                   | 30 kB 13.1 MB/s eta 0:00:01[K     |█████████████████               | 40 kB 9.8 MB/s eta 0:00:01[K     |█████████████████████▏          | 51 kB 5.4 MB/s eta 0:00:01[K     |█████████████████████████▍      | 61 kB 5.9 MB/s eta 0:00:01[K     |█████████████████████████████▋  | 71 kB 5.9 MB/s eta 0:00:01[K     |████████████████████████████████| 77 kB 3.3 MB/s 
[?25hBuilding wheels for collected packages: pypdf2
  Building wheel for pypdf2 (setup.py) ... [?25l[?25hdone
  Created wheel for pypdf2: filename=PyPDF2-1.26.0-py3-none-any.whl size=61101 sha256=4cd0d2ca3b38ee5343ca67ffcd07cafdf1a10872e346b6c9d604ae1bf070fc2d
  Stored in directory: /root/.cache/pip/wheels/80/1a/24/648467ade3a77ed20f35cfd2badd32134e96dd25ca811e64b3
Successfu

In [None]:
# pdf_splitter.py
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
def pdf_splitter(path):
    fname = os.path.splitext(os.path.basename(path))[0]
    pdf = PdfFileReader(path)
    for page in range(pdf.getNumPages()):
        pdf_writer = PdfFileWriter()
        pdf_writer.addPage(pdf.getPage(page))
        output_filename = '{}_page_{}.pdf'.format(
            fname, page+1)
        with open(output_filename, 'wb') as out:
            pdf_writer.write(out)
        print('Created: {}'.format(output_filename))
if __name__ == '__main__':
    path = 'fw9.pdf'
    pdf_splitter(path)

Created: fw9_page_1.pdf
Created: fw9_page_2.pdf
Created: fw9_page_3.pdf
Created: fw9_page_4.pdf
Created: fw9_page_5.pdf
Created: fw9_page_6.pdf


In [None]:
# pdf_merger.py

import glob
from PyPDF2 import PdfFileWriter, PdfFileReader

def merger(output_path, input_paths):
    pdf_writer = PdfFileWriter()

    for path in input_paths:
        pdf_reader = PdfFileReader(path)
        for page in range(pdf_reader.getNumPages()):
            pdf_writer.addPage(pdf_reader.getPage(page))

    with open(output_path, 'wb') as fh:
        pdf_writer.write(fh)


if __name__ == '__main__':
    paths = glob.glob('fw9_*.pdf')
    paths.sort()
    merger('pdf_merger.pdf', paths)

In [None]:
#merge spesific pages 
def merge(self, position, fileobj, bookmark=None, pages=None, import_bookmarks=True):
        """
        Merges the pages from the given file into the output file at the
        specified page number.

        :param int position: The *page number* to insert this file. File will
            be inserted after the given number.

        :param fileobj: A File Object or an object that supports the standard read
            and seek methods similar to a File Object. Could also be a
            string representing a path to a PDF file.

        :param str bookmark: Optionally, you may specify a bookmark to be applied at
            the beginning of the included file by supplying the text of the bookmark.

        :param pages: can be a :ref:`Page Range <page-range>` or a ``(start, stop[, step])`` tuple
            to merge only the specified range of pages from the source
            document into the output document.

        :param bool import_bookmarks: You may prevent the source document's bookmarks
            from being imported by specifying this as ``False``.
        """
</page-range>

#**Extracting Metadata from PDF file**

You can use PyPDF2 to extract a fair amount of useful data from any PDF. For example, you can learn the author of the document, its title and subject and how many pages there are. Let’s find out how by downloading the sample of this book from Leanpub at https://leanpub.com/reportlab. The sample I downloaded was called “reportlab-sample.pdf”.

Here’s the code:

In [None]:
# get_doc_info.py

from PyPDF2 import PdfFileReader


def get_info(path):
    with open(path, 'rb') as f:
        pdf = PdfFileReader(f)
        info = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
    
    print(info)

    author = info.author
    creator = info.creator
    producer = info.producer
    subject = info.subject
    title = info.title

if __name__ == '__main__':
    path = 'fw9.pdf'
    get_info(path)

{'/Author': 'SE:W:CAR:MP', '/CreationDate': "D:20181024094543-05'00'", '/Creator': 'Adobe LiveCycle Designer ES 9.0', '/Keywords': 'Fillable', '/ModDate': "D:20181024094543-05'00'", '/Producer': 'Adobe LiveCycle Designer ES 9.0', '/SPDF': '1112', '/Subject': 'Request for Taxpayer Identification Number and Certification', '/Title': 'Form W-9 (Rev. October 2018)'}


Here we import the PdfFileReader class from PyPDF2. This class gives us the ability to read a PDF and extract data from it using various accessor methods. The first thing we do is create our own get_info function that accepts a PDF file path as its only argument. Then we open the file in read-only binary mode. Next we pass that file handler into PdfFileReader and create an instance of it.

Now we can extract some information from the PDF by using the getDocumentInfo method. This will return an instance of PyPDF2.pdf.DocumentInformation, which has the following useful attributes, among others:

    author
    creator
    producer
    subject
    title

If you print out the DocumentInformation object, this is what you will see above.

We can also get the number of pages in the PDF by calling the `getNumPages` method. 


#Extracting Text from PDFs

PyPDF2 has limited support for extracting text from PDFs. It doesn’t have built-in support for extracting images, unfortunately. I have seen some recipes on StackOverflow that use PyPDF2 to extract images, but the code examples seem to be pretty hit or miss.

Let’s try to extract the text from the first page of the PDF that we downloaded in the previous section:

In [None]:
# extracting_text.py

from PyPDF2 import PdfFileReader


def text_extractor(path):
    with open(path, 'rb') as f:
        pdf = PdfFileReader(f)

        # get the first page
        page = pdf.getPage(1)
        print(page)
        print('Page type: {}'.format(str(type(page))))

        text = page.extractText()
        print(text)


if __name__ == '__main__':
    path = 'fw9.pdf'
    text_extractor(path)

You will note that this code starts out in much the same way as our previous example. We still need to create an instance of PdfFileReader. But this time, we grab a page using the getPage method. PyPDF2 is zero-based, much like most things in Python, so when you pass it a one, it actually grabs the second page. The first page in this case is just an image, so it wouldn’t have any text.

Interestingly, if you run this example you will find that it doesn’t return any text. Instead all I got was a series of line break characters. Unfortunately, PyPDF2 has pretty limited support for extracting text. Even if it is able to extract text, it may not be in the order you expect and the spacing may be different as well.

#**Another method with PYPDF2**

When is method 1 suitable?

    When you have lesser number of files
    When the group of files to be merged do not have a common filename pattern

How this method works?

In the following sequence.

    Import the PyPDF2 tool kit which has the tools that we need for playing with PDFs
    Open each and every file by entering the file name
    Read each and every file which was opened in Step 2 using PdfFileReader
    Create a blank PDF file using PdfFileWriter where you can store the merged output
    Loop through every page in every file which was read in Step 3 using for loop and copy all the information
    Give a name for the output file and then paste all the copied information in Step 5
    Close all the files


In [None]:
#Method1 2 files merging
import PyPDF2 
 
# Open the files that have to be merged one by one
pdf1File = open('FirstInputFile.pdf', 'rb')
pdf2File = open('SecondInputFile.pdf', 'rb')
 
# Read the files that you have opened
pdf1Reader = PyPDF2.PdfFileReader(pdf1File)
pdf2Reader = PyPDF2.PdfFileReader(pdf2File)
 
# Create a new PdfFileWriter object which represents a blank PDF document
pdfWriter = PyPDF2.PdfFileWriter()
 
# Loop through all the pagenumbers for the first document
for pageNum in range(pdf1Reader.numPages):
    pageObj = pdf1Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)
 
# Loop through all the pagenumbers for the second document
for pageNum in range(pdf2Reader.numPages):
    pageObj = pdf2Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)
 
# Now that you have copied all the pages in both the documents, write them into the a new document
pdfOutputFile = open('MergedFiles.pdf', 'wb')
pdfWriter.write(pdfOutputFile)
 
# Close all the files - Created as well as opened
pdfOutputFile.close()
pdf1File.close()
pdf2File.close()

When is method 2 suitable?

    When you have a lot of PDF files ( I mean a loooot – Like for example, hundreds of PDF files or even more)
    If all the PDF files that you want to merge follow a naming convention for their file names.

How this method works?

In the following sequence.

    Import PdfFileMerger and PdfFileReader tools
    Loop through all the files that have to be merged and append them
    Write the appended files into an output document and specify a name for it.


In [None]:
#Method2
from PyPDF2 import PdfFileMerger, PdfFileReader
 
# Call the PdfFileMerger
mergedObject = PdfFileMerger()
 
# I had 116 files in the folder that had to be merged into a single document
# Loop through all of them and append their pages
for fileNumber in range(1, 117):
    mergedObject.append(PdfFileReader('6_yuddhakanda_' + str(fileNumber)+ '.pdf', 'rb'))
 
# Write all the files into a file which is named as shown below
mergedObject.write("mergedfilesoutput.pdf")

#**PDFMINER for extracting text from pdf files**


In [None]:
pip install pdfminer.six

Collecting pdfminer.six
  Downloading pdfminer.six-20201018-py3-none-any.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 5.3 MB/s 
Collecting cryptography
  Downloading cryptography-3.4.8-cp36-abi3-manylinux_2_24_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 43.4 MB/s 
Installing collected packages: cryptography, pdfminer.six
Successfully installed cryptography-3.4.8 pdfminer.six-20201018


In [None]:
from pdfminer.high_level import extract_text

text = extract_text('fw9_page_1.pdf')
print(text)
