# Splitting a .pdf, or removing unnecessary pages from a .pdf

## Install and import necessary things

This, and all the other jupyter notebooks in this repository, start with installing and/or importing all the packages needed to perform the actions. 

In [11]:
%%capture

# installing necessary pdf conversion package via pip
# the '%%capture' at the top of this cell suppresses the output (which is normally quite long and annoying looking). 
# You can remove or comment it out if you prefer to see the output. 

import os
from PyPDF2 import PdfFileReader, PdfFileWriter

## Define the function

First, define a function that takes a .pdf filename and two page numbers and returns a smaller .pdf that only contains the pages between the provided page numbers. 

This code is adapted slightly from the original here -> https://gist.github.com/khanfarhan10/464d44086327369953327a7320716100


In [12]:
def pdf_split(fname, start, end=None):
    print('pdf_split', fname, start, end)

    inputpdf = PdfFileReader(open(fname, "rb"))
    output = PdfFileWriter()

    # turn 1,4 to 0,3
    num_pages = inputpdf.numPages
    if start:
        start-=1
    if not start:
        start=0
    if not end or end > num_pages:
        end=num_pages

    get_pages = list(range(start,end))
    #print('get_pages', get_pages, 'of', num_pages)
    # get_pages [0, 1, 2, 3]

    for i in range(start,end):
        if i < start:
            continue
        #output = PdfFileWriter()
        output.addPage(inputpdf.getPage(i))

    fname_no_pdf = row[0]
    if row[0][:-4].lower() == '.pdf':
        fname_no_pdf = row[0][:-4]
    out_filename = f"{outfolder + fname_no_pdf}"
    with open(out_filename, "wb") as outputStream:
        output.write(outputStream)
    print('saved', out_filename)

## Run the code

###  Test run

First off, we check the contents of the test folder. 

Then, we create a list of lists that holds the arguments needed to run the function. The start and end numbers have no real significance, they are just numbers used to test. Importantly, they are within the bounds of the .pdf which requires you to know how many pages are in it. 

Finally, a little for-loop runs the function on each item in the defined list of list. 

In [13]:
os.listdir("..\\raw_pdfs\\Test") # This is how to see the contents of any folders shown in the last contents check
                            # For example, 'input_pdfs' which is likely to contain things we want to import

['input_pdf_1.pdf', 'input_pdf_2.pdf', 'input_pdf_3.pdf']

In [14]:
to_split_test = [["input_pdf_1.pdf", 0, 0],
 ["input_pdf_2.pdf", 1, 0],
 ["input_pdf_3.pdf", 2, 5]]

In [15]:
for row in to_split_test:
    folder = "..\\raw_pdfs\\Test\\"
    outfolder =  "..\\input_pdfs\\Test\\"
    fname = folder + row[0]
    pdf_split(fname, row[1], row[2])

pdf_split ..\raw_pdfs\Test\input_pdf_1.pdf 0 0
saved ..\input_pdfs\Test\input_pdf_1.pdf
pdf_split ..\raw_pdfs\Test\input_pdf_2.pdf 1 0
saved ..\input_pdfs\Test\input_pdf_2.pdf
pdf_split ..\raw_pdfs\Test\input_pdf_3.pdf 2 5
saved ..\input_pdfs\Test\input_pdf_3.pdf


The output of the for-loop suggests it all went well, but you may want to inspect the files to double check every thing went as you intended. 

### Run on files of interest

Assuming there were no major problems, the process is repeated for the files of interest:

* check the contents of the relevant folder,
* define a list of lists to hold the function arguments, and
* run the for-loop to apply the function to the defined list of list. 


As before, defining the start and end numbers is a bit of a manual process. I had to open the files, identify the first page that I wanted to keep and the last page that I wanted to keep (paying special attention to the actual .pdf page number rather than the page numbers within the document). 

I also strongly recommend a lot of manual checking to be sure it came out correctly. I had to edit the list a couple of time and re-run the code to get everything correct. 


In [16]:
os.listdir("..\\..\\PIFL\\raw_pdfs\\ESHG") # This is how to see the contents of any folders shown in the last contents check
                            # For example, 'input_pdfs' which is likely to contain things we want to import

['ESHG2001abstractICHG.pdf',
 'ESHG2002Abstracts.pdf',
 'ESHG2003Abstracts.pdf',
 'ESHG2004Abstracts.pdf',
 'ESHG2005Abstracts.pdf',
 'ESHG2006Abstracts.pdf',
 'ESHG2007Abstracts.pdf',
 'ESHG2008Abstracts.pdf',
 'ESHG2009Abstracts.pdf',
 'ESHG2010Abstracts.pdf',
 'ESHG2011Abstracts.pdf',
 'ESHG2012Abstracts.pdf',
 'ESHG2013Abstracts.pdf',
 'ESHG2014Abstracts.pdf',
 'ESHG2015Abstracts.pdf',
 'ESHG2016Abstracts.pdf',
 'ESHG2017 electronic posters.pdf',
 'ESHG2017 oral presentations.pdf',
 'ESHG2017 posters.pdf',
 'ESHG2018 electronic posters.pdf',
 'ESHG2018 EMPAG.pdf',
 'ESHG2018 oral presentation.pdf',
 'ESHG2018 posters.pdf',
 'ESHG2019 oral presentation.pdf',
 'ESHG2019 posters.pdf',
 'ESHG2019 posters2.pdf',
 'ESHG2020 eposters.pdf',
 'ESHG2020 interactive eposter.pdf',
 'ESHG2020 oral presentation.pdf',
 'ESHG2021 eposters.pdf',
 'ESHG2021 oral presentations.pdf']

In [12]:
os.listdir("..\\..\\PIFL\\raw_pdfs\\ESHG") # This is how to see the contents of any folders shown in the last contents check
                            # For example, 'input_pdfs' which is likely to contain things we want to import

['ESHG2001abstractICHG.pdf',
 'ESHG2002Abstracts.pdf',
 'ESHG2003Abstracts.pdf',
 'ESHG2004Abstracts.pdf',
 'ESHG2005Abstracts.pdf',
 'ESHG2006Abstracts.pdf',
 'ESHG2007Abstracts.pdf',
 'ESHG2008Abstracts.pdf',
 'ESHG2009Abstracts.pdf',
 'ESHG2010Abstracts.pdf',
 'ESHG2011Abstracts.pdf',
 'ESHG2012Abstracts.pdf',
 'ESHG2013Abstracts.pdf',
 'ESHG2014Abstracts.pdf',
 'ESHG2015Abstracts.pdf',
 'ESHG2016Abstracts.pdf',
 'ESHG2017 electronic posters.pdf',
 'ESHG2017 oral presentations.pdf',
 'ESHG2017 posters.pdf',
 'ESHG2018 electronic posters.pdf',
 'ESHG2018 EMPAG.pdf',
 'ESHG2018 oral presentation.pdf',
 'ESHG2018 posters.pdf',
 'ESHG2019 oral presentation.pdf',
 'ESHG2019 posters.pdf',
 'ESHG2019 posters2.pdf',
 'ESHG2020 eposters.pdf',
 'ESHG2020 interactive eposter.pdf',
 'ESHG2020 oral presentation.pdf',
 'ESHG2021 eposters.pdf',
 'ESHG2021 oral presentations.pdf']

In the process of manually checking the ESHG files, I noticed that only the pre-2017 files needed to be trimmed. The more recent files do not contain cover pages, indices, adverts, etc. Winning!

In [7]:
to_split = [["ESHG2001abstractICHG.pdf", 64, 434],
 ["ESHG2002Abstracts.pdf", 54, 327],
 ["ESHG2003Abstracts.pdf", 44, 261],
 ["ESHG2004Abstracts.pdf", 58, 373],
 ["ESHG2005Abstracts.pdf", 55, 388],
 ["ESHG2006Abstracts.pdf", 74, 410],
 ["ESHG2007Abstracts.pdf", 5, 351],
 ["ESHG2008Abstracts.pdf", 5, 469],
 ["ESHG2009Abstracts.pdf", 6, 401],
 ["ESHG2010Abstracts.pdf", 6, 400],
 ["ESHG2011Abstracts.pdf", 5, 484],
 ["ESHG2012Abstracts.pdf", 6, 438],
 ["ESHG2013Abstracts.pdf", 6, 611],
 ["ESHG2014Abstracts.pdf", 6, 518],
 ["ESHG2015Abstracts.pdf", 6, 485],
 ["ESHG2016Abstracts.pdf", 6, 506]]

In [10]:
for row in to_split:
    folder = "..\\raw_pdfs\\ESHG\\"
    outfolder =  "..\\input_pdfs\\ESHG\\"
    fname = folder + row[0]
    pdf_split(fname, row[1], row[2])

pdf_split ..\raw_pdfs\ESHG\ESHG2001abstractICHG.pdf 64 434


NameError: name 'PdfFileReader' is not defined

As a note, there is an error message (PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected.) after the 2004 file. 

Online advice suggests this is an encoding error coming from how the .pdf file was created that let's it be read but not handled in the standard way. The solution seems to be to open it in Adobe and save it under a new file name. 