# Converting raw .pdf files to .txt files

This .ipynb contains all of the .pdf-to-.txt work done by JK in the initial run in 2022. Although this process was broadly successful, JK felt that a more professional team could do a lot better job of it in much less time than would be needed for JK to do a much better job. This code and logic is preserved mostly for transparency sake and for potential re-use by JK or others for future projects. 

## 1 - Install and import necessary things for all of the processes in this .ipynb

Start off by installing the PyPDF2 module (if you don't already have it installed) and importing that module so that it can be used in this notebook. I did this in two separate code chunks, one for installing and one for importing, because I wanted to keep the output manageable. 

In [1]:
%%capture
# installing necessary pdf conversion package via pip
# the '%%capture' at the top of this cell suppresses the output (which is normally quite long and annoying looking). 
# You can remove or comment it out if you prefer to see the output. 
!pip install PyPDF2       
!pip install pdfplumber
!pip install regex

# importing required modules (first for displaying screenshots, second for converting pdfs)

import os
from IPython.display import HTML, display
import PyPDF2 
from PyPDF2 import PdfReader, PdfWriter
import pdfplumber
import csv
import datetime
import pandas as pd
import regex
import re

date = datetime.date.today()    # This is a way to make sure that work done on one day does not overwrite previous work

## 2 - Define the functions used in this for all of the processes in this .ipynb

First, define a function that takes a .pdf filename and two page numbers and returns a smaller .pdf that only contains the pages between the provided page numbers. 

This code is adapted slightly from the original here -> https://gist.github.com/khanfarhan10/464d44086327369953327a7320716100


In [11]:
def pdf_split(fname, start, end=None):
    print('pdf_split', fname, start, end)

    inputpdf = PdfReader(open(fname, "rb"))
    output = PdfWriter()

    # turn 1,4 to 0,3
    num_pages = len(inputpdf.pages)
    if start:
        start-=1
    if not start:
        start=0
    if not end or end > num_pages:
        end=num_pages

    get_pages = list(range(start,end))
    #print('get_pages', get_pages, 'of', num_pages)
    # get_pages [0, 1, 2, 3]

    for i in range(start,end):
        if i < start:
            continue
        #output = PdfFileWriter()
        output.add_page(inputpdf.pages[i])

    fname_no_pdf = row[0]
    if row[0][:-4].lower() == '.pdf':
        fname_no_pdf = row[0][:-4]
    out_filename = f"{outfolder + fname_no_pdf}"
    with open(out_filename, "wb") as outputStream:
        output.write(outputStream)
    print('saved', out_filename)

Then, define the function that opens files and converts the contents by various steps. 

Specifically, this function opens each of the files in the folder given as an argument and for each one removes the '.pdf' suffix from the file name, removes the 'ESHG' bit of the filename, finds how many pages there are in the file, opens each page, extracts the text, appends that text to a string and then writes that string out to a new .txt file with the name (the short version, without '.pdf' or 'ESHG' in it). 

In [3]:
def convert_pdfs(input,output):
    lines = []
    for filename in os.listdir(input):
        pdf_contents = ""
        with pdfplumber.open(input + "\\" + filename) as pdf:
            name = filename.replace(r'.pdf', "")
            name = name.replace(r'ESHG', "")
            pages = pdf.pages
            for page in pdf.pages:
                text = page.extract_text()
                pdf_contents = pdf_contents + text
            with open(output + "\\" + name + ".txt", "w", newline='', encoding='utf-8') as f:
                pdf_contents = " ".join(pdf_contents.split())
                f.write(pdf_contents)
                f.close()
                

## 3 - Split the .pdfs 

###  Run a splitting .pdf test

The next 3 code cells are option if you want a low-stakes test or walkthrough of the process which has 3 parts. 
* 1 - Check the contents of the test folder. 
* 2 - Create a list of lists that holds the arguments needed to run the function. The start and end numbers have no real significance, they are just numbers used to test. Importantly, they are within the bounds of the .pdf which requires you to know how many pages are in it. 
* 3 - A for-loop runs the function on each item in the defined list of list. 

In [4]:
os.listdir("..\\raw_pdfs\\Test") # This is how to see the contents of any folders shown in the last contents check
                            # For example, 'input_pdfs' which is likely to contain things we want to import

['input_pdf_1.pdf', 'input_pdf_2.pdf', 'input_pdf_3.pdf']

In [5]:
to_split_test = [["input_pdf_1.pdf", 0, 0],
 ["input_pdf_2.pdf", 1, 0],
 ["input_pdf_3.pdf", 2, 5]]

In [12]:
for row in to_split_test:
    folder = "..\\raw_pdfs\\Test\\"
    outfolder =  "..\\input_pdfs\\Test\\"
    fname = folder + row[0]
    pdf_split(fname, row[1], row[2])

pdf_split ..\raw_pdfs\Test\input_pdf_1.pdf 0 0
saved ..\input_pdfs\Test\input_pdf_1.pdf
pdf_split ..\raw_pdfs\Test\input_pdf_2.pdf 1 0
saved ..\input_pdfs\Test\input_pdf_2.pdf
pdf_split ..\raw_pdfs\Test\input_pdf_3.pdf 2 5
saved ..\input_pdfs\Test\input_pdf_3.pdf


The output of the for-loop suggests it all went well, but you may want to inspect the files to double check every thing went as you intended. 

### Run the splitting process on the files of interest

Assuming there were no major problems, the process is repeated for the files of interest:

* check the contents of the relevant folder,
* define a list of lists to hold the function arguments, and
* run the for-loop to apply the function to the defined list of list. 

As before, defining the start and end numbers is a bit of a manual process. I had to open the files, identify the first page that I wanted to keep and the last page that I wanted to keep (paying special attention to the actual .pdf page number rather than the page numbers within the document). 

I also strongly recommend a lot of manual checking to be sure it came out correctly. I had to edit the list a couple of time and re-run the code to get everything correct. In the process of manually checking the ESHG files, I noticed that only the pre-2017 files needed to be trimmed. The more recent files do not contain cover pages, indices, adverts, etc. Winning!

Unfortunately, the raw .pdfs were too voluminous for github to host, so the next three code cells do notwork. You will have to create/edit the folders and files as needed for your actual project. 

In [13]:
os.listdir("..\\..\\PIFL\\raw_pdfs\\ESHG") # This is how to see the contents of any folders shown in the last contents check
                            # For example, 'input_pdfs' which is likely to contain things we want to import

FileNotFoundError: [WinError 3] The system cannot find the path specified: '..\\..\\PIFL\\raw_pdfs\\ESHG'

In [None]:
to_split = [["ESHG2001abstractICHG.pdf", 64, 434],
 ["ESHG2002Abstracts.pdf", 54, 327],
 ["ESHG2003Abstracts.pdf", 44, 261],
 ["ESHG2004Abstracts.pdf", 58, 373],
 ["ESHG2005Abstracts.pdf", 55, 388],
 ["ESHG2006Abstracts.pdf", 74, 410],
 ["ESHG2007Abstracts.pdf", 5, 351],
 ["ESHG2008Abstracts.pdf", 5, 469],
 ["ESHG2009Abstracts.pdf", 6, 401],
 ["ESHG2010Abstracts.pdf", 6, 400],
 ["ESHG2011Abstracts.pdf", 5, 484],
 ["ESHG2012Abstracts.pdf", 6, 438],
 ["ESHG2013Abstracts.pdf", 6, 611],
 ["ESHG2014Abstracts.pdf", 6, 518],
 ["ESHG2015Abstracts.pdf", 6, 485],
 ["ESHG2016Abstracts.pdf", 6, 506]]

In [None]:
for row in to_split:
    folder = "..\\..\\PIFL\\raw_pdfs\\ESHG\\"
    outfolder =  "..\\input_pdfs\\ESHG\\"
    fname = folder + row[0]
    pdf_split(fname, row[1], row[2])

If you get the above code to work, you may notice an error message. I got the error message (PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected.) after the 2004 file. 

Online advice suggests this is an encoding error coming from how the .pdf file was created that let's it be read but not handled in the standard way. The solution seems to be to open it in Adobe and save it under a new file name, but that did not work for me. 

## 4 - Create and/or check the input .pdf

I first created 2 test files in word and then printed each of them off to .pdf. I specifically put a few key features into these files to test how the text would be converted, such as a heading, multiple blank lines between text, an image with a caption, multiple pages, and two column text. 

I then saved these new .pdfs (and screenshots of both) into the same location as my .ipynb so that I can 
* 1 - paste in the images to show how the .pdfs looked to begin with (see next cell) and
* 2 - import the .pdfs so that PyPDF2 can convert them (see cell after next). 

In [None]:
display(HTML("<table><tr><td><img src='..\images\Input_pdf_image_1.png'></td><td><img src='..\images\Input_pdf_image_2.png'></td></tr></table>"))

The following code cell tells you all the files in a particular folder (in this case, the raw_pdfs\Test folder). 

Running it shows there are the two test .pdfs shown above as well as a third .pdf that I created later with many more pages (I created it to test the split_pdf function).

In [None]:
os.listdir("..\\raw_pdfs\\Test") # This is how to see the contents of any folders shown in the last contents check
                            # For example, 'input_pdfs' which is likely to contain things we want to import

In [None]:
# creating .pdf file objects from existing .pdfs in the same folder as this .ipynb code
# this ipmorts the .pdfs and creates accessible objects for the PyPDF2 module to work with

pdfFileObj_1 = open('..\input_pdfs\Test\input_pdf_1.pdf', 'rb') 
pdfFileObj_2 = open('..\input_pdfs\Test\input_pdf_2.pdf', 'rb') 
pdfFileObj_3 = open('..\input_pdfs\Test\input_pdf_3.pdf', 'rb') 



# 3 - Convert and check the .pdf objects

The first step is to create a PyPDF2 oject from the imported .pdf. This allows you to do things like:
* check how many pages are in the original .pdf, 
* convert some or all of those pages to page objects, and
* then extract the text from those page ojbects (optionally saving the text for later analysis). 

Of course, good coding etiquette suggests you should always close any opened files when you are done with them. 


PyPDF2 is not the only option for converting .pdf files, and in the convert_multiple_pdfs notebook I use pdfplumber instead. But still, PyPDF2 is straightforward and a good way to get to grip with the basic steps and to see what a converted .pdf would look like as a string of text. 

In [None]:
# creating .pdf reader objects, which the module will use for the actual conversion work
pdfReader_1 = PyPDF2.PdfFileReader(pdfFileObj_1) 
pdfReader_2 = PyPDF2.PdfFileReader(pdfFileObj_2) 
pdfReader_3 = PyPDF2.PdfFileReader(pdfFileObj_3) 

In [None]:
# printing number of pages in each .pdf file 
print(pdfReader_1.numPages) 
print(pdfReader_2.numPages) 
print(pdfReader_3.numPages) 

So far so good. We have the .pdfs imported, converted to module specific objects, and the module has correctly identified the number of pages in each. But really, we need to know how well the module can recognise the text within those objects cause I deliberately made that text a bit tricky. 

In [None]:
# creating .pdf file objects from existing .pdfs in the same folder as this .ipynb code
# this ipmorts the .pdfs and creates accessible objects for the PyPDF2 module to work with

pdfFileObj_1 = open('..\input_pdfs\Test\input_pdf_1.pdf', 'rb') 
pdfFileObj_2 = open('..\input_pdfs\Test\input_pdf_2.pdf', 'rb') 
pdfFileObj_3 = open('..\input_pdfs\Test\input_pdf_2.pdf', 'rb') 

# 4 - Get individual pages, extract text, save it as strings, etc. 

Now, we get down to the real work. We want to  
* convert some or all of those pages to page objects, and
* then extract the text from those page ojbects, 
* (optionally) save that text as string objects for later analysis, and 
* tidy up (good coding etiquette suggests you should always close any opened files when you are done with them) . 


In [None]:
# creating page objects for each page
# something to note here - the pages start counting from 0 so you get page 1 of our first test .pdf
#                          by asking getPage to getPage(0). 
#                          In turn, when we want to get both pages from the second test .pdf, we ask for 
#                          getPage(0) and also getPage(1)
pageObj_1_1 = pdfReader_1.getPage(0) 
pageObj_2_1 = pdfReader_2.getPage(0)
pageObj_2_2 = pdfReader_2.getPage(1) 
pageObj_3_3 = pdfReader_3.getPage(2) 

In [None]:
# extracting text from page to print on screen
print(pageObj_2_1.extractText()) 

In [None]:
# extracting text from page to print on screen
print(pageObj_2_2.extractText()) 

So, good news. This .pdf conversion module has successfully recognised that input_pdf_2 was structured in two columns and it has converted the text appropriately (ish) with the text flowing properly from the end of one line in a column to the start of the next line *in the same column* rather than reading on to the equivalent line in the next column. 

It might be better if the lines were not cut short to replicate the actual number of words in each column as they appear in the text, but that is a step for later on. 


In [None]:
# extracting text from page to save for later use, in this case as a string object
test_file_1 = (pageObj_1_1.extractText()) 
type(test_file_1)

# 5 - Converting a single file to inspect it for problems


## Inspect the problem file
In the split_pdfs notebook, we got an error message with the 2004 file. Inspecting it, it seemed fine - the cover pages, indices, adverts, etc. were cut off leaving only the pages with abstracts remaining. 

However, proceeding through the other notebooks in order (convert_multiple_pdfs, preliminary_clean, and analysis) it seemed like there were problems with the 2004 file. It seems that the encoding issues ran a bit deeper, so I had to go back and convert that file again, with some extra steps to deal with the encoding issues. 

As above with the test files, we start by loading a reader object from the file and check how many pages it has. 

In [None]:
pdf2004 = PyPDF2.PdfFileReader('..\input_pdfs\ESHG\ESHG2004abstracts.pdf')
print(pdf2004.numPages) 

Then, we take a look at a few pages. After playing around for a bit, I found a page that shows the encoding problems. 

In [None]:
page_50 = pdf2004.getPage(50) 
print(page_50.extractText()) 

I tried with pdfplumber too, just to check whether the encoding problem could be solved by importing and extracting the text in a different way. 


Scroll down to see the problem. Before too long, all the text starts to apear as a sequence in the following format:

(cid:73)(cid:68)(cid:87)(cid:68)(cid:86)(cid:72)(cid:3)(cid:39)(cid:72)(cid:191)(cid:70)(cid:76)(cid:72)(cid:81)(cid:70)(cid:92)(cid:29)(cid:3)(cid:48)(cid:82)(cid:79)(cid:72)(cid:70)(cid:88)(cid:79)(cid:68)(cid:85)(cid:3)(cid:71)(cid:72)(cid:73)(cid:72)(cid:70)(cid:87)(cid:3)

In [None]:
with pdfplumber.open('..\input_pdfs\ESHG\ESHG2004abstracts.pdf') as pdf:
        pages = pdf.pages
        for page in pdf.pages:
            text = page.extract_text()
            print(text)

I don't include it here, but there is some code at the end of the convert_pdf_multiple notebook that corrects the problems with this file. I hope it corrects the problems anyway. 

In [None]:
# 2 - Define the conversion function 

This bit of code does a fair bit. It opens each of the files in the folder given as an argument and for each one removes the '.pdf' suffix from the file name, removes the 'ESHG' bit of the filename, finds how many pages there are in the file, opens each page, extracts the text, appends that text to a string and then writes that string out to a new .txt file with the name (the short version, without '.pdf' or 'ESHG' in it). 

# 7 - Run on Test folder

Let's just test it on the Test folder to make sure it all goes to plan. It will remove the '.pdf' from the names but looking for and replacing 'ESHG' will have no effect. 

Be sure to look in the output_texts folder, open and inspect a file or two to make sure it worked as expected. 


In [None]:
os.listdir("..\input_pdfs\Test") # This is how to see the contents of any folders shown in the last contents check
                            # For example, 'input_pdfs' which is likely to contain things we want to import

In [None]:
convert_pdfs('..\input_pdfs\Test', '..\output_texts\Test')

# 8 - Run on folder of interest 

Now let's do the same for the target folder. This time, the name shortening lines in the function will take full effect. 

In [None]:
os.listdir("..\input_pdfs\ESHG") 

In [None]:
convert_pdfs('..\input_pdfs\ESHG', '..\output_texts\ESHG')       # This takes a long time to run. At least it did for me.
                                                                 # Go do something elso for a while or 
                                                                 # do this last thing before you go in the evening. 

If you remember correctly, the 2004 file threw up an error when we were splitting it to chop off the irrelevant bits. 

It looks fine when viewed in a .pdf reader but I noticed that running the preliminary clean and analysis notebooks showed that it was not behaving like the other files. I double checked it using the code in the convert_pdf_single notebook and saw that it started off looking normal but quickly went very squiffy. It seems that particular .txt file is full of squiffy text due to some weird encoding. 

As such, we need to do a bit of work to recode the 2004 file, giving it a unique name to distinguish it from the file converted by the previous code. 

In [None]:
def cidToChar(cidx):
    return chr(int(re.findall(r'\(cid\:(\d+)\)',cidx)[0]) + 29)

with open('..\\output_texts\\ESHG\\2004Abstracts.txt', "r", newline='', encoding='utf-8') as file:
    for item in file:
        abc = re.findall(r'\(cid\:\d+\)',item)
        if len(abc) > 0:
            for cid in abc: item=item.replace(cid, cidToChar(cid))
        output = repr(item).strip("'")
        with open('..\\output_texts\\ESHG\\2004.txt', "w", newline='', encoding='utf-8') as f:
            f.write(output)
            f.close()

Aaaaaaaaaaaaand, let's just have a look at the new (and uniquely named) recoded file to see what it looks like. At this point, you may want to save the original (and squiffy) file somewhere else and only continue working with the better encoded 2004.txt file. Or you can just delete the original one as you could always get it back again by re-running the conversion process on all the files (or just on the one file). 

In [None]:
with open('..\\output_texts\\ESHG\\2004.txt','r', encoding='utf-8') as f:
    contents = f.read()
    print(contents)