# Converting .pdf files to text files

## Install and import necessary things

Start off by installing the PyPDF2 module (if you don't already have it installed) and importing that module so that it can be used in this notebook. 

In [2]:
# installing necessary pdf conversion package via pip
!pip install PyPDF2             





In [37]:
# importing required modules (first for displaying screenshots, second for converting pdfs)
from IPython.display import HTML, display
import PyPDF2 

## Create and/or check the input .pdf

I first created 2 test files in word and then printed each of them off to .pdf. I specifically put a few key features into these files to test how the text would be converted, such as a heading, multiple blank lines between text, an image with a caption, multiple pages, and two column text. 

I then saved these new .pdfs (and screenshots of both) into the same location as my .ipynb so that I can 
* 1 - paste in the images to show how the .pdfs looked to begin with (see next cell) and
* 2 - import the .pdfs so that PyPDF2 can convert them (see cell after next). 

In [39]:
display(HTML("<table><tr><td><img src='Input_pdf_image_1.png'></td><td><img src='Input_pdf_image_2.png'></td></tr></table>"))

In [49]:
# creating .pdf file objects from existing .pdfs in the same folder as this .ipynb code
# this ipmorts the .pdfs and creates accessible objects for the PyPDF2 module to work with
pdfFileObj_1 = open('input_pdf_1.pdf', 'rb') 
pdfFileObj_2 = open('input_pdf_2.pdf', 'rb') 

## Convert and check the .pdf objects

The first step is to create a PyPDF2 oject from the imported .pdf. This allows you to do things like:
* check how many pages are in the original .pdf, 
* convert some or all of those pages to page objects, and
* then extract the text from those page ojbects (optionally saving the text for later analysis). 

Of course, good coding etiquette suggests you should always close any opened files when you are done with them. 

In [50]:
# creating .pdf reader objects, which the module will use for the actual conversion work
pdfReader_1 = PyPDF2.PdfFileReader(pdfFileObj_1) 
pdfReader_2 = PyPDF2.PdfFileReader(pdfFileObj_2) 

In [51]:
# printing number of pages in each .pdf file 
print(pdfReader_1.numPages) 
print(pdfReader_2.numPages) 

1
2


So far so good. We have the .pdfs imported, converted to module specific objects, and the module has correctly identified the number of pages in each. But really, we need to know how well the module can recognise the text within those objects cause I deliberately made that text a bit tricky. 

## Get individual pages, extract text, save it as strings, etc. 

Now, we get down to the real work. We want to  
* convert some or all of those pages to page objects, and
* then extract the text from those page ojbects, 
* (optionally) save that text as string objects for later analysis, and 
* tidy up (good coding etiquette suggests you should always close any opened files when you are done with them) . 


In [56]:
# creating page objects for each page
# something to note here - the pages start counting from 0 so you get page 1 of our first test .pdf
#                          by asking getPage to getPage(0). 
#                          In turn, when we want to get both pages from the second test .pdf, we ask for 
#                          getPage(0) and also getPage(1)
pageObj_1_1 = pdfReader_1.getPage(0) 
pageObj_2_1 = pdfReader_2.getPage(0)
pageObj_2_2 = pdfReader_2.getPage(1) 

In [55]:
# extracting text from page to print on screen
print(pageObj_1_1.extractText()) 

 Input .pdf for testing 
Test text.  
One.  
Two.  
Three.  
Figure 1 - A plane with caption to test caption conversion. 



In [57]:
# extracting text from page to print on screen
print(pageObj_2_1.extractText()) 

"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci 

velit..." 

"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is 

pain..." 

What is Lorem Ipsum? 

Lorem Ipsum is simply dummy text of the 

printing  and  typesetting  industry.  Lorem 

Ipsum  has  been  the  industry's  standard 

dummy text ever since the 1500s, when an 

unknown printer took a galley of type and 

scrambled  it  to  make  a  type  specimen 

book.  It  has  survived  not  only  five 

centuries, but also the leap into electronic 

typesetting,  remaining  essentially 

unchanged.  It  was  popularised  in  the 

1960s  with  the  release of  Letraset  sheets 

containing  Lorem  Ipsum  passages,  and 

more  recently  with  desktop  publishing 

software  like  Aldus  PageMaker  including 

versions of Lorem Ipsum. 

Why do we use it? 

It  is  a  long  established  fact  that  a  reader 

will be distracted by the readable co

In [58]:
# extracting text from page to print on screen
print(pageObj_2_2.extractText()) 

which don't look even slightly believable. If 

you are going to use  a passage  of  Lorem 

Ipsum,  you  need  to  be  sure  there  isn't 

anything  embarrassing  hidden  in  the 

middle  of  text.  All  the  Lorem  Ipsum 

generators on the Internet tend to repeat 

predefined  chunks  as  necessary,  making 

this the first true generator on the Internet. 

It uses a dictionary of over 200 Latin words, 

combined  with  a  handful  of  model 

sentence  structures,  to  generate  Lorem 

Ipsum  which  looks  reasonable.  The 

generated  Lorem  Ipsum  is  therefore 

always  free  from  repetition,  injected 

humour, or non-characteristic words etc. 


So, good news. This .pdf conversion module has successfully recognised that input_pdf_2 was structured in two columns and it has converted the text appropriately (ish) with the text flowing properly from the end of one line in a column to the start of the next line *in the same column* rather than reading on to the equivalent line in the next column. 

It might be better if the lines were not cut short to replicate the actual number of words in each column as they appear in the text, but that is a step for later on. 


In [60]:
# extracting text from page to save for later use, in this case as a string object
test_file_1 = (pageObj_1_1.extractText()) 
type(test_file_1)

str

In [16]:
# closing the pdf file object 
pdfFileObj.close() 

## Next steps

This is a great first start, but it is not the whole project. Prior to this step, we need to create one or more scripts to webscrape the necessarry .pdf files from the medical journals - e.g. from https://www.nature.com/ejhg/volumes/30

We would also need to ensure the scraped .pdf files have a consistent naming structure and are stored in an accessible place. This could easily be built into the webscraping script, but an extra step to check is never a bad thing. 

Then, we would need to embed the steps from this .ipynb into a loop that 
* imports a .pdf from the designated folder, 
* converts it to a pdfReader opbject, 
* counts the pages in that object, 
* creates page objects from each page, 
* extracts the text from each page, 
* appends that text to a saved string object with an appropriate name based off the original file name, 
* closes the pdfReader object and other other relevant objects, and 
* proceeds to the next .pdf. 

After that, we would then proceed to the actual text-mining steps (cleaning the text, stemming or lemmatising it, extracting the relevant person-first and identify-first language and any contexts that seem relevant, etc. )

