# Converting .pdf files to text files

## Install and import necessary things

Start off by installing the PyPDF2 module (if you don't already have it installed) and importing that module so that it can be used in this notebook. 

In [4]:
# installing necessary pdf conversion package via pip
!pip install PyPDF2             





In [5]:
# importing required modules (first for displaying screenshots, second for converting pdfs)
import os
from IPython.display import HTML, display
import PyPDF2 
import csv

## Create and/or check the input .pdf

I first created 2 test files in word and then printed each of them off to .pdf. I specifically put a few key features into these files to test how the text would be converted, such as a heading, multiple blank lines between text, an image with a caption, multiple pages, and two column text. 

I then saved these new .pdfs (and screenshots of both) into the same location as my .ipynb so that I can 
* 1 - paste in the images to show how the .pdfs looked to begin with (see next cell) and
* 2 - import the .pdfs so that PyPDF2 can convert them (see cell after next). 

In [7]:
display(HTML("<table><tr><td><img src='..\images\Input_pdf_image_1.png'></td><td><img src='..\images\Input_pdf_image_2.png'></td></tr></table>"))

In [6]:
# creating .pdf file objects from existing .pdfs in the same folder as this .ipynb code
# this ipmorts the .pdfs and creates accessible objects for the PyPDF2 module to work with

pdfFileObj_1 = open('..\input_pdfs\Test\input_pdf_1.pdf', 'rb') 
pdfFileObj_2 = open('..\input_pdfs\Test\input_pdf_2.pdf', 'rb') 
eshg = open('..\input_pdfs\ESHG\ESHG2001abstractICHG.pdf', 'rb') 

print(eshg)

<_io.BufferedReader name='..\\input_pdfs\\ESHG\\ESHG2001abstractICHG.pdf'>


## Convert and check the .pdf objects

The first step is to create a PyPDF2 oject from the imported .pdf. This allows you to do things like:
* check how many pages are in the original .pdf, 
* convert some or all of those pages to page objects, and
* then extract the text from those page ojbects (optionally saving the text for later analysis). 

Of course, good coding etiquette suggests you should always close any opened files when you are done with them. 

In [7]:
# creating .pdf reader objects, which the module will use for the actual conversion work
pdfReader_1 = PyPDF2.PdfFileReader(pdfFileObj_1) 
pdfReader_2 = PyPDF2.PdfFileReader(pdfFileObj_2) 
eshg_reader = PyPDF2.PdfFileReader(eshg) 

In [9]:
# printing number of pages in each .pdf file 
print(pdfReader_1.numPages) 
print(pdfReader_2.numPages) 
print(eshg_reader.numPages)

1
2
485


So far so good. We have the .pdfs imported, converted to module specific objects, and the module has correctly identified the number of pages in each. But really, we need to know how well the module can recognise the text within those objects cause I deliberately made that text a bit tricky. 

## Get individual pages, extract text, save it as strings, etc. 

Now, we get down to the real work. We want to  
* convert some or all of those pages to page objects, and
* then extract the text from those page ojbects, 
* (optionally) save that text as string objects for later analysis, and 
* tidy up (good coding etiquette suggests you should always close any opened files when you are done with them) . 


In [8]:
# creating page objects for each page
# something to note here - the pages start counting from 0 so you get page 1 of our first test .pdf
#                          by asking getPage to getPage(0). 
#                          In turn, when we want to get both pages from the second test .pdf, we ask for 
#                          getPage(0) and also getPage(1)
pageObj_1_1 = pdfReader_1.getPage(0) 
pageObj_2_1 = pdfReader_2.getPage(0)
pageObj_2_2 = pdfReader_2.getPage(1) 

In [12]:
# creating page objects for each page
# something to note here - the pages start counting from 0 so you get page 1 of our first test .pdf
#                          by asking getPage to getPage(0). 
#                          In turn, when we want to get both pages from the second test .pdf, we ask for 
#                          getPage(0) and also getPage(1)
eshg_reader_1_0 = eshg_reader.getPage(0) 
eshg_reader_1_1 = eshg_reader.getPage(1) 


In [11]:
# extracting text from page to print on screen
print(eshg_reader_1_0.extractText()) 

Lost your way mapping the genome? Let the 
Nature Publishing Group guide you to the latest advances.
Natureis the perfect medium to give you the latest
science news and results every week.  In Nature’s
Articles and Letters sections you’ll find important orig-
inal research such as the recent human genome project map and sequence.  
With conceptual advances in areas such as genomics and post-genomics and
developments in other disciplines which affect all areas of science,Natureis
your essential resource.Renowned for quality, global coverage,
Nature Geneticsassures you access to significant
genetics research.  News and Views articles, Brief
Communications, Letters and Reviews cover topics
across the field, including the genetic basis of 
disease, genomic analysis, epigenetics and chromosome biology.  Available 
full-text online, with web specials such as the Microarray site,Nature Genetics
is consistently cited more often than any other journal in the field.
Nature Reviews Geneticsis one

In [13]:
# extracting text from page to print on screen
print(eshg_reader_1_1.extractText()) 

www.agilent.com/chem/dna
©2001, Agilent Technologies Inc. Ago-4220WW
Resolver is a trademark of Rosetta Inpharmatics, Inc.Now you can design microarrays around experiments,
rather than adapting research around microarrays.
Agilent provides a complete system for fast, flexible
gene expression analysis–probe design services, 
in-situ oligonucleotide and cDNA microarrays,
reagents, protocols, automated scanner, and Rosetta
Resolver
tm
expression data analysis system. So you
can go where your research leads you.
Dreams made real.A Flexible New Solution

For gene expression analysis
 Demand greater BIOACTIVITY in your plasmid preparations
Comparative Purification and In vivo Data
   - Example of a high copy number plasmid purification


1% Agarose GelPristineDNA
      Double CsCl
            1 kb ladder

Features:
1.  Fast, low endotoxin, high yield
2.  Proprietary EndoSep
TM 
Suspension/Lysis
3.  Affinity binding fast flow column
4.  NaOH cleanable (10x)
 Benefits:
1.  Easy to use and affo

In [11]:
# extracting text from page to print on screen
print(pageObj_2_2.extractText()) 

which don't look even slightly believable. If 

you are going to use  a passage  of  Lorem 

Ipsum,  you  need  to  be  sure  there  isn't 

anything  embarrassing  hidden  in  the 

middle  of  text.  All  the  Lorem  Ipsum 

generators on the Internet tend to repeat 

predefined  chunks  as  necessary,  making 

this the first true generator on the Internet. 

It uses a dictionary of over 200 Latin words, 

combined  with  a  handful  of  model 

sentence  structures,  to  generate  Lorem 

Ipsum  which  looks  reasonable.  The 

generated  Lorem  Ipsum  is  therefore 

always  free  from  repetition,  injected 

humour, or non-characteristic words etc. 


So, good news. This .pdf conversion module has successfully recognised that input_pdf_2 was structured in two columns and it has converted the text appropriately (ish) with the text flowing properly from the end of one line in a column to the start of the next line *in the same column* rather than reading on to the equivalent line in the next column. 

It might be better if the lines were not cut short to replicate the actual number of words in each column as they appear in the text, but that is a step for later on. 


In [12]:
# extracting text from page to save for later use, in this case as a string object
test_file_1 = (pageObj_1_1.extractText()) 
type(test_file_1)

str