# Converting .pdf files to text files

# 1 - Install and import necessary things

Start off by installing the required packages (if you don't already have them installed) and then importing all required packages. 

In [6]:
%%capture

# installing necessary pdf conversion packages via pip
# the '%%capture' at the top of this cell suppresses the output (which is normally quite long and annoying looking). 
# You can remove or comment it out if you prefer to see the output. 

!pip install PyPDF2
!pip install pdfplumber

In [7]:
# importing required modules 
import os                        
from IPython.display import HTML, display
import csv
import datetime
import pdfplumber
import pandas as pd
import regex
import re

date = datetime.date.today()

# 2 - Define the conversion function 

This bit of code does a fair bit. It opens each of the files in the folder given as an argument and for each one removes the '.pdf' suffix from the file name, removes the 'ESHG' bit of the filename, finds how many pages there are in the file, opens each page, extracts the text, appends that text to a string and then writes that string out to a new .txt file with the name (the short version, without '.pdf' or 'ESHG' in it). 

In [3]:
def convert_pdfs(input,output):
    lines = []
    for filename in os.listdir(input):
        pdf_contents = ""
        with pdfplumber.open(input + "\\" + filename) as pdf:
            name = filename.replace(r'.pdf', "")
            name = name.replace(r'ESHG', "")
            pages = pdf.pages
            for page in pdf.pages:
                text = page.extract_text()
                pdf_contents = pdf_contents + text
            with open(output + "\\" + name + ".txt", "w", newline='', encoding='utf-8') as f:
                pdf_contents = " ".join(pdf_contents.split())
                f.write(pdf_contents)
                f.close()
                

## Run on Test folder

Let's just test it on the Test folder to make sure it all goes to plan. It will remove the '.pdf' from the names but looking for and replacing 'ESHG' will have no effect. 

Be sure to look in the output_texts folder, open and inspect a file or two to make sure it worked as expected. 


In [5]:
os.listdir("..\input_pdfs\Test") # This is how to see the contents of any folders shown in the last contents check
                            # For example, 'input_pdfs' which is likely to contain things we want to import

['input_pdf_1.pdf', 'input_pdf_2.pdf', 'input_pdf_3.pdf']

In [6]:
convert_pdfs('..\input_pdfs\Test', '..\output_texts\Test')

## Run on folder of interest 

Now let's do the same for the target folder. This time, the name shortening lines in the function will take full effect. 

In [8]:
os.listdir("..\input_pdfs\ESHG") 

['ESHG2001abstractICHG.pdf',
 'ESHG2002Abstracts.pdf',
 'ESHG2003Abstracts.pdf',
 'ESHG2004Abstracts.pdf',
 'ESHG2005Abstracts.pdf',
 'ESHG2006Abstracts.pdf',
 'ESHG2007Abstracts.pdf',
 'ESHG2008Abstracts.pdf',
 'ESHG2009Abstracts.pdf',
 'ESHG2010Abstracts.pdf',
 'ESHG2011Abstracts.pdf',
 'ESHG2012Abstracts.pdf',
 'ESHG2013Abstracts.pdf',
 'ESHG2014Abstracts.pdf',
 'ESHG2015Abstracts.pdf',
 'ESHG2016Abstracts.pdf',
 'ESHG2017 electronic posters.pdf',
 'ESHG2017 oral presentations.pdf',
 'ESHG2017 posters.pdf',
 'ESHG2018 electronic posters.pdf',
 'ESHG2018 EMPAG.pdf',
 'ESHG2018 oral presentation.pdf',
 'ESHG2018 posters.pdf',
 'ESHG2019 oral presentation.pdf',
 'ESHG2019 posters.pdf',
 'ESHG2019 posters2.pdf',
 'ESHG2020 eposters.pdf',
 'ESHG2020 interactive eposter.pdf',
 'ESHG2020 oral presentation.pdf',
 'ESHG2021 eposters.pdf',
 'ESHG2021 oral presentations.pdf']

In [None]:
convert_pdfs('..\input_pdfs\ESHG', '..\output_texts\ESHG')       # This takes a long time to run. At least it did for me.
                                                                 # Go do something elso for a while or 
                                                                 # do this last thing before you go in the evening. 

If you remember correctly, the 2004 file threw up an error when we were splitting it to chop off the irrelevant bits. 

It looks fine when viewed in a .pdf reader but I noticed that running the preliminary clean and analysis notebooks showed that it was not behaving like the other files. I double checked it using the code in the convert_pdf_single notebook and saw that it started off looking normal but quickly went very squiffy. It seems that particular .txt file is full of squiffy text due to some weird encoding. 

As such, we need to do a bit of work to recode the 2004 file, giving it a unique name to distinguish it from the file converted by the previous code. 

In [None]:
def cidToChar(cidx):
    return chr(int(re.findall(r'\(cid\:(\d+)\)',cidx)[0]) + 29)

with open('..\\output_texts\\ESHG\\2004Abstracts.txt', "r", newline='', encoding='utf-8') as file:
    for item in file:
        abc = re.findall(r'\(cid\:\d+\)',item)
        if len(abc) > 0:
            for cid in abc: item=item.replace(cid, cidToChar(cid))
        output = repr(item).strip("'")
        with open('..\\output_texts\\ESHG\\2004.txt', "w", newline='', encoding='utf-8') as f:
            f.write(output)
            f.close()

Aaaaaaaaaaaaand, let's just have a look at the new (and uniquely named) recoded file to see what it looks like. At this point, you may want to save the original (and squiffy) file somewhere else and only continue working with the better encoded 2004.txt file. Or you can just delete the original one as you could always get it back again by re-running the conversion process on all the files (or just on the one file). 

In [11]:
with open('..\\output_texts\\ESHG\\2004.txt','r', encoding='utf-8') as f:
    contents = f.read()
    print(contents)

ESHG Plenary Lectures 57 Abstracts such phenotypes can be recognized most easily in the integument. There are two main types of mosaicism: epigenetic or genomic mosaicism. Recent research indicates that both X-linked and L01. Multiple Sulfatase Deﬁ ciency: Molecular defect and properties of the autosomal forms of epigenetic mosaicism can be caused by missing enzyme. retrotransposon activity. K. von Figura, M. Mariappan, J. Peng, A. Preußer, B. Schmidt; X-linked epigenetic mosaicism: Different patterns of lyonization Biochemie II, Georg August Universität Göttingen, Göttingen, Germany. include Blaschko lines (many syndromes), checkerboard pattern (X- Based on puriﬁ cation and peptide sequencing of the missing linked hypertrichosis), and lateralization (CHILD syndrome). enzyme (1) and complementation cloning using minicell mediated Autosomal epigenetic mosaicism: This concept may explain the chromosome transfer (2) the gene defective in multiple sulfatase exceptional familial aggregation