### Convert pdf to text

https://filingdb.com/b/pdf-text-extraction
At its core, the PDF format consists of a stream of instructions describing how to draw on a page. In particular, text data isn’t stored as paragraphs - or even words - but as characters which are painted at certain locations on the page. As a result, most of the content semantics are lost when a text or word document is converted to PDF - all the implied text structure is converted into an almost amorphous soup of characters floating on pages.

In [6]:
import pdftotext

.PDF method returns an iterable with each element being a pdf page. Joining the contents and removing the zero-width characters : ~0.03 s

In [71]:
start = time.time()
with open('phone_numbers/Infoblatt_154.pdf','rb') as pdf_file:
    pdf = pdftotext.PDF(pdf_file) # returns an iterable 
text = '\n\n'.join(pdf).replace('\u200b','')
stop = time.time()
print(stop-start)

0.02790975570678711


Loading the text as a single string with pdfminer : ~ 0.3 s 

In [44]:
from pdfminer.high_level import extract_text
from io import StringIO
import time

In [72]:
start = time.time()
text_miner = extract_text('phone_numbers/Infoblatt_154.pdf').replace('\u200b','')
stop = time.time()
print(stop-start)

0.26982736587524414


Apache Tika works on .pdf, the most recent OOXML Microsoft Office file types and older binary file formats such as .doc, .ppt and .xls. Returns retrieved text in ~ 0.16 s

In [49]:
from tika import parser

In [73]:
start = time.time()
text_tika = parser.from_file('phone_numbers/Infoblatt_154.pdf')['content'].replace('\u200b','')
stop = time.time()
print(stop-start)

0.16096258163452148


### Compare  resulting texts

In [65]:
import difflib

In [117]:
sequence_matcher = difflib.SequenceMatcher(isjunk = lambda x: x=="\n", a= text_miner, b=text_tika)

In [118]:
sequence_matcher.ratio()

0.8634219554030875

In [115]:
sequence_matcher = difflib.SequenceMatcher(isjunk = lambda x: x=="\n", a= text, b=text_tika)

In [116]:
sequence_matcher.ratio()

0.39246276972951066

In [113]:
sequence_matcher = difflib.SequenceMatcher(isjunk = lambda x: x=="\n", a= text, b=text_miner)

In [114]:
sequence_matcher.ratio()

0.36456764085567944

In [277]:
text_tika = text_tika.encode().replace(b'\xe2\x80\x93',b'-').replace(b'\xef\xbf\xbd',b'Tel.').decode()

Choosing : pdftotext

Reason : although semantic of the results produced by pdftotext is sometimes lost : such as in the case : von 7.00 bis 8.00 Uhr sowie von[                   Wohnungsübergabe.\n: this shouldn't be here] 14.30 bis 15.00. Which is not the case with tika - i.e. tika deals better with texts that are divided into multiple columns. There are many saddening examples in file :  070219_KV_BR_Telefonverzei.pdf with tika, wich doesn't encode the numbers correctly

### Extracting phone numbers

In [148]:
import re

In [631]:
numbers_string = "(\d+ ?(-|/)? ?)" #({0})+(\s+(und|oder)\s+({0}))
telefon_string = "telefonnummer|tel|telephone|telefon"

In [632]:
classical_telephone_number=  "\+[(]{0,1}[0-9]{1,4}[)]{0,1}[-\s\./0-9]*"

In [633]:
other_telephone_numbers = "(({0}) ({0})+)".format(numbers_string)

In [636]:
TEL_NUMBERS_REGES = re.compile('((({1})\.?:? ?(({0})+)((und|oder) +({0})+)?)|({2})|({3}))'.format(numbers_string, telefon_string,classical_telephone_number, other_telephone_numbers),re.IGNORECASE)

In [637]:
[i[0].strip() for i in TEL_NUMBERS_REGES.findall(text_tika)]

['Tel. 19292',
 'Tel. 110',
 'Tel. 112',
 'Tel. 56-7272',
 'Tel. 56-7161',
 'Tel. 181122',
 'Tel. 37433',
 'Tel. 56 - 38629',
 'Tel. 56 - 37591',
 '56 - 1987']

In [639]:
import os

In [818]:
def find_telephone_numbers(file_path):
    with open(file_path,'rb') as f:
        text = pdftotext.PDF(f)
        text = '\n\n'.join(text)
    text =    text\
                           .replace('\u200b','')\
                           .encode()\
                           .replace(b'\xe2\x80\x93',b'-')\
                           .replace(b'\xef\xbf\xbd',b' ')\
                           .decode()
    print(file)
    return [i[0].strip() for i in TEL_NUMBERS_REGES.findall(text)]

In [820]:
numbers = []
with open("tel_nums.txt",'w') as wf:
    for file in os.listdir('phone_numbers/'):
        #parser.from_file(os.path.join('phone_numbers',file))['content']\
        telephone_numbers = find_telephone_numbers(os.path.join('phone_numbers',file))
        numbers.append( telephone_numbers)
        wf.write(file+'\n\n')
        wf.writelines('\n'.join(telephone_numbers)+'\n')

Infoblatt_154.pdf
52_pdf_wegweiser_schlaganfall.pdf
FL_SYB_BetriebsaerztlicherDienst_ID8414.pdf
070219_KV_BR_Telefonverzei.pdf


In [641]:
text = parser.from_file(os.path.join('phone_numbers','070219_KV_BR_Telefonverzei.pdf'))['content']\
                       .replace('\u200b','')\
                       .encode()\
                       .replace(b'\xe2\x80\x93',b'-')\
                       .replace(b'\xef\xbf\xbd',b' ')\
                       .decode()

### Unit conversion

Convert all values
given in milliliters to liters using only functions of the Python regex module.

In [651]:
with open('unit_conversion/si.txt','r') as f:
    data = f.readlines()

In [810]:
# test with a bit harder data (added more numbers after the floating point and having single digit numbers to test the padding)
data = ['1337 ml\n',
 '2,500 ml\n',
 '1 milliliters\n',
 '12 milliliters\n',
 '26 l\n',
 '18,421.902 ml\n',
 '8321ml\n',
 '32 m\n']

Find groups with milliliters : first group : numbers (with floating points and commas), second group : unit

In [811]:
ML_REGEX = re.compile("(\d+,?[\d]*\.?[\d]*)[\s]*(ml|milliliters)")

Find endings : the first group (3 digits): last digits before the floating point, second group (0->n digits) all digits after that

In [812]:
LAST_LETTERS = re.compile("((,?[\d]{1,3})(\.[\d]*)?$)") # match last 3 numbers before the floating point + everything after the floating point

Convert the data : pad zeros if first group (digits before the floating point has insufficient length for correct conversion)

In [813]:
TRUNCATE_DIGITS = re.compile("([\d]{1,3})([\d]*)$")

In [814]:
for i in data:
    
    number = ML_REGEX.findall(i)
    if len(number)==0:
        print("No mililiters: ", i)
        continue
    print('Coverting: ',number[0][0])
    found = LAST_LETTERS.search(number[0][0])
    new_floating_point = '.'+'0'*(3-len(found.groups()[0]))+re.sub('[,\.]','',found[0])
    converted = re.sub(found[0]+'$',new_floating_point,number[0][0])
    truncate = TRUNCATE_DIGITS.search(converted).groups()[1]
    print('last: ',truncate)
    if truncate != '':
        print('Before truncation: ', converted)
        converted = re.sub(truncate+'$','',converted)

    print(converted)

Coverting:  1337
last:  
1.337
Coverting:  2,500
last:  
2.500
Coverting:  1
last:  
.001
Coverting:  12
last:  
.012
No mililiters:  26 l

Coverting:  18,421.902
last:  902
Before truncation:  18.421902
18.421
Coverting:  8321
last:  
8.321
No mililiters:  32 m



### Testing algorithm on scans
Surprisingly pdftotext actually finds some of the numbers. In the double_ocr.pdf file there are even some decently looking numbers. This is because pdf stores the data (unlike the openxml format) not as text or images but by "describing" how the elements should be rendered - so the difference between converting a text written in docx and a scan shouldn't excremely big.

In [821]:
find_telephone_numbers('scans/single_ocr.pdf')

070219_KV_BR_Telefonverzei.pdf


['364 1733',
 '13 31',
 '322 60',
 '0178895 72 64',
 '38 69',
 '3 26 20',
 '30 74',
 '1 2',
 '30 76',
 '30 80 00',
 '2 56 81',
 '3 75 74',
 '3 20 73',
 '39 70',
 '3 36',
 '30 7545',
 '0171281 97 36',
 '30 88 66',
 '333 76',
 '19 31814',
 '5 16',
 '39 47-0',
 '3 34',
 '39 75 66',
 '3 2',
 '30 97-0',
 '015158 778269',
 '333 15',
 '332 14',
 '30 2',
 '0181 22008708',
 '39 26',
 '3 6',
 '3 35 70',
 '3 1',
 '325 43',
 '39 43',
 '372 34',
 '39 7497',
 '323 20',
 '5 33923',
 '372 94',
 '34 81',
 '8 2',
 '380 84',
 '345 61',
 '348 73',
 '2 0',
 '0157 73565090',
 '3075 38',
 '0177598 7162',
 '33 19',
 '28 70 81',
 '331 46',
 '2 2',
 '22 20 5',
 '28 95 65',
 '255 31 19',
 '232 80 74',
 '0176 232327 74',
 '86 72 35',
 '30 83 13',
 '340 85',
 '44 86',
 '255 3048',
 '318 97',
 '3 36 43',
 '3 7 46',
 '30 76 98',
 '30 80 39']

In [822]:
find_telephone_numbers('scans/double_ocr.pdf')

070219_KV_BR_Telefonverzei.pdf


['0 62 21',
 '36 4 33 45-0',
 '19 33836 70',
 '0 62 21 / 585 09 30',
 '585 09 40',
 '16 2828',
 '97 99-0',
 '2 72 26',
 '33 85',
 '5 69123',
 '62 21',
 '98 05-0',
 '60 72-0',
 '60 72 17',
 '16 30 61',
 '97 82-0',
 '1 38 36 16',
 '6 57 70-0',
 '41 49-0',
 '138 36-20',
 '8 93 61 60',
 '062 21',
 '60 43-0',
 '0 62 21',
 '60 43 60',
 '0 62 21',
 '14 07 14',
 '062 21',
 '97 37-0',
 '98 13-0',
 '062 02',
 '859 43-0',
 '53 70-0',
 '91 29-0',
 '0 62 21',
 '4 18 55 58',
 '91 4050',
 'Tel. 0 62 21 /50 25 95-95',
 '40 90 26',
 '98 12-0',
 '2 32 54/2',
 '47 82',
 '1922   834014',
 '14 14-0',
 '36 37 30',
 '36 37 333',
 '2 23 33',
 '2 9108']