## Exercise 1: Parsing HTML
Answer: No
HTML is strictly hierarchical. Converting html text into an "object oriented" format would make more sense. Such a form is more adequat for parsing. Depending on the goal it might be beneficial to perform search via regular expressions in a second step. Of course it always depends on the use case.

## Exercise 2: PDFs

In [138]:
import os
print(os.listdir("scans"))
from difflib import SequenceMatcher

['double_ocr.pdf', 'single_ocr.pdf']


In [139]:
def process_textract(filename):
    import textract
    return textract.process(filename).decode("utf-8")

def process_mymupdf(filename):
    text = ""
    import fitz
    with fitz.open(filename) as doc:
        for page in doc:
            text += page.getText()
    return text

def process_pdftotext(filename):
    import pdftotext
    with open(filename, "rb") as f:
        pdf = pdftotext.PDF(f)
        pdf = ("\n\n".join(pdf))
        return pdf

def process_tika(filename):  # did not work out of the box
    import tika
    tika.initVM()
    raw = tika.parser.from_file(filename)
    return (raw['content'])

fn_list = [process_mymupdf, process_textract, process_pdftotext]

In [140]:
filename = "scans/single_ocr.pdf"
second_filename = "scans/double_ocr.pdf"
results = []
results_secondfile = []
for fn in fn_list:
    results.append(fn(filename))
    results_secondfile.append(fn(second_filename))

### qualititative analysis
after analysing the output of the three methods we arrived at the following conclusions:
- tika has a lot of linebreaks in between names and adresses.
- textextract is similar, but seems to have more problems with uncommon symbols which result in outputs like "ItCharleg o;;eansnum " 
- pdftotext works by far the best. The majority of line breaks resembles the actual document. Far easier to read.
Names and adresses are usually on the same line connected by .... 

For the double_ocr.pdf file the first method also performs worst. Textextract results in more intuitive blocks of adresses. However pdftotext leads to the best overall results

## quantitative analysis

The sequence matcher output shows that method 0 is significantly more similar to both 1 and 2 than 1 and 2 are to each other

for the "double_ocr.pdf" file the ratios are relatively similar. There are no similarities to the single_ocr.pdf file

In [141]:
print("single ocr file")
print(SequenceMatcher(None, results[0], results[1]).ratio())
print(SequenceMatcher(None, results[1], results[2]).ratio())
print(SequenceMatcher(None, results[0], results[2]).ratio())
print()

print("double ocr file")
print(SequenceMatcher(None, results_secondfile[0], results_secondfile[1]).ratio())
print(SequenceMatcher(None, results_secondfile[1], results_secondfile[2]).ratio())
print(SequenceMatcher(None, results_secondfile[0], results_secondfile[2]).ratio())

single ocr file
0.4126882818483533
0.22705626628171305
0.4404858032727452

double ocr file
0.12592163035374038
0.11796194775878749
0.11129923649490464


In [142]:
num_symbols = [len(result) for result in results]
for fn, num in zip(fn_list, num_symbols):
    print(fn.__name__, num)

process_mymupdf 15876
process_textract 15460
process_pdftotext 44807


counting the number of symbols for each method shows that the pdftotext method outputs ~3 x more symbols.
However when removing multiple blank spaces it becomes the shortest output

here the methods behave similar on both files

In [143]:
num_symbols_filtered = [len(result.replace("  ", "")) for result in results]
for fn, num in zip(fn_list, num_symbols_filtered):
    print(fn.__name__, num)

process_mymupdf 15796
process_textract 15460
process_pdftotext 14971


It seems bad methods produce many linebreaks. Therefore a low number of linebreaks might be a useful indicator for higher accuracy.
here the methods behave similar on both files

In [144]:
num_linebreaks = [result.count("\n") for result in results]
for fn, num in zip(fn_list, num_linebreaks):
    print(fn.__name__, num)

process_mymupdf 631
process_textract 956
process_pdftotext 160


## summary
pdftotext works best. The downside is the large number of multiple whitespaces. Therefore these are removed
The methods behave similar on both files

In [149]:
with open("converted_single_ocr.txt", "w+") as outfile:
    outfile.write(results[2].replace("  ", ""))
with open("converted_double_ocr.txt", "w+") as outfile:
    outfile.write(results_secondfile[2].replace("  ", ""))

## exercise 3: why is pdf conversion hard?
- pdf enables a large variety of layouts. Worse more any given layout of content can be achieved through many different ways. In plain text layouts are only represented in whitespaces, linebreaks and tabs. 
When trying to convert to plain text this layout information needs to be convert. As the layout info from pdfs is highly ambigous this is difficult.
- even in between words that seem to be a single "block" of text, the conversion results in many blocks being seperated into small fragments. This indicates that within the pdf they are not a single block, even though they look as one. This makes decoding difficult
- images and graphics overlapping with text is common in pdf and hard to decode

## exercise 4.1: Phone numbers

In [562]:
import glob
import re
phone_files = list(glob.iglob("phone_numbers/*.pdf"))
candidate = "(0|\+|Tel.{0,5})([\s+\–]{0,3}[1-9])([0-9\-\s\–]{3,18})"
normalize_regex = "[0-9]+"
plaintexts = []
for file in phone_files:
    plaintexts.append(
        process_pdftotext(file).replace("  ","").replace("\n\n", ""))


In [572]:
total_matches = []
for plaintext in plaintexts:
    matches = re.findall(candidate, plaintext)
    print("found ", len(matches), "phone numbers")
    matches = [''.join(match) for match in matches]
    matches = [re.sub(r"[a-zA-Z+\s\n\:\.\-\–]*", r"", match) for match in matches]
    total_matches.extend(matches)


found  310 phone numbers
found  461 phone numbers
found  6 phone numbers
found  3 phone numbers


In [571]:
with open("phone_numbers.txt", "w+") as outfile:
    for number in total_matches:
        outfile.write(number + "\n")