# Automation of PDF analytics   

Often I need to massively work with PDF documents and extract information from them for analysis. 
Some times it's export from some software, sometimes it's collection of documents which has been filled up over time.

Often it can be hundreds or thousands of documents which needs to be:
- separated by type of the document: 
    - Often when you working with some migration or process adjustment, you come up with task of "cleanup documents" in folder 
      or place where everyone was dumping files. Relying on name of the file or size of it or even count of pages is not good, 
      but often documents have header which can be extracted and based on that you can make assumption of type of the document.
- create naming convention:
    - in some cases you have hundreds of files like NDA.pdf, NDA1.pdf, NDA (1).pdf, NDA-1999.pdf, NDA-Google.pdf.
      And you need to verify if it's correct file, and if for example it has a customer id number in it or date
      extract this information and rename all files to NDA-1234586-20240807.pdf and if the file is not NDA - then 
      rename it to "Unknown - %old name%.pdf".

[pymupdf](https://pypi.org/project/PyMuPDF/) has a really nice feature, which is not often described - convert PDF to JSON! 
After that you can work with file as a structural data and manipulate with everything from the code!

Let's get some PDF! In this case will generate PDF from URL using another great tool for converting HTML to PDF - [pdfkit](https://pypi.org/project/pdfkit/) 
and builtin function ".from_url":

In [2]:
#!/usr/bin/env python3

import pdfkit #apt-get install wkhtmltopdf
pdfkit.from_url('https://en.wikipedia.org/wiki/Main_Page', 'out.pdf')

True

So, we have PDF and some fields in it, which might be interested for us, such as:
1. we want to make sure that it's actually Wiki home page, based on "Welcome to Wikipedia" text
2. we want to know the date of this extraction (based on date mentioned on second page of the file)
3. we want to know how many articles Wikipedia had on that day

![image](Code_FAFCREvtn9.png)
![image](Code_ukgsW7FH9s.png)

Now, when we have a PDF, let's see how does it looks in JSON format.

For that let's use pymupdf and [.get_text](https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_text) which has parameter "json"!

In this example, I will choose only first page (doc[0]) and will print out formatted json

In [3]:
#!/usr/bin/python3

import pymupdf #pip install pymupdf
import json

doc = pymupdf.open("out.pdf") 
jpage = json.loads(doc[0].get_text("json"))

print(json.dumps(jpage, indent=4))

{
    "width": 595.0,
    "height": 842.0,
    "blocks": [
        {
            "number": 0,
            "type": 1,
            "bbox": [
                20.69948959350586,
                144.0253143310547,
                90.43045043945312,
                237.9604034423828
            ],
            "width": 121,
            "height": 163,
            "ext": "jpeg",
            "colorspace": 3,
            "xres": 96,
            "yres": 96,
            "bpc": 8,
            "transform": [
                69.73095703125,
                0.0,
                -0.0,
                93.93508911132812,
                20.69948959350586,
                144.0253143310547
            ],
            "size": 7184,
            "image": "/9j/4AAQSkZJRgABAQEAZABkAAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYHBwcGBwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcIDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAARCACjAHkDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAA

Output will be like this:

        $ ./parse_pdf.py | head
        {
            "width": 595.0,
            "height": 842.0,
            "blocks": [
                {
                    "number": 0,
                    "type": 1,
                    "bbox": [
                        20.69948959350586,
                        144.0253143310547,

Now let's find "Welcome to Wikipedia" in it and make sure that it's there:

        $ ./parse_pdf.py | less -S
        ...
            {
            "number": 9,
            "type": 0,
            "bbox": [
                200.50161743164062,
                52.73360824584961,
                393.5578308105469,
                70.3363037109375
            ],
            "lines": [
                {
                    "spans": [
                        {
                            "size": 14.983510971069336,
                            "flags": 20,
                            "font": "DejaVuSerif-Bold",
                            "color": 0,
                            "ascender": 0.93896484375,
                            "descender": -0.23583984375,
                            "text": "Welcome to Wikipedia",
                            "origin": [
                                200.50161743164062,
                                66.80259704589844
                            ],
        ...

So, we found that text in block 9, in the first line and span. You can verify it by using jq or any other tools:

        $ ./parse_pdf.py | jq '.blocks[9].lines[0].spans[0].text'
        "Welcome to Wikipedia"

Or you can do the same in python code directly with other elements we required:

In [11]:
print(jpage["blocks"][8]["lines"][0]["spans"][0]["text"])
#Welcome to Wikipedia

print(jpage["blocks"][10]["lines"][0]["spans"][0]["text"])
#6,864,026 articles in English

print(jpage["blocks"][17]["lines"][0]["spans"][0]["text"])
#"August 7"

Welcome to Wikipedia
6,864,586 articles in English
August 8


Very often PDF has same structure, like block->lines->spans->text and I have a small function to deal with it, 
plus provide some cleaning for outgoing text:

In [13]:
def get_value_by_key(json: str, block:int, line:int, span:int) -> str:
    try:
        return json["blocks"][block]["lines"][line]["spans"][span]["text"].replace('\xa0', ' ').strip()
    except:
        return ""

Same code with it will look the following way:

In [12]:
is_wiki =           get_value_by_key(jpage,8,0,0) #Welcome to Wikipedia
count_of_articles = get_value_by_key(jpage,10,0,0) #6,864,026 articles in English
date_of_extract =   get_value_by_key(jpage,17,0,0) #"August 7"

print(f"{is_wiki=} {count_of_articles=} {date_of_extract=}")

is_wiki='Welcome to Wikipedia' count_of_articles='6,864,586 articles in English' date_of_extract='August 8'


Now let's add some data clean up, logic and formatting:

In [25]:
import datetime

is_wiki="yes" if get_value_by_key(jpage,8,0,0) == "Welcome to Wikipedia" else "no"
count_of_articles=get_value_by_key(jpage,10,0,0).split()[0].replace(',', '') #6,864,026 articles in English => 6864026
date_of_extract=datetime.datetime.strptime(get_value_by_key(jpage,17,0,0), "%B %d").replace(year=2024)  #"August 7" => 2024-08-07

print(f"{is_wiki=} {count_of_articles=} {date_of_extract=}")

is_wiki='yes' count_of_articles='6864586' date_of_extract=datetime.datetime(2024, 8, 8, 0, 0)


And let's apply it to all pdf files in the directory, plus let's collect all this data in dataframe so we can work with it in the future:

In [26]:
#!/usr/bin/python3

import pymupdf
import json
from sorcery import dict_of #pip install sorcery
import glob
import pandas as pd

def get_value_by_key(json: str, block:int, line:int, span:int) -> str:
    try:
        return json["blocks"][block]["lines"][line]["spans"][span]["text"].replace('\xa0', ' ').strip()
    except:
        return ""

def extracts_info_from_pdf(file) -> dict:
    out={}

    doc = pymupdf.open(file) 
    jpage = json.loads(doc[0].get_text("json"))

    is_wiki="yes" if get_value_by_key(jpage,8,0,0) == "Welcome to Wikipedia" else "no"
    count_of_articles=get_value_by_key(jpage,10,0,0).split()[0].replace(',', '') #6,864,026 articles in English => 6846026
    date_of_extract=datetime.datetime.strptime(get_value_by_key(jpage,17,0,0), "%B %d").replace(year=2024)  #"August 7" => 2024-08-07

    out.update(dict_of(is_wiki, 
                       count_of_articles, 
                       date_of_extract,
                       file))

    return out    

if __name__ == "__main__":
    path = "*.pdf"
    extracts=[] 
    for file in glob.glob(path, recursive=True):
        extracts.append(extracts_info_from_pdf(file))

    df_extracts = pd.DataFrame(extracts)
    print(df_extracts)
    #  is_wiki count_of_articles date_of_extract    file
    #0     yes           6864586      2024-08-08 out.pdf
    
    #optionally to save into CSV: df_extracts.to_csv("extracts.csv")

  is_wiki count_of_articles date_of_extract         file
0     yes           6864586      2024-08-08  out-old.pdf
1     yes           6864586      2024-08-08      out.pdf
