# Running the 'data to document' step of the pdf2text pipeline

This notebook contains code which processes a folder of intermediate outputs (i.e. folders direct from Adobe Extract or files from pdfalto) into `Document` objects, and saves .json and .txt files to a specified folder.

It's left as a notebook for now, in order to enable experimentation with development of pipeline elements and not make premature decisions about implementation in the product pipeline.

In [2]:
import sys
sys.path.append("..")

from pathlib import Path
from typing import List

from tqdm.auto import tqdm

from extract.extract import DocumentEmbeddedTextExtractor, AdobeAPIExtractor
from extract.document import Document
from extract.utils import get_md5_hash

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# prototype pdfs
PDF_FOLDER = Path("../../../data/cclw-en-pdf-docs")
INTERMEDIATE_FOLDER = Path("../../../data/pdf2text/intermediate-final/")
OUTPUT_FOLDER = Path("../../../data/pdf2text/pipeline-output-md5/")

# pdfs loaded since prototype
PDF_FOLDER = Path("../../../data/_new_pdfs/en_for_adobe/done_fixed/")
INTERMEDIATE_FOLDER = Path("../../../data/_new_pdfs/en_for_adobe/intermediate_fixed//")
OUTPUT_FOLDER = Path("../../../data/_new_pdfs/en_for_adobe/output_md5/")

# pdfalto
PDFALTO_PATH = Path("../../../misc/pdfalto/pdfalto")


In [70]:
embedded_extractor = DocumentEmbeddedTextExtractor(pdfalto_path=PDFALTO_PATH)
adobe_extractor = AdobeAPIExtractor(credentials_path=".")

## fixing intermediate and done folders
caused by an issue where names convention of files processed since prototype contained underscores and file type extensions. shouldn't need to be run again but leaving here just in case.

## run data-to-document

In [73]:
# TODO: it could be useful to restructure the intermediate directory so each PDF parsed has its own folder, rather than many folders or XML files
# To do this we'd have to modify pdf2text, but this is probably better done after the merge.
# For now we identify folders and files belonging to each PDF using the method below

def group_intermediate_dir_by_pdf():
    """
    This assumes a flat structure for the intermediate dir, containing both Adobe and pdfalto outputs.
    It groups related files or folders in the directory by the stem of their PDF filename.
    
    E.g. directory structure: 
    ```
    - cclw-1055-bf17ca3b41b943fe83f0bd5c5ff36823_0
        - structuredData.json
    - cclw-1055-bf17ca3b41b943fe83f0bd5c5ff36823_1
        - structuredData.json
        - tables/
    - cclw-1055-bf17ca3b41b943fe83f0bd5c5ff36823_2
        - structuredData.json
    - cclw-8482-7a59b4bc5d7841cd9d8a0010215c97ec_metadata.xml
    - cclw-8482-7a59b4bc5d7841cd9d8a0010215c97ec_outline.xml
    - cclw-8482-7a59b4bc5d7841cd9d8a0010215c97ec.xml
    ```
    
    output:
    ```
    {
        "cclw-1055-bf17ca3b41b943fe83f0bd5c5ff36823": [
            "cclw-1055-bf17ca3b41b943fe83f0bd5c5ff36823_0",
            "cclw-1055-bf17ca3b41b943fe83f0bd5c5ff36823_1",
            "cclw-1055-bf17ca3b41b943fe83f0bd5c5ff36823_2",
        ],
        "cclw-8482-7a59b4bc5d7841cd9d8a0010215c97ec": [
            "cclw-8482-7a59b4bc5d7841cd9d8a0010215c97ec_metadata.xml",
            "cclw-8482-7a59b4bc5d7841cd9d8a0010215c97ec_outline.xml",
            "cclw-8482-7a59b4bc5d7841cd9d8a0010215c97ec.xml"
        ]
    }
    ```
    """
    pdf_intermediate_mapping = dict()
    
    pdf_stems = list(set([p.stem.split("_")[0] for p in INTERMEDIATE_FOLDER.iterdir()]))

    for pdf_stem in pdf_stems:
        pdf_intermediate_mapping[pdf_stem] = sorted([p for p in INTERMEDIATE_FOLDER.iterdir() if str(p.name).startswith(pdf_stem)])
        
    return pdf_intermediate_mapping

pdf_intermediate_mapping = group_intermediate_dir_by_pdf()

In [74]:
def parse_adobe_folders(folders: List[Path], pdf_filename: str) -> Document:
    """Parse list of adobe folders into one Document object."""
    pages = []
    # Folders are sorted here to ensure the correct order in parsing
    json_paths = [p / "structuredData.json" for p in sorted(folders)]
    curr_page_offset = 0
    
    for _path in json_paths:
        pdf_path = PDF_FOLDER/f"{pdf_filename}.pdf"
        temp_doc = adobe_extractor.data_to_document(
            data_path=_path, 
            pdf_path=pdf_path,
            page_offset=curr_page_offset,
        )

        pages += temp_doc.pages
        if pages:
            curr_page_offset = pages[-1].page_id + 1
        
    return Document(
        pages=pages,
        filename=pdf_filename,
        md5hash=get_md5_hash(pdf_path)
    )

# ----------------------------

pdf_document_objects = dict()

for pdf_stem, related_paths in tqdm(pdf_intermediate_mapping.items()):
    try:
        if all([p.is_dir() for p in related_paths]):
            document = parse_adobe_folders(related_paths, pdf_stem)
            pdf_document_objects[pdf_stem] = document

        elif valid_paths := [p for p in related_paths if p.name == f"{pdf_stem}.xml"]:
            # Finding the correctly named XML file could also mean that folders are 
            # present, but these are from Adobe failures
            if len(valid_paths) == 1:
                pdf_path = PDF_FOLDER/f"{pdf_stem}.pdf"
                document = embedded_extractor.data_to_document(data_path=valid_paths[0], pdf_path=pdf_path)
                pdf_document_objects[pdf_stem] = document
            else:
                print(f"Too many paths for {pdf_stem}")
        else:
            # TODO: handle adobe split failures which have fallen back to embedded text extractor
            print(pdf_stem, "?")

    except Exception as e:
        print(f"Failed for {pdf_stem}: {e}")

 36%|███████████████████████████████████████████████                                                                                     | 46/129 [00:29<00:27,  3.06it/s]

Failed for 2601-National Climate Change Act 2021: [Errno 2] No such file or directory: '../../../data/_new_pdfs/en_for_adobe/intermediate_changed/2601-National Climate Change Act 2021/structuredData.json'


 43%|████████████████████████████████████████████████████████▎                                                                           | 55/129 [00:36<00:57,  1.29it/s]

Failed for 2774-Decision No: [Errno 2] No such file or directory: '../../../data/_new_pdfs/en_for_adobe/done_changed/2774-Decision No.pdf'


 63%|██████████████████████████████████████████████████████████████████████████████████▉                                                 | 81/129 [00:53<00:47,  1.01it/s]

Failed for 2188-Second National Biodiversity Strategy and Action Plan 2017-2026: [Errno 2] No such file or directory: '../../../data/_new_pdfs/en_for_adobe/intermediate_changed/2188-Second National Biodiversity Strategy and Action Plan 2017-2026/structuredData.json'


 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊           | 118/129 [01:15<00:05,  1.97it/s]

2719-Infrastructure Investment and Jobs Act ?
Failed for 1976-National Security Policy 2017-2022 and National Security Strategy: [Errno 2] No such file or directory: '../../../data/_new_pdfs/en_for_adobe/done_changed/1976-National Security Policy 2017-2022 and National Security Strategy.pdf'


 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉       | 122/129 [01:17<00:03,  2.27it/s]

.DS ?


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 129/129 [01:20<00:00,  1.60it/s]


In [75]:
# Check that all PDFs have produced a Document object - list should be empty
[k for k,v in pdf_document_objects.items() if not v]

[]

In [76]:
# Serialise results to JSON and txt
for pdf_stem, document in tqdm(pdf_document_objects.items()):
    document.save_json(OUTPUT_FOLDER / f"{pdf_stem}.json")
    document.save_text(OUTPUT_FOLDER / f"{pdf_stem}.txt")

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 123/123 [00:15<00:00,  7.98it/s]
