Skip to content

from_file broken, len(images) != len(layout) #51

@hyperknot

Description

@hyperknot

The from_file method in inference.layout.DocumentLayout.from_file is broken.

First, the load_pdf is duplicated 2x, which I guess it probably the hot-path.

Second, the loop breaks for me because len(images) != len(layout). I mean we are looping through layouts and then loading the image based on the layout's index? But for me the images != layouts. Actually len(images) = 0, is this because I have some dependency missing which is failing silently?

Are len(images) supposed to equal len(layout)? I'm running it on the official loremipsum.pdf as well. Platform is macOS.

layout = DocumentLayout.from_file("sample-docs/loremipsum.pdf")

def from_file(cls, filename: str, model: Optional[Detectron2LayoutModel] = None):
"""Creates a DocumentLayout from a pdf file."""
# NOTE(alan): For now the model is a Detectron2LayoutModel but in the future it should
# be an abstract class that supports some standard interface and can accomodate either
# a locally instantiated model or an API. Maybe even just a callable that accepts an
# image and returns a dict, or something.
logger.info(f"Reading PDF for file: {filename} ...")
layouts, images = load_pdf(filename, load_images=True)
layouts, images = load_pdf(filename, load_images=True)
pages: List[PageLayout] = list()
for i, layout in enumerate(layouts):
image = images[i]
# NOTE(robinson) - In the future, maybe we detect the page number and default
# to the index if it is not detected
page = PageLayout(number=i, image=image, layout=layout, model=model)
page.get_elements()
pages.append(page)
return cls.from_pages(pages)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions