-
Notifications
You must be signed in to change notification settings - Fork 72
Closed
Description
The from_file method in inference.layout.DocumentLayout.from_file is broken.
First, the load_pdf is duplicated 2x, which I guess it probably the hot-path.
Second, the loop breaks for me because len(images) != len(layout). I mean we are looping through layouts and then loading the image based on the layout's index? But for me the images != layouts. Actually len(images) = 0, is this because I have some dependency missing which is failing silently?
Are len(images) supposed to equal len(layout)? I'm running it on the official loremipsum.pdf as well. Platform is macOS.
layout = DocumentLayout.from_file("sample-docs/loremipsum.pdf")
unstructured-inference/unstructured_inference/inference/layout.py
Lines 74 to 91 in 6d2205b
| def from_file(cls, filename: str, model: Optional[Detectron2LayoutModel] = None): | |
| """Creates a DocumentLayout from a pdf file.""" | |
| # NOTE(alan): For now the model is a Detectron2LayoutModel but in the future it should | |
| # be an abstract class that supports some standard interface and can accomodate either | |
| # a locally instantiated model or an API. Maybe even just a callable that accepts an | |
| # image and returns a dict, or something. | |
| logger.info(f"Reading PDF for file: {filename} ...") | |
| layouts, images = load_pdf(filename, load_images=True) | |
| layouts, images = load_pdf(filename, load_images=True) | |
| pages: List[PageLayout] = list() | |
| for i, layout in enumerate(layouts): | |
| image = images[i] | |
| # NOTE(robinson) - In the future, maybe we detect the page number and default | |
| # to the index if it is not detected | |
| page = PageLayout(number=i, image=image, layout=layout, model=model) | |
| page.get_elements() | |
| pages.append(page) | |
| return cls.from_pages(pages) |
Metadata
Metadata
Assignees
Labels
No labels