from_file broken, len(images) != len(layout)

The from_file method in inference.layout.DocumentLayout.from_file is broken.

First, the load_pdf is duplicated 2x, which I guess it probably the hot-path.

Second, the loop breaks for me because len(images) != len(layout). I mean we are looping through layouts and then loading the image based on the layout's index? But for me the images != layouts. Actually len(images) = 0, is this because I have some dependency missing which is failing silently?

Are len(images) supposed to equal len(layout)? I'm running it on the official loremipsum.pdf as well. Platform is macOS.

```
layout = DocumentLayout.from_file("sample-docs/loremipsum.pdf")
```

https://github.com/Unstructured-IO/unstructured-inference/blob/6d2205b3e6fc6aa9b5bc7f4a1839fd66fd40413e/unstructured_inference/inference/layout.py#L74-L91



	def from_file(cls, filename: str, model: Optional[Detectron2LayoutModel] = None):
	"""Creates a DocumentLayout from a pdf file."""
	# NOTE(alan): For now the model is a Detectron2LayoutModel but in the future it should
	# be an abstract class that supports some standard interface and can accomodate either
	# a locally instantiated model or an API. Maybe even just a callable that accepts an
	# image and returns a dict, or something.
	logger.info(f"Reading PDF for file: {filename} ...")
	layouts, images = load_pdf(filename, load_images=True)
	layouts, images = load_pdf(filename, load_images=True)
	pages: List[PageLayout] = list()
	for i, layout in enumerate(layouts):
	image = images[i]
	# NOTE(robinson) - In the future, maybe we detect the page number and default
	# to the index if it is not detected
	page = PageLayout(number=i, image=image, layout=layout, model=model)
	page.get_elements()
	pages.append(page)
	return cls.from_pages(pages)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

from_file broken, len(images) != len(layout) #51

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

from_file broken, len(images) != len(layout) #51

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions