Skip to content

Conversation

@qued
Copy link
Contributor

@qued qued commented Mar 29, 2023

Fixes a bug where coordinates in point-coordinates were not being converted to pixel-coordinates in agreement with the specified DPI.

Testing:

Parsing layout-parser-fast.pdf Should properly extract the text from the title block.

@qued qued requested a review from cragwolfe March 29, 2023 22:47
Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me in an unstructured project:

$ PYTHONPATH=. python -c "from unstructured.partition.pdf import partition_pdf; elems = partition_pdf('example-docs/layout-parser-paper-fast.pdf'); from rich.pretty import pprint; pprint(elems[0].to_dict()); pprint(elems[7].to_dict())"
{
│   'element_id': '015301d4f56aa4b20ec10ac889d2343f',
│   'coordinates': (
│   │   (438.1653747558594, 312.04864501953125),
│   │   (438.1653747558594, 413.18951416015625),
│   │   (1274.8990478515625, 413.18951416015625),
│   │   (1274.8990478515625, 312.04864501953125)
│   ),
│   'text': 'LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis',
│   'type': 'Title',
│   'metadata': {
│   │   'filename': 'example-docs/layout-parser-paper-fast.pdf',
│   │   'page_number': 1
│   }
}
{
│   'element_id': '44b6b64d159ac94e2f6910fd5055a68f',
│   'coordinates': (
│   │   (574.2962646484375, 557.7698364257812),
│   │   (574.2962646484375, 870.4000244140625),
│   │   (1111.1734619140625, 870.4000244140625),
│   │   (1111.1734619140625, 557.7698364257812)
│   ),
│   'text': 'Universityo',
│   'type': 'ListItem',
│   'metadata': {
│   │   'filename': 'example-docs/layout-parser-paper-fast.pdf',
│   │   'page_number': 1
│   }
}

@qued qued merged commit db173d0 into main Mar 29, 2023
@qued qued deleted the fix/convert-to-pixels branch March 29, 2023 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants