Skip to content

Conversation

@qued
Copy link
Contributor

@qued qued commented Jan 8, 2023

Certain PDF documents have text blocks within their layout that contain text, but the text contains / consists of unrecognized characters, represented in the text as (cid:n) where n is an integer (ref, ref).

Current logic accepts whatever text is present, only applying OCR in the case the layout text is None. This PR changes the logic to also apply OCR in the case that > 50% of the characters are unrecognized.

@qued qued requested a review from MthwRobinson January 8, 2023 06:28
Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this on a document that was previously returning cids and everything looks like it's working on this branch. Just one question from my end, I'll approve once you take a look at Crag's comments.

return layout


def cid_ratio(text: str) -> float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if the cid pattern for unknown characters is specific to pdfminer? Or is that a universal convention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qued qued merged commit 1b6aadd into main Jan 9, 2023
@qued qued deleted the alan/ocr-when-too-many-chars-unrecognized branch January 9, 2023 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants