-
Notifications
You must be signed in to change notification settings - Fork 72
feat: ocr when too many chars unrecognized #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MthwRobinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this on a document that was previously returning cids and everything looks like it's working on this branch. Just one question from my end, I'll approve once you take a look at Crag's comments.
| return layout | ||
|
|
||
|
|
||
| def cid_ratio(text: str) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if the cid pattern for unknown characters is specific to pdfminer? Or is that a universal convention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's something pdfminer.six is doing: https://github.com/pdfminer/pdfminer.six/blob/20221105/pdfminer/converter.py#L235
MthwRobinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Certain PDF documents have text blocks within their layout that contain text, but the text contains / consists of unrecognized characters, represented in the text as
(cid:n)wherenis an integer (ref, ref).Current logic accepts whatever text is present, only applying OCR in the case the layout text is
None. This PR changes the logic to also apply OCR in the case that > 50% of the characters are unrecognized.