feat: ocr when too many chars unrecognized #8

qued · 2023-01-08T06:28:10Z

Certain PDF documents have text blocks within their layout that contain text, but the text contains / consists of unrecognized characters, represented in the text as (cid:n) where n is an integer (ref, ref).

Current logic accepts whatever text is present, only applying OCR in the case the layout text is None. This PR changes the logic to also apply OCR in the case that > 50% of the characters are unrecognized.

…gnized

unstructured_inference/inference/layout.py

MthwRobinson

Tested this on a document that was previously returning cids and everything looks like it's working on this branch. Just one question from my end, I'll approve once you take a look at Crag's comments.

MthwRobinson · 2023-01-09T14:32:11Z

unstructured_inference/inference/layout.py

    return layout
+
+
+def cid_ratio(text: str) -> float:


Do you know if the cid pattern for unknown characters is specific to pdfminer? Or is that a universal convention?

Looks like it's something pdfminer.six is doing: https://github.com/pdfminer/pdfminer.six/blob/20221105/pdfminer/converter.py#L235

MthwRobinson

LGTM

qued added 5 commits January 6, 2023 16:16

ocr when cid ratio is too high

4540ed4

Separate out interpretation of text blocks

608b7fb

Test TextBlock interpretation when unknown symbols are in text

7aa6aa9

Merge branch 'alan/test-cid' into alan/ocr-when-too-many-chars-unreco…

52b2d22

…gnized

Update version and changelog

b34f32a

qued requested a review from MthwRobinson January 8, 2023 06:28

cragwolfe reviewed Jan 8, 2023

View reviewed changes

unstructured_inference/inference/layout.py Show resolved Hide resolved

MthwRobinson reviewed Jan 9, 2023

View reviewed changes

qued added 4 commits January 9, 2023 09:39

Add prechecks that are cheaper computationally

d88582f

test_cid_ratio stub

6bd1f1b

No more need for div0 case

cb1318e

Add tests for cid_ratio and is_cid_present functions

a130b60

cragwolfe approved these changes Jan 9, 2023

View reviewed changes

MthwRobinson approved these changes Jan 9, 2023

View reviewed changes

Merge branch 'main' into alan/ocr-when-too-many-chars-unrecognized

4ef4050

qued merged commit 1b6aadd into main Jan 9, 2023

qued deleted the alan/ocr-when-too-many-chars-unrecognized branch January 9, 2023 18:22

benjats07 pushed a commit that referenced this pull request Jan 10, 2023

feat: Update model load code to allow for different models (#8)

a412353

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: ocr when too many chars unrecognized #8

feat: ocr when too many chars unrecognized #8

Uh oh!

qued commented Jan 8, 2023

Uh oh!

Uh oh!

MthwRobinson left a comment

Uh oh!

MthwRobinson Jan 9, 2023

Uh oh!

qued Jan 9, 2023

Uh oh!

MthwRobinson left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: ocr when too many chars unrecognized #8

feat: ocr when too many chars unrecognized #8

Uh oh!

Conversation

qued commented Jan 8, 2023

Uh oh!

Uh oh!

MthwRobinson left a comment

Choose a reason for hiding this comment

Uh oh!

MthwRobinson Jan 9, 2023

Choose a reason for hiding this comment

Uh oh!

qued Jan 9, 2023

Choose a reason for hiding this comment

Uh oh!

MthwRobinson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants