Skip to content

Conversation

@ajjimeno
Copy link
Contributor

Issue:

In some cases, PDFMiner identifies an image document as a full page and in other installations not. It is difficult to find out when PDFMiner behaves in one way or another. In either case tested, the version is pdfminer.six v20221105. The solution is to ignore any annotation coming from Chipper in case the full page clearing code is activated. Not sure if this is relevant to other models.

@LaverdeS
Copy link
Contributor

LGTM.

One remark: the output is complete but its very different between several calls of chipper for the same sample. This is expected from a generative model like chipper; nevertheless, this was not being that noticeable bofore, which makes me think that with the last changes introduced in this PR those differences could increase, meaning, chipper could be more inconsistent between generations and could have more hallucinations.

Copy link
Contributor

@LaverdeS LaverdeS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ajjimeno
Copy link
Contributor Author

Added random seed to avoid the issue of differences across runs.

@ajjimeno ajjimeno merged commit c305d10 into main Oct 13, 2023
@ajjimeno ajjimeno deleted the bug/chipper-removed-by-pdfminer branch October 13, 2023 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants