Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ALTO regexp to correctly match TextBlock/Page/etc entities #25

Merged
merged 2 commits into from Apr 29, 2019

Conversation

Projects
None yet
2 participants
@mbennett-uoe
Copy link
Contributor

commented Apr 29, 2019

The ">?" at the end of the regexp in the getTextFromXml return block was causing inline entities to not be correctly matched/removed. Removing the quantifier fixes this.

Results with error:
regexerr

With patch:
regexfix

XML used for testing (snipped from larger file for testing. The whole file can be provided if required):

<Page WIDTH="2092" HEIGHT="3850" PHYSICAL_IMG_NR="6" ID="page_6"> <TextBlock ID="block_12" HPOS="83" VPOS="3284" WIDTH="1703" HEIGHT="100"><TextLine ID="line_43" HPOS="182" VPOS="3284" WIDTH="1604" HEIGHT="49"><String ID="string_493" HPOS="706" VPOS="3285" WIDTH="222" HEIGHT="37" WC="0.72" CONTENT="Permanent"/><SP WIDTH="31" VPOS="3285" HPOS="928"/><String ID="string_494" HPOS="959" VPOS="3285" WIDTH="216" HEIGHT="36" WC="0.95" CONTENT="Committee"/><SP WIDTH="30" VPOS="3285" HPOS="1175"/></TextLine></TextBlock></Page>
@jbaiter

This comment has been minimized.

Copy link
Member

commented Apr 29, 2019

Thanks a lot for debugging and fixing this issue! :-)

Before I merge this, could you maybe add a small integration test to https://github.com/dbmdz/solr-ocrhighlighting/blob/master/src/test/java/org/mdz/search/solrocr/solr/AltoEscapedTest.java? You can just add your test page XML to the end of the fixture document ( https://github.com/dbmdz/solr-ocrhighlighting/blob/master/src/test/resources/data/alto_escaped.xml ) and do your asserts on that.

@mbennett-uoe

This comment has been minimized.

Copy link
Contributor Author

commented Apr 29, 2019

I have added a single assert test. If it needs to be a bit more specific for safety let me know and I'll add a few more conditions.

@jbaiter

This comment has been minimized.

Copy link
Member

commented Apr 29, 2019

Seems the CI is currently broken for non-team PRs, but the tests are passing fine on my machine, so I'll just merge this anyway. Thank you again!

@jbaiter jbaiter merged commit 7a18cbc into dbmdz:master Apr 29, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.