Skip to content

Commit

Permalink
Filter out empty/whitespace-only regions (fixes #105)
Browse files Browse the repository at this point in the history
  • Loading branch information
jbaiter authored and morpheus-87 committed May 11, 2020
1 parent ca3456d commit 408a15c
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/changes.md
Expand Up @@ -33,6 +33,7 @@ This is a major release with a focus on compatibility and performance.
- Log warnings during source pointer parsing
- Filter out empty files during indexing
- Add new documentation section on performance tuning
- Empty regions or regions with only whitespace are no longer included in the output

## 0.3.1 (2019-07-26)

Expand Down
Expand Up @@ -198,6 +198,7 @@ protected OcrSnippet parseFragment(String ocrFragment, OcrPage page) {
String highlightedText = getTextFromXml(ocrFragment);
List<OcrBox> snippetRegions = byColumns.stream()
.map(this::determineSnippetRegion)
.filter(r -> !r.getText().isEmpty() && !r.getText().trim().isEmpty())
.collect(Collectors.toList());
Set<String> snippetPageIds = snippetRegions.stream()
.map(OcrBox::getPageId).collect(Collectors.toSet());
Expand Down

0 comments on commit 408a15c

Please sign in to comment.