Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest-attachment PDF-processing should be configureable #36890

Open
janLo opened this issue Dec 20, 2018 · 2 comments
Open

ingest-attachment PDF-processing should be configureable #36890

janLo opened this issue Dec 20, 2018 · 2 comments
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@janLo
Copy link

janLo commented Dec 20, 2018

Describe the feature:

The ingest-attachment plugin uses Apache Tika for document content extraction. The Pdf-parser in this project has several configuration options. Thes can be set using a properties file or programmatically (see http://tika.apache.org/1.19.1/api/org/apache/tika/parser/pdf/PDFParserConfig.html).

The plugin only allows to use the default values and gives no option to change them: https://github.com/elastic/elasticsearch/blob/master/plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/TikaImpl.java#L82

This causes problems with certain documents where the Pdf-producing software (e.g. the Scanbot app) tries to place the chars in the OCRed document exactly behind the actual char in the scanned image, as the default setup parser then inserts a lot of unexpected spaces in the extracted text. This is especiially unexpected as these spaces do not appear if the text is copied from a pdf reader. Finally this renders the plugin useless in this case as the document can only be found if the user knows where the spaces are inserted.

An example of this problem is described here: nextcloud/files_fulltextsearch#29

The situation might be resolved if the plugin allows the uder to set properties like AverageCharTolerance and SpacingTolerance via a configuration mechanism.

@colings86 colings86 added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Dec 20, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@clawoflight
Copy link

Please look into this, it's deal-braking for the primary use-case of file indexing!

@rjernst rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

5 participants