Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added tika text extraction support for lucene indexing #1277

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

mitjale
Copy link

@mitjale mitjale commented Feb 3, 2018

I've added text extraction support for Lucene indexing from Apache Tika parsers, thus enabling search on pdf and office documents

Copy link
Member

@flaix flaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure that not every line of the class LuceneService changed. It looks like you may have changed indentation. This makes it impossible to review changes.

Never change indentation, line endings, whitespace and such when making functional changes. Always keep functional changes separate and reviewable.

Copy link
Contributor

@TomaszSzt TomaszSzt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see You noticed that passing byte[] data can crash server if large file is to be processed. Good.

Yet still You do return String from extractText which still, as You don't limit the size of input file to be processed, opens a way for users to abuse it and use it to crash server by supplying large files enough to cause OutOfMemory exception. There is no problem in supplying 100MB or more PDF.

Either use streaming tika or restrict the number of bytes passsed to tika by using an input stream wrapper. Make the limit configurable if You decide on it.

I'm too missing this functionality, because it is a critical step which prevents GitBlit from being Knowledge Management System for non-coders, but this implementation will crash server with OutOfMemory exception in a very unpredictable way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants