Added tika text extraction support for lucene indexing #1277

mitjale · 2018-02-03T01:35:43Z

I've added text extraction support for Lucene indexing from Apache Tika parsers, thus enabling search on pdf and office documents

Default query search operator AND Added properties settings and mantained compatibility with default settings from previous version

flaix

I am pretty sure that not every line of the class LuceneService changed. It looks like you may have changed indentation. This makes it impossible to review changes.

Never change indentation, line endings, whitespace and such when making functional changes. Always keep functional changes separate and reviewable.

TomaszSzt

I can see You noticed that passing byte[] data can crash server if large file is to be processed. Good.

Yet still You do return String from extractText which still, as You don't limit the size of input file to be processed, opens a way for users to abuse it and use it to crash server by supplying large files enough to cause OutOfMemory exception. There is no problem in supplying 100MB or more PDF.

Either use streaming tika or restrict the number of bytes passsed to tika by using an input stream wrapper. Make the limit configurable if You decide on it.

I'm too missing this functionality, because it is a critical step which prevents GitBlit from being Knowledge Management System for non-coders, but this implementation will crash server with OutOfMemory exception in a very unpredictable way.

mitjale and others added 12 commits February 3, 2018 02:29

Added tika support for docx, doc, pdf, xls and xlsx

d411af4

Default query search operator AND Added properties settings and mantained compatibility with default settings from previous version

Added debendency to lucene-join

54dbe92

Added mised and operation

f1f86bf

Additional checks

a64e4b1

Added zip extraction

05b5128

Catch exceptions when extracting from documents to prevent indexing loop

9a00b31

Split archive in separate indexed documents

211c013

Added real path in archive path

66f7a90

Added real path in archive path

57c157d

Merge branch 'master' of https://github.com/mitjale/gitblit

e1af03b

Higher version of tika

adf9bd8

Switched to streaming data for tika parsing

1ae61f3

flaix requested changes Nov 11, 2020

View reviewed changes

TomaszSzt suggested changes Nov 17, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added tika text extraction support for lucene indexing #1277

Added tika text extraction support for lucene indexing #1277

mitjale commented Feb 3, 2018

flaix left a comment

TomaszSzt left a comment

Added tika text extraction support for lucene indexing #1277

Are you sure you want to change the base?

Added tika text extraction support for lucene indexing #1277

Conversation

mitjale commented Feb 3, 2018

flaix left a comment

Choose a reason for hiding this comment

TomaszSzt left a comment

Choose a reason for hiding this comment