-
Notifications
You must be signed in to change notification settings - Fork 768
SOLR-7632 TikaServer as pluggable backend to existing extraction handler #3670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ika API Refactor some tests to LocalTikaExtractionBackendTest
Exciting! |
Status:
TBD:
Anyone, please feel free to hack away on this if it looks exciting, committing directly to the PR branch. Question: Would it bring value to isolate the refactoring in one PR and then another one to add the tikaserver impl? |
Cleanup TestContainer Refactor ExtractionMetadata Add returnType to ExtractionRequest Remove static initializers
cc3d43f
to
a3794ce
Compare
Any luck with security manager?? I had many difficulties |
Testcontainers and docker don't love the SecurityManager. I had claude try to run the tests and add additional permissions to
|
Yea, that’s annoying. Perhaps we could disable JSM for this test or for tests in the entire module? |
I had the similar experience as I was upgrading kafka. And then I stopped. |
Java Security Manager and Testcontainers do not play nicely together. We prefer Testcontainers, so disable JSM
When I first saw |
Add common metadata Adjust some tests with dc:title instead of title Support passwords in TikaServer backend
solr/modules/extraction/src/test-files/extraction/solr/collection1/conf/solrconfig.xml
Show resolved
Hide resolved
Love the way you fixed it. Does this mean in practice that folks might see different resutls depending on which backend they use and the specfici document? On the other hand, that also seems totally okay in the sense that they are different backends... |
…" config) Move pdf-with-image test to local test Add recursive test to TikaServer test case
Last commit adds recursive parsing as an option All tests are now green. However, there is still a thread leak in the tikaserver test. I think there are some HttpClient stuff not released. Other TODO: Moved to PR description That concludes the "POC", proving that it is doable to do a drop-in replacement for users. |
We now have a separate github workflow testing extraction code, with TestContainers. It is only for the sake of this PR, not intendend for merge :) The thread leaks definitely looks related to ordinary Solr objects.
|
@epugh and others - I'll be on holiday for a week from today. Feel free to commit anything you like directly to this branch without asking, if you want to play around or move things closer to perfection. Normal review comments are of course welcome too, but commits eats comments for breakfast :) Any phased merge can be done later, as the interface boundaries are fairly clean, hopefully. |
* @deprecated Will be replaced with something similar that calls out to a separate Tika Server | ||
* process running in its own JVM. | ||
*/ | ||
@Deprecated(since = "9.10.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@epugh I undeprecated this and the Loader, and instead deprecated the Local backend. This part needs to be backported before 9.10 release. Also perhaps wording in major-changes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally!
Add some thread names to filter
Validate path of tikaConfigLoc
https://issues.apache.org/jira/browse/SOLR-7632
This work builds on the one in #3361 but instead of making a new module, we add it as a capability to the existing extraction handler through specifying
extraction.backend=tikaserver
.This first required refactoring extraction handler to detach it from the Tika-v1 API. There is a new interface
ExtractionBackend
that takes genericExtractionRequest
object in and returns anExtractionResult
bean, and a newLocalTikaExtractionBackend
implementation that encapsulates all Tikav1 api handling. This implementation can be deprecated, and in Solr 10, thetikaserver
one can be made default.All existing tests pass, and most of the existing extraction tests now also pass when running the
tikaserver
backend (running in TestContainers). Unfortunately docker is not available in Crave, so a new GH workflow is made to run only the extraction tests.TODO's: