Skip to content

Conversation

janhoy
Copy link
Contributor

@janhoy janhoy commented Sep 19, 2025

https://issues.apache.org/jira/browse/SOLR-7632

This work builds on the one in #3361 but instead of making a new module, we add it as a capability to the existing extraction handler through specifying extraction.backend=tikaserver.

This first required refactoring extraction handler to detach it from the Tika-v1 API. There is a new interface ExtractionBackend that takes generic ExtractionRequest object in and returns an ExtractionResult bean, and a new LocalTikaExtractionBackend implementation that encapsulates all Tikav1 api handling. This implementation can be deprecated, and in Solr 10, the tikaserver one can be made default.

All existing tests pass, and most of the existing extraction tests now also pass when running the tikaserver backend (running in TestContainers). Unfortunately docker is not available in Crave, so a new GH workflow is made to run only the extraction tests.

TODO's:

  • Lots of code from GenAI, which needs review and rewrite / simplification.
  • There may be debug print and TODOs left here and there
  • The metadata back-compat map is AI generated and by no means complete or even correct 🤣
  • Lack of JavaDoc everywhere
  • Harden exception handling, retrying, timeout values etc
  • Should probably use Jetty HTTP client instead of JDK one.
  • Optimize throughput with async methods? The request thread is blocked on Tika response
  • Explicit HTTPS / self-signed cert support?
  • Complete the RefGuide docs

@janhoy janhoy marked this pull request as draft September 19, 2025 15:14
@janhoy janhoy requested a review from epugh September 19, 2025 15:14
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Sep 19, 2025
@epugh
Copy link
Contributor

epugh commented Sep 19, 2025

Exciting!

@janhoy
Copy link
Contributor Author

janhoy commented Sep 20, 2025

Status:

  • Parses docs using TikaServer
  • Can switch between xml (html) and text format of the content field
  • Randomized the choice of backend for the main test class
  • ExtractOnly not fully implemented for tikaserver, some tests fail

TBD:

  • The whole xpath / SAX parsing of XML response is missing
  • We use JDK HTTP client, could perhaps use Jetty client. See other POC for example, including making timeouts configurable
  • Must make sure that tikaserver.url is only configurable on requesthandler config in solrconfig, not as a request parameter (security)
  • RefGuide docs, especially how to start TikaServer etc
  • Remove the DummyExtractionBackend

Anyone, please feel free to hack away on this if it looks exciting, committing directly to the PR branch.

Question: Would it bring value to isolate the refactoring in one PR and then another one to add the tikaserver impl?

Cleanup TestContainer
Refactor ExtractionMetadata
Add returnType to ExtractionRequest
Remove static initializers
@janhoy janhoy force-pushed the refactor-extraction-handler branch from cc3d43f to a3794ce Compare September 20, 2025 01:24
@epugh
Copy link
Contributor

epugh commented Sep 21, 2025

Any luck with security manager?? I had many difficulties

@epugh
Copy link
Contributor

epugh commented Sep 22, 2025

Testcontainers and docker don't love the SecurityManager. I had claude try to run the tests and add additional permissions to solr-tests.policy, and after an hour or so, I had a lot more, but no love:

// Needed for testcontainers
  permission java.io.FilePermission "/Users/epugh/.testcontainers.properties", "read";
  permission java.io.FilePermission "/Users/epugh/.docker-java.properties", "read";
  permission java.io.FilePermission "/Users/epugh/.docker/-", "read";
  permission java.io.FilePermission "/usr/local/opt/go@1.22/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/opt/go@1.20/bin/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.asdf/installs/nodejs/20.18.3/bin/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.asdf/shims/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.nvm/versions/node/v14.21.2/bin/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.rbenv/shims/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/sbin/docker-machine", "read";
  permission java.io.FilePermission "/System/Cryptexes/App/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/sbin/docker-machine", "read";
  permission java.io.FilePermission "/sbin/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/MacGPG2/bin/docker-machine", "read";
  permission java.io.FilePermission "/Library/Apple/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin/docker-machine", "read";
  permission java.io.FilePermission "/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin/docker-machine", "read";


@janhoy
Copy link
Contributor Author

janhoy commented Sep 22, 2025

Yea, that’s annoying. Perhaps we could disable JSM for this test or for tests in the entire module?

@iamsanjay
Copy link
Contributor

I had the similar experience as I was upgrading kafka. And then I stopped.

Java Security Manager and Testcontainers do not play nicely together.  We prefer Testcontainers, so disable JSM
@epugh
Copy link
Contributor

epugh commented Sep 22, 2025

When I first saw DummyExtractionBackend my first thought was that it should be in the test class hierarchy. However, would there be value in keeping it? If you wanted to test your set up in Solr (and not worry about the Tika side), could it be useful for that? "I send a doc and I get something back"....

Add common metadata
Adjust some tests with dc:title instead of title
Support passwords in TikaServer backend
@epugh
Copy link
Contributor

epugh commented Sep 26, 2025

Love the way you fixed it. Does this mean in practice that folks might see different resutls depending on which backend they use and the specfici document? On the other hand, that also seems totally okay in the sense that they are different backends...

…" config)

Move pdf-with-image test to local test
Add recursive test to TikaServer test case
@janhoy
Copy link
Contributor Author

janhoy commented Sep 26, 2025

Last commit adds recursive parsing as an option &recursive=true for backend tikaserver. I moved the failing image test to the Local test class, since there is a difference in how PDFs with images are parsed in 1.x and 3.x. In 1.x embedded images would leave traces in the extracted text by default. Not anymore. But I added a test to the tikaserver backend test class that extracts the same pdf, with recursive enabled and also a special HTTP header to TikaServer X-Tika-PDFextractInlineImages=true which indeed extracts the images an adds their names to the text. This recursive business is highly experimental, it's a different endpoint /rmeta, returning JSON which needs to be buffered entirely in memory both on TikaServer side and on Solr side. Also all the embedded documents are returned concatenated together in the content, and I believe that the metadata from the main object is the only retained?

All tests are now green. However, there is still a thread leak in the tikaserver test. I think there are some HttpClient stuff not released.

Other TODO: Moved to PR description

That concludes the "POC", proving that it is doable to do a drop-in replacement for users.

@janhoy
Copy link
Contributor Author

janhoy commented Sep 26, 2025

We now have a separate github workflow testing extraction code, with TestContainers. It is only for the sake of this PR, not intendend for merge :)

The thread leaks definitely looks related to ordinary Solr objects.

> Task :solr:modules:extraction:test
ExtractingRequestHandlerTikaServerTest > classMethod FAILED
    java.lang.AssertionError: ObjectTracker found 11 object(s) that were not released!!! [MockDirectoryWrapper, Http2SolrClient, Http2SolrClient, Http2SolrClient, MDCAwareThreadPoolExecutor, SolrIndexSearcher, MockDirectoryWrapper, SolrCore, Http2SolrClient, LBHttp2SolrClient, MockDirectoryWrapper]
    org.apache.lucene.tests.store.MockDirectoryWrapper:org.apache.solr.common.util.ObjectReleaseTracker$ObjectTrackerException: org.apache.lucene.tests.store.MockDirectoryWrapper
    	at org.apache.solr.common.util.ObjectReleaseTracker.track(ObjectReleaseTracker.java:54)
...

@janhoy
Copy link
Contributor Author

janhoy commented Sep 26, 2025

@epugh and others - I'll be on holiday for a week from today. Feel free to commit anything you like directly to this branch without asking, if you want to play around or move things closer to perfection. Normal review comments are of course welcome too, but commits eats comments for breakfast :)

Any phased merge can be done later, as the interface boundaries are fairly clean, hopefully.

* @deprecated Will be replaced with something similar that calls out to a separate Tika Server
* process running in its own JVM.
*/
@Deprecated(since = "9.10.0")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epugh I undeprecated this and the Loader, and instead deprecated the Local backend. This part needs to be backported before 9.10 release. Also perhaps wording in major-changes...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Dependency upgrades documentation Improvements or additions to documentation module:extraction test-framework tests tool:build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants