SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory #1040

gerlowskija · 2022-09-22T20:29:05Z

Description

IgnoreLargeDocumentProcessorFactory only has a single way to handle documents that exceed its configurable size limit. The first violation throws a SolrException: in effect, short-circuiting any remaining documents in the "batch" and returning a 400 to the user.

This is great for end users whose clients are built to handle the resulting 400 response, and who can modify and resubmit the batch. But it's not ideal for every use-case, especially where "best-effort" indexing is good enough.

Solution

This PR includes adding a new "permissive" mode of handling too-large documents to ILDPF. Under this new mode "too-large" documents will be logged (and not indexed), but won't cause the entire batch to be aborted/error-out.

Tests

Tests and documentations are still needed.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

Prior to this commit, IgnoreLargeDocumentProcessorFactory only had a single way to handle documents that exceeded its configurable size limit. The first violation would throw a SolrException: in effect, short-circuiting any other documents in the "batch" being processed. This approach is very dependent on the ordering of docs within a batch. A 100-doc batch with only 1 "size offender" might index 99 docs or none at all, depending on where the offender is in the list. This is great for end users whose clients are built to handle the resulting 400 response, and resubmit their whole batch. But it's not ideal for every use case. This commit introduces an alternate approach for handling these violations: quietly log out the ID and size of the offending doc but don't throw any exception that will short-circuit the remainder of the batch. The desired error-handling can be chosen using the URP's new config parameter 'onlyLogErrors'. When false (the default), the legacy behavior of short-circuiting the batch and surfacing a 400 error is used. Otherwise, the new "just log things out and continue" behavior is used.

solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java

dsmiley · 2022-09-23T20:50:07Z

solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java

+
+      private void handleViolatingDoc(AddUpdateCommand cmd, long estimatedSizeBytes) {
+        if (onlyLogErrors) {
+          log.warn("Skipping doc {} bc size {} exceeds limit {}", cmd.getPrintableId(), estimatedSizeBytes / 1024, maxDocumentSize);


Are both numbers logged in the same units? As a reader of this log message; I'm not sure what the unit is.

Good call; I've updated the var names and log message to make this clearer.

I actually ended up keeping the estimated size in bytes here, with the "limit" value still in KB.

"Limit" makes sense to keep in kb, because that's the unit that users specify when configuring this URP. As for the estimated size, I thought about using "kb" for that as well, but I was worried that users might be confused in cases where the integer-division/rounding needed to convert bytes to kb might result in displaying a message where the estimated size and limit are equal.

In short I didn't want to have us printing log messages that looked like:

Skipping doc asdf bc size 2kb exceeds limit 2kb

But I've added explicit units to the variables and log-messages involved here, so hopefully that's good enough to address the core of your issue.

dsmiley · 2022-09-23T20:50:35Z

solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java

-              BAD_REQUEST,
-              "Size of the document " + cmd.getPrintableId() + " is too large, around:" + docSize);
+                  BAD_REQUEST,
+                  "Size of the document " + cmd.getPrintableId() + " is too large, around:" + estimatedSizeBytes);


If necessary, convert to the same units as that we'd log?

sonatype-lift · 2022-09-26T19:22:43Z

solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java


  @Override
  public void init(NamedList<?> args) {
-    maxDocumentSize = args.toSolrParams().required().getLong(LIMIT_SIZE_PARAM);
+    final SolrParams params = args.toSolrParams();
+    maxDocumentSizeKb = params.required().getLong(LIMIT_SIZE_PARAM);


NULL_DEREFERENCE: object returned by params.required().getLong("limit") could be null and is dereferenced at line 58.

ℹ️ Learn about @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage

@sonatype-lift ignore Leave out the above finding from this PR

@sonatype-lift ignoreall Leave out all the existing findings from this PR

@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
_{Click here to add LiftBot to another repo.}

Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]

sonatype-lift · 2022-09-26T19:22:46Z

⚠️ 313 God Classes were detected by Lift in this project. Visit the Lift web console for more details.

…tory (#1040) Prior to this commit, IgnoreLargeDocumentProcessorFactory only had a single way to handle documents that exceeded its configurable size limit. The first violation would throw a SolrException: in effect, short-circuiting any other documents in the "batch" being processed. This approach is very dependent on the ordering of docs within a batch. A 100-doc batch with only 1 "size offender" might index 99 docs or none at all, depending on where the offender is in the list. This is great for end users whose clients are built to handle the resulting 400 response, and resubmit their whole batch. But it's not ideal for every use case. This commit introduces an alternate approach for handling these violations: quietly log out the ID and size of the offending doc but don't throw any exception that will short-circuit the remainder of the batch. The desired error-handling can be chosen using the URP's new config parameter 'permissiveMode'. When false (the default), the legacy behavior of short-circuiting the batch and surfacing a 400 error is used. Otherwise, the new "just log things out and continue" behavior is used.

sonatype-lift bot reviewed Sep 22, 2022

View reviewed changes

solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java Outdated Show resolved Hide resolved

dsmiley reviewed Sep 23, 2022

View reviewed changes

gerlowskija added 3 commits September 26, 2022 14:14

Review feedback, tests, and ref-guide docs

a0c4297

Merge branch 'main' into ildpf-more-permissive-mode

1433b83

Small tweak

f00d9c3

dsmiley approved these changes Sep 26, 2022

View reviewed changes

sonatype-lift bot reviewed Sep 26, 2022

View reviewed changes

gerlowskija added 2 commits September 28, 2022 08:19

Merge branch 'main' into ildpf-more-permissive-mode

4a3dd62

Add CHANGES.txt entry for new permissive mode

722820d

gerlowskija merged commit 708c461 into apache:main Sep 28, 2022

gerlowskija deleted the ildpf-more-permissive-mode branch September 28, 2022 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory #1040

SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory #1040

gerlowskija commented Sep 22, 2022

dsmiley Sep 23, 2022

gerlowskija Sep 26, 2022

gerlowskija Sep 26, 2022

dsmiley Sep 23, 2022

sonatype-lift bot Sep 26, 2022

sonatype-lift bot commented Sep 26, 2022

Command	Usage
`@sonatype-lift ignore`	Leave out the above finding from this PR
`@sonatype-lift ignoreall`	Leave out all the existing findings from this PR
`@sonatype-lift exclude <file\|issue\|path\|tool>`	Exclude specified `file\|issue\|path\|tool` from Lift findings by updating your config.toml file

SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory #1040

SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory #1040

Conversation

gerlowskija commented Sep 22, 2022

Description

Solution

Tests

Checklist

dsmiley Sep 23, 2022

Choose a reason for hiding this comment

gerlowskija Sep 26, 2022

Choose a reason for hiding this comment

gerlowskija Sep 26, 2022

Choose a reason for hiding this comment

dsmiley Sep 23, 2022

Choose a reason for hiding this comment

sonatype-lift bot Sep 26, 2022

Choose a reason for hiding this comment

sonatype-lift bot commented Sep 26, 2022