-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory #1040
SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory #1040
Conversation
Prior to this commit, IgnoreLargeDocumentProcessorFactory only had a single way to handle documents that exceeded its configurable size limit. The first violation would throw a SolrException: in effect, short-circuiting any other documents in the "batch" being processed. This approach is very dependent on the ordering of docs within a batch. A 100-doc batch with only 1 "size offender" might index 99 docs or none at all, depending on where the offender is in the list. This is great for end users whose clients are built to handle the resulting 400 response, and resubmit their whole batch. But it's not ideal for every use case. This commit introduces an alternate approach for handling these violations: quietly log out the ID and size of the offending doc but don't throw any exception that will short-circuit the remainder of the batch. The desired error-handling can be chosen using the URP's new config parameter 'onlyLogErrors'. When false (the default), the legacy behavior of short-circuiting the batch and surfacing a 400 error is used. Otherwise, the new "just log things out and continue" behavior is used.
solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/update/processor/IgnoreLargeDocumentProcessorFactory.java
Outdated
Show resolved
Hide resolved
|
||
private void handleViolatingDoc(AddUpdateCommand cmd, long estimatedSizeBytes) { | ||
if (onlyLogErrors) { | ||
log.warn("Skipping doc {} bc size {} exceeds limit {}", cmd.getPrintableId(), estimatedSizeBytes / 1024, maxDocumentSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are both numbers logged in the same units? As a reader of this log message; I'm not sure what the unit is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call; I've updated the var names and log message to make this clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually ended up keeping the estimated size in bytes here, with the "limit" value still in KB.
"Limit" makes sense to keep in kb, because that's the unit that users specify when configuring this URP. As for the estimated size, I thought about using "kb" for that as well, but I was worried that users might be confused in cases where the integer-division/rounding needed to convert bytes to kb might result in displaying a message where the estimated size and limit are equal.
In short I didn't want to have us printing log messages that looked like:
Skipping doc asdf bc size 2kb exceeds limit 2kb
But I've added explicit units to the variables and log-messages involved here, so hopefully that's good enough to address the core of your issue.
BAD_REQUEST, | ||
"Size of the document " + cmd.getPrintableId() + " is too large, around:" + docSize); | ||
BAD_REQUEST, | ||
"Size of the document " + cmd.getPrintableId() + " is too large, around:" + estimatedSizeBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If necessary, convert to the same units as that we'd log?
|
||
@Override | ||
public void init(NamedList<?> args) { | ||
maxDocumentSize = args.toSolrParams().required().getLong(LIMIT_SIZE_PARAM); | ||
final SolrParams params = args.toSolrParams(); | ||
maxDocumentSizeKb = params.required().getLong(LIMIT_SIZE_PARAM); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULL_DEREFERENCE: object returned by params.required().getLong("limit")
could be null and is dereferenced at line 58.
ℹ️ Learn about @sonatype-lift commands
You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.
Command | Usage |
---|---|
@sonatype-lift ignore |
Leave out the above finding from this PR |
@sonatype-lift ignoreall |
Leave out all the existing findings from this PR |
@sonatype-lift exclude <file|issue|path|tool> |
Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file |
Note: When talking to LiftBot, you need to refresh the page to see its response.
Click here to add LiftBot to another repo.
Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]
|
…tory (#1040) Prior to this commit, IgnoreLargeDocumentProcessorFactory only had a single way to handle documents that exceeded its configurable size limit. The first violation would throw a SolrException: in effect, short-circuiting any other documents in the "batch" being processed. This approach is very dependent on the ordering of docs within a batch. A 100-doc batch with only 1 "size offender" might index 99 docs or none at all, depending on where the offender is in the list. This is great for end users whose clients are built to handle the resulting 400 response, and resubmit their whole batch. But it's not ideal for every use case. This commit introduces an alternate approach for handling these violations: quietly log out the ID and size of the offending doc but don't throw any exception that will short-circuit the remainder of the batch. The desired error-handling can be chosen using the URP's new config parameter 'permissiveMode'. When false (the default), the legacy behavior of short-circuiting the batch and surfacing a 400 error is used. Otherwise, the new "just log things out and continue" behavior is used.
Description
IgnoreLargeDocumentProcessorFactory only has a single way to handle documents that exceed its configurable size limit. The first violation throws a SolrException: in effect, short-circuiting any remaining documents in the "batch" and returning a 400 to the user.
This is great for end users whose clients are built to handle the resulting 400 response, and who can modify and resubmit the batch. But it's not ideal for every use-case, especially where "best-effort" indexing is good enough.
Solution
This PR includes adding a new "permissive" mode of handling too-large documents to ILDPF. Under this new mode "too-large" documents will be logged (and not indexed), but won't cause the entire batch to be aborted/error-out.
Tests
Tests and documentations are still needed.
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.