Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-16428: Add "permissive" mode to IgnoreLargeDocumentsProcessorFactory #1040

Merged
merged 6 commits into from
Sep 28, 2022

Conversation

gerlowskija
Copy link
Contributor

Description

IgnoreLargeDocumentProcessorFactory only has a single way to handle documents that exceed its configurable size limit. The first violation throws a SolrException: in effect, short-circuiting any remaining documents in the "batch" and returning a 400 to the user.

This is great for end users whose clients are built to handle the resulting 400 response, and who can modify and resubmit the batch. But it's not ideal for every use-case, especially where "best-effort" indexing is good enough.

Solution

This PR includes adding a new "permissive" mode of handling too-large documents to ILDPF. Under this new mode "too-large" documents will be logged (and not indexed), but won't cause the entire batch to be aborted/error-out.

Tests

Tests and documentations are still needed.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

Prior to this commit, IgnoreLargeDocumentProcessorFactory only had a
single way to handle documents that exceeded its configurable size
limit.  The first violation would throw a SolrException: in effect,
short-circuiting any other documents in the "batch" being processed.

This approach is very dependent on the ordering of docs within a
batch.  A 100-doc batch with only 1 "size offender" might index 99 docs
or none at all, depending on where the offender is in the list.  This is
great for end users whose clients are built to handle the resulting 400
response, and resubmit their whole batch.  But it's not ideal for every
use case.

This commit introduces an alternate approach for handling these
violations: quietly log out the ID and size of the offending doc but
don't throw any exception that will short-circuit the remainder of the
batch.

The desired error-handling can be chosen using the URP's new config
parameter 'onlyLogErrors'.  When false (the default), the legacy
behavior of short-circuiting the batch and surfacing a 400 error is
used.  Otherwise, the new "just log things out and continue" behavior is
used.

private void handleViolatingDoc(AddUpdateCommand cmd, long estimatedSizeBytes) {
if (onlyLogErrors) {
log.warn("Skipping doc {} bc size {} exceeds limit {}", cmd.getPrintableId(), estimatedSizeBytes / 1024, maxDocumentSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are both numbers logged in the same units? As a reader of this log message; I'm not sure what the unit is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call; I've updated the var names and log message to make this clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually ended up keeping the estimated size in bytes here, with the "limit" value still in KB.

"Limit" makes sense to keep in kb, because that's the unit that users specify when configuring this URP. As for the estimated size, I thought about using "kb" for that as well, but I was worried that users might be confused in cases where the integer-division/rounding needed to convert bytes to kb might result in displaying a message where the estimated size and limit are equal.

In short I didn't want to have us printing log messages that looked like:

Skipping doc asdf bc size 2kb exceeds limit 2kb

But I've added explicit units to the variables and log-messages involved here, so hopefully that's good enough to address the core of your issue.

BAD_REQUEST,
"Size of the document " + cmd.getPrintableId() + " is too large, around:" + docSize);
BAD_REQUEST,
"Size of the document " + cmd.getPrintableId() + " is too large, around:" + estimatedSizeBytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If necessary, convert to the same units as that we'd log?


@Override
public void init(NamedList<?> args) {
maxDocumentSize = args.toSolrParams().required().getLong(LIMIT_SIZE_PARAM);
final SolrParams params = args.toSolrParams();
maxDocumentSizeKb = params.required().getLong(LIMIT_SIZE_PARAM);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULL_DEREFERENCE: object returned by params.required().getLong("limit") could be null and is dereferenced at line 58.


ℹ️ Learn about @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage
@sonatype-lift ignore Leave out the above finding from this PR
@sonatype-lift ignoreall Leave out all the existing findings from this PR
@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
Click here to add LiftBot to another repo.


Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]

@sonatype-lift
Copy link

sonatype-lift bot commented Sep 26, 2022

⚠️ 313 God Classes were detected by Lift in this project. Visit the Lift web console for more details.

@gerlowskija gerlowskija merged commit 708c461 into apache:main Sep 28, 2022
@gerlowskija gerlowskija deleted the ildpf-more-permissive-mode branch September 28, 2022 15:19
gerlowskija added a commit that referenced this pull request Sep 28, 2022
…tory (#1040)

Prior to this commit, IgnoreLargeDocumentProcessorFactory only had a
single way to handle documents that exceeded its configurable size
limit.  The first violation would throw a SolrException: in effect,
short-circuiting any other documents in the "batch" being processed.

This approach is very dependent on the ordering of docs within a
batch.  A 100-doc batch with only 1 "size offender" might index 99 docs
or none at all, depending on where the offender is in the list.  This is
great for end users whose clients are built to handle the resulting 400
response, and resubmit their whole batch.  But it's not ideal for every
use case.

This commit introduces an alternate approach for handling these
violations: quietly log out the ID and size of the offending doc but
don't throw any exception that will short-circuit the remainder of the
batch.

The desired error-handling can be chosen using the URP's new config
parameter 'permissiveMode'.  When false (the default), the legacy
behavior of short-circuiting the batch and surfacing a 400 error is
used.  Otherwise, the new "just log things out and continue" behavior is
used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants