New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingestion task fails with NullPointerException during BUILD_SEGMENTS phase #8835
Comments
Here is a log line output for the attempt to fetch of the file that fails: |
Looking at git, I see that the FileUtils.copyLarge method that has the NPE inside it was added in August 2019. And this is the line that's throwing NPE: |
This seems to be the PR which introduced these changes #8257 |
Observed behaviour related to this is that this is not always reproducible. The same task succeeded in the environment at other times. That suggests something related to difference in environment. Going through the druid code it seems there are retries configured (default 3 retries) for fetching the data in some time (default 60s). This could be an environment issue in the setup which is causing those fetches to not be complete in the given time duration. But the NullPointerException should still be improved on druid's side. In the current form this exception makes it hard to pinpoint the problem. |
Another observation. Seems that changing from prefixes to uris decreased the number of failures. Maybe that 60 seconds are being used to list the files as well as fetch them? That makes uris more reliable compared to prefixes. |
@mlubavin-vg @anshbansal thank you for the report! I'm looking at the bug but unfortunately, it's not obvious what could be null here. The NPE happens at the below line: try (InputStream inputStream = objectOpenFunction.open(object);
|
I forgot to say that the new inputSource and inputFormat are available since 0.17 which we are about to release. |
thanks for the info @jihoonson ! |
That's very interesting.. @mlubavin-vg thank you for sharing! |
An update on this - we were running more batch reprocessing in the last couple of days and started seeing this problem again, even with our "fetchTimeout" and "uris" changes. Just now, ran again with your suggested change of setting "maxFetchCapacityBytes" to 0, and it did not happen. Will update if I have more information |
Affected Version
0.16.0-incubating
I am fairly sure this did not happen with 0.12.3 (we are currently upgrading, and upgraded our test environment so far)
Description
I am using native index tasks to ingest data into Druid (they override data already in that interval). I submit about 30 tasks all at once, and they get queued up and processed in the middle managers and peons.
Every time I run this, several of the index tasks fail (their status in the UI is FAILED), and I find this stacktrace in the middlemanager logs :
Info:
In this test environment, I have a single MiddleManager with a task capacity of 2, and also a realtime kafka ingestion task running. In my production environment, I have 2 middle managers, 2 historicals, 2 coordinator/overlords, and 2 brokers.
I am using S3 for deep storage.
The tasks that I submit look like this:
The text was updated successfully, but these errors were encountered: