Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingestion task fails with NullPointerException during BUILD_SEGMENTS phase #8835

Open
mlubavin-vg opened this issue Nov 6, 2019 · 10 comments

Comments

@mlubavin-vg
Copy link

Affected Version

0.16.0-incubating

I am fairly sure this did not happen with 0.12.3 (we are currently upgrading, and upgraded our test environment so far)

Description

I am using native index tasks to ingest data into Druid (they override data already in that interval). I submit about 30 tasks all at once, and they get queued up and processed in the middle managers and peons.

Every time I run this, several of the index tasks fail (their status in the UI is FAILED), and I find this stacktrace in the middlemanager logs :

2019-11-01T06:46:32,759 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Encountered exception in BUILD_SEGMENTS.
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.io.IOException: java.lang.NullPointerException
	at org.apache.druid.data.input.impl.prefetch.Fetcher.checkFetchException(Fetcher.java:199) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.Fetcher.next(Fetcher.java:170) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.PrefetchableTextFilesFirehoseFactory$2.next(PrefetchableTextFilesFirehoseFactory.java:242) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.PrefetchableTextFilesFirehoseFactory$2.next(PrefetchableTextFilesFirehoseFactory.java:228) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.FileIteratingFirehose.getNextLineIterator(FileIteratingFirehose.java:107) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.FileIteratingFirehose.hasMore(FileIteratingFirehose.java:68) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.FiniteFirehoseProcessor.process(FiniteFirehoseProcessor.java:98) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:859) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.IndexTask.runTask(IndexTask.java:467) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:137) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:419) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:391) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_222]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: java.lang.NullPointerException
	at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_222]
	at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_222]
	at org.apache.druid.data.input.impl.prefetch.Fetcher.checkFetchException(Fetcher.java:190) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	... 15 more
Caused by: java.io.IOException: java.lang.NullPointerException
	at org.apache.druid.java.util.common.FileUtils.copyLarge(FileUtils.java:305) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.FileFetcher.download(FileFetcher.java:89) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.Fetcher.fetch(Fetcher.java:134) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.Fetcher.lambda$fetchIfNeeded$0(Fetcher.java:110) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	... 4 more
Caused by: java.lang.NullPointerException
	at org.apache.druid.java.util.common.FileUtils.lambda$copyLarge$1(FileUtils.java:293) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:86) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:125) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.java.util.common.FileUtils.copyLarge(FileUtils.java:291) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.FileFetcher.download(FileFetcher.java:89) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.Fetcher.fetch(Fetcher.java:134) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.Fetcher.lambda$fetchIfNeeded$0(Fetcher.java:110) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	... 4 more

Info:
In this test environment, I have a single MiddleManager with a task capacity of 2, and also a realtime kafka ingestion task running. In my production environment, I have 2 middle managers, 2 historicals, 2 coordinator/overlords, and 2 brokers.

I am using S3 for deep storage.

The tasks that I submit look like this:

{
            "type": "index",
            "spec": {
                "dataSchema": {
                    "dataSource": "redacted",
                    "metricsSpec": metrics_spec,
                    "granularitySpec": {
                        "segmentGranularity": "HOUR",
                        "queryGranularity": "NONE",
                        "intervals": intervals
                    },
                    "parser": parser
                },
                "ioConfig": {
                    "type": "index",
                    "firehose": {
                        "type": "static-s3",
                        "prefixes": s3_prefixes
                    }
                }
            }
        }
@mlubavin-vg
Copy link
Author

Here is a log line output for the attempt to fetch of the file that fails:
2019-11-01T06:43:37,098 INFO [firehose_fetch_0] org.apache.druid.data.input.impl.prefetch.Fetcher - Fetching [0]th object[s3://redacted/data/redacted/v0005/2019/01/10/12/data_0_0_0.json.gz], fetchedBytes[0]

@mlubavin-vg
Copy link
Author

mlubavin-vg commented Nov 6, 2019

Looking at git, I see that the FileUtils.copyLarge method that has the NPE inside it was added in August 2019. And this is the line that's throwing NPE:
try (InputStream inputStream = objectOpenFunction.open(object);

@anshbansal
Copy link
Contributor

This seems to be the PR which introduced these changes #8257

@anshbansal
Copy link
Contributor

Observed behaviour related to this is that this is not always reproducible. The same task succeeded in the environment at other times.

That suggests something related to difference in environment. Going through the druid code it seems there are retries configured (default 3 retries) for fetching the data in some time (default 60s). This could be an environment issue in the setup which is causing those fetches to not be complete in the given time duration. But the NullPointerException should still be improved on druid's side. In the current form this exception makes it hard to pinpoint the problem.

@anshbansal
Copy link
Contributor

Another observation. Seems that changing from prefixes to uris decreased the number of failures. Maybe that 60 seconds are being used to list the files as well as fetch them? That makes uris more reliable compared to prefixes.

@jon-wei jon-wei added this to the 0.17.0 milestone Dec 10, 2019
@jon-wei jon-wei removed this from the 0.17.0 milestone Dec 18, 2019
@jihoonson
Copy link
Contributor

@mlubavin-vg @anshbansal thank you for the report! I'm looking at the bug but unfortunately, it's not obvious what could be null here. The NPE happens at the below line:

            try (InputStream inputStream = objectOpenFunction.open(object);

objectOpenFunction cannot be null here since it's always created here. If object was null, NPE should happen at a different line inside open(). I think I need more time to look at this. In the meantime, you can try using the new inputSource and inputFormat. They are more intuitive than the existing firehose and more flexible to extend. If you still want to use the firehose, you can turn off prefetch by setting maxFetchCapacityBytes to 0 (prefetch isn't that useful and not supported for the new inputSource implementation).

@jihoonson
Copy link
Contributor

I forgot to say that the new inputSource and inputFormat are available since 0.17 which we are about to release.

@mlubavin-vg
Copy link
Author

thanks for the info @jihoonson !
we're ok now, having changed ioconfig from prefixes to "uris" and increasing "fetchTimeout": 600000 has eliminated the problem for us.

@jihoonson
Copy link
Contributor

That's very interesting.. @mlubavin-vg thank you for sharing!

@mlubavin-vg
Copy link
Author

An update on this - we were running more batch reprocessing in the last couple of days and started seeing this problem again, even with our "fetchTimeout" and "uris" changes.

Just now, ran again with your suggested change of setting "maxFetchCapacityBytes" to 0, and it did not happen. Will update if I have more information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants