Ingestion task fails with NullPointerException during BUILD_SEGMENTS phase #8835

mlubavin-vg · 2019-11-06T20:14:36Z

Affected Version

0.16.0-incubating

I am fairly sure this did not happen with 0.12.3 (we are currently upgrading, and upgraded our test environment so far)

Description

I am using native index tasks to ingest data into Druid (they override data already in that interval). I submit about 30 tasks all at once, and they get queued up and processed in the middle managers and peons.

Every time I run this, several of the index tasks fail (their status in the UI is FAILED), and I find this stacktrace in the middlemanager logs :

2019-11-01T06:46:32,759 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Encountered exception in BUILD_SEGMENTS.
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.io.IOException: java.lang.NullPointerException
	at org.apache.druid.data.input.impl.prefetch.Fetcher.checkFetchException(Fetcher.java:199) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.Fetcher.next(Fetcher.java:170) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.PrefetchableTextFilesFirehoseFactory$2.next(PrefetchableTextFilesFirehoseFactory.java:242) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.PrefetchableTextFilesFirehoseFactory$2.next(PrefetchableTextFilesFirehoseFactory.java:228) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.FileIteratingFirehose.getNextLineIterator(FileIteratingFirehose.java:107) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.FileIteratingFirehose.hasMore(FileIteratingFirehose.java:68) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.FiniteFirehoseProcessor.process(FiniteFirehoseProcessor.java:98) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:859) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.IndexTask.runTask(IndexTask.java:467) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:137) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:419) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:391) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_222]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: java.lang.NullPointerException
	at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_222]
	at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_222]
	at org.apache.druid.data.input.impl.prefetch.Fetcher.checkFetchException(Fetcher.java:190) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	... 15 more
Caused by: java.io.IOException: java.lang.NullPointerException
	at org.apache.druid.java.util.common.FileUtils.copyLarge(FileUtils.java:305) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.FileFetcher.download(FileFetcher.java:89) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.Fetcher.fetch(Fetcher.java:134) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.Fetcher.lambda$fetchIfNeeded$0(Fetcher.java:110) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	... 4 more
Caused by: java.lang.NullPointerException
	at org.apache.druid.java.util.common.FileUtils.lambda$copyLarge$1(FileUtils.java:293) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:86) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:125) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.java.util.common.FileUtils.copyLarge(FileUtils.java:291) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.FileFetcher.download(FileFetcher.java:89) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.Fetcher.fetch(Fetcher.java:134) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.data.input.impl.prefetch.Fetcher.lambda$fetchIfNeeded$0(Fetcher.java:110) ~[druid-core-0.16.0-incubating.jar:0.16.0-incubating]
	... 4 more

Info:
In this test environment, I have a single MiddleManager with a task capacity of 2, and also a realtime kafka ingestion task running. In my production environment, I have 2 middle managers, 2 historicals, 2 coordinator/overlords, and 2 brokers.

I am using S3 for deep storage.

The tasks that I submit look like this:

{
            "type": "index",
            "spec": {
                "dataSchema": {
                    "dataSource": "redacted",
                    "metricsSpec": metrics_spec,
                    "granularitySpec": {
                        "segmentGranularity": "HOUR",
                        "queryGranularity": "NONE",
                        "intervals": intervals
                    },
                    "parser": parser
                },
                "ioConfig": {
                    "type": "index",
                    "firehose": {
                        "type": "static-s3",
                        "prefixes": s3_prefixes
                    }
                }
            }
        }

The text was updated successfully, but these errors were encountered:

mlubavin-vg · 2019-11-06T20:31:41Z

Here is a log line output for the attempt to fetch of the file that fails:
2019-11-01T06:43:37,098 INFO [firehose_fetch_0] org.apache.druid.data.input.impl.prefetch.Fetcher - Fetching [0]th object[s3://redacted/data/redacted/v0005/2019/01/10/12/data_0_0_0.json.gz], fetchedBytes[0]

mlubavin-vg · 2019-11-06T20:51:58Z

Looking at git, I see that the FileUtils.copyLarge method that has the NPE inside it was added in August 2019. And this is the line that's throwing NPE:
try (InputStream inputStream = objectOpenFunction.open(object);

anshbansal · 2019-11-12T08:24:19Z

This seems to be the PR which introduced these changes #8257

anshbansal · 2019-11-12T09:34:30Z

Observed behaviour related to this is that this is not always reproducible. The same task succeeded in the environment at other times.

That suggests something related to difference in environment. Going through the druid code it seems there are retries configured (default 3 retries) for fetching the data in some time (default 60s). This could be an environment issue in the setup which is causing those fetches to not be complete in the given time duration. But the NullPointerException should still be improved on druid's side. In the current form this exception makes it hard to pinpoint the problem.

anshbansal · 2019-11-12T09:56:23Z

Another observation. Seems that changing from prefixes to uris decreased the number of failures. Maybe that 60 seconds are being used to list the files as well as fetch them? That makes uris more reliable compared to prefixes.

jihoonson · 2019-12-18T02:32:14Z

@mlubavin-vg @anshbansal thank you for the report! I'm looking at the bug but unfortunately, it's not obvious what could be null here. The NPE happens at the below line:

            try (InputStream inputStream = objectOpenFunction.open(object);

objectOpenFunction cannot be null here since it's always created here. If object was null, NPE should happen at a different line inside open(). I think I need more time to look at this. In the meantime, you can try using the new inputSource and inputFormat. They are more intuitive than the existing firehose and more flexible to extend. If you still want to use the firehose, you can turn off prefetch by setting maxFetchCapacityBytes to 0 (prefetch isn't that useful and not supported for the new inputSource implementation).

jihoonson · 2019-12-18T05:46:45Z

I forgot to say that the new inputSource and inputFormat are available since 0.17 which we are about to release.

mlubavin-vg · 2019-12-20T01:28:34Z

thanks for the info @jihoonson !
we're ok now, having changed ioconfig from prefixes to "uris" and increasing "fetchTimeout": 600000 has eliminated the problem for us.

jihoonson · 2019-12-20T01:32:15Z

That's very interesting.. @mlubavin-vg thank you for sharing!

mlubavin-vg · 2020-04-20T21:29:00Z

An update on this - we were running more batch reprocessing in the last couple of days and started seeing this problem again, even with our "fetchTimeout" and "uris" changes.

Just now, ran again with your suggested change of setting "maxFetchCapacityBytes" to 0, and it did not happen. Will update if I have more information

mlubavin-vg added the Uncategorized problem report label Nov 6, 2019

jon-wei added this to the 0.17.0 milestone Dec 10, 2019

jihoonson added Area - Batch Ingestion Bug and removed Uncategorized problem report labels Dec 11, 2019

jon-wei removed this from the 0.17.0 milestone Dec 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingestion task fails with NullPointerException during BUILD_SEGMENTS phase #8835

Ingestion task fails with NullPointerException during BUILD_SEGMENTS phase #8835

mlubavin-vg commented Nov 6, 2019

mlubavin-vg commented Nov 6, 2019

mlubavin-vg commented Nov 6, 2019 •

edited

anshbansal commented Nov 12, 2019

anshbansal commented Nov 12, 2019

anshbansal commented Nov 12, 2019

jihoonson commented Dec 18, 2019

jihoonson commented Dec 18, 2019

mlubavin-vg commented Dec 20, 2019

jihoonson commented Dec 20, 2019

mlubavin-vg commented Apr 20, 2020

Ingestion task fails with NullPointerException during BUILD_SEGMENTS phase #8835

Ingestion task fails with NullPointerException during BUILD_SEGMENTS phase #8835

Comments

mlubavin-vg commented Nov 6, 2019

Affected Version

Description

mlubavin-vg commented Nov 6, 2019

mlubavin-vg commented Nov 6, 2019 • edited

anshbansal commented Nov 12, 2019

anshbansal commented Nov 12, 2019

anshbansal commented Nov 12, 2019

jihoonson commented Dec 18, 2019

jihoonson commented Dec 18, 2019

mlubavin-vg commented Dec 20, 2019

jihoonson commented Dec 20, 2019

mlubavin-vg commented Apr 20, 2020

mlubavin-vg commented Nov 6, 2019 •

edited