Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query realtime datasource may get NullPointerException just when segment unannouncing. #12168

Closed
binlijin opened this issue Jan 19, 2022 · 1 comment · Fixed by #15260 or #15373
Closed

Comments

@binlijin
Copy link
Contributor

Please provide a detailed title (e.g. "Broker crashes when using TopN query with Bound filter" instead of just "Broker crashes").

Affected Version

0.22.1

Description

Broker process a query will dispatch subquery to different nodes, and a peon process a subquery for some segment, the segment may by unannouncing and peon process the subquery may get NullPointerException.

2022-01-16T00:12:42,443 INFO [[index_kafka_monitor_alert_7321a5cf7c99960_aoekboeg]-appenderator-persist] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[monitor_alert_2022-01-16T08:00:00.000Z_2022-01-16T09:00:00.000Z_2022-01-16T00:00:00.158Z_112] at path[/druid/segments/9.138.162.20:8106_indexer-executor__default_tier_2022-01-15T23:22:41.747Z_c7cd5c7591a24f4cb29aef61d58c107d0]
2022-01-16T00:12:42,467 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-01-16T00:12:42,649 ERROR [processing-0] org.apache.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2 - Exception with one of the sequences!
java.lang.NullPointerException: null
at org.apache.druid.segment.realtime.FireHydrant.getSegmentForQuery(FireHydrant.java:180) ~[druid-server-0.22.0.jar:0.22.0]
at org.apache.druid.segment.realtime.appenderator.SinkQuerySegmentWalker.lambda$null$3(SinkQuerySegmentWalker.java:216) ~[druid-server-0.22.0.jar:0.22.0]
at com.google.common.collect.Iterators$8.transform(Iterators.java:794) ~[guava-16.0.1.jar:?]
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48) ~[guava-16.0.1.jar:?]
at org.apache.druid.query.SinkQueryRunners$1.next(SinkQueryRunners.java:56) ~[druid-processing-0.22.0.jar:0.22.0]
at org.apache.druid.query.SinkQueryRunners$1.next(SinkQueryRunners.java:46) ~[druid-processing-0.22.0.jar:0.22.0]
at com.google.common.collect.Iterators$7.computeNext(Iterators.java:646) ~[guava-16.0.1.jar:?]
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.1.jar:?]
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.1.jar:?]
at com.google.common.collect.TransformedIterator.hasNext(TransformedIterator.java:43) ~[guava-16.0.1.jar:?]
at com.google.common.collect.Iterators.addAll(Iterators.java:356) ~[guava-16.0.1.jar:?]
at com.google.common.collect.Lists.newArrayList(Lists.java:147) ~[guava-16.0.1.jar:?]
at com.google.common.collect.Lists.newArrayList(Lists.java:129) ~[guava-16.0.1.jar:?]
at org.apache.druid.query.ChainedExecutionQueryRunner$1.make(ChainedExecutionQueryRunner.java:92) ~[druid-processing-0.22.0.jar:0.22.0]
at org.apache.druid.java.util.common.guava.BaseSequence.accumulate(BaseSequence.java:39) ~[druid-core-0.22.0.jar:0.22.0]
at org.apache.druid.java.util.common.guava.LazySequence.accumulate(LazySequence.java:40) ~[druid-core-0.22.0.jar:0.22.0]
at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50) ~[druid-core-0.22.0.jar:0.22.0]
at org.apache.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55) ~[druid-core-0.22.0.jar:0.22.0]
at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45) ~[druid-core-0.22.0.jar:0.22.0]
at org.apache.druid.java.util.common.guava.LazySequence.accumulate(LazySequence.java:40) ~[druid-core-0.22.0.jar:0.22.0]
at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50) ~[druid-core-0.22.0.jar:0.22.0]
at org.apache.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55) ~[druid-core-0.22.0.jar:0.22.0]
at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45) ~[druid-core-0.22.0.jar:0.22.0]
at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50) ~[druid-core-0.22.0.jar:0.22.0]
at org.apache.druid.query.CPUTimeMetricQueryRunner$1.wrap(CPUTimeMetricQueryRunner.java:78) ~[druid-processing-0.22.0.jar:0.22.0]
at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45) ~[druid-core-0.22.0.jar:0.22.0]
at org.apache.druid.query.spec.SpecificSegmentQueryRunner$1.accumulate(SpecificSegmentQueryRunner.java:86) ~[druid-processing-0.22.0.jar:0.22.0]
at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50) ~[druid-core-0.22.0.jar:0.22.0]
at org.apache.druid.query.spec.SpecificSegmentQueryRunner.doNamed(SpecificSegmentQueryRunner.java:170) ~[druid-processing-0.22.0.jar:0.22.0]
at org.apache.druid.query.spec.SpecificSegmentQueryRunner.access$100(SpecificSegmentQueryRunner.java:43) ~[druid-processing-0.22.0.jar:0.22.0]
at org.apache.druid.query.spec.SpecificSegmentQueryRunner$2.wrap(SpecificSegmentQueryRunner.java:152) ~[druid-processing-0.22.0.jar:0.22.0]
at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45) ~[druid-core-0.22.0.jar:0.22.0]
at org.apache.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2$1$1$1.call(GroupByMergingQueryRunnerV2.java:245) [druid-processing-0.22.0.jar:0.22.0]
at org.apache.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2$1$1$1.call(GroupByMergingQueryRunnerV2.java:232) [druid-processing-0.22.0.jar:0.22.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_272]
at org.apache.druid.query.PrioritizedListenableFutureTask.run(PrioritizedExecutorService.java:247) [druid-processing-0.22.0.jar:0.22.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_272]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_272]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_272]
2022-01-16T00:12:42,683 ERROR [qtp191953464-298[groupBy_[monitor_alert]59354191-cf20-40fc-b6e9-a1a322a54a7f]] org.apache.druid.server.QueryLifecycle - Exception while processing queryId [59354191-cf20-40fc-b6e9-a1a322a54a7f] (java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.NullPointerException)
2022-01-16T00:12:42,802 ERROR [qtp191953464-298[groupBy
[monitor_alert]_59354191-cf20-40fc-b6e9-a1a322a54a7f]] org.apache.druid.server.QueryResource - Exception handling request: {class=org.apache.druid.server.QueryResource, exceptionType=class java.lang.RuntimeException, exceptionMessage=java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.NullPointerException, query={"queryType":"groupBy","dataSource":{"type":"table","name":"monitor_alert_to_analysis"},"intervals":{"type":"segments","segments":[{"itvl":"2022-01-16T07:00:00.000Z/2022-01-16T08:00:00.000Z","ver":"2022-01-15T23:00:17.052Z","part":1499},{"itvl":"2022-01-16T08:00:00.000Z/2022-01-16T09:00:00.000Z","ver":"2022-01-16T00:00:00.158Z","part":112},{"itvl":"2022-01-16T08:00:00.000Z/2022-01-16T09:00:00.000Z","ver":"2022-01-16T00:00:00.158Z","part":268}]},"virtualColumns":[{"type":"expression","name":"v0","expression":"(("data_time" + 28800) * 1000)","outputType":"LONG"}],"filter":{"type":"and","fields":[{"type":"selector","dimension":"app_mark","value":"895_4455_cos_53","extractionFn":null},{"type":"selector","dimension":"metric","value":"total_req","extractionFn":null},{"type":"selector","dimension":"tag12","value":"[云][COS]","extractionFn":null},{"type":"selector","dimension":"tag13","value":"[COS]","extractionFn":null},{"type":"selector","dimension":"tag14","value":"[coshttpsvr]","extractionFn":null},{"type":"bound","dimension":"v0","lower":"1642313400000","upper":"1642320600000","lowerStrict":false,"upperStrict":false,"extractionFn":null,"ordering":{"type":"numeric"}}]},"granularity":{"type":"all"},"dimensions":[{"type":"default","dimension":"tag20","outputName":"d0","outputType":"STRING"}],"aggregations":[],"postAggregations":[],"having":null,"limitSpec":{"type":"NoopLimitSpec"},"context":{"applyLimitPushDown":false,"defaultTimeout":300000,"finalize":false,"fudgeTimestamp":"-4611686018427387904","groupByOutermost":false,"groupByStrategy":"v2","maxQueuedBytes":41841,"maxScatterGatherBytes":9223372036854775807,"queryFailTime":1642292259982,"queryId":"59354191-cf20-40fc-b6e9-a1a322a54a7f","resultAsArray":true,"sqlQueryId":"12b20021-2583-4763-864a-36d27086ab51","timeout":299544},"descending":false}, peer=9.138.162.166} (java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.NullPointerException)

@binlijin
Copy link
Contributor Author

A simple solution is just sleep some time after “Unannounce the segment” and before the actually droping segment.

gianm added a commit to gianm/druid that referenced this issue Oct 26, 2023
This can happen if the segment is removed while a query is in progress.
Returning empty causes the server to use ReportTimelineMissingSegmentQueryRunner,
which causes the Broker to look for the segment somewhere else.

Fixes apache#12168.
gianm added a commit to gianm/druid that referenced this issue Oct 26, 2023
…-segment retry bug.

Fixes apache#12168, by returning empty from FireHydrant when the segment is
swapped to null. This causes the SinkQuerySegmentWalker to use
ReportTimelineMissingSegmentQueryRunner, which causes the Broker to look
for the segment somewhere else.

In addition, this patch changes SinkQuerySegmentWalker to acquire references
to all hydrants (subsegments of a sink) at once, and return a
ReportTimelineMissingSegmentQueryRunner if *any* of them could not be acquired.
I suspect, although have not confirmed, that the prior behavior could lead to
segments being reported as missing even though results from some hydrants were
still included.
gianm added a commit to gianm/druid that referenced this issue Oct 26, 2023
…-segment retry bug.

Fixes apache#12168, by returning empty from FireHydrant when the segment is
swapped to null. This causes the SinkQuerySegmentWalker to use
ReportTimelineMissingSegmentQueryRunner, which causes the Broker to look
for the segment somewhere else.

In addition, this patch changes SinkQuerySegmentWalker to acquire references
to all hydrants (subsegments of a sink) at once, and return a
ReportTimelineMissingSegmentQueryRunner if *any* of them could not be acquired.
I suspect, although have not confirmed, that the prior behavior could lead to
segments being reported as missing even though results from some hydrants were
still included.
gianm added a commit to gianm/druid that referenced this issue Oct 26, 2023
…-segment retry bug.

Fixes apache#12168, by returning empty from FireHydrant when the segment is
swapped to null. This causes the SinkQuerySegmentWalker to use
ReportTimelineMissingSegmentQueryRunner, which causes the Broker to look
for the segment somewhere else.

In addition, this patch changes SinkQuerySegmentWalker to acquire references
to all hydrants (subsegments of a sink) at once, and return a
ReportTimelineMissingSegmentQueryRunner if *any* of them could not be acquired.
I suspect, although have not confirmed, that the prior behavior could lead to
segments being reported as missing even though results from some hydrants were
still included.
gianm added a commit that referenced this issue Nov 20, 2023
…g-segment retry bug. (#15260)

* Fix NPE caused by realtime segment closing race, fix possible missing-segment retry bug.

Fixes #12168, by returning empty from FireHydrant when the segment is
swapped to null. This causes the SinkQuerySegmentWalker to use
ReportTimelineMissingSegmentQueryRunner, which causes the Broker to look
for the segment somewhere else.

In addition, this patch changes SinkQuerySegmentWalker to acquire references
to all hydrants (subsegments of a sink) at once, and return a
ReportTimelineMissingSegmentQueryRunner if *any* of them could not be acquired.
I suspect, although have not confirmed, that the prior behavior could lead to
segments being reported as missing even though results from some hydrants were
still included.

* Some more test coverage.
writer-jill pushed a commit to writer-jill/druid that referenced this issue Nov 20, 2023
…g-segment retry bug. (apache#15260)

* Fix NPE caused by realtime segment closing race, fix possible missing-segment retry bug.

Fixes apache#12168, by returning empty from FireHydrant when the segment is
swapped to null. This causes the SinkQuerySegmentWalker to use
ReportTimelineMissingSegmentQueryRunner, which causes the Broker to look
for the segment somewhere else.

In addition, this patch changes SinkQuerySegmentWalker to acquire references
to all hydrants (subsegments of a sink) at once, and return a
ReportTimelineMissingSegmentQueryRunner if *any* of them could not be acquired.
I suspect, although have not confirmed, that the prior behavior could lead to
segments being reported as missing even though results from some hydrants were
still included.

* Some more test coverage.
yashdeep97 pushed a commit to yashdeep97/druid that referenced this issue Dec 1, 2023
…g-segment retry bug. (apache#15260)

* Fix NPE caused by realtime segment closing race, fix possible missing-segment retry bug.

Fixes apache#12168, by returning empty from FireHydrant when the segment is
swapped to null. This causes the SinkQuerySegmentWalker to use
ReportTimelineMissingSegmentQueryRunner, which causes the Broker to look
for the segment somewhere else.

In addition, this patch changes SinkQuerySegmentWalker to acquire references
to all hydrants (subsegments of a sink) at once, and return a
ReportTimelineMissingSegmentQueryRunner if *any* of them could not be acquired.
I suspect, although have not confirmed, that the prior behavior could lead to
segments being reported as missing even though results from some hydrants were
still included.

* Some more test coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant