Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPE due to delete-by-query with parent/child when upgrading from 1.1.1 to 1.3.x #8031

Closed
polyfractal opened this Issue Oct 8, 2014 · 6 comments

Comments

Projects
None yet
4 participants
@polyfractal
Copy link
Member

polyfractal commented Oct 8, 2014

An NPE was encountered when upgrading from 1.1.1 to 1.3.4. During the rolling upgrade, a background cron tried to execute a delete-by-query which included a parent/child query. This was allowed in 1.1.1, but disabled in later versions.

This caused a delete-by-query to queue up in the translog of a 1.1.1 node. Before the translog was cleared, the shard tried to move to a 1.3.4 node, which caused an NPE. The shards repeatedly failed recovery and kept bouncing around the cluster. Because allocation filtering was being used to migrate data from old -> new, the cluster tried to recover the shards on only 1.3.4 nodes...leading to a continuous failure.

The situation eventually resolved itself, likely because a background flush cleared out the translog and allowed the recovery to finally proceed normally.

Stack trace (sanitized to remove sensitive names/ips):


[2014-10-08 21:43:26,881][WARN ][indices.cluster          ] [prod-1.3.4] [my_index][6] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException: [my_index][6]: Recovery failed from [prod-1.1.1][YhcqkTzLTGSF8dyKAQPRBQ][prod-1.1.1.localdomain][inet[...]]{aws_availability_zone=us-east-1e, max_local_storage_nodes=1} into [prod-1.3.4][0cRcLbzTTAm15PMu_R_U2w][prod-1.3.4.localdomain][inet[prod-1.3.4.localdomain/...]]{aws_availability_zone=us-east-1e, max_local_storage_nodes=1}
    at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306)
    at org.elasticsearch.indices.recovery.RecoveryTarget.access$200(RecoveryTarget.java:65)
    at org.elasticsearch.indices.recovery.RecoveryTarget$2.run(RecoveryTarget.java:175)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [prod-1.1.1][inet[/...]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [my_index][6] Phase[2] Execution failed
    at org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1109)
    at org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:627)
    at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:117)
    at org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:61)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:323)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [prod-1.3.4][inet[/...]][index/shard/recovery/translogOps]
Caused by: org.elasticsearch.index.query.QueryParsingException: [my_index] Failed to parse
    at org.elasticsearch.index.query.IndexQueryParserService.parseQuery(IndexQueryParserService.java:330)
    at org.elasticsearch.index.shard.service.InternalIndexShard.prepareDeleteByQuery(InternalIndexShard.java:449)
    at org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:780)
    at org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:431)
    at org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:410)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at org.elasticsearch.index.query.QueryParserUtils.ensureNotDeleteByQuery(QueryParserUtils.java:36)
    at org.elasticsearch.index.query.HasParentFilterParser.parse(HasParentFilterParser.java:52)
    at org.elasticsearch.index.query.QueryParseContext.executeFilterParser(QueryParseContext.java:302)
    at org.elasticsearch.index.query.QueryParseContext.parseInnerFilter(QueryParseContext.java:283)
    at org.elasticsearch.index.query.NotFilterParser.parse(NotFilterParser.java:63)
    at org.elasticsearch.index.query.QueryParseContext.executeFilterParser(QueryParseContext.java:302)
    at org.elasticsearch.index.query.QueryParseContext.parseInnerFilter(QueryParseContext.java:283)
    at org.elasticsearch.index.query.FilteredQueryParser.parse(FilteredQueryParser.java:74)
    at org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:239)
    at org.elasticsearch.index.query.IndexQueryParserService.innerParse(IndexQueryParserService.java:342)
    at org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:268)
    at org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:263)
    at org.elasticsearch.index.query.IndexQueryParserService.parseQuery(IndexQueryParserService.java:314)
    ... 8 more
@martijnvg

This comment has been minimized.

Copy link
Member

martijnvg commented Oct 9, 2014

This is bad. First of all a the actual exception should be a QueryParsingException with the message the p/c queries are unsupported in the delete by query api and second I think the translog should just skip a operation if it fails with a QueryParsingException.

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Oct 9, 2014

@martijnvg can we somehow reproduce this with bwc test? just curious.... I think we should work on something with @dakrone to be able to skip individual operations in the translog... might be even a standalone tool? @dakrone any ideas?

@martijnvg

This comment has been minimized.

Copy link
Member

martijnvg commented Oct 9, 2014

@s1monw I'm sure that this can be reproduced in a bwc test :)

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Oct 15, 2014

@martijnvg assigned this to you, but perhaps @dakrone is the person best placed to look at this?

@martijnvg

This comment has been minimized.

Copy link
Member

martijnvg commented Oct 21, 2014

This issue is less severe as I initially thought. What it boils down to is that any delete by query translog operation with a p/c query is just ignored, but the rest of all translog operations are successfully executed and the shard gets assigned.

The NPE is annoying (which I will fix) but that gets wrapped by a QueryParsingException (in IndexQueryParserService#parseQuery(...) line 370) and because of this in LocalIndexShardGateway#recover(...) at line 276 we ignore the delete by query operation. A QueryParsingException exception status is seen as bad request, so the idea here is to ignore it.

@martijnvg

This comment has been minimized.

Copy link
Member

martijnvg commented Oct 21, 2014

I opened this PR for the NPE during recovery: #8177

martijnvg added a commit that referenced this issue Oct 22, 2014

Parent/child: Check if there is a search context, otherwise throw a q…
…uery parse exception.

Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade.

Closes #8031
Closes #8177

@martijnvg martijnvg closed this in 319878e Oct 22, 2014

martijnvg added a commit that referenced this issue Oct 22, 2014

Parent/child: Check if there is a search context, otherwise throw a q…
…uery parse exception.

Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade.

Closes #8031
Closes #8177

martijnvg added a commit that referenced this issue Oct 22, 2014

Parent/child: Check if there is a search context, otherwise throw a q…
…uery parse exception.

Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade.

Closes #8031
Closes #8177

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Parent/child: Check if there is a search context, otherwise throw a q…
…uery parse exception.

Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade.

Closes elastic#8031
Closes elastic#8177

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Parent/child: Check if there is a search context, otherwise throw a q…
…uery parse exception.

Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade.

Closes elastic#8031
Closes elastic#8177
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.