Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
NPE due to delete-by-query with parent/child when upgrading from 1.1.1 to 1.3.x #8031
An NPE was encountered when upgrading from 1.1.1 to 1.3.4. During the rolling upgrade, a background cron tried to execute a delete-by-query which included a parent/child query. This was allowed in 1.1.1, but disabled in later versions.
This caused a delete-by-query to queue up in the translog of a 1.1.1 node. Before the translog was cleared, the shard tried to move to a 1.3.4 node, which caused an NPE. The shards repeatedly failed recovery and kept bouncing around the cluster. Because allocation filtering was being used to migrate data from old -> new, the cluster tried to recover the shards on only 1.3.4 nodes...leading to a continuous failure.
The situation eventually resolved itself, likely because a background flush cleared out the translog and allowed the recovery to finally proceed normally.
Stack trace (sanitized to remove sensitive names/ips):
This is bad. First of all a the actual exception should be a
This issue is less severe as I initially thought. What it boils down to is that any delete by query translog operation with a p/c query is just ignored, but the rest of all translog operations are successfully executed and the shard gets assigned.
The NPE is annoying (which I will fix) but that gets wrapped by a QueryParsingException (in IndexQueryParserService#parseQuery(...) line 370) and because of this in LocalIndexShardGateway#recover(...) at line 276 we ignore the delete by query operation. A QueryParsingException exception status is seen as bad request, so the idea here is to ignore it.