log unsuccessful shards in failed scrolls #1261

tommyzli · 2020-05-28T20:08:31Z

Elasticsearch version (bin/elasticsearch --version): 7.6.1

elasticsearch-py version (elasticsearch.__versionstr__): 7.5.1

Description of the problem including expected versus actual behavior:

The scan() helper function only logs the number of successful vs failed shards. It would be helpful to also log the shards that failed, so I can quickly jump onto the node and grab the appropriate server logs. That data is a part of the response, but gets thrown away by the client.

Steps to reproduce:
A call to scan(client, query, raise_on_error=True) fails and throws
ScanError("Scroll request has only succeeded on 9 (+0 skiped) shards out of 10.")

Proposed error:
ScanError("Scroll request has only succeeded on 9 (+0 skipped) shards out of 10. First failure: node 'foo', shard 'bar', reason 'reason'")

The text was updated successfully, but these errors were encountered:

bartier · 2020-07-06T02:45:34Z

@tommyzli Unfortunately I did not find the information of shards/nodes that were not successful to answer using scroll API. I may be forgetting something, but only the successful/total shards counts is presented in the raw response.

The only exception where I could see scroll API send information about nodes that do not return successfully is when the _scroll_id request has answered initially with all shards successfully (for example 3/3) and when consuming the scroll API shards become unavailable because a node is unreachable (2/3 primary shards available in example bewlow). Is that what you are referring to?

Request _scroll_id with all shards available (3/3 in example)

POST /twitter/_search?scroll=1m&pretty HTTP/1.1
{
    "size": 3,
    "query": {
        "match_all" : {}
    }
}
# Response 3/3 shards
{
  "_scroll_id" : "FGluY2x1ZGVfY29udGV4dF91dWlkDnF1ZXJ5VGhlbkZldGNoAxRjQTM2SVhNQk1hMC15cHkyd2o4egAAAAAAAAAYFkZzZ2tJN2JrVEMtc1RUbGcxcWl2TmcUelRyNklYTUJQQUNTcE1WendnVXcAAAAAAAAA_BYzYXp1WXJ1LVRwV1JSd004dlV2YmNRFFlldjZJWE1CY3ZEeDdNb253aDR4AAAAAAAAAAgWdnNSSTJPSVpRd0dJbUxvN3RZX3I2QQ==",
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 25,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [...]
  }
}

Some shards become unavailable when consuming the _scroll_id

POST /_search/scroll?pretty HTTP/1.1
{
    "scroll" : "1m", 
    "scroll_id" : "FGluY2x1ZGVfY29udGV4dF91dWlkDnF1ZXJ5VGhlbkZldGNoAxRjQTM2SVhNQk1hMC15cHkyd2o4egAAAAAAAAAYFkZzZ2tJN2JrVEMtc1RUbGcxcWl2TmcUelRyNklYTUJQQUNTcE1WendnVXcAAAAAAAAA_BYzYXp1WXJ1LVRwV1JSd004dlV2YmNRFFlldjZJWE1CY3ZEeDdNb253aDR4AAAAAAAAAAgWdnNSSTJPSVpRd0dJbUxvN3RZX3I2QQ==" 
}
# Response shows node unreachable, then 2/3 shards available
{
  "_scroll_id" : "FGluY2x1ZGVfY29udGV4dF91dWlkDnF1ZXJ5VGhlbkZldGNoAxRjQTM2SVhNQk1hMC15cHkyd2o4egAAAAAAAAAYFkZzZ2tJN2JrVEMtc1RUbGcxcWl2TmcUelRyNklYTUJQQUNTcE1WendnVXcAAAAAAAAA_BYzYXp1WXJ1LVRwV1JSd004dlV2YmNRFFlldjZJWE1CY3ZEeDdNb253aDR4AAAAAAAAAAgWdnNSSTJPSVpRd0dJbUxvN3RZX3I2QQ==",
  "took" : 6,
  "timed_out" : false,
  "terminated_early" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 1,
    "failures" : [
      {
        "shard" : -1,
        "index" : null,
        "reason" : {
          "type" : "illegal_state_exception",
          "reason" : "node [FsgkI7bkTC-sTTlg1qivNg] is not available"
        }
      }
    ]
  },
  "hits" : {
    "total" : {
      "value" : 15,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [...]
  }
}

tommyzli · 2020-07-06T18:43:59Z

@bartier yeah, the case I saw was that a shard failed after already scrolling through a few pages. I'm thinking the code should check if error messages were included in the response and log them if so.

Amirilw · 2021-08-12T22:18:00Z

Did this ever got resolved ? I’m running to the same issue.

Amirilw · 2021-08-13T10:06:49Z

Ok, after debugging this issue for few days , splitting shards and adding nodes we found out the the main issue was heapsize on JVM size.

It was using the default of 1GB instead of 32 as the rest of the nodes.

When we saw it first:
we started to the issue after new nodes have joined the cluster and had the same hardware spec and elastic config.

debug:
python log didn’t give us any useful information about the issue just the error on the shards, our monitoring system didn’t report any issues as ram was consumed in normal ranges, after investigation we saw ram consumption was off for the new nodes (disk io and utilization were as expected and CPU)

Cluster version : 7.8.0
Python elastic version: 5.4.x/7.8/7.13

Solution:

Configured heapsize under jvm options to 32gb ram and reload the elastic service.

bartier mentioned this issue Jul 7, 2020

Log scroll response failures #1318

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log unsuccessful shards in failed scrolls #1261

log unsuccessful shards in failed scrolls #1261

tommyzli commented May 28, 2020

bartier commented Jul 6, 2020

tommyzli commented Jul 6, 2020

Amirilw commented Aug 12, 2021

Amirilw commented Aug 13, 2021

log unsuccessful shards in failed scrolls #1261

log unsuccessful shards in failed scrolls #1261

Comments

tommyzli commented May 28, 2020

bartier commented Jul 6, 2020

tommyzli commented Jul 6, 2020

Amirilw commented Aug 12, 2021

Amirilw commented Aug 13, 2021