Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log unsuccessful shards in failed scrolls #1261

Open
tommyzli opened this issue May 28, 2020 · 4 comments
Open

log unsuccessful shards in failed scrolls #1261

tommyzli opened this issue May 28, 2020 · 4 comments

Comments

@tommyzli
Copy link

Elasticsearch version (bin/elasticsearch --version): 7.6.1

elasticsearch-py version (elasticsearch.__versionstr__): 7.5.1

Description of the problem including expected versus actual behavior:

The scan() helper function only logs the number of successful vs failed shards. It would be helpful to also log the shards that failed, so I can quickly jump onto the node and grab the appropriate server logs. That data is a part of the response, but gets thrown away by the client.

Steps to reproduce:
A call to scan(client, query, raise_on_error=True) fails and throws
ScanError("Scroll request has only succeeded on 9 (+0 skiped) shards out of 10.")

Proposed error:
ScanError("Scroll request has only succeeded on 9 (+0 skipped) shards out of 10. First failure: node 'foo', shard 'bar', reason 'reason'")

@bartier
Copy link
Contributor

bartier commented Jul 6, 2020

@tommyzli Unfortunately I did not find the information of shards/nodes that were not successful to answer using scroll API. I may be forgetting something, but only the successful/total shards counts is presented in the raw response.

The only exception where I could see scroll API send information about nodes that do not return successfully is when the _scroll_id request has answered initially with all shards successfully (for example 3/3) and when consuming the scroll API shards become unavailable because a node is unreachable (2/3 primary shards available in example bewlow). Is that what you are referring to?

  1. Request _scroll_id with all shards available (3/3 in example)
POST /twitter/_search?scroll=1m&pretty HTTP/1.1
{
    "size": 3,
    "query": {
        "match_all" : {}
    }
}
# Response 3/3 shards
{
  "_scroll_id" : "FGluY2x1ZGVfY29udGV4dF91dWlkDnF1ZXJ5VGhlbkZldGNoAxRjQTM2SVhNQk1hMC15cHkyd2o4egAAAAAAAAAYFkZzZ2tJN2JrVEMtc1RUbGcxcWl2TmcUelRyNklYTUJQQUNTcE1WendnVXcAAAAAAAAA_BYzYXp1WXJ1LVRwV1JSd004dlV2YmNRFFlldjZJWE1CY3ZEeDdNb253aDR4AAAAAAAAAAgWdnNSSTJPSVpRd0dJbUxvN3RZX3I2QQ==",
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 25,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [...]
  }
}
  1. Some shards become unavailable when consuming the _scroll_id
POST /_search/scroll?pretty HTTP/1.1
{
    "scroll" : "1m", 
    "scroll_id" : "FGluY2x1ZGVfY29udGV4dF91dWlkDnF1ZXJ5VGhlbkZldGNoAxRjQTM2SVhNQk1hMC15cHkyd2o4egAAAAAAAAAYFkZzZ2tJN2JrVEMtc1RUbGcxcWl2TmcUelRyNklYTUJQQUNTcE1WendnVXcAAAAAAAAA_BYzYXp1WXJ1LVRwV1JSd004dlV2YmNRFFlldjZJWE1CY3ZEeDdNb253aDR4AAAAAAAAAAgWdnNSSTJPSVpRd0dJbUxvN3RZX3I2QQ==" 
}
# Response shows node unreachable, then 2/3 shards available
{
  "_scroll_id" : "FGluY2x1ZGVfY29udGV4dF91dWlkDnF1ZXJ5VGhlbkZldGNoAxRjQTM2SVhNQk1hMC15cHkyd2o4egAAAAAAAAAYFkZzZ2tJN2JrVEMtc1RUbGcxcWl2TmcUelRyNklYTUJQQUNTcE1WendnVXcAAAAAAAAA_BYzYXp1WXJ1LVRwV1JSd004dlV2YmNRFFlldjZJWE1CY3ZEeDdNb253aDR4AAAAAAAAAAgWdnNSSTJPSVpRd0dJbUxvN3RZX3I2QQ==",
  "took" : 6,
  "timed_out" : false,
  "terminated_early" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 1,
    "failures" : [
      {
        "shard" : -1,
        "index" : null,
        "reason" : {
          "type" : "illegal_state_exception",
          "reason" : "node [FsgkI7bkTC-sTTlg1qivNg] is not available"
        }
      }
    ]
  },
  "hits" : {
    "total" : {
      "value" : 15,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [...]
  }
}

@tommyzli
Copy link
Author

tommyzli commented Jul 6, 2020

@bartier yeah, the case I saw was that a shard failed after already scrolling through a few pages. I'm thinking the code should check if error messages were included in the response and log them if so.

@Amirilw
Copy link

Amirilw commented Aug 12, 2021

Did this ever got resolved ? I’m running to the same issue.

@Amirilw
Copy link

Amirilw commented Aug 13, 2021

Ok, after debugging this issue for few days , splitting shards and adding nodes we found out the the main issue was heapsize on JVM size.

It was using the default of 1GB instead of 32 as the rest of the nodes.

When we saw it first:
we started to the issue after new nodes have joined the cluster and had the same hardware spec and elastic config.

debug:
python log didn’t give us any useful information about the issue just the error on the shards, our monitoring system didn’t report any issues as ram was consumed in normal ranges, after investigation we saw ram consumption was off for the new nodes (disk io and utilization were as expected and CPU)

Cluster version : 7.8.0
Python elastic version: 5.4.x/7.8/7.13

Solution:

Configured heapsize under jvm options to 32gb ram and reload the elastic service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants