Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Productionize: GetShardRegionStats returns empty shard set on ask timeout #27100

Closed
migesok opened this issue Jun 6, 2019 · 4 comments
Closed
Assignees
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted 3 - in progress Someone is working on this ticket t:cluster-sharding
Milestone

Comments

@migesok
Copy link

migesok commented Jun 6, 2019

Akka version: 2.5.22

When some of the shards are too busy GetShardRegionStats starts to return an empty shard set even though I see that some shards are handling requests. Consequently the same behaviour is in cluster/shards Akka Management HTTP API as it uses GetShardRegionStats.
Also it can be seen in the implementation code:

  def replyToRegionStatsQuery(ref: ActorRef): Unit = {
    askAllShards[Shard.ShardStats](Shard.GetShardStats)
      .map { shardStats =>
        ShardRegionStats(shardStats.map {
          case (shardId, stats) => (shardId, stats.entityCount)
        }.toMap)
      }
      .recover {
        case x: AskTimeoutException => ShardRegionStats(Map.empty)
      }
      .pipeTo(ref)
  }

And the ask timeout is hardcoded and there is no way to override it:

  def askAllShards[T: ClassTag](msg: Any): Future[Seq[(ShardId, T)]] = {
    implicit val timeout: Timeout = 3.seconds
    Future.sequence(shards.toSeq.map {
      case (shardId, ref) => (ref ? msg).mapTo[T].map(t => (shardId, t))
    })
  }

Judging by the code the same applies to other sharding state inspection methods like GetShardRegionState.
I haven't found this behaviour documented anywhere, it took me some time to figure out that if the management API says I have 0 shards allocated it doesn't always mean it.
Would be great to at least to document it, but best to change it somehow so inability to collect the data for some shards is presented to the sharding inspection API user.
Also it would be great if the ask timeout was configurable.

@johanandren
Copy link
Member

Both not producing an answer on failure and making the timeout configurable sounds good to me. I think the reason it is hardcoded is that it initially wasn't meant as a message protocol for production but tests and then grew to be used in production scenarios.

Would you be up to do a PR with the two changes?

@johanandren johanandren added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted help wanted Issues that the core team will likely not have time to work on t:cluster-tools labels Jun 16, 2019
@helena
Copy link
Member

helena commented Jun 28, 2019

I'll take this. My understanding of what we want to do is

  1. Handle the case when a GetShardRegionStats is issued, and some of the shards are too busy, fix the current return of case x: AskTimeoutException => ShardRegionStats(Map.empty) because they are not empty, just busy handling requests.
  2. Investigate if additional, similar modification to GetShardRegionState response is needed for the same reason
  3. Make the timeout configurable for askAllShards

@helena helena added 3 - in progress Someone is working on this ticket and removed help wanted Issues that the core team will likely not have time to work on labels Jun 28, 2019
@helena helena self-assigned this Jun 28, 2019
@helena helena changed the title GetShardRegionStats returns empty shard set on ask timeout Productionize: GetShardRegionStats returns empty shard set on ask timeout Jun 28, 2019
@helena
Copy link
Member

helena commented Jul 1, 2019

Regarding the return of all stats or an empty Map - from the scaladoc

If the timeout is reached without answers from all shard regions the reply will contain an empty map of regions.

So we are doing this as a flag to determine if the sharded cluster is "ready" yet it is a problem when used against an already running sharded cluster, where some shards are busy and can't respond within the overarching timeout.

I'll update this as well.

@helena
Copy link
Member

helena commented Jul 1, 2019

Once this is merged I will push a PR against Akka Management for making routeGetShardInfo route use the new configurable timeout.

helena pushed a commit to helena/akka that referenced this issue Jul 15, 2019
…akka#27299 - precursor to work coming in: Productionize: GetShardRegionStats returns empty shard set on ask timeout akka#27100
helena pushed a commit to helena/akka that referenced this issue Jul 15, 2019
…akka#27299 - precursor to work coming in: Productionize: GetShardRegionStats returns empty shard set on ask timeout akka#27100
helena pushed a commit to helena/akka that referenced this issue Jul 16, 2019
helena pushed a commit to helena/akka that referenced this issue Jul 17, 2019
helena pushed a commit to helena/akka that referenced this issue Jul 17, 2019
helena pushed a commit to helena/akka that referenced this issue Jul 24, 2019
helena pushed a commit to helena/akka that referenced this issue Jul 24, 2019
helena pushed a commit to helena/akka that referenced this issue Jul 24, 2019
@helena helena added this to the 2.6.0-M5 milestone Jul 25, 2019
@helena helena closed this as completed Jul 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted 3 - in progress Someone is working on this ticket t:cluster-sharding
Projects
None yet
Development

No branches or pull requests

4 participants