Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitioned Cache + stats + multiple nodes causes failure #77

Closed
zachdaniel opened this issue Aug 6, 2020 · 17 comments
Closed

Partitioned Cache + stats + multiple nodes causes failure #77

zachdaniel opened this issue Aug 6, 2020 · 17 comments

Comments

@zachdaniel
Copy link

The partitioned cache doesn't define a module for the local adapter, and so when it tries to RPC to another node to call a function on that local adapter, it fails. How is it supposed to work if its not defining a module?

@zachdaniel
Copy link
Author

It looks like the real problem has to do with local caches on each node trying to use the stats ref passed down by the parent cache.

@zachdaniel
Copy link
Author

I get an argument error whenever it uses the rpc.

@cabol
Copy link
Owner

cabol commented Aug 7, 2020

One of the changes was precisely don't depend on declaring a local cache explicitly, instead, the partitioned adapter handless it. Now, I'm confused about your issue and I think I'll nee more details, does it happens only when you set stats: true or also without the :stats option? May you please provide the cache module and config so I can try to reproduce the issue?

@zachdaniel
Copy link
Author

Totally :) I was in a bit of a rush yesterday, but I'll get the issue reproduced in a minimal app and let you know. Specifically, this happens when I enable stats: true in the partitioned cache. Disabling that eliminated the error.

@cabol
Copy link
Owner

cabol commented Aug 7, 2020

Ok, got it, that info is helpful, I'll try to reproduce with that (a partitioned cache with stats: true), Thanks 👍

@cabol
Copy link
Owner

cabol commented Aug 10, 2020

I found the issue. I need to find another way for handling the counters in a distributed fashion, because the problem is, when the RPC is executed, the command on the distributed node is being performed by using the counter reference from the calling node. Even if that is solved, the stats aggregation across all cluster nodes needs to happen anyways, otherwise, by the time the stats are consulted, the result will contain only the values in that node. Therefore, perhaps I'll take some time to sort it out; I'll try to push a fix as soon as I can, thanks!

@zachdaniel
Copy link
Author

@cabol one thing you might consider: I'm not so sure you want stats to aggregate across nodes. We report our metrics to datadog for example, and it understands the hosts that our metrics are coming from. If every node sent the sum total of all nodes, it would be extra work to disambiguate that. Perhaps it would be possible to just enable stats for the primary local cache that the distributed cache creates automatically? I can see reasons you might want to have stats aggregate across the cluster, so I'm thinking making it optional with something like:

config :my_app, MyApp.Cache,
  stats: true, <- this enables aggregates in total
  primary: [
    stats: true, <- this enables just node specific aggregates
    adapter: Nebulex.Adapters.Local,
    backend: :shards
  ]

With this you could have both, neither, or one enabled.

@zachdaniel zachdaniel changed the title Confused about how the partitioned cache is supposed to work in 2.0.0+ Partitioned Cache + stats + multiple nodes causes failure Aug 12, 2020
@cabol
Copy link
Owner

cabol commented Aug 14, 2020

You're right, I will fix it to enable stats for the primary local cache that the distributed cache creates automatically (like you mention). Besides, getting the aggregated stats should be easy, there is a function to retrieve the cluster nodes (Cache.nodes()), and then calling Cache.stats_info() for each of them.

@Ivor
Copy link

Ivor commented Sep 23, 2020

Experienced the same issue on ReplicatedCache. Appreciate the hard work. 👍

@cabol
Copy link
Owner

cabol commented Sep 23, 2020

Yeah, I've been working on some fixes and improvements that should solve this issue, I'll push it soon!

cabol added a commit that referenced this issue Sep 24, 2020
- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly
- This should fix issue #77
cabol added a commit that referenced this issue Sep 24, 2020
- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly
- This should fix issue #77
@cabol
Copy link
Owner

cabol commented Sep 24, 2020

I've pushed some fixes and improvements to master which I think should solve this issue. However, I've had to do some refactorings to the adapters, so there are some minor changes in the config, I highly recommend to take a look at the adapters you're using. Please try it out and let me know if the problem persists, stay tuned, thanks!

cabol added a commit that referenced this issue Sep 24, 2020
- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly
- This should fix issue #77
cabol added a commit that referenced this issue Sep 24, 2020
- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly
- This should fix issue #77
cabol added a commit that referenced this issue Sep 24, 2020
- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly
- This should fix issue #77
cabol added a commit that referenced this issue Sep 26, 2020
- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly
- This should fix issue #77
@Ivor
Copy link

Ivor commented Sep 28, 2020

Thanks, will give it a shot and report what we find!

@franc
Copy link

franc commented Sep 29, 2020

Thanks @cabol, I can confirm that it is working for a replicated cache.

Looking at the phoenix dashboard live view output on 2 different nodes I can see that the hits and misses differ per node, but that the writes are the same. This makes sense as the cache is replicated. The cache size is a bit smaller than the amount of writes, and I assume this is because of soem concurrent caching/race conditions that might mean that some writes happened on both primaries. (This was before any evictions/expires).
I expected expires and evictions to also be mirrored on both nodes like writes are, but it seems like they are not. Or at least not counted.
I haven't looked at the Stats code much, but it may be useful to have a counter to distinguish between writes to local and writes via replication.

@cabol
Copy link
Owner

cabol commented Sep 30, 2020

Thanks @cabol, I can confirm that it is working for a replicated cache.

Glad to hear it is working now 😄

Looking at the phoenix dashboard live view output on 2 different nodes I can see that the hits and misses differ per node, but that the writes are the same. This makes sense as the cache is replicated. The cache size is a bit smaller than the amount of writes, and I assume this is because of soem concurrent caching/race conditions that might mean that some writes happened on both primaries. (This was before any evictions/expires).
I expected expires and evictions to also be mirrored on both nodes like writes are, but it seems like they are not. Or at least not counted.

As you mentioned, since it is a replicated cache, the writes will be the same, because each write is replicated to all nodes in the cluster. On the other hand, the reads are resolved locally, then it is normal to have different values for hits and misses per node. I think the idea of this is with telemetry and maybe by using the StatsD reporter you can aggregate the values for those metrics, then if you have two nodes, one node with 3 hits and 2 misses and the other one with 5 hits and 3 misses, in the end, you may want to see: hits: 8 and misses: 5. If you dispatch the stats via telemetry and use something like StatsD you should see the values in that way. However, if you are not using telemetry and StatsD, it is possible to aggregate the metrics (except writes since it is the same for all nodes) by calling Cache.stats_info() on each node of the cluster and you can get them by calling Cache.nodes(). Perhaps it would be good to add the tag node when dispatching the telemetry event, so we can also filter/aggregate stats by node, I'll check it out.

I haven't looked at the Stats code much, but it may be useful to have a counter to distinguish between writes to local and writes via replication.

Yeah, agreed, it would be useful too, let me take a look at it and see if there is something not too complicated we can do, otherwise, I can create a separate ticket and tag it as an improvement.

Thanks 👍

@franc
Copy link

franc commented Oct 1, 2020

yep, I'm using telemetry and statsD accross 2 nodes and then looking at it in datadog - it is easy to aggregage or distinguish in datadog based on host.
I understand writes being the same while hits and misses would be different. What is curious to me is that evictions and expires are different too. my expectation would be that they are mirrored over the cluster to eventually be the same.

@cabol
Copy link
Owner

cabol commented Oct 1, 2020

I understand writes being the same while hits and misses would be different. What is curious to me is that evictions and expires are different too. my expectation would be that they are mirrored over the cluster to eventually be the same.

Good point, yes, let me check it out because when the GC runs or a new generation is created, it should be applied in all the nodes, perhaps it is related to that, if so, that would require some sort of sync in the cluster, I'll check!!!

@cabol
Copy link
Owner

cabol commented Oct 29, 2020

I think we can close this issue, feel free to re-open it if the issue persists!

BTW, Added also an example of how to add the node to the event so it can be used in the metric tags: https://github.com/cabol/nebulex/blob/master/guides/telemetry.md#adding-other-custom-metrics.

PD: @franc may you create a separate ticket for What is curious to me is that evictions and expires are different too. my expectation would be that they are mirrored over the cluster to eventually be the same ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants