Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per key metadata sharing across clients in clustered ehcache #1015

Closed
rkavanap opened this issue May 1, 2016 · 19 comments
Closed

Per key metadata sharing across clients in clustered ehcache #1015

rkavanap opened this issue May 1, 2016 · 19 comments

Comments

@rkavanap
Copy link
Contributor

rkavanap commented May 1, 2016

For a multi-tiered clustered cache manager, that has a cache that has either TTI and TTL configured, there are two challenges with respect to per-key metadata that needs to be addressed..

When a client does a get on a key locally from its caching tier, other clients doing a get on the same key later must somehow see the correct access time for this key locally in order to accurately compute 'time to idle' and decide whether an entry needs to be expired or not.

For both TTL and TTI computations with clients having different clocks, the computations must ensure that the clocks are synchronized across multiple VMs/systems either by using a single time reference (say of the server) for absolute time with clients computing time relative to this absolute time or by ensuring that the clocks are synchronized using internal or external clock synchronization mechanisms.

@rkavanap
Copy link
Contributor Author

rkavanap commented May 1, 2016

A solution proposal for this issue is illustrated here to solicit comments and stimulate discussions.

Timestamped Links, Metadata appends and Fault before remove

One potential solution for this issue is to time stamp every link in the blob chain on the server and then the client can use the time stamp on each link to recompute the lastaccesstime and the createdtime for the key.

Sub optimally on every get (or optimally when expiry is nearing) that hits the caching tier, a metadata append operation can be send asynchronously to the cluster tier, so that the server can add a link to the chain. Optimally, this append can just have the key and operation type in the blob, as the time stamp on the link can be put by the server when the link is appended to the blob chain. These metadata blob links in the chain along with other functions in the chain will help other clients to accurately compute the metadata such as lastaccesstime and creationtime etc.

If we want expiry computations to be more accurate in a clustered environment even in corner cases where there is frequent gets from an interested client and infrequent gets from another bored client, this solution can be expanded as follows:

When a TTI is configured, a semi-interested or bored client that 'wrongly sees' an expiry for a key on a caching tier (due to infrequent hits on the caching tier by itself) can first trigger a 'fault', resolve and recompute the lastaccesstime and issue a 'remove' if and only if the expiry was deemed accurate on the second pass.

The above solution will be illustrated using sequence diagrams etc in another comment or doc

Pros and Cons of this approach:

  1. This approach can be extended easily to ensure that the absolute time reference is always that of the server (and not the client). This means that only the active and its passive needs to be time synchronized and client VMs need not be time synchronized (at least for this requirement).
  2. May increase metadata chatter, However, the approach itself can be optimized to reduce metadata chatter between clients through server, Also, since client does not have to share their metadata to other clients, metadata exchange is more optimal (at least save 4 longs per key per operation).
  3. In alignment with the current design of an opaque structure of a chain of binary blobs stored against each hash on the server.
  4. Controlled load patterns for clients in an active/standby mode: In an application design suppose there are 2 clients using a cache with a TTI configured for millions of entries. Suppose one client is active and another is standby. When the active client dies, the standby client takes over without millions of these cache entries getting wrongly expired. If these million entries were wrongly expired it increases load on the underlying database as well as ehcache as ehcache now has to expire millions of entries.

@albinsuresh
Copy link
Contributor

Here is an illustration of the issue:

clusteredtieredaccess

As you can see in the picture the local updates to the caching tier on client1 is not getting reflected on the server. So when a second client(client2 in this diagram) accesses the same data at a later point in time, it could get an old version of the data from the clustered tier with old metadata. This could be problematic in TTI based expiry calculations.

@albinsuresh
Copy link
Contributor

Here is the pictorial representation of the proposed solution:

newtieredaccess

Here each get request on the caching tier sends an append request to the clustered tier with a meta-update operation. The meta-update operation is a light-weight operation that's used just as an indication to the server that the meta needs to be updated. Since all operations that get appended to the chain is time-stamped by the server, this meta-update operation will also get time-stamped and this time stamp is an indication data access at the caching tier. When some other client retrieves the data from the server(client2 here), it gets the chain with the metadata links as well. Resolving that chain will get you the information on all the accesses on that data by other clients.

@lorban
Copy link
Contributor

lorban commented May 2, 2016

This approach could work, but there are a few things that must be sorted out before we can consider this design viable or not:

  • Updating the metadata on every get, will put some unacceptable load on the server. Even considering that async, a hot key server from heap could send well over 2 million updates per second. We must find the right place for the metadata to be updated so that clients both have reasonably accurate time stamps without putting too much strain on the resources.
  • Reading this proposal straight as it is, the resolving of the expiration is still done by the client, hence having a slight clock drift (think: a few seconds) between clients and/or the server(s) wouldn't hurt but a large time gap (think: 1+ hour) would have very adverse effects. We must also decide now if we leave the reasonable time sync effort in the hands of the user or if we introduce some form of server time source that the clients must use to resolve the time stamps.
  • We must also figure out what kind of side-channel must be added to the Montreal design to be able to implement this without compromising the entire tiering architecture.

@lorban
Copy link
Contributor

lorban commented May 2, 2016

Quick followup: updating the metadata when expiry is nearing won't cut the mustard: a TTI eviction policy would continuously push the metadata update in the future for as long as the key is being accessed.

@rkavanap
Copy link
Contributor Author

rkavanap commented May 2, 2016

Let me clarify..In montreal design, an entry enters the caching tier first only when a get is done on the authoritative tier. If we recompute and note server time during the resolution of the chain, we can compute whether we are closing in on the expiry with reference to this first get from the server. Like you said, we should never move the bar of computing whether we reached close to expiry on the subsequent gets that only hits the caching tier.

We have also described another case, where a bored client has warmed the cache with a get and does not issue any additional gets until expiry, while simultaneously the other more interested client is continuously accessing the key from its caching tier. Even if this meta data updates reaches the server, the bored client may not see this meta-data updates. To handle this case, we could issue a 'fault' on expiry, fetch the chain from the server again and append a 'remove' only if the entry is still expired post get. This may be a bit over engineering, But then this also handles application use cases where a 'standby' client warms its cache (say for write once, read many data or static data) but then continues to access it in an operational fashion only when it takes over from a failed active client.

@rkavanap
Copy link
Contributor Author

rkavanap commented May 2, 2016

Just one more clarification. The meta data update timer on a client will start only on the first get from the caching tier, but the time to expiry check for the meta data update request queuing to the server will be always based on the first get from the authoritative tier.

Moreover, this is only a first cut on the approach. If there is a general opinion that we should take this forward, the design can be fine tuned and made completely hole free at the right time (when we are adding TieredStore functionality for clustered ehcache) in our sprints. If we hit some showstopper issues during the design, we can always refine the approach or even change direction, if need be..

@lorban
Copy link
Contributor

lorban commented May 3, 2016

After some more though and a conversation with @rkavanap it looks like this is all doable in a reasonable way.

In a nutshell, the caching tier would need to store an extra "refresh" timestamp in its value holder that would get set when the mapping is faulted in to the original expiration time. When such time is reached, a metadata update would be sent to the authority and a new refresh timestamp would be calculated for the caching tier's value holder. This logic should be light enough to always be active, and not be specific to a cluster tier being present.

Keeping the timestamps absolute on the clients and servers would keep the required changes to a minimum and would keep the code as simple as possible (especially in the clustered vs non-clustered case).

The only open question left is: do we need to build in some form of time sync between clients and servers or do we rely on the sysadmin's good will to keep all machines in a reasonable time sync? I'm more in favor of the latter option but adding some clock sync checks that would log if the machines' clocks have drifted too much apart wouldn't hurt.

@rkavanap
Copy link
Contributor Author

rkavanap commented May 3, 2016

Just some more additional comments: (Now that we assume server and client time stamps are equivalent and some different mechanism is assumed for time sync)

You could still timestamp every link on the server side for every append and when the metadata update is sent, you could send just one long that contains the time the key was last accessed in the caching tier, Why flood the network with metadata? This adds up when there are millions of keys and since the dev effort is less, there is hardly any tradeoff.. The append operation (API) can take a boolean that tells the server whether or not to timestamp an element. This can efficiently handle non TTI caches as well.

Also, are you planning to do the 'fault' before 'remove' for the bored but warm client case? Of course this is another abstraction different from the metadataappend abstraction. Maybe another story that can be prioritized later?..

@rkavanap
Copy link
Contributor Author

rkavanap commented May 4, 2016

One more thing for appends:

We need a way to distinguish between a mutative operation and a metadata operation on the server as a mutative operation triggers invalidation. Currently I do not see a distinction between the two.

@lorban
Copy link
Contributor

lorban commented May 4, 2016

There is no need to have the server do the timestamp'ing as we agreed to assume clients and servers will have their clocks sync'ed.

The mappings stored by the authority (which the clustered store will be yet another impl of) already contain some metadata; check ValueHolder, it comes with:

  • creationTime
  • expirationTime
  • lastAccessTime

What we miss is a authorityRefreshTime timestamp in the CachingTier's ValueHolder to act as the timer that will trigger a metadata refresh of the authority.

I'm not planning anything special for the 'warm but bored' client: those clients will naturally expire their caching tier's contents and re-fetch from the authority on-demand. It's IMHO another feature's responsibility to keep an idle cache warm (think: refresh-ahead).

Strictly speaking, there's no need for any kind of distinction between metadata / non-metadata operations. A metadata update from one client will trigger eviction in the other clients' caching tier like any normal mutative operation. But I do agree that we could/should consider filtering those metadata update operations at the client level and handle them a bit more intelligently to avoid evicting perfectly valid and warm data from other clients' caching tiers. This has some impact on #919 though so we should sync with @ljacomet before making any final decision.

Note that implementing a caching tier metadata refreshment mechanism would solve the 'warm but bored' client problem as a side effect.

@rkavanap
Copy link
Contributor Author

rkavanap commented May 4, 2016

I now understood your philosophy for the 'bored but warm' client and it make complete sense. I originally misunderstood that the caching tier may forcefully remove from authority on expiry. But now I understand that only a flush happens and there is no forceful eviction from authority tier on expiry.

Some questions though:

The server side of the authoritative tier is just a chain of blobs. So why blobify the value holder. Simply blobify the key and value and re-create the value holder on the client side of the authoritative tier during the resolve. Timestamping elements (links) will allow this. Also the metadata update is simply blobifying the key. With your simplified approach of no background threads, we do not even have to send the accesstime as the metadata append is going only on a explicit get hitting the caching tier and when we are closing in on the refresh time. Also we agreed that we can assume timestamps are synchronized on clients and server. There is hardly any additional dev effort for saving this nw bandwidth, so why not save it.

I did not understand one aspect: Why send invalidations for metadata update? Invalidations from the server is expensive, even if all clients are blocking it after it has demarshalled the eviction request. The serverside store does not know whether it is a metadata update, unless we have a different append operation.

@rkavanap
Copy link
Contributor Author

rkavanap commented May 4, 2016

Ok. After seeing your response on issue #919 I understand. You are planning to convert the invalidation coming from the server to a refresh metadata update. That works. But isn't that expensive? Isn't it simpler for the client to issue a 'fault before flush' to the authoritative tier in case of local expiry? This make it more on demand and more efficient and reduces broadcast messages.

I probably know the answer (changes montreal design etc), but I feel broadcasting is a bit expensive.

@lorban
Copy link
Contributor

lorban commented May 4, 2016

About your "blobifying the value holder" thought: what you're suggesting is an optimization that we may want to include, or maybe not. This requires more engineering effort, that's why I'm not keen on doing that right ahead.

About "sending invalidations for metadata update": it's a trade off. Either we pay the broadcast cost of metadata update to keep caching tiers warm, or pay the price of having all warm clients pounding on the server when it's time to refresh. It's unclear to me at the moment which one of the two would be most valuable so again I prefer to go with the solution with the smallest engineering effort.

lorban added a commit to lorban/ehcache3 that referenced this issue May 4, 2016
lorban added a commit to lorban/ehcache3 that referenced this issue May 4, 2016
@lorban
Copy link
Contributor

lorban commented May 4, 2016

Check my early prototype branch here to have a grasp of what this could look like:
https://github.com/lorban/ehcache3/tree/issue-1015

lorban added a commit to lorban/ehcache3 that referenced this issue May 4, 2016
@rkavanap
Copy link
Contributor Author

rkavanap commented May 8, 2016

Ludovic, I fail to understand this. Intuitively, how can broadcasting to 1000s of clients who is not even interested in that particular mapping better that the client pulling into caching tier when it is interested, And when the client pulls and does not use it for a long time, it just have to check once more with server before expiring.

Once we track the clients per mapping, what you are saying will make sense to me more. But even in that model, intuitively itself the model of pulling on demand on expiry looks better than multicasting metadata changes. But then your argument of engineering effort Vs additional benefit seems much stronger in the multicasting model.

Better to wait for the broadcast to be changed to multi-cast before this is merged, right?

Other than that, I really see the simplicity of your change.

@lorban
Copy link
Contributor

lorban commented May 9, 2016

As I said, it's a tradeoff: how can you know that the 1000 clients are interested or not? If most of them aren't, then indeed the broadcast is going to be more expensive than waiting for the clients to re-fetch.
But if most of them are interested, then broadcasting would be a lot more efficient as that would prevent a thundering herd of server requests while avoiding a perf dip in the clients and some outliers in the latency percentiles.

As I said, I'd leave the final decision of which model to go to later and take the least engineering effort route for now. It doesn't matter too much as both let-the-client-refetch or broadcast-the-metadata-updates implementations can be swapped without much pain.

@lorban lorban removed their assignment Jun 7, 2016
@chrisdennis
Copy link
Member

TTI is deprecated (#1097). People running with egregiously unsynchronized clocks when using TTL deserve everything they get. Screw those bastards.

@lorban
Copy link
Contributor

lorban commented Sep 28, 2022

Lovely! 🤪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants