Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fragment cache somtimes returns null despite cached data loading initially #72

Closed
RudiBoshoff opened this issue Aug 30, 2021 · 20 comments
Closed

Comments

@RudiBoshoff
Copy link

Hi there,

ENV:
We are currently using your gem on our Rails api which passes data through to an Apollo Angular app. We originally had a pure rails solution but got better performance results with your gem.

Issue:
The problem is that sometimes the cache returns null. This doesn't happen locally but happens when we use the cache on Redis for staging and production. The cached data is returned most of the time but sometimes returns null. We have investigated as much as we could: we first thought it might be the front end but we have narrowed it down to the cache returning null.

How to reproduce:
This only works some of the time

  1. load the site the first time to populate the cache (works)
  2. hard refresh and reload the site (network results show that cache is working)
    image
  3. waiting a while (5 - 10 ish minutes) and loading the page again (returns null). The cache set to expire after 2 hours. Checking Redis shows that the key is still present when this failure occurs.
    image

Question:
Do you have any ideas on what might be causing this or what we might be able to do? Thank you in advance.

@DmitryTsepelev
Copy link
Owner

Hi @RudiBoshoff!

That sounds really strange, because gem does reading in a dead simple way:

FragmentCache.cache_store.read(cache_key).tap do |cached|
  return NIL_IN_CACHE if cached.nil? && FragmentCache.cache_store.exist?(cache_key)
end

In this case FragmentCache.cache_store is a Redis adapter, and if you see that key is still there, the value should be loaded without any issues. Since you know the key, could you please try to call FragmentCache.cache_store.read(cache_key) from the console and make sure value not nil?

Reading happens here, looks like it also does nothing more that a nilcheck.

@DmitryTsepelev
Copy link
Owner

Closing this for now, feel free to reopen 🙂

@gsdean
Copy link
Contributor

gsdean commented Jan 11, 2023

FWIW: we're seeing similar behavior.

Isn't there a race condition.

  1. FragmentCache.cache_store.read(cache_key) returns nil for an uncached value.
  2. Someone else populates the cache key
  3. FragmentCache.cache_store.exist?(cache_key) return true (event though it wasn't cached for step 1)

Thoughts? @DmitryTsepelev

@gsdean
Copy link
Contributor

gsdean commented Jan 12, 2023

We've been doing some testing in production with the following

def value_from_cache
  return nil unless FragmentCache.cache_store.exist?(cache_key)

  FragmentCache.cache_store.read(cache_key).tap do |cached|
    return NIL_IN_CACHE if cached.nil? && FragmentCache.cache_store.exist?(cache_key)
  end
end

It seems to have addressed the race condition

@RudiBoshoff @DmitryTsepelev I think we should reopen this

@tsugitta
Copy link

tsugitta commented Feb 7, 2023

@gsdean
same here, can you explain how the workaround works? I understand the problem derives from fetching cache multiple times, but is it safe to fetch cache multiple times in that way?

@gsdean
Copy link
Contributor

gsdean commented Feb 7, 2023

The workaround, isn't perfect. There is still a chance a race condition could cause a misread from the cache (specifically when an eviction occurs after the initial read, and then the cache is repopulated). Basically it reduces the risk of value_from_cahe incorrectly returning nil from probable to less probable. I'll let you decide if that is safe or not.

For what it's worth we've tested both with and without this patch in a high load/concurrency production environment and without this patch we see numerous intermittent errors where nil is returned incorrectly. With the patch we have yet to have an error.

@DmitryTsepelev DmitryTsepelev reopened this Feb 7, 2023
@DmitryTsepelev
Copy link
Owner

Sorry guys I don't know how but github does not sends me events from some projects. I'll take a look tomorrow morning

@DmitryTsepelev
Copy link
Owner

If I understand correctly, the problem is that race condition makes cache to return nil even if it's not nil anymore. This is only possible when it was nil before cache was rewritten. I think this behavior is kinda acceptable, because user not gets wrong data, it's just outdated, and the tradeoff of the cache is to have fast data vs. fresh data.

I guess we could add it to the docs and call it a day, what do you think?

@gsdean
Copy link
Contributor

gsdean commented Feb 8, 2023

The cache returns the wrong value not an outdated value. I do not believe documentation alone is sufficient. In a moderately concurrent deployment, this is a serious issue.

@tsugitta
Copy link

tsugitta commented Feb 8, 2023

If I understand correctly, the problem is that race condition makes cache to return nil even if it's not nil anymore.

in my situation, the problem seems that race condition makes cache to return nil even though it has never been nil before.

In reality, there are only two possibilities: the cache exists and is filled with values, or the cache does not exist. However, it seems to behave as if the cache exists and the value is nil.

@DmitryTsepelev
Copy link
Owner

The only way to synchronize threads here is using some kind of lock: we could make it configurable and let users to opt–in. It will fix the issue completely but will slow down things a bit. WDYT?

@DmitryTsepelev
Copy link
Owner

Also, sounds like we could change the behavior to keep something different rather than nil in cache. In this case we will have no problems with differentiating scenarios "cache contains nil" and "cache does not have this key"

@gsdean
Copy link
Contributor

gsdean commented Feb 8, 2023

My vote would be to either

  1. Just take the hit on the extra read (as we've done in our workaround). This is a fairly pragmatic solution. Maybe start here and add an issue to revisit the extra read.

OR

  1. Avoid caching nil by default (as you've suggested). Perhaps we could also add a parameter cache_nil: true to preserve compatibility and use the extra read from 1.

I would actively discourage locks. As you've said they would be slow, but also very hard to manage in a multi-server deployment such as the one we have in production (load balancer in front of multiple web servers).

@tsugitta
Copy link

tsugitta commented Feb 8, 2023

I'm not familiar with the cache API or instrumentation API, but it seems that the read method records whether or not there is a cache hit, so by using it, can't we call read only once without calling exist? ?

https://github.com/rails/rails/blob/v7.0.4.2/activesupport/lib/active_support/cache.rb#L379

@gsdean
Copy link
Contributor

gsdean commented Feb 8, 2023

@tsugitta I agree, this is the most correct fix, but it seemed like a big change 🤷

@DmitryTsepelev
Copy link
Owner

Big change is not an issue, we can make a major release. The problem I'm trying to solve is that read returns nil in either when there's no such key at all (so we need to run the resolver method) and when there is a key with nil (so we want to return nil without running the resultion again). We also could add an option to force re–resolution when there's nil for any reason (e.g., resolve_when_nil: true).

Do you want to create the PR for the workaround?

@tsugitta
Copy link

tsugitta commented Feb 9, 2023

The problem I'm trying to solve is that read returns nil in either when there's no such key at all (so we need to run the resolver method) and when there is a key with nil (so we want to return nil without running the resultion again).

Yes, I'm also trying to solve the same problem. As I commented above, read seems to notify us with Instrumentation whether there is a cache hit or not, so by looking at it, we can determine whether the cache doesn't exist or the cache exists and the value is nil.
In addition, for the case where nil is regarded as a valid cache value, there seems to be an advantage that the two requests for read and exist? can be reduced to one.

Aside from that, the workaround seems great. I think it would be nice if I could submit a PR, but I haven't figured out the structure of the library yet, so I don't know where to start now and it may take some time. So at least I don't think I am the right person.

@tsugitta
Copy link

tsugitta commented Feb 10, 2023

I had overlooked that the version I was using was outdated. After upgrading this library from 1.9.1 to 1.18.1 (and Rails from 6.1.4.4 to 7.0.4), this problem, which used to happen dozens of times a day in our application that accesses cache frequently, never happened again (has never been reported to Sentry).

@DmitryTsepelev
Copy link
Owner

Interesting 🤔I didn't remember we addressed it here in the gem, maybe something in Rails had changed for better

@gsdean
Copy link
Contributor

gsdean commented Jun 3, 2023

As far as I can tell this is still an issue. Any thoughts on the PR? It would be really great to either merge that or leave this open until it’s resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants