use lock in InMemoryNormalizedCache to avoid race conditions #552

RolandasRazma · 2019-05-28T15:19:13Z

we see some crashes trying to access RecordSet while doing .clear() on InMemoryNormalizedCache

* thread #35, queue = 'com.apollographql.DataLoader', stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
  frame #0: 0x00000001c71bf530 libobjc.A.dylib`objc_msgSend + 16
  frame #1: 0x00000001f60ee364 libswiftCore.dylib`Swift.Dictionary._Variant.subscript.getter : (A) -> Swift.Optional<B> + 148
  frame #2: 0x0000000106044da8 Apollo`RecordSet.subscript.getter(key="User_12", self=Apollo.RecordSet @ 0x00000001701b20d8) at RecordSet.swift:24:19
  frame #3: 0x000000010602fc64 Apollo`closure #1 in InMemoryNormalizedCache.loadRecords($0="User_12", self=0x0000000282b1ef60) at InMemoryNormalizedCache.swift:9:42
  frame #4: 0x000000010602fce8 Apollo`thunk for @callee_guaranteed (@guaranteed String) -> (@owned Record?, @error @owned Error) at <compiler-generated>:0
  frame #5: 0x000000010602fdcc Apollo`partial apply for thunk for @callee_guaranteed (@guaranteed String) -> (@owned Record?, @error @owned Error) at <compiler-generated>:0
  frame #6: 0x00000001f6074ef4 libswiftCore.dylib`(extension in Swift):Swift.Collection.map<A>((A.Element) throws -> A1) throws -> Swift.Array<A1> + 748
  frame #7: 0x000000010602fb1c Apollo`InMemoryNormalizedCache.loadRecords(keys=1 value, self=0x0000000282b1ef60) at InMemoryNormalizedCache.swift:9:24
  frame #8: 0x0000000106030228 Apollo`protocol witness for NormalizedCache.loadRecords(forKeys:) in conformance InMemoryNormalizedCache at <compiler-generated>:0
  frame #9: 0x0000000106001c20 Apollo`partial apply at <compiler-generated>:0
  frame #10: 0x00000001060099c4 Apollo`closure #1 in DataLoader.dispatch(self=0x00000002831abd00) at DataLoader.swift:53:12
  frame #11: 0x0000000105ff0d50 Apollo`thunk for @escaping @callee_guaranteed () -> () at <compiler-generated>:0
  frame #12: 0x00000001084176f0 libdispatch.dylib`_dispatch_call_block_and_release + 24
  frame #13: 0x0000000108418c74 libdispatch.dylib`_dispatch_client_callout + 16
  frame #14: 0x0000000108420bf4 libdispatch.dylib`_dispatch_lane_serial_drain + 712
  frame #15: 0x00000001084218b4 libdispatch.dylib`_dispatch_lane_invoke + 456
  frame #16: 0x000000010842b77c libdispatch.dylib`_dispatch_workloop_worker_thread + 1148
  frame #17: 0x00000001c7bed114 libsystem_pthread.dylib`_pthread_wqthread + 304
  frame #18: 0x00000001c7befcd4 libsystem_pthread.dylib`start_wqthread + 4

sentry.io log

This PR wraps access to InMemoryNormalizedCache.records in internal queue to avoid race conditions.

RolandasRazma · 2019-05-28T15:36:05Z

I realised that not queue fixes race conditions but actual usage of Promises. Will open smaller PR for that

RolandasRazma · 2019-05-28T15:50:18Z

actually only promises don't fix it ;/ Will open PR again...

designatednerd · 2019-07-10T20:03:38Z

Sources/Apollo/InMemoryNormalizedCache.swift

@@ -1,21 +1,37 @@
+import Dispatch


I don't think you need to import this explicitly - it should be part of Foundation

designatednerd

This seems like a pretty significant change, although it could be a very helpful one (I know we've got some TSAN issues I'm going to be looking at with #600).

Is there a way to add tests for this change?

RolandasRazma · 2019-07-12T06:59:34Z

@designatednerd It should be possible, I can replicate crash very easily (not sure if its easy to test)
we have setup where multiple clients use same store and as soon as you start doing stuff on multiple threads it crashes. We could put everything on one just because we know about bug, but other people might be facing same problem without realising reason

designatednerd · 2019-07-12T10:18:10Z

@RolandasRazma I mean add an automated test so we can make sure a) This is doing what it needs to and b) This doesn't regress in the future 😄

RolandasRazma · 2019-07-12T10:20:04Z

@designatednerd I got that :) Just was saying that its easy to replicate by hand but not sure how easy to replicate in test

designatednerd · 2019-07-12T10:29:20Z

Maybe you could set up two clients that use the same store for the test and try to have them access/mutate things in a certain sequence?

RolandasRazma · 2019-07-15T15:16:32Z

@designatednerd So I tried to write test to replicate bug (wanted to create separate PR) but because this is race condition test not always work, it slows down all other tests and it crashes everything when it works. Not sure if it should be merged together with all other tests. I will drop it here for now, let me know what you want to do... (Without changes in this PR I manage to replicate crash with this test every 3rd run or so)

  func testThreadedCache() throws {
    let cache = InMemoryNormalizedCache()
    
    let networkTransport = MockNetworkTransport(body: [
      "data": [
        "hero": [
          "name": "Luke Skywalker",
          "__typename": "Human"
        ]
      ]
    ])
    
    let loadingClient = ApolloClient(networkTransport: networkTransport, store: ApolloStore(cache: cache))
    let clearingClient = ApolloClient(networkTransport: networkTransport, store: ApolloStore(cache: cache))
    
    let group = DispatchGroup()
    
    for _ in 0...10000 {
      
      group.enter()
      DispatchQueue.global().async {
        loadingClient.fetch(query: HeroNameQuery(), cachePolicy: .returnCacheDataElseFetch) { (result, error) in
          group.leave()
        }
      }
      
      group.enter()
      DispatchQueue.global().async {
        _ = clearingClient.clearCache()
        group.leave()
      }
      
    }
    
    let expectation = self.expectation(description: "Fetching query")
    group.notify(queue: .main) {
      expectation.fulfill()
    }
    self.wait(for: [expectation], timeout: 5)
    
  }

designatednerd · 2019-07-15T20:30:59Z

I'm really not sure why that would affect other tests, it certainly shouldn't. I'll try to take a look tomorrow.

RolandasRazma · 2019-07-16T08:44:33Z

@designatednerd it affects in a sense that it has tight loop and that will delay other ones, also on crash it will take everything down and other tests won't have chance to run

designatednerd · 2019-07-16T09:15:18Z

Ah I think I missed that the expectation was within the giant for loop when I was looking at this last night.

It also doesn't seem like we're doing any asserting on the results - in theory we should be able to check after the client's been cleared that it wouldn't have anything in the cache.

RolandasRazma · 2019-07-16T09:19:35Z

I don't think it is about clearing, problem is that ivar can be accessed simultaneously. It usually breaks in records.merge as it takes longer than clear

designatednerd · 2019-07-23T11:29:40Z

OK, coming back to this, I'm a bit confused: Is the test passing when it's run with these changes in place, or is it still failing?

Or are the failures happening on another branch that does not have these changes in place?

RolandasRazma · 2019-07-24T08:04:58Z

Test in comments will crash on dev (without this PR) but it won't do it constantly (as this is race condition). I manage to crash every 3rd run or so
As far as I can see it will not crash if this PR is applied

designatednerd · 2019-07-24T08:08:05Z

Yeah then I'd say add it to this PR - that'll at least let us know if we've borked anything in the future.

RolandasRazma · 2019-07-24T08:10:25Z

@designatednerd do you know any better way to test for race conditions? This doesn't feel right...

RolandasRazma · 2019-07-24T08:12:14Z

@designatednerd maybe just add check if ivar accesses from multiple threads (as it will lead to crash sooner or later)?

designatednerd · 2019-07-24T08:22:39Z

Race conditions are really hard to test for - the main thing I'd do is probably check the results of queries from both clients performed in order. it could be something where you just run everything in sequence - and theoretically you should get results back sequentially. This is the sequence I'd probably be checking:

Set Luke Skywalker with loading client
Fetch data with clearing client -> Should be Luke Skywalker because the set should finish before the fetch starts
Clear with clearing client
Fetch data with loading client -> should be cleared because the clear should finish before the fetch starts

AnthonyMDev · 2019-07-24T20:22:40Z

Sources/Apollo/InMemoryNormalizedCache.swift

  public init(records: RecordSet = RecordSet()) {
    self.records = records
+    self.queue = DispatchQueue(label: "com.apollographql.InMemoryNormalizedCache", attributes: .concurrent)


Shouldn't we be using a serial queue to ensure that we don't have two blocks accessing it at the same time? Wouldn't concurrent still allow for race conditions?

I was thinking same, but serial queue would slow down things and I can't replicate crash with concurrent one... Maybe fact that it is accessed on same thread is enough...
We have this PR in production for past couple of months (way before I created PR) and see 0 crashes.

Okay, you're right a serial queue is less performant, but this concurrent queue doesn't seem to necessarily address the issue. Of course it's up to @designatednerd in the end, and I'm glad that this workaround has been able to work for your team in production. But I wouldn't feel comfortable about pulling in a workaround for an issue that we can't consistently reproduce and that doesn't seem to make perfect sense.

It seems that this may happen to avoid the race conditions present in your code base currently, but it doesn't completely avoid the possibility of them occurring.

I'd feel much more comfortable going with NSRecursiveLock as a solution here than dispatching to another queue.

Additionally, though I understand it may not be possible, it would be great if we could work a bit more on trying to create a test case that consistently crashes.

I'd be willing to spend some time looking into this and try to reproduce a crashing test, but not sure I'll get much further than you did @RolandasRazma .

completely agree it needs better testing. But the only thing I can think off is add check if ivar is accessed from multiple threads

Hello, I've been looking around to see all the work being done on this library recently and saw this PR and this thread, so hope you don't mind me chiming in. The reason why concurrent works is because your using it with barriers. If you only ever access a concurrent queue with barriers, you've essentially made it function exactly like a serial queue. A common approach to synchronizing with a queue, is using a concurrent queue, then use async barrier writes, and sync non-barrier reads. Keeps the happy-path fast, and the write path safe.

IIRC, I think async reads would work with the pattern described too, but would definitely need testing because I'm not 100% sure of any side-effects inside the task blocks here... if any. Otherwise, might as well take the first suggestion, if it's all "barrier'ed up", might as well make it serial and remove the barriers.

Yeah, if you are always using barriers, then serial is the same! Thanks @jimissacs!

AnthonyMDev · 2019-07-26T17:21:28Z

Just sent this as a message to @designatednerd, but probably should just put it here for discussion.

It looks like we are using both a DispatchQueue with .barrier AND a ReadWrite cache lock. That seems like overkill. Adding an additional queue on the cache itself seems like a crazy amount of unneeded locking.

If there are still crashes that are being reported currently, then it indicates to me that we aren't properly locking on the correct places. But if we are going to add new locks, I'd like to kill some of the old ones and just figure out the correct places to be locking, rather than continuing to add more and more locks on top of each other.

Does this seem to be right, or am I missing something about the locks and queues that are currently in place?

AnthonyMDev · 2019-07-26T18:50:53Z

Alright, so I've written a modified test. This one does not crash, but it does fail the equal assertion a random number of times, and it fails consistently. It's failing because the cache is being cleared between the write and the loadRecords() calls.

When I implement the PR from @RolandasRazma this test passes! It's not showing the EXC_BAD_ACCESS crash, but I think I'm getting to the same underlying behavior.

func testMultipleStoresOneCacheConcurrentAccess() throws {
    let initialRecords: RecordSet = [
      "2001": ["name": "R2-D2", "__typename": "Droid", "primaryFunction": "Protocol"]
    ]

    let cache = InMemoryNormalizedCache(records: initialRecords)

    let store1 = ApolloStore(cache: cache)
    let store2 = ApolloStore(cache: cache)

    let expectation = self.expectation(description: "Read updated query")

    var data = HeroDetails(unsafeResultMap: ["name": "R2-D2",
                                             "__typename": "Droid",
                                             "primaryFunction": "Protocol"])

    for i in 1...1000 {
      DispatchQueue.global().async {
        _ = store1.withinReadWriteTransaction { transaction1 in
          data.name = "R4-D\(i)"
          try! transaction1.write(object: data, withKey: "2001")

          let heroName = try? transaction1.loadRecords(forKeys: ["2001"]).await().first??["name"] as? String

          XCTAssertEqual(heroName, "R4-D\(i)")
          if i == 1000 {
            expectation.fulfill()
          }

        }
      }

      DispatchQueue.global().async {
        store2.clearCache()
      }
    }

    self.wait(for: [expectation], timeout: 5)
  }

However the more I think about this, the more I think that both the ApolloStore and the cache itself shouldn't have their own locks. It seems like it's a lot of overkill and causes performance issues. Also, every future implementation of a new cache type will need to handle its own locking.
I understand using the same cache with multiple clients, but I think that in that case, you should always use one shared ApolloStore. Since the store is already handling thread safety here, using the same cache in different stores should be considered a developer error and not supported.

If I do both the clearing of the cache and the read/write on the same store, this test passes without the new code from this PR.

AnthonyMDev · 2019-07-26T19:30:59Z

I've just confirmed that in the test case @RolandasRazma provided, if you use one single store with both clients, the crash also goes away.

RolandasRazma · 2019-07-26T21:26:57Z

I think I can reproduce it even with one store and cache, need to look into it

AnthonyMDev · 2019-09-19T15:36:46Z

Sorry I've been MIA on this for a while. I think the safest and most performant way to handle this is to handle locking as close to the source as possible. If certain types of caches don't even need the locking then we shouldn't force them to use it.
Having the lock at any higher level than the cache itself lends itself to more issues like we are currently seeing where multiple stores are accessing the same cache.
Different caches may be able to implement locking differently to improve performance based on their implementation details as well.

I strongly support locking at the cache level.

designatednerd · 2019-09-19T15:49:05Z

@AnthonyMDev Makes sense - what do you think about having the locking work in a protocol extension?

AnthonyMDev · 2019-09-19T15:53:20Z

I'm not a fan of that just because it limits the configuration of how the locking is implemented too much in my opinion. Writing locking code isn't fun, but it's not the most difficult thing in the world to do. When you are writing a caching layer, that's something that's expected. Locking for thread safety is too important and too particular to have a shared implementation for I think.

designatednerd · 2019-09-19T16:22:47Z

I think Anthony makes some great points here - locking at the cache level rather than at any higher level is probably the least-risky way to go. I think he's also probably right that I'm trying too hard to throw Protocol Extensions at the problem. :)

Is it worth continuing this within this particular PR? Do any of you have some time to work on this?

AnthonyMDev · 2019-09-19T16:49:16Z

Haha, to be fair, I LOVE extracting as much logic as possible into protocol extensions. It's definitely the right mindset to be in by default. I think this is just a rare exception to that. :)

AnthonyMDev · 2019-09-19T16:53:17Z

As for your questions:
I think the direction this PR took isn't what we are going to go with, so I don't think we need to continue with it. Thanks so, so, so MUCH @RolandasRazma for bringing this up, researching it, coming up with the test case, and providing so much information! If you or @designatednerd have time to implement the locking in the cache and kill the locking in the store, go for it.
I'm not sure when I'll be able to find time to take this on, but if no one else can, I will try to do it next week.

If either of you work on this, let me know so we don't work over each other. I'd like to provide my test cases to you also so that you can use them to help verify that the new locking works. I riffed off of @RolandasRazma 's tests and even though they still don't fail 100% of the time, they fail more consistently now. :)

designatednerd · 2019-09-19T17:15:07Z

Unfortunately this isn't my highest priority at the moment but I'm happy to provide support for whoever has time to work on it!

AnthonyMDev · 2019-09-19T17:16:10Z

Roger that. It's not my highest priority right now either haha ;) But I'll try to make it happen when I get a chance.

RolandasRazma · 2019-09-25T08:18:29Z

should we keep this open until it can be addressed properly so other people would know about problem?

designatednerd · 2019-09-25T13:23:43Z

We can either keep it open or open a new issue to summarize what we've figured out and what needs to happen - I'll leave that up to you @RolandasRazma.

AnthonyMDev · 2019-09-25T16:18:56Z

I’m thinking I would keep the issue open, but you’re the repo maintainer, so up to you!

RolandasRazma · 2020-01-13T10:46:45Z

Was there any progress on this? I reverted to stock implantation (latest from GitHub) and we getting like 80 crashes a day. Stack is not very useful but it goes something like...

__hidden
__hidden
RecordSet.subscript.getter
redacted
_hidden
_hidden
InMemoryNormalizedCache.loadRecords

RolandasRazma · 2020-01-17T10:55:01Z

Looks like simple NSLock around self.records access is enough to stop crashing with latest changes in ApolloStore and InMemoryNormalizedCache

@designatednerd what do you think about such fix? Can I get into deadlock somehow?

designatednerd · 2020-01-18T01:12:31Z

I mean you always can get into deadlock, but I suspect if it doesn't deadlock in the tests which are doing a lot of different accesses of the code, it won't in practice. Probably worth adding a test which explicitly checks accessing from two different threads.

Would love to see this update get over the line!

RolandasRazma · 2020-01-18T09:33:33Z

I was testing with test @AnthonyMDev ended up with. Looks fine. Tried to use mutex as well but didn't noticed any speed difference and NSLock easier to use. We pushed it to production (live dangerously and stuff) but its unlikely we hear anything from users as people don't report deadlocks :) Will open PR so we can continue discussion.

designatednerd · 2020-01-20T00:24:01Z

Awesome, looking forward to it! Heads up that I'm on vacation for the next week and a half so responses may be slow, but I'll try to get a look as I can.

…t queue at the same time

RolandasRazma · 2020-01-20T10:34:00Z

@designatednerd pushed changes, also included @AnthonyMDev test as it crashes 100% of a time now without lock (after locks were removed from ApolloStore)

AnthonyMDev · 2020-01-20T22:36:40Z

Sources/Apollo/InMemoryNormalizedCache.swift

@@ -2,6 +2,7 @@ import Foundation

 public final class InMemoryNormalizedCache: NormalizedCache {
  private var records: RecordSet
+  private let recordsLock = NSLock()


Is there a reason why we aren't using NSRecursiveLock here? If the implementation of accessing the records ever changes to need to access the cache recursively or anything, then this is going to deadlock.

If there is a good reason I'm missing to use a regular NSLock, then great, but I prefer to default to recursive if possible for safety.

no good reason, switched

designatednerd

Looks reasonable to me, and a far simpler change than earlier. I'm interested in eventually abstracting some of this logic into a property wrapper, but for now let's get this going!

FYI this won't ship as part of a release until I'm back on the ground to support it, but I'll merge it now if you want to use it off master.

lukelabonte · 2021-01-29T16:12:34Z

Looks reasonable to me, and a far simpler change than earlier. I'm interested in eventually abstracting some of this logic into a property wrapper, but for now let's get this going!

FYI this won't ship as part of a release until I'm back on the ground to support it, but I'll merge it now if you want to use it off master.

@designatednerd do you have an ETA on when you'll be back? I'm wondering when we should expect a new release that would include this fix.

designatednerd · 2021-01-29T23:03:59Z

This shipped a year ago as part of 0.22.0. There have been substantial other changes around how we're handling this, I'd appreciate you filing a new issue with updated details if you're having problems. Thank you!

RolandasRazma closed this May 28, 2019

RolandasRazma reopened this May 28, 2019

RolandasRazma closed this May 31, 2019

RolandasRazma deleted the InMemoryNormalizedCache_queue branch May 31, 2019 13:54

RolandasRazma restored the InMemoryNormalizedCache_queue branch May 31, 2019 13:54

RolandasRazma reopened this Jun 5, 2019

designatednerd reviewed Jul 10, 2019

View reviewed changes

AnthonyMDev reviewed Jul 24, 2019

View reviewed changes

RolandasRazma requested a review from designatednerd January 20, 2020 10:13

use NSLock() to prevent records: RecordSet modification from differen…

183552d

…t queue at the same time

AnthonyMDev approved these changes Jan 20, 2020

View reviewed changes

RolandasRazma changed the title ~~use queue in InMemoryNormalizedCache to avoid race conditions~~ use lock in InMemoryNormalizedCache to avoid race conditions Jan 21, 2020

switch to NSRecursiveLock

5f4807d

designatednerd approved these changes Jan 25, 2020

View reviewed changes

designatednerd merged commit 6f823f0 into apollographql:master Jan 25, 2020

designatednerd added this to the 0.22.0 milestone Jan 25, 2020

designatednerd mentioned this pull request Feb 3, 2020

Flaky cache tests #993

Closed

designatednerd mentioned this pull request Feb 11, 2020

Don't run in-memory cache tests during SQLite cache testing #1010

Merged

martijnwalraven mentioned this pull request Nov 20, 2020

Convert to synchronous GraphQL execution and data loading #1531

Merged

use lock in InMemoryNormalizedCache to avoid race conditions #552

use lock in InMemoryNormalizedCache to avoid race conditions #552

Conversation

RolandasRazma commented May 28, 2019 • edited Loading

RolandasRazma commented May 28, 2019

RolandasRazma commented May 28, 2019

Choose a reason for hiding this comment

designatednerd left a comment

Choose a reason for hiding this comment

RolandasRazma commented Jul 12, 2019 • edited Loading

designatednerd commented Jul 12, 2019

RolandasRazma commented Jul 12, 2019

designatednerd commented Jul 12, 2019

RolandasRazma commented Jul 15, 2019

designatednerd commented Jul 15, 2019

RolandasRazma commented Jul 16, 2019

designatednerd commented Jul 16, 2019

RolandasRazma commented Jul 16, 2019 • edited Loading

designatednerd commented Jul 23, 2019

RolandasRazma commented Jul 24, 2019

designatednerd commented Jul 24, 2019

RolandasRazma commented Jul 24, 2019

RolandasRazma commented Jul 24, 2019

designatednerd commented Jul 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimisaacs Jul 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AnthonyMDev commented Jul 26, 2019

AnthonyMDev commented Jul 26, 2019

AnthonyMDev commented Jul 26, 2019

RolandasRazma commented Jul 26, 2019

AnthonyMDev commented Sep 19, 2019

designatednerd commented Sep 19, 2019

AnthonyMDev commented Sep 19, 2019

designatednerd commented Sep 19, 2019

AnthonyMDev commented Sep 19, 2019

AnthonyMDev commented Sep 19, 2019

designatednerd commented Sep 19, 2019

AnthonyMDev commented Sep 19, 2019

RolandasRazma commented Sep 25, 2019

designatednerd commented Sep 25, 2019

AnthonyMDev commented Sep 25, 2019 via email

RolandasRazma commented Jan 13, 2020 • edited Loading

RolandasRazma commented Jan 17, 2020

designatednerd commented Jan 18, 2020

RolandasRazma commented Jan 18, 2020

designatednerd commented Jan 20, 2020

RolandasRazma commented Jan 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

designatednerd left a comment

Choose a reason for hiding this comment

lukelabonte commented Jan 29, 2021

designatednerd commented Jan 29, 2021

RolandasRazma commented May 28, 2019 •

edited

Loading

RolandasRazma commented Jul 12, 2019 •

edited

Loading

RolandasRazma commented Jul 16, 2019 •

edited

Loading

jimisaacs Jul 27, 2019 •

edited

Loading

RolandasRazma commented Jan 13, 2020 •

edited

Loading