DataStore unit tests for observed rapid single-field mutations with variable connection latencies; add monitoring to fake GraphQL service for in-flight mutations and subscriptions; add helpers for updating artificial testing latencies #11312

david-mcafee · 2023-04-26T23:48:01Z

Description of changes

Adding unit tests to replicate the behavior in a long-standing p1 to assist in better diagnosing the problem.

This PR tests:

Variable connection latencies (fast vs slow connections)
Whether or not we wait for the outbox after each update
Whether or not we wait for the initial record creation to exit the outbox
The observed updates from DataStore.observe()
Queried updates
The records, requests, and running mutations on the fake service
That the fake GraphQL service has not stopped subscriptions, and that all subscription messages have been sent.

Though the original issue seemed to be related to poor connection speeds, I discovered that this can also be reproduced by rapid consecutive saves. Additionally, there is a problem with attempting to update a newly created record that has not yet left the outbox. This PR contains many permutations on how we make updates in order to test how the outbox is handling the merging of multiple outgoing requests.

Additionally, I've added a few utils to the fake GraphQL service to allow for adjusting the artificial latencies.

Lastly, the failing tests are skipped because the problematic behavior still exists, however, I have added a TODO to the outbox's syncOutboxVersionsOnDequeue, as this is the source of the issue, and I am currently working on a fix. Essentially, when we are merging mutations in the outbox, incoming data from AppSync contains all the fields in the record, whereas outgoing data only contains updated fields, resulting in an error when doing a comparison between equal mutations.

Question before you review!

All 6 of these tests are essentially the same test, with variations on 1) connection latencies, 2) whether or not we wait for the outbox on each mutation, and 3) the expected end values of observed updates, queried updates, and what we are observing on the service. I am split between whether or not to create a single test function that accepts a few params, and then adjusts the test assertions, latencies, and whether or not to wait on the outbox. Personally, I like the readability of the current approach (especially because of the test-specific comments in-line), but I'm open to suggestions.

Issue #, if available

Description of how you validated changes

Checklist

PR description included
yarn test passes
Tests are changed or added
Relevant documentation is changed or added (and PR referenced)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…ability to view all running mutations for fake GraphQL service; working unit test that validates update response via observe, checks for running mutations in GraphQL service, and validates sent requests

… on update

svidgen

It would probably be helpful to create a util method or two for these tests so it's easier to parse out the "narrative" of each test. Maybe a util for the observer-logger bit, and almost undoubtedly a method for the graphqlservice awaiting — though I don't actually understand why we're doing that part!

Either way, these tests are marked as skipped, which leads me to believe this is the desired behavior that we're not meeting yet. With that being the case, I want to make sure I understand why we're expected the particular observe messages we're identifying in each case. Can we walk through these together? (Ideally after factoring out some utils to make each test narrative more "top-level".) Maybe drag @iartemiev and/or @manueliglesias into that discussion.

I think it might also be good to get a branch of this PR that just demonstrates today's behavior -- regardless of whether it's "correct". (And un-mark the tests as "skipped".) Then, when we're settled on what the correct behavior is, we can see the fix and diffs to the expected observe() messages alongside each other, etc..

packages/datastore/src/sync/outbox.ts

packages/datastore/__tests__/helpers/fakes/graphqlService.ts

packages/datastore/__tests__/connectivityHandling.test.ts

svidgen · 2023-04-28T15:01:44Z

packages/datastore/__tests__/connectivityHandling.test.ts

+			// Skipping because currently, DataStore returns `undefined` versions on each update.
+			// Note: may need to fine tune the assertions once the issue is fixed.
+			test.skip('rapid mutations on poor connection when initial create is pending', async () => {


nit: Can we throw blank lines between the tests so they don't visually bleed into each other?

Definitely! I'm surprised I didn't do that here, as I'm a big fan of spacing :)

svidgen · 2023-04-28T15:11:48Z

packages/datastore/__tests__/connectivityHandling.test.ts

+				expect(subscriptionLog).toEqual([
+					['post title 0', 1],
+					['post title 1', 1],
+					['post title 2', 3],
+				]);


It's not actually clear to me why these are the only observe() messages we expect to see. Shouldn't we expect to see messages from the initial saves and then 1 to 2 updates echo back from each messages that actually hits the service?

Maybe we should expand these assertions to explain why each update is expected to come from.

Alternatively ... maybe we shouldn't assert on the individual observer messages at all in these tests. I'm tempted to say what we really want to assert on is final state + the last observe() message. I'm eager to discuss that though, and would be interested to loop @iartemiev and/or @manueliglesias into that discussion.

I wasn't concerned with the subscription message related to the initial creation of the record in these tests, as I was focused on the updates specifically. However, that's a really valid point, and I'll instantiate the subscription prior to creating the record. I will also expand on these assertions for clarity - I think that's also a really great point.

One of the challenges I faced when writing these tests was replicating the problematic behavior, as I wasn't hitting the right code paths in the outbox. For instance, a slight adjustment in latencies / pauses could result in completely different behavior, given the fast execution of the tests. There are a lot of test permutations here because there are several ways this can ultimately fail. I created assertions on all sub messages because this sheds light on if my test case is actually testing the outbox code path I expect it to. An argument could be made for only checking the number of requests received by the service, but then I don't know if the subscription message I'm asserting on is the final subscription message. I also want to know if the sequence of version updates is what I expect it to be. If I only assert on final values, it's possible that the tests may not be testing outbox merging at all, and then we completely lose the value of these tests.

david-mcafee · 2023-04-28T20:46:43Z

It would probably be helpful to create a util method or two for these tests so it's easier to parse out the "narrative" of each test. Maybe a util for the observer-logger bit, and almost undoubtedly a method for the graphqlservice awaiting — though I don't actually understand why we're doing that part!

I'm not sure I see that much logic to extract to a util for the DataStore.observe messages, as we are simply creating a subscription, logging each result, and then making final assertions. If you feel strongly about this, perhaps we can sync offline to discuss exactly what you are proposing?

Re: a util for the fake GraphQL service, I definitely see the benefit of extracting that out. As to why we are checking the service and making assertions on it - I consistently ran into two issues while writing these tests. First, as I began updating the latency values, as well as the update speed, I noticed that the service hadn't always received all the updates prior to me making final assertions. In other words, it would eventually break, but not at the time I was making assertions. In this case, the mutations were still in the outbox, or being delayed by latency. Second, and related, the service may have received an update, but may still be processing a mutation at the time of me making a final assertion. For both of these cases, this ultimately meant that I was asserting that the final state of things was successful, but the service was either 1) not receiving the number of updates I expected, or 2) was still processing them. It's important that we assert the number of updates received, because we want to know just how many merges were performed by the outbox. A slight tweak in a pause or latency would result in different outbox behavior, and all of the sudden we are no longer testing the code paths we were hoping to test. By tracking received requests (existing behavior) and updating the service to track running mutations, as I've done in this PR, we can make our final test assertions with much greater confidence.

Either way, these tests are marked as skipped, which leads me to believe this is the desired behavior that we're not meeting yet. With that being the case, I want to make sure I understand why we're expected the particular observe messages we're identifying in each case. Can we walk through these together? (Ideally after factoring out some utils to make each test narrative more "top-level".) Maybe drag @iartemiev and/or @manueliglesias into that discussion.

You're correct that these test the desired behavior - I was torn here, as it seemed a bit odd to commit tests that pass on broken code, but in this case, I think it makes sense since we're going to fix it. As to the "why" of testing DataStore.observe messages - this gives us a clear picture into what messages are being merged by the outbox, as well as the sequential success of all subscription messages (now we know what values are coming through for updated fields, as well as how _version is being incremented). Essentially, I want to know with absolute certainty what is happening from beginning to end, not just the end result. I want to know if the number of messages received by DataStore.observe match the number of requests being processed by the service. If I do not check this, something is broken, or maybe a response is in flight but is delayed by the fake latency, in which case I may assert against a subscription message, but ultimately the final subscription message that I have not yet received will revert the value. Happy to discuss offline to explain these further!

I think it might also be good to get a branch of this PR that just demonstrates today's behavior -- regardless of whether it's "correct". (And un-mark the tests as "skipped".) Then, when we're settled on what the correct behavior is, we can see the fix and diffs to the expected observe() messages alongside each other, etc..

Sounds good to me!

svidgen · 2023-04-28T21:45:31Z

In this case, the mutations were still in the outbox, or being delayed by latency ... A slight tweak in a pause or latency would result in different outbox behavior, and all of the sudden we are no longer testing the code paths we were hoping to test.

That's all a bit concerning, actually. ~~But, I'll dig on that independently.~~ I'm pretty confident the Hub events that waitForEmptyOutbox() is looking for should only be fired once the mutations are "processed" (Http response). But, maybe the extra steps you're performing are controlling more for subscriber messages too. Does that sound right?

Or, are we somehow getting Hub events that are prematurely signaling that the outbox is empty?

-- Edit --

I believe this is where the mutation processor initiates the signal that becomes the Hub event.

https://github.com/aws-amplify/amplify-js/blob/main/packages/datastore/src/sync/processors/mutation.ts#L291-L296

If I'm reading it right, it should only occur once the message has been sent, succeeded, and the mutation removed from the outbox.

Maybe this isn't a blocker for this PR though. You're performing some additional checks — I guess I'd be curious to see a 1 pointer to spike on this to make sure we don't have unexpected, premature, or rogue outbox completed messages, along with a more precise explanation for why we need to check the graphql service fake.

david-mcafee · 2023-05-02T18:34:01Z

In this case, the mutations were still in the outbox, or being delayed by latency ... A slight tweak in a pause or latency would result in different outbox behavior, and all of the sudden we are no longer testing the code paths we were hoping to test.

That's all a bit concerning, actually. ~~But, I'll dig on that independently.~~ I'm pretty confident the Hub events that waitForEmptyOutbox() is looking for should only be fired once the mutations are "processed" (Http response). But, maybe the extra steps you're performing are controlling more for subscriber messages too. Does that sound right?

Or, are we somehow getting Hub events that are prematurely signaling that the outbox is empty?

-- Edit --

I believe this is where the mutation processor initiates the signal that becomes the Hub event.

https://github.com/aws-amplify/amplify-js/blob/main/packages/datastore/src/sync/processors/mutation.ts#L291-L296

If I'm reading it right, it should only occur once the message has been sent, succeeded, and the mutation removed from the outbox.

Maybe this isn't a blocker for this PR though. You're performing some additional checks — I guess I'd be curious to see a 1 pointer to spike on this to make sure we don't have unexpected, premature, or rogue outbox completed messages, along with a more precise explanation for why we need to check the graphql service fake.

I've added a 1-pointer for further investigation. If there is still any confusion with the updated comments throughout, please let me know and I'll add further clarification.

svidgen

Looks good!

There are some repeated blocks in the tests that might warrant a test util or something down the road. But, I'm not asking for that now. Just planting the seed!

Thanks David!

david-mcafee added 21 commits April 25, 2023 14:47

add ability to update / retrieve fake GraphQL service latencies; add …

dffb72b

…ability to view all running mutations for fake GraphQL service; working unit test that validates update response via observe, checks for running mutations in GraphQL service, and validates sent requests

remove pauses (resolved by waiting for outbox); add additional fields…

be8d1ee

… on update

checkpoint

6de8ab4

testing checkpoint

71efe9f

issue reproduced

3985daa

cleanup, update comments

ed539ee

further cleanup

7519e0f

further cleanup

d358db7

update comments

260ecee

placeholder test cases

b7292af

Merge branch 'main' into datastore-consistency-testing

7788eb1

remove unnecessary timeouts; update comments

aed87a4

update comments

9e41c79

add all test permutations

3468826

remove unused pauses

cadfc8a

minor updates

73b87bd

fake service cleanup

bb8555d

add TODO to fix

f599cb0

add todo for fix

ebacc28

fine tune tests

642afd4

update assertions to fail as expected, skip tests, update comments

ea7be9f

david-mcafee changed the title ~~Datastore consistency testing~~ DataStore unit tests for observed rapid single-field mutations with variable connection latencies Apr 27, 2023

david-mcafee self-assigned this Apr 27, 2023

david-mcafee added Testing DataStore Related to DataStore category labels Apr 27, 2023

david-mcafee marked this pull request as ready for review April 27, 2023 23:48

david-mcafee requested review from a team as code owners April 27, 2023 23:48

david-mcafee added 2 commits April 27, 2023 16:50

fix ts ignore

8d6612a

remove vs code config

e0e5693

update tests

797f069

svidgen reviewed Apr 28, 2023

View reviewed changes

david-mcafee requested review from a team and removed request for a team April 28, 2023 20:01

david-mcafee changed the title ~~DataStore unit tests for observed rapid single-field mutations with variable connection latencies~~ DataStore unit tests for observed rapid single-field mutations with variable connection latencies, including single and multiple concurrent client updates Apr 28, 2023

address PR feedback

2e575d2

david-mcafee added 4 commits May 1, 2023 14:24

refactor fake service assertions into util

209ea30

update assertions, update fake service util

4e9f772

Update packages/datastore/__tests__/helpers/fakes/graphqlService.ts

65829e1

minor updates

3c7955e

david-mcafee added 5 commits May 2, 2023 11:34

address PR feedback

efd7d0b

Merge branch 'main' into datastore-consistency-testing

df3bfd6

address PR feedback

7bbe781

update comment

d289120

remove vs config

ccf54ed

david-mcafee requested a review from svidgen May 2, 2023 21:11

david-mcafee added 2 commits May 2, 2023 14:29

fix import

1b56d39

Merge branch 'main' into datastore-consistency-testing

57a958f

svidgen approved these changes May 3, 2023

View reviewed changes

iartemiev approved these changes May 4, 2023

View reviewed changes

david-mcafee merged commit 3ca1913 into main May 4, 2023

david-mcafee deleted the datastore-consistency-testing branch May 4, 2023 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataStore unit tests for observed rapid single-field mutations with variable connection latencies; add monitoring to fake GraphQL service for in-flight mutations and subscriptions; add helpers for updating artificial testing latencies #11312

DataStore unit tests for observed rapid single-field mutations with variable connection latencies; add monitoring to fake GraphQL service for in-flight mutations and subscriptions; add helpers for updating artificial testing latencies #11312

david-mcafee commented Apr 26, 2023 •

edited

Loading

svidgen left a comment •

edited

Loading

svidgen Apr 28, 2023

david-mcafee Apr 28, 2023

svidgen Apr 28, 2023

david-mcafee Apr 28, 2023 •

edited

Loading

david-mcafee commented Apr 28, 2023 •

edited

Loading

svidgen commented Apr 28, 2023 •

edited

Loading

david-mcafee commented May 2, 2023

svidgen left a comment

DataStore unit tests for observed rapid single-field mutations with variable connection latencies; add monitoring to fake GraphQL service for in-flight mutations and subscriptions; add helpers for updating artificial testing latencies #11312

DataStore unit tests for observed rapid single-field mutations with variable connection latencies; add monitoring to fake GraphQL service for in-flight mutations and subscriptions; add helpers for updating artificial testing latencies #11312

Conversation

david-mcafee commented Apr 26, 2023 • edited Loading

Description of changes

This PR tests:

Question before you review!

Issue #, if available

Description of how you validated changes

Checklist

svidgen left a comment • edited Loading

Choose a reason for hiding this comment

svidgen Apr 28, 2023

Choose a reason for hiding this comment

david-mcafee Apr 28, 2023

Choose a reason for hiding this comment

svidgen Apr 28, 2023

Choose a reason for hiding this comment

david-mcafee Apr 28, 2023 • edited Loading

Choose a reason for hiding this comment

david-mcafee commented Apr 28, 2023 • edited Loading

svidgen commented Apr 28, 2023 • edited Loading

david-mcafee commented May 2, 2023

svidgen left a comment

Choose a reason for hiding this comment

david-mcafee commented Apr 26, 2023 •

edited

Loading

svidgen left a comment •

edited

Loading

david-mcafee Apr 28, 2023 •

edited

Loading

david-mcafee commented Apr 28, 2023 •

edited

Loading

svidgen commented Apr 28, 2023 •

edited

Loading