Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataStore unit tests for observed rapid single-field mutations with variable connection latencies; add monitoring to fake GraphQL service for in-flight mutations and subscriptions; add helpers for updating artificial testing latencies #11312

Merged
merged 39 commits into from
May 4, 2023

Conversation

david-mcafee
Copy link
Member

@david-mcafee david-mcafee commented Apr 26, 2023

Description of changes

Adding unit tests to replicate the behavior in a long-standing p1 to assist in better diagnosing the problem.

This PR tests:

  1. Variable connection latencies (fast vs slow connections)
  2. Whether or not we wait for the outbox after each update
  3. Whether or not we wait for the initial record creation to exit the outbox
  4. The observed updates from DataStore.observe()
  5. Queried updates
  6. The records, requests, and running mutations on the fake service
  7. That the fake GraphQL service has not stopped subscriptions, and that all subscription messages have been sent.

Though the original issue seemed to be related to poor connection speeds, I discovered that this can also be reproduced by rapid consecutive saves. Additionally, there is a problem with attempting to update a newly created record that has not yet left the outbox. This PR contains many permutations on how we make updates in order to test how the outbox is handling the merging of multiple outgoing requests.

Additionally, I've added a few utils to the fake GraphQL service to allow for adjusting the artificial latencies.

Lastly, the failing tests are skipped because the problematic behavior still exists, however, I have added a TODO to the outbox's syncOutboxVersionsOnDequeue, as this is the source of the issue, and I am currently working on a fix. Essentially, when we are merging mutations in the outbox, incoming data from AppSync contains all the fields in the record, whereas outgoing data only contains updated fields, resulting in an error when doing a comparison between equal mutations.

Question before you review!

All 6 of these tests are essentially the same test, with variations on 1) connection latencies, 2) whether or not we wait for the outbox on each mutation, and 3) the expected end values of observed updates, queried updates, and what we are observing on the service. I am split between whether or not to create a single test function that accepts a few params, and then adjusts the test assertions, latencies, and whether or not to wait on the outbox. Personally, I like the readability of the current approach (especially because of the test-specific comments in-line), but I'm open to suggestions.

Issue #, if available

Description of how you validated changes

Checklist

  • PR description included
  • yarn test passes
  • Tests are changed or added
  • Relevant documentation is changed or added (and PR referenced)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@david-mcafee david-mcafee changed the title Datastore consistency testing DataStore unit tests for observed rapid single-field mutations with variable connection latencies Apr 27, 2023
@david-mcafee david-mcafee self-assigned this Apr 27, 2023
@david-mcafee david-mcafee added Testing DataStore Related to DataStore category labels Apr 27, 2023
@david-mcafee david-mcafee marked this pull request as ready for review April 27, 2023 23:48
@david-mcafee david-mcafee requested review from a team as code owners April 27, 2023 23:48
Copy link
Contributor

@svidgen svidgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be helpful to create a util method or two for these tests so it's easier to parse out the "narrative" of each test. Maybe a util for the observer-logger bit, and almost undoubtedly a method for the graphqlservice awaiting — though I don't actually understand why we're doing that part!

Either way, these tests are marked as skipped, which leads me to believe this is the desired behavior that we're not meeting yet. With that being the case, I want to make sure I understand why we're expected the particular observe messages we're identifying in each case. Can we walk through these together? (Ideally after factoring out some utils to make each test narrative more "top-level".) Maybe drag @iartemiev and/or @manueliglesias into that discussion.

I think it might also be good to get a branch of this PR that just demonstrates today's behavior -- regardless of whether it's "correct". (And un-mark the tests as "skipped".) Then, when we're settled on what the correct behavior is, we can see the fix and diffs to the expected observe() messages alongside each other, etc..

packages/datastore/src/sync/outbox.ts Outdated Show resolved Hide resolved
packages/datastore/__tests__/connectivityHandling.test.ts Outdated Show resolved Hide resolved
Comment on lines 1155 to 1157
// Skipping because currently, DataStore returns `undefined` versions on each update.
// Note: may need to fine tune the assertions once the issue is fixed.
test.skip('rapid mutations on poor connection when initial create is pending', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we throw blank lines between the tests so they don't visually bleed into each other?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely! I'm surprised I didn't do that here, as I'm a big fan of spacing :)

Comment on lines 995 to 999
expect(subscriptionLog).toEqual([
['post title 0', 1],
['post title 1', 1],
['post title 2', 3],
]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not actually clear to me why these are the only observe() messages we expect to see. Shouldn't we expect to see messages from the initial saves and then 1 to 2 updates echo back from each messages that actually hits the service?

Maybe we should expand these assertions to explain why each update is expected to come from.

Alternatively ... maybe we shouldn't assert on the individual observer messages at all in these tests. I'm tempted to say what we really want to assert on is final state + the last observe() message. I'm eager to discuss that though, and would be interested to loop @iartemiev and/or @manueliglesias into that discussion.

Copy link
Member Author

@david-mcafee david-mcafee Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't concerned with the subscription message related to the initial creation of the record in these tests, as I was focused on the updates specifically. However, that's a really valid point, and I'll instantiate the subscription prior to creating the record. I will also expand on these assertions for clarity - I think that's also a really great point.

One of the challenges I faced when writing these tests was replicating the problematic behavior, as I wasn't hitting the right code paths in the outbox. For instance, a slight adjustment in latencies / pauses could result in completely different behavior, given the fast execution of the tests. There are a lot of test permutations here because there are several ways this can ultimately fail. I created assertions on all sub messages because this sheds light on if my test case is actually testing the outbox code path I expect it to. An argument could be made for only checking the number of requests received by the service, but then I don't know if the subscription message I'm asserting on is the final subscription message. I also want to know if the sequence of version updates is what I expect it to be. If I only assert on final values, it's possible that the tests may not be testing outbox merging at all, and then we completely lose the value of these tests.

@david-mcafee david-mcafee requested review from a team and removed request for a team April 28, 2023 20:01
@david-mcafee david-mcafee changed the title DataStore unit tests for observed rapid single-field mutations with variable connection latencies DataStore unit tests for observed rapid single-field mutations with variable connection latencies, including single and multiple concurrent client updates Apr 28, 2023
@david-mcafee david-mcafee changed the title DataStore unit tests for observed rapid single-field mutations with variable connection latencies, including single and multiple concurrent client updates DataStore unit tests for observed rapid single-field mutations with variable connection latencies, including updates from both single and multiple concurrent clients Apr 28, 2023
@david-mcafee
Copy link
Member Author

david-mcafee commented Apr 28, 2023

It would probably be helpful to create a util method or two for these tests so it's easier to parse out the "narrative" of each test. Maybe a util for the observer-logger bit, and almost undoubtedly a method for the graphqlservice awaiting — though I don't actually understand why we're doing that part!

I'm not sure I see that much logic to extract to a util for the DataStore.observe messages, as we are simply creating a subscription, logging each result, and then making final assertions. If you feel strongly about this, perhaps we can sync offline to discuss exactly what you are proposing?

Re: a util for the fake GraphQL service, I definitely see the benefit of extracting that out. As to why we are checking the service and making assertions on it - I consistently ran into two issues while writing these tests. First, as I began updating the latency values, as well as the update speed, I noticed that the service hadn't always received all the updates prior to me making final assertions. In other words, it would eventually break, but not at the time I was making assertions. In this case, the mutations were still in the outbox, or being delayed by latency. Second, and related, the service may have received an update, but may still be processing a mutation at the time of me making a final assertion. For both of these cases, this ultimately meant that I was asserting that the final state of things was successful, but the service was either 1) not receiving the number of updates I expected, or 2) was still processing them. It's important that we assert the number of updates received, because we want to know just how many merges were performed by the outbox. A slight tweak in a pause or latency would result in different outbox behavior, and all of the sudden we are no longer testing the code paths we were hoping to test. By tracking received requests (existing behavior) and updating the service to track running mutations, as I've done in this PR, we can make our final test assertions with much greater confidence.

Either way, these tests are marked as skipped, which leads me to believe this is the desired behavior that we're not meeting yet. With that being the case, I want to make sure I understand why we're expected the particular observe messages we're identifying in each case. Can we walk through these together? (Ideally after factoring out some utils to make each test narrative more "top-level".) Maybe drag @iartemiev and/or @manueliglesias into that discussion.

You're correct that these test the desired behavior - I was torn here, as it seemed a bit odd to commit tests that pass on broken code, but in this case, I think it makes sense since we're going to fix it. As to the "why" of testing DataStore.observe messages - this gives us a clear picture into what messages are being merged by the outbox, as well as the sequential success of all subscription messages (now we know what values are coming through for updated fields, as well as how _version is being incremented). Essentially, I want to know with absolute certainty what is happening from beginning to end, not just the end result. I want to know if the number of messages received by DataStore.observe match the number of requests being processed by the service. If I do not check this, something is broken, or maybe a response is in flight but is delayed by the fake latency, in which case I may assert against a subscription message, but ultimately the final subscription message that I have not yet received will revert the value. Happy to discuss offline to explain these further!

I think it might also be good to get a branch of this PR that just demonstrates today's behavior -- regardless of whether it's "correct". (And un-mark the tests as "skipped".) Then, when we're settled on what the correct behavior is, we can see the fix and diffs to the expected observe() messages alongside each other, etc..

Sounds good to me!

@svidgen
Copy link
Contributor

svidgen commented Apr 28, 2023

In this case, the mutations were still in the outbox, or being delayed by latency ... A slight tweak in a pause or latency would result in different outbox behavior, and all of the sudden we are no longer testing the code paths we were hoping to test.

That's all a bit concerning, actually. But, I'll dig on that independently. I'm pretty confident the Hub events that waitForEmptyOutbox() is looking for should only be fired once the mutations are "processed" (Http response). But, maybe the extra steps you're performing are controlling more for subscriber messages too. Does that sound right?

Or, are we somehow getting Hub events that are prematurely signaling that the outbox is empty?

-- Edit --

I believe this is where the mutation processor initiates the signal that becomes the Hub event.

https://github.com/aws-amplify/amplify-js/blob/main/packages/datastore/src/sync/processors/mutation.ts#L291-L296

If I'm reading it right, it should only occur once the message has been sent, succeeded, and the mutation removed from the outbox.

Maybe this isn't a blocker for this PR though. You're performing some additional checks — I guess I'd be curious to see a 1 pointer to spike on this to make sure we don't have unexpected, premature, or rogue outbox completed messages, along with a more precise explanation for why we need to check the graphql service fake.

@david-mcafee david-mcafee changed the title DataStore unit tests for observed rapid single-field mutations with variable connection latencies, including updates from both single and multiple concurrent clients DataStore unit tests for observed rapid single-field mutations with variable connection latencies; add monitoring to fake GraphQL service for in-flight mutations and subscriptions May 2, 2023
@david-mcafee david-mcafee changed the title DataStore unit tests for observed rapid single-field mutations with variable connection latencies; add monitoring to fake GraphQL service for in-flight mutations and subscriptions DataStore unit tests for observed rapid single-field mutations with variable connection latencies; add monitoring to fake GraphQL service for in-flight mutations and subscriptions, and ability to update latencies May 2, 2023
@david-mcafee david-mcafee changed the title DataStore unit tests for observed rapid single-field mutations with variable connection latencies; add monitoring to fake GraphQL service for in-flight mutations and subscriptions, and ability to update latencies DataStore unit tests for observed rapid single-field mutations with variable connection latencies; add monitoring to fake GraphQL service for in-flight mutations and subscriptions; add ability to update artificial testing latencies May 2, 2023
@david-mcafee david-mcafee changed the title DataStore unit tests for observed rapid single-field mutations with variable connection latencies; add monitoring to fake GraphQL service for in-flight mutations and subscriptions; add ability to update artificial testing latencies DataStore unit tests for observed rapid single-field mutations with variable connection latencies; add monitoring to fake GraphQL service for in-flight mutations and subscriptions; add helpers for updating artificial testing latencies May 2, 2023
@david-mcafee
Copy link
Member Author

In this case, the mutations were still in the outbox, or being delayed by latency ... A slight tweak in a pause or latency would result in different outbox behavior, and all of the sudden we are no longer testing the code paths we were hoping to test.

That's all a bit concerning, actually. But, I'll dig on that independently. I'm pretty confident the Hub events that waitForEmptyOutbox() is looking for should only be fired once the mutations are "processed" (Http response). But, maybe the extra steps you're performing are controlling more for subscriber messages too. Does that sound right?

Or, are we somehow getting Hub events that are prematurely signaling that the outbox is empty?

-- Edit --

I believe this is where the mutation processor initiates the signal that becomes the Hub event.

https://github.com/aws-amplify/amplify-js/blob/main/packages/datastore/src/sync/processors/mutation.ts#L291-L296

If I'm reading it right, it should only occur once the message has been sent, succeeded, and the mutation removed from the outbox.

Maybe this isn't a blocker for this PR though. You're performing some additional checks — I guess I'd be curious to see a 1 pointer to spike on this to make sure we don't have unexpected, premature, or rogue outbox completed messages, along with a more precise explanation for why we need to check the graphql service fake.

I've added a 1-pointer for further investigation. If there is still any confusion with the updated comments throughout, please let me know and I'll add further clarification.

@david-mcafee david-mcafee requested a review from svidgen May 2, 2023 21:11
Copy link
Contributor

@svidgen svidgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

There are some repeated blocks in the tests that might warrant a test util or something down the road. But, I'm not asking for that now. Just planting the seed!

Thanks David!

@david-mcafee david-mcafee merged commit 3ca1913 into main May 4, 2023
@david-mcafee david-mcafee deleted the datastore-consistency-testing branch May 4, 2023 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataStore Related to DataStore category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants