Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pubsub): Automated reconnect for IoT and API #10235

Merged

Conversation

stocaaro
Copy link
Contributor

@stocaaro stocaaro commented Aug 19, 2022

Description of changes

Automated reconnect for IoT and API.

This change doesn't alter the interface for either IoT or API, however when the connection is lost or disrupted it will automatically attempt to re-form the connection and resubscribe for new events.

This changes what behavior developers can expect and may conflict with workarounds. Based on this, the change will be merged into the next tag for inclusion in the next major version.

Expected behavior

  • When an existing connection is severed by sleep/backgrounding/airplane mode, it will reconnect 5 seconds after the app/site is resumed.
  • When the network is unavailable past socket expiration and JS can detected network reachability, when the network recovers, then the reconnection will be restarted 5 seconds later.
  • When the network is unavailable past socket expiration and JS cannot detected network reachability, then reconnection will be attempted every 60 seconds until the network recovers and it can succeed.
  • When the connection is started while the network is unavailable, the reconnection behavior will be used to recover when the network becomes available.

Bugs fixed

  • Sending messages through MQTT would initialize and cache the connection, but didn't integrate with the observables that started the connection in ways that caused issues and/or interfered with the reconnection process. Refactored to allow publish to use the connection, but not to create a missing connection.
  • MQTT connection caching retaining failed, errored connections. Refactored to more proactively clear the cache of failed connections.

Issue #, if available

#10124
#9824
#9749
#8513
#4459
#3039
#2692
#2372
#1016

Description of how you validated changes

  • Manually tested and retested in both Safari, Chrome and Firefox for the Mac then also Chrome and Brave on Android
  • Unit tests added
  • Integ testing needed

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@lgtm-com

This comment was marked as resolved.

Copy link
Contributor

@elorzafe elorzafe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More questions than feedback.

We can chat more offline.

Thanks for doing this @stocaaro !

@lgtm-com

This comment was marked as resolved.

@codecov-commenter
Copy link

codecov-commenter commented Aug 31, 2022

Codecov Report

Merging #10235 (9d4259a) into next-major-version/5 (7b64ddd) will increase coverage by 0.04%.
The diff coverage is 96.55%.

@@                   Coverage Diff                    @@
##           next-major-version/5   #10235      +/-   ##
========================================================
+ Coverage                 84.43%   84.47%   +0.04%     
========================================================
  Files                       258      259       +1     
  Lines                     18694    18772      +78     
  Branches                   4019     4050      +31     
========================================================
+ Hits                      15784    15858      +74     
- Misses                     2820     2824       +4     
  Partials                     90       90              
Impacted Files Coverage Δ
...ackages/pubsub/src/Providers/MqttOverWSProvider.ts 92.59% <91.83%> (+0.06%) ⬆️
packages/pubsub/src/utils/ReconnectionMonitor.ts 96.77% <96.77%> (ø)
packages/core/src/Util/Retry.ts 94.59% <100.00%> (ø)
packages/pubsub/__tests__/helpers.ts 89.23% <100.00%> (+0.34%) ⬆️
.../src/Providers/AWSAppSyncRealTimeProvider/index.ts 95.82% <100.00%> (-0.05%) ⬇️
packages/pubsub/src/Providers/constants.ts 96.96% <100.00%> (+0.19%) ⬆️
...ackages/pubsub/src/utils/ConnectionStateMonitor.ts 100.00% <100.00%> (ø)
packages/pubsub/src/PubSub.ts 93.05% <0.00%> (-1.39%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…ases to allow reconnection across failure modes
@lgtm-com

This comment was marked as resolved.

@lgtm-com

This comment was marked as duplicate.

@lgtm-com
Copy link

lgtm-com bot commented Sep 13, 2022

This pull request fixes 1 alert when merging 86f55ee into 7b64ddd - view on LGTM.com

fixed alerts:

  • 1 for Useless conditional

@stocaaro

This comment was marked as resolved.

@lgtm-com
Copy link

lgtm-com bot commented Sep 14, 2022

This pull request fixes 1 alert when merging 1a605e9 into 7b64ddd - view on LGTM.com

fixed alerts:

  • 1 for Useless conditional

@lgtm-com
Copy link

lgtm-com bot commented Sep 14, 2022

This pull request fixes 1 alert when merging a44e9f3 into 7b64ddd - view on LGTM.com

fixed alerts:

  • 1 for Useless conditional

Copy link
Contributor

@elorzafe elorzafe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review (didnt review the tests) more questions than changes

Comment on lines +98 to +99
// If no initial network state was discovered, stop trying
this._initialNetworkStateSubscription?.unsubscribe();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble understanding this ☕

L78 tries to get the initial state and then after it gets the status is unsubscribe, I think we chatted about this and that it didn't return the status immediately? After looking on ReachabilityMonitor code https://github.com/aws-amplify/amplify-js/blob/main/packages/core/src/Util/Reachability.ts#L21 it seems it does.

I am probably missing something here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your guidance in the connection state monitoring PR, we don't want to leave reachability running for every provider in perpetuity, but only want to monitor when there are active subscriptions. So now, the monitor is turned on and off, however, this means we don't turn it on at the beginning to get the state, which we otherwise are defaulting to being on.

Now in this code, we need to know that the network is off before the first subscription is formed to get the right reconnection states for cold start. To accomplish this, we are turning it on, reading the current state and then turning it off again to be later restarted when the first subscription begins.

@@ -0,0 +1,64 @@
import { Observer } from 'zen-observable-ts';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having a hard time to take this dependency out from core, is it necessary or can be defined "by hand"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is imported 19 times across Pubsub, Core, DataStore, Storage and ApiGraphQL. I'm sure there are alternatives, but I don't think it is in the scope of this change to remove this dependency.

/**
* Given a reconnect event, start the appropriate behavior
*/
record(event: ReconnectEvent) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this gets called multiple times, can be a race condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a bug with this earlier and this pay enough attention to the retained reconnectSetTimeoutId. Fixing this, but assuming the thread isn't being dropped between the conditional block and assignment, this should be safe. Interested to hear your thoughts.

HALT_RECONNECT = 'HALT_RECONNECT',
}

export class ReconnectionMonitor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would explain a little more why do we need this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Adding a comment above this class to make it clear.

// Trigger reconnection when the connection is disrupted
if (connectionState === ConnectionState.ConnectionDisrupted) {
this.reconnectionMonitor.record(ReconnectEvent.START_RECONNECT);
} else if (connectionState !== ConnectionState.Connecting) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer here to make an exhaustive check just in case we are missing a state in the future, maybe explicitly check the states that goes to halt state for reconnection and if we add another state in the future we could have a compilation error because it was not handle properly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. Updating to have an exhaustive list for halt as requested.

@@ -676,9 +713,6 @@ export class AWSAppSyncRealTimeProvider extends AbstractPubSubProvider {
logger.debug(`WebSocket connection error`);
};
newSocket.onclose = () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was remove because of Reachability monitor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was triggering failed too early and getting in the way of reconnection.

It gets re-thrown on 815, retried then finally rejecting the promise on 676 with ultimately logged in the top level promise catch block I just refactored into its own method. Connection is recorded closed on 373 after I post the update.

Comment on lines 342 to 370
Promise.resolve(
this.connectionStateMonitor.record(CONNECTION_CHANGE.CLOSED)
);

// Notify concurrent unsubscription
if (typeof subscriptionFailedCallback === 'function') {
subscriptionFailedCallback();
if (
this.connectionState !==
ConnectionState.ConnectionDisruptedPendingNetwork
) {
if (isNonRetryableError(err)) {
observer.error({
errors: [
{
...new GraphQLError(
`${CONTROL_MSG.CONNECTION_FAILED}: ${message}`
),
},
],
});
} else {
logger.debug(`${CONTROL_MSG.CONNECTION_FAILED}: ${message}`);
}

const { subscriptionFailedCallback } =
this.subscriptionObserverMap.get(subscriptionId) || {};

// Notify concurrent unsubscription
if (typeof subscriptionFailedCallback === 'function') {
subscriptionFailedCallback();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I would export some of this into a separate function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good call. This bit has gotten extra out of hand.

});

startSubscription();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could have a race condition? with L220?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could, although in the case where it gets run twice subscriptionStartActive causes the second call to abort if it is run before the first call reaches its finally block.

Hypothetically of both calls execute their if (!subscriptionStartActive) { before the assignment on the next line, then they could be running in parallel, however I've read that this isn't how the JS GIL operates, although if this last week has taught me anything its that there is diversity in the ecosystem.

Sufficiently safe or is there a better locking mechanism to prevent any chance of double execution?

@lgtm-com
Copy link

lgtm-com bot commented Sep 22, 2022

This pull request fixes 1 alert when merging 9d4259a into 7b64ddd - view on LGTM.com

fixed alerts:

  • 1 for Useless conditional

@stocaaro
Copy link
Contributor Author

I spent several hours chasing the DataStore with API cold start issue and have confirmation that they aren't compatible. AppSync appears to only honor one subscription per unique query, last subscriber wins. As my demo app adds subscriptions normally, through luck the datastore subscribers are setup first and when I start the API subscriptions they replace the datastore ones leaving everything apparently working.

For the cold start case, Datastore doesn't attempt to start its subscriptions until the network exists while API sets up its subscriptions and internally waits for the network to connect. In this situation when the network returns the API subscriptions are overridden the the datastore subscriptions.

Long term, if deduplicate on queries and send messages to every subscription that is watching a given query, then it won't matter who's subscription identifier the payload was sent to. Until we support multiple subscribers in this, or some other, way, the behavior our tests are observing is expected.

@stocaaro stocaaro requested review from a team as code owners September 23, 2022 13:38
@lgtm-com
Copy link

lgtm-com bot commented Sep 23, 2022

This pull request fixes 1 alert when merging 6bc69dc into 92b672b - view on LGTM.com

fixed alerts:

  • 1 for Useless conditional

@lgtm-com
Copy link

lgtm-com bot commented Sep 23, 2022

This pull request fixes 1 alert when merging e6bae74 into 92b672b - view on LGTM.com

fixed alerts:

  • 1 for Useless conditional

@lgtm-com
Copy link

lgtm-com bot commented Sep 28, 2022

This pull request fixes 1 alert when merging 3c49956 into 92b672b - view on LGTM.com

fixed alerts:

  • 1 for Useless conditional

Copy link
Contributor

@katiegoines katiegoines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the time to go through my questions with me! LGTM

@stocaaro stocaaro merged commit a677cd8 into aws-amplify:next-major-version/5 Sep 28, 2022
stocaaro added a commit that referenced this pull request Oct 4, 2022
…10421)

<!--
Please make sure to read the Pull Request Guidelines:

https://github.com/aws-amplify/amplify-js/blob/main/CONTRIBUTING.md#pull-requests
-->

#### Description of changes
<!--
Thank you for your Pull Request! Please provide a description above and
review
the requirements below.
-->
Follow up to #10235.
In previous change, AppSyncRealTimeProvider blocks the GraphQL
connection error to the subscribers. This breaks in the integration test
[when subscription is
disabled](https://app.circleci.com/pipelines/github/aws-amplify/amplify-js/14982/workflows/93104c1c-ba16-432a-8d71-19831b335d37/jobs/155531).
This fixes the unintended breaking change.

#### Issue #, if available
<!-- Also, please reference any associated PRs for documentation
updates. -->



#### Description of how you validated changes
Validated the change in the Cypress integ test locally.



#### Checklist
<!-- Remove items that do not apply. For completed items, change [ ] to
[x]. -->

- [x] PR description included
- [x] `yarn test` passes
- [ ] Tests are [changed or
added](https://github.com/aws-amplify/amplify-js/blob/main/CONTRIBUTING.md#steps-towards-contributions)
- [ ] Relevant documentation is changed or added (and PR referenced)

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants