feat(pubsub): Automated reconnect for IoT and API #10235

stocaaro · 2022-08-19T19:36:19Z

Description of changes

Automated reconnect for IoT and API.

This change doesn't alter the interface for either IoT or API, however when the connection is lost or disrupted it will automatically attempt to re-form the connection and resubscribe for new events.

This changes what behavior developers can expect and may conflict with workarounds. Based on this, the change will be merged into the next tag for inclusion in the next major version.

Expected behavior

When an existing connection is severed by sleep/backgrounding/airplane mode, it will reconnect 5 seconds after the app/site is resumed.
When the network is unavailable past socket expiration and JS can detected network reachability, when the network recovers, then the reconnection will be restarted 5 seconds later.
When the network is unavailable past socket expiration and JS cannot detected network reachability, then reconnection will be attempted every 60 seconds until the network recovers and it can succeed.
When the connection is started while the network is unavailable, the reconnection behavior will be used to recover when the network becomes available.

Bugs fixed

Sending messages through MQTT would initialize and cache the connection, but didn't integrate with the observables that started the connection in ways that caused issues and/or interfered with the reconnection process. Refactored to allow publish to use the connection, but not to create a missing connection.
MQTT connection caching retaining failed, errored connections. Refactored to more proactively clear the cache of failed connections.

Issue #, if available

#10124
#9824
#9749
#8513
#4459
#3039
#2692
#2372
#1016

Description of how you validated changes

Manually tested and retested in both Safari, Chrome and Firefox for the Mac then also Chrome and Brave on Android
Unit tests added
Integ testing needed

Checklist

PR description included
yarn test passes
Tests are changed or added

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

elorzafe

Codecov Report

Merging #10235 (9d4259a) into next-major-version/5 (7b64ddd) will increase coverage by 0.04%.
The diff coverage is 96.55%.

@@                   Coverage Diff                    @@
##           next-major-version/5   #10235      +/-   ##
========================================================
+ Coverage                 84.43%   84.47%   +0.04%     
========================================================
  Files                       258      259       +1     
  Lines                     18694    18772      +78     
  Branches                   4019     4050      +31     
========================================================
+ Hits                      15784    15858      +74     
- Misses                     2820     2824       +4     
  Partials                     90       90

Impacted Files	Coverage Δ
...ackages/pubsub/src/Providers/MqttOverWSProvider.ts	`92.59% <91.83%> (+0.06%)`	⬆️
packages/pubsub/src/utils/ReconnectionMonitor.ts	`96.77% <96.77%> (ø)`
packages/core/src/Util/Retry.ts	`94.59% <100.00%> (ø)`
packages/pubsub/__tests__/helpers.ts	`89.23% <100.00%> (+0.34%)`	⬆️
.../src/Providers/AWSAppSyncRealTimeProvider/index.ts	`95.82% <100.00%> (-0.05%)`	⬇️
packages/pubsub/src/Providers/constants.ts	`96.96% <100.00%> (+0.19%)`	⬆️
...ackages/pubsub/src/utils/ConnectionStateMonitor.ts	`100.00% <100.00%> (ø)`
packages/pubsub/src/PubSub.ts	`93.05% <0.00%> (-1.39%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…ases to allow reconnection across failure modes

packages/pubsub/src/utils/ReconnectionMonitor.ts

lgtm-com · 2022-09-13T15:50:06Z

This pull request fixes 1 alert when merging 86f55ee into 7b64ddd - view on LGTM.com

fixed alerts:

1 for Useless conditional

packages/pubsub/src/utils/ReconnectionMonitor.ts

packages/pubsub/src/Providers/MqttOverWSProvider.ts

packages/pubsub/src/Providers/AWSAppSyncRealTimeProvider/index.ts

lgtm-com · 2022-09-14T03:03:24Z

This pull request fixes 1 alert when merging 1a605e9 into 7b64ddd - view on LGTM.com

fixed alerts:

1 for Useless conditional

lgtm-com · 2022-09-14T19:43:09Z

This pull request fixes 1 alert when merging a44e9f3 into 7b64ddd - view on LGTM.com

fixed alerts:

1 for Useless conditional

elorzafe

Partial review (didnt review the tests) more questions than changes

elorzafe · 2022-09-16T17:13:33Z

packages/pubsub/src/utils/ConnectionStateMonitor.ts

+		// If no initial network state was discovered, stop trying
+		this._initialNetworkStateSubscription?.unsubscribe();


I am having trouble understanding this ☕

L78 tries to get the initial state and then after it gets the status is unsubscribe, I think we chatted about this and that it didn't return the status immediately? After looking on ReachabilityMonitor code https://github.com/aws-amplify/amplify-js/blob/main/packages/core/src/Util/Reachability.ts#L21 it seems it does.

I am probably missing something here

From your guidance in the connection state monitoring PR, we don't want to leave reachability running for every provider in perpetuity, but only want to monitor when there are active subscriptions. So now, the monitor is turned on and off, however, this means we don't turn it on at the beginning to get the state, which we otherwise are defaulting to being on.

Now in this code, we need to know that the network is off before the first subscription is formed to get the right reconnection states for cold start. To accomplish this, we are turning it on, reading the current state and then turning it off again to be later restarted when the first subscription begins.

elorzafe · 2022-09-16T17:15:07Z

packages/pubsub/src/utils/ReconnectionMonitor.ts

@@ -0,0 +1,64 @@
+import { Observer } from 'zen-observable-ts';


I am having a hard time to take this dependency out from core, is it necessary or can be defined "by hand"?

This is imported 19 times across Pubsub, Core, DataStore, Storage and ApiGraphQL. I'm sure there are alternatives, but I don't think it is in the scope of this change to remove this dependency.

elorzafe · 2022-09-16T17:17:35Z

packages/pubsub/src/utils/ReconnectionMonitor.ts

+	/**
+	 * Given a reconnect event, start the appropriate behavior
+	 */
+	record(event: ReconnectEvent) {


What happens if this gets called multiple times, can be a race condition?

Had a bug with this earlier and this pay enough attention to the retained reconnectSetTimeoutId. Fixing this, but assuming the thread isn't being dropped between the conditional block and assignment, this should be safe. Interested to hear your thoughts.

elorzafe · 2022-09-16T17:20:22Z

packages/pubsub/src/utils/ReconnectionMonitor.ts

+	HALT_RECONNECT = 'HALT_RECONNECT',
+}
+
+export class ReconnectionMonitor {


I would explain a little more why do we need this.

Sure. Adding a comment above this class to make it clear.

elorzafe · 2022-09-16T17:35:15Z

packages/pubsub/src/Providers/AWSAppSyncRealTimeProvider/index.ts

+					// Trigger reconnection when the connection is disrupted
+					if (connectionState === ConnectionState.ConnectionDisrupted) {
+						this.reconnectionMonitor.record(ReconnectEvent.START_RECONNECT);
+					} else if (connectionState !== ConnectionState.Connecting) {


I would prefer here to make an exhaustive check just in case we are missing a state in the future, maybe explicitly check the states that goes to halt state for reconnection and if we add another state in the future we could have a compilation error because it was not handle properly

Maybe. Updating to have an exhaustive list for halt as requested.

elorzafe · 2022-09-16T17:43:00Z

packages/pubsub/src/Providers/AWSAppSyncRealTimeProvider/index.ts

@@ -676,9 +713,6 @@ export class AWSAppSyncRealTimeProvider extends AbstractPubSubProvider {
 						logger.debug(`WebSocket connection error`);
 					};
 					newSocket.onclose = () => {


This was remove because of Reachability monitor?

This was triggering failed too early and getting in the way of reconnection.

It gets re-thrown on 815, retried then finally rejecting the promise on 676 with ultimately logged in the top level promise catch block I just refactored into its own method. Connection is recorded closed on 373 after I post the update.

elorzafe · 2022-09-16T17:44:26Z

packages/pubsub/src/Providers/AWSAppSyncRealTimeProvider/index.ts

+			Promise.resolve(
+				this.connectionStateMonitor.record(CONNECTION_CHANGE.CLOSED)
+			);

-			// Notify concurrent unsubscription
-			if (typeof subscriptionFailedCallback === 'function') {
-				subscriptionFailedCallback();
+			if (
+				this.connectionState !==
+				ConnectionState.ConnectionDisruptedPendingNetwork
+			) {
+				if (isNonRetryableError(err)) {
+					observer.error({
+						errors: [
+							{
+								...new GraphQLError(
+									`${CONTROL_MSG.CONNECTION_FAILED}: ${message}`
+								),
+							},
+						],
+					});
+				} else {
+					logger.debug(`${CONTROL_MSG.CONNECTION_FAILED}: ${message}`);
+				}
+
+				const { subscriptionFailedCallback } =
+					this.subscriptionObserverMap.get(subscriptionId) || {};
+
+				// Notify concurrent unsubscription
+				if (typeof subscriptionFailedCallback === 'function') {
+					subscriptionFailedCallback();
+				}


Maybe I would export some of this into a separate function

Yeah good call. This bit has gotten extra out of hand.

elorzafe · 2022-09-16T17:45:16Z

packages/pubsub/src/Providers/AWSAppSyncRealTimeProvider/index.ts

 				});

+				startSubscription();


Could have a race condition? with L220?

You could, although in the case where it gets run twice subscriptionStartActive causes the second call to abort if it is run before the first call reaches its finally block.

Hypothetically of both calls execute their if (!subscriptionStartActive) { before the assignment on the next line, then they could be running in parallel, however I've read that this isn't how the JS GIL operates, although if this last week has taught me anything its that there is diversity in the ecosystem.

Sufficiently safe or is there a better locking mechanism to prevent any chance of double execution?

lgtm-com · 2022-09-22T03:52:47Z

This pull request fixes 1 alert when merging 9d4259a into 7b64ddd - view on LGTM.com

fixed alerts:

1 for Useless conditional

stocaaro · 2022-09-22T03:55:56Z

I spent several hours chasing the DataStore with API cold start issue and have confirmation that they aren't compatible. AppSync appears to only honor one subscription per unique query, last subscriber wins. As my demo app adds subscriptions normally, through luck the datastore subscribers are setup first and when I start the API subscriptions they replace the datastore ones leaving everything apparently working.

For the cold start case, Datastore doesn't attempt to start its subscriptions until the network exists while API sets up its subscriptions and internally waits for the network to connect. In this situation when the network returns the API subscriptions are overridden the the datastore subscriptions.

Long term, if deduplicate on queries and send messages to every subscription that is watching a given query, then it won't matter who's subscription identifier the payload was sent to. Until we support multiple subscribers in this, or some other, way, the behavior our tests are observing is expected.

…connect-pr

lgtm-com · 2022-09-23T13:45:28Z

This pull request fixes 1 alert when merging 6bc69dc into 92b672b - view on LGTM.com

fixed alerts:

1 for Useless conditional

lgtm-com · 2022-09-23T14:05:18Z

This pull request fixes 1 alert when merging e6bae74 into 92b672b - view on LGTM.com

fixed alerts:

1 for Useless conditional

lgtm-com · 2022-09-28T19:07:08Z

This pull request fixes 1 alert when merging 3c49956 into 92b672b - view on LGTM.com

fixed alerts:

1 for Useless conditional

katiegoines

Thank you for taking the time to go through my questions with me! LGTM

…10421)  #### Description of changes  Follow up to #10235. In previous change, AppSyncRealTimeProvider blocks the GraphQL connection error to the subscribers. This breaks in the integration test [when subscription is disabled](https://app.circleci.com/pipelines/github/aws-amplify/amplify-js/14982/workflows/93104c1c-ba16-432a-8d71-19831b335d37/jobs/155531). This fixes the unintended breaking change. #### Issue #, if available  #### Description of how you validated changes Validated the change in the Cypress integ test locally. #### Checklist  - [x] PR description included - [x] `yarn test` passes - [ ] Tests are [changed or added](https://github.com/aws-amplify/amplify-js/blob/main/CONTRIBUTING.md#steps-towards-contributions) - [ ] Relevant documentation is changed or added (and PR referenced) By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

stocaaro added 3 commits August 19, 2022 09:26

chore: Rebase onto latest change

5a7eb71

fix: Cleaning up PR

8df7264

fix: More cleaning up PR

908254e

stocaaro requested review from elorzafe and ashika01 as code owners August 19, 2022 19:36

This comment was marked as resolved.

Sign in to view

elorzafe reviewed Aug 19, 2022

View reviewed changes

This was referenced Aug 22, 2022

How to catch PubSub errors if MQTT connection lost or broken? #2692

Closed

PubSub Javascript: How to detect websocket closed and reconnect? #4459

Closed

stocaaro added 6 commits August 31, 2022 10:58

feat: Add reconnection delay. Factor out reconnect event management f…

dcbbfbc

…rom providers

feat: When loaded offline, the reconnect logic triggers once the netw…

56518e0

…ork is available

Fix comment accuracy

a920333

Comment for clarity

fe6d33d

Rolling back unnecessary string casting

e52c180

Moving the initial startSubscription to after the reconnection setup

85e0725

This comment was marked as resolved.

Sign in to view

stocaaro and others added 3 commits August 31, 2022 17:08

Merge branch 'next-major-version/5' into aws-amplifyGH-9824/pubsub/re…

9dfe416

…connect-pr

Merge branch 'aws-amplifyGH-9824/pubsub/reconnect-pr' of github.com:s…

fe7cd5e

…tocaaro/amplify-js into aws-amplifyGH-9824/pubsub/reconnect-pr

Fix LGTM issues

fe2ead6

stocaaro requested a review from elorzafe September 1, 2022 14:17

This was referenced Sep 2, 2022

PubSub reconnect #3039

Closed

Unable to reconnect AppSync after connection interruption (web) or app going into background (ios/ android) #9749

Closed

AppSync Subscription connection closing and unable to reestablish #10124

Closed

feat: Add retry interval around reconnect and remove observer error c…

f24b276

…ases to allow reconnection across failure modes

This comment was marked as resolved.

Sign in to view

stocaaro added 2 commits September 9, 2022 14:28

fix: Stop reconnecting unless in Connecting or ConnectionDisrupted

8953703

fix: Remove unused variable

4db1fd7

This comment was marked as duplicate.

Sign in to view

fix: Missed variable rename

e41f1e3