-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pubsub): Automated reconnect for IoT and API #10235
feat(pubsub): Automated reconnect for IoT and API #10235
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
packages/pubsub/src/Providers/AWSAppSyncRealTimeProvider/index.ts
Outdated
Show resolved
Hide resolved
packages/pubsub/src/Providers/AWSAppSyncRealTimeProvider/index.ts
Outdated
Show resolved
Hide resolved
This comment was marked as resolved.
This comment was marked as resolved.
…tocaaro/amplify-js into aws-amplifyGH-9824/pubsub/reconnect-pr
Codecov Report
@@ Coverage Diff @@
## next-major-version/5 #10235 +/- ##
========================================================
+ Coverage 84.43% 84.47% +0.04%
========================================================
Files 258 259 +1
Lines 18694 18772 +78
Branches 4019 4050 +31
========================================================
+ Hits 15784 15858 +74
- Misses 2820 2824 +4
Partials 90 90
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…ases to allow reconnection across failure modes
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as duplicate.
This comment was marked as duplicate.
This pull request fixes 1 alert when merging 86f55ee into 7b64ddd - view on LGTM.com fixed alerts:
|
This comment was marked as resolved.
This comment was marked as resolved.
This pull request fixes 1 alert when merging 1a605e9 into 7b64ddd - view on LGTM.com fixed alerts:
|
This pull request fixes 1 alert when merging a44e9f3 into 7b64ddd - view on LGTM.com fixed alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review (didnt review the tests) more questions than changes
// If no initial network state was discovered, stop trying | ||
this._initialNetworkStateSubscription?.unsubscribe(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having trouble understanding this ☕
L78 tries to get the initial state and then after it gets the status is unsubscribe, I think we chatted about this and that it didn't return the status immediately? After looking on ReachabilityMonitor code https://github.com/aws-amplify/amplify-js/blob/main/packages/core/src/Util/Reachability.ts#L21 it seems it does.
I am probably missing something here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From your guidance in the connection state monitoring PR, we don't want to leave reachability running for every provider in perpetuity, but only want to monitor when there are active subscriptions. So now, the monitor is turned on and off, however, this means we don't turn it on at the beginning to get the state, which we otherwise are defaulting to being on.
Now in this code, we need to know that the network is off before the first subscription is formed to get the right reconnection states for cold start. To accomplish this, we are turning it on, reading the current state and then turning it off again to be later restarted when the first subscription begins.
@@ -0,0 +1,64 @@ | |||
import { Observer } from 'zen-observable-ts'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having a hard time to take this dependency out from core, is it necessary or can be defined "by hand"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is imported 19 times across Pubsub, Core, DataStore, Storage and ApiGraphQL. I'm sure there are alternatives, but I don't think it is in the scope of this change to remove this dependency.
/** | ||
* Given a reconnect event, start the appropriate behavior | ||
*/ | ||
record(event: ReconnectEvent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this gets called multiple times, can be a race condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a bug with this earlier and this pay enough attention to the retained reconnectSetTimeoutId
. Fixing this, but assuming the thread isn't being dropped between the conditional block and assignment, this should be safe. Interested to hear your thoughts.
HALT_RECONNECT = 'HALT_RECONNECT', | ||
} | ||
|
||
export class ReconnectionMonitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would explain a little more why do we need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Adding a comment above this class to make it clear.
// Trigger reconnection when the connection is disrupted | ||
if (connectionState === ConnectionState.ConnectionDisrupted) { | ||
this.reconnectionMonitor.record(ReconnectEvent.START_RECONNECT); | ||
} else if (connectionState !== ConnectionState.Connecting) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer here to make an exhaustive check just in case we are missing a state in the future, maybe explicitly check the states that goes to halt state for reconnection and if we add another state in the future we could have a compilation error because it was not handle properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. Updating to have an exhaustive list for halt as requested.
@@ -676,9 +713,6 @@ export class AWSAppSyncRealTimeProvider extends AbstractPubSubProvider { | |||
logger.debug(`WebSocket connection error`); | |||
}; | |||
newSocket.onclose = () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was remove because of Reachability monitor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was triggering failed too early and getting in the way of reconnection.
It gets re-thrown on 815, retried then finally rejecting the promise on 676 with ultimately logged in the top level promise catch block I just refactored into its own method. Connection is recorded closed on 373 after I post the update.
Promise.resolve( | ||
this.connectionStateMonitor.record(CONNECTION_CHANGE.CLOSED) | ||
); | ||
|
||
// Notify concurrent unsubscription | ||
if (typeof subscriptionFailedCallback === 'function') { | ||
subscriptionFailedCallback(); | ||
if ( | ||
this.connectionState !== | ||
ConnectionState.ConnectionDisruptedPendingNetwork | ||
) { | ||
if (isNonRetryableError(err)) { | ||
observer.error({ | ||
errors: [ | ||
{ | ||
...new GraphQLError( | ||
`${CONTROL_MSG.CONNECTION_FAILED}: ${message}` | ||
), | ||
}, | ||
], | ||
}); | ||
} else { | ||
logger.debug(`${CONTROL_MSG.CONNECTION_FAILED}: ${message}`); | ||
} | ||
|
||
const { subscriptionFailedCallback } = | ||
this.subscriptionObserverMap.get(subscriptionId) || {}; | ||
|
||
// Notify concurrent unsubscription | ||
if (typeof subscriptionFailedCallback === 'function') { | ||
subscriptionFailedCallback(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I would export some of this into a separate function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good call. This bit has gotten extra out of hand.
}); | ||
|
||
startSubscription(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could have a race condition? with L220?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could, although in the case where it gets run twice subscriptionStartActive
causes the second call to abort if it is run before the first call reaches its finally
block.
Hypothetically of both calls execute their if (!subscriptionStartActive) {
before the assignment on the next line, then they could be running in parallel, however I've read that this isn't how the JS GIL operates, although if this last week has taught me anything its that there is diversity in the ecosystem.
Sufficiently safe or is there a better locking mechanism to prevent any chance of double execution?
This pull request fixes 1 alert when merging 9d4259a into 7b64ddd - view on LGTM.com fixed alerts:
|
I spent several hours chasing the DataStore with API cold start issue and have confirmation that they aren't compatible. AppSync appears to only honor one subscription per unique query, last subscriber wins. As my demo app adds subscriptions normally, through luck the datastore subscribers are setup first and when I start the API subscriptions they replace the datastore ones leaving everything apparently working. For the cold start case, Datastore doesn't attempt to start its subscriptions until the network exists while API sets up its subscriptions and internally waits for the network to connect. In this situation when the network returns the API subscriptions are overridden the the datastore subscriptions. Long term, if deduplicate on queries and send messages to every subscription that is watching a given query, then it won't matter who's subscription identifier the payload was sent to. Until we support multiple subscribers in this, or some other, way, the behavior our tests are observing is expected. |
This pull request fixes 1 alert when merging 6bc69dc into 92b672b - view on LGTM.com fixed alerts:
|
This pull request fixes 1 alert when merging e6bae74 into 92b672b - view on LGTM.com fixed alerts:
|
This pull request fixes 1 alert when merging 3c49956 into 92b672b - view on LGTM.com fixed alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking the time to go through my questions with me! LGTM
…10421) <!-- Please make sure to read the Pull Request Guidelines: https://github.com/aws-amplify/amplify-js/blob/main/CONTRIBUTING.md#pull-requests --> #### Description of changes <!-- Thank you for your Pull Request! Please provide a description above and review the requirements below. --> Follow up to #10235. In previous change, AppSyncRealTimeProvider blocks the GraphQL connection error to the subscribers. This breaks in the integration test [when subscription is disabled](https://app.circleci.com/pipelines/github/aws-amplify/amplify-js/14982/workflows/93104c1c-ba16-432a-8d71-19831b335d37/jobs/155531). This fixes the unintended breaking change. #### Issue #, if available <!-- Also, please reference any associated PRs for documentation updates. --> #### Description of how you validated changes Validated the change in the Cypress integ test locally. #### Checklist <!-- Remove items that do not apply. For completed items, change [ ] to [x]. --> - [x] PR description included - [x] `yarn test` passes - [ ] Tests are [changed or added](https://github.com/aws-amplify/amplify-js/blob/main/CONTRIBUTING.md#steps-towards-contributions) - [ ] Relevant documentation is changed or added (and PR referenced) By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Description of changes
Automated reconnect for IoT and API.
This change doesn't alter the interface for either IoT or API, however when the connection is lost or disrupted it will automatically attempt to re-form the connection and resubscribe for new events.
This changes what behavior developers can expect and may conflict with workarounds. Based on this, the change will be merged into the
next
tag for inclusion in the next major version.Expected behavior
Bugs fixed
Issue #, if available
#10124
#9824
#9749
#8513
#4459
#3039
#2692
#2372
#1016
Description of how you validated changes
Checklist
yarn test
passesBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.