Start ReminderService initial load in the background #1520

jdom · 2016-03-03T00:29:53Z

This way the service will not block the silo from fully initializing before all the reminders are loaded.
Avoid reading and starting all the reminders if the Silo is stopped before finishing the first load.

jdom · 2016-03-03T00:33:11Z

Ping @fsimonazzi

gabikliot · 2016-03-03T01:35:40Z

Can you provide some background on the problem that motivated this change?

jdom · 2016-03-03T01:53:54Z

The goal is to reduce silo startup time when there are A LOT of reminders to load (regardless of what's being discussed in #947 to separate into quantums, assuming that all reminders fit in memory, it's just that it takes very long to scan the reminders table).
A concrete hosting scenario where this can be an issue:
If you are hosting in an azure worker role, you currently must start the silo in the Run method, as opposed to the OnStart method, because azure needs to open the appropriate firewall ports for Orleans to start talking to other instances. This means that if you are performing an upgrade, Azure will start updating images based on the update domains. But as soon as the OnStart method returns, then Azure calls the Run method and considers the role as healthy and can move to the next update domain.
Nevertheless, if starting Orleans takes an extra 40 seconds due to retrieving the reminders, then this means there is an increasing chance that the service will take more downtime, as there might be times where no Silo instances are fully started.

Doing the initial load of the reminders in the background after the silo is considered functional helps reduce the downtime, with no considerable side effects. The only minor side effect is potentially timing out on the call to RegisterOrUpdateReminder, as it waits for the initial load to finish, but it's not very different than getting other errors on that call (such as a write failure).

gabikliot · 2016-03-03T06:19:27Z

src/OrleansRuntime/ReminderService/LocalReminderService.cs

@@ -242,6 +238,28 @@ public void RangeChangeNotification(IRingRange old, IRingRange now, bool increas

        #region Internal implementation methods

+        private async Task StartInBackground()
+        {
+            await ReadAndUpdateReminders();


if that throws, the rest of the code will not execute.
Just putting try finally might not be good too: you did not really start yet, the reminders were not read into memory first time, so if you start accepting register requests, it may lead to a wrong behaivour.
Instead, I think you need to carefully make sure you can start the timer first and check all cases.

gabikliot · 2016-03-03T06:28:41Z

OK, in general sounds good, BUT:

you need to be careful with this change, and do it correctly. Previously, the code implicitly
assumed that if first read fails, we are going to fail the silo, so the code was "failing fast". Not now. But you did not protect for that case. If the first read fails, you will not start a timer (and thus not ever recover) and in general, the logic may be broken now.
indeed, we are still doing table.Init and if that fails, we fail the silo. That is good. BUT: there are more potential failure cases in which we used to fail the silo startup. For example, if the read from table fails since the format changed or what not, in the old code we used to clearly indicate that we cannot start the silo. It was clearly indicated to the application. But not any more. Now we will just keep retrying, for ever maybe, and the silo will be in a state without reminder service. This may be the semantics you want, but just need to be aware of this case. The failure semantics will change from fail fast to "degrade to partially functional silo without reminder service."

jdom · 2016-03-03T06:49:57Z

Completely agree on the error handling for the first read. I actually did the work and rebased, but I guess I didn't force push and didn't realize I didn't update the PR. I'll push it tomorrow morning.
Regarding the change in semantics, I was aware and I did add the code comment because of it. There is technically the possibility for unrecoverable failures after init, but they are probably so unlikely that I think it should be acceptable to degrade in that case, do you agree?

gabikliot · 2016-03-03T07:21:15Z

. There is technically the possibility for unrecoverable failures after init, but they are probably so unlikely that I think it should be acceptable to degrade in that case, do you agree?

No, I don't agree. That was exactly my point. Init only checks and creates the table, it does not attempt to read. The first read may fail for many reasons, for example format change. What is very unlikely is the 2nd read to fail after 1st succeeded, but 1st read failing is non unlikely. So if you go with the semantics change (which I generally support), you may want to indicate that error more clearly, for example with a specific error msg to the register requests if you decide not to serve them.

The bigger problem is lets say there are no new registers - and the first read failed - the reminders will just not tick, silently, without any visible indication, maybe only an error in the log. That is a big semantic change.

jdom · 2016-03-03T17:51:51Z

Thanks @gabikliot for the feedback. I added (an updated version of) the retry logic that would retry a number of times, and afterwards will stop the service but making the startedTask faulted, which will bubble up to app code if they try to use reminders. Also I added some timeout logic for grain code trying to update reminders when the initialization is taking very long, so that the grains are not stuck for a very long time.
Also made sure we have warnings and errors while retrying.

I hardcoded some details related to retrying (retry count, delay time, and max wait time for callers). We could add config knobs for them, but I wasn't sure if it's what we want, and also wanted to validate that you agree with the approach first.

gabikliot · 2016-03-04T06:45:41Z

src/OrleansRuntime/ReminderService/LocalReminderService.cs

@@ -400,7 +477,7 @@ private bool TryStopPreviousTimer(GrainReference grainRef, string reminderName)
        private async Task DoResponsibilitySanityCheck(GrainReference grainRef, string debugInfo)
        {
            if (status != ReminderServiceStatus.Started)
-                await startedTask.Task;
+                await startedTask.Task.WithTimeout(InitialLoadMaxWaitTimeForUpdates);


so the callers will be getting this timeout if they call while the service is trying to init.
Instead, it would be better to do like you do now await startedTask.Task.WithTimeout(InitialLoadMaxWaitTimeForUpdates)
but then catch and rethrow "service is still trying to init".

But that could potentially make the grain be stuck for 10 minutes. I would prefer to bail early.

No, it would not.
As I wrote, you still put the WithTimeout (await startedTask.WithTimeout(...)), you just rethrow an explicit exc instead of a Timeout exc.

Oh, I miss understood. I thought you were suggesting that I wait longer, until no retries are left. Good, I will change the exception.

jdom · 2016-03-04T23:20:53Z

@gabikliot, I addressed all the comments. I kept separated commits so that it is easier for you to review what changed from iteration to iteration. Let me know if it's good for merging and I'll squash and rebase on top of the current master

sergeybykov · 2016-03-05T02:41:30Z

When this one is ready, we need to merge it and include in the 1.1.3 release.

gabikliot · 2016-03-05T06:04:59Z

OK, you can squash.

This way the service will not block the silo from fully initializing before all the reminders are loaded. Avoid reading and starting all the reminders if the Silo is stopped before finishing the first load. Give an explicit exception when trying to register a reminder and the service is still booting Retry forever, but fail fast to update reminders after a few retries

jdom · 2016-03-05T17:59:58Z

Done, thanks @gabikliot . This was rebased and squashed, and it's ready to merge unless someone else has more feedback

Start ReminderService initial load in the background

giglesias · 2016-03-15T19:28:28Z

Thanks for introducing this change in the 1.1.3 release. We were experiencing a delay of nearly 40 seconds during silo startup due to the initial reminders load from Azure table storage. We've upgraded to the latest Orleans bits and now the general silo startup time has considerably decreased. Even with a lot of reminders, it is aligned with what we've seen in some tests that we performed in the past where we had no reminders at all in Azure table storage.

Just wanted to share our findings.

sergeybykov · 2016-03-15T20:11:08Z

@giglesias Thanks for confirming!

jdom · 2016-03-15T20:32:03Z

Awesome @giglesias, thanks for validating it.

dnfclas added the cla-already-signed label Mar 3, 2016

gabikliot reviewed Mar 3, 2016
View reviewed changes

gabikliot added the enhancement label Mar 3, 2016

gabikliot reviewed Mar 4, 2016
View reviewed changes

sergeybykov assigned gabikliot Mar 5, 2016

jdom force-pushed the reminder-service-start-background branch from 91993c9 to d1286c3 Compare March 5, 2016 17:59

gabikliot pushed a commit that referenced this pull request Mar 5, 2016

Merge pull request #1520 from jdom/reminder-service-start-background

62c65ca

Start ReminderService initial load in the background

gabikliot merged commit 62c65ca into dotnet:master Mar 5, 2016

jdom deleted the reminder-service-start-background branch March 7, 2016 17:55

sergeybykov mentioned this pull request May 11, 2016

Reminders Stored in SQL Are Not Always Firing After Cluster Restart #1746

Closed

github-actions bot locked and limited conversation to collaborators Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start ReminderService initial load in the background #1520

Start ReminderService initial load in the background #1520

jdom commented Mar 3, 2016

jdom commented Mar 3, 2016

gabikliot commented Mar 3, 2016

jdom commented Mar 3, 2016

gabikliot Mar 3, 2016

gabikliot commented Mar 3, 2016

jdom commented Mar 3, 2016

gabikliot commented Mar 3, 2016

jdom commented Mar 3, 2016

gabikliot Mar 4, 2016

jdom Mar 4, 2016

gabikliot Mar 4, 2016

jdom Mar 4, 2016

jdom commented Mar 4, 2016

sergeybykov commented Mar 5, 2016

gabikliot commented Mar 5, 2016

jdom commented Mar 5, 2016

giglesias commented Mar 15, 2016

sergeybykov commented Mar 15, 2016

jdom commented Mar 15, 2016

Start ReminderService initial load in the background #1520

Start ReminderService initial load in the background #1520

Conversation

jdom commented Mar 3, 2016

jdom commented Mar 3, 2016

gabikliot commented Mar 3, 2016

jdom commented Mar 3, 2016

gabikliot Mar 3, 2016

Choose a reason for hiding this comment

gabikliot commented Mar 3, 2016

jdom commented Mar 3, 2016

gabikliot commented Mar 3, 2016

jdom commented Mar 3, 2016

gabikliot Mar 4, 2016

Choose a reason for hiding this comment

jdom Mar 4, 2016

Choose a reason for hiding this comment

gabikliot Mar 4, 2016

Choose a reason for hiding this comment

jdom Mar 4, 2016

Choose a reason for hiding this comment

jdom commented Mar 4, 2016

sergeybykov commented Mar 5, 2016

gabikliot commented Mar 5, 2016

jdom commented Mar 5, 2016

giglesias commented Mar 15, 2016

sergeybykov commented Mar 15, 2016

jdom commented Mar 15, 2016