Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start ReminderService initial load in the background #1520

Merged
merged 1 commit into from Mar 5, 2016

Conversation

jdom
Copy link
Member

@jdom jdom commented Mar 3, 2016

This way the service will not block the silo from fully initializing before all the reminders are loaded.
Avoid reading and starting all the reminders if the Silo is stopped before finishing the first load.

@jdom
Copy link
Member Author

jdom commented Mar 3, 2016

Ping @fsimonazzi

@gabikliot
Copy link
Contributor

Can you provide some background on the problem that motivated this change?

@jdom
Copy link
Member Author

jdom commented Mar 3, 2016

The goal is to reduce silo startup time when there are A LOT of reminders to load (regardless of what's being discussed in #947 to separate into quantums, assuming that all reminders fit in memory, it's just that it takes very long to scan the reminders table).
A concrete hosting scenario where this can be an issue:
If you are hosting in an azure worker role, you currently must start the silo in the Run method, as opposed to the OnStart method, because azure needs to open the appropriate firewall ports for Orleans to start talking to other instances. This means that if you are performing an upgrade, Azure will start updating images based on the update domains. But as soon as the OnStart method returns, then Azure calls the Run method and considers the role as healthy and can move to the next update domain.
Nevertheless, if starting Orleans takes an extra 40 seconds due to retrieving the reminders, then this means there is an increasing chance that the service will take more downtime, as there might be times where no Silo instances are fully started.

Doing the initial load of the reminders in the background after the silo is considered functional helps reduce the downtime, with no considerable side effects. The only minor side effect is potentially timing out on the call to RegisterOrUpdateReminder, as it waits for the initial load to finish, but it's not very different than getting other errors on that call (such as a write failure).

@@ -242,6 +238,28 @@ public void RangeChangeNotification(IRingRange old, IRingRange now, bool increas

#region Internal implementation methods

private async Task StartInBackground()
{
await ReadAndUpdateReminders();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that throws, the rest of the code will not execute.
Just putting try finally might not be good too: you did not really start yet, the reminders were not read into memory first time, so if you start accepting register requests, it may lead to a wrong behaivour.
Instead, I think you need to carefully make sure you can start the timer first and check all cases.

@gabikliot
Copy link
Contributor

OK, in general sounds good, BUT:

  1. you need to be careful with this change, and do it correctly. Previously, the code implicitly
    assumed that if first read fails, we are going to fail the silo, so the code was "failing fast". Not now. But you did not protect for that case. If the first read fails, you will not start a timer (and thus not ever recover) and in general, the logic may be broken now.

  2. indeed, we are still doing table.Init and if that fails, we fail the silo. That is good. BUT: there are more potential failure cases in which we used to fail the silo startup. For example, if the read from table fails since the format changed or what not, in the old code we used to clearly indicate that we cannot start the silo. It was clearly indicated to the application. But not any more. Now we will just keep retrying, for ever maybe, and the silo will be in a state without reminder service. This may be the semantics you want, but just need to be aware of this case. The failure semantics will change from fail fast to "degrade to partially functional silo without reminder service."

@jdom
Copy link
Member Author

jdom commented Mar 3, 2016

Completely agree on the error handling for the first read. I actually did the work and rebased, but I guess I didn't force push and didn't realize I didn't update the PR. I'll push it tomorrow morning.
Regarding the change in semantics, I was aware and I did add the code comment because of it. There is technically the possibility for unrecoverable failures after init, but they are probably so unlikely that I think it should be acceptable to degrade in that case, do you agree?

@gabikliot
Copy link
Contributor

. There is technically the possibility for unrecoverable failures after init, but they are probably so unlikely that I think it should be acceptable to degrade in that case, do you agree?

No, I don't agree. That was exactly my point. Init only checks and creates the table, it does not attempt to read. The first read may fail for many reasons, for example format change. What is very unlikely is the 2nd read to fail after 1st succeeded, but 1st read failing is non unlikely. So if you go with the semantics change (which I generally support), you may want to indicate that error more clearly, for example with a specific error msg to the register requests if you decide not to serve them.

The bigger problem is lets say there are no new registers - and the first read failed - the reminders will just not tick, silently, without any visible indication, maybe only an error in the log. That is a big semantic change.

@jdom
Copy link
Member Author

jdom commented Mar 3, 2016

Thanks @gabikliot for the feedback. I added (an updated version of) the retry logic that would retry a number of times, and afterwards will stop the service but making the startedTask faulted, which will bubble up to app code if they try to use reminders. Also I added some timeout logic for grain code trying to update reminders when the initialization is taking very long, so that the grains are not stuck for a very long time.
Also made sure we have warnings and errors while retrying.

I hardcoded some details related to retrying (retry count, delay time, and max wait time for callers). We could add config knobs for them, but I wasn't sure if it's what we want, and also wanted to validate that you agree with the approach first.

@@ -400,7 +477,7 @@ private bool TryStopPreviousTimer(GrainReference grainRef, string reminderName)
private async Task DoResponsibilitySanityCheck(GrainReference grainRef, string debugInfo)
{
if (status != ReminderServiceStatus.Started)
await startedTask.Task;
await startedTask.Task.WithTimeout(InitialLoadMaxWaitTimeForUpdates);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the callers will be getting this timeout if they call while the service is trying to init.
Instead, it would be better to do like you do now await startedTask.Task.WithTimeout(InitialLoadMaxWaitTimeForUpdates)
but then catch and rethrow "service is still trying to init".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that could potentially make the grain be stuck for 10 minutes. I would prefer to bail early.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it would not.
As I wrote, you still put the WithTimeout (await startedTask.WithTimeout(...)), you just rethrow an explicit exc instead of a Timeout exc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I miss understood. I thought you were suggesting that I wait longer, until no retries are left. Good, I will change the exception.

@jdom
Copy link
Member Author

jdom commented Mar 4, 2016

@gabikliot, I addressed all the comments. I kept separated commits so that it is easier for you to review what changed from iteration to iteration. Let me know if it's good for merging and I'll squash and rebase on top of the current master

@sergeybykov
Copy link
Contributor

When this one is ready, we need to merge it and include in the 1.1.3 release.

@gabikliot
Copy link
Contributor

OK, you can squash.

This way the service will not block the silo from fully initializing before all the reminders are loaded.
Avoid reading and starting all the reminders if the Silo is stopped before finishing the first load.
Give an explicit exception when trying to register a reminder and the service is still booting
Retry forever, but fail fast to update reminders after a few retries
@jdom jdom force-pushed the reminder-service-start-background branch from 91993c9 to d1286c3 Compare March 5, 2016 17:59
@jdom
Copy link
Member Author

jdom commented Mar 5, 2016

Done, thanks @gabikliot . This was rebased and squashed, and it's ready to merge unless someone else has more feedback

gabikliot pushed a commit that referenced this pull request Mar 5, 2016
Start ReminderService initial load in the background
@gabikliot gabikliot merged commit 62c65ca into dotnet:master Mar 5, 2016
@jdom jdom deleted the reminder-service-start-background branch March 7, 2016 17:55
@giglesias
Copy link

Thanks for introducing this change in the 1.1.3 release. We were experiencing a delay of nearly 40 seconds during silo startup due to the initial reminders load from Azure table storage. We've upgraded to the latest Orleans bits and now the general silo startup time has considerably decreased. Even with a lot of reminders, it is aligned with what we've seen in some tests that we performed in the past where we had no reminders at all in Azure table storage.

Just wanted to share our findings.

@sergeybykov
Copy link
Contributor

@giglesias Thanks for confirming!

@jdom
Copy link
Member Author

jdom commented Mar 15, 2016

Awesome @giglesias, thanks for validating it.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants