Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reminders Stored in SQL Are Not Always Firing After Cluster Restart #1746

Closed
zeus82 opened this issue May 10, 2016 · 29 comments
Closed

Reminders Stored in SQL Are Not Always Firing After Cluster Restart #1746

zeus82 opened this issue May 10, 2016 · 29 comments
Labels

Comments

@zeus82
Copy link
Contributor

zeus82 commented May 10, 2016

When they do fire after a restart, the grain's OnActivateAsync does not get called.

This repo has a solution I created to illustrate the problem: OrleansReminderProblem

The connection string to the database is in the OrleansHostWrapper

Run the test host a few times for at least 1 minute to make sure the reminder doesn't fire.

To see that the test reminder is firing look for an Info level message in the logs that says "Reminder got called"

Also, to see that the OnActivateAsync method did get called look for an Error message in the logs that says: "********** OnActivateAsync was not called ********************"

@zeus82
Copy link
Contributor Author

zeus82 commented May 11, 2016

I've been digging, and it looks like the reminder is getting loaded from the database.

In LocalReminderService.cs, it looks like the StartTimer method is getting called, but then soon after the StopReminder method gets called which seems to be preventing the timer from ticking.

@gabikliot
Copy link
Contributor

Stop reminder is supposed to stop the timer from ticking. Stop reminder is trigger by application code. So if you stopped the reminder, well,... it is supposed to not tick any more.

@zeus82
Copy link
Contributor Author

zeus82 commented May 11, 2016

The StopReminder call is not happening from my code... Here is the stack trace:

OrleansRuntime.dll!Orleans.Runtime.ReminderService.LocalReminderService.LocalReminderData.StopReminder(Orleans.Runtime.TraceLogger logger) Line 560 C#
OrleansRuntime.dll!Orleans.Runtime.ReminderService.LocalReminderService.ReadTableAndStartTimers(Orleans.Runtime.IRingRange range) Line 401 C#
OrleansRuntime.dll!Orleans.Runtime.ReminderService.LocalReminderService.ReadAndUpdateReminders() Line 212 C#
OrleansRuntime.dll!Orleans.Runtime.ReminderService.LocalReminderService.DoInitialReadAndUpdateReminders() Line 289 C#
OrleansRuntime.dll!Orleans.Runtime.ReminderService.LocalReminderService.StartInBackground() Line 247 C#
OrleansRuntime.dll!Orleans.Runtime.ReminderService.LocalReminderService.Start() Line 86 C#
OrleansRuntime.dll!Orleans.Runtime.Scheduler.SchedulerExtensions.QueueTask.AnonymousMethod__0() Line 38 C#
OrleansRuntime.dll!Orleans.Runtime.Scheduler.SchedulerExtensions.QueueTask.AnonymousMethod__1() Line 46 C#
OrleansRuntime.dll!Orleans.Runtime.Scheduler.ClosureWorkItem.Execute() Line 46 C#
OrleansRuntime.dll!Orleans.Runtime.Scheduler.TaskSchedulerUtils.RunWorkItemTask(Orleans.Runtime.Scheduler.IWorkItem todo, System.Threading.Tasks.TaskScheduler sched) Line 19 C#
OrleansRuntime.dll!Orleans.Runtime.Scheduler.TaskSchedulerUtils.WrapWorkItemAsTask.AnonymousMethod__0(object state) Line 10 C#
[External Code]
OrleansRuntime.dll!Orleans.Runtime.Scheduler.ActivationTaskScheduler.RunTask(System.Threading.Tasks.Task task) Line 45 C#
OrleansRuntime.dll!Orleans.Runtime.Scheduler.WorkItemGroup.Execute() Line 348 C#
OrleansRuntime.dll!Orleans.Runtime.Scheduler.WorkerPoolThread.Run() Line 176 C#
Orleans.dll!Orleans.Runtime.AsynchAgent.AgentThreadProc(object obj) Line 173 C#

The in the method that calls StopReminder, there is a log message that says "Not in table, In local, Old, so removing."... Its a Verbose log.

@veikkoeeva
Copy link
Contributor

veikkoeeva commented May 11, 2016

@gabikliot, @sergeybykov, @jdom Do you remember to tell how are the hash ranges allocated when using GrainBasedMembershipTable? Could that affect the hash range between the runs? That is, the reminders are in SQL Server, the other configuration is the in-memory developer one.

@zeus82 The message Not in table, In local, Old, so removing does not appear to be in the logs.

@zeus82
Copy link
Contributor Author

zeus82 commented May 11, 2016

@veikkoeeva Yeah... for some reason I can't get verbose logging to work... I add whats below, but looks like logging is still at the Info level. Am I doing something wrong?

siloHost.Config.Defaults.DefaultTraceLevel = Orleans.Runtime.Severity.Verbose3;

@zeus82
Copy link
Contributor Author

zeus82 commented May 11, 2016

Ok, so I managed to get Verbose2 level logs... They are in the In this repo (same repo as one from original comment)

You can now find Not in table, In local, Old, so removing in 2nd and 3rd logs

@sergeybykov
Copy link
Contributor

sergeybykov commented May 11, 2016

Looks like this snippet points to the root of it:

[2016-05-11 15:39:08.478 GMT 11 VERBOSE 100000 ReminderService 127.0.0.1:22222] For range <(x9FF1752E xAF387035], Size=x0F46FB07, %Ring=5.968%>, I read in 1 reminders from table. LocalTableSequence 18, CachedSequence 18
[2016-05-11 15:39:08.481 GMT 11 VERBOSE-2 100000 ReminderService 127.0.0.1:22222] In table, Not in local, <GrainRef=GrainReference:*grn/E7C24DEC/00000000 ReminderName=heartbeat Period=00:01:00>
[2016-05-11 15:39:08.489 GMT 11 VERBOSE 101404 asynTask.SafeTimerBase 127.0.0.1:22222] Creating timer asynTask.SafeTimerBase with dueTime=-00:00:00.0010000 period=-00:00:00.0010000
[2016-05-11 15:39:08.492 GMT 11 VERBOSE 100000 ReminderService 127.0.0.1:22222] Reminder [heartbeat, GrainReference:_grn/E7C24DEC/0000000000000000000000000000000003ffffffe7c24dec-0xA85C4D19, 00:01:00, 2016-05-11 15:35:47.897 GMT, 0, 19, Ticking], Due time00:00:39.4078298
[2016-05-11 15:39:08.493 GMT 11 VERBOSE 102936 ReminderService 127.0.0.1:22222] Started reminder <GrainRef=GrainReference:_grn/E7C24DEC/00000000 ReminderName=heartbeat Period=00:01:00>.
[2016-05-11 15:39:08.495 GMT 11 VERBOSE-2 100000 ReminderService 127.0.0.1:22222] Reading rows for range <(xAF387035 xB99F7445], Size=x0A670410, %Ring=4.063%>
[2016-05-11 15:39:08.495 GMT 11 VERBOSE 100000 ReminderService 127.0.0.1:22222] Reading rows from <(xAF387035 xB99F7445], Size=x0A670410, %Ring=4.063%>
[2016-05-11 15:39:08.497 GMT 11 VERBOSE 100000 ReminderService 127.0.0.1:22222] For range <(xAF387035 xB99F7445], Size=x0A670410, %Ring=4.063%>, I read in 0 reminders from table. LocalTableSequence 20, CachedSequence 20
[2016-05-11 15:39:08.497 GMT 11 VERBOSE-2 100000 ReminderService 127.0.0.1:22222] Not in table, In local, Old, so removing. [heartbeat, GrainReference:*grn/E7C24DEC/0000000000000000000000000000000003ffffffe7c24dec-0xA85C4D19, 00:01:00, 2016-05-11 15:35:47.897 GMT, 0, 19, Ticking]
[2016-05-11 15:39:08.500 GMT 11 VERBOSE 101408 asynTask.SafeTimerBase 127.0.0.1:22222] Disposing timer asynTask.SafeTimerBase

A reminder is read from the table as part of one range, gets started, only to be stopped because it is considered part of another range.

Can this be related to the range/hash calculation and storage issue that @shayhatsor addressed in #1374?

@shayhatsor
Copy link
Member

shayhatsor commented May 11, 2016

Well, the same reminders ranges tests pass on both Sql Server,MySql and Azure... so I don't know what or if something is wrong here, in terms of range queries.

@veikkoeeva
Copy link
Contributor

@zeus82 Can you switch to Azure Storage Emulator? You shouldn't need more than switching the connection string and starting it. The connection string you can check in the samples and instructions to start the emulator can be found here. I wonder if this could be replicated by creating a test.

@sergeybykov
Copy link
Contributor

@shayhatsor Yeah, this is very strange.
@zeus82 You used the latest and greatest SQL scripts for this project, right?

@shayhatsor
Copy link
Member

shayhatsor commented May 11, 2016

Just a thought, maybe this is related to this : Start ReminderService initial load in the background.

@sergeybykov
Copy link
Contributor

@shayhatsor Definitely not impossible that #1520 is related. Although it shouldn't have impacted anything as it only deferred initialization.

@gabikliot
Copy link
Contributor

From the posted log snippet, the range changed, and now this remainder is not in this silo range, so it stopped it. Legit so far. Another silo should have picked that reminder and start ticking. You need to look at all silo logs.

@sergeybykov
Copy link
Contributor

sergeybykov commented May 11, 2016

@gabikliot Looks like it was a single silo though. Just above those log entries there is this:

[2016-05-11 15:39:08.347 GMT 11 VERBOSE-2 100000 ReminderService 127.0.0.1:22222] My range= MultiRange: Size=x100000000, %Ring=100.000%

@sergeybykov
Copy link
Contributor

MembershipOracle 127.0.0.1:22222] -ReadAll (called from BecomeActive) Membership table 1 silos, 1 are Active

@zeus82
Copy link
Contributor Author

zeus82 commented May 11, 2016

@sergeybykov I am using the latest SQL - 2014 Standard

Here is what the reminder row looks like:
00000000-0000-0000-0000-000000000000 GrainReference=0000000000000000000000000000000003ffffffe7c24dec heartbeat 60000 -1470345959 0 2016-05-11 15:35:47.897

@gabikliot
Copy link
Contributor

I will be able to look at the log in the evening.

@zeus82
Copy link
Contributor Author

zeus82 commented May 11, 2016

@veikkoeeva Ok, so it looks like my reminder is getting called when I use azure. So problem might just be with Sql then?

@jdom
Copy link
Member

jdom commented May 11, 2016

Did not dig a lot into the logs, but I do see in your first log (for the 2nd run) that OnActivateAsync was called... Is the link incorrect?

@zeus82
Copy link
Contributor Author

zeus82 commented May 11, 2016

@jdom I thought at first the OnActivateAsync wasn't called, but it was just way farther up in the log file! My original comment was wrong... I shouldn't have deleted it :(

@shayhatsor
Copy link
Member

shayhatsor commented May 11, 2016

I think I've found the problem. I believe there's a bug here. If a silo handles more than one hash range we might miss reminders, I believe the fix should be something like this:

var remindersNotInTable = 
        (from localReminder in localReminders
         let grainHash = localReminder.Key.GrainRef.GetUniformHashCode()
         where srange.InRange(grainHash)
         select localReminder).ToDictionary(localReminder => localReminder.Key,
                                            localReminder => localReminder.Value); // shallow copy

[Edit] After applying this fix, @zeus82's example solution works as expected!

@zeus82
Copy link
Contributor Author

zeus82 commented May 12, 2016

@shayhatsor Do you know why it happened 100% of the time with SQL and never with Azure?

@gabikliot
Copy link
Contributor

I think @shayhatsor is correct with the bug. The fix looks OK. Just please rewrite it with simple LINQ, and not with embedded SQL.
And if you are there, please also delete row https://github.com/dotnet/orleans/blob/master/src/OrleansRuntime/ReminderService/LocalReminderService.cs#L211.
It duplicates with row: https://github.com/dotnet/orleans/blob/master/src/OrleansRuntime/ReminderService/LocalReminderService.cs#L315

As for why it happens only in SQL and not Azure. Based on the log I suspect that the execution in SQL case is serial: we read rage by range, while in Azure we read ranges in parallel. The calls are made in parallel, but awaited all together. I suspect that in SQL the implementation is actually not truly async, and thus the calls execute in the serial order.
As a result, in SQL it reproduces always, but in Azure there is a race and we are essentially lucky, most of the time. But the bug is still there, of course. @shayhatsor , you might want to look into this serial execution in SQL.

As for why there are multiple ranges: indeed, as @sergeybykov wrote, it is one silo, but we still use multiple (30 by default) ranges, since that significantly improves load balancing of ranges (the well known so called "virtual buckets optimization for consistent hashing"). That causes all silos (in multi silo case) to have aggregated ranges of the same size. If there was one range per silo, the range sizes would vary quite a lot. Another advantage of multi ranges is splitting the read from the table into multiple parallel reads.

@gabikliot gabikliot added the bug label May 12, 2016
@sergeybykov
Copy link
Contributor

@shayhatsor @gabikliot Do I understand you correctly that you see the bug being caused by interleaving of multiple ReadTableAndStartTimers() calls due to await at https://github.com/dotnet/orleans/blob/master/src/OrleansRuntime/ReminderService/LocalReminderService.cs#L322?

I see that the proposed fix would prevent removal of reminders that fall into ranges other than the range passed to each individual call. I wonder though, and it's not clear to me yet, if that would inadvertently cause reminders that belong to ranges that the silo used to own but handed over to other silos due to repartitioning to not get removed? Testing with a single silo won't reveal such an issue because the silo owns all the ranges.

@shayhatsor
Copy link
Member

About @gabikliot's comment:

Just please rewrite it with simple LINQ, and not with embedded SQL.

I used the let keyword, it doesn't have a simple equivalent in LINQ. I can convert to foreach.

Based on the log I suspect that the execution in SQL case is serial: we read rage by range, while in Azure we read ranges in parallel. The calls are made in parallel, but awaited all together. I suspect that in SQL the implementation is actually not truly async, and thus the calls execute in the serial order.

I think it shows something else, the fact that the Azure emulator is very slow compared to a real SQL Server that runs locally. The results get back before another parallel call is invoked, so it appears serial in the log. The implementation of SQL Server was tested by @veikkoeeva to be truly async, so I trust it.

@shayhatsor
Copy link
Member

shayhatsor commented May 12, 2016

About @sergeybykov's comment:

I wonder though, and it's not clear to me yet, if that would inadvertently cause reminders that belong to ranges that the silo used to own but handed over to other silos due to repartitioning to not get removed? Testing with a single silo won't reveal such an issue because the silo owns all the ranges.

I see your point. We need to cleanup out of range reminders from localReminders in the ReadAndUpdateReminders method. This doesn't relate to any specific sub-range.
What do you think?

@sergeybykov
Copy link
Contributor

I think that's probably the right solution - to clean up in ReadAndUpdateReminders before we run foreach at https://github.com/dotnet/orleans/blob/master/src/OrleansRuntime/ReminderService/LocalReminderService.cs#L209.

The current code calls ring.GetMyRange() at https://github.com/dotnet/orleans/blob/master/src/OrleansRuntime/ReminderService/LocalReminderService.cs#L351 and https://github.com/dotnet/orleans/blob/master/src/OrleansRuntime/ReminderService/LocalReminderService.cs#L382 which makes me nervous because of potential races between changed to the ring and this logic.

I think we shouldn't do that, and instead check only against the range we are executing ReadTableAndStartTimers() for, even though that range may already be outdated due to a ring change.

I'm not sure yet if that is enough to be correct. Maybe we should serialize all calls to ReadAndUpdateReminders(), so that we never get more than one of them running interleaved.

@jdom
Copy link
Member

jdom commented May 12, 2016

I agree that we should be serializing the calls to ReadAndUpdateReminders(), and only update the myRange field from there.

@gabikliot
Copy link
Contributor

Fixed via #1757.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants