-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reminders Stored in SQL Are Not Always Firing After Cluster Restart #1746
Comments
I've been digging, and it looks like the reminder is getting loaded from the database. In LocalReminderService.cs, it looks like the StartTimer method is getting called, but then soon after the StopReminder method gets called which seems to be preventing the timer from ticking. |
Stop reminder is supposed to stop the timer from ticking. Stop reminder is trigger by application code. So if you stopped the reminder, well,... it is supposed to not tick any more. |
The StopReminder call is not happening from my code... Here is the stack trace:
The in the method that calls StopReminder, there is a log message that says "Not in table, In local, Old, so removing."... Its a Verbose log. |
@gabikliot, @sergeybykov, @jdom Do you remember to tell how are the hash ranges allocated when using @zeus82 The message |
@veikkoeeva Yeah... for some reason I can't get verbose logging to work... I add whats below, but looks like logging is still at the Info level. Am I doing something wrong?
|
Ok, so I managed to get Verbose2 level logs... They are in the In this repo (same repo as one from original comment) You can now find |
Looks like this snippet points to the root of it:
A reminder is read from the table as part of one range, gets started, only to be stopped because it is considered part of another range. Can this be related to the range/hash calculation and storage issue that @shayhatsor addressed in #1374? |
Well, the same reminders ranges tests pass on both Sql Server,MySql and Azure... so I don't know what or if something is wrong here, in terms of range queries. |
@shayhatsor Yeah, this is very strange. |
Just a thought, maybe this is related to this : Start ReminderService initial load in the background. |
@shayhatsor Definitely not impossible that #1520 is related. Although it shouldn't have impacted anything as it only deferred initialization. |
From the posted log snippet, the range changed, and now this remainder is not in this silo range, so it stopped it. Legit so far. Another silo should have picked that reminder and start ticking. You need to look at all silo logs. |
@gabikliot Looks like it was a single silo though. Just above those log entries there is this:
|
|
@sergeybykov I am using the latest SQL - 2014 Standard Here is what the reminder row looks like: |
I will be able to look at the log in the evening. |
@veikkoeeva Ok, so it looks like my reminder is getting called when I use azure. So problem might just be with Sql then? |
Did not dig a lot into the logs, but I do see in your first log (for the 2nd run) that OnActivateAsync was called... Is the link incorrect? |
@jdom I thought at first the |
I think I've found the problem. I believe there's a bug here. If a silo handles more than one hash range we might miss reminders, I believe the fix should be something like this: var remindersNotInTable =
(from localReminder in localReminders
let grainHash = localReminder.Key.GrainRef.GetUniformHashCode()
where srange.InRange(grainHash)
select localReminder).ToDictionary(localReminder => localReminder.Key,
localReminder => localReminder.Value); // shallow copy [Edit] After applying this fix, @zeus82's example solution works as expected! |
@shayhatsor Do you know why it happened 100% of the time with SQL and never with Azure? |
I think @shayhatsor is correct with the bug. The fix looks OK. Just please rewrite it with simple LINQ, and not with embedded SQL. As for why it happens only in SQL and not Azure. Based on the log I suspect that the execution in SQL case is serial: we read rage by range, while in Azure we read ranges in parallel. The calls are made in parallel, but awaited all together. I suspect that in SQL the implementation is actually not truly async, and thus the calls execute in the serial order. As for why there are multiple ranges: indeed, as @sergeybykov wrote, it is one silo, but we still use multiple (30 by default) ranges, since that significantly improves load balancing of ranges (the well known so called "virtual buckets optimization for consistent hashing"). That causes all silos (in multi silo case) to have aggregated ranges of the same size. If there was one range per silo, the range sizes would vary quite a lot. Another advantage of multi ranges is splitting the read from the table into multiple parallel reads. |
@shayhatsor @gabikliot Do I understand you correctly that you see the bug being caused by interleaving of multiple I see that the proposed fix would prevent removal of reminders that fall into ranges other than the range passed to each individual call. I wonder though, and it's not clear to me yet, if that would inadvertently cause reminders that belong to ranges that the silo used to own but handed over to other silos due to repartitioning to not get removed? Testing with a single silo won't reveal such an issue because the silo owns all the ranges. |
About @gabikliot's comment:
I used the
I think it shows something else, the fact that the Azure emulator is very slow compared to a real SQL Server that runs locally. The results get back before another parallel call is invoked, so it appears serial in the log. The implementation of SQL Server was tested by @veikkoeeva to be truly async, so I trust it. |
About @sergeybykov's comment:
I see your point. We need to cleanup out of range reminders from |
I think that's probably the right solution - to clean up in The current code calls I think we shouldn't do that, and instead check only against the range we are executing I'm not sure yet if that is enough to be correct. Maybe we should serialize all calls to |
I agree that we should be serializing the calls to |
Fixed via #1757. |
When they do fire after a restart, the grain's OnActivateAsync does not get called.
This repo has a solution I created to illustrate the problem: OrleansReminderProblem
The connection string to the database is in the OrleansHostWrapper
Run the test host a few times for at least 1 minute to make sure the reminder doesn't fire.
To see that the test reminder is firing look for an Info level message in the logs that says "Reminder got called"
Also, to see that the OnActivateAsync method did get called look for an Error message in the logs that says: "********** OnActivateAsync was not called ********************"
The text was updated successfully, but these errors were encountered: