Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reminders scalability issues #947

Open
Tracked by #7477
MSFT-AshleyIngram opened this issue Oct 27, 2015 · 40 comments
Open
Tracked by #7477

Reminders scalability issues #947

MSFT-AshleyIngram opened this issue Oct 27, 2015 · 40 comments

Comments

@MSFT-AshleyIngram
Copy link

Orleans currently loads the entire reminders table into memory on Silo startup, partitioning reminders across available silos in the cluster, regardless of the time that reminders are set for.

This causes a limitation on the number of reminders that can be set at any given time, because of memory constraints across the cluster.

We propose paging reminders, loading only the reminders which will trigger within a configurable time period in the future.

@sergeybykov
Copy link
Contributor

That's the general direction we wanted to refactor the reminder service - to only have a fraction of overall data in memory that is relevant within a next relatively small window of time.

@MSFT-AshleyIngram
Copy link
Author

Proposal

In order to improve the scalability of Orleans reminders, we should page them into memory based on when they are due, rather than loading up all (partitioned) reminders at silo startup.

This work can effectively be sub-divided into 2 parts:

  1. Creating a custom IReminderService implementation, which loads reminders based on when they are due.
  2. Create a new interface (extending IReminderTable) and build a set of implementations for persisting reminder data, including the time period it is due for.

Reminder Service

Much like the existing implementation, the custom IReminderService will load reminders into memory, and trigger them on the appropriate grains when necessary. Unlike the existing implementation, it will load them based on the time period ('quantum') the original reminder was set for. The quantum-length will be configurable, allowing reminders to be loaded in at a configurable granularity (e.g. only reminders that will fire in the next 5 minutes, or the next hour).

For the most part, this can be handled by the existing GrainTimer. A timer can be set up to load reminders from the data store periodically, based on the configured quantum-size. When the timer fires, reminders will be fetched from the storage which match both the current time, and the consistent hashing chunk the silo is responsible for.

Challenges

The most significant challenge with this approach is grain recurrence. Grains can be set to recur on a regular time period, like this

RegisterOrUpdateReminder("reminderName", DateTime.UtcNow, TimeSpan.FromMinutes(5));

When all reminders are loaded into memory, the 'due-time' of a reminder can be calculated based on the recurrence interval and the first time the reminder was fired. However, when we're loading reminders into memory based only on their due time, we can't calculate this recurrence interval. We also can't store recurring reminders infinitely into the future (e.g. adding a database record for every 5 minutes, ad-infinitum).

Our proposal is to introduce a 'clean-up' task which will schedule subsequent reminders in the future. Reminders will be marked for clean-up, and periodically a task will run which will calculate the next time a reminder should run (based on the recurrence period). These reminders will then explicitly be scheduled in the future.
A reminder can be marked for clean-up in 2 ways, either:

  1. After being triggered
  2. When the clean-up process is executed (including on silo start-up) it will query for any past reminders, which may have been missed (for example, if the silo experiences any prolonged downtime). This will ensure that no reminders are lost.

Custom Storage Providers

As with the existing implementation, the storage provider for reminders will be configurable. A new interface, 'IQuantumReminderTable' will be introduced

public interface IQuantumReminderTable : IReminderTable
{
    Task<ReminderTableData> ReadRowsForQuantum(uint begin, uint end, DateTime now);
    bool IsInCurrentQuantum(ReminderEntry entry, DateTime utcNow);
    Task<IEnumerable<ReminderTableData>> GetPastReminders(uint begin, uint end, DateTime now);
}

These operations are required to support querying based on a specified time period, and cleaning up past reminders (as discussed above).

We aim to provide storage providers reaching parity with the existing implementations in the Core Orleans project (Azure Table Storage, SQL, and a Mock implementation).

Challenges

The primary challenge we have encountered is with the Azure Table Storage provider. Specifically, it is not clear how to create a partitioning scheme which supports both querying by quantum (necessary for paging in Reminders), and querying by grain (necessary for operations like deleting reminders) in a way which does not cause a full-table scan.

It may be that we have to introduce 2 Azure tables, one which will effectively act as an index for the other. The first table could be the same as the existing Reminders table, with the second providing a mapping from time/quantum to reminder ID (which can be queried in the other table). This potentially poses some issues with keeping the tables consistent, as transactions can only be performed on entities in the same table and partition.

Thanks for reading! I'd appreciate any feedback.

@gabikliot
Copy link
Contributor

I agree that the storage mapping is the hardest problem here.
So lets say, just for the sake of exploration, that we go with the two tables approach. Lets ignore (or assume that we can solve it later on) the consistency problem. How would the key schemes in both tables look like now? I think this direction of thinking could help us towards a solution. Lets brainstorm it.

@giglesias
Copy link

Hi guys, is there an ETA for this one or any update that you can share?

@MSFT-AshleyIngram
Copy link
Author

Apologies - I didn't see Gabi's response to this issue!

From the perspective of the table schemas - I'd expect one of the tables to be identical to the current reminders table.
The second table would be an 'index', with the Quantum as PK, a unique Id as Row Key (in case a reminder triggers more than once in a given quantum) and additional columns as required to look up the entry in the original reminder table. This would allow lookup based on Quantum, so those reminders could be paged in from the existing reminders table.

An alternative implementation would be to use an Azure Queue in order schedule the reminders (using the built in visibility timeout mechanism).

In terms of our progress/an ETA, whilst I started implementing this our work has been re-prioritized slightly so I'm not currently actively working on it.

@gabikliot
Copy link
Contributor

If the PK is Quantum , then in the situation when all your Billion reminders have the same Quantum - 12 hours lets say, you will put all reminders in that index table in the same PK, which would not scale. Did I understand it right?

There might not be a large number of different quanta. We need to use PK on something that is naturally very well partitioned. Like next tick time (assuming all reminders were not created at the same time) , but I don't know how to build such an index without re-keying (reshuffling the reminder after every tick).

@MSFT-AshleyIngram
Copy link
Author

Yeah that's correct. The choice of quantum size is a trade-off - if you make it too big then you're paging lots of reminders into memory (effectively the issue with the current reminders system, which has a 'quantum' of infinite time). If you choose too small a quantum then no (or few) reminders will be in the same quantum, so you'll be doing lots of unnecessary queries to table storage.

Ideally, you'd probably want around 1000 reminders per quantum, as that's the max you can retrieve in 1 table storage request. The appropriate quantum size is therefore application specific.

If you set the quantum to be 1 tick (thereby partitioning by tick) its fairly unlikely you'd have more than 1 reminder per partition. That'd mean you'd be doing 1 query per reminder, which is quite a large overhead.

@gabikliot
Copy link
Contributor

OK, now I see!
We were thinking about two very different things for quantum. I though you referred to "reminder period", which is the parameter given to the reminder at reminder creation which controls the time between the ticks. It is ina full application control and the system (the reminders implementation) cannot change it.

But you referred to something very different. You referred to an internal runtime tuning parameter, which controls how reminders are grouped in the table and how much is read together in one read. It is basically the "index shard/partition". I get it now.

I think the most important point that you did not mention is that in your solution the index is "dynamic" and changes based on reminder tick (and not only when new reminders are created/deleted): it needs to be updated after every reminder tick. Once reminder ticks, it need to be reinserted into the index table into a different place. That is, the index table is organized in a chronological order of next time to tick. Just that you round it up into buckets/shards/partitions - next partition to page in.

Did I now understand it correctly?
It's so hard to page in again, after 5 month delay.

@gabikliot
Copy link
Contributor

@MSFT-AshleyIngram any feedback? Did I now understand it correctly?

@MSFT-AshleyIngram
Copy link
Author

Yeah that seems correct.

There are effectively 2 challenges - the first is keeping the 2 tables in sync, especially given we will have to update the index after a reminder triggers/ticks in order to handle recurring reminders. The second is what to do if the reminders table needs to be re-partitioned (e.g. if the quantum size changes).

@veikkoeeva
Copy link
Contributor

If I understand correctly, for OrleansRemindersTable in relational storage this essentially means calcuting using StartTime and Period the rows that belong to a given quantum. Possibly also predefining an index supporting this.

@gabikliot
Copy link
Contributor

@MSFT-AshleyIngram, Why would the quantum ever change?

Also, the index table is constantly being changed - after every reminder tick this reminder is removed from the index and re-inserted into a new place. Correct? So this is an opportunity to re-partition the table, if one wants to. But why would one want to?
Essentially, the index table is a sorted list of reminders, sorted by next time to tick.

@veikkoeeva , indeed with relational storage this will be much easier and natural.

@sergeybykov
Copy link
Contributor

I'm concerned about a design where every reminder tick would cause a write (potentially multiple writes) to the index table. That would make it not only more expensive to run, but also more brittle in case of storage outages/throttling.

The original design we discussed more than a year ago didn't require writes upon ticks by imposing certain limitations on reminder periods and accuracy.

@sergeybykov
Copy link
Contributor

Here's a copy of the original proposal.

Constraints & Assumptions

In the interest of time, there are things not considered in this prototype which are assumed to be solvable in a future iteration (terms are defined later in this document):

• Caching policy for reminder bucket agents to keep more frequent reminders in memory.
• Immediate cancellation of reminders (as opposed to cancellation guarantee within a quantum's worth of time).
• Integration with existing framework (we want to think of and validate things that will change in this redesign, not things that would stay the same).

A menu of reminder periods will be supported:

• Smallest supported period is 5 minutes.
• Largest supported period is 24 hours.
• All periods must be factors of 24 hours and multiples of 5 minutes. e.g.:
o 15 minutes and 3 hours are acceptable periods.
o 7 minutes and 11 hours are not acceptable periods.
o This ensures that if a reminder is scheduled to trigger 3 times a day, it will trigger the same times every day, which is necessary to keep the reminder sequence calculation reliable (calculated by subtracting the current time from the first tick's timestamp and dividing by the period).
• We will consider two examples of valid period in this description: 10 minutes and 10 hours.

Semantics & Topology

Reminders will be divided between buckets, each bucket containing reminders of a given period.

• The number of buckets is predetermined and finite because the number of periods supported is predetermined and finite.
• For the purposes of prototyping, we will say that the number of buckets is expected to not exceed 24 (24 hours in the day, not all quantities of hours supported, support for 10, 15, and 30 minute periods).
• Therefore, we consider two buckets in this document: reminders that fire every 10 minutes and reminders that fire every 10 hours.

Each bucket will be managed by a reminder bucket agent-- a grain that reads upcoming reminders from the database and triggers them.

• One reminder bucket agent exists per bucket per silo.
• Reminder bucket agents use a timer that triggers periodically.
o This period of the timer is called a quantum and represents the smallest possible period that the system can support.
o We will use a quantum of 5 minutes, meaning that all reminder bucket agents will do work every 5 minutes.
o The timer's start delay is a random percentage of the quantum to ensure that all bucket agent timers are distributed as evenly as possible.
o For example, a 10-minute reminder bucket agent will be responsible for triggering reminders for all 10-minute reminders owned by the silo it resides upon.
• Each time a reminder bucket agent is triggered, it performs two tasks which may be executed in parallel:
o First, it must retrieve the reminders that will be triggered in the subsequent quantum from the database.
• These reminders will be a subset of all reminders associated with the bucket agent as defined later in this document.
• These reminders will all have the same period.
• These reminders will be kept for the next quantum.
o Second, it must trigger the set of reminders retrieved from the database in the previous quantum.
o If it takes more than the length of the quantum to complete these two tasks, the reminder bucket agent is overloaded.

Quantum Affinity Space

The quantum affinity space (QAS) is defined as the number of quanta needed to equal the length of the period:

QAS(rp, q) = rp / q

where:
• rp is the reminder period, which is the same for all reminders in a given bucket.
• q is the length of the quantum, which is the same for all reminders in the system.
e.g.:
• The QAS of a 10 minute period is 2 quanta.
• The QAS of a 10 hour period is 120 quanta.

The QAS cannot be fractional, so the quantum must be a factor of all available bucket periods.

Quantum Affinity

An additional value called the quantum affinity must be calculated for each reminder.

QA(gi, rn, qas) = UD(gr ++ rn) % qas

where:

• QA is a function that calculates the quantum affinity.
• gi a grain identity that the reminder is associated with. Currently, we use a GrainReference object to serve this purpose.
• rn is a string containing the reminder name.
• QAS is the quantum affinity space as defined earlier in this document.
• ++ is the concatenation function for string and/or binary data.
• % is the modulus function.
• UD is a uniform distribution function (uniform hash, round-robin/counter, uniform random, etc.)

WRT Azure Table Storage

The quantum affinity will be stored in a new column to make the information accessible.
The partition key and row key need to be changed to make common queries cheap. The most common is "Give me all reminders…"

• used by a specified service id.
• owned by a specified ring range,
• from a specified bucket
• and with a specified quantum affinity.

Auzre Table performs best when it doesn't have to do a full table scan. Therefore, it's best if the partition key contain information that is constructed in such a way that most queries do not result in range scans instead.

The following fields will be concatenated in order to create a partition key:

  1. the service ID (identical for all queries in the system).
  2. the bucket (1:1 relationship with reminder manager making the query)
  3. the quantum affinity (one quantum affinity is queried at a time)
  4. the consistent hash of the grain ID (different ranges are queried by different silos).

Fields 1-3 are queried by identity, not as a range. Field 4, however, is queried using a range. Placing this last in the field means that if we know fields 1, 2, and 3, we can query ranges of hash values because the partition keys are sorted according to hash value and bucketed according to the other values. For example, the following sorted partition keys are separated by implicit buckets:

// BUCKET__CASH

Serviceid0_category0_quantum0__00000000
Serviceid0_category0_quantum0__00000001
Serviceid0_category0_quantum0__00000002

Serviceid0_category0_quantum1__00000002
Serviceid0_category0_quantum1__00000005

Serviceid0_category1_quantum0__00000000
Serviceid0_category1_quantum0__00000002

Serviceid0_category1_quantum1__00000004

Serviceid1_category0_quantum0__00000000

Within each implicit bucket (e.g. Serviceid0_category0_quantum1), consistent hashes can be queried by range (0-1, 2-5, etc.).

Algorithm Steps

Activation

When a reminder bucket agent is activated, it:

  1. Initializes the quantum affinity index (QAI) to 0.
  2. Calculates and caches the size of the quantum affinity space (QAS) of the period serviced by the agent.
  3. Retrieves the reminders that will be triggered in during the first quantum, when (see step 2 of timer tick algorithm, given QAIn+1 = 0) .

Timer Tick

When a reminder agent's quantum timer ticks, it:

  1. Increments the QAI using the following function:

QAIn+1=(QAIn + 1) % QAS

Where:
o QAI refers to the quantum affinity index.
o n is the quantum sequence, i.e. the nth quantum to be processed by the agent.
o QAS is the quantum affinity space.

  1. Retrieves the reminders that will be triggered in the subsequent quantum from the database (may be performed in parallel with step 3).

a. Construct the partition key that the agent is interested by combining the grain reference descriptor, bucket, and QAIn+1.
b. Read the entire partition from Azure Table Storage.
c. Cache the reminder entities for the next quantum.

  1. Triggers all reminders whose quantum affinity matches QAIn (may be performed in parallel with step 2).

a. Reminders may be triggered in parallel, if it works out in practice.
b. Deletes the current quantum's cache of reminder entities.

Hand Waving

For simplicity, creation and deletion of reminders are handled through direct writes to the database. This is, in fact, the only time writes need to be made to the database.

We'll need to change this strategy once we wish to support either caching or cancellation guarantees sooner than one quantum's worth of time.

@MSFT-AshleyIngram
Copy link
Author

I agree that writing to table storage every time we trigger a reminder is problematic. We originally thought of this to resolve the issue of recurrent reminders.

If my understanding of your proposal is correct Sergey, this would be roughly analogous to giving a reminder a label based on (say) the number of the hour in a week it occurs (from 1 to 168) and using that as a Partition Key. Assuming that a reminder only recurs once a week, it never needs to be updated (unless it is deleted/edited).

If a reminder occurred more often than once a week, it would require duplicate entries in the index (say, if you wanted it to occur at 3am every day, you'd need entries for 3, 27, 51, etc).

If a reminder occurred less often than once a week (say every fortnight) you'd be unable to represent it within this system.

If your reminder happened more frequently than once an hour, you'd be unable to represent it either (at least in table storage) as the PK and RK would have to be unique.

The 2 values (1 hour and 1 week) would therefore have to be configurable in order to ensure you can accurately represent the full range of reminders you need within your application.

Is this correct?

@sergeybykov
Copy link
Contributor

First of all, this is not my proposal. It was produced by a member of the team at the time - Michael.

My interpretation of it is that it simply breaks a day into a 5-minute buckets, and a reminder can be scheduled only within those buckets. That's why it says:

• Smallest supported period is 5 minutes.
• Largest supported period is 24 hours.

So a reminder that you need to do something once a week would need to be scheduled to fire every day, and then the grain may decide to do nothing 6 out of 7 times.

@MSFT-AshleyIngram
Copy link
Author

I agree that this is better than our original proposal. Its much better than having to shuffle reminders around to handle recurrence.

Is there any reason you wouldn't make those 2 periods (5 mins and 24hrs) configurable, with some sensible defaults? That way, as in my example above you could change it to 1 hour and 1 week if you wanted to schedule longer distance reminders with less granularity (or whatever trade-off makes sense for that application).

This seems to me to be slightly better than having the grain itself having to handle skipping days (which seems pretty similar to the scheduling that ReminderService handles).

@sergeybykov
Copy link
Contributor

Is there any reason you wouldn't make those 2 periods (5 mins and 24hrs) configurable, with some sensible defaults? That way, as in my example above you could change it to 1 hour and 1 week if you wanted to schedule longer distance reminders with less granularity (or whatever trade-off makes sense for that application).

It's an interesting question. I think there will have to be an obvious shared starting point. For 24 hours it's easy - midnight UTC. If it's a week instead of 24 hours - midnight on Monday of some fixed year? A month is already a problem with unequal numbers of days in them.

But what if you say it's 4 days - when would we start the each 4-day period? What would be the 'anchor' point? I guess we could choose an arbitrary day in the past (1/1/2000) and start counting days from that date.

So I don't see any sensible settings other than 24 hours (or factors of it) or a week or N days. My hunch is that the 24 hours limit would be least confusing, but I see how N days could work.

@MSFT-AshleyIngram
Copy link
Author

Yeah I think having the "upper-bound" in terms of a "number of days" would make most sense.

For the "lower band" I think minutes would be suitable (I think there are many use-cases for a reminder which triggers on a higher frequency than 1m).

@Eldar1205
Copy link

Like to add a point regarding the "lower bound" of reminders period:

Currently the reminder service supports minimal 1 minute intervals. This should remain for backward compatibility, as part of a customizable reminders service if not default one. I think SOLID is important here - current implementation is suitable for 'small' amount of `high frequency' reminders and can support less than 1 minute intervals, its 1 minute lower bound limitation is artificial and may be removed, so the suggestion above that can help with scaling to millions of reminders should in my mind be added as a separate reminder service, added in open-closed fashion.

@shlomiw
Copy link
Contributor

shlomiw commented Sep 25, 2017

Persistent reminders are a great tool in Orleans belt. Can be also very useful to keep specific grains alive across silos, without requiring a direct invocation for reactivation (in my case I need it for maintenance clean-up logic after a period of time).
This limitation should be noted in the docs.
I'm surfacing this issue again so it wouldn't get lost.

@sergeybykov
Copy link
Contributor

No worries, it won't get lost. It's in the backlog, and contributions are more than welcome. :-)

@SebFerraro
Copy link

To add to this. We have roughly 20-30k of a specific type of grain that all fire off the RegisterOrUpdateReminder method in the activation of the grain. We've just implemented this and are seeing some degraded performance across our nodes. Response rates have gone up and our database is showing that the Update call on the clustered index in the reminder table is the most cpu intense transaction being made for a good 5 minutes after startup - I've briefly read some of the suggestions and it seems it may help our situation?

@veikkoeeva
Copy link
Contributor

@NoMercy82 Just to be clear on this, are you using ADO.NET for the reminders? If so, they're almost as slow as possible currently. Adding indexes will help, but the layout and a bit of the query would need to be redone. There's a lot of room for improvement for what comes to ADO.NET.

@shlomiw
Copy link
Contributor

shlomiw commented May 1, 2018

@veikkoeeva - we're using ADO.NET also for the reminders. Until it will be redone, what indexes would you suggest adding? Thanks in advance

@veikkoeeva
Copy link
Contributor

veikkoeeva commented May 3, 2018

@shlomiw Apologies for the delay. I'm not sure what would be the most useful index to create without testing. If you look at https://github.com/dotnet/orleans/blob/master/src/AdoNet/Orleans.Reminders.AdoNet/SQLServer-Reminders.sql#L102, the range queries are most often used (I assume), so maybe

CREATE CLUSTERED INDEX IX_RemindersRange ON OrleansRemindersTable(GrainHash, ServiceId); The order matters, the most discriminating index should be on the left. It might make sense to see if sorting helps (https://www.tutorialgateway.org/clustered-index-in-sql-server/) or not making the index clustered.

You can test these also by capturing (or creating) a query, copying the table with contents and just making queries and checking the query plan characteristics. There's some more options at More information at https://gitter.im/dotnet/orleans?at=5ae8c72797e5506e048ca6cc.

@ifle
Copy link

ifle commented May 4, 2018

@veikkoeeva I look to reminder sql's and don't understand why are there checking of null values of parameters in query and not in the code? If one of the parameter is null do not run the query

SELECT
		GrainId,
		ReminderName,
		StartTime,
		Period,
		Version
	FROM OrleansRemindersTable
	WHERE
		ServiceId = @ServiceId AND @ServiceId IS NOT NULL
		AND ((GrainHash > @BeginHash AND @BeginHash IS NOT NULL)
		OR (GrainHash <= @EndHash AND @EndHash IS NOT NULL));

@veikkoeeva
Copy link
Contributor

veikkoeeva commented May 4, 2018

@ifle Good question. Until recently everything was in one script, so the rationale was more clearly presented. If you look at https://github.com/dotnet/orleans/blob/master/src/AdoNet/Shared/SQLServer-Main.sql, you see the idea is that the database boundary is like an interface and Orleans sends in in certain format with a script to run it. I.e. the names and types matter. From that perspective it is a sanity check, defensive programming that also documents the intention. Shouldn't hurt the performance, adds a bit of robustness and maybe a bit opionately doesn't remove checks "because we can knowing somewhere else is code that checks the invariant and no change ever will expose a bug potentially serious ramificastions because of that".

Just in case you -- or someone else reading this -- wonders why this is like it is, it matters to be able to change the queries or the layout in deployment specific ways either statically or dynamically. Like filegroups or schemas or even introduce dynamic partitioning of tables (e.g. by creating new tables on the fly and starting to use them). Additionally, not all features are available on all vendors or versions (e.g. in-memory tables with natively compiler procedures might make sense and one might even add them to this default implementation via flag if available or then adjust on one's own). Though this is documented elsewhere.

@shlomiw
Copy link
Contributor

shlomiw commented May 6, 2018

@veikkoeeva - thank you very much for the lead. I'll keep track and monitor. Will add the indices as needed.
I believe the indexes should be part of the sql initial script to create the schema.

@veikkoeeva
Copy link
Contributor

@shlomiw Did you add the index, by the way? If so, what was the effect?

@shlomiw
Copy link
Contributor

shlomiw commented May 11, 2018

@veikkoeeva - since my relevant grains have short life timespan, the reminders are being cleaned-up, and the table is being kept small. When we'll have more traffic coming in, it might grow enough to add the index.
I'm keeping track and I will let you know the impact if I do so.

Anyway, I think it's important to add all the 'hot-spot' indices to the sql script, it can make a big performance difference.

@veikkoeeva
Copy link
Contributor

@shlomiw True. Let's do that unless a bigger refactoring takes place. :)

@jason-bragg jason-bragg added the P2 label Jul 5, 2018
@claycass17
Copy link

claycass17 commented Oct 30, 2019

We currently have a silo running on version 2.3.4 which uses Ado reminders and we have about 53,000 reminders which tick every minute or less. (Silo is setup on a cluster of 4 silos) All works well until we've updated to orleans 2.4.3 and the performance of our MariaDb server where reminders are persisted has degraded badly. The CPU and memory usage are maxing out and latency on the grains where ReceiveReminder runs have increased from 500ms/req to 20,000ms/req (aprox). We've tried several solutions to investigate the issue such as deploying only the csproj files with the updated nuget packages. We've turned on in memory reminders instead of Ado and silo works fine. We've attempted a previous version 2.4.2 but no luck. We've tried updating all orelans dlls to 2.4.3 except Microsoft.Orleans.Reminders.AdoNet still no luck. We've updated MySql.Data to latest 8.0.18 yet we still get these performance issues. We've tried adding the index on the orleansreminderstable, which made the CPU go slightly down, but we still get latency on grains which has a ripple effect of everything failing then. Can you kindly shed some light on this please?

@sergeybykov
Copy link
Contributor

@claycass17 This sounds unrelated to the limitation that this issue is tracking. Can you open a separate issue and share logs? Between 2.3.4 and 2.4.3, I don't see any change that could have impacted behavior of reminders. So, my hunch is the problem is somewhere else. Logs might help to find that out.

@veikkoeeva
Copy link
Contributor

@claycass17 Can you share which kind of index you added?

Do have the opportunity to switch back to the 2.3.4 version to double-check there is this dramatic difference between the 2.3.4 and 2.4.2 & 2.4.3? Can you say if the call frequency to database changed between the initial version switch? Or what is the frequency of reminder calls now?

@claycass17
Copy link

I can easily switch from 2.4.3 to 2.3.4 and the environment goes back to normal immediately on 2.3.4. As for query frequency, when tested locally using UseLocalhostClustering having about 1000 reminders I get the same amount of hits (29 hits) on both versions SELECT
GrainId,
ReminderName,
StartTime,
Period,
Version
FROM OrleansRemindersTable
WHERE
ServiceId = 'xxx' AND 'xxx' IS NOT NULL
AND GrainHash > 31334654 AND 31334654 IS NOT NULL
AND GrainHash <= 179297290 AND 179297290 IS NOT NULL

However on a dev environment having 53,000 reminders and using UseConsulClustering (with version 2.4.3) we get thousands of hits of the same query (having different GrainHash filters).
I will get exact logging/frequency numbers tomorrow. As soon as we revert to 2.3.4 the hits drop back to less than 100.

@veikkoeeva
Copy link
Contributor

veikkoeeva commented Oct 31, 2019

@claycass17 Sorry if I have missed this, how do you host your deployment? E.g. Kubernetes, Linux/Windows some other?

How easy or difficult would it be to create a project that reproduce the problem?

Using https://github.com/dotnet/orleans/tree/master/Samples/OneBoxDeployment might be one option, you just add reminder grains, add a test or an API function that creates a lot of grains artificially (like the one about state) so that one can observe the case under the debugger.

I'm myself unfortunatelly very blocked until the second week of December (one has to sleep sometimes), but my thinking here is that if this case can be replicated easily like this, maybe someone, maybe the core team, can easier time to troubleshoot. I have a .NET Core 3 update in progress at https://github.com/veikkoeeva/orleans/tree/update-oneboxdeployment-to-core3 if it matters, but I'm not sure when I get that done, December the latest of maybe earlier if I get a few other, very minor things solved.

@claycass17
Copy link

claycass17 commented Oct 31, 2019

The index we added was simply on the GrainHash (create index orleansreminderstable_GrainHash_index on orleansreminderstable (GrainHash);) knowing ServiceId is part of a composite key. However the issue here is not the performance of the select, but more the frequency. We've captured the frequency of the SELECT quoted above in previous post on both versions in a 40min time span and these are the results..drum roll.. Orleans v2.3.4 = 2,144 hits, v2.4.3 = 251,133 hits on approx 53,000 reminders.
Our project is hosted on four Windows Servers.
We will give OneBoxDeployment a go and keep you updated. We will also monitor logging and come up with some stats.

@claycass17
Copy link

Digging deeper into this we've noticed that 'Added Server..' log is being printed more than 13,000 times in 40min. The actual log comes from AddServer in ConsistentRingProvider when its called from SiloStatusChangeNotification, further more from NotifyObservers in SiloStatusListenerManager from ProcessMembershipUpdates. Eventually deep down these calls end up calling Orleans query with key ReadRangeRows2Key in large amounts. Our membership is handled via Consul. What could be changing the state of our silos? This issue is visible on version 2.4.3. but not on 2.3.4.

@claycass17
Copy link

Apologies for cluttering this post. A new issue has been reported here #6089

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests