Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved membership and reminder tables tests #1488

Merged
merged 4 commits into from
Feb 27, 2016
Merged

Improved membership and reminder tables tests #1488

merged 4 commits into from
Feb 27, 2016

Conversation

shayhatsor
Copy link
Member

This is what I came up with while refactoring the sql queries to be as similar/simple as possible.
The new code is backwards compatible with the old sql queries.
Added some new tests, those revealed two minor bugs in the MySQL script, and one minor bug in the MSSQL script. @veikkoeeva please take a look. also these changes shouldn't conflict with #1461.

@gabikliot gabikliot changed the title refactored sql queries refactored sql queries Feb 22, 2016
@veikkoeeva
Copy link
Contributor

The diff is too large to be viewed in browser, so I'll be back in about 12 hours. Can you give a quickie on what to look for? Say, does this affect how I should handle using storage in #1461?

@shayhatsor
Copy link
Member Author

@veikkoeeva, this shouldn't affect #1461. I think that best way to approach this change is by comparing the new MySql with the current, then comparing the new MySql with the new MsSql.
Most of the changes are in the new MsSql script:

  • Minimized the code as much as possible without loosing correctness. while doing that i found one minor bug in InsertMembership (which also existed in MySql) that throws an exception instead of FALSE. Not really an issue, just not 100% per the contract.
  • Removed explicit rollbacks. Instead, using implicit rollbacks by setting XACT_ABORT ON.
  • Using @@ROWCOUNT instead of @@TRANCOUNT

@shayhatsor
Copy link
Member Author

This seems to be a good time to mention that I've noticed that many tests are being added to the tests projects, which is great. But, there is a lack of consistency. When a test is added, there's no "right" place for it. There are many good tests that only run on Azure and SQL Server. All tests should run on all officially supported platforms. Also, the recently added tests use a configuration different from the tests @veikkoeeva and I wrote.
In this PR, I've refactored the current tests for membership and reminder tables. Liveness is still being tested only on Azure, AFAIK.
An effort should be made to make sure new and existing tests are added to all supported platforms. And also, test configurations should be aligned.

@veikkoeeva
Copy link
Contributor

@shayhatsor I see the the changes are:

  1. Rearrange the table and query definitions to be closer by rearranging them and by dropping angle brackets from the SQL Server version. I'm OK with this, easier to diff scripts.
  2. About XACT_ABORT, it's a construct of SQL Server 2005 and newer. I didn't spot a version for SQL Server 2000 in the refactored version. If you are running tests on LocalDb, you can try setting compatibility level (I don't remember if one can go back several versions) and see if the queries will be rejected or not. I think using XACT_ABORT is all the better, but do we still account for SQL Server 2000? By the way, I tried to look up some information about it and found this. I have contemplated on setting up a test script that does some of this, or perhaps a test script to tests the scripts in a DB (like this, but one could arrange such a setting with plain SQL too and return just a bool for pass/fail to unit test project). As we're all constrained with time, I haven't seen that much value in it. It would allow quickly testing concurrent queries with very high-velocity though (think of multiple concurrent loops, something like this).
  3. About exceptions, in general it's not possible stop them occuring without trapping them somehow in the db, on SQL Server 2005 and newer it is easier (try-catch contruct), but the most fatal ones can't be trapped and can leak anyhow. I think Orleans logs exceptions and considers the operation failed and retry later in case one. If there's less chance for exceptions, the better, though.
  4. In the new script you write SELECT @@ROWCOUNT and rely on implicit cast, I personally would prefer SELECT CAST(@@ROWCOUNT AS BIT); due to being explicit about it. That likely saves a few bytes on the wire (not much of a saving, I know). It looks like the code still reads the transmitted value as bool.

About 3. and 4., it looks like you removed SET NOCOUNT ON;. The purpose of this is to reduce network chatter. MSDN explains:

The @@rowcount function is updated even when SET NOCOUNT is ON.
SET NOCOUNT ON prevents the sending of DONE_IN_PROC messages to the client for each statement in a stored procedure. For stored procedures that contain several statements that do not return much actual data, or for procedures that contain Transact-SQL loops, setting SET NOCOUNT to ON can provide a significant performance boost, because network traffic is greatly reduced.

I reason it would be good to have it there still as it won't affect @@ROWCOUNT. The impact is debatable, but shouldn't hurt. It also applies to "plain queries" (because they are calls to a stored procedure under the hood, a fact exploited also in the scripts). Some further reasoning: When I initially wrote the queries, I specifically did not not want to depend on checking affected rows, since it's a rather implicit (indirect) way to see if something has actually happened and it might be difficult to predict how the value might vary.

It's becoming late, but I hope the comments are clear enough to make sense or to see if they are bogus.

@veikkoeeva
Copy link
Contributor

@shayhatsor About tests. It has come up in Gitter on a few occasions. I believe it was @centur with whom we have briefly chatted (probably with many others too) about this. It would be nice if the tests would be cleaner and refactorend and "always green" when one runs them locally. Not all have all the pieces, such as backing stores, but perhaps those tests could be marked as skipped with the reason that no such store could be used. Otherwise about the tests being applicable across all stores, you are not alone with that idea. We just need someone to write down a plan and see who gets around doing it first. :)

About selecting configuration. I think the the non-functioning relational tests are the last batch migrated from VSO and used (still use?) the SQL Server specific .mdf files (basically the database files). The initial system used those too, but we got the tests migrated to deploy the actual scripts to a brand new database. Maybe in the future we could/should test backwards comptability when doing migrations too. Albeit it looks like the core parts are fairly stable already, sans the ones missing.

I might as record here what I have in mind about configuration selection and storage: Check environment variables so that if in CI server, use some configuration applicable there. If not, use sensible local default values. Especially with regard to relational store, one could test if MySQL has been installed (like this) and/or try to reach it. If no success, do not run MySQL tests. LocalDb is installed by default on Windows and with VS, but it's a different matter when Orleans really get cross-platform.

Maybe we could write a set of abstractions that set up one-box deployment, as discussed at #411 and use that in testing as well as running Orleans as Windows service and whatnot. It looks like to me it doesn't need to be much more complicated than this (it has some hardwired configuration etc., but as an idea, have silos and put them into a bunch to create a system).

@veikkoeeva
Copy link
Contributor

@shayhatsor Ah, might not be clear about the refactoring. Apart for some observations (e.g. NOCOUNT, I'd personally prefer an explicit cast, SQL Server 2000...), the refactoring look like a sensible improvement. Especially considering making comparisons between the scripts. Good job.

@shayhatsor
Copy link
Member Author

@veikkoeeva, first of all, thanks for your thorough review! now, some comments about your comments:

About XACT_ABORT, it's a construct of SQL Server 2005 and newer

https://technet.microsoft.com/en-us/library/aa259192%28v=sql.80%29.aspx

I think Orleans logs exceptions and considers the operation failed and retry later in case one.

it does.

In the new script you write SELECT @@rowcount and rely on implicit cast

I don't, check out the diff of RelationalOrleansQueries.cs

When I initially wrote the queries, I specifically did not not want to depend on checking affected rows

that's a really good point, I'll add SET NOCOUNT ON back.

@centur
Copy link
Contributor

centur commented Feb 22, 2016

@veikkoeeva Umm it wasn't me. At least reading through this issue it's not something I remember.

@veikkoeeva
Copy link
Contributor

@shayhatsor Ah, one lives, one learns about XACT_ABORT. :)

About the new InsertMembershipKey query. It looks like now a situation wherein the INSERT to OrleansMembershipTable part can succeed, but UPDATE to OrleansMembershipVersionTable can fail (due to other conditions than @@ROWCOUNT), but the transaction will not be rolled back and the partial update to (considering the whole two-table operation, the insert to OrleansMembershipTable) remains. It is plausible to think DeploymentId = @deploymentId is satisfied in these cases, but I'm not sure if there can be an edge case for Version = @versionEtag wherein it is not true (maybe something about starting, restarting, adding or removing silos, voting dead or somesuch).

@shayhatsor
Copy link
Member Author

@veikkoeeva, I see your point, this will cause unnecessary contention on the table, and might just not work per the contract in certain edge cases.
The question that comes to mind is: can we remove all two-table operations?
I believe we can, by removing OrleansMembershipVersionTable altogether.
For a specific deployment, the table version is the SUM() of all the Version fields of OrleansMembershipTable for that deployment. I think this is the way to go. Am I missing something ?
As long as SUM() returns a snapshot sum of the rows, we're following Orleans Extended Membership Protocol to the letter. Even if SUM(),in a certain DB configuration, sums "dirty" rows too, we're still fine because we're following Orleans Basic Membership protocol. Also, we can make the DB sum a snapshot by locking or other snapshot enabling mechanism if we really want to stick with the Extended protocol.

@veikkoeeva
Copy link
Contributor

@shayhatsor Summing brings some trouble, or complexity, to this. The scripts has a part that says SET @snapshotSettings = N'ALTER DATABASE ' + @current + N' SET READ_COMMITTED_SNAPSHOT ON; ALTER DATABASE ' + @current + N' SET ALLOW_SNAPSHOT_ISOLATION ON;';. This is a setting I find is used very often these days. I included it there because it makes a performance difference and not all (people with only little experience with SQL Server) will put it on, and to show the DB should work with it -- and if it is not set, someone surely would contemplate on putting it on. This setting likely makes a lot of difference in ES and state provider scenarios too.

Basically the problem here is that this isolation level sees a snapshot of the rows in the beginning of a given transaction and so might miss one inserted during it. I think this situation could be remedied by adding READCOMMITTEDLOCK hint. Reading Read Committed Snapshot Isolation makes such a point:

When run under RCSI, this statement cannot see any committed database modifications that occur after the statement starts executing. While we will not encounter the problems of missed or multiply-encountered rows possible under the locking implementation, a concurrent transaction might add a payment that ought to prevent a customer from being sent a stern warning letter about an overdue payment after the statement above starts executing.

(as an aside, a real world process would not rely on anything like this). The follow-up article is great too.

Would this also mean the DeploymentId and Timestamp columns would be replicated on every row too? If so, maybe it would make sense to do more redesign of the database anyhow as discussed elsewhere with regard to other constructs. I don't know which would be better, make a lot of refactoring once or these little and not necessarily backwards compatible changes little-by-little.

@shayhatsor
Copy link
Member Author

I've read about snapshot reads, it's the default behavior for MySql and MsSql. It's probably the default for all other relational DBs. so the SUM() solution seems very good to me.
about your comment:

Would this also mean the DeploymentId and Timestamp columns would be replicated on every row too?

We'll only need to add one new column to OrleansMembershipTable. a LastUpdated Timestamp column. MAX() on that column for a DeploymentId gives the old OrleansMembershipVersionTable.Timestamp.

@veikkoeeva
Copy link
Contributor

@shayhatsor Where did you find READ_COMMITTED_SNAPSHOT and ALLOW_SNAPSHOT_ISOLATION would be on by default in SQL Server? They don't seem to be default on to me for what I know and can check. The documents [such as here](Isolation Levels in the Database Engine) explicitly tell:

Read committed (Database Engine default level))

and these snapshot isolations are by default off.

This won't also account for existing server installations (in SQL Server there are databases inside SQL Server databases, inside these databases can be schemas, whereas MySQL these individual databases are considered as schemas, I think). It looks like for MySQL the default is REPEATABLE READ (from 14.2.2 The InnoDB Transaction Model and Locking) and at this level it looks like (I'm not sure) it behaves largely the same as SQL Server when the snapshot levels have been turned on, reading 14.2.2.4 Consistent Nonlocking Reads.

I don't think there will be a problem as such using the SUM if READCOMMITTEDLOCK is applied. But I quote here in full the point from the previous link:

A question of timing

INSERT dbo.OverdueInvoices
SELECT I.InvoiceNumber
FROM dbo.Invoices AS I
WHERE I.TotalDue >
(
    SELECT SUM(P.Amount)
    FROM dbo.Payments AS P
    WHERE P.InvoiceNumber = I.InvoiceNumber
);

When run under RCSI, this statement cannot see any committed database modifications that occur after the statement starts executing. While we will not encounter the problems of missed or multiply-encountered rows possible under the locking implementation, a concurrent transaction might add a payment that ought to prevent a customer from being sent a stern warning letter about an overdue payment after the statement above starts executing.

You can probably think of many other potential problems that might occur in this scenario, or in others that are conceptually similar. The longer the statement runs for, the more out-of-date its view of the database becomes, and the greater the scope for possibly-unintended consequences.

Of course, there are plenty of mitigating factors in this specific example. The behaviour might well be seen as perfectly acceptable. After all, sending a reminder letter because a payment arrived a few seconds too late is an easily defended action. The principle remains however.

This issue of out-of-date data applies to all RCSI statements in principle, no matter how quickly they might complete. How ever small the time window is, there is always a chance that a concurrent operation might modify the data set we are working with, without us being aware of that change. Let us look again at one of the simple examples we used before when exploring the behaviour of locking read committed:

The way I believe this works is that under these settings the SUM won't see newly added rows (silos) if one is being concurrently added -- when the table is read while a new silo is inserted. I don't know if there is a situation that matters wherein Orleans queries for the silo data while one is being also added at rather much the exact same moment and there's "something off" with the version number. Though applying a more limiting lock for this particular operation seems to remove these concerns at the cost of some performance. Unless there will be hundreds or thousands of these operations per second to this table, I think the performance shouldn't matter.

Does this make sense?

Other question could be that should we make more refactoring that introduce these changes at once, but on the other hand, the most harm I see happens one needs to restart silos (since silodata isn't backwards compatible without separate user action) when updating the relational store.

@shayhatsor
Copy link
Member Author

@veikkoeeva, I see what you mean. I wrongly used the term "snapshot" where i should have used "read committed", it's enough for our purposes. When a silo reads from the membership table, it can read an "old" state that was changed during the read, just not dirty reads. Also, the data must be eventually consistent.
When a silo reads from the membership table, each silo state entry comes with two versions:

  • Row version - the number of changes made to the silo state entry
  • Table version - the number of changes made to the whole table

Each silo holds a local cache of the table. Each silo state entry in the cache must have the following property:
The tuple (Row Version,Table Version) of the cached silo state entry was at one time in the past "true" for that silo in the DB.
To summarize, we need to provide:
(1) Eventual consistency of the reads from the table - "read committed" provides that
(2) Optimistic concurrency mechanism enforced at a row and table level - Version and SUM(Version) provide that given (1)

I am working on a solution that wouldn't require any migration script.

@veikkoeeva
Copy link
Contributor

@shayhatsor No problem here. I think we're on the same page provided READCOMMITTEDLOCK will be used.

It occurred to me that there's already one way of doing it and one option would be to leave it as-is. Then do a separate PR that refactors much more of the issues, such as statistics and maybe other structures, such as suspecting*. I'm not sure what comes from the geo-distribution PR either: #948 and if it'd require "difficult changes".

@veikkoeeva
Copy link
Contributor

It would be useful to know if there is already critical usage of the relational backend and if it were useful to keep the data or is it OK to wipe it and restart silos. As there isn't yet streams or state provider data on relational, it would look plausible migrating silo data isn't that critical. What would @sergeybykov say? From whom should these be asked? :)

@sergeybykov
Copy link
Contributor

@veikkoeeva My impression is that Gigya folks were first who were serious about using SQL-based system store. There are could be other users that never surfaced and are running Orleans in "stealth" mode.

So far we've erred on the "move fast" side not making backward compatibility between releases a requirement. I think we can flag this in release notes.

@shayhatsor
Copy link
Member Author

@sergeybykov, from Gigya's perspective, making breaking changes in the relational implementation are currently OK. Current systems that are in production use reminders only as a backup.
About this PR, I'll re-push only the tests and open a new PR for the relational storage changes.

@sergeybykov
Copy link
Contributor

Good. Is this ready to be merged then?

@shayhatsor shayhatsor closed this Feb 27, 2016
@shayhatsor shayhatsor reopened this Feb 27, 2016
@shayhatsor shayhatsor changed the title refactored sql queries Improved membership and reminder tables tests Feb 27, 2016
@shayhatsor
Copy link
Member Author

I've added many test cases not covered previously and refactored the base test classes. Added same tests for Azure, in order to have a good base case. this will help with refactoring and simplifying relational storage including the SQL scripts. @sergeybykov this PR only contains refactored and improved tests, it is ready for review and merge.

sergeybykov added a commit that referenced this pull request Feb 27, 2016
Improved membership and reminder tables tests
@sergeybykov sergeybykov merged commit f9eef93 into dotnet:master Feb 27, 2016
@sergeybykov
Copy link
Contributor

Thank you, @shayhatsor! The tests look and are organized much cleaner now.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants