New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: remove interleaved tables #52009
Comments
This isn't quite true. Even if writes aren't coalesced into a single consensus round or even a 1PC transaction, there is still a large benefit to having all of a transaction's intents on the same range. In that case, intent resolution is heavily optimized. The KV layer is able to resolve all of a transaction's intents and clean up its transaction record in a single consensus round. This reduces the amount of load on the system and reduces the transaction's contention footprint. |
This is a great point. My counterpoint pushes more towards the future and how changes on the horizon may mitigate the importance of this optimization. Today it is quite important that we issue intent resolution rapidly after transactions are committed in order to reduce the contention footprint of transactions. One proposal which has been proposed in order to reduce the contention footprint is the idea that we could separate the in-memory process of intent resolution from the durable process of intent resolution. In that world we could imagine waiting much larger to build batches of intents for resolution mitigating the overhead of individual command evaluation and replication. The above point still does depend on RPCs on the order of the number of ranges involved but we might, in that world, under relatively high load (the case where this matters), be able to issue even fewer intent resolution batches. |
|
I'd be curious to see some join speed numbers for different join operators for interleaved tables vs non-interleaved and non-collocated leaseholders. |
My reading here is that this only works if you don't split.
This seems like a requirement before moving forward here. |
|
@awoods187 have you gotten any fresh insights on user perspectives here? |
|
Two users are fine with removing this provided we give them enough time, we show them a path of it, and performance doesn't suffer. I have one big remaining user to talk that I'm waiting on time to be scheduled. |
|
Updating the issue--i'd like to see a series of performance investigations that seek to disprove our hypothesis that the same performance can be achieved without interleaving before we remove them. We need to take into account active workloads, inserts, deletes, etc. |
|
Closing the loop, we've benchmarked interleaved tables and found that they indeed provide only negligible performance improvements. We'll be moving forward with the plan to remove interleaved tables in the coming releases. |
Fixes: cockroachdb#52009 Previously, interleaved tables were only deprecated so users were allowed to create them after flipping an internal setting. This was inadequate because we want to completely drop support for this feature, as a part of the next release we will be completely disabling support for them. To address this, this patch will block upgrade to the next version if interleaved tables exist. Release note: Upgrade to the next version will be blocked if interleaved tables/indexes exist. Users should convert existing interleaved tables/indexes to non-interleaved ones or drop any interleaved tables/indexes before upgrading to the next version.
Fixes: cockroachdb#52009 Previously, interleaved tables were only deprecated so users were allowed to create them after flipping an internal setting. This was inadequate because we want to completely drop support for this feature, as a part of the next release we will be completely disabling support for them. To address this, this patch will block upgrade to the next version if interleaved tables exist. Release note: Upgrade to the next version will be blocked if interleaved tables/indexes exist. Users should convert existing interleaved tables/indexes to non-interleaved ones or drop any interleaved tables/indexes before upgrading to the next version.
Fixes: cockroachdb#52009 Previously, interleaved tables were only deprecated so users were allowed to create them after flipping an internal setting. This was inadequate because we want to completely drop support for this feature, as a part of the next release we will be completely disabling support for them. To address this, this patch will block upgrade to the next version if interleaved tables exist. Additionally, test cases for interleaved table / index functionality are removed. Release note: Upgrade to the next version will be blocked if interleaved tables/indexes exist. Users should convert existing interleaved tables/indexes to non-interleaved ones or drop any interleaved tables/indexes before upgrading to the next version.
Fixes: cockroachdb#52009 Previously, interleaved tables were only deprecated so users were allowed to create them after flipping an internal setting. This was inadequate because we want to completely drop support for this feature, as a part of the next release we will be completely disabling support for them. To address this, this patch will block upgrade to the next version if interleaved tables exist. Additionally, test cases for interleaved table / index functionality are removed. Release note: Upgrade to the next version will be blocked if interleaved tables/indexes exist. Users should convert existing interleaved tables/indexes to non-interleaved ones or drop any interleaved tables/indexes before upgrading to the next version.
Fixes: cockroachdb#52009 Previously, interleaved tables were only deprecated so users were allowed to create them after flipping an internal setting. This was inadequate because we want to completely drop support for this feature, as a part of the next release we will be completely disabling support for them. To address this, this patch will block upgrade to the next version if interleaved tables exist. Additionally, test cases for interleaved table / index functionality are removed. Release note: Upgrade to the next version will be blocked if interleaved tables/indexes exist. Users should convert existing interleaved tables/indexes to non-interleaved ones or drop any interleaved tables/indexes before upgrading to the next version.
Fixes: cockroachdb#52009 Previously, interleaved tables were only deprecated so users were allowed to create them after flipping an internal setting. This was inadequate because we want to completely drop support for this feature, as a part of the next release we will be completely disabling support for them. To address this, this patch will block upgrade to the next version if interleaved tables exist. Additionally, test cases for interleaved table / index functionality are removed. Release note: Upgrade to the next version will be blocked if interleaved tables/indexes exist. Users should convert existing interleaved tables/indexes to non-interleaved ones or drop any interleaved tables/indexes before upgrading to the next version.
67668: sql: remove unused field in txnKVFetcher r=andreimatei a=andreimatei Release note: None 68074: sql: add version gate to stop upgrade if interleaved tables exist r=fqazi a=fqazi Fixes: #52009 Previously, interleaved tables were only deprecated so users were allowed to create them after flipping an internal setting. This was inadequate because we want to completely drop support for this feature, as a part of the next release we will be completely disabling support for them. To address this, this patch will block upgrade to the next version if interleaved tables exist. Release note: Upgrade to the next version will be blocked if interleaved tables/indexes exist. Users should convert existing interleaved tables/indexes to non-interleaved ones or drop any tables/indexes before upgrading to the next version. 68391: sql: add deprecation path for pg incompatible database privileges r=RichardJCai a=RichardJCai Release note (sql change): Granting SELECT, UPDATE, INSERT, DELETE on databases is being deprecated. The syntax is still supported however it is automatically converted to the equivalent ALTER DEFAULT PRIVILEGES FOR ALL ROLES command. The user is given a notice that the privilege is incompatible and automatically being converted to an ALTER DEFAULT PRIVILEGE FOR ALL ROLES command. Example: > GRANT SELECT, UPDATE ON DATABASE defaultdb TO demo; NOTICE: granting SELECT, UPDATE is going to be deprecated in 22.1, please use ALTER DEFAULT PRIVILEGES FOR ALL ROLES instead in the futurethe incompatible privileges were not granted, "ALTER DEFAULT PRIVILEGES FOR ALL ROLES GRANT SELECT, UPDATE ON TABLES TO demo" was automatically executed instead GRANT Resolves #67378 Refer only to last commit. sql: add deprecation path for pg incompatible database privileges 68909: sql,jobs,ccl: add ability to specify a REASON string with PAUSE JOB r=rhu713 a=rhu713 Add ability to specify a REASON string with PAUSE JOB, persist this string in the job's payload. See individual commits. 69006: sql: add telemetry for ALTER DATABASE PLACEMENT r=arulajmani a=pawalt This PR adds telemetry for ALTER DATABASE PLACEMENT operations. Release note: None Resolves #68597 69017: ci: add ci bazel jobs for maybe_stress{,race} r=rail a=rickystewart * Rip benchmark support out of `github-pull-request-make` (it's not currently used). * Add support for Bazel builds in `github-pull-request-make`. * Add scripts to invoke `github-pull-request-make` appropriately. Closes #68388. Closes #68389. Release note: None 69095: sql: skip long test under short r=andreimatei a=andreimatei Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: richardjcai <caioftherichard@gmail.com> Co-authored-by: Rui Hu <rui@cockroachlabs.com> Co-authored-by: Peyton Walters <peyton.walters@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
|
Respectfully, I am very disappointed this feature was removed. It's probably going to be a breaker for me using cockroachdb for my use case. I suppose it's good this happened before implementation. The actual problems listed are not inherent design flaws, they're implementation issues. Meanwhile, the benchmarks in issue #53455 are far too small to test anything useful. 10M rows is, to put it mildly, not a large number. 1B rows, on a normal Postgres instance or what have you, is still not a large number. Where this issue is going to be a serious problem is when you have FK joins across large tables. Interleaving effectively removed the FK join, because consistency could be guaranteed within a key range like a NoSQL database. It could also be guaranteed with the niceties of SQL typing. Removal of this feature makes CockroachDB an entirely different database from what was advertised. Interleaving allowed data integrity and schemas to be colocated in data space. That this was not the case (of which I'm still not convinced because the benchmark is 3 orders of magnitude too small to be relevant) is unfortunate. Features like this one are precisely the reasons one would use a reliable distributed database like CockroachDB. If we are being honest, Postgres is far more performant than CockroachDB, but it has issues scaling once you go > 1TB. Even there, solutions like Aurora will let you scale out and they still have the FK join problem because the low-level storage model can't colocate tuples across tables. On large node counts, the FK join becomes even a larger problem, and I don't see that addressed. The Google Doc is locked so I am unable to see the results if that was tested. To summarize:
|
|
Let's work together to digest your concerns. There's a few things here that I don't really understand. I hope I can convince you that the thing which removed doesn't prevent cockroach from providing the wonderful things you seem to be looking for, and, also that it wasn't really adding much.
Since this issue was filed, we've worked with a number of customers which had used this feature to de-interleave. None have encountered any performance or scalability problems. Most saw improvements. All that being said, of course access locality is a good thing and of course improving it would be nice. The feature was removed largely because of its implementation being very heavy for both customers and implementers while providing not a lot of benefit. There were multiple problem with the interleaving implementation:
So, my pitch is that if we're going to do something magical to co-locate data accessed together, then we should do it in a way that transparently benefits lots of use cases and doesn't infect tons of layers with its concerns. The design I have in mind presently for that is #65726. |
|
I appreciate the response. I will explain, and I hope I am wrong.
Specifically, let's say you have a few tables. Let's say each of those tables is 1TB. For my purposes, this would be at least an order of magnitude larger. But let's start there. It would also be more than a dozen typed tables with joins, not just 3. From my understanding, if I want to have FK joins on those 3 tables, it has to do a lock, go fetch results from 3 different ranges, and then combine them. Notably, these 3 different 256mb ranges (or index ranges) do not need to be on the same server. With interleaving, they clearly are (absent the exception the interleaved rows exist over a range boundary). With the proposed changes, if I have 12 joins, I have 12 hops. With interleaving, at least as how I understand it, if I wanted to hydrate all data over the "ancestor" PK the foreign keys are all joined on, which is This gets into my comment about "data integrity" being compromised. It's not that cockroachdb is compromised, it's that if I want to do anything sophisticated to speed up that 12-hop scenario of this tree of data, I have to do my own error prone fetching. I can't have FKs in practice because they're not colocated. If I want it to be consistent, I have to use my own locks and block on write. Maybe here, just doing a
I have my own experience but I can explain why this is problematic and why almost no one uses FKs at scale. Which is why it's so neat that CockroachDB does! Aurora works at the physical storage level of disk. It is not a fully logically sharded database like something like Citus or Vitess. If I have a table
I agree with this. For my use case, when I tested interleaved columns with cockroachdb and a real-world schema/problem over a year ago, it was very nice to do with reads though. It was much nicer than what would be required with something like Dynamo. I'm going to explore what the performance around this is, but I think my intuition on FK scalability here is still correct. The more tables you have in FKs without interleaved storage, the more locks you're going to need around ranges. Without telling the storage to colocate tuples, there's no guarantee that some of those locks won't be terribly inefficient. If I have varying cardinalities of some of these ranges sharing PK prefixes, my intuition is the low cardinality items' ranges and indexes are going to hold up queries for everyone else. I will look at your proposal. 👍🏻 |
|
I mostly don't see how scale is all that relevant. I see two ways it might come up.
Both of these are legitimate concerns. However, the old interleaved implementation would also be issuing the same 12 requests. It does have the advantage that they'll all be going to the same node and reading off the same disks so their likelihood of being slow is likely to be correlated.
Just to re-iterate, it's not 12 sequential hops, it's 12 concurrent requests.
Reads in cockroach are fully optimistic (no locking required). In the case where you're equi-joining these three tables on a parameter, the database will scan all three in parallel. If it was not an equi-join on a query parameter, then the interleave would not have helped you. |
|
Both of your assumptions are true for my problem. In my experience, this tail latency is highly non-linear. I am aware it's 12 concurrent requests, but the fanout element is important. I'll break down some real numbers on this problem, which is ecommerce. In ecommerce, it is common for merchants to have 10s of thousands to millions of products. For these products, it is common to have 100-500 "variants", and for them to sell it in 3-10 stores or channels. This means on a single
So this I suppose is where my confusion lies. I do not understand the CockroachDB data model well enough, but it seems to me that the join part of interleaved tables should almost always be on the same server, because it's almost always within the same range. This would require only a lock on a single server, concurrent index scans on that server, and returning back the results. Without interleaving, it seems highly unlikely the ranges for the joins are on the same server, which would require a lock that's open for ~3 orders of magnitude longer over the network. Furthermore, it doesn't seem likely that the lock efficiency (edit: or fanout efficiency, even with optimistic locking) is linear w.r.t. either (1) table join count or (2) server count. The following scenarios: (a) 3 table joins on 3 nodes Scenario (d) has ~16x as much concurrent fanout during the period of the locks, and 4x as much time spent in network locks. This is why I am skeptical that the data size for the benchmarks tests is representative to see the contention and blocking of ranges of a real production system. (I have seen a real production system, costing millions of dollars, fail on this problem at only about ~3x fanout and only about 1TB of data.)
This the most precise way of describing why I believe that the benchmark is flawed. Query time can't be independent of data size or join count, or even linear. If I have 8 cores, I cannot dispatch 12 concurrent requests. Whether it's just because of cardinality differences on ranges, or the number of requests that need to be concurrent with limited core count, or the way increasing node count increases fragmentation of (definitionally non-independent!!!) joins across servers, this is at least a quadratic problem. |
|
Can you teach me more about what you mean when you talk about "network locks"?
Can you lay out for me a bit more concretely what becomes quadratic? Anyway, on the whole, I do agree with you that if we collocated the the data for the overlapping portions of the indexes then we could schedule the join processor by the data and that would be good. Collocation is good. We would like to do something about it, eventually. The more data we can produce about the lack of locality being a problem, the more inclined we'd be to do something about it. |
Let me just start by thanking you for that. What follows is not be trying to be disagreeable but describing a problem I see with this and why interleaved tables seemed like Santa Claus to me. Today I found out 🎄 was not happening.
"locks" the way I'm using it is abused and imprecise. I understand the optimistic locking of cockroachdb to a degree. Most of these problems don't have to do with worrying about something like pessimistic 2PC over the network. The easiest way to summarize it though would be that under modest load, you're (1) going to get serialized behavior, (2) that serialized behavior becomes much worse with join count, and (3) it becomes far worse even than that with node count because of both network latency and needing to ultimately ensure write consistency. If I have 2 user requests a second, 4 joins per query, and 4 cores on a server, during the duration of the request I already have 8 concurrent reads (whether dispatched sync or async) and we're down to 50% or less speed of what's theoretically possible. Namely, it seems like on an interleaved table, for range scans or something like it, a query planner could decide to forgo concurrency on the local range entirely in lieu of scanning the parent PK alone. When you have 10K request/s and 32-128 nodes or something, this is material. Amdahl's law will make parallelization assumptions wrong in the real world. Worse, with non-interleaved tables distributed across the network, even optimistic locking is going to have issues with modest write loads. To completely make up a term, the "intersection" of possible write conflicts is The way I normally get around this problem with relational databases is partitioning, but most of them are still not good with partition-wise joins. Citus was nice because you get shard locality and partition-wise joins, so your "intersection" size is divided by shard count and then divided by partition count. My understanding of what CockroachDB's interleaving was is that it's even more local than that: a join within a single record. It seemed to be Dynamo but with schemas and no locking downsides. Dynamo, having been developed for the use case of ecommerce I mentioned, was the pattern I thought this emulated the best. |
I just don't think this is true unless you're operating under overload. There are some assumptions here about lack of parallelism that I don't buy. I agree with you that if you have more load incident on the system than CPU resources available in the system, the lack of locality will lead to outsized effects on latency because of the likelihood that you encounter these resource bottlenecks. However, that's not a common operating mode of cockroach, and, to deal with it, we're investing in admission control. As you back away from perfect CPU saturation, the picture improves dramatically. Nobody should be running cockroach at sustained, full CPU saturation.
This would be, more or less, true, if it were the case that all of the work to be done to service these requests involved on-core CPU time. However, that's not at all the case. Some of the work involved relates to network IO and some relates to disk IO. In both of those cases, the way cockroach is written, we'll achieve parallelism.
I think this one is the point you raise that I'm most interested in. Read-write conflicts are not a probabilistic thing, they happen because of workload behavior. The likelihood of a given statement hitting a write lock is roughly exactly the same whether the table is interleaved or not. There's a good point in here but I'm not sure it's as dire as you're making it out to be. I think what you're saying is that if there's a write-read conflict, when the reader discovers the conflict, work needs to be done to resolve it (generally by waiting for the writer). What you're pointing at is that if all of the writer's locks are in one place and the reader is reading in just one place, then there's just one request to deal with that conflict. Now, imagine that our writer has laid down However, if every reader is encountering locks (the implication of Also, this problem is solved with the issue to collocated overlapping index data into a single replicated state machine as proposed in #65726. Let me pose my response differently. What benchmark most worries you? A 12-way equi-join with what data sizes for each table and what sort of read-write mix? Maybe we can assuage your fears best with a benchmark. |
My most recent comments were to give my experiences and a use case in the event it is helpful to you. I was a software engineer who had to maintain code and feel your pain on needing to remove something if it isn't working. I wish the best with efforts on the data colocation front, and I will still be using cockroachdb for my project. The use of colocated tables with something like Dynamo query patterns + schema is one I am not aware of being done with any quality. If it could be done, I would be very excited. In respects to benchmarks though, I just don't think I've seen a benchmark yet (not just here) that tests joins in a way that would give me confidence. CockroachDB is a beast on the TPC-C benchmark with point lookups/inserts, and it does have some joins, but it's not the complexity that would convince me naive joins are theoretically close to what is achievable with an interleaved table.
This is the part that confuses me as well. So perhaps it's worth me explaining in more detail. Whether the probability is the same is the case, I don't believe it should be the case if interleaving is done as I have seen it done before. The interleaving docs show something that hints at a logical storage model very similar to leveldb. If we're looking at the interleaved tables, and we assume all the data for a "customer" (let's just use an example) is 100MB, it almost certainly is all in that one range on one server. Assuming a worst case "lock" of the whole key range, it's still theoretically I don't see how this is possible at all if there's joins. In my real 12-way-join scenario, I'm going to need to consult up to 12 ranges, and do coordination. If I have an To put it in a more fundamental language, joins are objectively "complex" in the Rich Hickey connected vs unconnected sort of way. They cannot be "independent" mathematically. When you interleave, they can be, because you make the data under the PK prefix "simple" and completely owned. Instead of set intersection (<-- complex), you have what amount to fields or properties. If CockroachDB has join costs w.r.t. data size, number of servers, number of ranges that are the same as a range scan over a single key prefix, it would have accomplished a feat no other database has ever managed, even ones with no network coordination. Postgres is the best I am aware of by far, and here are benchmarks for it. MySQL is quadratic and quite bad, and here are some examples for it. As an anecdote, last year I had to work on a single, relatively simple 3 way join query on MySQL where each table was roughly 1B rows. That query went from 1 hour to over 12 hours with around a 50% increase in table size, and it was executed on a beastly 128-core machine.
On this we'll just have to agree to disagree minus a benchmark. I'm aware of how IO bound vs CPU bound workloads work, but there are only so many threads that can be executing at once, and only so many IO streams as well. My example was just to say that it's literally impossible to have 12 way parallelism on a 8 core machine. The CPU cannot do it. I agree you can send a lot of work to I/O to come back later, but even there naive parallelism is going to saturate you quickly. Back to the interleaved example though: an interleaved range scan requires no parallelism at all. The reason it requires no parallelism is because it has no objective complexity and it is not a join (or should not be in a sufficient implementation). This was the same reason for Dynamo's flat keyspace and single-table query design, although they made different tradeoffs w.r.t. consistency. |
Do you agree or disagree that we would not want to interleave |
I agree with this statement. My above statement where I put I'll give a different example where we could theoretically interleave products even if we didn't: I don't want to do that. I like relational databases and want to use relational modeling wherever it makes sense. But this is where interleaving provides real power. It creates branch cuts in your relational graph to simplify it. For me, all of Much like I would like to isolate accounts/tenants via traditional sharding to a certain database via |
FWIW, I awkwardly had this exact same experience, despite the feature having already been long removed by the time I even learned of its existence: I was trying to evaluate whether I should bother with CockroachDB, and in the process read an article (one about online index updates) that mentioned this feature in passing. As I had been spending the day carefully planning out a manual data layout for my product in a more tradition key value store I immediately "got it" and I was like "OMG I AM SOLD: THESE PEOPLE THOUGHT OF EVERYTHING". And then I put it on my todo list to figure out how to use the feature, and thankfully noticed it wasn't there anymore only a mere few (excited) hours later.
For me, it comes down to this, FWIW: I have a large amount of data and I don't want to pay to keep almost any of it in RAM constantly. I thereby want to be able to ensure that the data "inside" of my objects (ones which I don't need to do range queries on, of course) is stored in the same SST as the parent, allowing me to do a single disk I/O to read everything I need for my display. This was the kind of careful balancing act I was doing with my manual data layout, why I was concerned that CockroachDB's SQL approach was going to abstract me away from being able to solve it in an efficient manner, why I was so excited to see that the product had considered this use case and provided a syntax for it, and why I am so disappointed to see the feature has been removed :(. I thereby agree that unless your test is being done under an actual load situation--which you seem to be calling "overload" but I am going to merely call "cost effective" (as if you make the whole thing more expensive to operate by requiring more RAM or slower to use with more I/O then with whatever level of over-capacity one needs that will scale up together)--it isn't a valid test on the performance.
I was really feeling this also <3. :( (FWIW, my memory is that Cassandra actually has something similar to this with their "super columns" feature. I know they have been downplaying it a lot since the move to CQL, but I learned them back in the early days and if you know the join is going to happen and the data you are interleaving is small they served a purpose.) (I added some thoughts in #53455, btw, as I am curious as to the discrepancy with prior benchmarks that showed performance benefits.) |
Interleaved tables are a mixed bag. Their theoretical pros are:
In reality, they have a large cons list:
We should remove interleaved tables altogether from CockroachDB, and that's what this issue tracks.
Primary Key Changes gives us an offramp for people who use interleaved tables. The procedure is that you should alter the primary key of all of the child interleaved tables to no longer be interleaved, one by one, until there are no interleaved children left.
Here's a proposal, using input from the schema team and others:
Epic: CRDB-1582
The text was updated successfully, but these errors were encountered: