-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: scaling cockroach clusters while running under cayley, slows performance #17108
Comments
@aselus-hub The slight dip in performance when expanding the cluster from 1 to 3 nodes is expected. You've gone from 1 replica to 3 replicas and there is a small amount of overhead to doing so. That performance dropped further when you expanded to 5 nodes is very surprising. I don't have any explanation for that other than the statement that "it shouldn't happen". Can you provide reproduction instructions? @tschottdorf Any thoughts about the |
@petermattis the only current reproduction instructions I can give are the ones above, generating a graph, spliting it out into nquads and writing those quads in batches through cayley into cdb (example of quads as above) I'm going to write an example app that feeds it, but that might take a little bit (as I can't reuse work code that caused the problem originally) I will link the github repo once the example app is up. |
@petermattis the first data point to look at for |
@petermattis Here are numbers I got by using this to test (which follows the pattern, both local and remote numbers.) based on cdb nodes on AWS i2.xlarges (15GiB SQL memory) From remote: from AWS localized(5 runs each):
3 cdb nodes with lb
5 cdb nodes with lb
<10% cpu used on node. |
@aselus-hub Thanks for putting that example together. Will take a look tomorrow. |
@aselus-hub Wanted to let you know this is still on my radar. Apologies for not being able to take a look at a sooner. Other work has interfered. |
@aselus-hub Finally taking a look at this. I'm not sure what the cause of the slowdown moving to 5 nodes is yet. But examining the transactions I noticed that they are executing a lot of serial operations. For example:
That's 5 separate INSERT statements. Due to its distributed nature, Cockroach operations have higher latencies. You'd be better off structuring this as 2 INSERTS:
I'm not seeing anything in https://github.com/cayleygraph/cayley/blob/d9a72b0288ed17c0601adbc92eb7cb79e5687729/graph/sql/cockroach.go that would prohibit this. For an additional optimization, you can use the
If cockroach sees a query like that it will run the Now, time for me to look at the 3 node vs 5 node performance. |
On a GCE test cluster I see:
I think the fall off in performance between 3 and 5 nodes is because of the relatively small amount of data in the test. That small amount of data resides in a single range and with 3 nodes there is a 1/3 chance of the leaseholder for that range being local on the node that receives the queries. With 5 nodes there is a 1/5 chance of the leaseholder being local. If the data set were larger so that it spanned multiple ranges I would expect the performance to increase with additional nodes. @tschottdorf I can easily reproduce the |
@tschottdorf I can reproduce the
|
What's the statement that does the read? I'm seeing mostly inserts up here. One way in which this error can still pop up with perfectly synchronized clocks is simply when the "beams cross":
|
Inserts do a conditional put. Isn't that a read? The "beams cross" scenario requires contention, right? I'm not quite understanding what this test app and cayley are doing (yet). |
@petermattis I will try that optimization and add no return , though sadly gos cdb interface doesn't support multivalue inserts to my knowledge so I'll build a string for it. Unless there's some other way to do that in cockroach ? In the real dataset example we where inserting about 500k nquads with the same sort of scaling results. If the target is to be able to get that up to at least 3000inserts/second for example are there any steps we can take? Or is that kind of scaling not possible with cockroach and this data model? |
Huh, the errors seem to be occurring when the same |
Yes, it would require the first (but slower) txn to read something the second one wrote. |
Yes, you'll need to build a string, though you can still use placeholders.
Yes, that scaling is possible, especially if you batch sufficiently. |
I think this may be what is happening. The errors are all occurring on INSERTS for which there appears to be an existing row. These are |
Ok, that seems worth fixing then. Just to be sure, you're also seeing old-fashioned |
@tschottdorf These are definitely not 1PC transactions. I haven't actually looked down at what is happening at the I'm seeing INSERTS into the |
@petermattis Essentially each nquad is a relationship representation of: The reason it tries to re-insert the vertex is because it in and of itself does not know that the predicate already exists, so it just does a forceable/ignorable write... I could pottentially create an LRU or ARC cache that keeps track of what nodes have already been inserted and reduce the amount of collisions that happen in the nodes list, would that help[or top of the RETURNING NOTHING]? |
Yes, avoiding reinserting the node/vertex would likely help, though the batching of |
@petermattis did modifications as perscribed to the following results on local machine: "RETURNING NOTHING" : did not effect performance [my guess is because this is on local machine where return latency is low?]. Will do scaling test next, running it with more data for longer, any recommendation on how much data I should use for the test? |
@aselus-hub Glad to hear about the progress. Not sure why more that 15 inserts is showing degraded performance. Definitely not expected. It might be related to the indexes cayley uses. We'd be happy to investigate. In order to see a benefit from more machines, you'll need enough data to occupy multiple ranges. A range is 64MB is size and it splits when it becomes larger. You'll want to test with a data set significantly larger than 64MB (e.g. 1GB). PS Can you point me towards your edits to cayley? We'll need them to investigate the performance oddity with batches larger than 15. |
@petermattis I uploaded the update to both, the cockroach driver updates are in: https://github.com/aselus-hub/cayley |
@aselus-hub Perusing https://github.com/aselus-hub/cayley/blob/master/graph/sql/cockroach.go#L295-L301 I noticed that you're preparing a statement and then immediately executing it. The prepare requires a roundtrip to the server. You'd be better off skipping the prepare in which case the driver can pipeline the prepare and exec. Something like:
In order to get the 1 round-trip behavior, I think you might also need to set |
@petermattis made the modification, I'm going to try and run a prolonged test overnight to see what kind of performance numbers come from it. As such i also added a regular every 'n' report status to the test app. I tried smashing the two inserts into one exec as you had recommended before but it told me this was not possible, as it denoted two inserts in one prepared statement (even after i removed the prepare) |
Yes, you can't use multiple statements with prepared statements. I think that's a limitation of the postgres wire protocol. |
@petermattis
I have the output files for this if a slope is needed. So it seems as data grew the insert time sadly dropped more and more for writes, so increase of dataset/ranges did not speed up ingestion rate. This was with the same code that's 2k/second local[during quick tests], and Any other thoughts or should I close the experiments? Thanks again for all the help. |
@aselus-hub On the admin UI, can you take a look at the "Ranges", "Replicas per Store" and "Leaseholders per Store" graphs? I'm curious to know what the count of ranges is and how balanced the replicas and leaseholders are. |
@petermattis
Replicas per node:
Leaseholders per Store
Bonus Node summary information:
|
FYI, I'm interested in seeing something improved here. I hacked up the equivalent of what @petermattis suggested in #17108 (comment) into the latest version of Cayley's code (which has unfortunately moved on a fair bit from the base of @aselus-hub's fork), and I saw a 4-5x speedup on doing a large number of insertions. |
@dsymonds I can provide additional guidance on performance changes to Cayley, but you will either have to shepherd those upstream or convince the Cayley folks it is worth fixing themselves. As I mentioned in that comment, round-trips affect Cockroach performance more than traditional databases. There are usually ways to structure an application's logic to reduce the round-trips. Is there something else you're looking for here? |
No, I'm looking into pushing the improvements upstream to Cayley. I just wanted to note the magnitude of the speedup that I observed. It'd be nice if CockroachDB did it automatically (that is, coalescing value insertions inside a transaction that have conflict resolution and don't return values), but it's not as necessary for my specific use case. |
Agreed that fixing this in Cockroach would be ideal. There are a few ideas and experiments in this area that we'll be investigating for the 2.1 release (scheduled for October). |
I've filed cayleygraph/cayley#691 to chase this upstream. |
Because CockroachDB is a distributed database, round trips to it are much slower than comparable round trips in, say, Postgres, so different tradeoffs are required for optimal insertions. In this case, a single INSERT statement with many values is much faster for CockroachDB to handle than a sequence of single value INSERT statements. In one test case involving loading a large number of nodes and quads, this change produces a 4-5x speedup. See cockroachdb/cockroach#17108 for a more detailed discussion of this matter. Fixes #691.
FYI, I fixed cayleygraph/cayley#691 based on the ideas here (multi-value insert statements, and avoiding returning data when not needed). @petermattis If you have any other suggestions for the code affected there (see b49c06e), I'd be happy to try it out and see if I can get further performance improvements. |
@dsymonds Very nice. I took a quick look at your change. A definite improvement. Some comments below:
This shouldn't be necessary. A semicolon is only necessary to terminate a statement if the query contains multiple statements.
The use of placeholders triggers different code paths within the Yet another area to experiment with is to avoid using placeholders at all. This is somewhat more dangerous as you'd have to guarantee you're properly quoting the values in the query. The advantage is that you can send multiple semicolon separated statements in a single call to |
This was at the behest of CockroachDB engineers who say this is a much better driver (cockroachdb/cockroach#17108 (comment)). This extends work done in #691.
@petermattis anything concrete left to do here? Based on my quick skim of the thread, this might be worth considering: |
I've created #28461. I read through the rest of this issue and don't see anything else that is actionable. |
Reference cayley code that does writes to cdb: https://github.com/cayleygraph/cayley/blob/d9a72b0288ed17c0601adbc92eb7cb79e5687729/graph/sql/cockroach.go
Essentially the use case is as follows: write 4 quads into cayley which creates 1 transaction into cockroachdb. The items in the quads are linked to one another (
example quads(the application is generating thousands of these per second):
NOTE: Application example to come later as per request in gitter.
NOTE: All numbers are averages over 3 runs.
When ingesting these quad sets we then measure how many sets are ingested into cockroach through the cayley library. When done on local machine (OsX laptop, latest) we got numbers between 300~600 sets ingested(each with 1~5 quads in their set)
When we took the same thing up to AWS and put 1 node cockroach and 1 node 'ingestion/generation' we got 260sets/second (accountable with latency, no errors cockroach was taking 350% of cpu resource via top, ingestion 75% of resources[on separate nodes])
We then expanded cockroach to 3 nodes, placed it behind an ELB and ran the test again the performance dropped to 230sets/second(300~CPU on first node, and 95% on the other two).
We then thought it might be the generator so we scaled that to two generators (which gave us the same performance 226sets/second) with the generators each taking <50% cpu.
We then cleaned cockroach and added two more nodes(same size), and got 198sets/second
AWS cockroach node info:
i2.xlarge
cockroach running on the 800Gb SSD as its storage device,
tried with default(15GiB) and 50GiB of ram configured for each node
ntp sync at stratum 2 sub 50ms on all nodes
NOTE: the runs which had lower performance numbers had a significantly higher amount of errors:
INFO - {go-log} (*GoLog).Write: 2017/07/18 23:04:12 ERROR: couldn't exec INSERT statement: pq: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1500419052.641330403,0 encountered previous write with future timestamp 1500419052.641409756,0 within uncertainty interval
What did you expect to see?
The performance to improve as more nodes are added.
What did you see instead?
Performance reduced.
The text was updated successfully, but these errors were encountered: