Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: scaling cockroach clusters while running under cayley, slows performance #17108

Closed
aselus-hub opened this issue Jul 19, 2017 · 41 comments
Closed
Assignees
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Milestone

Comments

@aselus-hub
Copy link

Reference cayley code that does writes to cdb: https://github.com/cayleygraph/cayley/blob/d9a72b0288ed17c0601adbc92eb7cb79e5687729/graph/sql/cockroach.go

Essentially the use case is as follows: write 4 quads into cayley which creates 1 transaction into cockroachdb. The items in the quads are linked to one another (
example quads(the application is generating thousands of these per second):

<NodeA> <related_to> <NodeB> .
<NodeB> <related_to> <NodeC> .
<NodeC> <related_to> <NodeD> .
<NodeD> <related_to> <NodeE> .

NOTE: Application example to come later as per request in gitter.
NOTE: All numbers are averages over 3 runs.

When ingesting these quad sets we then measure how many sets are ingested into cockroach through the cayley library. When done on local machine (OsX laptop, latest) we got numbers between 300~600 sets ingested(each with 1~5 quads in their set)

When we took the same thing up to AWS and put 1 node cockroach and 1 node 'ingestion/generation' we got 260sets/second (accountable with latency, no errors cockroach was taking 350% of cpu resource via top, ingestion 75% of resources[on separate nodes])

We then expanded cockroach to 3 nodes, placed it behind an ELB and ran the test again the performance dropped to 230sets/second(300~CPU on first node, and 95% on the other two).

We then thought it might be the generator so we scaled that to two generators (which gave us the same performance 226sets/second) with the generators each taking <50% cpu.

We then cleaned cockroach and added two more nodes(same size), and got 198sets/second

AWS cockroach node info:
i2.xlarge
cockroach running on the 800Gb SSD as its storage device,
tried with default(15GiB) and 50GiB of ram configured for each node
ntp sync at stratum 2 sub 50ms on all nodes

NOTE: the runs which had lower performance numbers had a significantly higher amount of errors:
INFO - {go-log} (*GoLog).Write: 2017/07/18 23:04:12 ERROR: couldn't exec INSERT statement: pq: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1500419052.641330403,0 encountered previous write with future timestamp 1500419052.641409756,0 within uncertainty interval

  • What did you expect to see?
    The performance to improve as more nodes are added.

  • What did you see instead?
    Performance reduced.

@petermattis
Copy link
Collaborator

@aselus-hub The slight dip in performance when expanding the cluster from 1 to 3 nodes is expected. You've gone from 1 replica to 3 replicas and there is a small amount of overhead to doing so. That performance dropped further when you expanded to 5 nodes is very surprising. I don't have any explanation for that other than the statement that "it shouldn't happen". Can you provide reproduction instructions?

@tschottdorf Any thoughts about the ReadWithinUncertaintyIntervalErrors?

@aselus-hub
Copy link
Author

@petermattis the only current reproduction instructions I can give are the ones above, generating a graph, spliting it out into nquads and writing those quads in batches through cayley into cdb (example of quads as above) I'm going to write an example app that feeds it, but that might take a little bit (as I can't reuse work code that caused the problem originally) I will link the github repo once the example app is up.

@tbg
Copy link
Member

tbg commented Jul 19, 2017

@petermattis the first data point to look at for ReadWithinUncertaintyIntervalErrors is the clock offset. I believe there are clock_offset measurements in /_status/vars. If a node is trailing and contending with other nodes, it will often catch that error. Even with well-synchronized clocks, it could happen sometimes: txn1 starts at ts1, waits, txn2 starts at ts2 > ts1, writes, txn1 tries to write and sees value at ts2. But it doesn't seem here that overlapping transactions have a very heterogenously sized read phase.

@aselus-hub
Copy link
Author

aselus-hub commented Jul 19, 2017

@petermattis
example branch used to generate bellow numbers(apologies for the mess, i wrote it in haste):
https://github.com/aselus-hub/cayley-cdb-appscaling

Here are numbers I got by using this to test (which follows the pattern, both local and remote numbers.) based on cdb nodes on AWS i2.xlarges (15GiB SQL memory)

From remote:
1 cdb node: 80/second
3 cdb nodes(with lb): 74~83 sets/second. If talking though lb: 55
5 cdb nodes: if talking to one node 67~69 if talking through lb: 49

from AWS localized(5 runs each):
1 cdb node with lb 239~254 (median example, over 10seconds):

time elapsed: 10.266989253s
total processed: 2477
sets/second: 241.2586532391883

3 cdb nodes with lb 193~201 (median example, over 10second)

time elapsed: 10.169260542s
total processed: 2025
sets/second: 199.12952290253162

5 cdb nodes with lb 170~185 (median example, over 10second)

time elapsed: 10.398752349s
total processed: 1907
sets/second: 183.38738494752087

<10% cpu used on node.

@petermattis
Copy link
Collaborator

@aselus-hub Thanks for putting that example together. Will take a look tomorrow.

@petermattis
Copy link
Collaborator

@aselus-hub Wanted to let you know this is still on my radar. Apologies for not being able to take a look at a sooner. Other work has interfered.

@petermattis
Copy link
Collaborator

@aselus-hub Finally taking a look at this. I'm not sure what the cause of the slowdown moving to 5 nodes is yet. But examining the transactions I noticed that they are executing a lot of serial operations. For example:

BEGIN
SAVEPOINT cockroach_restart
INSERT INTO nodes(hash, value_string) VALUES ($1, $2) ON CONFLICT (hash) DO NOTHING;
INSERT INTO nodes(hash, value_string) VALUES ($1, $2) ON CONFLICT (hash) DO NOTHING;
INSERT INTO nodes(hash, value_string) VALUES ($1, $2) ON CONFLICT (hash) DO NOTHING;
INSERT INTO quads(subject_hash, predicate_hash, object_hash, label_hash, id, ts) VALUES ($1, $2, $3, $4, $5, $6) ON CONFLICT (subject_hash, predicate_hash, object_hash) DO NOTHING;
INSERT INTO quads(subject_hash, predicate_hash, object_hash, label_hash, id, ts) VALUES ($1, $2, $3, $4, $5, $6) ON CONFLICT (subject_hash, predicate_hash, object_hash) DO NOTHING;
RELEASE SAVEPOINT cockroach_restart
COMMIT

That's 5 separate INSERT statements. Due to its distributed nature, Cockroach operations have higher latencies. You'd be better off structuring this as 2 INSERTS:

BEGIN
SAVEPOINT cockroach_restart
INSERT INTO nodes(hash, value_string) VALUES ($1, $2), ($3, $4), ($5, $6) ON CONFLICT (hash) DO NOTHING;
INSERT INTO quads(subject_hash, predicate_hash, object_hash, label_hash, id, ts) VALUES ($1, $2, $3, $4, $5, $6), ($7, $8, $9, $10, $11, $12) ON CONFLICT (subject_hash, predicate_hash, object_hash) DO NOTHING;
RELEASE SAVEPOINT cockroach_restart
COMMIT

I'm not seeing anything in https://github.com/cayleygraph/cayley/blob/d9a72b0288ed17c0601adbc92eb7cb79e5687729/graph/sql/cockroach.go that would prohibit this.

For an additional optimization, you can use the RETURNING NOTHING syntax (https://www.cockroachlabs.com/docs/stable/insert.html) and send the INSERTs for the nodes and quads tables in the same query. This would look something like:

INSERT INTO quads ... RETURNING NOTHING; INSERT INTO nodes ... RETURNING NOTHING;

If cockroach sees a query like that it will run the INSERTs in parallel.

Now, time for me to look at the 3 node vs 5 node performance.

@petermattis
Copy link
Collaborator

On a GCE test cluster I see:

1-node:  346 ops/sec
3-nodes: 135 ops/sec
5-nodes: 130 ops/sec

I think the fall off in performance between 3 and 5 nodes is because of the relatively small amount of data in the test. That small amount of data resides in a single range and with 3 nodes there is a 1/3 chance of the leaseholder for that range being local on the node that receives the queries. With 5 nodes there is a 1/5 chance of the leaseholder being local. If the data set were larger so that it spanned multiple ranges I would expect the performance to increase with additional nodes.

@tschottdorf I can easily reproduce the ReadWithinUncertaintyIntervalError error on denim which has fairly tight clock synchronization according to the node exporter metrics. Especially if you crank up the concurrency in the test app that @aselus-hub provided. I think there is something to investigate here.

@petermattis
Copy link
Collaborator

@tschottdorf I can reproduce the ReadWithinUncertaintyIntervalError problem on a local 5-node cluster. That is, all nodes running on the same machine.

~ rm -fr cockroach-data && roachdemo -n 5 -- --insecure --logtostderr
./cayley-cdb-appscaling
2017/07/28 16:06:23 ERROR: couldn't exec INSERT statement: pq: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1501272383.609559945,0 encountered previous write with future timestamp 1501272383.617180736,0 within uncertainty interval
2017/07/28 16:06:23 ERROR: couldn't exec INSERT statement: pq: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1501272383.607930617,0 encountered previous write with future timestamp 1501272383.617180736,0 within uncertainty interval

@tbg
Copy link
Member

tbg commented Jul 28, 2017

What's the statement that does the read? I'm seeing mostly inserts up here.

One way in which this error can still pop up with perfectly synchronized clocks is simply when the "beams cross":

  • node1 begins transaction, picks timestamp
  • node2 begins transaction, picks (higher, because perfectly synchronized) timestamp, reads & inserts, commits
  • node 1 reads, catches uncertainty error

@petermattis
Copy link
Collaborator

Inserts do a conditional put. Isn't that a read?

The "beams cross" scenario requires contention, right? I'm not quite understanding what this test app and cayley are doing (yet).

@aselus-hub
Copy link
Author

@petermattis I will try that optimization and add no return , though sadly gos cdb interface doesn't support multivalue inserts to my knowledge so I'll build a string for it. Unless there's some other way to do that in cockroach ?

In the real dataset example we where inserting about 500k nquads with the same sort of scaling results. If the target is to be able to get that up to at least 3000inserts/second for example are there any steps we can take? Or is that kind of scaling not possible with cockroach and this data model?

@petermattis
Copy link
Collaborator

Huh, the errors seem to be occurring when the same nodes.hash is inserted concurrently. I need to step out for a bit, but that seems odd.

@tbg
Copy link
Member

tbg commented Jul 28, 2017

Yes, it would require the first (but slower) txn to read something the second one wrote.
I would have to check what a CPut would do, but it's possible that it would treat this as a ReadWithinUncertaintyIntervalError. Even though that would be somewhat silly: it could still look at the future value and if it doesn't match the assumption of the CPut, it may as well return a ConditionFailedError right there (I think).

@petermattis
Copy link
Collaborator

@petermattis I will try that optimization and add no return , though sadly gos cdb interface doesn't support multivalue inserts to my knowledge so I'll build a string for it. Unless there's some other way to do that in cockroach ?

Yes, you'll need to build a string, though you can still use placeholders.

In the real dataset example we where inserting about 500k nquads with the same sort of scaling results. If the target is to be able to get that up to at least 3000inserts/second for example are there any steps we can take? Or is that kind of scaling not possible with cockroach and this data model?

Yes, that scaling is possible, especially if you batch sufficiently.

@petermattis
Copy link
Collaborator

Yes, it would require the first (but slower) txn to read something the second one wrote.
I would have to check what a CPut would do, but it's possible that it would treat this as a ReadWithinUncertaintyIntervalError. Even though that would be somewhat silly: it could still look at the future value and if it doesn't match the assumption of the CPut, it may as well return a ConditionFailedError right there (I think).

I think this may be what is happening. The errors are all occurring on INSERTS for which there appears to be an existing row. These are INSERT ... ON CONFLICT DO NOTHING. But the lack of ConditionFailedError is causing the insert to fail on conflict.

@tbg
Copy link
Member

tbg commented Jul 28, 2017

Ok, that seems worth fixing then. Just to be sure, you're also seeing old-fashioned WriteIntentErrors, right? Those would be expected if the "faster" value is in fact an intent. Or perhaps you're not seeing those because 1PC optimization?

@petermattis
Copy link
Collaborator

@tschottdorf These are definitely not 1PC transactions. I haven't actually looked down at what is happening at the storage layer yet.

I'm seeing INSERTS into the nodes table fail where the hash is ded76379665e2a11c953fbfeca46541832498e61. That is the node for the related_through predicate (literally SHA1("related_through")). @aselus-hub I'm not terribly familiar with cayley, but over a 10 second run where I inserted 2622 nodes, we performed 2218 inserts of the "related_through" node. That seems sub-optimal.

@aselus-hub
Copy link
Author

@petermattis
Cayley is a graph db that uses cockroach as a storage layer(not of my writing ;-) though I am going to try and write/test/submit your advice into its cockroach connector).

Essentially each nquad is a relationship representation of:
parent/subject ---(edge/predicate)---> chid/object [tag/label]
Each of the items in the nquad relationship is represented as a node in the nodes table, and the relationship is represented in the nquads table.

The reason it tries to re-insert the vertex is because it in and of itself does not know that the predicate already exists, so it just does a forceable/ignorable write... I could pottentially create an LRU or ARC cache that keeps track of what nodes have already been inserted and reduce the amount of collisions that happen in the nodes list, would that help[or top of the RETURNING NOTHING]?

@petermattis
Copy link
Collaborator

The reason it tries to re-insert the vertex is because it in and of itself does not know that the predicate already exists, so it just does a forceable/ignorable write... I could pottentially create an LRU or ARC cache that keeps track of what nodes have already been inserted and reduce the amount of collisions that happen in the nodes list, would that help[or top of the RETURNING NOTHING]?

Yes, avoiding reinserting the node/vertex would likely help, though the batching of INSERTs will probably help more. Note that my suggestion regarding the batching of INSERTs will help any sql backend. It just has a more noticeable affect on Cockroach where the latencies are higher.

@aselus-hub
Copy link
Author

aselus-hub commented Jul 31, 2017

@petermattis did modifications as perscribed to the following results on local machine:
Creating bulk inserts: doing 15+ inserts at a time upped performance to ~1700/second(more then that degraded performance, which I found odd since the documentation said ~500 was ideal. But if i clocked it up to anywhere near there then the performance dropped significantly [to ~100 sets]). Is this as expected?

"RETURNING NOTHING" : did not effect performance [my guess is because this is on local machine where return latency is low?].

Will do scaling test next, running it with more data for longer, any recommendation on how much data I should use for the test?

@petermattis
Copy link
Collaborator

@aselus-hub Glad to hear about the progress. Not sure why more that 15 inserts is showing degraded performance. Definitely not expected. It might be related to the indexes cayley uses. We'd be happy to investigate.

In order to see a benefit from more machines, you'll need enough data to occupy multiple ranges. A range is 64MB is size and it splits when it becomes larger. You'll want to test with a data set significantly larger than 64MB (e.g. 1GB).

PS Can you point me towards your edits to cayley? We'll need them to investigate the performance oddity with batches larger than 15.

@aselus-hub
Copy link
Author

aselus-hub commented Jul 31, 2017

@petermattis I uploaded the update to both, the cockroach driver updates are in: https://github.com/aselus-hub/cayley
The new changes in cdb_graph_gen extend the batch processing to be more then just one set of nquads(using the 'batchSize' const as the apply tx threshold).

@petermattis
Copy link
Collaborator

@aselus-hub Perusing https://github.com/aselus-hub/cayley/blob/master/graph/sql/cockroach.go#L295-L301 I noticed that you're preparing a statement and then immediately executing it. The prepare requires a roundtrip to the server. You'd be better off skipping the prepare in which case the driver can pipeline the prepare and exec. Something like:

		_, err = tx.Exec(insertQuadStr, quadValues...)

In order to get the 1 round-trip behavior, I think you might also need to set binary_parameters=true in your postgres URL.

@aselus-hub
Copy link
Author

@petermattis made the modification, I'm going to try and run a prolonged test overnight to see what kind of performance numbers come from it. As such i also added a regular every 'n' report status to the test app.

I tried smashing the two inserts into one exec as you had recommended before but it told me this was not possible, as it denoted two inserts in one prepared statement (even after i removed the prepare)

@petermattis
Copy link
Collaborator

I tried smashing the two inserts into one exec as you had recommended before but it told me this was not possible, as it denoted two inserts in one prepared statement (even after i removed the prepare)

Yes, you can't use multiple statements with prepared statements. I think that's a limitation of the postgres wire protocol.

@aselus-hub
Copy link
Author

@petermattis
finished running a 14hr test set with 3 of the example applications and 5 cockroachDB isntances+lb

Starting performance: ~200node/sec (total of ~600)
Inserts/second within the first hour(According to the UI) : 100/sec

Ending performance: ~65node/sec (total of ~195)
Ending total data ingested: 60.8gbs (out of a total of  3.6Tb)
Inserts/second towards the end of the 14 hours(according to the UI) : 40/sec

Average over all time: ~80/node

I have the output files for this if a slope is needed.

So it seems as data grew the insert time sadly dropped more and more for writes, so increase of dataset/ranges did not speed up ingestion rate. This was with the same code that's 2k/second local[during quick tests], and 1000msgs/sec 1 writer - 1 cdb and 800msg/sec 1 writer - 1 cdb setups. So the scalability experiment is still a failure for this kind of table count.

Any other thoughts or should I close the experiments?

Thanks again for all the help.

@petermattis
Copy link
Collaborator

@aselus-hub On the admin UI, can you take a look at the "Ranges", "Replicas per Store" and "Leaseholders per Store" graphs? I'm curious to know what the count of ranges is and how balanced the replicas and leaseholders are.

@aselus-hub
Copy link
Author

@petermattis
Snapshot info taken at end of run, all in all the numbers aren't that far off, though a bit smaller then I would have expected:
Ranges:

Ranges: 350
Leaders: 350
Lease Holders: 349

Replicas per node:

Node 1 - 214
Node 2 - 208
Node 3 - 203
Node 4 - 210
Node 5 - 215

Leaseholders per Store

Node 1 - 69
Node 2 - 72
Node 3 - 66
Node 4 - 75
Node 5 - 67

Bonus Node summary information:

ID | ADDRESS | UPTIME | BYTES | REPLICAS | MEM USAGE | LOGS
1 | ip- xx | 18 hours | 7.7 GiB | 217 | 11.7 GiB | Logs
2 | ip- xx | 18 hours | 7.9 GiB | 211 | 11.8 GiB | Logs
3 | ip- xx | 18 hours | 7.5 GiB | 203 | 12.9 GiB | Logs
4 | ip- xx | 18 hours | 7.4 GiB | 213 | 12.6 GiB | Logs
5 | ip- xx | 18 hours | 7.6 GiB | 215 | 11.8 GiB | Logs
5
TOTAL LIVE NODES
38.1 GiB
TOTAL BYTES
1059
TOTAL REPLICAS
60.8 GiB
TOTAL MEMORY USAGE

@dsymonds
Copy link

FYI, I'm interested in seeing something improved here. I hacked up the equivalent of what @petermattis suggested in #17108 (comment) into the latest version of Cayley's code (which has unfortunately moved on a fair bit from the base of @aselus-hub's fork), and I saw a 4-5x speedup on doing a large number of insertions.

@petermattis
Copy link
Collaborator

@dsymonds I can provide additional guidance on performance changes to Cayley, but you will either have to shepherd those upstream or convince the Cayley folks it is worth fixing themselves. As I mentioned in that comment, round-trips affect Cockroach performance more than traditional databases. There are usually ways to structure an application's logic to reduce the round-trips. Is there something else you're looking for here?

@dsymonds
Copy link

No, I'm looking into pushing the improvements upstream to Cayley. I just wanted to note the magnitude of the speedup that I observed. It'd be nice if CockroachDB did it automatically (that is, coalescing value insertions inside a transaction that have conflict resolution and don't return values), but it's not as necessary for my specific use case.

@petermattis
Copy link
Collaborator

It'd be nice if CockroachDB did it automatically (that is, coalescing value insertions inside a transaction that have conflict resolution and don't return values), but it's not as necessary for my specific use case.

Agreed that fixing this in Cockroach would be ideal. There are a few ideas and experiments in this area that we'll be investigating for the 2.1 release (scheduled for October).

@dsymonds
Copy link

I've filed cayleygraph/cayley#691 to chase this upstream.

dennwc pushed a commit to cayleygraph/cayley that referenced this issue Feb 22, 2018
Because CockroachDB is a distributed database, round trips to it are
much slower than comparable round trips in, say, Postgres, so different
tradeoffs are required for optimal insertions. In this case, a single
INSERT statement with many values is much faster for CockroachDB to
handle than a sequence of single value INSERT statements.

In one test case involving loading a large number of nodes and quads,
this change produces a 4-5x speedup.

See cockroachdb/cockroach#17108 for a more
detailed discussion of this matter.

Fixes #691.
@dsymonds
Copy link

FYI, I fixed cayleygraph/cayley#691 based on the ideas here (multi-value insert statements, and avoiding returning data when not needed). @petermattis If you have any other suggestions for the code affected there (see b49c06e), I'd be happy to try it out and see if I can get further performance improvements.

@petermattis
Copy link
Collaborator

@dsymonds Very nice. I took a quick look at your change. A definite improvement. Some comments below:

  • fmt.Fprint(&query, ";")

This shouldn't be necessary. A semicolon is only necessary to terminate a statement if the query contains multiple statements.

  • _, err := tx.Exec(query.String(), allValues...)
    

The use of placeholders triggers different code paths within the lib/pq driver. I can't recall if lib/pq pipelines the low-level PREPARE/BIND/EXEC operations or if it sends them one at a time. There is another Go postgres driver which the knowledgeable Cockroach engineers say is much better: https://github.com/jackc/pgx. I'm not sure if it would have an effect here, but it is worth a small bit of time exploring. The other option is to specify binary_parameters?true in the postgresql:// URL. A brief reading of the lib/pq source shows that this sends the PREPARE/BIND/EXEC operations before waiting for any results.

Yet another area to experiment with is to avoid using placeholders at all. This is somewhat more dangerous as you'd have to guarantee you're properly quoting the values in the query. The advantage is that you can send multiple semicolon separated statements in a single call to tx.Exec(). Ideally, this shouldn't have any benefit given you're specifying RETURNING NOTHING.

dennwc pushed a commit to cayleygraph/cayley that referenced this issue Feb 26, 2018
This was at the behest of CockroachDB engineers who say this is a much
better driver
(cockroachdb/cockroach#17108 (comment)).

This extends work done in #691.
@petermattis petermattis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label May 18, 2018
@petermattis petermattis changed the title Scaling cockroach clusters while running under cayley, slows performance. perf: scaling cockroach clusters while running under cayley, slows performance May 18, 2018
@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label May 29, 2018
@tbg tbg added this to Backlog in KV May 29, 2018
@tbg
Copy link
Member

tbg commented Jul 19, 2018

@petermattis anything concrete left to do here? Based on my quick skim of the thread, this might be worth considering:

#17108 (comment)

@tbg tbg modified the milestones: 2.1, 2.2 Jul 19, 2018
@tbg tbg added A-coreperf and removed A-disaster-recovery A-kv-transactions Relating to MVCC and the transactional model. A-kv-distribution Relating to rebalancing and leasing. A-kv-client Relating to the KV client and the KV interface. A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 31, 2018
@petermattis
Copy link
Collaborator

I've created #28461. I read through the rest of this issue and don't see anything else that is actionable.

KV automation moved this from Backlog to Finished (milestone 3, ends 7/20) Aug 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

5 participants