-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Running v1.1-beta.20170907 on debian stretch with ext4 filesystem.
Nodes use --cache=20% --max-sql-memory=10GB
3 nodes in the same datacenter, all with the following hardware:
CPU: Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz (8 threads)
RAM: 64GiB
Disks: 2x 512GiB NVME in software RAID1
Test benchmark table:
CREATE TABLE bench (s string primary key, n int);
A simple load generating script written in Go is creating 100 concurrent connections and runs the following query on one of the nodes where $1 is a random string of length 3:
INSERT INTO bench (s, n) VALUES($1, 1) ON CONFLICT (s) DO UPDATE SET n = bench.n + 1
Base performance without any further modifications is around 300 qps which is extremely slow. This is likely due to the small datasize resulting in just 1 range which seems like a unit of concurreny in the system. The default max range size is 64 MiB.
Then I dropped and recreated the table and modified the size of ranges to a min/max of 64/128 KiB to force it to split ranges very quickly. Inserts start slowly again at under 300 qps but at something like 8 ranges already, qps rise to around 700.
Once the system reaches 64 ranges after a few minutes, performance stabalizes around 2500 qps.
This is still quite low as some other databases can do nearly an order of magnitude more qps on the same hardware.
At the stable 2.5k qps, each node is using less than 50% of available CPU power, have plenty of RAM free and network throuput is about 1MiB up and down each on a gbit network.
The disk io though is quite worrying at around 26MiB writes/s and 8% CPU spent in iowait as indicated by dstat.
The data being updated is very small (one integer). Granted, CockroachDB keeps all past values so let's assume each update is like an insert. The string has 3 bytes plus 4 byte integer plus overhead for metadata and encoding. Let's assume a generous 64 bytes per entry. At 2500 qps, that would be around 256KiB/s. LSM storage engines have write amplification. Not sure how many levels were generated in this test but I'd assume not too many. So let's assume each row is actually written 4 times as time goes by. That's 1MiB/s. Still off by a factor of 26. Not sure where all this disk io comes from but it seems excessive.
Batching, as suggested on Gitter, didn't help. I tried to write 10 rows per query and qps dropped by a factor of 10 accordingly. KV operations seemed stable so it's writing the same amount of rows. Additionally one has to be very careful as the same primary key can't appear twice in a single insert so one has to pre-process the batch items before executing the query or otherwise the query will fail with an error.
@tschottdorf asked on Gitter to see a SHOW KV TRACE of an example query. Please see below.
This was run while the load generator is still running.
root@:26257/test> SHOW KV TRACE FOR INSERT INTO bench(s, n) VALUES ('ABC', 1) ON CONFLICT (s) DO UPDATE SET n = bench.n + 1;
+----------------------------------+---------------+---------------------------------------------------------------+------------------------------------------------------+------------------+-------+
| timestamp | age | message | context | operation | span |
+----------------------------------+---------------+---------------------------------------------------------------+------------------------------------------------------+------------------+-------+
| 2017-09-21 09:24:35.115653+00:00 | 0s | output row: [] | [client=127.0.0.1:36494,user=root,n1] output row: [] | consuming rows | (0,2) |
| 2017-09-21 09:24:35.115674+00:00 | 21µs28ns | querying next range at /Table/56/1/"ABC" | [client=127.0.0.1:36494,user=root,n1] | sql txn implicit | (0,0) |
| 2017-09-21 09:24:35.115696+00:00 | 42µs735ns | r2522: sending batch 1 Scan to (n3,s3):1 | [client=127.0.0.1:36494,user=root,n1] | sql txn implicit | (0,0) |
| 2017-09-21 09:24:35.117193+00:00 | 1ms539µs422ns | fetched: /bench/primary/'ABC'/n -> /305 | [client=127.0.0.1:36494,user=root,n1] | sql txn implicit | (0,0) |
| 2017-09-21 09:24:35.117218+00:00 | 1ms565µs47ns | Put /Table/56/1/"ABC"/0 -> /TUPLE/2:2:Int/306 | [client=127.0.0.1:36494,user=root,n1] | sql txn implicit | (0,0) |
| 2017-09-21 09:24:35.11724+00:00 | 1ms586µs925ns | querying next range at /Table/56/1/"ABC"/0 | [client=127.0.0.1:36494,user=root,n1] | sql txn implicit | (0,0) |
| 2017-09-21 09:24:35.117266+00:00 | 1ms612µs181ns | r2522: sending batch 1 Put, 1 BeginTxn, 1 EndTxn to (n3,s3):1 | [client=127.0.0.1:36494,user=root,n1] | sql txn implicit | (0,0) |
+----------------------------------+---------------+---------------------------------------------------------------+------------------------------------------------------+------------------+-------+
(7 rows)
I couldn't observe any benefit from larger ranges. I think if a table started out with a small range size and automatically increased this as it grows, performance could be greatly improved. At least the default of 64MiB seems way too high.
Side observations:
-
When using a shorter length for the random primary key string like 2 which creates a lot more conflicts, the load generator quickly dies with this error:
ERROR: TransactionStatusError: does not exist (SQLSTATE XX000)
I am not sure what this error indicates. It might warrant its own issue. -
Doing a
TRUNCATE TABLE bench;while inserts are running, results in the table not being displayed in the admin UI. It re-appears once the TRUNCATE is finished. -
Changing the queries to pure SELECTs for a single row, results in around 2200 qps.
-
Changing the queries to ON CONFLICT DO NOTHING, results in around 7100 qps.
-
To refesh the table overview in the admin UI takes several seconds because each time nearly 900KiB (3.46MiB uncompressed) of javascript are downloaded each time. The servers are not close to me so this causes quite some lag. CockroachDB prevents the browser from caching the assets and I think that should be changed. It should at least support Etags so the browser can cache it as long as the file didn't change. An alternative solution would be to use a URL which contains the hash or mtime of the binary.
-
Increasing the range size after over 1000 ranges were created didn't seem to result in a lower amount of ranges. Are ranges ever merged?
-
The admin UI seems sensitive to the machine running the browser having a synchronized clock. I saw nodes being randomly reported as suspect and couldn't figure out what's wrong until I noticed my laptops clock was off by a bit. It also causes the queries per second value to be 0 every now and then.
-
The database size in the admin UI might be off. For one table it shows me a size of 9.3GiB while the whole cluster in the overview shows a usage of 3.6GiB which also matches the 1.2GiB size of the cockroach-data directory on each node.
-
The number of indices in the admin ui tables page seems wrong. I have a table with a primary key over 3 columns and it lists 3 indices while it should be 1.
-
I shut down node 3 via "cockroach quit" which made the load generator get stuck. No errors. After restarting the load generator, it quickly becomes stuck again. Once I brought node 3 back up, queries continued. That's a real problem for a production setup. Note that the load generator only connects to node 1. The admin UI correctly identified node 3 as dead. This also probably warrants its own issue.