Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high memory usage in cluster #15702

Closed
BramGruneir opened this issue May 4, 2017 · 69 comments
Closed

high memory usage in cluster #15702

BramGruneir opened this issue May 4, 2017 · 69 comments

Comments

@BramGruneir
Copy link
Member

Originally reported in the forum:

Version: binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)

The cluster should be idle, but a lot of the nodes are showing load... It's not busy from clients, but never seems to go idle, even after leaving it inactive for a long days (but it's only been hours on rc.1).

$ dogroup "grep -F '[config]' /data/crdb/logs/cockroach.log ; uptime" crdb+

Processing crdb1a
I170502 23:48:09.159014 1 util/log/clog.go:990 [config] file created at: 2017/05/02 23:48:09
I170502 23:48:09.159014 1 util/log/clog.go:990 [config] running on machine: crdb1a
I170502 23:48:09.159014 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170502 23:48:09.159014 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=ohq]
01:41:36 up 14 days, 7:13, 5 users, load average: 1.18, 1.28, 1.31
Connection to crdb1a closed.

Processing crdb2a
I170502 23:48:09.477202 1 util/log/clog.go:990 [config] file created at: 2017/05/02 23:48:09
I170502 23:48:09.477202 1 util/log/clog.go:990 [config] running on machine: crdb2a
I170502 23:48:09.477202 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170502 23:48:09.477202 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=ohq]
01:41:36 up 14 days, 7:15, 5 users, load average: 0.23, 0.14, 0.14
Connection to crdb2a closed.

Processing crdb3a
I170503 23:25:06.941979 1 util/log/clog.go:990 [config] file created at: 2017/05/03 23:25:06
I170503 23:25:06.941979 1 util/log/clog.go:990 [config] running on machine: crdb3a
I170503 23:25:06.941979 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170503 23:25:06.941979 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=ohq]
01:41:36 up 14 days, 7:15, 12 users, load average: 0.94, 0.62, 0.54
Connection to crdb3a closed.

Processing crdb4a
I170503 22:17:01.035850 69 util/log/clog.go:887 [config] file created at: 2017/05/03 22:17:01
I170503 22:17:01.035850 69 util/log/clog.go:887 [config] running on machine: crdb4a
I170503 22:17:01.035850 69 util/log/clog.go:887 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170503 22:17:01.035850 69 util/log/clog.go:887 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=ohq]
01:41:36 up 8 days, 4:20, 1 user, load average: 2.46, 2.56, 2.53
Connection to crdb4a closed.

Processing crdb5a
I170502 23:48:09.701236 1 util/log/clog.go:990 [config] file created at: 2017/05/02 23:48:09
I170502 23:48:09.701236 1 util/log/clog.go:990 [config] running on machine: crdb5a
I170502 23:48:09.701236 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170502 23:48:09.701236 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=ohq]
01:41:36 up 8 days, 4:19, 2 users, load average: 1.21, 1.28, 1.40
Connection to crdb5a closed.

Processing crdb1c
I170504 01:35:23.501695 2577759 util/log/clog.go:887 [config] file created at: 2017/05/04 01:35:23
I170504 01:35:23.501695 2577759 util/log/clog.go:887 [config] running on machine: crdb1c
I170504 01:35:23.501695 2577759 util/log/clog.go:887 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170504 01:35:23.501695 2577759 util/log/clog.go:887 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=ods]
01:41:37 up 8 days, 1:36, 1 user, load average: 0.76, 0.49, 0.47
Connection to crdb1c closed.

Processing crdb1d
I170503 23:57:57.642682 1 util/log/clog.go:990 [config] file created at: 2017/05/03 23:57:57
I170503 23:57:57.642682 1 util/log/clog.go:990 [config] running on machine: crdb1d
I170503 23:57:57.642682 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170503 23:57:57.642682 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=dfw]
01:41:37 up 2:09, 2 users, load average: 1.68, 1.20, 1.01
Connection to crdb1d closed.

Processing crdb1e
I170503 23:36:48.702122 1 util/log/clog.go:990 [config] file created at: 2017/05/03 23:36:48
I170503 23:36:48.702122 1 util/log/clog.go:990 [config] running on machine: crdb1e
I170503 23:36:48.702122 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170503 23:36:48.702122 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=nyc]
01:41:38 up 2:07, 1 user, load average: 0.33, 0.26, 0.24
Connection to crdb1e closed.

Processing crdb1f
I170504 01:40:56.496240 138 util/log/clog.go:887 [config] file created at: 2017/05/04 01:40:56
I170504 01:40:56.496240 138 util/log/clog.go:887 [config] running on machine: crdb1f
I170504 01:40:56.496240 138 util/log/clog.go:887 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170504 01:40:56.496240 138 util/log/clog.go:887 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=atl]
01:41:38 up 8 days, 4:20, 1 user, load average: 1.26, 1.55, 1.77
Connection to crdb1f closed.

Processing crdb1h
I170502 23:48:12.067698 1 util/log/clog.go:990 [config] file created at: 2017/05/02 23:48:12
I170502 23:48:12.067698 1 util/log/clog.go:990 [config] running on machine: crdb1h
I170502 23:48:12.067698 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170502 23:48:12.067698 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=slc]
01:41:39 up 8 days, 4:20, 1 user, load average: 0.06, 0.09, 0.13
Connection to crdb1h closed.

Processing crdb2h
I170502 23:48:12.919771 1 util/log/clog.go:990 [config] file created at: 2017/05/02 23:48:12
I170502 23:48:12.919771 1 util/log/clog.go:990 [config] running on machine: crdb2h
I170502 23:48:12.919771 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170502 23:48:12.919771 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=slc]
01:41:41 up 8 days, 1:02, 1 user, load average: 0.50, 0.27, 0.31
Connection to crdb2h closed.

Processing crdb3h
I170503 00:40:25.760618 92 util/log/clog.go:887 [config] file created at: 2017/05/03 00:40:25
I170503 00:40:25.760618 92 util/log/clog.go:887 [config] running on machine: crdb3h
I170503 00:40:25.760618 92 util/log/clog.go:887 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170503 00:40:25.760618 92 util/log/clog.go:887 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=slc]
01:41:42 up 8 days, 4:19, 2 users, load average: 1.18, 1.27, 1.30
Connection to crdb3h closed.

$ dogroup "cockroach node status" crdb1a

Processing crdb1a 
+----+---------------------------+-----------------+---------------------+---------------------+------------+-----------+-------------+--------------+--------------+------------------+-----------------------+--------+--------------------+------------------------+
| id | address | build | updated_at | started_at | live_bytes | key_bytes | value_bytes | intent_bytes | system_bytes | replicas_leaders | replicas_leaseholders | ranges | ranges_unavailable | ranges_underreplicated |
+----+---------------------------+-----------------+---------------------+---------------------+------------+-----------+-------------+--------------+--------------+------------------+-----------------------+--------+--------------------+------------------------+
| 1 | crdb1a.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:12 | 2017-05-02 23:48:32 | 1159428044 | 211872481 | 1185351247 | 128670 | 2261085 | 121 | 119 | 137 | 0 | 0 |
| 2 | crdb2a.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:13 | 2017-05-02 23:48:32 | 1555609906 | 132124895 | 1584360723 | 0 | 1403991 | 30 | 30 | 30 | 0 | 0 |
| 3 | crdb3a.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:08 | 2017-05-03 23:25:07 | 489122968 | 147093469 | 527490655 | 0 | 7143859 | 41 | 41 | 70 | 0 | 0 |
| 4 | crdb4a.ces.cvnt.net:26257 | beta-20170420 | 2017-04-26 00:42:15 | 2017-04-26 00:32:54 | 414197579 | 263132629 | 443540902 | 56 | 322549 | 65 | 65 | 80 | 0 | 0 |
| 5 | crdb5a.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:07 | 2017-05-02 23:48:32 | 1254336608 | 154373631 | 1288792963 | 0 | 1483634 | 110 | 109 | 110 | 0 | 0 |
| 6 | crdb1c.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:10 | 2017-05-02 23:48:11 | 1258763227 | 236980813 | 1302208729 | 7200 | 1623267 | 31 | 30 | 31 | 0 | 0 | 
| 7 | crdb1d.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:43:49 | 2017-05-03 23:57:59 | 853076729 | 112511399 | 858024729 | 0 | 1068483 | 30 | 23 | 44 | 0 | 0 | 
| 8 | crdb1f.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:07 | 2017-05-02 23:48:56 | 1343106828 | 204336461 | 1358733187 | 0 | 7534479 | 90 | 89 | 90 | 0 | 0 | 
| 9 | crdb1h.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:07 | 2017-05-02 23:48:32 | 1045219625 | 170285764 | 1108588917 | 0 | 1492912 | 47 | 47 | 47 | 0 | 0 | 
| 10 | crdb4a.ces.cvnt.net:26257 | beta-20170420 | 2017-04-26 00:37:06 | 2017-04-26 00:16:06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 11 | crdb3h.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:10 | 2017-05-02 23:48:15 | 1697832015 | 269914723 | 1681637990 | 0 | 1783742 | 88 | 86 | 89 | 0 | 1 | 
| 12 | crdb1e.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:10 | 2017-05-03 23:36:50 | 837064696 | 58057801 | 818900087 | 0 | 1015956 | 28 | 27 | 50 | 0 | 0 | 
| 13 | crdb4a.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:10 | 2017-05-02 23:48:11 | 691725075 | 220581452 | 677119482 | 135870 | 2472971 | 104 | 103 | 104 | 0 | 0 | 
| 14 | crdb2h.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:10 | 2017-05-02 23:48:13 | 893348467 | 144365663 | 911573214 | 135870 | 7976365 | 40 | 16 | 40 | 0 | 0 | 
+----+---------------------------+-----------------+---------------------+---------------------+------------+-----------+-------------+--------------+--------------+------------------+-----------------------+--------+--------------------+------------------------+ 
(14 rows) 
Connection to crdb1a closed.
@BramGruneir
Copy link
Member Author

5 nodes at ohq
1 node at ods
1 node at dfw
1 node at nyc
1 node at atl
3 nodes at slc

@jlauro
Copy link

jlauro commented May 4, 2017

heap.zip

All nodes originally had 8GB of RAM, but nodes crdb1d and crdb1e were upgrade to 16GB of ram due to continually using more memory than 8GB and not resolved by restarting.

@a-robinson
Copy link
Contributor

Thanks for the heap profiles, @jlauro! You said that these heap profiles are from after the cluster was idle for some time? If not, could you expand on what workload you were running? Did you run a restore from a backup?

crdb1d and crdb1e are both showing the same memory bloat from raftpb.(*Entry).Unmarshal as in #15681. cc @danhhz @petermattis

@a-robinson
Copy link
Contributor

What's odd to me is that you say the memory usage wasn't resolved by restarting. That would imply that it isn't just a memory leak staying around forever, but something happening every time the node restarts, which is much stranger.

@jlauro
Copy link

jlauro commented May 4, 2017

Yes, the cluster is idle... at least in terms of clients sending any sort of traffic to them. However, most of the nodes are actively exercising cpu/network. As to what they are working on moving around, I have no idea... but I should be able to copy the entire dataset to all the nodes from all the nodes (n x n) in far lass time that it's been idle... Load isn't bad on some nodes, and others are using 150+% CPU (dual cpu vm).

How much background load should be expected when "idle"?

@jlauro
Copy link

jlauro commented May 4, 2017

image
image
I don't know why... but a couple of nodes (4a and 5a) with 8gb decided to start swapping... all nodes idle from external activity (I may have done a show database command on a couple nodes), and collected the heap stats (but did that on all nodes, but not all node showed an increase in memory).

@a-robinson
Copy link
Contributor

Would you mind sharing some of the graphs from the admin UI over the same time period?

@jlauro
Copy link

jlauro commented May 4, 2017

I think restarting with only 8gb does resolve it short term, but did seem to be the same nodes being prone to using more memory despite being idle. I can wipe all data recreate with fresh import if you think that might help clean out any bad data from beta versions (and will also clear out the 2 dead nodes) or crashes / ungraceful kills.

@jlauro
Copy link

jlauro commented May 4, 2017

Which graphs would you like? and from cluster, or specific nodes?

@a-robinson
Copy link
Contributor

There could be raft snapshots getting sent to the two nodes. It's also possible that very uneven latencies between the datacenters (some being much closer together than others) could be making for a lot of lease transfers, which would explain seeing a bunch of raft leadership transfers. I'm not sure why the memory wouldn't be getting GC'ed faster, but long enough delays between GC cycles might explain things.

@a-robinson
Copy link
Contributor

I'd be curious to see the replication queue and raftlog queue from the Queues page. Leaseholders per store on the Replication page would also be interesting. The memory usage graph from the runtime page and the SQL byte traffic and SQL queries graphs from the SQL page would also be helpful.

@BramGruneir
Copy link
Member Author

I setup a cluster locally with the same locality settings and put some load on it, then let it sit. So without the latencies between nodes, I didn't see any lease or replica rebalancing thrashing.

To start, how about the cluster-wide overview, runtime and replication views, and the same for the nodes with high memory usage.

@a-robinson
Copy link
Contributor

If I could connect to the admin UI myself, that would speed things up a bit, but I really appreciate your quick turnaround times!

@a-robinson
Copy link
Contributor

a-robinson commented May 4, 2017

To be a little more clear, I'm worried you might be running into #15369 if the latencies between your datacenters differ by a lot, and am hoping to determine how many lease transfers and snapshots are happening in the cluster. If you want to partially test that hypothesis out, you could try running SET CLUSTER SETTING kv.allocator.load_based_lease_rebalancing.enabled = false; from a SQL shell and see if things improve.

@a-robinson
Copy link
Contributor

Knowing the latencies between your datacenters would help me try to repro. I'm not really sure where "ohq" or "ods" might be. I'm sorry we don't have better tooling yet for extracting all the relevant info (cc #13984).

If you do just want to do a single data dump to avoid more interruptions to your day, running cockroach debug zip would go a long way towards helping me understand what's happening with minimal effort on your part.

@jlauro
Copy link

jlauro commented May 4, 2017

Ok, had to setup a temporary tunnel to reach from the internet... https://208.184.77.179:8080
It's forwarded to gui on crdb1e. No real data on the boxes...

@jlauro
Copy link

jlauro commented May 4, 2017

I think one of the nodes die from decided to start taking too much memory... Want me restart cockroach on crdb4a and crdb5a, and/or increase the memory on them or all nodes?

@a-robinson
Copy link
Contributor

Thanks! It doesn't look like what's going on is related to lease transfers.

I don't care too much what you do with crdb4a and crdb5a, but could you possibly enable remote debugging by running SET CLUSTER SETTING server.remote_debugging.mode = any;?

@jlauro
Copy link

jlauro commented May 4, 2017

crdb4a crashed hard... had to reboot node, even console wasn't responding... Increased it's ram to 16gb and it's back up.
crdb5a is having trouble stopping... Had to do a killall after issuing a quit... cockroach is back up (still at 8gb on that node)

I ran the cockroach debug zip, but it created a 23mb file and this caps it at 10mb upload...
I put in the SET CLUSTER SETTING server.remote_debugging.mode = 'any'; Do you still need the zip?

@jlauro
Copy link

jlauro commented May 4, 2017

Here is full-mesh latency of 100 packets per check:
Processing crdb1a
crdb1a to 1a rtt min/avg/max/mdev = 0.002/0.004/0.015/0.002 ms, ipg/ewma 0.025/0.004 ms
crdb1a to 1c rtt min/avg/max/mdev = 0.213/0.281/1.048/0.091 ms, ipg/ewma 0.641/0.309 ms
crdb1a to 1d rtt min/avg/max/mdev = 46.212/46.958/53.371/1.129 ms, pipe 2, ipg/ewma 47.406/47.519 ms
crdb1a to 1e rtt min/avg/max/mdev = 27.852/28.523/31.381/0.740 ms, pipe 2, ipg/ewma 28.790/28.512 ms
crdb1a to 1f rtt min/avg/max/mdev = 34.181/35.731/48.213/2.257 ms, pipe 2, ipg/ewma 35.885/35.099 ms
crdb1a to 1h rtt min/avg/max/mdev = 67.795/69.631/83.190/2.212 ms, pipe 2, ipg/ewma 69.955/68.466 ms
crdb1a to 2a rtt min/avg/max/mdev = 0.118/0.153/0.530/0.048 ms, ipg/ewma 0.627/0.140 ms
crdb1a to 2h rtt min/avg/max/mdev = 67.813/69.434/85.625/2.594 ms, pipe 2, ipg/ewma 69.561/68.528 ms
crdb1a to 3a rtt min/avg/max/mdev = 0.048/0.076/0.138/0.016 ms, ipg/ewma 0.186/0.076 ms
crdb1a to 3h rtt min/avg/max/mdev = 67.803/69.304/76.788/1.632 ms, pipe 2, ipg/ewma 69.626/68.496 ms
crdb1a to 4a rtt min/avg/max/mdev = 0.079/0.112/0.216/0.029 ms, ipg/ewma 0.411/0.105 ms
crdb1a to 5a rtt min/avg/max/mdev = 0.069/0.094/0.199/0.022 ms, ipg/ewma 0.272/0.094 ms
Connection to crdb1a closed.

Processing crdb2a
crdb2a to 1a rtt min/avg/max/mdev = 0.069/0.115/1.120/0.102 ms, ipg/ewma 0.179/0.102 ms
crdb2a to 1c rtt min/avg/max/mdev = 0.253/0.314/0.540/0.038 ms, ipg/ewma 0.389/0.309 ms
crdb2a to 1d rtt min/avg/max/mdev = 46.184/46.758/49.385/0.610 ms, pipe 2, ipg/ewma 47.340/47.072 ms
crdb2a to 1e rtt min/avg/max/mdev = 27.958/28.318/30.107/0.340 ms, pipe 2, ipg/ewma 28.721/28.364 ms
crdb2a to 1f rtt min/avg/max/mdev = 34.077/36.133/50.380/2.795 ms, pipe 2, ipg/ewma 36.598/34.951 ms
crdb2a to 1h rtt min/avg/max/mdev = 67.851/68.563/75.777/1.314 ms, pipe 2, ipg/ewma 68.656/69.016 ms
crdb2a to 2a rtt min/avg/max/mdev = 0.002/0.003/0.010/0.002 ms, ipg/ewma 0.024/0.003 ms
crdb2a to 2h rtt min/avg/max/mdev = 67.837/68.576/79.117/1.552 ms, pipe 2, ipg/ewma 68.699/68.908 ms
crdb2a to 3a rtt min/avg/max/mdev = 0.044/0.079/0.144/0.023 ms, ipg/ewma 0.152/0.098 ms
crdb2a to 3h rtt min/avg/max/mdev = 67.860/68.565/75.681/1.239 ms, pipe 2, ipg/ewma 68.717/68.755 ms
crdb2a to 4a rtt min/avg/max/mdev = 0.078/0.116/0.205/0.024 ms, ipg/ewma 0.211/0.113 ms
crdb2a to 5a rtt min/avg/max/mdev = 0.081/0.115/0.187/0.026 ms, ipg/ewma 0.210/0.109 ms
Connection to crdb2a closed.

Processing crdb3a
crdb3a to 1a rtt min/avg/max/mdev = 0.055/0.065/0.112/0.012 ms, ipg/ewma 0.249/0.059 ms
crdb3a to 1c rtt min/avg/max/mdev = 0.197/0.247/0.360/0.038 ms, ipg/ewma 0.338/0.250 ms
crdb3a to 1d rtt min/avg/max/mdev = 46.149/47.143/73.766/2.927 ms, pipe 2, ipg/ewma 47.269/47.213 ms
crdb3a to 1e rtt min/avg/max/mdev = 27.817/28.416/33.146/0.890 ms, pipe 2, ipg/ewma 28.557/28.151 ms
crdb3a to 1f rtt min/avg/max/mdev = 34.068/35.913/46.453/2.551 ms, pipe 2, ipg/ewma 35.914/35.558 ms
crdb3a to 1h rtt min/avg/max/mdev = 67.789/68.663/90.981/2.712 ms, pipe 2, ipg/ewma 68.817/68.302 ms
crdb3a to 2a rtt min/avg/max/mdev = 0.064/0.083/0.160/0.017 ms, ipg/ewma 0.166/0.080 ms
crdb3a to 2h rtt min/avg/max/mdev = 67.786/68.612/82.621/2.248 ms, pipe 2, ipg/ewma 68.806/68.164 ms
crdb3a to 3a rtt min/avg/max/mdev = 0.001/0.002/0.010/0.002 ms, ipg/ewma 0.013/0.002 ms
crdb3a to 3h rtt min/avg/max/mdev = 67.759/68.619/82.745/2.266 ms, pipe 2, ipg/ewma 68.804/68.084 ms
crdb3a to 4a rtt min/avg/max/mdev = 0.060/0.073/0.318/0.030 ms, ipg/ewma 0.108/0.079 ms
crdb3a to 5a rtt min/avg/max/mdev = 0.050/0.061/0.095/0.012 ms, ipg/ewma 0.105/0.064 ms
Connection to crdb3a closed.

Processing crdb4a
crdb4a to 1a rtt min/avg/max/mdev = 0.077/0.103/0.384/0.041 ms, ipg/ewma 0.257/0.097 ms
crdb4a to 1c rtt min/avg/max/mdev = 0.227/0.297/0.500/0.052 ms, ipg/ewma 0.935/0.291 ms
crdb4a to 1d rtt min/avg/max/mdev = 46.343/113.943/3824.315/468.884 ms, pipe 4, ipg/ewma 373.310/47.235 ms
crdb4a to 1e rtt min/avg/max/mdev = 27.968/28.550/32.256/0.730 ms, pipe 2, ipg/ewma 28.937/28.644 ms
crdb4a to 1f rtt min/avg/max/mdev = 34.430/38.957/70.372/5.623 ms, pipe 2, ipg/ewma 39.487/38.325 ms
crdb4a to 1h rtt min/avg/max/mdev = 67.896/69.741/84.099/2.240 ms, pipe 2, ipg/ewma 71.167/69.062 ms
crdb4a to 2a rtt min/avg/max/mdev = 0.103/0.138/0.593/0.078 ms, ipg/ewma 0.205/0.121 ms
crdb4a to 2h rtt min/avg/max/mdev = 67.924/69.759/74.998/1.833 ms, pipe 2, ipg/ewma 70.379/68.791 ms
crdb4a to 3a rtt min/avg/max/mdev = 0.049/0.075/0.237/0.024 ms, ipg/ewma 0.468/0.077 ms
crdb4a to 3h rtt min/avg/max/mdev = 67.800/69.751/84.183/2.265 ms, pipe 2, ipg/ewma 71.215/68.655 ms
crdb4a to 4a rtt min/avg/max/mdev = 0.002/0.004/0.017/0.002 ms, ipg/ewma 0.035/0.003 ms
crdb4a to 5a rtt min/avg/max/mdev = 0.068/0.092/0.189/0.021 ms, ipg/ewma 0.175/0.097 ms
Connection to crdb4a closed.

Processing crdb5a
crdb5a to 1a rtt min/avg/max/mdev = 0.085/0.117/0.253/0.022 ms, ipg/ewma 0.567/0.126 ms
crdb5a to 1c rtt min/avg/max/mdev = 0.256/0.344/1.218/0.105 ms, ipg/ewma 0.928/0.319 ms
crdb5a to 1d rtt min/avg/max/mdev = 46.112/46.819/49.778/0.666 ms, pipe 2, ipg/ewma 47.826/46.644 ms
crdb5a to 1e rtt min/avg/max/mdev = 27.903/28.419/30.363/0.498 ms, pipe 2, ipg/ewma 28.538/28.448 ms
crdb5a to 1f rtt min/avg/max/mdev = 34.102/35.970/44.359/2.015 ms, pipe 2, ipg/ewma 36.238/36.454 ms
crdb5a to 1h rtt min/avg/max/mdev = 67.843/69.039/85.161/2.886 ms, pipe 2, ipg/ewma 70.117/68.824 ms
crdb5a to 2a rtt min/avg/max/mdev = 0.094/0.151/0.279/0.036 ms, ipg/ewma 0.560/0.158 ms
crdb5a to 2h rtt min/avg/max/mdev = 67.824/68.964/85.272/2.878 ms, pipe 2, ipg/ewma 69.932/68.543 ms
crdb5a to 3a rtt min/avg/max/mdev = 0.056/0.093/0.221/0.021 ms, ipg/ewma 0.424/0.092 ms
crdb5a to 3h rtt min/avg/max/mdev = 67.825/68.996/85.190/2.893 ms, pipe 2, ipg/ewma 70.145/68.677 ms
crdb5a to 4a rtt min/avg/max/mdev = 0.085/0.128/0.270/0.031 ms, ipg/ewma 0.622/0.139 ms
crdb5a to 5a rtt min/avg/max/mdev = 0.002/0.003/0.012/0.002 ms, ipg/ewma 0.151/0.003 ms
Connection to crdb5a closed.

Processing crdb1c
crdb1c to 1a rtt min/avg/max/mdev = 0.221/0.297/0.773/0.075 ms, ipg/ewma 0.635/0.268 ms
crdb1c to 1c rtt min/avg/max/mdev = 0.001/0.001/0.027/0.003 ms, ipg/ewma 0.009/0.001 ms
crdb1c to 1d rtt min/avg/max/mdev = 46.144/49.104/172.583/15.010 ms, pipe 4, ipg/ewma 48.202/59.294 ms
crdb1c to 1e rtt min/avg/max/mdev = 27.898/28.288/29.980/0.421 ms, pipe 2, ipg/ewma 28.412/28.520 ms
crdb1c to 1f rtt min/avg/max/mdev = 34.227/36.328/46.970/2.496 ms, pipe 2, ipg/ewma 36.628/37.767 ms
crdb1c to 1h rtt min/avg/max/mdev = 67.778/68.478/82.702/1.803 ms, pipe 2, ipg/ewma 68.412/69.919 ms
crdb1c to 2a rtt min/avg/max/mdev = 0.244/0.335/0.519/0.065 ms, ipg/ewma 0.570/0.356 ms
crdb1c to 2h rtt min/avg/max/mdev = 67.806/68.492/75.988/1.210 ms, pipe 2, ipg/ewma 68.755/68.726 ms
crdb1c to 3a rtt min/avg/max/mdev = 0.217/0.291/0.488/0.053 ms, ipg/ewma 0.353/0.270 ms
crdb1c to 3h rtt min/avg/max/mdev = 67.752/68.312/72.480/0.777 ms, pipe 2, ipg/ewma 68.655/68.602 ms
crdb1c to 4a rtt min/avg/max/mdev = 0.257/0.339/0.539/0.061 ms, ipg/ewma 0.639/0.319 ms
crdb1c to 5a rtt min/avg/max/mdev = 0.241/0.321/0.610/0.060 ms, ipg/ewma 0.418/0.287 ms
Connection to crdb1c closed.

Processing crdb1d
crdb1d to 1a rtt min/avg/max/mdev = 46.067/47.051/57.269/1.619 ms, pipe 2, ipg/ewma 52.063/46.887 ms
crdb1d to 1c rtt min/avg/max/mdev = 46.116/47.004/57.309/1.628 ms, pipe 2, ipg/ewma 52.659/46.934 ms
crdb1d to 1d rtt min/avg/max/mdev = 0.009/0.015/0.042/0.008 ms, ipg/ewma 0.358/0.019 ms
crdb1d to 1e rtt min/avg/max/mdev = 42.979/43.727/54.702/1.556 ms, pipe 2, ipg/ewma 48.125/43.572 ms
crdb1d to 1f rtt min/avg/max/mdev = 26.520/30.171/43.099/4.436 ms, pipe 2, ipg/ewma 35.350/27.881 ms
crdb1d to 1h rtt min/avg/max/mdev = 37.001/38.454/60.040/2.593 ms, pipe 2, ipg/ewma 43.359/37.531 ms
crdb1d to 2a rtt min/avg/max/mdev = 46.260/47.176/55.403/1.424 ms, pipe 2, ipg/ewma 52.338/47.170 ms
crdb1d to 2h rtt min/avg/max/mdev = 37.084/38.455/59.391/2.508 ms, pipe 2, ipg/ewma 43.383/37.576 ms
crdb1d to 3a rtt min/avg/max/mdev = 46.152/47.145/57.231/1.724 ms, pipe 2, ipg/ewma 51.879/47.078 ms
crdb1d to 3h rtt min/avg/max/mdev = 37.070/38.419/59.923/2.442 ms, pipe 2, ipg/ewma 43.693/37.458 ms
crdb1d to 4a rtt min/avg/max/mdev = 46.224/47.204/57.359/1.649 ms, pipe 2, ipg/ewma 52.151/47.020 ms
crdb1d to 5a rtt min/avg/max/mdev = 46.104/47.073/55.340/1.356 ms, pipe 2, ipg/ewma 51.700/47.143 ms
Connection to crdb1d closed.

Processing crdb1e
crdb1e to 1a rtt min/avg/max/mdev = 27.836/28.185/30.265/0.460 ms, pipe 2, ipg/ewma 28.174/28.159 ms
crdb1e to 1c rtt min/avg/max/mdev = 27.820/28.069/28.825/0.296 ms, pipe 2, ipg/ewma 28.071/28.055 ms
crdb1e to 1d rtt min/avg/max/mdev = 42.871/43.510/56.765/1.604 ms, pipe 2, ipg/ewma 43.407/44.602 ms
crdb1e to 1e rtt min/avg/max/mdev = 0.002/0.002/0.015/0.002 ms, ipg/ewma 0.015/0.003 ms
crdb1e to 1f rtt min/avg/max/mdev = 19.883/24.026/39.281/4.211 ms, pipe 2, ipg/ewma 23.911/22.934 ms
crdb1e to 1h rtt min/avg/max/mdev = 49.424/50.151/63.533/2.064 ms, pipe 2, ipg/ewma 50.261/49.816 ms
crdb1e to 2a rtt min/avg/max/mdev = 27.869/28.214/29.723/0.380 ms, pipe 2, ipg/ewma 28.263/28.174 ms
crdb1e to 2h rtt min/avg/max/mdev = 49.384/50.074/62.645/1.987 ms, pipe 2, ipg/ewma 50.230/49.861 ms
crdb1e to 3a rtt min/avg/max/mdev = 27.836/28.131/29.913/0.373 ms, pipe 2, ipg/ewma 28.101/28.088 ms
crdb1e to 3h rtt min/avg/max/mdev = 49.374/50.121/63.593/2.057 ms, pipe 2, ipg/ewma 50.264/49.821 ms
crdb1e to 4a rtt min/avg/max/mdev = 27.932/28.304/31.235/0.413 ms, pipe 2, ipg/ewma 28.205/28.434 ms
crdb1e to 5a rtt min/avg/max/mdev = 27.845/28.208/29.460/0.332 ms, pipe 2, ipg/ewma 28.109/28.249 ms
Connection to crdb1e closed.

Processing crdb1f
crdb1f to 1a rtt min/avg/max/mdev = 33.949/35.967/45.261/2.410 ms, pipe 2, ipg/ewma 41.592/34.802 ms
crdb1f to 1c rtt min/avg/max/mdev = 34.014/36.011/46.030/2.424 ms, pipe 2, ipg/ewma 41.650/34.665 ms
crdb1f to 1d rtt min/avg/max/mdev = 26.524/29.557/51.015/4.451 ms, pipe 2, ipg/ewma 35.581/27.102 ms
crdb1f to 1e rtt min/avg/max/mdev = 19.814/22.185/31.048/2.842 ms, pipe 2, ipg/ewma 27.545/20.381 ms
crdb1f to 1f rtt min/avg/max/mdev = 0.005/0.009/0.260/0.025 ms, ipg/ewma 0.053/0.006 ms
crdb1f to 1h rtt min/avg/max/mdev = 61.737/63.960/73.065/2.810 ms, pipe 2, ipg/ewma 69.409/66.086 ms
crdb1f to 2a rtt min/avg/max/mdev = 34.162/36.273/46.964/2.578 ms, pipe 2, ipg/ewma 41.997/34.972 ms
crdb1f to 2h rtt min/avg/max/mdev = 61.773/64.098/78.777/2.836 ms, pipe 2, ipg/ewma 69.291/65.236 ms
crdb1f to 3a rtt min/avg/max/mdev = 33.988/36.053/46.311/2.662 ms, pipe 2, ipg/ewma 41.123/34.732 ms
crdb1f to 3h rtt min/avg/max/mdev = 61.820/63.919/73.285/2.416 ms, pipe 2, ipg/ewma 67.652/64.966 ms
crdb1f to 4a rtt min/avg/max/mdev = 34.199/36.061/43.788/2.018 ms, pipe 2, ipg/ewma 42.023/34.922 ms
crdb1f to 5a rtt min/avg/max/mdev = 34.088/36.061/46.296/2.559 ms, pipe 2, ipg/ewma 41.631/34.934 ms
Connection to crdb1f closed.

Processing crdb1h
Connection to crdb1h closed.
crdb1h to 1a rtt min/avg/max/mdev = 67.807/69.040/81.870/2.149 ms, pipe 2, ipg/ewma 68.959/69.010 ms
crdb1h to 1c rtt min/avg/max/mdev = 67.820/69.327/94.351/3.658 ms, pipe 2, ipg/ewma 69.280/68.334 ms
crdb1h to 1d rtt min/avg/max/mdev = 36.990/38.371/56.651/2.843 ms, pipe 2, ipg/ewma 38.512/37.794 ms
crdb1h to 1e rtt min/avg/max/mdev = 49.391/50.463/68.724/2.409 ms, pipe 2, ipg/ewma 50.387/50.585 ms
crdb1h to 1f rtt min/avg/max/mdev = 61.893/66.974/85.089/5.404 ms, pipe 2, ipg/ewma 66.646/64.240 ms
crdb1h to 1h rtt min/avg/max/mdev = 0.001/0.002/0.023/0.002 ms, ipg/ewma 0.041/0.001 ms
crdb1h to 2a rtt min/avg/max/mdev = 67.863/69.767/91.404/3.845 ms, pipe 2, ipg/ewma 69.698/69.441 ms
crdb1h to 2h rtt min/avg/max/mdev = 0.038/0.055/0.104/0.015 ms, ipg/ewma 0.107/0.056 ms
crdb1h to 3a rtt min/avg/max/mdev = 67.760/69.230/92.245/3.793 ms, pipe 2, ipg/ewma 69.259/68.197 ms
crdb1h to 3h rtt min/avg/max/mdev = 0.036/0.052/0.128/0.015 ms, ipg/ewma 0.121/0.043 ms
crdb1h to 4a rtt min/avg/max/mdev = 67.872/68.799/79.694/1.591 ms, pipe 2, ipg/ewma 68.760/68.564 ms
crdb1h to 5a rtt min/avg/max/mdev = 67.787/69.195/93.098/3.321 ms, pipe 2, ipg/ewma 69.277/68.198 ms

Processing crdb2h
crdb2h to 1a rtt min/avg/max/mdev = 67.749/69.176/74.964/1.654 ms, pipe 2, ipg/ewma 69.726/68.105 ms
crdb2h to 1c rtt min/avg/max/mdev = 67.712/69.191/74.776/1.536 ms, pipe 2, ipg/ewma 69.721/68.411 ms
crdb2h to 1d rtt min/avg/max/mdev = 37.071/40.363/67.051/4.588 ms, pipe 2, ipg/ewma 40.818/40.922 ms
crdb2h to 1e rtt min/avg/max/mdev = 49.493/51.492/61.137/2.118 ms, pipe 2, ipg/ewma 52.237/50.770 ms
crdb2h to 1f rtt min/avg/max/mdev = 61.644/64.754/87.812/3.658 ms, pipe 2, ipg/ewma 65.212/66.709 ms
crdb2h to 1h rtt min/avg/max/mdev = 0.032/0.052/0.127/0.018 ms, ipg/ewma 0.078/0.043 ms
crdb2h to 2a rtt min/avg/max/mdev = 67.852/69.751/77.466/1.901 ms, pipe 2, ipg/ewma 70.376/68.385 ms
crdb2h to 2h rtt min/avg/max/mdev = 0.001/0.001/0.010/0.001 ms, ipg/ewma 0.042/0.002 ms
crdb2h to 3a rtt min/avg/max/mdev = 67.744/69.153/77.282/1.610 ms, pipe 2, ipg/ewma 69.681/68.188 ms
crdb2h to 3h rtt min/avg/max/mdev = 0.037/0.065/0.181/0.021 ms, ipg/ewma 0.153/0.062 ms
crdb2h to 4a rtt min/avg/max/mdev = 67.912/69.769/77.456/1.859 ms, pipe 2, ipg/ewma 70.317/68.360 ms
crdb2h to 5a rtt min/avg/max/mdev = 67.799/69.509/76.168/1.789 ms, pipe 2, ipg/ewma 70.128/68.977 ms
Connection to crdb2h closed.

Processing crdb3h
crdb3h to 1a rtt min/avg/max/mdev = 67.776/68.878/81.889/1.927 ms, pipe 2, ipg/ewma 72.904/69.471 ms
crdb3h to 1c rtt min/avg/max/mdev = 67.760/69.313/90.997/3.550 ms, pipe 2, ipg/ewma 73.449/69.897 ms
crdb3h to 1d rtt min/avg/max/mdev = 37.039/38.337/54.687/2.880 ms, pipe 2, ipg/ewma 43.732/38.990 ms
crdb3h to 1e rtt min/avg/max/mdev = 49.952/51.439/70.076/3.568 ms, pipe 2, ipg/ewma 55.838/50.935 ms
crdb3h to 1f rtt min/avg/max/mdev = 61.813/64.737/84.827/4.051 ms, pipe 2, ipg/ewma 69.550/65.666 ms
crdb3h to 1h rtt min/avg/max/mdev = 0.035/0.063/0.316/0.035 ms, ipg/ewma 1.854/0.050 ms
crdb3h to 2a rtt min/avg/max/mdev = 67.886/69.469/90.958/3.429 ms, pipe 2, ipg/ewma 73.959/70.191 ms
crdb3h to 2h rtt min/avg/max/mdev = 0.039/0.084/0.471/0.060 ms, ipg/ewma 1.879/0.071 ms
crdb3h to 3a rtt min/avg/max/mdev = 67.711/69.466/85.548/3.107 ms, pipe 2, ipg/ewma 73.397/71.089 ms
crdb3h to 3h rtt min/avg/max/mdev = 0.001/0.002/0.021/0.002 ms, ipg/ewma 0.154/0.002 ms
crdb3h to 4a rtt min/avg/max/mdev = 67.897/69.556/91.022/3.737 ms, pipe 2, ipg/ewma 73.960/70.044 ms
crdb3h to 5a rtt min/avg/max/mdev = 67.768/69.486/85.602/3.100 ms, pipe 2, ipg/ewma 73.273/69.914 ms
Connection to crdb3h closed.

@a-robinson
Copy link
Contributor

Thanks for opening up the debug pages, @jlauro! I haven't fully figured things out, but I just wanted to let you know that I've been actively looking into it this morning, and am starting to get a handle on what's going on.

@a-robinson a-robinson self-assigned this May 5, 2017
@a-robinson
Copy link
Contributor

Ok, so here's a summary of what's going on.

  • There's at least one range that keeps having raft elections without ever settling on a leader. The three replicas choose a leader, and the new leader goes through the same raft.becomeLeader stack trace that @danhhz called out in backup: OOM during TPC-H scalefactor=10 restore running rc2 #15681 (I've pasted an example stack trace with parameters below).
  • The new raft leader has to get all raft entries since the last entry that it knows has been committed on a quorum of nodes to check for configuration changes. For this range, that is entry 1547.
  • Although I'm not sure how it got to this state since I don't have log access, it turns out that this is a lot of entries to have to read. On node 12, which is what the SSH tunnel is pointing to, it's reading from 1547 to more than 49000. All these entries are getting pulled out of rocksdb via MVCCIterate and are causing a TON of MVCCMetadata and raftpb.Entry objects to get unmarshalled. I did a basic measurement using two heap profiles spaced 36 seconds apart, and in that short time there was 10.5GB of memory allocated while unmarshalling MVCCMetadata objects and 10.5GB allocated while unmarshalling raftpb.Entry objects.
  • The memory does get released, but each node is having to allocate so much memory so quickly that they effectively never clear it all out even though the garbage collector is working really hard.
  • Presumably this MVCC-iterating is taking so long that the leader fails to heartbeat its followers, leading to another election. The range is currently on raft term 115688, and it's continuing to steadily increase -- that's a lot of elections.

I don't know if something recently changed in raft or in our usage of it or if this has just always been around, but this is pretty absurd. I'll see if I can find what might have triggered this, but cc @petermattis @bdarnell in the meantime.

Debug page for the problematic range:
r137-2.pdf

github.com/cockroachdb/cockroach/pkg/storage/engine.MVCCIterate(0x7f6153125540, 0xc420362a50, 0x27c7d80, 0xc51294fc20, 0xc55a5da680, 0x11, 0x20, 0xc55a5da6a0, 0x11, 0x20, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/mvcc.go:1689 +0x6d6
github.com/cockroachdb/cockroach/pkg/storage.iterateEntries(0x7f6153125540, 0xc420362a50, 0x27c7d80, 0xc51294fc20, 0x89, 0x60c, 0xc143, 0xc5f349ba40, 0xc50fca6000, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raftstorage.go:192 +0x145
github.com/cockroachdb/cockroach/pkg/storage.entries(0x7f6153125540, 0xc420362a50, 0x27c7d80, 0xc51294fc20, 0x89, 0xc4201b89a0, 0x60c, 0xc143, 0xffffffffffffffff, 0x28, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raftstorage.go:126 +0x3a2
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Entries(0xc420697200, 0x60c, 0xc143, 0xffffffffffffffff, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raftstorage.go:78 +0x179
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raftLog).slice(0xc420456380, 0x60c, 0xc143, 0xffffffffffffffff, 0x5af2a8, 0x10, 0x19780e0, 0x1, 0xc51294fc00)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/log.go:304 +0xcc
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raftLog).entries(0xc420456380, 0x60c, 0xffffffffffffffff, 0x7f61531b84b0, 0x0, 0xc50f959b30, 0xc4f4be07f0, 0x9a0b02)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/log.go:245 +0xbb
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).becomeLeader(0xc4200ec780)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:607 +0x10d
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.stepCandidate(0xc4200ec780, 0x6, 0x7, 0x9, 0x1c23d, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1019 +0x55c
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc4200ec780, 0x6, 0x7, 0x9, 0x1c23d, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:778 +0x10f7
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*RawNode).Step(0xc420c73340, 0x6, 0x7, 0x9, 0x1c23d, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/rawnode.go:195 +0xc9
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest.func4(0xc420c73340, 0x7f615311f4d0, 0xc420012498, 0x7f6153125540)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3101 +0xb8
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc420697200, 0x1bc5f00, 0xc4f4be1838, 0xf3f45d, 0xc4206972f8)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:489 +0x13c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroup(0xc420697200, 0xc4f4be1838, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:506 +0x99
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest(0xc420628000, 0x7f6153125540, 0xc480abe0f0, 0xc632653080, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3102 +0xc47
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc420628000, 0x89)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3385 +0x172
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc420472000, 0xc4205f46e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:228 +0x24b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x7f6153125540, 0xc4201e6cc0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:168 +0x33
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4202f2f40, 0xc4205f46e0, 0xc420227f80)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:215 +0xf7
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:216 +0xad

@petermattis
Copy link
Collaborator

Yowzer! That's a lot of unmarshaling! I'm not sure what can be done about this. The Raft leader needs to check for ConfChange entries. We could be mildly more intelligent by providing a specific interface for counting the number of ConfChange entries so we only have to keep 1 unmarshalled entry in memory at a time. Not sure how much that would help.

I'd love to know how the Raft group got into this state.

@a-robinson
Copy link
Contributor

@jlauro - to help us figure out how the range got into this state, could you go to node 12 (crdb1e) and run cockroach debug raft-log <data-directory> 137 (137 is the problematic range ID). If the file is too big to put here, my email is alex at cockroachlabs.com.

@jlauro
Copy link

jlauro commented May 5, 2017

Just to confirm, I have to shut cockroach down on that node first?

[root@crdb1e data]# cockroach debug raft-log /data/crdb/ 137
Error: could not open rocksdb instance: IO error: lock /data/crdb//LOCK: Resource temporarily unavailable
Failed running "debug"

@petermattis
Copy link
Collaborator

Just to confirm, I have to shut cockroach down on that node first?

@jlauro Yes, that's correct.

@jlauro
Copy link

jlauro commented May 5, 2017

The file is 36GB. Compressed it is 218mb. Not sure if I can e-mail that large of a file. I'll try...

@jlauro
Copy link

jlauro commented May 5, 2017

Sending via gmail... looks like it's going to covert it to google drive and send you a link.

@a-robinson
Copy link
Contributor

Even if we can't do it on a per-replica basis or dynamically, putting in a knob to increase the raft election timeout will at least make it possible for a cluster to recover from situations like this, which is better than what we have now. And adding the knob wouldn't add any real risk to 1.0/1.0.1

@bdarnell
Copy link
Contributor

bdarnell commented May 8, 2017

Can we adjust the Raft election timeout on a per-Replica basis?

Not easily, or at least not as you've described here - we're holding the raftMu too long on the leader, but the election timeout is used on the followers. If we have a way of letting the followers know that they should slow down their election timeouts, the best way to implement this is probably to skip calls to RawNode.Tick(). (doing this may have surprising interactions with CheckQuorum, though)

Is there a way to heartbeat even if a command is taking too long?

Maybe. We're already taking heartbeat generation mostly out of raft's hands with coalesced heartbeats, so we might be able to keep heartbeats flowing even when raft processing is blocked. I think there are probably dragons here, though.

Something else I just noticed: numOfPendingConf is counting the number of pending config changes in the unapplied portion of the Raft log. That is, the portion of the Raft log that is committed and will eventually be applied but hasn't been applied yet.

The unapplied portion of the log may include both committed and uncommitted entries. All the committed entries will become applied at the next handleRaftReady, but the uncommitted entries stay uncommitted until there has been at least one round-trip to the followers.

@danhhz Do you have a backup of the data from #15681 that you think might be the same root cause as this issue? I'd like to look at the raft logs to see what they have in common with the one from @jlauro.

@danhhz
Copy link
Contributor

danhhz commented May 8, 2017

@bdarnell they're in ~/15681.tgz on navy 3

@petermattis
Copy link
Collaborator

Not easily, or at least not as you've described here - we're holding the raftMu too long on the leader, but the election timeout is used on the followers. If we have a way of letting the followers know that they should slow down their election timeouts, the best way to implement this is probably to skip calls to RawNode.Tick(). (doing this may have surprising interactions with CheckQuorum, though)

My description was terse. I was imagining that if we detect the Replica.raftMu was held for longer than the election timeout and the current replica is the leader, we destroy and recreate the raft.RawNode with a larger timeout. This wouldn't help the new leader, but presumably another replica has already called for an election at this point.

Adjusting the number of calls to RawNode.Tick() is interesting, though scary.

Maybe. We're already taking heartbeat generation mostly out of raft's hands with coalesced heartbeats, so we might be able to keep heartbeats flowing even when raft processing is blocked. I think there are probably dragons here, though.

Seems tricky to do this. We don't even know that the replica is the leader when this is occurring and we can't determine that it is until the long running call finishes.

The unapplied portion of the log may include both committed and uncommitted entries. All the committed entries will become applied at the next handleRaftReady, but the uncommitted entries stay uncommitted until there has been at least one round-trip to the followers.

Ah, that's what I was missing.

@petermattis
Copy link
Collaborator

Even if we can't do it on a per-replica basis or dynamically, putting in a knob to increase the raft election timeout will at least make it possible for a cluster to recover from situations like this, which is better than what we have now. And adding the knob wouldn't add any real risk to 1.0/1.0.1

Should we revert the removal of --raft-tick-interval (#15547)?

@a-robinson
Copy link
Contributor

An env var might be better since it'll only really be recommended for people that have hit this problem, but I do think we should expose either it or RaftElectionTimeoutTicks. I was about to send out a PR for the latter, but I'd be fine with either.

@a-robinson
Copy link
Contributor

Just so that we have some type of escape hatch if needed.

@petermattis
Copy link
Collaborator

RaftElectionTimeoutTicks is probably a better knob. I'm fine with an env var.

@bdarnell
Copy link
Contributor

bdarnell commented May 8, 2017

I think #13869 is going to be the long-term solution here to keep the reproposals from spiraling out of control.

@bdarnell
Copy link
Contributor

bdarnell commented May 8, 2017

Commands are reproposed if they are not committed within 1-2 election timeouts, which defaults to 3-6 seconds. To commit a 40MB command in 3s requires network throughput of at least 13MB/sec to one of the followers, or twice that if requests to two followers get routed over the same link. (enabling RPC compression would really help for these DeleteRange commands). Our current GRPC window configuration allows 2MB to be transferred per stream per RTT, so for the 70ms RTT shown above we have a GRPC-imposed limit of ~30MB/sec/stream, if the underlying network can give us that much bandwidth. That's making significant demands of the network, so in addition to the generic flow control of #13869 we may want to start considering the size of pending commands before reproposing them.

@petermattis
Copy link
Collaborator

The proposal quota in #13869 is 1000 Raft log entries. I think we'd need to figure out how to make the proposal quota based on the size of the Raft log if in order to address the issue here.

Cc @irfansharif

@irfansharif
Copy link
Contributor

The proposal quota in #13869 is 1000 Raft log entries. I think we'd need to figure out how to make the proposal quota based on the size of the Raft log if in order to address the issue here.

will have a PR out for review for that shortly, can discuss/incorporate ideas for this there.

@bdarnell
Copy link
Contributor

bdarnell commented May 9, 2017

Yeah, it definitely needs to be based on size instead of the number of entries.

Also, pulling on the thread a little further, even if they are not reproposed, commands this large are sufficient to cause heartbeats to be delayed, perhaps to the point of causing elections. Maybe MsgApp messages should use separate GRPC streams (similar to the way that MsgSnap used to). This would need a lot of testing (we want to minimize the chance that MsgApps are reordered with respect to each other, but I think raft will tolerate reorderings of different message types reasonably well)

@jseldess
Copy link
Contributor

jseldess commented May 10, 2017

Documented this known limitation here: https://github.com/cockroachdb/docs/pull/1381/files#diff-b959712af7fa5ed1c822d0310313666eR11

@petermattis, @a-robinson , if I missed or misunderstood anything, please let me know.

@jlauro
Copy link

jlauro commented May 10, 2017

I managed to break it again and crash crdb1a with rc2 (was rebuilt fresh).
Loaded test data, things were stable (I think). (Did have some large drop tables and reloads, but avoided delete and truncate).
Was testing changing things such as:
echo num_replicas: 5 | cockroach zone set .default -f -
I reset it back to default of 3.
intent_bytes never went down to 0. Not sure what that is?

crdb1a.ces.cvnt.net:/data/crdb/logs# cockroach node status
+----+---------------------------+-----------+---------------------+---------------------+------------+-----------+-------------+--------------+--------------+------------------+-----------------
| id | address | build | updated_at | started_at | live_bytes | key_bytes | value_bytes | intent_bytes | system_bytes | replicas_leaders | replicas_leaseho
+----+---------------------------+-----------+---------------------+---------------------+------------+-----------+-------------+--------------+--------------+------------------+-----------------
| 1 | crdb1a.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 09:47:32 | 2017-05-10 08:21:51 | 2019638360 | 139202189 | 1991327744 | 0 | 351294 | 13 |
| 2 | crdb2a.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 09:47:29 | 2017-05-06 22:39:18 | 3502745800 | 221917391 | 3526421821 | 834 | 328742 | 7 |
| 3 | crdb3a.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 09:47:24 | 2017-05-06 22:40:24 | 3610148453 | 139552393 | 3629641253 | 834 | 340457 | 41 |
| 4 | crdb4a.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 09:47:29 | 2017-05-06 22:40:28 | 3348420581 | 453439546 | 3363255015 | 0 | 319347 | 21 |
| 5 | crdb5a.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 09:47:33 | 2017-05-06 22:40:32 | 3023314756 | 224650631 | 3075611462 | 0 | 332281 | 25 |
| 6 | crdb1c.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 09:46:46 | 2017-05-06 22:40:35 | 3450206423 | 324660519 | 3601338766 | 834 | 281877 | 20 |
| 7 | crdb1d.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 09:47:24 | 2017-05-07 14:05:53 | 3321604955 | 167530324 | 3339968507 | 0 | 322178 | 11 |
| 8 | crdb1e.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 09:47:32 | 2017-05-06 22:43:32 | 3900417255 | 275552475 | 3927756592 | 834 | 317118 | 32 |
| 9 | crdb1f.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 09:47:29 | 2017-05-06 22:44:58 | 3180572508 | 275833772 | 3055715321 | 834 | 367050 | 25 |
| 10 | crdb1h.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 09:47:25 | 2017-05-06 22:45:03 | 3499697971 | 189248379 | 3407460372 | 0 | 335869 | 6 |
| 11 | crdb2h.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 09:47:31 | 2017-05-06 22:45:11 | 3586339457 | 195195533 | 3571139197 | 834 | 334825 | 30 |
| 12 | crdb3h.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 09:47:27 | 2017-05-06 22:45:17 | 2923088328 | 343006534 | 3078007655 | 834 | 274249 | 19 |
+----+---------------------------+-----------+---------------------+---------------------+------------+-----------+-------------+--------------+--------------+------------------+-----------------
(12 rows)

I am not sure if this is the same or different issue, but appears to be a different way to possibly trigger it. I think there might be some higher than normal packet loss between crdb1d and the *a and *c nodes.

@a-robinson
Copy link
Contributor

When you say that you "managed to break it again", what symptom(s) are you referring to?

Also, it looks like the end of that table got truncated. Did it print properly in your terminal?

@jlauro
Copy link

jlauro commented May 10, 2017

Main/oirignal symptom for this issue (#15702) of high memory usage, way beyond what it should use. 8gb ram with 4gb for swap for vm, and the process grew to 24gb address space. (I think 11gb actual active+swap for the go process)
May 10 07:52:18 crdb1a kernel: Out of memory: Kill process 15055 (cockroach) score 942 or sacrifice child
May 10 07:52:18 crdb1a kernel: Killed process 15055 (cockroach) total-vm:24623408kB, anon-rss:7659012kB, file-rss:0kB, shmem-rss:0kB

Table printed correctly in the terminal. Looks like spaces were eaten on the cut-n-paste and don't align, but all the cols appear to have made it in the post.

@jlauro
Copy link

jlauro commented May 10, 2017

Pasting with current status.

crdb1a.ces.cvnt.net:/var/log#

 cockroach node status
+----+---------------------------+-----------+---------------------+---------------------+------------+-----------+-------------+--------------+--------------+------------------+-----------------------+--------+--------------------+------------------------+
| id |          address          |   build   |     updated_at      |     started_at      | live_bytes | key_bytes | value_bytes | intent_bytes | system_bytes | replicas_leaders | replicas_leaseholders | ranges | ranges_unavailable | ranges_underreplicated |
+----+---------------------------+-----------+---------------------+---------------------+------------+-----------+-------------+--------------+--------------+------------------+-----------------------+--------+--------------------+------------------------+
|  1 | crdb1a.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 16:11:52 | 2017-05-10 08:21:51 | 2072154251 | 107897060 |  2038123495 |            0 |       365381 |               26 |                    18 |     26 |                  0 |                      0 |
|  2 | crdb2a.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 16:11:49 | 2017-05-06 22:39:18 | 3563171063 | 259508079 |  3512254035 |          834 |       327812 |               14 |                    13 |     16 |                  0 |                      1 |
|  3 | crdb3a.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 16:11:44 | 2017-05-06 22:40:24 | 3767280122 | 136171156 |  3753122524 |         1329 |       354598 |               17 |                    15 |     17 |                  0 |                      2 |
|  4 | crdb4a.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 15:25:29 | 2017-05-06 22:40:28 | 3342133853 | 412594792 |  3323755826 |            0 |       319914 |               43 |                    45 |     45 |                  0 |                      0 |
|  5 | crdb5a.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 16:11:53 | 2017-05-06 22:40:32 | 3110162549 | 297230702 |  3138182113 |            0 |       352732 |               37 |                    31 |     38 |                  1 |                      1 |
|  6 | crdb1c.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 10:12:06 | 2017-05-06 22:40:35 | 3463093765 | 324663937 |  3614222690 |          834 |       284228 |                6 |                     1 |      6 |                  0 |                      0 |
|  7 | crdb1d.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 16:11:44 | 2017-05-07 14:05:53 | 3517430199 | 192595960 |  3526970300 |            0 |       366309 |                3 |                     1 |      4 |                  0 |                      0 |
|  8 | crdb1e.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 16:11:42 | 2017-05-06 22:43:32 | 4071275310 | 306155311 |  4105284795 |         1329 |       349681 |               69 |                    69 |     69 |                  0 |                      0 |
|  9 | crdb1f.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 16:11:49 | 2017-05-06 22:44:58 | 3462452940 | 337402912 |  3327287565 |         1329 |       408664 |               19 |                    12 |     19 |                  0 |                      4 |
| 10 | crdb1h.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 16:11:45 | 2017-05-06 22:45:03 | 3654987444 | 180937222 |  3535998015 |            0 |       366176 |               20 |                    21 |     20 |                  0 |                      0 |
| 11 | crdb2h.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 16:11:51 | 2017-05-06 22:45:11 | 3702500902 | 196211075 |  3689000131 |         1329 |       352007 |                2 |                     1 |      2 |                  0 |                      0 |
| 12 | crdb3h.ces.cvnt.net:26257 | v1.0-rc.2 | 2017-05-10 16:11:47 | 2017-05-06 22:45:17 | 3123894924 | 375593469 |  3282480924 |          834 |       305463 |               41 |                    41 |     41 |                  0 |                      0 |
+----+---------------------------+-----------+---------------------+---------------------+------------+-----------+-------------+--------------+--------------+------------------+-----------------------+--------+--------------------+------------------------+
(12 rows)

@ggaaooppeenngg
Copy link
Contributor

ggaaooppeenngg commented May 17, 2017

I also go a large Unmarshal memory cost.

I170517 15:49:55.700616 1 util/log/clog.go:1011  [config] file created at: 2017/05/17 15:49:55
I170517 15:49:55.700616 1 util/log/clog.go:1011  [config] running on machine: 10-9-141-142
I170517 15:49:55.700616 1 util/log/clog.go:1011  [config] binary: CockroachDB CCL 3c1c3fb-dirty (linux amd64, built 2017/05/05 09:21:49, devel +8db4d02 Fri Apr 28 07:27:25 2017 +0000)
I170517 15:49:55.700616 1 util/log/clog.go:1011  [config] arguments: [./cockroach start --insecure --advertise-host=10.9.141.142 --join=10.25.106.179:26257]
I170517 15:49:55.700616 1 util/log/clog.go:1011  line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid file:line msg utf8=✓
I170517 15:49:55.700615 1 cli/start.go:575  CockroachDB CCL 3c1c3fb-dirty (linux amd64, built 2017/05/05 09:21:49, devel +8db4d02 Fri Apr 28 07:27:25 2017 +0000)
I170517 15:49:55.802023 8 cli/start.go:338  starting cockroach node
W170517 15:49:55.802349 8 server/server.go:161  running in insecure mode, this is strongly discouraged. See --insecure.
W170517 15:49:55.806384 8 gossip/gossip.go:1189  [n?] no incoming or outgoing connections
I170517 15:49:55.807243 8 storage/engine/rocksdb.go:380  opening rocksdb instance at "/home/ubuntu/cockroach-data"
I170517 15:49:55.840594 13 gossip/client.go:131  [n?] started gossip client to 10.25.106.179:26257
I170517 15:49:55.875265 8 server/config.go:421  1 storage engine initialized
I170517 15:49:55.876227 8 server/server.go:678  [n?] sleeping for 429.932142ms to guarantee HLC monotonicity
I170517 15:49:56.332412 8 server/node.go:467  [n2] initialized store [n2,s2]: {Capacity:21001531392 Available:16319844352 RangeCount:40 LeaseCount:0}
I170517 15:49:56.332448 8 server/node.go:351  [n2] node ID 2 initialized
I170517 15:49:56.332573 8 gossip/gossip.go:297  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"10.9.141.142:26257" > attrs:<> locality:<>
I170517 15:49:56.333201 8 storage/stores.go:296  [n2] read 2 node addresses from persistent storage
I170517 15:49:56.333419 8 server/node.go:608  [n2] connecting to gossip network to verify cluster ID...
I170517 15:49:56.333447 8 server/node.go:633  [n2] node connected via gossip and verified as part of cluster "eb31b293-2f2a-4fa2-9913-1543f8f6825c"
I170517 15:49:56.333479 8 server/node.go:405  [n2] node=2: started with [[]=/home/ubuntu/cockroach-data] engine(s) and attributes []
I170517 15:49:56.334055 8 sql/executor.go:338  [n2] creating distSQLPlanner with address {tcp 10.9.141.142:26257}
I170517 15:49:56.342867 8 server/server.go:741  [n2] starting http server at 10-9-141-142:8080
I170517 15:49:56.342889 8 server/server.go:742  [n2] starting grpc/postgres server at 10-9-141-142:26257
I170517 15:49:56.342903 8 server/server.go:743  [n2] advertising CockroachDB node at 10.9.141.142:26257
I170517 15:49:58.559686 388 storage/raft_transport.go:436  [n2] raft transport stream to node 3 established
I170517 15:49:58.752549 392 storage/replica_raftstorage.go:592  [n2,s2,r1/2:/{Min-System/}] applying Raft snapshot at index 149779 (id=3cc1e462, encoded size=25185, 1 rocksdb batches, 22 log entries)
I170517 15:49:58.755015 392 storage/replica_raftstorage.go:598  [n2,s2,r1/2:/{Min-System/}] applied Raft snapshot in 2ms [clear=0ms batch=0ms entries=0ms commit=2ms]
I170517 15:49:59.022061 398 storage/raft_transport.go:436  [n2] raft transport stream to node 1 established
I170517 15:50:06.334408 39 storage/store.go:3995  [n2,s2] sstables (read amplification = 3):
4 [   1M  1 ]: 1M
5 [  33M  3 ]: 32M 166K 21K
6 [ 529M 18 ]: 101M[4] 32M 19M 16M[2] 13M[2] 10M 2M 1M[2] 839K 705K 649K 460K
I170517 15:50:06.335892 41 server/status/runtime.go:221  [n2] runtime stats: 81 MiB RSS, 320 goroutines, 20 MiB/608 KiB/28 MiB GO alloc/idle/total, 7.2 MiB/11 MiB CGO alloc/total, 0.00cgo/sec, 0.00/0.00 %(u/s)time, 0.00 %gc (7x)
E170517 15:50:07.069594 329 server/status.go:410  [n2,status] failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = Canceled desc = context canceled
I170517 15:50:16.335961 41 server/status/runtime.go:221  [n2] runtime stats: 2.8 GiB RSS, 431 goroutines, 1.4 GiB/904 KiB/1.5 GiB GO alloc/idle/total, 978 MiB/995 MiB CGO alloc/total, 268.60cgo/sec, 0.28/0.11 %(u/s)time, 0.00 %gc (11x)
I170517 15:50:20.440645 32 gossip/gossip.go:1203  [n2] node has connected to cluster via gossip
I170517 15:50:20.440882 32 storage/stores.go:312  [n2] wrote 2 node addresses to persistent storage
W170517 15:50:20.932825 71 vendor/github.com/coreos/etcd/raft/raft.go:793  [n2,s2,r2/2:/System/{-tsd}] 2 stepped down to follower since quorum is not active
E170517 15:50:22.261504 544 server/status.go:410  [n2,status] failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I170517 15:50:22.261626 544 vendor/google.golang.org/grpc/server.go:752  grpc: Server.processUnaryRPC failed to write status: stream error: code = DeadlineExceeded desc = "context deadline exceeded"
W170517 15:50:23.132865 79 vendor/github.com/coreos/etcd/raft/raft.go:793  [n2,s2,r1/2:/{Min-System/}] 2 stepped down to follower since quorum is not active
I170517 15:50:26.335886 41 server/status/runtime.go:221  [n2] runtime stats: 2.9 GiB RSS, 427 goroutines, 1.5 GiB/656 KiB/1.5 GiB GO alloc/idle/total, 982 MiB/997 MiB CGO alloc/total, 74.70cgo/sec, 0.02/0.04 %(u/s)time, 0.00 %gc (0x)
I170517 15:50:36.335902 41 server/status/runtime.go:221  [n2] runtime stats: 2.9 GiB RSS, 462 goroutines, 1.5 GiB/1016 KiB/1.6 GiB GO alloc/idle/total, 981 MiB/997 MiB CGO alloc/total, 84.80cgo/sec, 0.02/0.02 %(u/s)time, 0.00 %gc (0x)
I170517 15:50:44.659903 178 vendor/google.golang.org/grpc/transport/http2_server.go:323  transport: http2Server.HandleStreams failed to read frame: read tcp 10.9.141.142:26257->10.13.26.150:46920: read: connection reset by peer
I170517 15:50:46.336687 41 server/status/runtime.go:221  [n2] runtime stats: 3.5 GiB RSS, 449 goroutines, 1.4 GiB/723 MiB/2.2 GiB GO alloc/idle/total, 981 MiB/991 MiB CGO alloc/total, 263.30cgo/sec, 0.27/0.15 %(u/s)time, 0.00 %gc (4x)
E170517 15:50:47.285382 719 server/status.go:410  [n2,status] failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
W170517 15:50:56.333192 90 storage/replica.go:1104  [n2,s2,r2/2:/System/{-tsd}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.333244 89 storage/replica.go:1104  [n2,s2,r5/2:/Table/{0-11}] have been waiting 1m0s attempting to acquire lease
I170517 15:50:56.336053 40 gossip/gossip.go:445  [n2] gossip status (ok, 3 nodes)
gossip client (1/3 cur/max conns)
  1: 10.25.106.179:26257 (1m1s: infos 30/21 sent/received, bytes 4012B/7548B sent/received)
gossip server (0/3 cur/max conns, infos 30/21 sent/received, bytes 4012B/7548B sent/received)
I170517 15:50:56.336211 41 server/status/runtime.go:221  [n2] runtime stats: 3.5 GiB RSS, 463 goroutines, 1.4 GiB/721 MiB/2.2 GiB GO alloc/idle/total, 981 MiB/992 MiB CGO alloc/total, 71.30cgo/sec, 0.01/0.00 %(u/s)time, 0.00 %gc (0x)
W170517 15:50:56.377112 177 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r3 for batch: Merge [/System/tsd/cr.node.gossip.connections.incoming/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.gossip.connections.refused/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.gossip.bytes.received/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.gossip.bytes.sent/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.gossip.infos.received/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.gossip.infos.sent/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.gossip.connections.outgoing/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.distsender.batches/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.distsender.batches.partial/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.distsender.rpc.sent/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.distsender.rpc.sent.local/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.distsender.rpc.sent.sendnexttimeout/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.distsender.rpc.sent.nextreplicaerror/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.distsender.errors.notleaseholder/2/10s/2017-05-17T07:00:00Z,/Min), Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.377427 197 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r43 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.377489 194 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r16 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Merge [/System/tsd/cr.node.requests.slow.distsender/2/10s/2017-05-17T07:00:00Z,/Min), Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.377824 195 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r14 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.377886 198 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r28 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.378331 199 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r26 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.378388 196 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r30 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.379315 193 storage/replica.go:1104  [n2,s2,r16/2:/System/tsd/cr.node.{li…-ro…}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.379467 214 storage/replica.go:1104  [n2,s2,r26/2:/System/tsd/cr.node.sql.mem.c…] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.382469 210 storage/replica.go:1104  [n2,s2,r43/2:/System/tsd/cr.node.sql.…] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.385394 213 storage/replica.go:1104  [n2,s2,r30/2:/System/tsd/cr.node.sql.…] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.388270 191 storage/replica.go:1104  [n2,s2,r3/2:/System/tsd{-/cr.nod…}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.391153 211 storage/replica.go:1104  [n2,s2,r14/2:/System/tsd/cr.node.{ro…-sq…}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.391164 219 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r52 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.392784 212 storage/replica.go:1104  [n2,s2,r28/2:/System/tsd/cr.node.sql.mem.…] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.392870 220 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r25 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.394753 222 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r13 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.394796 209 storage/replica.go:1104  [n2,s2,r25/2:/System/tsd/cr.node.sql.mem.…] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.395122 221 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r15 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.395195 223 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r24 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Merge [/System/tsd/cr.node.txn.aborts/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.commits/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.commits1PC/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.abandons/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.durations-max/2/10s/2017-05-17T07:00:00Z,/Min), ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.395296 208 storage/replica.go:1104  [n2,s2,r52/2:/System/tsd/cr.node.sql.mem.…] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.395487 224 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r50 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.396065 246 storage/replica.go:1104  [n2,s2,r50/2:/System/tsd/cr.{node.…-store…}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.396193 233 storage/replica.go:1104  [n2,s2,r24/2:/System/tsd/cr.node.{sy…-tx…}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.396544 244 storage/replica.go:1104  [n2,s2,r13/2:/System/tsd/cr.node.s{ql…-ys…}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.396696 245 storage/replica.go:1104  [n2,s2,r15/2:/System/tsd/cr.node.sql.…] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.401143 8 kv/dist_sender.go:1189  [n2] have been waiting 1m0s sending RPC to r2 for batch: Scan [/System/"system-version/",/System/"system-version0")
W170517 15:50:56.401325 43 kv/dist_sender.go:1189  [summaries,n2] have been waiting 1m0s sending RPC to r2 for batch: Put [/System/StatusNode/2,/Min)
W170517 15:50:56.401490 257 storage/replica.go:1104  [n2,s2,r2/2:/System/{-tsd}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.401622 255 storage/replica.go:1104  [n2,s2,r2/2:/System/{-tsd}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.401670 256 storage/replica.go:1104  [n2,s2,r2/2:/System/{-tsd}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.401714 37 kv/dist_sender.go:1189  [s2,r5/2:/Table/{0-11},n2] have been waiting 1m0s sending RPC to r2 for batch: [txn: af5c33fd], BeginTransaction [/System/NodeLiveness/2,/Min), ConditionalPut [/System/NodeLiveness/2,/Min), EndTransaction [/System/NodeLiveness/2,/Min)
W170517 15:50:56.423167 258 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r48 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.423627 262 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r49 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.423679 261 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r53 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Merge [/System/tsd/cr.store.queue.replicate.removereplica/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.store.queue.replicate.removedeadreplica/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.store.queue.replicate.rebalancereplica/2/10s/2017-05-17T07:00:00Z,/Min), Merge [/System/tsd/cr.store.queue.replicate.transferlease/2/10s/2017-05-17T07:00:00Z,/Min), Noop
W170517 15:50:56.423897 263 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r45 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.424176 259 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r44 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.424579 264 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r42 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.424722 260 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r23 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Merge [/System/tsd/cr.store.queue.replicagc.removereplica/2/10s/2017-05-17T07:00:00Z,/Min)
W170517 15:50:56.425525 275 storage/replica.go:1104  [n2,s2,r42/2:/System/tsd/cr.store.ra{f…-n…}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.425614 241 storage/replica.go:1104  [n2,s2,r23/2:/System/tsd/cr.store.queue.…] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.425958 270 storage/replica.go:1104  [n2,s2,r48/2:/System/tsd/cr.store.{ca…-le…}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.426279 272 storage/replica.go:1104  [n2,s2,r49/2:/System/tsd/cr.store.{qu…-ra…}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.426518 273 storage/replica.go:1104  [n2,s2,r45/2:/System/tsd/cr.store.raft.…] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.427645 271 storage/replica.go:1104  [n2,s2,r53/2:/System/tsd/cr.store.queue.…] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.427736 274 storage/replica.go:1104  [n2,s2,r44/2:/System/tsd/cr.store.{le…-qu…}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.458109 284 storage/replica.go:1104  [n2,s2,r22/2:/System/tsd/cr.store.r{a…-o…}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.458166 42 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r21 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.458284 280 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r22 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.458464 281 kv/dist_sender.go:1189  [ts-poll,n2] have been waiting 1m0s sending RPC to r51 for batch: Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, Noop, ... 374 skipped ..., Noop, Noop, Noop, Noop, Noop
W170517 15:50:56.458573 283 storage/replica.go:1104  [n2,s2,r21/2:/System/ts{d/cr.st…-e}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.458578 295 storage/replica.go:1104  [n2,s2,r51/2:/System/tsd/cr.store.{ro…-va…}] have been waiting 1m0s attempting to acquire lease
W170517 15:50:56.533584 99 storage/replica.go:1104  [timeSeriesMaintenance,n2,s2,r43/2:/System/tsd/cr.node.sql.…] have been waiting 1m0s attempting to acquire lease
W170517 15:50:57.335972 92 storage/replica.go:1104  [gc,n2,s2,r2/2:/System/{-tsd}] have been waiting 1m0s attempting to acquire lease
W170517 15:51:02.550594 96 kv/dist_sender.go:1189  [raftlog,s2,r44/2:/System/tsd/cr.store.{le…-qu…},n2] have been waiting 1m0s sending RPC to r44 for batch: TruncateLog [/System/tsd/cr.store.leases.transfers.error/1/10s/2017-05-09T16:00:00Z,/Min)
E170517 15:51:02.550731 96 storage/queue.go:634  [raftlog,n2,s2,r44/2:/System/tsd/cr.store.{le…-qu…}] failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I170517 15:51:06.336095 41 server/status/runtime.go:221  [n2] runtime stats: 3.5 GiB RSS, 452 goroutines, 1.4 GiB/695 MiB/2.2 GiB GO alloc/idle/total, 981 MiB/992 MiB CGO alloc/total, 154.70cgo/sec, 0.02/0.00 %(u/s)time, 0.00 %gc (0x)
E170517 15:51:13.196921 995 server/status.go:410  [n2,status] failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I170517 15:51:16.335836 41 server/status/runtime.go:221  [n2] runtime stats: 3.5 GiB RSS, 463 goroutines, 1.5 GiB/665 MiB/2.2 GiB GO alloc/idle/total, 981 MiB/992 MiB CGO alloc/total, 128.70cgo/sec, 0.01/0.01 %(u/s)time, 0.00 %gc (0x)

this is heap profiling

profile001.pdf

@ggaaooppeenngg
Copy link
Contributor

ggaaooppeenngg commented May 17, 2017

@a-robinson How did you figure out range 137 is problemtic?

@a-robinson
Copy link
Contributor

@ggaaooppeenngg did you run DELETE or TRUNCATE SQL commands?

To determine the problematic range, I ran a couple little python scripts to process the output of /_status/ranges/<nodeID> and /_status/raft debug pages -- first to find which ranges had replicas on the nodes with high memory, then later to find the raft groups that had the highest term numbers. If you want to find which range is causing problems in your cluster, I'd either recommend that you either check out the /_status/raft page to find the group with the highest term number or keep reloading the list of all running goroutines (/debug/pprof/goroutine?debug=2) until it has a stack trace with becomeLeader in it and then grabbing the hex value of the 5th argument to storage.entries. In my comment above (#15702 (comment)), that parameter was 0x89, which is 137 in decimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants