Volume server hangs #1353

ingardm · 2020-06-07T10:18:17Z

Describe the bug
The volume server http interface is unresponsive. The process is not dead as it is still logging lines like this:

Jun 07 12:08:13 dn-379 weed[38923]: I0607 12:08:13 38923 disk.go:11] read disk size: dir:"/mnt/data-small/data" all:100000802516992 used:2829158424576 free:97171644092416 percent_free:97.17087 percent_used:2.8291357
Jun 07 12:08:34 dn-379 weed[38923]: I0607 12:08:34 38923 disk.go:11] read disk size: dir:"/mnt/data-small/data" all:100000802516992 used:2829158424576 free:97171644092416 percent_free:97.17087 percent_used:2.8291357

System Setup
Same setup as : #1346
5x masters and 5x volume servers:

root     38923 31.5  0.8 11695928 578576 ?     Ssl  Jun05 770:25 /usr/local/bin/weed -logtostderr -v=1 volume -compactionMBps=50 -port=8080 -mserver=10.1.14.110:9333,10.1.14.111:9333,10.1.14.112:9333,10.1.14.113:9333,10.1.14.114:9333 -dir=/mnt/data-small/data -index leveldb -max 0 -ip 10.1.14.112 -rack dn-379
root     39048  0.4  0.0 5329596 19068 ?       Ssl  Jun05  11:41 /usr/local/bin/weed -logtostderr master -mdir=/mnt/data-small/meta -ip=10.1.14.112 -defaultReplication=010 -volumePreallocate -volumeSizeLimitMB 30000 -garbageThreshold 0.3 -pulseSeconds 5 -peers=10.1.14.110:9333,10.1.14.111:9333,10.1.14.112:9333,10.1.14.113:9333,10.1.14.114:9333

Expected behavior
Normal operation

Additional context
The only thing we changed was pulseSeconds from 10 to 5, but now I'm constantly restarting volume servers that hangs. Could it be that the volume server is waiting for replication to go through? I've seen some log lines related to replication failing. This is roughly 30 mins prior to all transfers stopping to the cluster:

Jun 05 21:01:14 dn-379 weed[38923]: I0605 21:01:14 38923 upload_content.go:214] failing to upload to http://10.1.14.110:8080/5172,0e86755ae4c78708?ts=1591383674&ttl=&type=replicate Post http://10.1.14.110:8080/5172,0e86755ae4c78708?ts=1591383674&ttl=&type=replicate: read tcp 10.1.14.112:38402->10.1.14.110:8080: read: connection reset by peer
Jun 05 21:01:14 dn-379 weed[38923]: I0605 21:01:14 38923 store_replicate.go:87] failed to write to replicas for volume 5172: [10.1.14.110:8080]: Post http://10.1.14.110:8080/5172,0e86755ae4c78708?ts=1591383674&ttl=&type=replicate: read tcp 10.1.14.112:38402->10.1.14.110:8080: read: connection reset by peer
Jun 05 21:01:14 dn-379 weed[38923]: I0605 21:01:14 38923 upload_content.go:214] failing to upload to http://10.1.14.110:8080/5172,0e867549195ab81e?ts=1591383674&ttl=&type=replicate Post http://10.1.14.110:8080/5172,0e867549195ab81e?ts=1591383674&ttl=&type=replicate: read tcp 10.1.14.112:38396->10.1.14.110:8080: read: connection reset by peer
Jun 05 21:01:14 dn-379 weed[38923]: I0605 21:01:14 38923 store_replicate.go:87] failed to write to replicas for volume 5172: [10.1.14.110:8080]: Post http://10.1.14.110:8080/5172,0e867549195ab81e?ts=1591383674&ttl=&type=replicate: read tcp 10.1.14.112:38396->10.1.14.110:8080: read: connection reset by peer

The text was updated successfully, but these errors were encountered:

ingardm · 2020-06-07T10:20:59Z

Do we have pprof or other means to dump all threads to maybe get some clarity on what is causing the stalls? Is there an overview or some way to see transfers in flight internally (replication)?

ingardm · 2020-06-07T10:24:06Z

$ curl -v 'http://10.1.14.112:8080'
* Rebuilt URL to: http://10.1.14.112:8080/
*   Trying 10.1.14.112...
* TCP_NODELAY set
* Connected to 10.1.14.112 (10.1.14.112) port 8080 (#0)
> GET / HTTP/1.1
> Host: 10.1.14.112:8080
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 400 Bad Request
< Date: Sun, 07 Jun 2020 10:23:05 GMT
< Content-Length: 0
<
* Connection #0 to host 10.1.14.112 left intact
✔ ~
$ curl -v 'http://10.1.14.112:8080/5205,0e9c951016a11654'
*   Trying 10.1.14.112...
* TCP_NODELAY set
* Connected to 10.1.14.112 (10.1.14.112) port 8080 (#0)
> GET /5205,0e9c951016a11654 HTTP/1.1
> Host: 10.1.14.112:8080
> User-Agent: curl/7.54.0
> Accept: */*
>




hangs forever

chrislusf · 2020-06-07T20:10:23Z

you can use kill -3 <pid> to see the threads.

weed volume -cpuprofile=xxx -memprofile=yyy can profile cpu and memory. see weed volume -h

seems the write request is queued up on volume servers. Adding more volume servers or reduce the concurrency should help.

ingardm · 2020-06-08T05:48:27Z

@chrislusf log attached
log.txt

ingardm · 2020-06-08T06:37:07Z

log from another volume server
dn380.log.txt

ingardm · 2020-06-08T14:28:51Z

It happend again on the same server. This time we had patched the source to add pprof. This attachment is from debug/pprof/goroutine?debug=2

pprof.log

Kimbsen · 2020-06-08T14:45:07Z

Its a deadlock in func (v *Volume) syncWrite(n *needle.Needle) (offset uint64, size uint32, isUnchanged bool, err error)

Its initally (write)locked here: https://github.com/chrislusf/seaweedfs/blob/master/weed/storage/volume_read_write.go#L75

but if the errorcondition its true here: https://github.com/chrislusf/seaweedfs/blob/master/weed/storage/volume_read_write.go#L79 v.ContentSize() attempts to (read)Lock the same mutex

#1353

chrislusf · 2020-06-08T15:14:10Z

@Kimbsen Thanks a lot for showing the bug! I added a fix for it.

ingardm · 2020-06-10T19:27:44Z

We've been running our testing the whole day and seemingly no problems. Closing this task :)

chrislusf added a commit that referenced this issue Jun 8, 2020

volume server: add a fix for dead lock

bdbe5dd

#1353

ingardm closed this as completed Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume server hangs #1353

Volume server hangs #1353

ingardm commented Jun 7, 2020

ingardm commented Jun 7, 2020

ingardm commented Jun 7, 2020

chrislusf commented Jun 7, 2020

ingardm commented Jun 8, 2020

ingardm commented Jun 8, 2020

ingardm commented Jun 8, 2020

Kimbsen commented Jun 8, 2020

chrislusf commented Jun 8, 2020

ingardm commented Jun 10, 2020

Volume server hangs #1353

Volume server hangs #1353

Comments

ingardm commented Jun 7, 2020

ingardm commented Jun 7, 2020

ingardm commented Jun 7, 2020

chrislusf commented Jun 7, 2020

ingardm commented Jun 8, 2020

ingardm commented Jun 8, 2020

ingardm commented Jun 8, 2020

Kimbsen commented Jun 8, 2020

chrislusf commented Jun 8, 2020

ingardm commented Jun 10, 2020