Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume server hangs #1353

Closed
ingardm opened this issue Jun 7, 2020 · 9 comments
Closed

Volume server hangs #1353

ingardm opened this issue Jun 7, 2020 · 9 comments

Comments

@ingardm
Copy link
Contributor

ingardm commented Jun 7, 2020

Describe the bug
The volume server http interface is unresponsive. The process is not dead as it is still logging lines like this:

Jun 07 12:08:13 dn-379 weed[38923]: I0607 12:08:13 38923 disk.go:11] read disk size: dir:"/mnt/data-small/data" all:100000802516992 used:2829158424576 free:97171644092416 percent_free:97.17087 percent_used:2.8291357
Jun 07 12:08:34 dn-379 weed[38923]: I0607 12:08:34 38923 disk.go:11] read disk size: dir:"/mnt/data-small/data" all:100000802516992 used:2829158424576 free:97171644092416 percent_free:97.17087 percent_used:2.8291357

System Setup
Same setup as : #1346
5x masters and 5x volume servers:

root     38923 31.5  0.8 11695928 578576 ?     Ssl  Jun05 770:25 /usr/local/bin/weed -logtostderr -v=1 volume -compactionMBps=50 -port=8080 -mserver=10.1.14.110:9333,10.1.14.111:9333,10.1.14.112:9333,10.1.14.113:9333,10.1.14.114:9333 -dir=/mnt/data-small/data -index leveldb -max 0 -ip 10.1.14.112 -rack dn-379
root     39048  0.4  0.0 5329596 19068 ?       Ssl  Jun05  11:41 /usr/local/bin/weed -logtostderr master -mdir=/mnt/data-small/meta -ip=10.1.14.112 -defaultReplication=010 -volumePreallocate -volumeSizeLimitMB 30000 -garbageThreshold 0.3 -pulseSeconds 5 -peers=10.1.14.110:9333,10.1.14.111:9333,10.1.14.112:9333,10.1.14.113:9333,10.1.14.114:9333

Expected behavior
Normal operation

Additional context
The only thing we changed was pulseSeconds from 10 to 5, but now I'm constantly restarting volume servers that hangs. Could it be that the volume server is waiting for replication to go through? I've seen some log lines related to replication failing. This is roughly 30 mins prior to all transfers stopping to the cluster:

Jun 05 21:01:14 dn-379 weed[38923]: I0605 21:01:14 38923 upload_content.go:214] failing to upload to http://10.1.14.110:8080/5172,0e86755ae4c78708?ts=1591383674&ttl=&type=replicate Post http://10.1.14.110:8080/5172,0e86755ae4c78708?ts=1591383674&ttl=&type=replicate: read tcp 10.1.14.112:38402->10.1.14.110:8080: read: connection reset by peer
Jun 05 21:01:14 dn-379 weed[38923]: I0605 21:01:14 38923 store_replicate.go:87] failed to write to replicas for volume 5172: [10.1.14.110:8080]: Post http://10.1.14.110:8080/5172,0e86755ae4c78708?ts=1591383674&ttl=&type=replicate: read tcp 10.1.14.112:38402->10.1.14.110:8080: read: connection reset by peer
Jun 05 21:01:14 dn-379 weed[38923]: I0605 21:01:14 38923 upload_content.go:214] failing to upload to http://10.1.14.110:8080/5172,0e867549195ab81e?ts=1591383674&ttl=&type=replicate Post http://10.1.14.110:8080/5172,0e867549195ab81e?ts=1591383674&ttl=&type=replicate: read tcp 10.1.14.112:38396->10.1.14.110:8080: read: connection reset by peer
Jun 05 21:01:14 dn-379 weed[38923]: I0605 21:01:14 38923 store_replicate.go:87] failed to write to replicas for volume 5172: [10.1.14.110:8080]: Post http://10.1.14.110:8080/5172,0e867549195ab81e?ts=1591383674&ttl=&type=replicate: read tcp 10.1.14.112:38396->10.1.14.110:8080: read: connection reset by peer
@ingardm
Copy link
Contributor Author

ingardm commented Jun 7, 2020

Do we have pprof or other means to dump all threads to maybe get some clarity on what is causing the stalls? Is there an overview or some way to see transfers in flight internally (replication)?

@ingardm
Copy link
Contributor Author

ingardm commented Jun 7, 2020

$ curl -v 'http://10.1.14.112:8080'
* Rebuilt URL to: http://10.1.14.112:8080/
*   Trying 10.1.14.112...
* TCP_NODELAY set
* Connected to 10.1.14.112 (10.1.14.112) port 8080 (#0)
> GET / HTTP/1.1
> Host: 10.1.14.112:8080
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 400 Bad Request
< Date: Sun, 07 Jun 2020 10:23:05 GMT
< Content-Length: 0
<
* Connection #0 to host 10.1.14.112 left intact
✔ ~
$ curl -v 'http://10.1.14.112:8080/5205,0e9c951016a11654'
*   Trying 10.1.14.112...
* TCP_NODELAY set
* Connected to 10.1.14.112 (10.1.14.112) port 8080 (#0)
> GET /5205,0e9c951016a11654 HTTP/1.1
> Host: 10.1.14.112:8080
> User-Agent: curl/7.54.0
> Accept: */*
>




hangs forever

@chrislusf
Copy link
Collaborator

you can use kill -3 <pid> to see the threads.

weed volume -cpuprofile=xxx -memprofile=yyy can profile cpu and memory. see weed volume -h

seems the write request is queued up on volume servers. Adding more volume servers or reduce the concurrency should help.

@ingardm
Copy link
Contributor Author

ingardm commented Jun 8, 2020

@chrislusf log attached
log.txt

@ingardm
Copy link
Contributor Author

ingardm commented Jun 8, 2020

log from another volume server
dn380.log.txt

@ingardm
Copy link
Contributor Author

ingardm commented Jun 8, 2020

It happend again on the same server. This time we had patched the source to add pprof. This attachment is from debug/pprof/goroutine?debug=2

pprof.log

@Kimbsen
Copy link
Contributor

Kimbsen commented Jun 8, 2020

Its a deadlock in func (v *Volume) syncWrite(n *needle.Needle) (offset uint64, size uint32, isUnchanged bool, err error)

Its initally (write)locked here: https://github.com/chrislusf/seaweedfs/blob/master/weed/storage/volume_read_write.go#L75

but if the errorcondition its true here: https://github.com/chrislusf/seaweedfs/blob/master/weed/storage/volume_read_write.go#L79 v.ContentSize() attempts to (read)Lock the same mutex

chrislusf added a commit that referenced this issue Jun 8, 2020
@chrislusf
Copy link
Collaborator

@Kimbsen Thanks a lot for showing the bug! I added a fix for it.

@ingardm
Copy link
Contributor Author

ingardm commented Jun 10, 2020

We've been running our testing the whole day and seemingly no problems. Closing this task :)

@ingardm ingardm closed this as completed Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants