Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weed mount stalls some read operations while writing #2263

Closed
PeterCxy opened this issue Aug 16, 2021 · 9 comments
Closed

weed mount stalls some read operations while writing #2263

PeterCxy opened this issue Aug 16, 2021 · 9 comments

Comments

@PeterCxy
Copy link
Contributor

PeterCxy commented Aug 16, 2021

Describe the bug

It seems that on my setup, a weed mount process will stall some chunk read requests while writing a large file. After the write operation is finished, the process is unstalled and all read requests are processed. It does not seem like a volume server performance issue, because other weed mount processes and the S3 gateway work completely fine while the one doing writes is stuck.

To reproduce, try to read a large file while writing another large file to the mountpoint, e.g. using dd. If the file being read is not in the local cache, then the reading process may get stuck for a long time.

System Setup

SeaweedFS version: 2.62
Mount command: weed mount -filer=<redacted> -filer.path=<redacted> -collection=<redacted> -dir=<redacted> -concurrentWriters=1 -allowOthers=true -chunkSizeLimitMB=1 -cacheCapacityMB=10240 -cacheDir=<redacted>

The bandwidth is not saturated during the write operation (in fact this is why I have concurrentWriters=1 and chunkSizeLimitMB=1 because I thought it might be due to my network not being able to handle the default 32 concurrent writers uploading chunks at the same time, but this did not change anything).

Expected behavior

The FUSE mount should continue to process read requests while the write request is ongoing.

@PeterCxy
Copy link
Contributor Author

PeterCxy commented Aug 16, 2021

Okay it seems that by using RWMutex in the volume server, one single volume can never be written to and read at the same time (though I am not sure if this is the root cause of my problem, but I do have a small amount of volumes in the collection where this happens). Is this really necessary at all times? Since volumes are normally append-only except during compaction, I believe it is reasonable to allow reads to happen while writes are ongoing (though precautions need to be made to not return data that has not finished writing yet or is only partially written).

@PeterCxy
Copy link
Contributor Author

According to the Go documentation, a blocked Lock call prevents any RLocks from being acquired. It might be the case that multiple goroutines are trying to acquire the write lock to a volume at the same time, this preventing any reads from happening.

@PeterCxy PeterCxy changed the title weed mount stalls all read operations while writing weed mount stalls some read operations while writing Aug 17, 2021
@PeterCxy
Copy link
Contributor Author

PeterCxy commented Aug 17, 2021

This only seems to happen when there are a lot of ongoing write and read requests -- at some point in the request queue, all reads begin to get stuck and only writes are processed. Take this debug output for example: (I added log output around the RLock() call in readNeedle in volume_read.go)

I0817 13:46:32 51473 volume_read.go:17] reading needle 1,0ae3122338, acquiring read lock
I0817 13:46:32 51473 volume_read.go:19] reading needle 1,0ae3122338, read lock okay
I0817 13:46:33 51473 volume_read.go:17] reading needle 1,0b2e0c8014, acquiring read lock
I0817 13:46:33 51473 volume_read.go:19] reading needle 1,0b2e0c8014, read lock okay
I0817 13:46:33 51473 volume_read.go:17] reading needle 1,0b2e0c8014, acquiring read lock
I0817 13:46:33 51473 volume_read.go:19] reading needle 1,0b2e0c8014, read lock okay
I0817 13:46:33 51473 volume_read.go:17] reading needle 1,0c346954e1, acquiring read lock
I0817 13:46:33 51473 volume_read.go:19] reading needle 1,0c346954e1, read lock okay
I0817 13:46:33 51473 volume_read.go:17] reading needle 1,0c346954e1, acquiring read lock
I0817 13:46:33 51473 volume_read.go:19] reading needle 1,0c346954e1, read lock okay
I0817 13:46:33 51473 volume_read.go:17] reading needle 1,0d4157506e, acquiring read lock
I0817 13:46:33 51473 volume_read.go:19] reading needle 1,0d4157506e, read lock okay
I0817 13:46:33 51473 volume_read.go:17] reading needle 1,0d4157506e, acquiring read lock
I0817 13:46:33 51473 volume_read.go:19] reading needle 1,0d4157506e, read lock okay
I0817 13:46:33 51473 volume_read.go:17] reading needle 1,0eda71d03e, acquiring read lock
I0817 13:46:33 51473 volume_read.go:19] reading needle 1,0eda71d03e, read lock okay
I0817 13:46:33 51473 volume_read.go:17] reading needle 1,0eda71d03e, acquiring read lock
I0817 13:46:33 51473 volume_read.go:19] reading needle 1,0eda71d03e, read lock okay
I0817 13:46:33 51473 volume_read.go:17] reading needle 1,0ff5a33599, acquiring read lock
I0817 13:46:33 51473 volume_read.go:19] reading needle 1,0ff5a33599, read lock okay
I0817 13:46:33 51473 volume_read.go:17] reading needle 1,0ff5a33599, acquiring read lock
I0817 13:46:33 51473 volume_read.go:19] reading needle 1,0ff5a33599, read lock okay
I0817 13:46:33 51473 volume_read.go:17] reading needle 1,10533a013f, acquiring read lock
I0817 13:46:33 51473 volume_read.go:19] reading needle 1,10533a013f, read lock okay
I0817 13:46:33 51473 volume_read.go:17] reading needle 1,10533a013f, acquiring read lock
I0817 13:46:33 51473 volume_read.go:19] reading needle 1,10533a013f, read lock okay
I0817 13:46:33 51473 volume_read.go:17] reading needle 1,119233bb06, acquiring read lock
I0817 13:46:33 51473 volume_read.go:19] reading needle 1,119233bb06, read lock okay
I0817 13:47:04 51473 volume_read.go:17] reading needle 1,119233bb06, acquiring read lock
I0817 13:47:04 51473 volume_read.go:19] reading needle 1,119233bb06, read lock okay

Notice the irregular time gap between the last two invocations (during this time, only write requests are processed, because if I add the same debug output to the Lock() calls in the write implementation, it keeps going when reads are stuck). Also note that it did not get stuck in the process of acquiring the lock, so it may have nothing to do with the usage of RWMutex after all.

@chrislusf
Copy link
Collaborator

how busy are the traffic? There is a new volume server option -concurrentDownloadLimitMB=256 in 2.62. Do you think it may be related?

@PeterCxy
Copy link
Contributor Author

PeterCxy commented Aug 17, 2021

@chrislusf I thought of that just now as well, so I tried to remove all checks of in-flight download data in volume_server_handlers.go, but it is still happening.

My current way to reproduce it is to just write one 1GB file, try to cat it while writing another, and observe the cat process getting stuck in the middle of the reading.

@chrislusf
Copy link
Collaborator

can you reproduce without using mount? Mount has its own caching and write strategies, and skipping it can help to pinpoint the problem.

@PeterCxy
Copy link
Contributor Author

@chrislusf I've not tested without mount, but I believe the problem should be on the volume server implementation, because in the metrics exposed to Prometheus, those stuck intervals are counted towards volume server response times (unless the response times are not collected from the volume servers themselves, then I may be wrong).

@PeterCxy
Copy link
Contributor Author

@chrislusf I have found the cause of the issue -- it's in your fork of fuse. In particular, it is this commit: seaweedfs/fuse@b6b8dbb. Removing this or drastically increasing N temporarily fixed the issue for me.

It distributes works to N queues, disregarding the fact that some work may be finished much slower than others. As a result, a request will inevitably be stuck waiting for a lot of slow requests ahead of it in the queue, while all of the other workers may be idling. I'd suggest either reverting the commit, or implementing some sort of shared work queue or work stealing to avoid such a situation.

I'm going to run some tests locally to see if a shared work queue is a better solution, or maybe this should just be reverted, since the goroutine scheduler itself implements work queuing and stealing.

@chrislusf
Copy link
Collaborator

Nice! I actually wonder how you found it out so quickly. :)

PeterCxy added a commit to PeterCxy/fuse that referenced this issue Aug 17, 2021
The original round-robin job queue distribution can cause starvation
when a request is distributed to a worker which has a long backlog of
slow requests to process. This is especially prevalent when there are a
lot of concurrent read and write requests, as mentioned in issue
seaweedfs/seaweedfs#2263.

This commit uses `reflect.Select()` so that the request is only ever
sent to a channel in the ready state. As part of this change, all the
channels are now unbuffered instead of buffered, and the main FUSE
serving loop will block when no idle worker is available. There is no
point building up a job queue when no goroutine can process them -- it
will block the filesystem operation anyway, and increase memory usage
for no benefit. Just do not accept any new request when it cannot be
processed in time.

I am not sure if using the dynamic `select` provided by `reflect` is
faster than just creating a goroutine for each request, but doing this
at least allows us to still limit the maximum number of concurrent
requests to a reasonable level.

I'd also like to propose a customizable worker pool size `N` (and expose
it to the SeaweedFS side), but this is a topic for another day.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@chrislusf @PeterCxy and others