-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mount/s3/filer] spurious I/O timeout when reading from volume servers #1907
Comments
Seems related to the memory leak problem? |
@chrislusf: But Also, when I tried to check older logs, it seems that these I/O timeouts have been happening before too, both with |
It seems that the issue is unrelated to ZeroTier, because my setup does not work properly even with ZeroTier removed. Unfortunately I am not able to reliably reproduce the issue outside of my production setup. |
Interestingly, whenever the timeout happens, the volume server spams something like these in the logs
|
This is a known issue and it is clear how to solve it |
@kmlebedev But is it the cause of timeouts / read failures? Or is it just a cosmetic error? |
|
I think the problem might simply be that SeaweedFS is giving up the connection too early. It seems that SeaweedFS starts reporting the timeout error only about 5 - 10 seconds after I initiated the I/O operation, which means that the first timeout happened very early. In addition, this error seems to only happen with large files, which could actually take more than 5 - 10 seconds to download due to network latency and TCP slow start. |
Yeah I believe that Filer / Mount started timing out just ~9s after the request was initiated, but the full download would take ~16s. But I assumed that Filer / Mount is supposed to stream the file data, i.e. it should not wait until the full request is completed before sending back data to the client? |
According to |
@chrislusf I think this could be the issue: https://github.com/chrislusf/seaweedfs/blob/10164d0386460c1c39ed8b5ee5c434704a2b28fd/weed/util/fasthttp_util.go#L17 According to the documentation of
So if the body took longer than this to read, the request would time out (?). |
Ok. We may need to remove the usage of fasthttp package. |
@chrislusf I believe just increasing the ReadTimeout / WriteTimeout will be enough. |
I changed the timeouts locally to time.Minute and it seems at least for now the timeout messages have gone away. |
@chrislusf According to valyala/fasthttp#299, to set the TCP dial timeout (instead of the timeout of the full request), one needs to provide a custom Dial function to The |
We have a limited chunk size |
@kmlebedev But that could be customized and there doesn't seem to be a sane way to calculate a max timeout from a given chunk size. So probably need to just get rid of the body read timeout or set it to a high value like 5 minutes or 10 minutes. |
This is hardly reasonable, since the chunk size can be set to 1Gb. |
Or maybe just assume that the interconnect between SeaweedFS nodes will not be slower than, say, 10Mbps (~1.25 MiB/s), and calculate the timeout based on this. If users still run into timeout issues, a customizable option could be provided to set the timeout even higher. |
On a volume, the bottleneck is the disk, especially the HDD. Accordingly, the read speed may drop to 0. |
I agree that smart load-balancing is nice to have, but for now, to resolve this issue, we either need to get rid of the timeout (which could result in Filer getting stuck forever), or set the timeout to some higher value. Not being able to read anything is a bigger problem than load-balancing. |
Describe the bug
Since some time ago, my
weed mount
mountpoint stopped working properly and the logs showed a lot of I/O timeouts, such asThis caused the daemon to get stuck retrying and make zero progress. This looked like an issue with the volume servers, however, when I tried to curl the failed URL manually on the exact same machine where the error was reported:
It returned immediately with the expected content of the file dumped into
test.bin
. But theweed mount
daemon still keeps reporting I/O timeout even after my manual curl has clearly succeeded. In case it was just me being lucky, I retried the command several times while the I/O timeout errors were continuing, but all of them failed.System Setup
If there is anything non-standard about my setup, it's probably the interconnect between the nodes -- I used ZeroTier to form a virtual private network between all the nodes to save me some trouble. But I have ran tests in the network and no other program seem to show any issue with ZeroTier, including
curl
.Expected behavior
SeaweedFS should not time out when
curl
clearly didn't.Additional context
I suspect that some TCP connection parameter in either Go's HTTP library or SeaweedFS is at play here. I don't expect this bug to be very reproducible, but any insight into the weird behavior of SeaweedFS is appreciated.
Speaking of the timeout, it seems like my Nginx reverse proxy at the master node can also get stuck halfway through when receiving HTML from the volume servers. I think it is the same issue here.
The text was updated successfully, but these errors were encountered: