-
-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connection frozen for hours #542
Comments
Any particular reason for turning nonblock off? I think with it on the timeout behavior might be a little more predictable/reliable? |
I was getting Bad Request errors with IncompleteBody error code and followed advice on the mentioned blog post. Since then, I'm not getting these. |
Thanks for the update. |
I think I was not clear here. After adding Now, sometimes I get a frozen connection for hours. Quite infrequently but sometimes. Once a week or less for a task that runs every day. Shouldn't there be a timeout? |
Yes, I would expect a timeout. The one possible exception would be a weird edge case. Basically the timeouts are largely setup around individual reads/writes/etc. So if you kept getting back, say one byte per read, with a very large request, it might not timeout. We have this (vs a more globally tracked timeout limit that the whole request would be tracked against) because it is much simpler to implement/understand and almost always has worked in a manner close enough to what we wanted to not worry about it. So it could be possible that this is being problematic. That said, the stacktrace makes it look a bit like this is while connecting rather than reading/writing, which is not a case I have seen before. I'm also not confident I agree with the analysis in that blog post. It should still use chunked encoding with nonblock disabled. In fact disabling nonblock seems more likely to cause the hangs you mention, vs leaving it enabled. That said, it sounds like this solved a problem (though I'm at a bit of a loss as to how/why that would solve the issue). You may want to try updating to a newer excon, I'm by no means certain it will fix the problem, but it shouldn't hurt. Sorry for this difficulty, I'm not certain of the root cause here, but hopefully we can figure it out in time. Always more difficult to pin down these more infrequent errors. |
The files in question are images, and not very large: 700 x 700, which need to be stored in s3. I would expect after a few minutes I would get some sort of Timeout Exception then decide if I retry or not. I have commented out |
Hmm. Are the images coming from memory or are they on disk before uploading? Just wondering if there are other places where the contention may be coming from. |
From local hard disk. |
Is it in parallel? I just wonder if there might not be issues with local disk access somewhere in here instead of over-the-wire, but could be grasping at straws. |
It's just sequential and in an underused machine with lots of cpu, memory and disk. |
K. Thanks for the update |
Finding more problems. With default configuration, commented out the line
This is a rake task that fetches remote information and images and stores it, and each item should not take more than 2 seconds. If there's a problem at any endpoint, I'd like to have a Timeout Exception in a reasonable time: 20 seconds maybe. This task must store between 10K-20K items per day. To be precise about my latest comments, sometimes there will be two or three jobs in parallel uploading images. All of them are sequential though. |
Just got another couple freezes. This time, this was the only task connecting to s3. Default configuration. I realized of first one after two hours. Killed, then re-launch and another one.
|
@jogaco hey, thanks for working through this with me and providing extra details. Sorry it is taking so long for me to sort it out. That said, I think I made progress this morning. It looks like there were a couple selects in the nonblock case for ssl specifically that were hidden deep in things that I had overlooked previously. I just pushed some changes to master to cleanup a bit of code and make the timeout behavior around these more consistent, which should hopefully fix your issue. Additionally, I think this would explain why disabling non-block would prevent the issue. Any chance you could try the current master and see if that fixes the issue? All tests pass, but would be nice to know if it helps you too before I do a release with no real effect. Thanks! |
@geemus I really appreciate this component is working for most of the time: 99.99%. Just providing more details so this can be ironed out for me and others. On the contrary, your response is extremely timely! |
Another freeze with non-block disabled.
|
Curious. Are you certain non-block is disabled though? I think that select should be within a |
Non-block is not disabled. It is the default mode, right? As opposed to my first stack traces. This is my current initializer relevant line, commented out:
|
Seems to be working fine. I'll let you know of any issues. |
Cool, appreciated. Hopefully this will solve it, but since it is intermittent it may be hard to know for sure. |
Yeah. I know. Watching closely for the next days... |
A short note to report I haven't had any issues since the fix. Thanks! |
Nice, glad to hear it! |
Closing due to inactivity (looks like it was likely fixed). |
In an upload to S3 storage, I got a frozen connection in a rake task. It stood like this for hours and had to kill it.
I have this in my config:
The text was updated successfully, but these errors were encountered: