-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 write timeout (Ruby 2.0) #241
Comments
Actually it turns out that the failure is somewhat intermittent. After leaving my computer idling for a while the above script actually works a few times before starting to fail again. Once it starts failing it fails pretty consistently. Note that it always succeeds on ruby 1.9, so this is definitely something 2.0 related. |
Is this the entire script or are you doing anything else (threading perhaps)? |
Hi Trevor, that's the entire script. This is on a Linux system if that's significant. I'll try and replicate it on another box when I get some time to see if it's a local oddity. |
OK, so I can replicate this on an EC2 Ubuntu Server 12.04 instance set up as follows:
It doesn't happen as frequently as it does on my local box, but it does happen. Most times it happens when I rerun after leaving the box idle for a few minutes. Edit: now I've managed to get it into a state where it's happening pretty much every time I run it. |
I've been receiving this error alot, it turned out for me it was threading (too many concurrent CPU heavy jobs in my worker queue). I do wonder if this may have had a hand in some of the errors I logged whilst only having one concurrent job queue (one thread per process) but they certainly weren't as common as you've mentioned. I'll be doing a big upload again very soon and will carefully monitor the failures. One of the differences I can see between some of my code and yours is that i'm parsing in my own IO object (built on top of StringIO with some convenience methods to implement tempfiles if a path is requested). You could attempt to read the file into memory as a string if space allows and parse that in, which would remove any potential issues caused by the file from the equation, everything else in your case is vanilla, so if it's not your file it likely is the library. G |
The error still occurs even if you write a simple string instead of a file, so it doesn't seem to be related to the file being uploaded at all. |
Also experiencing intermittent failures. Currently just using the following workaround in place of calls to S3Object's
|
@ampedandwired The error message you are getting, You mentioned you were able to reproduce this issue using strings (and not just files). It also does not appear to be threading related (your example does not create multiple threads). I'm not able to reproduce this issue locally. It could be helpful if you could enable some detailed logging so I can get some more insight into what is going on. Can you try this: # when creating s3, enable a few loggers
s3 = AWS::S3.new(:logger => Logger.new($stdout), :http_wire_trace => true) I'm hoping this will help shed some light on what is happening so we can get this fixed. Thanks! |
Here's a full trace from two consecutive tests, the first one succeeded the second did not:
The code being run is as follows: require 'aws'
s3 = AWS::S3.new(:logger => Logger.new($stdout), :http_wire_trace => true)
emr = AWS::EMR.new
bucket_name = 'xxxxx'
puts "Bucket exists: #{s3.buckets[bucket_name].exists?}"
emr.job_flows.each { |jf| puts jf.name }
puts "Uploading file"
s3.buckets[bucket_name].objects['foo'].write("I'm a little teapot") |
In terms of replicating the problem, did you try running it on an EC2 Ubuntu instance? I was able to replicate the problem there on a clean box (using the setup documented in my comment above), so hopefully you'll be able to duplicate my results that way. |
I'm just spun up a clean EC2 Ubuntu instance and I'm going to try to replicate the issue again. |
I'm still yet unable to reproduce the issue. Here are the exact steps I took
require 'aws'
s3 = AWS::S3.new(:logger => Logger.new($stdout), :http_wire_trace => true)
emr = AWS::EMR.new
bucket_name = 'aws-sdk'
puts "Bucket exists: #{s3.buckets[bucket_name].exists?}"
emr.job_flows.each { |jf| puts jf.name }
puts "Uploading file"
s3.buckets[bucket_name].objects['foo'].write("I'm a little teapot") I've run this 50+ times with a few pauses. I will come back to this after I've let the instance idle. |
We also ran into this problem after upgrading to ruby-2.0.0-p0. We saw a significant increase in s3 upload reties immediately after the upgrade. It doesn't seem to be related to the data as our retries eventually succeed. This test case will fail for me sometimes on ruby-2.0: require 'aws'
s3 = AWS::S3.new(
:access_key_id => ENV['AWSID'],
:secret_access_key => ENV['AWSKEY'],
)
bucket = ENV['BUCKET']
pid = Process.pid
size = 1024 * 1024 * 25
0.upto(100).each do |i|
name = "#{pid}.#{i}"
puts "generating #{name}"
File.open('/dev/urandom', 'rb') do |urand|
File.open("/tmp/#{name}", 'wb') do |out|
out.write(urand.read(size))
end
end
puts "uploading #{name}"
File.open("/tmp/#{name}", 'rb') do |f|
s3.buckets[bucket].objects[name].write(f,
:multipart_threshold => 1024*1024*20,
)
end
puts "finished #{name}"
end |
Since this seems to be specific to Ruby 2.0, I wonder if those who are experiencing this issue on p0 can also reproduce it on 2.0.0-p195 which was released last week. There may have been some related changes that could have fixed this issue. |
Some further notes:
|
I wonder what regions everybody else is using? |
Hi guys. Not sure if this is related, or a separate issue: thoughtbot/paperclip#751 (comment) Those of us using Paperclip have been encountering timeouts caused the aws-sdk gem after version 1.6.2. This is a major issue, and one that's been around for some time. I don't know if there has been any coordination on fixing it yet. |
@uberllama Thank you for looping us in. I'll chime in on the paperclip thread. After a quick scan of the other issue, they do not appear to be related, but I could be wrong. |
I think I got the same issue, my test case is to upload a bunch (~40) of smaller files:
I had the error also locally, but increasing the timeouts using
helped. The curious thing is, that on Heroku the error clearly comes before the timeout runs out. I'll now test what happens if I exchange aws-sdk with some other s3 upload gem (fog). |
Ok, if I exchange aws-sdk with fog, it works for me also on Heroku + Ruby 2.0.0 |
We're seeing this bug as well, it's causing some of our backup scripts to fail on occasion. Like a few others have mentioned above, this is a script that has run without incident for a few years, but started failing in the past week. The failures coincide with our move to Ruby 2.0. (Also: Though we use Paperclip elsewhere, this script has nothing to do with Paperclip.) The timeout errors are sporadic, and during a given backup, one or more files may be successfully copied while later a file fails to be copied due to the timeout error. They seem to happen when there are multiple files being transferred and a particular file in the succession of writes happens to take a bit longer. I've done some logging during these writes and though my sample size is a bit small, it appears that writes where the connection is open for 20 seconds are failing, even though our timeout thresholds are well above that. Some sample output from the script with instrumentation for debugging is included below. Notable is this section:
which is where the timeout error occurs. So far (admittedly, in a small sample of trials), this same script works fine as long as the elapsed time stays under 20 seconds. I'll do some more experimenting and post an update if we glean new info, but I'm wondering if there's some reason, in Ruby 2.0, the http_read_timeout (or is there another setting in play when doing writes?) is actually 20 seconds, even though it is configured at a value well above that. Here's the output:
|
Experiencing this on Ruby-2.0.0p195. Same symptoms - intermittent massive write timeout. |
I'm also experiencing this issue on ruby 2.0.0p195. It was working fine on 1.9.3p194. ruby 2.0.0p195 (2013-05-14 revision 40734) [i686-linux] |
@frogandcode suggested trying to establish a new connection per upload, so I switched our backup to use the method below. We haven't noticed any issues so far and it appears some writes may even be faster than before. I wouldn't consider it the ultimate solution, but it might helpful if you're writing important backups. def write(name, file)
@bucket.objects[name].write(file)
ensure
# Ensure the HTTP pool is emptied after each write.
AWS.config.http_handler.pool.empty!
end |
+1 Today I just deployed a ruby 2.0.0p195 upgrade and immediately started encountering this issue with aws sdk 1.11.3. It is an intermittent error, but I do so many S3 writes that I see it every few minutes |
I've reverted to the following nested retries until we can get a fix. Retrying 4 times catches 100% of the failures (~20k uploads per day); # Dirty dirty dirty(S3Bug)..
begin
tempobject.rewind
obj = s3_bucket.objects[ uid ].write( tempobject )
rescue Exception => e
puts "Upload Failed once.."
begin
tempobject.rewind
obj = s3_bucket.objects[ uid ].write( tempobject )
rescue Exception => e
puts "Upload Failed twice.."
begin
tempobject.rewind
obj = s3_bucket.objects[ uid ].write( tempobject )
rescue Exception => e
puts "Upload Failed three times.."
tempobject.rewind
obj = s3_bucket.objects[ uid ].write( tempobject )
end
end
end |
@thinkgareth Could you try to test out the master branch and see if that resolves your issue (without the nested 4x retry block)? I believe this might resolve your issue. |
I have exactly the same issue here. I tried all different methods (extend timeout, empty http pool etc.) and seems that using File or Pathname instead of IO stream may help a bit and the error rate decreases. However, no matter what I did, the intermittent errors never disappear. I am using Heroku, Ruby 2.0, and latest Sinatra, Unicorn and AWS SDK. |
@cdunn I'm reading through the code where request_transport is retried. There appears to be a bug in their retry logic. When they call req.exec the second time, no, attempt is made to rewind the body stream before. Simply put, the first request succeeds in sending some bytes, but not all before it encounters an error. Net::HTTP decides to retry the request but fails to rewind the body stream, causing fewer than all of the bytes to be sent as indicated by the content-length header. S3 keeps waiting for additional bytes, but never gets them, and then fails the request. The SDK's retry logic always rewinds the body stream before it attempt to resend a request. Net::HTTP make no such attempt. I'm going to try to file this as a bug report and see if we can't get this fixed. |
The workaround proposed by @cdunn works for me to remove the timeout, but I still get tons of retries. They most often come from the EOF error. Oddly enough, running on a single thread produces the most retries, while they decrease as I ramp up to 5 threads. Seems like this would point to it not being congestion related... |
I ended up using 'fog' gem to upload to S3 and I get no more timeouts or retries. |
@pickerflicker Ruby Net::HTTP changed between 1.9.3 and 2.0 in a way that causes this error to occur. In 1.9.3, when the connection fails or times out, the SDK retries the error. In Ruby 2.0.0 Net::HTTP attempts to retry the failed request, but in a broken way. It retries the PUT request without first rewinding the body stream. This is okay if the request payload is a string, but the Ruby SDK uses IO objects for streaming requests. Users of 1.9.3 are likely not reporting issues as their requests are succeeding with then SDK retries. The timeout issue happens likely (this I have difficulty verifying) when S3 closes a long running HTTP session. This may be due to congestion, or latent network issues, its hard to say. You are not seeing this issue with fog probably for 2 reasons. First, fog uses excon for http requests and not Net::HTTP. Second, it does not use persistent HTTP connections by default. A new connection is established (and closed) for each request. You might experience the infrequent delay of s3 closing a connection if you enable persistent connections, but you likely would not experience the 2nd issue, as excon would not retry your request without rewinding the stream. There are a number of possible work-arounds. I'm trying to get a bug report and fix put together for Net::HTTP. Until then, you can disable retries of PUT requests, following @cdunn's suggestion: Net::HTTP::IDEMPOTENT_METHODS_.delete("PUT") Also, it would be possible to put together a simple Excon handler for |
Thanks for clarifying @trevorrowe. I wasn't exactly sure why Fog gem worked. I'm using Ruby 2.1.0. My original issue was not an actual error, since the second PUT request would always be quick and successful. My issue was the occasional 20s timeout is just too long and ruins user experience. I've tried adding the net:http workaround into my Rails project initializers, but I still saw these occasional timeouts. |
@pickerflicker Here is a sample http handler that uses Excon: require 'excon'
class ExconHandler
def handle(req, resp, &read_block)
options = {
:method => req.http_method.downcase.to_sym,
:path => req.path,
:query => req.querystring,
:headers => req.headers,
:body => req.body_stream,
}
options[:response_block] = read_block if block_given?
connection = Excon.new(req.endpoint)
excon_resp = connection.request(options)
resp.status = excon_resp.status.to_i
resp.headers = excon_resp.headers
resp.body = excon_resp.body unless block_given?
end
end
AWS.config(http_handler: ExconHandler.new) I would be interested to know if using this resolves your issue. There are a few missing features from this implementation, that all could be addressed:
These could all be addressed without much effort. |
@trevorrowe Sorry, there's no good way for me to try this out now. This timeout issue didn't show up in QA(probably not enough requests.) and I can't experiment on production. |
@pickerflicker What sort of frequency were you seeing the timeouts + retries in production previously? |
@trevorrowe The timeouts would at least once every 2 hours. I would see a long POST request take a little over 20 seconds in New Relic. These requests all had the same signature: first PUT request timeout after 20 seconds, followed by one retry which was always successful and quick. |
The problem didn't solve yet, please let us know the progress please. |
We've been able to resolve these issues in the V2 Ruby SDK by patching Net::HTTP. These patches are:
You can see the patches in the v2 SDK here: @mikhailov If you are interested to test out a patch, I'd be willing to backport these into a branch of the v1 Ruby SDK. |
@trevorrowe that's great, we can test the patch, thanks! A question about s3_endpoint, we still use it, but it seems to be deprecated in favor of region. Can we still use s3_endpoint syntax with aws-sdk-ruby v2? |
@mikhailov you can use |
Net::HTTP has a few bugs regarding Expect-100 continue and how it retries idempotent HTTP requests. This commit enables two patches that address these issues. Without these patches, the user may experience S3 timeout issues. Unfortunately this built-in retry logic is buggy; When *some* of the body is sent before a network failure, it fails to rewind the body before sending the retry. This results in a partial body, that is fewer than "Content-Length" bytes. S3 eventually times out the request waiting for all of the bytes. This commit makes two major changes to avoid this scenario: * It enables the expect-100 continue HTTP logic BEFORE sending bytes to Amazon S3. This eliminates most of the common failure scenarios where bytes are sent and the request broken by the remote end (especially for auth-erors). * It patches the Net::HTTP transport method to rewind the request body stream if the body responds to #rewind before attempting a retry. Previously, the 100-continue bheavior and the 2nd patch were opt-in, but this problematic enough for Ruby 2 users that this enables this by default. See #241
@mikhailov I've pushed a branch, s3-write-timeout-patch, which back ports some changes from the V2 SDK into this repo. Please feel free to give this a spin and let me know if it resolves the issue you are experiencing. |
Thanks for your hard work over such a long period of time on this! Very exciting. |
👏 |
I should add a caveat that the fix only applies for Ruby versions 1.9.3+. Ruby 1.8.7 and Ruby 1.9.2 do not support HTTP Expect 100-continue and therefore will continue to have their connections closed by Amazon S3 under certain circumstances. |
Ruby 1.8.7 is retired and [1.9.3 will be EOL soon](Support for Ruby version 1.9.3 will end on February 23, 2015). 🎆 🚢 👍 |
For some reason aws-sdk is throwing timeout errors, and this was a known problem (aws/aws-sdk-ruby#241), which is currently supposed to be fixed. Anyway, for some reason this happens only with Tempfiles that were changed by ImageMagick, I tried calling \#fsync but it didn't work. So, to avoid this bug, we simply reopen the tempfile to get a fresh file descriptor again.
This is an odd one. The following code generates a timeout for me when run with ruby 2.0.0-p0 but it's OK with 1.9.3-p385. Using aws-sdk 1.9.2.
The calls to
bucket.exists?
andjob_flows.each
are significant - if you remove either of these the write succeeds.The error generated:
The text was updated successfully, but these errors were encountered: