-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store assets on S3 as well as NFS and optionally serve from S3 on public URLs #74
Store assets on S3 as well as NFS and optionally serve from S3 on public URLs #74
Conversation
The `S3Storage#save` method is called when an asset is virus scanned successfully. At the moment the `S3Storage#save` method is a no-op but this commit ensures that the method is called when an asset is marked as clean. We're planning to treat files that are uploaded to S3 as disposable while we work to get this functionality integrated. For that reason, and to have a minimal impact on the application, we are not storing the "uploaded" state in the database.
I think we should remove the |
This relies on the following environment variables: AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_REGION AWS_S3_BUCKET_NAME The first three are standard for the `aws-sdk` gem. The last one is specific to this app.
Previously the upload to S3 happened within the `Asset#scan_for_viruses` job. Now it happens within the new `Asset#upload_to_s3` job. This means that the default behaviour (i.e. everything except the upload to S3) is completely unchanged by the addition of the upload to S3.
The uploads to S3 happen in a `Delayed::Job`. While we are working on configuring S3 in the various deployment environments, if this job fails it may be due to a configuration error. In that case we'd like to know what the exception was so that we can fix it. This commit notifies Errbit (via the Airbrake library) and re-raises the exception so that `Delayed::Job` will retry it (in case we fix the problem, or it was temporary). Note that this change replicates the pattern used in `Asset#scan_for_viruses`, the other method invoked via `Delayed::Job`. A better solution would be to implement a generic solution for reporting all failures that occur in a `Delayed::Job`. See #68 for suggestions on how to implement this.
This is a very naïve approach to serving assets stored on S3 (instead of on NFS) from the Asset Manager app. The current behaviour should be unchanged and so this should be safe to deploy to production. The new behaviour is triggered by adding "stream_from_s3=true" to the URL query string. The default behaviour uses `ActionController::DataStreaming#send_file` [1] in combination with the `config.action_dispatch.x_sendfile_header` option and the nginx configuration so that the asset is actually streamed directly from disk by nginx and does not pass through the Rails app at all. The new behaviour downloads the file from S3 into a StringIO object in memory and then uses `ActionController::DataStreaming#send_data` [2] to send the data to the browser. The new approach is unlikely to be viable in production, because it means that all data for all assets has to be downloaded from S3, temporarily stored in memory, and then streamed to the browser from the Rails app, tying up more memory and a Ruby thread/process each time. This could easily mean that the app is overloaded and requests are blocked or result in a time-out. However, we think the new behaviour is a useful first step to verify that the app is correctly wired up to use the relevant AWS credentials, etc. The only way the new behaviour differs from the default behaviour from the client browser's point of view is that an `ETag` header is sent in the response. This is because `Rack::ETag` [3] in the Rails middleware stack automatically sets an `ETag` header on all String bodies. Since in the new approach the file response is sent as a String from the Rails app passing through the middleware stack and it does not respond to the `to_path` method, an `ETag` header is added. It's not obvious whether this is actually a problem, but we should definitely investigate further if we want to pursue using the new behaviour as the default behaviour. [1]: http://api.rubyonrails.org/v4.2.7.1/classes/ActionController/DataStreaming.html#method-i-send_file [2]: http://api.rubyonrails.org/v4.2.7.1/classes/ActionController/DataStreaming.html#method-i-send_data [3]: http://www.rubydoc.info/gems/rack/1.6.4/Rack/ETag
This commit introduces `S3Storage.build` to as a way for `Services.cloud_storage` to construct the service object. If the bucket name is not set then an instance of the new `S3Storage::Null` class is returned. Calls to methods on this instance do not attempt to connect to AWS and hence we should avoid lots of exceptions being reported to Errbit. We're making an assumption here that if the bucket name is set then it's very likely that the other AWS environment variables will be set.
@chrisroos, @chrislo: I've renamed the environment variable to |
dd79654
to
99e1e45
Compare
This looks good to me. I checked out this branch to my local development environment to test it. Testing file upload without setting any environment variables gave the following output from
Testing file upload with the env variables set as per the explanation above gave
I was able to view the uploaded asset at The only thing I could think to change here would be to not run the delayed job at all if the environment variable wasn't set, rather than run a job which suceeds but does not have the expected side effect - this might make it a little easier to see at a glance what's going on if we have misconfigured something. Or perhaps we could log something informative instead? I think we'll be able to work it out either way though so 👍 |
@chrislo: I think the AWS logging might help in that scenario, but I take your point. |
@floehopper - ah, apologies. It wasn't clear to me from that commit what that would do - but yes, if it logs all of the AWS requests I don't think we need anything else. |
We're enhancing Asset Manager to upload files to, and serve files from S3. This PR sets the AWS environment variables required by Asset Manager. We're safe to use the standard AWS environment variable names because we rely on `govuk_setenv` to provide each application with its own environment. See the related PRs in asset-manager and govuk-terraform-provisioning: * alphagov/asset-manager#74 * alphagov/govuk-terraform-provisioning#125
We're enhancing Asset Manager to upload files to, and serve files from S3. This PR sets the AWS environment variables required by Asset Manager. We're safe to use the standard AWS environment variable names because we rely on `govuk_setenv` to provide each application with its own environment. See the related PRs in asset-manager and govuk-terraform-provisioning: * alphagov/asset-manager#74 * alphagov/govuk-terraform-provisioning#125
This is the first step in changing the Asset Manager to store its assets in and serve its assets from AWS S3, i.e. to avoid using NFS.
This PR means that whenever an asset is successfully virus scanned a new asynchronous job is kicked off to upload the asset to S3. The fact that it's a separate job should mean that the current external behaviour is unaffected; the assets are still stored on NFS as before.
Also if
stream_from_s3=true
is added to the query string of the public URL for an asset, then the asset will be streamed from S3 via the Asset Manager app. The default behaviour is unaffected.Serving assets through the Rails app and not using
X-Sendfile
and nginx is naïve and may well not be practical for production use, but deploying these changes will allow us to check that we have all the appropriate infrastructure & credentials in place before we consider alternative options. This is explained in more detail in the relevant commit note.The new behaviour requires the following standard environment variables to be set:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_REGION
And this application specific environment variable:
AWS_S3_BUCKET_NAME
Supersedes #63.