New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store assets on S3 as well as NFS and optionally serve from S3 on public URLs #74

Merged
merged 8 commits into from Jul 19, 2017

Conversation

Projects
None yet
3 participants
@floehopper
Contributor

floehopper commented Jul 19, 2017

This is the first step in changing the Asset Manager to store its assets in and serve its assets from AWS S3, i.e. to avoid using NFS.

This PR means that whenever an asset is successfully virus scanned a new asynchronous job is kicked off to upload the asset to S3. The fact that it's a separate job should mean that the current external behaviour is unaffected; the assets are still stored on NFS as before.

Also if stream_from_s3=true is added to the query string of the public URL for an asset, then the asset will be streamed from S3 via the Asset Manager app. The default behaviour is unaffected.

Serving assets through the Rails app and not using X-Sendfile and nginx is naïve and may well not be practical for production use, but deploying these changes will allow us to check that we have all the appropriate infrastructure & credentials in place before we consider alternative options. This is explained in more detail in the relevant commit note.

The new behaviour requires the following standard environment variables to be set:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_REGION

And this application specific environment variable:

  • AWS_S3_BUCKET_NAME

Supersedes #63.

Add hook to save assets to cloud storage as well as NFS
The `S3Storage#save` method is called when an asset is virus scanned
successfully. At the moment the `S3Storage#save` method is a no-op but
this commit ensures that the method is called when an asset is marked as
clean.

We're planning to treat files that are uploaded to S3 as disposable
while we work to get this functionality integrated. For that reason,
and to have a minimal impact on the application, we are not storing
the "uploaded" state in the database.
@chrisroos

This comment has been minimized.

Show comment
Hide comment
@chrisroos

chrisroos Jul 19, 2017

Contributor

I think we should remove the ASSET_MANAGER prefix from the ASSET_MANAGER_S3_BUCKET_NAME environment variable. I've recently discovered that this sort of namespacing is unnecessary. App-specific environment variables are written to app-specific directories by Puppet and are only loaded/read in the processes that run the app to avoid interfering with other apps on the same server.

Contributor

chrisroos commented Jul 19, 2017

I think we should remove the ASSET_MANAGER prefix from the ASSET_MANAGER_S3_BUCKET_NAME environment variable. I've recently discovered that this sort of namespacing is unnecessary. App-specific environment variables are written to app-specific directories by Puppet and are only loaded/read in the processes that run the app to avoid interfering with other apps on the same server.

@chrisroos chrisroos assigned chrisroos and unassigned chrisroos Jul 19, 2017

floehopper and others added some commits Jul 18, 2017

Save asset to AWS S3 bucket when virus scan passes
This relies on the following environment variables:

    AWS_ACCESS_KEY_ID
    AWS_SECRET_ACCESS_KEY
    AWS_REGION

    AWS_S3_BUCKET_NAME

The first three are standard for the `aws-sdk` gem. The last one is
specific to this app.
Notify Errbit via Airbrake if upload to S3 fails
The uploads to S3 happen in a `Delayed::Job`. While we are working
on configuring S3 in the various deployment environments, if this job
fails it may be due to a configuration error. In that case we'd like
to know what the exception was so that we can fix it. This commit
notifies Errbit (via the Airbrake library) and re-raises the exception
so that `Delayed::Job` will retry it (in case we fix the problem, or it
was temporary).

Note that this change replicates the pattern used in
`Asset#scan_for_viruses`, the other method invoked via `Delayed::Job`.
A better solution would be to implement a generic solution for reporting
all failures that occur in a `Delayed::Job`. See #68 for suggestions on
how to implement this.
Optionally stream asset from S3 on public URL
This is a very naïve approach to serving assets stored on S3 (instead of
on NFS) from the Asset Manager app. The current behaviour should be
unchanged and so this should be safe to deploy to production.

The new behaviour is triggered by adding "stream_from_s3=true" to the
URL query string.

The default behaviour uses `ActionController::DataStreaming#send_file`
[1] in combination with the `config.action_dispatch.x_sendfile_header`
option and the nginx configuration so that the asset is actually
streamed directly from disk by nginx and does not pass through the Rails
app at all.

The new behaviour downloads the file from S3 into a StringIO object in
memory and then uses `ActionController::DataStreaming#send_data` [2] to
send the data to the browser.

The new approach is unlikely to be viable in production, because it
means that all data for all assets has to be downloaded from S3,
temporarily stored in memory, and then streamed to the browser from the
Rails app, tying up more memory and a Ruby thread/process each time.
This could easily mean that the app is overloaded and requests are
blocked or result in a time-out.

However, we think the new behaviour is a useful first step to verify
that the app is correctly wired up to use the relevant AWS credentials,
etc.

The only way the new behaviour differs from the default behaviour from
the client browser's point of view is that an `ETag` header is sent in
the response. This is because `Rack::ETag` [3] in the Rails middleware
stack automatically sets an `ETag` header on all String bodies. Since
in the new approach the file response is sent as a String from the Rails
app passing through the middleware stack and it does not respond to the
`to_path` method, an `ETag` header is added.

It's not obvious whether this is actually a problem, but we should
definitely investigate further if we want to pursue using the new
behaviour as the default behaviour.

[1]: http://api.rubyonrails.org/v4.2.7.1/classes/ActionController/DataStreaming.html#method-i-send_file
[2]: http://api.rubyonrails.org/v4.2.7.1/classes/ActionController/DataStreaming.html#method-i-send_data
[3]: http://www.rubydoc.info/gems/rack/1.6.4/Rack/ETag
Use a separate Delayed::Job to upload assets to S3
Previously the upload to S3 happened within the
`Asset#scan_for_viruses` job. Now it happens within the new
`Asset#upload_to_s3` job.

This means that the default behaviour (i.e. everything except the upload
to S3) is completely unchanged by the addition of the upload to S3.
Avoid connecting to AWS if S3 bucket name not set
This commit introduces `S3Storage.build` to as a way for
`Services.cloud_storage` to construct the service object. If the bucket
name is not set then an instance of the new `S3Storage::Null` class is
returned. Calls to methods on this instance do not attempt to connect to
AWS and hence we should avoid lots of exceptions being reported to
Errbit.

We're making an assumption here that if the bucket name is set then it's
very likely that the other AWS environment variables will be set.
@floehopper

This comment has been minimized.

Show comment
Hide comment
@floehopper

floehopper Jul 19, 2017

Contributor

@chrisroos, @chrislo: I've renamed the environment variable to AWS_S3_BUCKET_NAME to make it non-application-specific and using the AWS prefix to associate it with the other related env vars. I'm about to force-push the changes.

Contributor

floehopper commented Jul 19, 2017

@chrisroos, @chrislo: I've renamed the environment variable to AWS_S3_BUCKET_NAME to make it non-application-specific and using the AWS prefix to associate it with the other related env vars. I'm about to force-push the changes.

@chrislo

This comment has been minimized.

Show comment
Hide comment
@chrislo

chrislo Jul 19, 2017

Contributor

This looks good to me. I checked out this branch to my local development environment to test it. Testing file upload without setting any environment variables gave the following output from rake jobs:work

[Worker(host:munro.local pid:57177)] Job Asset#scan_for_viruses (id=596f70632da79ade16000001) RUNNING
[Worker(host:munro.local pid:57177)] Job Asset#scan_for_viruses (id=596f70632da79ade16000001) COMPLETED after 8.6514
[Worker(host:munro.local pid:57177)] Job Asset#save_to_cloud_storage (id=596f706e2da79adf59000000) RUNNING
[Worker(host:munro.local pid:57177)] Job Asset#save_to_cloud_storage (id=596f706e2da79adf59000000) COMPLETED after 0.0012
[Worker(host:munro.local pid:57177)] 2 jobs processed at 0.2305 j/s, 0 failed

Testing file upload with the env variables set as per the explanation above gave

[Worker(host:munro.local pid:58096)] Job Asset#scan_for_viruses (id=596f715a2da79ae2e3000001) RUNNING
[Worker(host:munro.local pid:58096)] Job Asset#scan_for_viruses (id=596f715a2da79ae2e3000001) COMPLETED after 8.8415
[Worker(host:munro.local pid:58096)] Job Asset#save_to_cloud_storage (id=596f71632da79ae2f0000000) RUNNING
[Worker(host:munro.local pid:58096)] Job Asset#save_to_cloud_storage (id=596f71632da79ae2f0000000) COMPLETED after 0.5874
[Worker(host:munro.local pid:58096)] 2 jobs processed at 0.2113 j/s, 0 failed

I was able to view the uploaded asset at /media/<id>/image.jpeg?stream_from_s3=true and /media/<id>/image.jpeg and the response to the former had the ETAG header set as per fe04bbe.

The only thing I could think to change here would be to not run the delayed job at all if the environment variable wasn't set, rather than run a job which suceeds but does not have the expected side effect - this might make it a little easier to see at a glance what's going on if we have misconfigured something. Or perhaps we could log something informative instead? I think we'll be able to work it out either way though so 👍

Contributor

chrislo commented Jul 19, 2017

This looks good to me. I checked out this branch to my local development environment to test it. Testing file upload without setting any environment variables gave the following output from rake jobs:work

[Worker(host:munro.local pid:57177)] Job Asset#scan_for_viruses (id=596f70632da79ade16000001) RUNNING
[Worker(host:munro.local pid:57177)] Job Asset#scan_for_viruses (id=596f70632da79ade16000001) COMPLETED after 8.6514
[Worker(host:munro.local pid:57177)] Job Asset#save_to_cloud_storage (id=596f706e2da79adf59000000) RUNNING
[Worker(host:munro.local pid:57177)] Job Asset#save_to_cloud_storage (id=596f706e2da79adf59000000) COMPLETED after 0.0012
[Worker(host:munro.local pid:57177)] 2 jobs processed at 0.2305 j/s, 0 failed

Testing file upload with the env variables set as per the explanation above gave

[Worker(host:munro.local pid:58096)] Job Asset#scan_for_viruses (id=596f715a2da79ae2e3000001) RUNNING
[Worker(host:munro.local pid:58096)] Job Asset#scan_for_viruses (id=596f715a2da79ae2e3000001) COMPLETED after 8.8415
[Worker(host:munro.local pid:58096)] Job Asset#save_to_cloud_storage (id=596f71632da79ae2f0000000) RUNNING
[Worker(host:munro.local pid:58096)] Job Asset#save_to_cloud_storage (id=596f71632da79ae2f0000000) COMPLETED after 0.5874
[Worker(host:munro.local pid:58096)] 2 jobs processed at 0.2113 j/s, 0 failed

I was able to view the uploaded asset at /media/<id>/image.jpeg?stream_from_s3=true and /media/<id>/image.jpeg and the response to the former had the ETAG header set as per fe04bbe.

The only thing I could think to change here would be to not run the delayed job at all if the environment variable wasn't set, rather than run a job which suceeds but does not have the expected side effect - this might make it a little easier to see at a glance what's going on if we have misconfigured something. Or perhaps we could log something informative instead? I think we'll be able to work it out either way though so 👍

@floehopper

This comment has been minimized.

Show comment
Hide comment
@floehopper

floehopper Jul 19, 2017

Contributor

@chrislo: I think the AWS logging might help in that scenario, but I take your point.

Contributor

floehopper commented Jul 19, 2017

@chrislo: I think the AWS logging might help in that scenario, but I take your point.

@floehopper floehopper merged commit 507e514 into master Jul 19, 2017

2 checks passed

continuous-integration/jenkins/branch This commit looks good
Details
security/snyk No new vulnerabilities
Details

@floehopper floehopper deleted the store-assets-on-both-nfs-and-s3-and-optionally-serve-from-s3-on-public-urls branch Jul 19, 2017

@chrislo

This comment has been minimized.

Show comment
Hide comment
@chrislo

chrislo Jul 19, 2017

Contributor

@floehopper - ah, apologies. It wasn't clear to me from that commit what that would do - but yes, if it logs all of the AWS requests I don't think we need anything else.

Contributor

chrislo commented Jul 19, 2017

@floehopper - ah, apologies. It wasn't clear to me from that commit what that would do - but yes, if it logs all of the AWS requests I don't think we need anything else.

chrisroos added a commit to alphagov/govuk-puppet that referenced this pull request Jul 20, 2017

Add S3 bucket config to Asset Manager
We're enhancing Asset Manager to upload files to, and serve files from
S3. This PR sets the AWS environment variables required by Asset
Manager.

We're safe to use the standard AWS environment variable names because we
rely on `govuk_setenv` to provide each application with its own
environment.

See the related PRs in asset-manager and govuk-terraform-provisioning:

* alphagov/asset-manager#74
* alphagov/govuk-terraform-provisioning#125

@chrislo chrislo referenced this pull request Jul 20, 2017

Closed

Use S3 for storing and serving assets #76

10 of 10 tasks complete

chrisroos added a commit to alphagov/govuk-puppet that referenced this pull request Jul 21, 2017

Add S3 bucket config to Asset Manager
We're enhancing Asset Manager to upload files to, and serve files from
S3. This PR sets the AWS environment variables required by Asset
Manager.

We're safe to use the standard AWS environment variable names because we
rely on `govuk_setenv` to provide each application with its own
environment.

See the related PRs in asset-manager and govuk-terraform-provisioning:

* alphagov/asset-manager#74
* alphagov/govuk-terraform-provisioning#125

@chrisroos chrisroos added this to the Stream assets from S3 milestone Jul 31, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment