AWS Glacier Support #1109

Closed
pctj101 opened this Issue Aug 21, 2012 · 49 comments

Comments

Projects
None yet
6 participants

pctj101 commented Aug 21, 2012

Is anyone working towards adding Glacier support for this gem? Didn't want to duplicate effort...

Owner

geemus commented Aug 21, 2012

@pctj101 - you are the first to mention it that I am aware of, so I think you are probably pretty safe to take a shot at it if you would like.

Contributor

ericchernuka commented Aug 22, 2012

@pctj101 I'd totally be willing to provide testing support if you're looking to add Glacier support.

Contributor

alhafoudh commented Aug 22, 2012

+1

pctj101 commented Aug 22, 2012

Looks really different from S3. I figure a few things...

  1. I'm sure that Fog already has a good way to make signatures on the request. I haven't found it, but I'm sure that's the benefit of using Fog.

1.5) Setup of Glacier (wouldn't be the first thing I'd tackle)

  1. After uploading whatever "archive" file, we get some response to say okay "Amazon gave it this Archive ID"
    http://docs.amazonwebservices.com/amazonglacier/2012-06-01/dev/api-archive-post.html

  2. For restore, we need to make some signed request to initiate the restore and Amazon says "Okay we're working on it"

  3. Then Amazon sends some Amazon SNS notification (at a later time) that the archive is ready for download. We'd have to catch that notification and fetch that file. I suppose that would be like:
    4a) Some program that catches the SNS message
    4b) Some function call that downloads the file
    4c) Some option to combine 4a and 4b: Wait for the SNS message and start an autodownload would be convenient, but Amazon says that it may take 4-6 hours. Not so awesome to wait and block for 4-6 hours.

Here's what kinda got me stuck...

  • Not sure how to sign requests (didn't spend a bunch of time looking either) because of.... (next line)
  • Because of the asynchronous nature of 3 & 4.. I'm not really sure how to structure the program. Did we really want to make fog wait around for an SNS message? Or is that not Fog's place?
Owner

geemus commented Aug 22, 2012

@pctj101 - mostly we are not yet using signature version 4, so we might have to do something there. That said it looks at least somewhat similar to S3 signing, which you can find here: https://github.com/fog/fog/blob/master/lib/fog/aws/storage.rb#L298

As for asynchronous stuff, I'd say that the requests should return immediately, but that there should maybe be helper methods that would allow you to poll/wait if you so choose.

pctj101 commented Aug 23, 2012

So what happens is that after a restore is requested, X hours later amazon sends a message to SNS which is their notification service. SNS can do an HTTP post, phone SMS, or queue the message in Amazon SQS for programmatic retrieval. In a robotic sense having fog subscribe to this Queue and monitor for messages makes the most sense, however I'm sure there is some nutball who would rather get an email or http post instead. So as a flexible library we'd might not want to bind the way you get the notification too tightly to only one notification protocol be it HTTP or SQS.

So part A) we request something like a restore or archive listing
Part B) we listen for the SNS via SQS , or user can build their own http post receiver
Part C) we act on info from Part B, Which I don't believe needs to be immediate.

That's what I see so far

Contributor

alhafoudh commented Aug 23, 2012

Dont think about the sync stuff. Let's just create interface to manage the storing, revrieval and status of archive/valut.
Let the user deal with the SNS/email stuff. Fog already provides API's for SNS. The Glacier fog extension should not include "polling" or any "waiting" for the archive retrieval.

Owner

geemus commented Aug 23, 2012

@pctj101 - yeah, I think I'm with @alhafoudh now that I understand more. Polling/waiting for hours doesn't make much sense, I hadn't realized what the timescale was (minutes of polling seems fine). I think you can safely skip that and focus on just setting up the interface.

pctj101 commented Aug 23, 2012

Well... crud... AWS already has a "Storage" module. Glacier is yet another "Storage" module. Er.... what do I do? Should it be Fog::Storage::AWS2? But then the directory structure and shared code would be all out of whack.

Or Fog::Archive::AWS?

Contributor

alhafoudh commented Aug 23, 2012

I vote for Fog::Archive::AWS

Contributor

ericchernuka commented Aug 23, 2012

I also vote Fog::Archive::AWS.

Eric Chernuka
http://about.me/ericchernuka

On Thu, Aug 23, 2012 at 11:31 AM, Ahmed Al Hafoudh <notifications@github.com

wrote:

I vote for Fog::Archive::AWS


Reply to this email directly or view it on GitHubhttps://github.com/fog/fog/issues/1109#issuecomment-7977322.

pctj101 commented Aug 23, 2012

Wow thanks for the votes... I didn't do it that way, only to leverage some of the stuff in storage (since I have no idea what's going on in this code base)... But yeah I'd love for it to be Fog::Archive::AWS (knowing nothing else).

Here's how far I got... time to sleep...

pctj101/fog@e0e3093

(As you can tell, I'm far from Mr. Wizard)

pctj101 commented Aug 23, 2012

Definitely could use help getting that AWS Signature Version 4 working... egads!

http://docs.amazonwebservices.com/general/latest/gr/signature-version-4.html

Owner

geemus commented Aug 23, 2012

Yeah, Archive seems good, thanks guys!

@pctj101 - I'll add some comments to your code. I can help out with signature version 4, but probably not in the next couple days (I'm leaving to travel to a conference in the next couple hours).

Owner

geemus commented Aug 23, 2012

Signatures, YUCK! I'll try to help out on that as soon as I can though.

Contributor

alhafoudh commented Aug 23, 2012

I was just looking how the signature is done at the moment.

@geemus should we update the signature method to the new version, or should we abstract the signature system to support more than one signature versions throughout AWS services?

I can also look at the new signature.

Contributor

alhafoudh commented Aug 23, 2012

The version 4 signature implementation for ruby already exists in aws-sdk gem.

Take a look: https://github.com/amazonwebservices/aws-sdk-for-ruby/blob/master/lib/aws/core/signature/version_4.rb

I suppose we don't want to couple fog with aws-sdk gem, so we can borrow some pieces.

pctj101 commented Aug 24, 2012

Yeah I saw the aws-sdk gem's v4 code.

  • May not want to couple fog with it
  • Not sure if we can "borrow code" without affecting license structure
  • May have to reformat the signed bits anyways to the point... may as well rewrite it

@alhafoudh Agree with your point on making v4 sigs available across services if that's the direction everything is headed.

Owner

geemus commented Aug 24, 2012

Yeah, I'd say we should try to make it generally available but not use aws-sdk. If you don't get to it I can try probably later in the weekend or early next week.

pctj101 commented Aug 24, 2012

Just FYI, I won't have time to look at this for a couple days. If anyone else makes progress then please share :)

tomash commented Aug 25, 2012

I'd love to help with adding glacier support, how far did you get with it, @pctj101 ?

pctj101 commented Aug 26, 2012

@tomash - So far no further on that the current commit. -- pctj101/fog@e0e3093

Have to take care of other things first before I can return to this. I'm sure we all look forward to anything you can do.

tomash commented Aug 26, 2012

all right, i hope to have something to show by thursday :)

Owner

geemus commented Aug 28, 2012

@tomash - great, let me know if you have questions.

I think for now we should probably just setup v4 for this in particular, rather than changing all signing. But we should start moving toward v4 where we can (just want to make sure we test individual services along the way).

Contributor

fcheung commented Sep 2, 2012

Bleurgh, I'm an idiot - only checked pull requests & mailing list when seeing if anyone else had started. Anyway, my efforts are at #1124

tomash commented Sep 3, 2012

awesome! (i didn't have time to work on it, unfortunately)

Owner

geemus commented Sep 3, 2012

@fcheung - no worries, easy to lose track of whats going on.

@pctj101 - looks like you may have gotten beat to the punch, but if you could check out #1124 for us and make sure it looks good based on the stuff you'd look through I'd appreciate it.

@tomash - no worries and I imagine there will be plenty of additional opportunities to participate.

pctj101 commented Sep 4, 2012

@geemus - Just came back to see what was up and it appears there are geniuses in the house. I'll definitely go check out what @fcheung did after I get some sleep :)

Owner

geemus commented Sep 4, 2012

@pctj101 - great, I appreciate the help!

pctj101 commented Sep 5, 2012

I guess my level of wizardry isn't as high... but here's what I did today running through some of @fcheung's work.

Because the responses aren't really parsed out (that I know of), the API isn't super easy to use yet.

Some items I've tested:

  • Listing Vault - No issue
  • Creating Vault - No issue
  • Deleting Vault - No issue
  • Getting Notification Conf - No issueiguration
  • Listing Jobs - No issue
  • Creating Archive - No issue
  • Initiating Archive List - Awaiting completion
  • Initiating Archive Retrieval - Awaiting completion

Awaiting amazon completion jobs initiated to continue testing.

If anyone is knowledgable on parsing "Excon::Response" structures, there's plenty of "ease of use" work that could be done to parse the Hash and make them accessible via ruby function calls/accessors.

Open questions in my head:

  • What's the best way to load a giant file into memory to upload using "create_archive" (or better to just multipart upload? yeah? :) )
  • For gigantic files, because you have to calcualte the checksum of the entire archive, what's the best way to do it in ruby without consuming a huge amount of RAM?
    test_file = '/test/test1.dat'
    body = test1dat_file = File.open(test_file)
    description = "Test Archive"
    result = glacier.create_archive(thevault["VaultName"], body)
    #fails: NoMethodError: undefined method `bytesize' for #<File:/test/test1.dat>
  • get_job_output states a parameter '# * response_block<~Proc> Proc to use for streaming the response', but haven't figured out how to use it yet

pctj101 commented Sep 5, 2012

This is far from beautiful, but in case this boilerplate saves anyone any time....
https://gist.github.com/3638336

As usual, suggestions welcome to improve my hack and slash.

Contributor

fcheung commented Sep 5, 2012

Hi,

thanks for the feedback! The slow responses are quite a PITA for testing. It should probably indeed be using Fog::Storage.parse_data(data)

create_archive can't do multipart (by definition), you need to use the multipart apis (fog could hide that I suppose, but I like requests mapping 1-1 onto amazon apis)

Were you using the raw apis or the model stuff? You should be able to do

vault = glacier.vaults.create(:id => 'vault') #or glacier.vaults.get('vault')
vault.archives.create :body => File.open(test_file), :multipart_chunk_size => 1024*1024

You should be able to upload terabyte sized files in this way without ever having more than 1 chunk of the file in memory.

The response_block param is a regular Excon param. Again if you use the models, that's hidden from you:

job.get_output(:io => File.new(...)) will write the output to a file

I wrote a bunch of this up at http://www.spacevatican.org/2012/9/4/using-glacier-with-fog/

pctj101 commented Sep 5, 2012

Thanks for the pointers... that... really helps....

pctj101 commented Sep 5, 2012

@fcheung Yup, your 'vault.archives' commands work so much easier than my "hard way"

I'm still waiting for jobs to finish (as I'm sure you're much more painfully aware than I).

At this point, I consider it to be usable for my purposes, which is uploading giant zip files. So I have no more "personal demands" off the top of my head...

Contributor

fcheung commented Sep 5, 2012

Great. Re hard way versus models, I think that's kind of the idea: the raw requests need good knowledge of the underlying APIs but are the most powerful (eg if you wanted to upload chunks in parallel) but the models hide that nastiness and make it all nice and easy to use

pctj101 commented Sep 6, 2012

@fcheung - Just wanted to share an archive restoration observation with you... didn't dig into the code yet, but wanted to try a restore:

    restore_target = '/tmp/whatever'

    # works, data restored
    File.open(restore_target, "w") {|f| vault.jobs[0].get_output(:io => f) }

    # file created, 0 bytes
    vault.jobs[0].get_output(:io => File.open(restore_target, "w") )

Not sure why... but I'll dig into it later (or you might already know)

pctj101 commented Sep 6, 2012

BTW @fcheung on your blog:

vault.archive.create :body => File.new('somefile'), :multipart_chunk_size => 1024*1024

Should be:

vault.archives.create :body => File.new('somefile'), :multipart_chunk_size => 1024*1024

(.archives with an 's')

Contributor

fcheung commented Sep 6, 2012

Did you close the file afterwards? If the file isn't closed the data may still be buffered

On 6 Sep 2012, at 15:03, pctj101 notifications@github.com wrote:

@fcheung - Just wanted to share an archive restoration observation with you... didn't dig into the code yet, but wanted to try a restore:

restore_target = '/tmp/whatever'

# works, data restored
File.open(restore_target, "w") {|f| vault.jobs[0].get_output(:io => f) }

# file created, 0 bytes
vault.jobs[0].get_output(:io => File.open(restore_target, "w") )


Reply to this email directly or view it on GitHub.

pctj101 commented Sep 6, 2012

ooh... well looking at the code:

  • this works because probably the file does get closed due to the { block } (guessing <- I think this is true of File.open {|f| ...} )

    File.open(restore_target, "w") {|f| vault.jobs[0].get_output(:io => f) }

  • I have no idea how this would get closed since "File.open" never gets assigned to a variable to close
    vault.jobs[0].get_output(:io => File.open(restore_target, "w") )

Also, on a unrelated note, I notice that you cache the value of jobs and archives for the vault. I think performance-wise this makes a whole bunch of sense, but doesn't really have a good way to "refresh" unless you grab the vaults from the glacier object again (glacier.vaults). I'm not familiar enough with fog to know if this is good/bad/has a convention for when to cache/when not to. But just mentioning.

Contributor

fcheung commented Sep 6, 2012

I think I'm doing the standard fog thing. You should be able to call reload on any of the collections (vaults.reload)

pctj101 commented Sep 6, 2012

There is indeed a reload command. Well then.... I think all questions answered yet again!

pctj101 commented Sep 8, 2012

@geemus - Looks like @fcheung really did a bang up job of getting glacier kicked off in a very usable way. Is it time to close the issue? or do you usually leave it open for "related items"?

pctj101 commented Sep 8, 2012

Okay since I'm more useful this way... I've created a very basic command line utility to upload a file to Glacier and log it to DynamoDB. Because ArchiveID's are rather "temporary" until you can run an inventory, it's often helpful to have a good log of what you've uploaded along with whatever metadata doesn't fit into a typical "Glacier Archive".

Take a look and see if it's useful/hackable for you if you're looking for example code or something like it.

https://github.com/pctj101/icypop

pctj101 commented Sep 9, 2012

@fcheung - Have you had any trouble downloading the results of the InventoryRetrieval? (Archive Retrieval no issue)

After the InventoryRetrieval jobs returns StatusCode = Succeeded then:

irb(main):835:0> job = vault.jobs.get("THEJOBID")
=>   <Fog::AWS::Glacier::Job
    id="THEJOBID",
    action="InventoryRetrieval",
    archive_id=nil,
    archive_size=0,
    completed=true,
    completed_at=2012-09-09 11:47:17 UTC,
    created_at=2012-09-09 07:28:43 UTC,
    inventory_size=1422,
    description="2012-09-09 17:28:42 +1000 Inventory Request",
    tree_hash=nil,
    sns_topic="arn:aws:sns:ap-northeast-1:###:---",
    status_code="Succeeded",
    status_message="Succeeded",
    vault_arn="arn:aws:glacier:ap-northeast-1:###:vaults/----",
    format=nil,
    type=nil
  >


# Try #1 - SocketError
irb(main):813:0> File.open('myarchive') do |f|
irb(main):814:1*   job.get_output :io => f
irb(main):815:1> end
Excon::Errors::SocketError: not opened for writing (IOError)
from /Users/dummy/glacier_test/fog/lib/fog/aws/models/glacier/job.rb:51:in `write'
from /Users/dummy/glacier_test/fog/lib/fog/aws/models/glacier/job.rb:51:in `block in get_output'
from /Users/dummy/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/excon-0.16.1/lib/excon/response.rb:49:in `call'
from /Users/dummy/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/excon-0.16.1/lib/excon/response.rb:49:in `parse'
from /Users/dummy/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/excon-0.16.1/lib/excon/connection.rb:255:in `request_kernel'
from /Users/dummy/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/excon-0.16.1/lib/excon/connection.rb:101:in `request'
from /Users/dummy/glacier_test/fog/lib/fog/core/connection.rb:20:in `request'
from /Users/dummy/glacier_test/fog/lib/fog/aws/glacier.rb:163:in `request'
from /Users/dummy/glacier_test/fog/lib/fog/aws/requests/glacier/get_job_output.rb:35:in `get_job_output'
from /Users/dummy/glacier_test/fog/lib/fog/aws/models/glacier/job.rb:54:in `get_output'
from (irb):814:in `block in irb_binding'
from (irb):813:in `open'
from (irb):813
from /Users/dummy/.rbenv/versions/1.9.2-p290/bin/irb:12:in `<main>'



# Try #2 - MultiJson::DecodeError
irb(main):820:0> restore_target = "abcde"
=> "abcde"
irb(main):821:0>     File.open(restore_target, "w") {|f| job.get_output(:io => f) }
MultiJson::DecodeError: A JSON text must at least contain two octets!
from /Users/dummy/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/json-1.7.5/lib/json/common.rb:155:in `initialize'
from /Users/dummy/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/json-1.7.5/lib/json/common.rb:155:in `new'
from /Users/dummy/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/json-1.7.5/lib/json/common.rb:155:in `parse'
from /Users/dummy/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/multi_json-1.3.6/lib/multi_json/adapters/json_common.rb:7:in `load'
from /Users/dummy/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/multi_json-1.3.6/lib/multi_json.rb:93:in `load'
from /Users/dummy/glacier_test/fog/lib/fog/core/json.rb:36:in `decode'
from /Users/dummy/glacier_test/fog/lib/fog/aws/glacier.rb:165:in `request'
from /Users/dummy/glacier_test/fog/lib/fog/aws/requests/glacier/get_job_output.rb:35:in `get_job_output'
from /Users/dummy/glacier_test/fog/lib/fog/aws/models/glacier/job.rb:54:in `get_output'
from (irb):821:in `block in irb_binding'
from (irb):821:in `open'
from (irb):821
from /Users/dummy/.rbenv/versions/1.9.2-p290/bin/irb:12:in `<main>'

pctj101 commented Sep 9, 2012

One thing I notice is that...when I try to download an InventoryRetrieval job output...

/lib/fog/aws/requests/glacier/get_job_output.rb

This call:

          request(
            options.merge(
            :expects  => [200,206],
            :idempotent => true,
            :headers => headers,
            :method   => :get,
            :path     => path
          ))

To this path:
get_job_output: Path: /-/vaults/myvaultname/jobs/jobid/output

This returns nothing in the body... which is a bit strange.

Contributor

fcheung commented Sep 9, 2012

There's nothing in the body because it's been streamed into the IO object provided. We try to load that json (because the content type is application/json). Depending on your MultiJson backend that is either ok (eg yajl-ruby) or raises an error (json gem). Should be fixed as on 1ca73fc

pctj101 commented Sep 9, 2012

@fcheung You are one heck of a wizard. I'm going to try out the code you mention above.

pctj101 commented Sep 9, 2012

@fcheung - Works like a charm!

Owner

geemus commented Sep 10, 2012

@fcheung - thanks again, you rock!

@pctj101 - icypop looks pretty cool, thanks for helping out on this and working on that. I'll go ahead and close this since it seems like we've got our bases covered now.

geemus closed this Sep 10, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment