Skip to content
This repository has been archived by the owner on Sep 1, 2022. It is now read-only.

S3 datasets #147

Merged
merged 3 commits into from
Aug 31, 2015
Merged

S3 datasets #147

merged 3 commits into from
Aug 31, 2015

Conversation

danfruehauf
Copy link
Contributor

This is a S3 crawlable dataset implementation, it is rather concise and provides thredds the capability of browsing S3 buckets. At the moment it'll cache files for up to 1 minute (hardcoded, but can be modified).

A configuration for a dataset would look like:

  <datasetScan name="Amazon S3 Dataset" ID="testS3" path="IMOS" location="s3://bucket-name/path">
    <metadata inherited="true">
      <serviceName>all</serviceName>
      <dataType>Grid</dataType>
    </metadata>
    <addDatasetSize/>
    <crawlableDatasetImpl className="thredds.crawlabledataset.CrawlableDatasetAmazonS3"/>
  </datasetScan>

At IMOS, we are about to migrate large portions of data to S3. Thredds is one of our main front facing applications and we have decided to implement proper S3 support for it, rather than use one of the available (and experimental) filesystem interfaces to S3.

@danfruehauf
Copy link
Contributor Author

And as another side note - the master branch doesn't build. Also v4.5.2. Both complain about com.coverity.security.Escape.

@danfruehauf danfruehauf force-pushed the s3_datasets branch 4 times, most recently from 16d002a to 89afcc1 Compare July 24, 2015 07:34
@danfruehauf
Copy link
Contributor Author

Scratch that, master does build. It was my messy gradle environment.

@danfruehauf danfruehauf force-pushed the s3_datasets branch 3 times, most recently from c88bab8 to 1b0d872 Compare July 27, 2015 05:22
@cwardgar
Copy link
Contributor

Hi David,

Would you mind providing unit tests for CrawlableDatasetAmazonS3?

@danfruehauf
Copy link
Contributor Author

@cwardgar It's Dan, and yes, I could. Mind you the CrawlableDatasetFile didn't really contain proper tests and CrawlableDatasetDods didn't have tests at all, so I thought I might get away with that.

What about the rest of the PR? Is there anything you don't like that stands out?

And another thing - the tests are failing, but I don't think it's because of those changes.

@cwardgar
Copy link
Contributor

Oops, sorry Dan. Yeah, some of the CrawlableDataset stuff is pretty old (circa 2005) and our test coverage of it isn't the best. It's an ongoing effort to improve it.

Mostly, I'm concerned about performance, particularly getFile(). It overrides CrawlableDatasetFile.getFile() and I think a lot of users of that method assume that the file is local instead of in the cloud. And I know for a fact that getFile() gets called in a few spots where we don't actually need the file's content; just its metadata. That's a failing of our code, probably, but it could have serious performance implications nonetheless.

TestCalendarDateUnit.testMinMax() is failing on your branch, but not on Unidata/thredds:master. Maybe it's because you based yours on old version (4.5.1?)?

@danfruehauf
Copy link
Contributor Author

@cwardgar It is rebased against master, (last commit by you at Tue Jul 21 17:24:45 2015 -0600)

And yes, I'm well aware of that, this is why I use ehcache for the files (and cleans up). You can witness the performance at http://thredds-systest.aodn.org.au

A note about that - the thredds server is not in AWS next to the data, it still works alright though. I doubt you can get much better performance out of that to be honest.

I'm working on writing some tests now, they will mostly be for the utility functions as it is a bit difficult to test the S3 interaction.

 * crawl S3 hierarchies
 * download files
 * avoid double downloading files
 * cache directory listing
 * remove files from disk after they are expired
@danfruehauf
Copy link
Contributor Author

@cwardgar I rebased against latest master and also added tests. If you think there should be more test coverage, please give me some pointers. Generally speaking, the heavy lifting of the code is just interacting with the S3 library - which will be difficult to test and also not that beneficial.

@cwardgar
Copy link
Contributor

@danfruehauf Oh cool, I didn't realize that you had it in a running server. So to enable it, you just have something like

<crawlableDatasetImpl className="thredds.crawlabledataset.CrawlableDatasetAmazonS3 " />

in your catalog somewhere?

@cwardgar
Copy link
Contributor

Also, would you mind sharing your catalog? You said that your sever is not next to your data, so it sounds like I can test on my machine without making any changes.

@danfruehauf
Copy link
Contributor Author

In the body of the PR there is a snippet of the catalog, but here it is again:

<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         name="Unidata THREDDS Data Server"
         version="1.0.2">
  <service name="regGriddedServices" serviceType="Compound" base="">
    <service name="dapService" serviceType="OPENDAP" base="/thredds/dodsC/"/>
    <service name="httpService" serviceType="HTTPServer" base="/thredds/fileServer/"/>
  </service>
  <datasetScan name="S3 IMOS test data 1" ID="IMOS" path="IMOS" location="s3://imos-test-data-1/opendap">
    <metadata inherited="true">
      <serviceName>regGriddedServices</serviceName>
      <dataType>Grid</dataType>
    </metadata>
    <filter>
      <include wildcard="*.nc"/>
      <include wildcard="*.nc.gz"/>
      <include wildcard="*.ncml"/>
    </filter>
    <crawlableDatasetImpl className="thredds.crawlabledataset.CrawlableDatasetAmazonS3"/>
  </datasetScan>
</catalog>

This bucket is publicly available, you should be able to test your machine too. Please note this is a test bucket, it will probably get deleted at some point.

@danfruehauf
Copy link
Contributor Author

@cwardgar Another thing I haven't mentioned, the bucket is in the Australia (Sydney) region, might be a bit slower for you if you are in Boulder :)

@cwardgar
Copy link
Contributor

@danfruehauf This is definitely a feature we're interested in and — as implemented — it doesn't impact the rest of THREDDS at all really.

However, THREDDS v5.0.0 is on the horizon and the crawlableDatasetImpl functionality that CrawlableDatasetAmazonS3 depends on is gone. Before we can merge this in, we need to figure out how best to integrate it into v5.0.0. I'll leave this pull request open and keep you posted.

@danfruehauf
Copy link
Contributor Author

@cwardgar That's good to know. As you said, it doesn't impact the rest of the thredds code (except for maybe those minor added dependencies).

So there's another branch that's the actual development branch? I thought master is what I should base the work against. Given that, I'd be happy to perform the required work for thredds v5.0.0. Eventually there isn't much happening in that interface implementation, the 2 main core functions which will probably stay the same are:

  • listS3Dir
  • getS3File

Please keep me posted. We'd be happy to see that feature both in thredds v5.0.0, but also on master (v4.6.x) if possible.

Further down the track I think there should be a few more improvements on that. This is:

  • Allow AWS credentials from config file
  • Allow setting S3 endpoint from config file
  • Control caching from config file (ehcache)

And did the config I provided work for you? I always say that "seeing is believing".

@cwardgar
Copy link
Contributor

@danfruehauf Hi Dan. master is the "near-term" development branch, currently 4.6.3-SNAPSHOT. We also have a branch called 5.0.0, which we've already put a lot of work into.

We're not sure yet what the best approach is for adding the functionality to 5.0, but we'll be talking about it at our meeting tomorrow.

And yes, the catalog worked for me! There was a noticeable delay in directory traversal, but not prohibitively long (and maybe that's mostly due to the bucket being physically far from me). I'd be curious to test it out with larger files.

@danfruehauf
Copy link
Contributor Author

@cwardgar Thanks for the info, that's good news.

Just so you know, I destroyed our imos-test-data-1 bucket, but you can still test with imos-test-data-2 if you need to:

<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         name="Unidata THREDDS Data Server"
         version="1.0.2">
  <service name="regGriddedServices" serviceType="Compound" base="">
    <service name="dapService" serviceType="OPENDAP" base="/thredds/dodsC/"/>
    <service name="httpService" serviceType="HTTPServer" base="/thredds/fileServer/"/>
  </service>
  <datasetScan name="S3 IMOS test data 2" ID="IMOS" path="IMOS" location="s3://imos-test-data-2/opendap">
    <metadata inherited="true">
      <serviceName>regGriddedServices</serviceName>
      <dataType>Grid</dataType>
    </metadata>
    <filter>
      <include wildcard="*.nc"/>
      <include wildcard="*.nc.gz"/>
      <include wildcard="*.ncml"/>
    </filter>
    <crawlableDatasetImpl className="thredds.crawlabledataset.CrawlableDatasetAmazonS3"/>
  </datasetScan>
</catalog>

So imos-test-data-1 was accidentally created in US-east, so it was actually closer to you than to us. imos-test-data-2 is really in the Sydney region, expect it to be even slower.

As for us, testing with imos-test-data-2 gives us indistinguishable performance compared to filesystem access, even though our thredds machine is not on AWS.

We are hosting on our thredds catalog some gridded satellite data, I will try experimenting with it (it is roughly 70mb of data for each file) and let you know how it goes.

@danfruehauf
Copy link
Contributor Author

@cwardgar After the meeting, any news regarding that PR and functionality?

@cwardgar
Copy link
Contributor

@danfruehauf The plan is to merge it into master and release it as part of 4.6.3, either tomorrow or next week. I'd just like to take another pass over it and possibly improve the testing.

@danfruehauf
Copy link
Contributor Author

@cwardgar That's great news!

I think generally the implementation is pretty concise and does most of the heavy lifting. I tried to make sure I used the coding conventions in this project as much as possible. Perhaps something I'm not very familiar with is the ehcache part. Probably it'd be better to drive those retention period variables from an external config file.

@cwardgar
Copy link
Contributor

@danfruehauf @DennisHeimbigner

We noticed that you're using URLs of the form s3://<bucket>/<object>, rather than http(s)://<bucket>.s3.amazonaws.com/<objectid>. Did you just do that for brevity, or is the s3: protocol/prefix an established practice when working with AWS? Our chief concern is that it limits our ability to specify "http:" versus "https:".

Also, did you destroy both of your test buckets? I'm getting "The specified bucket does not exist" at http://imos-test-data-2.s3.amazonaws.com/opendap

@cwardgar
Copy link
Contributor

Before I merge, I'm going to try to improve the test coverage. That'll require heavy mocking, so I'm going to do it with Spock, which is my new favorite testing framework. I may rewrite some of your tests to use it. You can follow along in our s3_datasets branch, if you want.

@danfruehauf
Copy link
Contributor Author

@cwardgar By default the S3 client will use HTTPS. I actually disable it explicitly in https://github.com/Unidata/thredds/pull/147/files#diff-4f74c3ba230ed262459960022533c90dR250

s3://bucket/object is the standard way of referring to objects on s3 as far as I know. We cannot have a http(s):// type URI because we need some way of hinting this is actually a s3 URI, and not HTTP/S. And yes, this is the proper standard way as far as I am concerned.

Sorry about deleting the buckets, we now have our "proper" bucket which will become production, that is under s3://imos-data/IMOS, feel free to use that URI for testing. To browse the bucket using HTTP, you can go to http://imos-data.s3.amazonaws.com/index.html

In terms of test coverage, go for it. I am familiar with Spock, however you will probably be much faster than me in implementing the tests.

Let me know if I can assist with anything else.

@cwardgar
Copy link
Contributor

Merged: dcf26fe

THREDDS-in-the-cloud is a high priority for us right now, so I spent a significant amount of time polishing this feature. Big changes:

  • Broke the old CrawlableDatasetAmazonS3 into 5 separate classes. The idea here was to simplify the code, ease unit testing, and make CrawlableDatasetAmazonS3 as light as possible, since it won't work in 5.0.0.
  • Switched from Ehcache to Guava Cache. Guava caches are just flat-out superior.
  • Wrote a ton of tests using the Spock framework. Spock is just awesome, especially for mocking and stubbing, which I did a lot of. Coverage is about 80-90% on average, but I lack integration tests. We don't have a persistent S3 bucket here at Unidata that we can place test datasets in.

Performance is adequate, but only when navigating directories with relatively few entries. Otherwise, an excessive number of listObjects requests are made that really slow down navigation (you can see this in threddsServlet.log). This can be fixed, but not without modifying the upstream machinery that works with CrawlableDatasets, which I am loath to do on 4.6. I think in 5.0, we can get this right.

Thanks for the contribution, Dan!

@danfruehauf
Copy link
Contributor Author

@cwardgar Awsome news. That's pretty good. I think we will switch promptly to the upstream branch then!

@danfruehauf
Copy link
Contributor Author

@cwardgar I've been experimenting with your new implementation and trying to access regular folders we have (100-2000 files). The new implementation you have provided is too slow for our needs.

My original implementation displays 1800 files in a folder within a second or so, your new implementation will take probably minutes, if not longer.

Looks like we will be waiting for 5.0 to actually use that feature.

@cwardgar
Copy link
Contributor

@danfruehauf I was able to add a 2nd level of caching to CrawlableDatasetAmazonS3 that dramatically improves navigation speed: 7875515. The performance is on par with the javascript widget that you have in your bucket. If you want, you can pull in the change from master and give it a shot.

The way in which upstream clients work with CrawlableDatasets is still problematic: unnecessary listObjects() requests are made on child directories that we haven't even changed into yet. It just turns out that this wasn't the main cause of the navigation slowness like I thought.

Also, there's #186, which is an issue that neither of our implementations addresses. It doesn't look like I'll have any time soon to work on it, but I'd welcome any pull requests you might make.

@rsignell-usgs
Copy link
Contributor

@danfruehauf , are you guys still using this approach?

@cwardgar
Copy link
Contributor

In case someone stumbles upon this thread in the future (or someone wants to direct-link to this comment), I thought I'd give a brief description of the current state of S3 dataset support in THREDDS:

First off, I want to stress that I consider CrawlableDatasetAmazonS3 experimental. By no means is it ready for production. Aside from #186 and general concerns about caching, the big issue is that THREDDS assumes that its CrawlableDatasets generate their directory listings from a local file system. When that's not the case, performance problems abound.

Secondly, the feature is only available in the 4.6.X line, starting at 4.6.3. As of 2018-03-27, the feature is NOT available in 5.X. That could change in the future, but I can guarantee that it won't be included in 5.0. Basically, the CrawlableDataset class no longer exists in 5.X, so CrawlableDatasetAmazonS3 would have to be altered to suit whatever the new machinery is. I have not looked into it, so I have no idea how much work that would entail.

CrawlableDatasetAmazonS3 works by downloading the entire NetCDF file from S3 and then opening it as a local dataset. This has some pluses and minuses:

Pluses:

  • Easiest to implement by far. NetCDF-Java generally assumes that the datasets it's opening are local. Support for remote datasets is limited (just OPeNDAP?).
  • Follows Amazon's recommendation to NOT attempt ranged requests of S3 objects. Apparently the performance is poor, but I haven't actually benchmarked.

Minuses:

  • Must download entire file for any size request.
  • Must manage downloaded files. Where to put them? When to cache? When to delete?

Our implementation hasn't changed substantially since 2015. It currently isn't a high priority for us.

@rschmunk
Copy link
Contributor

And GF the "entire dataset" is "large"

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants