-
Notifications
You must be signed in to change notification settings - Fork 179
Conversation
And as another side note - the master branch doesn't build. Also v4.5.2. Both complain about |
16d002a
to
89afcc1
Compare
Scratch that, master does build. It was my messy gradle environment. |
c88bab8
to
1b0d872
Compare
Hi David, Would you mind providing unit tests for CrawlableDatasetAmazonS3? |
@cwardgar It's Dan, and yes, I could. Mind you the What about the rest of the PR? Is there anything you don't like that stands out? And another thing - the tests are failing, but I don't think it's because of those changes. |
Oops, sorry Dan. Yeah, some of the Mostly, I'm concerned about performance, particularly
|
@cwardgar It is rebased against master, (last commit by you at And yes, I'm well aware of that, this is why I use ehcache for the files (and cleans up). You can witness the performance at http://thredds-systest.aodn.org.au A note about that - the thredds server is not in AWS next to the data, it still works alright though. I doubt you can get much better performance out of that to be honest. I'm working on writing some tests now, they will mostly be for the utility functions as it is a bit difficult to test the S3 interaction. |
* crawl S3 hierarchies * download files
* avoid double downloading files * cache directory listing * remove files from disk after they are expired
@cwardgar I rebased against latest master and also added tests. If you think there should be more test coverage, please give me some pointers. Generally speaking, the heavy lifting of the code is just interacting with the S3 library - which will be difficult to test and also not that beneficial. |
@danfruehauf Oh cool, I didn't realize that you had it in a running server. So to enable it, you just have something like
in your catalog somewhere? |
Also, would you mind sharing your catalog? You said that your sever is not next to your data, so it sounds like I can test on my machine without making any changes. |
In the body of the PR there is a snippet of the catalog, but here it is again:
This bucket is publicly available, you should be able to test your machine too. Please note this is a test bucket, it will probably get deleted at some point. |
@cwardgar Another thing I haven't mentioned, the bucket is in the Australia (Sydney) region, might be a bit slower for you if you are in Boulder :) |
@danfruehauf This is definitely a feature we're interested in and — as implemented — it doesn't impact the rest of THREDDS at all really. However, THREDDS v5.0.0 is on the horizon and the |
@cwardgar That's good to know. As you said, it doesn't impact the rest of the thredds code (except for maybe those minor added dependencies). So there's another branch that's the actual development branch? I thought
Please keep me posted. We'd be happy to see that feature both in thredds v5.0.0, but also on master (v4.6.x) if possible. Further down the track I think there should be a few more improvements on that. This is:
And did the config I provided work for you? I always say that "seeing is believing". |
@danfruehauf Hi Dan. We're not sure yet what the best approach is for adding the functionality to 5.0, but we'll be talking about it at our meeting tomorrow. And yes, the catalog worked for me! There was a noticeable delay in directory traversal, but not prohibitively long (and maybe that's mostly due to the bucket being physically far from me). I'd be curious to test it out with larger files. |
@cwardgar Thanks for the info, that's good news. Just so you know, I destroyed our
So As for us, testing with We are hosting on our thredds catalog some gridded satellite data, I will try experimenting with it (it is roughly 70mb of data for each file) and let you know how it goes. |
@cwardgar After the meeting, any news regarding that PR and functionality? |
@danfruehauf The plan is to merge it into master and release it as part of 4.6.3, either tomorrow or next week. I'd just like to take another pass over it and possibly improve the testing. |
@cwardgar That's great news! I think generally the implementation is pretty concise and does most of the heavy lifting. I tried to make sure I used the coding conventions in this project as much as possible. Perhaps something I'm not very familiar with is the ehcache part. Probably it'd be better to drive those retention period variables from an external config file. |
@danfruehauf @DennisHeimbigner We noticed that you're using URLs of the form Also, did you destroy both of your test buckets? I'm getting "The specified bucket does not exist" at http://imos-test-data-2.s3.amazonaws.com/opendap |
Before I merge, I'm going to try to improve the test coverage. That'll require heavy mocking, so I'm going to do it with Spock, which is my new favorite testing framework. I may rewrite some of your tests to use it. You can follow along in our s3_datasets branch, if you want. |
@cwardgar By default the S3 client will use HTTPS. I actually disable it explicitly in https://github.com/Unidata/thredds/pull/147/files#diff-4f74c3ba230ed262459960022533c90dR250
Sorry about deleting the buckets, we now have our "proper" bucket which will become production, that is under In terms of test coverage, go for it. I am familiar with Spock, however you will probably be much faster than me in implementing the tests. Let me know if I can assist with anything else. |
Merged: dcf26fe THREDDS-in-the-cloud is a high priority for us right now, so I spent a significant amount of time polishing this feature. Big changes:
Performance is adequate, but only when navigating directories with relatively few entries. Otherwise, an excessive number of Thanks for the contribution, Dan! |
@cwardgar Awsome news. That's pretty good. I think we will switch promptly to the upstream branch then! |
@cwardgar I've been experimenting with your new implementation and trying to access regular folders we have (100-2000 files). The new implementation you have provided is too slow for our needs. My original implementation displays 1800 files in a folder within a second or so, your new implementation will take probably minutes, if not longer. Looks like we will be waiting for 5.0 to actually use that feature. |
@danfruehauf I was able to add a 2nd level of caching to The way in which upstream clients work with Also, there's #186, which is an issue that neither of our implementations addresses. It doesn't look like I'll have any time soon to work on it, but I'd welcome any pull requests you might make. |
@danfruehauf , are you guys still using this approach? |
In case someone stumbles upon this thread in the future (or someone wants to direct-link to this comment), I thought I'd give a brief description of the current state of S3 dataset support in THREDDS: First off, I want to stress that I consider Secondly, the feature is only available in the
Pluses:
Minuses:
Our implementation hasn't changed substantially since 2015. It currently isn't a high priority for us. |
And GF the "entire dataset" is "large" |
This is a S3 crawlable dataset implementation, it is rather concise and provides thredds the capability of browsing S3 buckets. At the moment it'll cache files for up to 1 minute (hardcoded, but can be modified).
A configuration for a dataset would look like:
At IMOS, we are about to migrate large portions of data to S3. Thredds is one of our main front facing applications and we have decided to implement proper S3 support for it, rather than use one of the available (and experimental) filesystem interfaces to S3.