Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure blob objectstore #2621

Merged
merged 26 commits into from Jul 28, 2016

Conversation

Projects
None yet
7 participants
@zfrenchee
Copy link
Contributor

commented Jul 18, 2016

I've been speaking with @bgruening about this. For our purposes we need to use Azure to store our data, and this is a first step in that direction. This PR provides a new backend for the galaxy ObjectStore: Azure Blob Storage. The adapter was heavily modeled off of the S3 adapter in the same directory.

Alexander Lenail added some commits Jul 18, 2016

Alexander Lenail
first commit
modeled off S3 adapter
Alexander Lenail
@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 18, 2016

WARNING: At present, this PR is not ready to merge. Though it would do no immediate harm, I would like to test it further before merging.

@nsoranzo nsoranzo changed the title Azure blob objectstore [WIP] Azure blob objectstore Jul 18, 2016

@jxtx jxtx added the status/WIP label Jul 18, 2016

Alexander Lenail added some commits Jul 19, 2016

Alexander Lenail
Modified galaxy.objectstore.rst
Not sure how RST works — would appreciate some guidance here.
@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 20, 2016

@jxtx @nsoranzo I believe the checks began failing when I merged dev back into this branch. Travis says something about flake8. Can I assume this is not my problem?

@dannon

This comment has been minimized.

Copy link
Member

commented Jul 20, 2016

@zfrenchee It does look like in the Travis output that the problematic file is azureObjectstore.py -- see the results here: https://travis-ci.org/galaxyproject/galaxy/jobs/145915553#L168

Let me know if you want help fixing this up, though, I'd be happy to give it a once-over and open a PR to your branch.

@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 20, 2016

@dannon oh my gosh! I scrolled to the bottom much too fast. These seems like lint errors, let me look through them and see if I can fix them all in one pass. I'll save that review of yours for when I need it.

from galaxy.exceptions import ObjectNotFound, ObjectInvalid
from galaxy.util import umask_fix_perms, safe_relpath, directory_hash_id
from galaxy.util.sleeper import Sleeper
dfrom ..objectstore import ObjectStore, convert_bytes

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Jul 21, 2016

Member

Typo dfrom

This comment has been minimized.

Copy link
@zfrenchee

zfrenchee Jul 21, 2016

Author Contributor

Sorry!

Alexander Lenail
@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2016

@dannon, Travis hasn't looked through the latest stuff here yet, but I think I'm ready for that first review. =)

@nsoranzo

This comment has been minimized.

Copy link
Member

commented Jul 21, 2016

Alexander Lenail
@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2016

Hello @nsoranzo , @dannon :
Do I need to solve all the lint errors travis mentions? I believe the remaining lint errors are not issues.

azure_blob.py:8:1: F401 'subprocess' imported but unused
azure_blob.py:23:5: F401 'azure.common.AzureConflictHttpError' imported but unused
azure_blob.py:23:5: F401 'azure.common.AzureMissingResourceHttpError' imported but unused
azure_blob.py:40:12: F821 undefined name 'BlobService'
azure_blob.py:40:31: E701 multiple statements on one line (colon)
azure_blob.py:60:5: E266 too many leading '#' for block comment
azure_blob.py:216:17: F841 local variable 'blob' is assigned to but never used
azure_blob.py:259:5: E266 too many leading '#' for block comment
azure_blob.py:487:5: E266 too many leading '#' for block comment
@dannon

This comment has been minimized.

Copy link
Member

commented Jul 21, 2016

@zfrenchee At least the "BlobService" statement seems like it might be problematic. Here's what I'd do: https://github.com/dannon/galaxy/tree/minor_azure_tweaks

(I tried to open a PR on your fork, but for some reason I'm unable to)

(edit the second: github decided I was able to all of a sudden? See https://github.com/zfrenchee/galaxy/pull/1)

Alexander Lenail and others added some commits Jul 21, 2016

Alexander Lenail
Alexander Lenail
Merge pull request #1 from dannon/minor_azure_tweaks
Address minor azure objblob lint issues, and one UndefinedObject issue.
@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2016

@dannon @natefoo I believe you two are the experts on the ObjectStore.

One thing you might notice about this PR is that all of the files which are stored as blobs end up being publicly accessible via a URL, which means "security by obscurity" is the only line of defense. You can imagine this wouldn't be tolerable for clinical data, for example. I am not entirely sure but I do believe ../s3.py objectStore plugin is the same way. I was wondering whether it would be possible to support data privacy with an objectstore plugin, which would be quite important for the group I belong to. I'm not sure exactly how to do this, and wanted to get this up and running for publicly available blobs first, but I think I'd like to be able to support that before merging this (though we can also do two merges).

I'd love to get your thoughts and ideas about this.

@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2016

Oh, and finally, I should mention that I still haven't figured out how to test this, mostly because I haven't quite figured out how configuration files work in galaxy. Specifically: I added an entry to config/object_store_conf.xml.sample : <object_store type="Azure"> which has parameters like <auth account_name="..." account_key="...." />. How do those get filled in? Is it in galaxy.ini like this page seems to indicate (I think)? If so, why is none of the code on that page in galaxy.ini to start with?

(edit: jk I got it to work finally)

Alexander Lenail added some commits Jul 21, 2016

@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 22, 2016

@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 25, 2016

@dannon still not sure how to test this, since I'm not sure how to get the dependencies fixed.

@dannon

This comment has been minimized.

Copy link
Member

commented Jul 26, 2016

@zfrenchee I think you're probably running into the issue I've fixed here: https://github.com/zfrenchee/galaxy/pull/2

Alexander Lenail
Merge pull request #2 from dannon/azure-blob-objectstore
Fixes azure_blob detection for conditional dependency resolution
@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2016

Thanks for the PR -- that will make this more scalable in the longer term. I'm not sure that was the issue though =(.

@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2016

Strangely enough, when I do ./run.sh I see:

Requirement already satisfied (use --upgrade to upgrade): azure-storage in ./.venv/lib/python2.7/site-packages (from -r /dev/stdin (line 1))
Requirement already satisfied (use --upgrade to upgrade): azure-common in ./.venv/lib/python2.7/site-packages (from azure-storage->-r /dev/stdin (line 1))

So my guess is that there's something funky going on...

@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2016

Aha -- I believe this may have something to do with the version:

~/ » python
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.72)] on darwin
>>> from azure.storage.blob import BlockBlobService
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name BlockBlobService
>>> from azure.storage.blob import BlobService
>>>

Azure storage recently moved from "BlobService" to "BlockBlobService". My PR assumes the newest azure sdk, but my bet is, like my computer, the sdk galaxy is trying to install is out of date.

Alexander Lenail
@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2016

yup that fixed it. Upgraded from version 0.20.0 to 0.32.0 (which is either the most recent or close to it)

Alexander Lenail added some commits Jul 26, 2016

@dannon

This comment has been minimized.

Copy link
Member

commented Jul 26, 2016

@zfrenchee Great! My guess is you had an older version installed (maybe via 'pip install azure' -- I think the version depended on there is 0.20.0) so Galaxy wasn't automatically installing the most up-to-date and compatible version in your .venv.

Alexander Lenail
@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2016

@dannon I have verified that this works! (and by that I mean that uploading data via the galaxy UI properly uploads data to MS azure). I think it's probably fair to remove the WIP label and discuss more rigorous testing strategies.

@bgruening bgruening added status/review and removed status/WIP labels Jul 27, 2016

@bgruening bgruening changed the title [WIP] Azure blob objectstore Azure blob objectstore Jul 27, 2016

@bgruening

This comment has been minimized.

Copy link
Member

commented Jul 27, 2016

@zfrenchee @dannon awesome work!
I removed the WIP label and added the review label.

@galaxybot galaxybot added this to the 16.07 milestone Jul 27, 2016

@dannon dannon self-assigned this Jul 27, 2016

@dannon

This comment has been minimized.

Copy link
Member

commented Jul 27, 2016

@zfrenchee Don't worry about the new conflicts. I'm going to test a few more things and will merge manually, resolving them.

@dannon

This comment has been minimized.

Copy link
Member

commented Jul 28, 2016

I'm going to follow up with a separate PR to fix one bug I found, but this is nice and it's working fine for me. Thanks @zfrenchee!

Merging locally to resolve conflicts.

@dannon dannon merged commit b4896bc into galaxyproject:dev Jul 28, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@zfrenchee zfrenchee deleted the zfrenchee:azure-blob-objectstore branch Jul 28, 2016

@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Jul 28, 2016

@dannon if you aren't already exhausted by azure, I'm going to get started on another objectstore backend using another azure product they call "data lake". We at MIT will be using azure blobs in the short term, but may move to "data lake" in the medium term. Keep an eye out for that PR, and thanks for all the help so far =)

@natefoo

This comment has been minimized.

Copy link
Member

commented Aug 10, 2016

One thing you might notice about this PR is that all of the files which are stored as blobs end up being publicly accessible via a URL, which means "security by obscurity" is the only line of defense. You can imagine this wouldn't be tolerable for clinical data, for example. I am not entirely sure but I do believe ../s3.py objectStore plugin is the same way. I was wondering whether it would be possible to support data privacy with an objectstore plugin, which would be quite important for the group I belong to. I'm not sure exactly how to do this, and wanted to get this up and running for publicly available blobs first, but I think I'd like to be able to support that before merging this (though we can also do two merges).

@zfrenchee at least in the case of S3, aren't they private and only accessible because Galaxy has a secret key that allows access to the bucket?

@zfrenchee

This comment has been minimized.

Copy link
Contributor Author

commented Aug 10, 2016

Hi @natefoo
I think at some point I figured out how to do this, and I agree with you about s3. I'll need to double check before we upload clinical data, but I think this Azure connector is now private by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.