Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Use adlfs to access public blobs #97

Closed
raybellwaves opened this issue Sep 12, 2020 · 5 comments
Closed

Feature request: Use adlfs to access public blobs #97

raybellwaves opened this issue Sep 12, 2020 · 5 comments

Comments

@raybellwaves
Copy link
Contributor

raybellwaves commented Sep 12, 2020

Originally posted it at https://stackoverflow.com/questions/63856476/adlfs-create-azureblobfilesystem-with-credential-none but should probably post here. Need rep points to create a adlfs tag and fsspec tags...

I'm going through this:
https://azure.microsoft.com/en-us/services/open-datasets/catalog/goes-16/

and I was curious how to do the equivalent of s3fs.S3FileSystem(anon=True) with adlfs.AzureBlobFileSystem.

It seems credential=None is a good candidate (https://docs.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.containerclient?view=azure-python) but as far as i'm aware it hasn't been implemented?

>>> import adlfs
>>> fs = adlfs.AzureBlobFileSystem(credential=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ray/local/bin/anaconda3/envs/test_env/lib/python3.8/site-packages/fsspec/spec.py", line 52, in __call__
    obj = super().__call__(*args, **kwargs)
TypeError: __init__() missing 1 required positional argument: 'account_name'
@raybellwaves
Copy link
Contributor Author

I believe public data in Azure is stored in a slightly different way to AWS.

For example,GOES-16 on az is stored at https://goes.blob.core.windows.net. What's the nomenclature for this in az/adlfs connection_string? what you call the goes part? storage name? blob name?
I see an example here: BlobEndpoint=https://storagesample.blob.core.windows.net;. Once you have that you can access the data in the noaa-goes16 container. e.g. the path to a file would be noaa-goes16/ABI-L2-MCMIPF/2020/001/00/OR_ABI-L2-MCMIPF-M6_G16_s20200010000216_e20200010009536_c20200010010028.nc

For reference if you mount the blob the Storage Explorer you see addition info when copying the file
{"CloudHub.Azure.Storage.Blobs":{"connectionString":"BlobEndpoint=https://goes.blob.core.windows.net;SharedAccessSignature=","containerName":"noaa-goes16","subscription":null,"accountUri":"https://goes.blob.core.windows.net","sourceFolder":"ABI-L2-MCMIPF/2020/001/00/","items":[{"relativePath":"OR_ABI-L2-MCMIPF-M6_G16_s20200010000216_e20200010009536_c20200010010028.nc","snapshot":""}],"sasToken":"","service":"blob"}}

In AWS (and using s3fs) you can grab a file (note files do not overlap as have different frequency) as:

import s3fs

fs = s3fs.S3FileSystem(anon=True)
file = "OR_ABI-L2-MCMIPC-M6_G16_s20200010001184_e20200010003557_c20200010004112.nc"
fs.get("noaa-goes16/ABI-L2-MCMIPC/2020/001/00/{}".format(file), file)

Therefore would probably need extra arg
adlfs.AzureBlobFileSystem(credential=None, storage_name="goes")

That's because other datasets are hosted on different storage
e.g. https://azure.microsoft.com/en-us/services/open-datasets/catalog/hls/ is at https://hlssa.blob.core.windows.net

@raybellwaves raybellwaves changed the title Create AzureBlobFileSystem using credential=None? Feature request: Use adlfs to access public blobs Sep 17, 2020
@lostmygithubaccount
Copy link

are there any workarounds to read from public blobs currently? trying to read from Azure open datasets hosted on a public blob

@hayesgb
Copy link
Collaborator

hayesgb commented Sep 22, 2020

I've added anonymous authentication to AzureBlobFileSystem with #106 and #107 so the above goes example works as follows:

storage_options = {'account_name': 'goes'}
fs = AzureBlobFileSystem(**storage_options)
fs.ls("noaa-goes16")

An example Jupyter Notebook on Azure can be found here. The above doesn't work if you just try fs.ls("/").

I believe @raybellwaves is right -- the trick is making sure you get the account_name and container right.

BTW -- The GOES data is stored in xarray, so I didn't try opening the file.

Also successfully accessed the data here as follows:

storage_options={"account_name": "hlssa"}
fs = AzureBlobFileSystem(**storage_options)
fs.ls("hls")
ddf = dd.read_csv("abfs://hls/S2_TilingSystem2-1.txt", storage_options=storage_options)

Both .head() and .tail() work
tag: @lostmygithubaccount

@lostmygithubaccount
Copy link

will this be published to pypi soon @hayesgb ?

@hayesgb
Copy link
Collaborator

hayesgb commented Sep 25, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants