Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAST: Add flat option to Observations.download_products #2511

Merged
merged 7 commits into from
Sep 10, 2022

Conversation

jdavies-st
Copy link
Contributor

@jdavies-st jdavies-st commented Sep 8, 2022

This adds a flat boolean option to Observations.download_products() that does not make any subdirectories for the downloads. Default is turned off, i.e. False.

If set to True, and download_dir is specified, it will put all files into download_dir without subdirectories.

If set to True and download_dir is not specified, it will put files in the current working directory, again with no subdirectories.

Setting flat=True has no effect when curl_flag=True, as the bash script is server-side generated at MAST.

The default of False retains the current behavior and puts files into the standard directory structure of mastDownload/ <obs_collection> / <obs_id> /.

Resolves #2500.

@codecov
Copy link

codecov bot commented Sep 8, 2022

Codecov Report

Merging #2511 (279c683) into main (c9972e0) will decrease coverage by 0.00%.
The diff coverage is 78.57%.

@@            Coverage Diff             @@
##             main    #2511      +/-   ##
==========================================
- Coverage   62.98%   62.98%   -0.01%     
==========================================
  Files         133      133              
  Lines       17308    17314       +6     
==========================================
+ Hits        10902    10905       +3     
- Misses       6406     6409       +3     
Impacted Files Coverage Δ
astroquery/mast/observations.py 76.01% <78.57%> (-0.66%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Comment on lines +616 to +621
if not flat:
local_path = os.path.join(base_dir, data_product['obs_collection'], data_product['obs_id'])
if not os.path.exists(local_path):
os.makedirs(local_path)
else:
local_path = base_dir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using Path from pathlib?

Suggested change
if not flat:
local_path = os.path.join(base_dir, data_product['obs_collection'], data_product['obs_id'])
if not os.path.exists(local_path):
os.makedirs(local_path)
else:
local_path = base_dir
if flat:
local_path = Path(base_dir)
else:
local_path = Path(base_dir, data_product['obs_collection'], data_product['obs_id'])
local_path.mkdir(parents=True, exist_ok=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider, I just didn't know what the downstream effects of that would be, given that all paths in the module use the older os.path routines, as do most in the core library. You'll notice in the test I did use the pytest tmp_path fixture, which supplies a temporary Path object as location to write the temporary downloaded file. And it works just fine. So it should work, but maybe it would be good to do a consistent cleanup throughout the module?

I did fix

https://github.com/astropy/astroquery/pull/2511/files#diff-55036b43100098d3295a04581bde036bd91d0371ed055fb0dabd0d8ddbfb6b5bL738

base_dir = download_dir.rstrip('/') + "/mastDownload"

which I believe was probably a bug, as that cannot work with a Windows filesystem. But I held off doing anymore changes that were not focused on adding this new feature. Happy to do that in a subsequent PR so as not to detract from this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be better to be consistent throughout the module and switching to pathlib sounds like something that deserves a separate pull request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not mix the two without a good reason, and even pathlib's own documentation refers that os.path can be used. That being said I would merge a PR if it consistently changes all os.path to pathlib in the package (but if so, I would still not like to see / operators as those may be misleading during quick code reads (and because we still get a lot of contributions doing raw sting manipulations and additions using '/' in them.

@bsipocz bsipocz added the mast label Sep 8, 2022
@bsipocz bsipocz added this to the v0.4.7 milestone Sep 8, 2022
Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a possibly relevant test failure (in fact likely introduced in the previous PRs), could you please have a look at it, too?

======================================================= FAILURES =======================================================
_______________________________ TestMast.test_observations_get_cloud_uris_no_duplicates ________________________________

self = <astroquery.mast.tests.test_mast_remote.TestMast object at 0x117450c70>
msa_product_table = <Table masked=True length=6>
 obsID   obs_collection dataproduct_type             obs_id             ...  size  parent...
87601178           JWST            image jw02736008001_03101_00004_nrs2 ... 567360     87602009     PUBLIC           1

    def test_observations_get_cloud_uris_no_duplicates(self, msa_product_table):
    
        # Get a product list with 6 duplicate JWST MSA config files
        products = msa_product_table
    
        # enable access to public AWS S3 bucket
        mast.Observations.enable_cloud_dataset()
    
        # Check duplicate cloud URIs as well
>       uris = mast.Observations.get_cloud_uris(products)

astroquery/mast/tests/test_mast_remote.py:347: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
astroquery/mast/observations.py:791: in get_cloud_uris
    return self._cloud_connection.get_cloud_uri_list(data_products, include_bucket, full_url)
astroquery/mast/cloud.py:161: in get_cloud_uri_list
    return [self.get_cloud_uri(product, include_bucket, full_url) for product in data_products]
astroquery/mast/cloud.py:161: in <listcomp>
    return [self.get_cloud_uri(product, include_bucket, full_url) for product in data_products]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <astroquery.mast.cloud.CloudAccess object at 0x1174632b0>
data_product = <Row index=0>
 obsID   obs_collection dataproduct_type             obs_id             description  type               ...                 -- CALJWST         --        2736 jw02736008001_01_msa.fits 567360     87602009     PUBLIC           1
include_bucket = True, full_url = False

    def get_cloud_uri(self, data_product, include_bucket=True, full_url=False):
        """
        For a given data product, returns the associated cloud URI.
        If the product is from a mission that does not support cloud access an
        exception is raised. If the mission is supported but the product
        cannot be found in the cloud, the returned path is None.
    
        Parameters
        ----------
        data_product : `~astropy.table.Row`
            Product to be converted into cloud data uri.
        include_bucket : bool
            Default True. When false returns the path of the file relative to the
            top level cloud storage location.
            Must be set to False when using the full_url argument.
        full_url : bool
            Default False. Return an HTTP fetchable url instead of a cloud uri.
            Must set include_bucket to False to use this option.
    
        Returns
        -------
        response : str or None
            Cloud URI generated from the data product. If the product cannot be
            found in the cloud, None is returned.
        """
    
        s3_client = self.boto3.client('s3', config=self.config)
    
        path = utils.mast_relative_path(data_product["dataURI"])
        if path is None:
            raise InvalidQueryError("Malformed data uri {}".format(data_product['dataURI']))
    
        if 'galex' in path:
            path = path.lstrip("/mast/")
        else:
            path = path.lstrip("/")
    
        try:
            s3_client.head_object(Bucket=self.pubdata_bucket, Key=path)
            if include_bucket:
                path = "s3://{}/{}".format(self.pubdata_bucket, path)
            elif full_url:
                path = "http://s3.amazonaws.com/{}/{}".format(self.pubdata_bucket, path)
            return path
        except self.botocore.exceptions.ClientError as e:
            if e.response['Error']['Code'] != "404":
                raise
    
>       warnings.warn("Unable to locate file {}.".format(data_product['productFilename']), NoResultsWarning)
E       astroquery.exceptions.NoResultsWarning: Unable to locate file jw02736008001_01_msa.fits.

astroquery/mast/cloud.py:135: NoResultsWarning

@@ -584,7 +584,7 @@ def download_file(self, uri, *, local_path=None, base_url=None, cache=True, clou

return status, msg, url

def _download_files(self, products, base_dir, *, cache=True, cloud_only=False,):
def _download_files(self, products, base_dir, flat=False, *, cache=True, cloud_only=False,):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just being extra nitpicky/pedantic as this is not public api, please add the kwarg after the *

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Thanks. It would be nice to eventually expunge that * and be specific about what args are accepted, but I suspect that's a separate issue.

Copy link
Member

@bsipocz bsipocz Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you mean, the * is doing exactly that enforcement of being specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that it would be better for documentation (autodoc) to specify all the args that * allows, but I guess that varies with dataset as the allowable filters vary? No matter, not super relevant this PR. And it's a private method, so not that important for documenting the API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may still misunderstand you, there is no args in *, it's the syntax to enforce keyword-only arguments (not to be the same as when it says *args or similar which would allow additional positional arguments to be passed.)

https://peps.python.org/pep-3102/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* limits the number of positional arguments, see https://peps.python.org/pep-3102/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you! I didn't know about this fun feature. 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my favourites, and would really love to see that we actually get to #1746 (mast is already done thanks to Jenny)

@jdavies-st
Copy link
Contributor Author

@bsipocz That test failure is real and peculiar, as I recall it passing when I developed the test. I see that it does fail now, even in the refactored form. And I can't get it to pass. That said, it's not required that it finds the file on S3, only that it removes duplicates, so I've told it to ignore the NoResultsWarning. The test is only failing because warnings are turned into exceptions.

The AWS S3 stuff is a bit opaque to me I'm afraid. If you have any suggestions on a better way test, I'd be happy to hear.

Anyway, test now passes, though it may fail if it suddenly starts finding that file in S3, which means something like with pytest.warns(None): might be the better context manager.

@jdavies-st
Copy link
Contributor Author

Btw, this PR does not implement flat downloads when curl_flag=True is used. This is because the curl script that is downloaded is generated server side at MAST via MAST's bundle API, and astroquery.mast has no control over how this is done. Bummer.

So that is a current limitation.

@eerovaher
Copy link
Member

There should be a warning if flat and curl_flag are both True.

@jdavies-st
Copy link
Contributor Author

There should be a warning if flat and curl_flag are both True.

Yes, agree. Will add.

@@ -642,8 +648,8 @@ def _download_curl_script(self, products, out_dir):
"""

url_list = [("uri", url) for url in products['dataURI']]
download_file = "mastDownload_" + time.strftime("%Y%m%d%H%M%S")
local_path = os.path.join(out_dir.rstrip('/'), download_file + ".sh")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .rstrip("/") here is unnecessary and is actually a bug if the path is a Path object, something writing the test caught.

@bsipocz bsipocz merged commit fae02b8 into astropy:main Sep 10, 2022
@bsipocz
Copy link
Member

bsipocz commented Sep 10, 2022

Thank you @jdavies-st!

@jdavies-st jdavies-st deleted the mast-no-subdirs branch September 10, 2022 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: provided flat directory structure option for mast.Observations.download_products
4 participants