Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cull duplicate dataURIs for MAST in download_products #2497

Merged
merged 7 commits into from
Sep 6, 2022

Conversation

jdavies-st
Copy link
Contributor

@jdavies-st jdavies-st commented Aug 19, 2022

Resolves #2496

I'm not precisely sure how to test this in a unit test, as I don't understand the mocking system. And there doesn't seem to be any JWST unit tests yet.

Regardless, this is what I see when I run the workflow in the issue above locally.

Before this PR:

>>> manifest = Observations.download_products(data_products)
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00003_nrs2/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         1s
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00001_nrs2/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         1s
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00002_nrs1/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         0s
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00002_nrs2/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         1s
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00003_nrs1/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         0s
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00003_nrs2/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         1s

This PR:

manifest = Observations.download_products(data_products)
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:JWST/product/jw02736008001_01_msa.fits to ./mastDownload/JWST/jw02736008001_03103_00001_nrs1/jw02736008001_01_msa.fits ...
|============================================================================| 567k/567k (100.00%)         0s

Any further suggestions or existing examples on how to test this in a unit test would be appreciated.

@codecov
Copy link

codecov bot commented Aug 19, 2022

Codecov Report

Merging #2497 (0c96a7e) into main (f0de9d4) will increase coverage by 0.13%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##             main    #2497      +/-   ##
==========================================
+ Coverage   62.86%   62.99%   +0.13%     
==========================================
  Files         133      133              
  Lines       17276    17302      +26     
==========================================
+ Hits        10860    10899      +39     
+ Misses       6416     6403      -13     
Impacted Files Coverage Δ
astroquery/exceptions.py 0.00% <0.00%> (ø)
astroquery/mast/observations.py 76.47% <80.00%> (+0.05%) ⬆️
astroquery/alma/core.py 48.19% <0.00%> (+4.83%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@keflavich keflavich requested a review from ceb8 August 19, 2022 13:54
Copy link
Contributor

@keflavich keflavich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, it's a simple fix and the issue describes the problem well. But, I defer to @ceb8 for final review.

@keflavich
Copy link
Contributor

Regarding the unit tests - I think it would be a good idea to have some JWST-specific ones, since that ought to be in high demand in coming months & years.

In the meantime, you could perhaps riff off of this example:

def test_observations_download_file(patch_post, tmpdir):
# pull a single data product
products = mast.Observations.get_product_list('2003738726')
uri = products['dataURI'][0]

and just add a check that the list of uris is unique (assuming you can find a data set ID that has non-unique entries that need to be filtered)

@bsipocz
Copy link
Member

bsipocz commented Aug 19, 2022

I suppose it would be also useful to be able to turn off the progress bar, but that points beyond the scope here.

@bsipocz bsipocz added this to the v0.4.7 milestone Aug 19, 2022
bsipocz
bsipocz previously approved these changes Aug 19, 2022
Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a changelog entry and then this is good to go.

@jaymedina
Copy link
Contributor

Hi @jdavies-st , this looks good, although I would suggest having the fix under get_product_list instead of download_products . Methods such as download_products, get_cloud_uris, etc all bank on the output received from get_product_list, so fixing that output straightaway before feeding the product list anywhere else would be cleaner.

@jdavies-st
Copy link
Contributor Author

jdavies-st commented Aug 22, 2022

Thanks for the feedback!

I had thought of putting the unique() call in get_product_list, but currently for JWST, one often has to repeatedly call this function for each observation, as passing all observations of a proposal (or even some in batches) times out. I.e. the following recipe from the astroquery.mast doc examples just doesn't work for most JWST proposals:

from astroquery.mast import Observations


obs_table = Observations.query_criteria(proposal_id="2736", instrument_name="NIRSPEC")
data_products = Observations.get_product_list(obs_table)

You'll get a timeout or it will just hang and fail. One has to break up into chunks or just loop through each obs.

As an example, there can be 250k products (many duplicates) for a couple dozen NIRSpec observations which include level 3 products. And for scientists, this is going to be the way they will want to query - get me all the data for my <proposal_id>, maybe minus the guide star data.

So if I have to already call in a loop, say like the following:

from astroquery.mast import Observations
from astropy.table import unique, vstack


obs_table = Observations.query_criteria(proposal_id="2736", instrument_name="NIRSPEC")
data_products_list = []
for obs in obs_table:
    data_products_list.append(Observations.get_product_list(obs))
data_products = vstack(data_products_list)

I still will have lots of duplicates in my final vstacked table, as between observations is where all the duplication is occurring not within, especially for an instrument like NIRSpec where each level 3 product uses the same level 2 products as input. So if I have NIRSPec MOS with 100 spectra, and get the level 3 products, then I will get 100 copies of the level 2 products, one for each level 3 observation.

Also, if one is interested in the parent observation of a data product, each of the many copies of a particular file to download will have different parent metadata. So culling early might make people miss data if later in their script (before download) they select by observation, say the newest observation that was just taken.

So that was my thinking in leaving the duplicate culling to the last moment, before download, as that's really when you want to cull. Just don't make me download this file 600 times. =)

Thoughts? Or do you think it would still be best to cull early?

@jaymedina
Copy link
Contributor

jaymedina commented Aug 22, 2022

Hi @jdavies-st , those are good points. Doing the cull during the download step instead of the get_product_list step means the get_cloud_uri(s) methods will still return duplicate products, so just for completion I would suggest adding the unique() to the get_cloud_uri and get_cloud_uris methods as well. And I'm in agreement with @keflavich that a JWST-specific unit test that checks for duplicates would be another good addition to this PR. Thanks very much for catching this!

I also want to mention that the hang-up issue you're experiencing when trying to download large amounts of products is on our radar. We're currently trying to find the root cause of this and will implement a fix so users don't have to do those iterative product queries anymore.

@ceb8
Copy link
Member

ceb8 commented Aug 22, 2022

I think that dealing with this duplicate issue is important but a have a few concerns about this solution. Mostly that it breaks the connection between the metadata and file location. As @jdavies-st mentions above the metadata is unique for each parent observation even though the file is the same. Ironically the download location was designed to be guaranteed to be unique for each individual observation because a few Hubble products do not have unique names (while being unique files I believe), a fact which is now causing this problem for JWST.

I have a few ideas for solutions:

  1. Having a flat directory download option (with an argument flag or some such) where all files are dumped in the given download directory with no subdirectories is something users have been requesting for a while, maybe now is the time to implement this.
    In this case once the first copy is downloaded all other download attempts will be short circuited by the cache functionality which will find the file already downloaded and not download it again.
  • Pros:
    Adds functionality that users already want
    Means that each indvidual metadata entry will be associated with the file location
  • Cons:
    JWST users have to set the flag and if they forget they will still get all the copies
    A little more work to implement
  1. Have JWST files go to local_path = os.path.join(base_dir, data_product['obs_collection']) rather than local_path = os.path.join(base_dir, data_product['obs_collection']), data_product['obs_id']). This means that all JWST files go in the same directory, so lilke option one the caching takes over and prevents duplicate downloads.
  • Pros:
    Does not requere a user argument
    Means that each indvidual metadata entry will be associated with the file location.
  • Cons:
    Means harding coding different treatment for JWST
    Not a general solution
  1. The fix you've already implemented but with a check and warning if the unique call indicates duplicates present. I think the warning is necessary because it is a departure from the functionality up till now which gives the user one file per product row, and the user will need to know they need to do a manual cross-match on dataURI to connect downloaded files to their metadata entries.
  • Pros:
    Easiest coding wise, just add warning
  • Cons:
    Breaks the metadata to file connection

I defer to @jaymedina to decide what solution is best, but at minimum I do think the user needs to be warned if they are going to get a download manifest with a different number of rows than they supplied to the download function.

@bsipocz bsipocz dismissed their stale review August 22, 2022 16:34

I revoke my approval in favour of the discussion above.

@jaymedina
Copy link
Contributor

Thanks @ceb8 , I'm in favor of option 1 especially if this functionality has already been requested by users. Perhaps also adding a warning in the docstring to users: If they have JWST products in their product list + don't use the flat download directory kwd argument that they may be re-downloading files unintentionally.

@jdavies-st
Copy link
Contributor Author

jdavies-st commented Aug 24, 2022

@ceb8 I agree that option 1) above is great solution. It is something I'd like to see solved as well, as one always has to move the files to a flat structure so that the calibration pipeline can be run. Though I didn't find a Github issue here for that. But I'm not convinced that JWST users will enjoy having another kwarg to control this behavior. It would be nice if it was smart. And if one forgets to add the flat=True or similar kwarg, one will still get the old behavior of multiple downloads, which is not desirable ever. It places a big strain on all infrastructure involved - mast servers, bandwidth, local storage and researchers' time. The duplicates can make the download size (and time) be a factor of 10 to 100 times larger. This is significant.

Option 2 would also work very well and would prevent users from having to use flat=True or similar. If the current behavior of organizing into obsid subdirs was brought about by HST not having unique filenames, this is not a problem with JWST. So while one could argue option 2 is not a general solution, its actually the HST handling which is not a general solution. There should be no assumption that the same files won't be found across multiple obs_ids given the JWST calibration pipeline routinely creates level 3 products for almost all instruments and observing modes.

That said, I think the 2 options above solve an issue that is separate from the one this PR solves, or at least broader scope. And this PR is actually compatible with both options 1 and 2. It will be needed if a user forgets flat=True in option 1, and it will be technically equivalent in option 2, but much more efficient. Instead of having to iterate through 10 or 100 times more data products and checking if they exist, this sort-circuits for cases where the same product is listed more than once (often hundreds of times). It will be much faster, and there's little downside except the manifest will now be smaller and won't have duplicates.

Happy to add a warning in this PR if that is thought to be needed, though I think once one is downloading, the resulting products on disk no longer have different metadata - they are the same files. The only place metadata will be lost is in the manifest table of the downloads (and the files won't be found multiple times across multiple subdirs).

Thoughts on the way forward?

I've added a test to this PR based on the example in the issue.

@bsipocz
Copy link
Member

bsipocz commented Aug 24, 2022

It is something I'd like to see solved as well, as one always has to move the files to a flat structure

totally independently I was about to open an issue about this as I've just run into it.

@ceb8
Copy link
Member

ceb8 commented Aug 25, 2022

@jdavies-st what I meant about metatdata was not in the file itself but in the connection between the file and the prodct list, because the way you do it the file will appear under in single JWST/obsid directory while being associated with a lot of different obsids. This is not prohibitive, but is why I think it is important to have a warning trigger when this happens. Also it is a change to what the user can expect in that up till now they were guarenteed a manifest with a one-to-one correspondence to their input product list, and I wouldn't want to change that behaviour without letting the user know.

I think your point about it being inadvisable for this to happen ever is a good one, which makes me think that a combination of 3 and 1 are probably correct, with this PR being the immediate fix and then the flat directory option being considered relatively high priority (within the constraints on the MAST team of course). Thoughts @jaymedina?

@eerovaher
Copy link
Member

Have you considered using hard links? That would allow you to maintain the directory structure while still avoiding downloading and writing duplicate files.

@jaymedina
Copy link
Contributor

jaymedina commented Aug 25, 2022

What do you mean by hard links @eerovaher ?

I just spoke with one of the MAST database handlers to confirm that the reason for the current convention was because of different Hubble products having the same name and this was their reply:

I'm not aware of any duplicated HST pipeline products with the same name. The original reason for the dir structure was to put all files for the same observation in the same directory which is a 'natural' way to group them. The problem is JWST which doesn't have such a logical structure with the same file being used for multiple observations so it appears in multiple places.

So if this is true, Method 2 could be a general solution. They also suggested going with a directory structure like:
->obs_collection sub-directory (as mentioned in Clara's Method 2)
--> sub-sub-directory for auxiliary files that are shared across observations like asn files, guidestar files, msa files, etc.

This would generalize the file directory structure so it's compatible across missions, and the cacheing would control duplicates in the auxiliary files sub-sub-directory. So this would basically be Method 2 with some flare. Let me know your thoughts and if you see potential problems with this convention.


EDIT: Scratch that, I received feedback from others that there were duplicate files seen in the GALEX mission among others, so method 2 is probably a no.

One other suggestion would be to fiddle with the manifest itself in download_files and download_curl_scripts. So for example in download_files, adding an if/else after calling local_path stating that if a file is already in manifest_array, add the next row with a status message denoting that this file is a duplicate and will be skipped. You would have to assert that the product is not part of any HST mission, though.

@eerovaher
Copy link
Member

What do you mean by hard links @eerovaher ?

https://en.wikipedia.org/wiki/Hard_link

@jdavies-st
Copy link
Contributor Author

jdavies-st commented Aug 26, 2022

Thanks @ceb8. I'm in full agreement. I will add a warning to the user in this PR.

Btw, if one uses the MAST web portal to pull the same level 3 obs that is in the test in this PR, one gets the same 6 MSA config files in one's basket (due to 2 detectors, 3 nods):

Screen Shot 2022-08-26 at 11 19 22

So the exact same MSA config file listed 6 times for each "dataset". But if I hit the download button and look at the resulting curl script:

# Download Product Files:



cat <<EOT
<<< Downloading File: mast:JWST/product/jw02736008001_01_msa.fits
                  To: ${DOWNLOAD_FOLDER}/JWST/jw02736008001_03101_00002_nrs1/jw02736008001_01_msa.fits
EOT

curl --globoff --location-trusted -f --progress-bar --create-dirs $CONT --output ./${DOWNLOAD_FOLDER}'/JWST/jw02736008001_03101_00002_nrs1/jw02736008001_01_msa.fits' 'https://mast.stsci.edu/api/v0.1/Download/file?bundle_name=MAST_2022-08-26T0519.sh&uri=mast:JWST/product/jw02736008001_01_msa.fits'




exit 0

It has culled them to a single download, but keeping the subdir structure. So the MAST portal is doing the same thing that this PR is doing. Confluence.

Of course better would be not to do the subdirs in the first place. A separate PR, and further discussion.

@jdavies-st
Copy link
Contributor Author

@jaymedina I've added the duplicate culling to get_cloud_uris as well. It looks like that's the only that is needed for cloud URIs, as get_cloud_uri only works with a single data product.

@jdavies-st
Copy link
Contributor Author

I believe this PR is complete and ready for final review.

Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good to me, thanks @jdavies-st! Though, before going ahead and merging this I would like to get an approval from either @ceb8 or @jaymedina.

@jaymedina
Copy link
Contributor

This looks good to me, thanks for the updates! This just needs a commit squash and I believe a rebase, but @bsipocz can confirm if this is the case or not.

@eerovaher
Copy link
Member

As it currently stands, if a query results in a DuplicateResultsWarning then there isn't any way for the user to perform the same query without the warning being emitted. In such cases the warning should go through the logging system, see https://docs.python.org/3/howto/logging.html#when-to-use-logging

@bsipocz
Copy link
Member

bsipocz commented Sep 6, 2022

Thanks for approving @jaymedina! No need to squash nor to rebase (commits are logical chunks and not back and forth, and there are no duplicates or conflicts).

@bsipocz bsipocz merged commit 3a57b4c into astropy:main Sep 6, 2022
@bsipocz
Copy link
Member

bsipocz commented Sep 6, 2022

Thanks @jdavies-st!

@bsipocz
Copy link
Member

bsipocz commented Sep 6, 2022

As a follow-up, I agree with @eerovaher, warning that a user can do nothing about are not ideal, please consider using the logger instead.

@jdavies-st
Copy link
Contributor Author

@eerovaher yes, I agree. Thanks for pointing that out to me. I will make a follow-up PR to fix this.

@jdavies-st jdavies-st deleted the mast-cull-duplicate-downloads branch September 7, 2022 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JWST MAST product lists often list and download the same product multiple times
6 participants