Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated bosh pull to allow multiple zenodo ids to be pulled at once #438

Merged
merged 10 commits into from
Apr 1, 2019

Conversation

ramou
Copy link
Contributor

@ramou ramou commented Mar 10, 2019

This closes #435

  • Added tests to see if pulling multiple items worked
    • Added a test to show that if you duplicate a searched-for zenodo id
      it collapses it and gives results as if you did not send duplicates
    • Added a test to show that if only some of the zenodo ids were valid
      then it would raise an exception on the first invalid zenodo id found
  • Switched the mocks in test_pull to use side_effects because it's
    cooler and better, and I can more readily mock multiple different types
    of response
  • added verbose feedback about squashing duplicate zenodo ids
  • changed any places using pull to pass a list of ids and to expect a
    list as a response
  • ParseArgs kindly takes care of passing no zenodo ids, but there is no
    test for this

Stuart Thiel added 2 commits March 9, 2019 19:23
 - Added tests to see if pulling multiple items worked
  - Added a test to show that if you duplicate a searched-for zenodo id
it collapses it and gives results as if you did not send duplicates
  - Added a test to show that if only some of the zenodo ids were valid
then it would raise an exception on the first invalid zenodo id found
 - Switched the mocks in test_pull to use `side_effects` because it's
cooler and better, and I can more readily mock multiple different types
of response
 - added verbose feedback about squashing duplicate zenodo ids
 - changed any places using pull to pass a list of ids and to expect a
list as a response
 - ParseArgs kindly takes care of passing no zenodo ids, but there is no
test for this
@coveralls
Copy link

coveralls commented Mar 10, 2019

Coverage Status

Coverage decreased (-0.05%) to 93.527% when pulling 8b13cfe on ramou:i435 into f1c4cc7 on boutiques:develop.

@ramou ramou changed the base branch from master to develop March 11, 2019 15:50
Copy link
Member

@glatard glatard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great to me! @erinb90, would you have time to look at it before we merge?

self.cached_fname = os.path.join(self.cache_dir,
"zenodo-{0}.json".format(self.zid))
discarded_zids = zids
zids = list(dict.fromkeys(zids))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is meant to remove duplicates, then perhaps add a comment for it.

+ downloaded[0])
json_files.append(downloaded[0])
else:
raise_error(ZenodoError, "Seached-for descriptor \"{0}\" "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Searched

+ downloaded[0])
json_files.append(downloaded[0])
else:
raise_error(ZenodoError, "Seacrhed-for descriptor \"{0}\" "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's still a typo here 😛

print_info("Downloading descriptor %s"
% file_name)
downloaded = urlretrieve(file_path, entry["fname"])
if(self.verbose):
Copy link
Contributor

@erinb90 erinb90 Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the below line should be printed even in non-verbose mode. Without it the user has no idea where the descriptor was downloaded and whether the download was successful.

Copy link
Contributor

@erinb90 erinb90 Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you don't need parentheses around self.verbose (for this and everywhere else in the file)

@ramou
Copy link
Contributor Author

ramou commented Mar 12, 2019

Both the use of verbose and the format of the if are from the original. It is not unreasonable to formalize how we use verbose or to add something to what pycodestyles is checking for, but I think that's unrelated to this issue.

return downloaded[0]
raise_error(ZenodoError, "Descriptor not found")
if not len(r.json()["hits"]["hits"]):
raise_error(ZenodoError, "Descriptor \"{0}\" "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be nicer if, instead of raising an exception after the first invalid Zenodo ID, we simply printed an error message without terminating the program. That way, the valid ones would still be downloaded and the user would only have to rerun the command for the invalid ones. The logger has a print_error function for this purpose.

@erinb90
Copy link
Contributor

erinb90 commented Mar 12, 2019

In searcher.py, at the part where it prints [ INFO ] Search successful, could you add the search query to this info message? Now that we're doing multiple searches in a single bosh command, this would make things more clear.

@erinb90
Copy link
Contributor

erinb90 commented Mar 12, 2019

Both the use of verbose and the format of the if are from the original. It is not unreasonable to formalize how we use verbose or to add something to what pycodestyles is checking for, but I think that's unrelated to this issue.

True, those parentheses were already there. I'm using Pycharm and it underlines everywhere that has redundant parentheses, so I can fix all of them at some point in the future. Pycodestyle doesn't care about it so it's not a big deal.

However, the "Downloaded descriptor to..." line was intentionally printed in non-verbose mode before, so I'd like to keep it that way if others agree.

@ramou
Copy link
Contributor Author

ramou commented Mar 12, 2019 via email

@erinb90
Copy link
Contributor

erinb90 commented Mar 12, 2019

I had considered that, but once again I went with an approach consistent with the previous implementation. Specifically, raising an exception on error. Since there is no cost to rerun the whole command again because cached descriptors should be identical by definition, it seemed the intent was to force the user to give a fully correct set of parameters, vs a tool for validating zenodo ids. We expect all correct ids and complain to the user when, exceptionally, this is not the case. I'm not opposed to a refinement of the current intent, but unless I have misinterpreted, it may be best to do that in another issue.

On Tue, Mar 12, 2019, 10:53 Erin Benderoff @.> wrote: @.* commented on this pull request. ------------------------------ In tools/python/boutiques/puller.py <#438 (comment)>: > - for hit in r.json()["hits"]["hits"]: - file_path = hit["files"][0]["links"]["self"] - file_name = file_path.split(os.sep)[-1] - if hit["id"] == int(self.zid): - if not os.path.exists(self.cache_dir): - os.makedirs(self.cache_dir) - if(self.verbose): - print_info("Downloading descriptor %s" - % file_name) - downloaded = urlretrieve(file_path, self.cached_fname) - print("Downloaded descriptor to " + downloaded[0]) - return downloaded[0] - raise_error(ZenodoError, "Descriptor not found") + if not len(r.json()["hits"]["hits"]): + raise_error(ZenodoError, "Descriptor "{0}" " I think it might be nicer if, instead of raising an exception after the first invalid Zenodo ID, we simply printed an error message without terminating the program. That way, the valid ones would still be downloaded and the user would only have to rerun the command for the invalid ones. The logger has a print_error function for this purpose. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#438 (review)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABunQOOoFEtSzOC1CY45s1FFAKjRzjZHks5vV79SgaJpZM4bm--2 .

Well it made sense to raise an exception when we were only pulling one ID at a time, but now that we accept a list, we need to decide whether to terminate after seeing an invalid one or letting it proceed through the whole thing. I personally think the latter is better, otherwise you might end up with some descriptors downloaded and others not, which can be confusing. @glatard what do you think?

@ramou
Copy link
Contributor Author

ramou commented Mar 12, 2019

However, the "Downloaded descriptor to..." line was intentionally printed in non-verbose mode before, so I'd like to keep it that way if others agree.

When checking for consistency in the searcher update I noted again that we only show things found in cache in verbose mode, but as you say, originally always posted the download URL. This seems inconsistent with the rest of puller.py. It's all a matter of the purpose of pull, currently given as Download a descriptor from Zenodo.

I'd note that puller doesn't actually do what its description says it does. it Ensures that Zenodo descriptors are locally cached, downloading them if needed. With that description it might be clearer why I think that we should be more consistent with the verbose output, either putting it all under verbose or always showing the found location. Even all that bugs me because what really matters is the end-location in cache, not whether it was downloaded or the URL of the download; I can see that being desirable if you wanted to be able to see and nuke the local cache to force a download, but we currently only have one remote source and descriptors on zenodo are immutable so this seems like over-engineering.

@erinb90
Copy link
Contributor

erinb90 commented Mar 12, 2019

However, the "Downloaded descriptor to..." line was intentionally printed in non-verbose mode before, so I'd like to keep it that way if others agree.

When checking for consistency in the searcher update I noted again that we only show things found in cache in verbose mode, but as you say, originally always posted the download URL. This seems inconsistent with the rest of puller.py. It's all a matter of the purpose of pull, currently given as Download a descriptor from Zenodo.

I'd note that puller doesn't actually do what its description says it does. it Ensures that Zenodo descriptors are locally cached, downloading them if needed. With that description it might be clearer why I think that we should be more consistent with the verbose output, either putting it all under verbose or always showing the found location. Even all that bugs me because what really matters is the end-location in cache, not whether it was downloaded or the URL of the download; I can see that being desirable if you wanted to be able to see and nuke the local cache to force a download, but we currently only have one remote source and descriptors on zenodo are immutable so this seems like over-engineering.

Good points. Ok, so can you change the description of the puller to what you said?

And I realize now that it's probably better not to show any info messages in non-verbose mode. When using bosh pull as part of another command (e.g. bosh exec launch with a Zenodo ID), we don't care about the location. We just want to use the descriptor, whether from cache or downloaded. Sorry for making you change it and then change it back 😛

@ramou
Copy link
Contributor Author

ramou commented Mar 12, 2019 via email

@glatard glatard merged commit 1f9d722 into boutiques:develop Apr 1, 2019
@ramou ramou deleted the i435 branch May 28, 2019 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants