Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise DataLad additions over git/git-annex section #64

Closed
mih opened this issue Apr 17, 2021 · 6 comments
Closed

Revise DataLad additions over git/git-annex section #64

mih opened this issue Apr 17, 2021 · 6 comments

Comments

@mih
Copy link
Member

mih commented Apr 17, 2021

This section is arguably the key section of "Statement of need" and in turn the entire paper. Currently it puts forth 5 reasons:

  1. They are generic and lack support for domain-specific solutions
  2. They require a layer above to establish a distribution
  3. Modularization is needed to scale
  4. Annotation of changes is not "re-executable"
  5. Git and git-annex do not necessarily facilitate the best scientific workflow

I would propose to trim the list, and to straighten the argument:

A. Seamless nesting of independent modular units (with emphasis on "seamless", which is what DataLad adds to Git's submodules)
B. Reproducible execution (or capture of actionable provenance)
C. Interoperability adapters and interfaces (more of a collection of the former, rather than a definition of the latter)

I think 1-5 are outcomes that can be achieved with A-C, rather than the technological contribution.

The current text seems to be easily sortable under A, B, and C to illustrate more or less intuitive use cases, why one would want such features.

The description of B could be extended to reach beyond provenance capture and hint at a wider metadata support.

@mih mih mentioned this issue Apr 17, 2021
@yarikoptic
Copy link
Member

Well, there are always multiple ways how to present an argument ;) And any of the dimensions to characterize against would not be totally orthogonal.

  • current paragraphs in "Statement of need" for DataLad subsection serve as answers to "Why Git and git-annex alone are not enough" question which compliments "Why Git and git-annex". IMHO such formulation fits "Statement of need" section quite well. If to go for A-C I think then subsection names and wording within would need to be adjusted to reflect that new structure, or proposed A-C renamed to pretty much current ones (see mapping below) to reflect deficiencies we are addressing instead of strengths of DataLad.
  • overall it boils down to mapping of 1 -> C, 3 -> A, 4 -> B with pretty much a complete removal of 5 (which I think would be a loss) and making 2 just a footnote of A.
  • "Wider metadata support" ATM has little to nothing to do with B. Reproducible execution but already fits well with a distribution aspect. Thus if to be RFed into those 3, might better be hinted in "modular" (after all "aggregation" of metadata across subdatasets is a unique feature of DataLad here as well).

Overall, besides a possible "contraction", with just above description for the possible change I am not yet convinced that it would provide a clearer presentation of the "Statement of need" since it would pretty much boil down to loosing "distribution" and "best practices" arguments, and seems would require reshaping of the entire "Statement of need" presentation.

@mih
Copy link
Member Author

mih commented Apr 17, 2021

We seem to have rather different views on what DataLad contributes in essence. I would prefer to have each point be a crisp declaration of an added value. However, neither of the five points (just looking at the tag lines) clicks with me. That may be just me. However, I don't think I am able to improve upon the present points.

@bpoldrack
Copy link
Member

bpoldrack commented Apr 18, 2021

I lean towards @mih's view here.
A notion of "distribution" that would imply anything on top of git/git-annex has always been a very vague thing to me. As far as I consider it something valueable it is an emergent property from datalad's entirety that is shown best in usecases (esentially: handbook). 5 is vague thing to me, too.

For me it boils down to A, B, C. Possibly plus a clearer notion of "making git/git-annex easier to handle", which is somewhat hidden in A's seamless.

@yarikoptic
Copy link
Member

"distribution" aspect is what started it all, and it is still there. Ok, let's sacrifice 5, and go with A(3), B(4), C(1), and move distribution (2) to 4 and see if it survives.

@adswa
Copy link
Member

adswa commented Apr 18, 2021

I will try to implement the proposed structure

@mih
Copy link
Member Author

mih commented Apr 21, 2021

As the OP, I think the manuscript has evolved in the spirit of this issue, hence it should be safe to close it.

@mih mih closed this as completed Apr 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants