Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: Ensure that datalad urls are posix #7183

Merged
merged 1 commit into from
Nov 30, 2022
Merged

Conversation

adswa
Copy link
Member

@adswa adswa commented Nov 16, 2022

Fixes #7182

If the source in a clone command is a windows path, the resulting datalad_url in the .gitmodules file of the superdataset contains it verbatim, including escaped backslashes. Ensuring that it is posix makes it functional, and compliant to git's url format.

@adswa adswa added semver-patch Increment the patch version when merged CHANGELOG-missing When a PR's description does not contain a changelog item, yet. labels Nov 16, 2022
@github-actions github-actions bot removed the CHANGELOG-missing When a PR's description does not contain a changelog item, yet. label Nov 16, 2022
@codecov
Copy link

codecov bot commented Nov 17, 2022

Codecov Report

Base: 88.70% // Head: 88.70% // No change to project coverage 👍

Coverage data is based on head (c868ae9) compared to base (c868ae9).
Patch has no changes to coverable lines.

❗ Current head c868ae9 differs from pull request most recent head a45453a. Consider uploading reports for the commit a45453a to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##            maint    #7183   +/-   ##
=======================================
  Coverage   88.70%   88.70%           
=======================================
  Files         355      355           
  Lines       46672    46672           
  Branches     6352     6352           
=======================================
  Hits        41400    41400           
  Misses       5257     5257           
  Partials       15       15           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@adswa adswa requested a review from bpoldrack November 17, 2022 11:03
@@ -385,6 +386,8 @@ def __call__(
# from git default behavior WRT a submodule's name vs
# its path when we made this a new subdataset.
subds_name = path.relative_to(ds.pathobj)
# ensure posix url, relative Windows paths would be butchered
source = Path(source).as_posix() if not source.startswith('ria') else source
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bpoldrack I would be interested to learn if you know a more elegant way to do this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even think ria is an exclusion criterion here. Also: You are assuming source to always be a path (what is Path('https://somewhere.org/blob') supposed to do?)

I think the issue is with source already. If this was the spot to fix it, we would have cloned from something with \ already. Which may work, but is not the intention really. So, in my view, source needs to be fixed before cloning if it happens to be a path. This may already happen somewhere in clone_dataset. There should be something that determines the giturl which is what is actually used and it's part of some dict (something_props?). I guess, this is what is supposed to be used here and possibly fixed for windows paths before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: You are assuming source to always be a path (what is Path('https://somewhere.org/blob') supposed to do?)

I believe pathlib actually works well for non-paths - at least the cases I could think of. E.g., consider

>>>from pathlib import Path
>>>Path('https://somewhere.org/blob').as_posix()
'https:/somewhere.org/blob'
>>>Path('git@github.com:datalad/datalad.git').as_posix()
'git@github.com:datalad/datalad.git'
>>>Path('adina@juseless.inm7.de:/home/scratch/weee').as_posix()
 'adina@juseless.inm7.de:/home/scratch/weee'

What else is a potential clone source beyond those URLs, local paths, and ria URLs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adswa

I believe pathlib actually works well for non-paths

Well, yes - I think b/c at least on unix such URLs are actually valid paths. Did you try that on Windows? May be it works, too, in that Path would become a PurePosixPath or something, but I'm not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re ria: The check with startswith is fine, the question is: What to do then? I think, the same should apply, since everything after the ria prefix is just a URL:

if source.startswith('ria'):
    source = 'ria+' + Path(source[4:]).as_posix()
else:
    source = Path(source).as_posix()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try that on Windows? May be it works, too, in that Path would become a PurePosixPath or something, but I'm not sure.

This is on Windows 11 with Python 3.10

In [1]: from pathlib import Path 
In [2]: Path('https://somewhere.org/blob').as_posix()  
Out[2]: 'https:/somewhere.org/blob'                                                                                                                                                                                                            
In [3]: Path('git@github.com:datalad/datalad.git').as_posix()                                                           
Out[3]: 'git@github.com:datalad/datalad.git'                                                                                                                                                                                                    
In [4]: Path('adina@juseless.inm7.de:/home/scratch/weee').as_posix()                                                    
Out[4]: 'adina@juseless.inm7.de:/home/scratch/weee'                                                                                                                              

Copy link
Member Author

@adswa adswa Nov 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding a bit of trial-and-error on Windows: It seems that in order to be able to create a RIA store on Windows, the URL has to be Posix (and can never contain the drive (e.g., C:)

C:\Users\mih\TMP\someds>datalad create-sibling-ria -s new --new-store-ok ria+file:/Users/mih/TMP/somestore
[INFO] create siblings 'new' and 'new-storage' ...
[INFO] Fetching updates for Dataset("C:\Users\mih\TMP\someds")
update(ok): . (dataset)
update(ok): . (dataset)
[INFO] Configure additional publication dependency on "new-storage"
configure-sibling(ok): . (sibling)
create-sibling-ria(ok): C:\Users\mih\TMP\someds (dataset)
action summary:
  configure-sibling (ok: 1)
  create-sibling-ria (ok: 1)
  update (ok: 1)
0.00 [00:04, ?/s]
C:\Users\mih\TMP\someds>git remote -v
new     \Users\mih\TMP\somestore\7b9\5bb8c-750f-4e2e-8e91-a083af8b2b13 (fetch)
new     \Users\mih\TMP\somestore\7b9\5bb8c-750f-4e2e-8e91-a083af8b2b13 (push)
new-storage
store   \Users\mih\TMP\RIA\7b9\5bb8c-750f-4e2e-8e91-a083af8b2b13 (fetch)
store   \Users\mih\TMP\RIA\7b9\5bb8c-750f-4e2e-8e91-a083af8b2b13 (push)
store-storage
C:\Users\mih\TMP\someds>datalad create-sibling-ria -s new2 --new-store-ok ria+file:\Users\mih\TMP\someotherstore
[INFO] create siblings 'new2' and 'new2-storage' ...
create-sibling-ria(error): C:\Users\mih\TMP\someds (dataset) [initremote failed.
stdout: initremote new2-storage
failed

stderr: git-annex: Non-absolute object tree base path configuration: \Users\mih\TMP\someotherstore
initremote: 1 failed
]
0.00 [00:03, ?/s]

Cloning from a RIA store (into a superdataset) works with Posix-like paths and also with non posix-like paths., and requires the drive name.

C:\Users\mih\TMP>datalad clone ria+file:///C:/Users/mih/TMP/somestore#7b95bb8c-750f-4e2e-8e91-a083af8b2b13 riaclone
[INFO] Detected a filesystem without fifo support.
[INFO] Disabling ssh connection caching.
[INFO] Detected a crippled filesystem.
[INFO] Disabling core.symlinks.
[INFO] Entering an adjusted branch where files are unlocked as this filesystem does not support locked files.
[INFO] Switched to branch 'adjusted/master(unlocked)'
install(ok): C:\Users\mih\TMP\riaclone (dataset)

C:\Users\mih\TMP>datalad clone ria+file:///C:\Users\mih\TMP\somestore#7b95bb8c-750f-4e2e-8e91-a083af8b2b13 riaclone2
[INFO] Detected a filesystem without fifo support.
[INFO] Disabling ssh connection caching.
[INFO] Detected a crippled filesystem.
[INFO] Disabling core.symlinks.
[INFO] Entering an adjusted branch where files are unlocked as this filesystem does not support locked files.
[INFO] Switched to branch 'adjusted/master(unlocked)'
install(ok): C:\Users\mih\TMP\riaclone2 (dataset)

In almost akl cases, the recorded URLs are quite butchered:

[submodule "riasubds2"]
	path = riasubds2
	url = file:///C%3A%5CUsers%5Cmih%5CTMP%5Csomestore/7b9/5bb8c-750f-4e2e-8e91-a083af8b2b13
	datalad-id = 7b95bb8c-750f-4e2e-8e91-a083af8b2b13
	datalad-url = "ria+file:///C:\\Users\\mih\\TMP\\somestore#7b95bb8c-750f-4e2e-8e91-a083af8b2b13"
[submodule "riaclone3"]
	path = riaclone3
	url = file:///C%3A/Users/mih/TMP/somestore/7b9/5bb8c-750f-4e2e-8e91-a083af8b2b13
	datalad-id = 7b95bb8c-750f-4e2e-8e91-a083af8b2b13
	datalad-url = "ria+file:///C:/Users/mih/TMP/somestore#7b95bb8c-750f-4e2e-8e91-a083af8b2b13"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR: Using file: URLs on windows is a story on its own. That's unrelated to what's going on here. If you want to try create-sibling-ria with such a URL, it should be generated by datalad.support.network.get_local_file_url(.., compatibility='git') and then prefixed with ria+. That's not nice (and may still be buggy), but I don't see exactly how create-sibling-ria is supposed to fix a given URL that is incorrect.

Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I reconsidered this entirely. Looking into the issue, the purpose of datalad-url and the commit introducing it (6c8669e), I figure:

It's supposed to store the original URL given in order for later clone calls (via get) to be able to trigger the same treatment (considering ria, url rewrites, etc.). The value is not used by git-clone (that's what url is for), but by datalad get.

This means, I think, that URLs should remain unmodified (If there's a problem with a file: URL especially on windows, that's a different story needing a different solution).

For paths, things are arguably different. Obviously, what's currently stored is invalid and would need fixing indeed. But: Is making it POSIX-y the correct thing to do? It's then not even recognizable as a windows path anymore. Meaning: If there was a clone routine (possibly via patch) that would actually need to be triggered again, should it then be deprived of the information that this was a windows path (which may be the very info relevant)? So, technically correct would likely be to store a valid windows path rather than a POSIX path.

On the other hand, there's nothing in clone I am aware of that would actually trigger any sort of rewrite with a local path as source. In addition, somewhat "secretly" committing local paths is a questionable thing in the first place.

In conclusion, I am currently thinking of changing this patch to simply not store anything in case source is a path rather than a URL.

WDYT, @datalad/developers?

@adswa
Copy link
Member Author

adswa commented Nov 28, 2022

Thanks for looking into it more.

Is making it POSIX-y the correct thing to do?

I don't feel like I know enough about the code that may use this URL, but POSIX-y is at least git clone's own convention, regardless of operating system: https://www.git-scm.com/docs/git-clone#_git_urls

So, technically correct would likely be to store a valid windows path rather than a POSIX path.

Then maybe pathlibs PurePath? It would do-the-correct-thing(TM) on the given operating system it is on, if I read it correctly.

I am currently thinking of changing this patch to simply not store anything in case source is a path rather than a URL.

I'm not sure I reach the same conclusion. I remember trouble with old datasets without a dataset_id in .datalad/config when some parts of the code checked for the dataset ID. What if something at some point relies on the presence of datalad-urls in some decision making?

@bpoldrack
Copy link
Member

but POSIX-y is at least git clone's own convention,

Yes, but this is the irrelevant part here. datalad-url is not supposed to be used with git but with datalad clone only, during a datalad get later on. At the point where we store it here, source was already used, so it can be used the same way again.

Then maybe pathlibs PurePath?

Path would as well, I think. The value in source isn't the issue - this is correct I think. What's the issue is that it's stored wrong. The double-backslash appears to interprete the python string escape as literal when writing to file. That's what makes the stored-to-file value invalid while source worked just fine. What's supposed to be stored is what print("some\\backslash") would do, what's actually stored is some\\backslash. Not sure where this really happens, but I suppose it's down the road in Runner/subprocess.Popen (where source is passed to).

What if something at some point relies on the presence of datalad-urls in some decision making?

It's not currently the case (only actual use in get and it's guarded against non-existence) and I wouldn't know why anything ever should rely on it. Anything that does would rely on the assumption that something was cloned with datalad, breaking the notion that we should be able to deal with any git repo. So, if code was introduced that relies on this being present, I'd consider this code buggy, not the absence of this entry.

However, I am not entirely sure not storing it is the best solution, but I also don't see how storing something that deviates from what was given to clone can fulfill the purpose of the datalad-url entry at all. Again: All this is about is being able to trigger any special treatment during datalad clone that was based on the source as given to the command. This cannot be accomplished by providing a later datalad get with something different.

So, to me it's either

  • fix the stored to value to be the actual windows path that was used, or
  • not commit paths at all, since their only utility cross clones is as relative URL anyway and that is covered by the actual url entry in .gitmodules, not the datalad-url.

@bpoldrack
Copy link
Member

Verdict from call: Do go with storing the POSIX path.

I'll push a minor change and go with it.

@bpoldrack
Copy link
Member

Slight change and squashed. Are you OK with this, @adswa ?

@bpoldrack bpoldrack merged commit 405ece5 into datalad:maint Nov 30, 2022
@mih
Copy link
Member

mih commented Dec 1, 2022

Late to the party, but nevertheless wanted to mention Path.as_uri(), as a robust and revertable way to turn a native path into a valid URL that can be converted back to a native path (via urllib). See the FileUrlOperations class in datalad-next for an example. This approach avoids custom conventions.

@bpoldrack
Copy link
Member

Doesn't work with relative paths, though, @mih. And storing absolute ones seems wrong to me.

@yarikoptic-gitmate
Copy link
Collaborator

PR released in 0.17.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-patch Increment the patch version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Butchered datalad_urls on Windows when cloning subdatasets from local paths
5 participants