New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datalad get can't find URL despite registering via addurls (and I can see the URL with git annex whereis) #7582
Comments
Thanks for the detailed report. I think you are doing everything correctly, so this is puzzling... I can confirm that when I use the URL from your The crucial error seems to be "Unable to access these remotes: web", and I must say I never saw it happen before. Searching gives me this thread, which basically amounts to a) no internet connection (at least to that server), or b) file changed on the server, or the server reports size incorrectly (though I suppose git-annex's error message could have improved since). Here are the things I would try to debug (although I am making guesses in a general direction here) - if you don't mind, please also share your outputs for those commands:
|
From use of |
Here's the output of your debugging suggestions, thanks for your help @mslw !
And just to check the precise size, I got it in bytes as well:
The size exactly matches what came out of curl:
Could I have possibly turned off the web remote, or screwed it up in some way? I did mess around with remotes a bit at first when I thought that I wanted a special remote, but I don't think I would have tried turning off web.
I added the URLs using a bash script, since I wanted to set up a process that will be super easy for future contributors to this project. It first checks if each model already has a URL (by seeing if the output of
Then when I finally run the addurls piece, I use this snippet (
If more of the bash script would help, I can certainly put more in here. |
@yarikoptic I'm running Linux, I believe one of the Ubuntu LTS's on the server, and Fedora 39 on my personal laptop. I'd like this to work on Windows, but so far there has not been a Windows machine involved. Isn't the Windows separator In case it's still relevant, it looks like running |
Thanks for the additional details. Again, everything for I also have no concerns about the addurls call or format (side note: if you find working with json output easier, you can also use Which brings us to Regarding messing around with remotes, you can check Here's my minimal attempt at a reproducer (reduced to just git-annex calls, and just the one file):
Maybe you could try that (incl. creating a fresh repo) on your laptop - if it fails, it would point to a more general problem, if it works it would point to a repo-specific problem, but I am out of ideas 😕 |
FTR, I noticed that this is crossposted from https://git-annex.branchable.com/forum/How_to_allow_clones_to_get_files_via_URL__63__/ where it got no answers - that's perfectly fine, but good to keep track of :) |
@watson-e-and you are totally right about FWIW I have tried to replicate with independent minimal reproducer -- but failed❯ mkdir d; cd d; git init; git annex init "CARCAS models on the 3dviewers server"; mkdir models; git annex addurl --file 'models/Alpaca 3rd Carpal L.glb' 'https://3dviewer.sites.carleton.edu/carcas/carcas-models/models/Alpaca%203rd%20Carpal%20L.glb' ; git annex drop *; git annex get *
Initialized empty Git repository in /tmp/d/.git/
init CARCAS models on the 3dviewers server ok
(recording state in git...)
addurl https://3dviewer.sites.carleton.edu/carcas/carcas-models/models/Alpaca%203rd%20Carpal%20L.glb
(to models/Alpaca 3rd Carpal L.glb) ok
(recording state in git...)
drop models/Alpaca 3rd Carpal L.glb ok
(recording state in git...)
get models/Alpaca 3rd Carpal L.glb (from web...)
ok
(recording state in git...)
❯ echo $?
0
if you could share that resultant git/git-annex repo (could be private) or at least that Also given that you have git-annex 10.20230626-g8594d49 -- did you try more recent release? FWIW -- I had a nearby |
This is exciting, I tried making a minimal example involving both the server and my own laptop, and everything worked exactly as expected. I set up the dataset, added a model, added a url to the model using It seems like then that I've eliminated software as the source of the error. Since the history of my dataset isn't that important to the project, my instinct is to start from scratch and see if I still have issues with the URLs.
About cross posts, yep, this is a cross post both from git annex's forum and from Datalad's help forum at Neurostars. I was reluctant to post an issue here because I suspected the problem was more a user error than a software bug. I plan on putting an update to each of those once my problem is resolved, or at least linking to this issue. |
The beauty of git-annex is that "secret sauce" for its functioning is really just a bunch of text files within |
It's a much less publicized resource, but our lab has a note on that in the knowledge that we started a while ago: Create a DataLad dataset from a published collection of files. There is a small difference in starting points -- it looks like you add URLs to files already present in the dataset, while the note uses |
@yarikoptic I'm not totally sure what you mean about comparing the differences. How can both be remotes? Should I make a third repository, and if so, do I install the working and non-working copies as subdatasets? That doesn't seem right... I'm also happy to go digging around in the files if there's somewhere that explicitly has the settings for different remotes in git annex in a text file. I haven't found one yet, but it seems like there should be one. |
Just like that modi repo3; cd repo3; git init
git remote add --fetch remote1 location1
git remote add --fetch remote2 location2
git diff remote1/git-annex remote2/git-annex ;-) so yes, third repository but not as sub datasets but as remotes |
@mslw Thanks for that link, I wish I had stumbled across it earlier! What I want to do does seem quite simple and straightforward once you know the tools that are needed. If I have time and/or the support of the rest of the team for this project, I might explore if this is something that could make a good Use Case in the Datalad Handbook. People interested in digital humanities or other collaborative projects with 3D models might want to replicate this workflow of datalad + github + web server for local development of better features with the model viewer, and I believe that the project I'm working on is in part intended as an exemplar. It would certainly be on the simpler side, but I think it could be worth it. |
@yarikoptic I'm having trouble since I don't have enough disk space on my server to make a full copy there, and I'm trying to see if there's a non-annoying way to get all 59 of the models onto my laptop to build a working copy locally. This is based on the understanding that I need to use a working copy that has the same files in it to get good results out of As I'm typing this, I'm realizing that it can't hurt to try. |
Ok, it might have to wait a couple days for me to get a full example working locally so that I can get less cluttered results from I got a bunch of outputs that look like this, which I assume is from all my 58 models being present as broken symlinks in the not working repo, and not present at all in the working repo. I'm pretty confident, since I can see
There's a handful of more interesting ones, and of course I might have missed some useful ones. This one seems to be about a
There's also some that seem to be referencing the
In these last two, you can clearly see the remnants of my first bad attempt at adding the URLs by creating a remote called 'serverweb'. I tried to delete it when I realized that I wanted the command
I also discovered that git annex will give you a little more information on remotes if you run
Not working repo:
The big differences I'm seeing are
Do any of these look like the source of the issue? |
in working one you seems to have just 1 key total which is odd. Isn't there the key for that "models/Alpaca 3rd Carpal L.glb"? what is it? check for diff on that key. is it private data/urls? if not, let me repeat request:
|
@yarikoptic Oops, I lost track of your request. Yes, the repository is public. Here's the link: https://github.com/DigitalCarleton/carcas |
that one has no |
My apologies, this is the right repository, it's just that I didn't also link to the subdataset where the problems really are. I forgot that the link doesn't work on Github. If you recursively clone the repository at the link I gave you, you should be able to see the |
I now have a working version at [https://github.com/DigitalCarleton/carcas]. Creating everything from scratch, it worked perfectly. Things I did differently
If I have time, I'll look into what went wrong by trying to compare this working version with the non-working version. |
I will then choose this issue for now |
What is the problem?
I’ve been trying a set up a dataset that primarily lives on a web server, but needs to be clone-able by other people. The annex files are visible and downloadable from the server’s website. In particular, the files I’m concerned about here are in a subdataset.
I would like people to be able to clone the dataset from Github, and then (whether or not they have permission to push back to Github) run
datalad get
to download files from the web server. The web server does not show the hidden files like.git
, and so cannot be used as a remote, I believe.I used
datalad addurls
to add the URL of each file on the server to each file in the annex. When I rungit annex whereis filename
, it shows up that it lives on the server in the server’s local copy of the dataset, and that it lives on the web, with a correct URL. In fact, if I click on that URL and open it in a browser, it downloads my file.The dataset lives on Github, but the annex does not. When I make a clone of the superdataset on my personal computer, I get messages like
Then when I'm in the dataset
carcas-models
that has the annex and I rundatalad get models /Alpaca\ 3rd\ Carpal\ L.glb
, I get this error message:I suspect my problem is with how I set things up with git annex, because when I try
git annex get models/Alpaca\ 3rd\ Carpal\ L.glb
, I get the error:I'm confused on how to debug this because when I run git annex whereis models/Alpaca\ 3rd\ Carpal\ L.glb, everything looks correct:
What's the correct way to set up this use case? I don't think that I want the server to be a special remote, because the hidden files like .gitattributes aren't visible. I want to be able to put more files on the server, add their URLS based on where they are on the server, and push to Github so that other people can get these files if they want.
What steps will reproduce the problem?
I'm not sure how to reproduce without access to another web server.
DataLad information
Datalad 0.19.6
Git annex 10.20230626-g8594d49
Additional context
No response
Have you had any success using DataLad before?
This is my first time using Datalad, but everything else about using it has gone quite successfully.
The text was updated successfully, but these errors were encountered: