Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create-sibling-github created repos make git-annex branch the default #4997

Closed
yarikoptic opened this issue Oct 8, 2020 · 19 comments
Closed
Labels

Comments

@yarikoptic
Copy link
Member

I think it was not happening before and might be due to changes in github and getting away from master being a default...

Originally reported by @jwodder in the scope of dandi/dandi-cli#250 (comment)

I wanted to do a demonstration and also ran into it

exit:2 /tmp/tipp-refernece > datalad create-sibling-github tipp-reference-datalad
.: github(-) [https://github.com/yarikoptic/tipp-reference-datalad.git (git)]
'https://github.com/yarikoptic/tipp-reference-datalad.git' configured as sibling 'github' for Dataset(/tmp/tipp-refernece)

/tmp/tipp-refernece > datalad publish --to github
[INFO   ] Publishing Dataset(/tmp/tipp-refernece) to github 
publish(ok): /tmp/tipp-refernece (dataset) [pushed to github: ['[new branch]', '[new branch]']] 

Since we just say "new branch" I do not know yet either may be it is due to the order we are pushing the branches -- and if git-annex is first -- it would make it the default.

DataLad 0.13.4.dev10 WTF (configuration, datalad, dataset, dependencies, environment, extensions, git-annex, location, metadata_extractors, python, system)

WTF

configuration <SENSITIVE, report disabled by configuration>

datalad

  • full_version: 0.13.4.dev10-g7cfc3
  • version: 0.13.4.dev10

dataset

  • id: a3acbb87-b9a7-4c19-bfee-d81637c18f17
  • metadata: <SENSITIVE, report disabled by configuration>
  • path: /tmp/tipp-refernece
  • repo: AnnexRepo

dependencies

  • annexremote: 1.4.3
  • appdirs: 1.4.3
  • boto: 2.49.0
  • cmd:7z: 16.02
  • cmd:annex: 8.20200810+git143-gdee38c54d-1~ndall+1
  • cmd:bundled-git: 2.24.0
  • cmd:git: 2.24.0
  • cmd:system-git: 2.27.0
  • cmd:system-ssh: 8.1p1
  • exifread: 2.1.2
  • git: 3.1.0
  • gitdb: 4.0.2
  • humanize: 2.3.0
  • iso8601: 0.1.12
  • keyring: 18.0.1
  • keyrings.alt: 3.4.0
  • msgpack: 0.6.2
  • mutagen: 1.40.0
  • requests: 2.23.0
  • scrapy: 1.7.3
  • tqdm: 4.43.0
  • wrapt: 1.11.2

environment

  • GIT_PAGER: less --no-init --quit-if-one-screen
  • LANG: en_US.UTF-8
  • PATH: /home/yoh/proj/datalad/datalad-crawler/venvs/dev3/bin:/home/yoh/gocode/bin:/home/yoh/gocode/bin:/home/yoh/bin:/home/yoh/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin

extensions

  • crawler:
    • description: Crawl web resources
    • entrypoints:
      • datalad_crawler.crawl.Crawl:
        • class: Crawl
        • load_error: None
        • module: datalad_crawler.crawl
        • names:
          • crawl
      • datalad_crawler.crawl_init.CrawlInit:
        • class: CrawlInit
        • load_error: None
        • module: datalad_crawler.crawl_init
        • names:
          • crawl-init
          • crawl_init
    • load_error: None
    • module: datalad_crawler
    • version: 0.6.0

git-annex

  • build flags:
    • Assistant
    • Webapp
    • Pairing
    • S3
    • WebDAV
    • Inotify
    • DBus
    • DesktopNotify
    • TorrentParser
    • MagicMime
    • Feeds
    • Testsuite
  • dependency versions:
    • aws-0.20
    • bloomfilter-2.0.1.0
    • cryptonite-0.25
    • DAV-1.3.3
    • feed-1.0.0.0
    • ghc-8.4.4
    • http-client-0.5.13.1
    • persistent-sqlite-2.8.2
    • torrent-10000.1.1
    • uuid-1.3.13
    • yesod-1.6.0
  • key/value backends:
    • SHA256E
    • SHA256
    • SHA512E
    • SHA512
    • SHA224E
    • SHA224
    • SHA384E
    • SHA384
    • SHA3_256E
    • SHA3_256
    • SHA3_512E
    • SHA3_512
    • SHA3_224E
    • SHA3_224
    • SHA3_384E
    • SHA3_384
    • SKEIN256E
    • SKEIN256
    • SKEIN512E
    • SKEIN512
    • BLAKE2B256E
    • BLAKE2B256
    • BLAKE2B512E
    • BLAKE2B512
    • BLAKE2B160E
    • BLAKE2B160
    • BLAKE2B224E
    • BLAKE2B224
    • BLAKE2B384E
    • BLAKE2B384
    • BLAKE2BP512E
    • BLAKE2BP512
    • BLAKE2S256E
    • BLAKE2S256
    • BLAKE2S160E
    • BLAKE2S160
    • BLAKE2S224E
    • BLAKE2S224
    • BLAKE2SP256E
    • BLAKE2SP256
    • BLAKE2SP224E
    • BLAKE2SP224
    • SHA1E
    • SHA1
    • MD5E
    • MD5
    • WORM
    • URL
    • X*
  • local repository version: 8
  • operating system: linux x86_64
  • remote types:
    • git
    • gcrypt
    • p2p
    • S3
    • bup
    • directory
    • rsync
    • web
    • bittorrent
    • webdav
    • adb
    • tahoe
    • glacier
    • ddar
    • git-lfs
    • httpalso
    • hook
    • external
  • supported repository versions:
    • 8
  • upgrade supported from repository versions:
    • 0
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
  • version: 8.20200810+git143-gdee38c54d-1~ndall+1

location

  • path: /tmp/tipp-refernece
  • type: dataset

metadata_extractors

  • annex:
    • load_error: None
    • module: datalad.metadata.extractors.annex
    • version: None
  • audio:
    • load_error: None
    • module: datalad.metadata.extractors.audio
    • version: None
  • datacite:
    • load_error: None
    • module: datalad.metadata.extractors.datacite
    • version: None
  • datalad_core:
    • load_error: None
    • module: datalad.metadata.extractors.datalad_core
    • version: None
  • datalad_rfc822:
    • load_error: None
    • module: datalad.metadata.extractors.datalad_rfc822
    • version: None
  • exif:
    • load_error: None
    • module: datalad.metadata.extractors.exif
    • version: None
  • frictionless_datapackage:
    • load_error: None
    • module: datalad.metadata.extractors.frictionless_datapackage
    • version: None
  • image:
    • load_error: None
    • module: datalad.metadata.extractors.image
    • version: None
  • xmp:
    • load_error: None
    • module: datalad.metadata.extractors.xmp
    • version: None

python

  • implementation: CPython
  • version: 3.8.5

system

@yarikoptic
Copy link
Member Author

oh -- in above there was a partial screw up of mine -- I was in "incoming" branch, and used publish... so I redid it all with push which did show pushing master first according to the output:

/tmp/tipp-refernece > git remote remove github

/tmp/tipp-refernece > datalad create-sibling-github tipp-reference-datalad
.: github(-) [https://github.com/yarikoptic/tipp-reference-datalad.git (git)]
'https://github.com/yarikoptic/tipp-reference-datalad.git' configured as sibling 'github' for Dataset(/tmp/tipp-refernece)

/tmp/tipp-refernece > datalad push --to github                     
publish(ok): /tmp/tipp-refernece (dataset) [refs/heads/master->github:refs/heads/master [new branch]]            
publish(ok): /tmp/tipp-refernece (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]]

but it is git-annex which became the default. so I guess it is some kind of "intended" github operation now and we would need an explicit api call to make current pushed branch (master or incoming originally) to be default (if "new branch")

@mih
Copy link
Member

mih commented Oct 9, 2020

The branch that is first pushed being the default on github has been the case for as long as I can remember. I don't think we need an explicit API call to rectify this in DataLad. It is easily fixed by hand https://github.com/datalad/datalad/settings/branches and does not happen with the recommended create-sibling-github + push combo.

@yarikoptic
Copy link
Member Author

I did use the combo in the example right above your comment (previous repo was removed from github manually first). And according to the push result rendering it was the master branch which was pushed first.

It is not as "easily" fixed, when you are to create multiple datasets.

@mih
Copy link
Member

mih commented Oct 10, 2020

Push order was an explicit devgoal for push. I tried again with a fresh repo and master was default.

@yarikoptic
Copy link
Member Author

So order isn't enough and behavior seems to vary between people.

@yarikoptic
Copy link
Member Author

May be there is some race condition at GitHub side and adding a slight delay after pushing master (if we see it is new branch) would mitigate it.... Didn't try yet but that explanation is the only one which makes sense to me. Alternative to delay might be to follow push of the branch with its forced fetch if new.

@yarikoptic
Copy link
Member Author

FWIW, using #5008

reproducer (assumes my github login, adjust for yours)
#!/bin/bash

export PS4='> '
set -x
set -eu
cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"

datalad create testgh; 
cd testgh; 
datalad create-sibling-github -s github --existing replace testgh
datalad push --to github; 

datalad clone git://github.com/yarikoptic/testgh ../dest
git -C ../dest branch -a
output
$> bash gh-default-branch
> set -eu
>> mktemp -d /home/yoh/.tmp/dl-XXXXXXX
> cd /home/yoh/.tmp/dl-z8xaY6l
> datalad create testgh
[INFO   ] Creating a new annex repo at /home/yoh/.tmp/dl-z8xaY6l/testgh 
[INFO   ] Scanning for unlocked files (this may take some time) 
create(ok): /home/yoh/.tmp/dl-z8xaY6l/testgh (dataset)
> cd testgh
> datalad create-sibling-github -s github --existing replace testgh
.: github(-) [https://github.com/yarikoptic/testgh.git (git)]
'https://github.com/yarikoptic/testgh.git' configured as sibling 'github' for Dataset(/home/yoh/.tmp/dl-z8xaY6l/testgh)
> datalad push --to github
publish(ok): /home/yoh/.tmp/dl-z8xaY6l/testgh (dataset) [refs/heads/master->github:refs/heads/master [new branch]]                                                                                                                
publish(ok): /home/yoh/.tmp/dl-z8xaY6l/testgh (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]]
> datalad clone git://github.com/yarikoptic/testgh ../dest                                                       
[INFO   ] Scanning for unlocked files (this may take some time)                                                  
[INFO   ] Remote origin uses a protocol not supported by git-annex; setting annex-ignore                         
install(ok): /home/yoh/.tmp/dl-z8xaY6l/dest (dataset)
> git -C ../dest branch -a
* git-annex
  remotes/origin/HEAD -> origin/git-annex
  remotes/origin/git-annex
  remotes/origin/master
bash gh-default-branch  8.21s user 3.42s system 69% cpu 16.748 total

and the thing is that we are pushing ALL refspecs at once in a single git push call. Breaking that apart and pushing one at a time

(dirty patch)
diff --git a/datalad/core/distributed/push.py b/datalad/core/distributed/push.py
index 121bc5bcd..dc2711c86 100644
--- a/datalad/core/distributed/push.py
+++ b/datalad/core/distributed/push.py
@@ -670,11 +670,13 @@ def _append_branch_to_refspec_if_needed(ds, refspecs, branch):
 
 
 def _push_refspecs(repo, target, refspecs, force_git_push, res_kwargs):
-    push_res = repo.push(
+    push_res = sum([
+        repo.push(
         remote=target,
-        refspec=refspecs,
+        refspec=refspec,
         git_options=['--force'] if force_git_push else None,
-    )
+    ) for refspec in refspecs
+    ], [])
     # TODO maybe compress into a single message whenever everything is
     # OK?
     for pr in push_res:
seems to resolve the issue
$> bash gh-default-branch
> set -eu
>> mktemp -d /home/yoh/.tmp/dl-XXXXXXX
> cd /home/yoh/.tmp/dl-uuIWhQz
> datalad create testgh
[INFO   ] Creating a new annex repo at /home/yoh/.tmp/dl-uuIWhQz/testgh 
[INFO   ] Scanning for unlocked files (this may take some time) 
create(ok): /home/yoh/.tmp/dl-uuIWhQz/testgh (dataset)
> cd testgh
> datalad create-sibling-github -s github --existing replace testgh
repository "testgh" already exists on GitHub.
Do you really want to remove it? (choices: yes, [no]): yes

[WARNING] Authentication failed using a token. 
Do you really want to remove it? (choices: yes, [no]): yes

.: github(-) [https://github.com/yarikoptic/testgh.git (git)]
'https://github.com/yarikoptic/testgh.git' configured as sibling 'github' for Dataset(/home/yoh/.tmp/dl-uuIWhQz/testgh)
> datalad push --to github
publish(ok): /home/yoh/.tmp/dl-uuIWhQz/testgh (dataset) [refs/heads/master->github:refs/heads/master [new branch]]                                                                                                                
publish(ok): /home/yoh/.tmp/dl-uuIWhQz/testgh (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]]
> datalad clone git://github.com/yarikoptic/testgh ../dest                                                       
[INFO   ] Scanning for unlocked files (this may take some time)                                                  
[INFO   ] Remote origin uses a protocol not supported by git-annex; setting annex-ignore                         
install(ok): /home/yoh/.tmp/dl-uuIWhQz/dest (dataset)
> git -C ../dest branch -a
  git-annex
* master
  remotes/origin/HEAD -> origin/master
  remotes/origin/git-annex
  remotes/origin/master

so the solution would be, if remote has no branches yet, first push the first refspec and then all the rest.

yarikoptic added a commit to yarikoptic/datalad that referenced this issue Oct 10, 2020
…te branches yet

This is a central place for logic of pushing refspecs used by push and publish,
so I decided to fix in a single spot instead of duplicating logic or providing
yet another additional wrapper, and thus possibly of help to any other code which
uses .push().

Note:
- logic is in effect only if remote and refspecs are provided. So it will be
  of no effect if either of those not provided, but most likely there would be
  no relevant use case since it would require manual configuration first of the
  refspecs and default remote for a branch within config before submitting a push

Q: may be logic should move even deeper into  push_ generator?

Closes: datalad#4997
@mih
Copy link
Member

mih commented Oct 10, 2020

The reproducer fails to reproduce for me.

(datalad3-dev) mih@meiner /tmp % datalad create testgh;
[INFO   ] Creating a new annex repo at /tmp/testgh 
[INFO   ] Scanning for unlocked files (this may take some time) 
create(ok): /tmp/testgh (dataset)
(datalad3-dev) mih@meiner /tmp % cd testgh/
(datalad3-dev) 2 mih@meiner /tmp/testgh (git)-[master] % datalad create-sibling-github -s github testgh
.: github(-) [https://github.com/mih/testgh.git (git)]
'https://github.com/mih/testgh.git' configured as sibling 'github' for Dataset(/tmp/testgh)
(datalad3-dev) mih@meiner /tmp/testgh (git)-[master] % datalad push --to github;
Update availability for 'github':  75%|█████████████████████████████████▊           | 3.00/4.00 [00:00<00:00, 27.4k Steps/s]Username for 'https://github.com': mih
Password for 'https://mih@github.com': 
publish(ok): /tmp/testgh (dataset) [refs/heads/master->github:refs/heads/master [new branch]]                               
publish(ok): /tmp/testgh (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]]                         
(datalad3-dev) mih@meiner /tmp/testgh (git)-[master] % datalad clone git://github.com/yarikoptic/testgh ../dest
[INFO   ] Scanning for unlocked files (this may take some time)                                                             
install(ok): /tmp/dest (dataset)                                                                                            
(datalad3-dev) mih@meiner /tmp/testgh (git)-[master] % git -C ../dest branch -a
(datalad3-dev) mih@meiner /tmp/testgh (git)-[master] % git -C ../dest branch -a |cat
  git-annex
* master
  remotes/origin/HEAD -> origin/master
  remotes/origin/git-annex
  remotes/origin/master

@yarikoptic
Copy link
Member Author

Lucky you! May be has something to do with going via password (in my case was one of the tokens)

@mih
Copy link
Member

mih commented Oct 11, 2020

Maybe this is of relevance here (i.e. a change that has happened at GitHub recently that might break behavior):

https://github.blog/changelog/2020-08-26-set-the-default-branch-for-newly-created-repositories/

@yarikoptic
Copy link
Member Author

May be. So did you set it to master? (For me it is main)

@cmadjar
Copy link

cmadjar commented Oct 19, 2020

FYI - I have been noticing that behaviour lately with our CONP dataset automatically crawled. I ran some test locally as I suspected this could be due to Github renaming the default branch for new repos to main instead of master.

In settings/repositories of your organization (or user), the repository default branch has been changed by Github to main instead of master.

If the default branch is set to main, datalad create-sibling-github, followed by datalad push --to github will indeed create a new repository with default branch set to git-annex as can be seen in the following test: https://github.com/cmadjar/test_1_main.

If the default branch is set to master, datalad create-sibling-github, followed by datalad push --to github will create a new repository with the default branch set to master as can be seen in the following test: https://github.com/cmadjar/test_2_master

That made me suspect that Github just picks the first branch in alphabetical order amongst the branches available as the default branch, which would correspond to git-annex.

To confirm that, I created a third dataset with branch bobo and git-annex (removed master), when I run datalad create-sibling-github, followed by datalad push --to github, the default branch becomes bobo as can be seen in the following test: https://github.com/cmadjar/test_2_bobo

No idea if this is helpful to you but just in case, I figured I would share my little investigation of the day :-).

@yarikoptic
Copy link
Member Author

Thank you @cmadjar ! Not sure if may be some order of events wasn't desired, but I did try to change default branch to be master for @dandibot, and then created/pushed sibling and it still ended up with git-annex to be the default one :-( https://github.com/dandisets/000003

@yarikoptic
Copy link
Member Author

ok -- found 1 more, there is also at organization level. Hopefully next one would work better

@cmadjar
Copy link

cmadjar commented Oct 20, 2020

@yarikoptic I think it has to be changed at the organization level setting for it to work. At least, this is the one I played with with my tests yesterday. Curious to see if your next test will work the same way it did in my tests.

Questions:

  • is there a plan to change the default branch created by datalad create to main instead of master in your roadmap?
  • for the already existing datasets if we change our default branch to be named main and delete master, how would DataLad react?

At the moment, we are sticking with master for CONP datasets since we are not sure what would be the impact on the datasets and DataLad if we were to create a main branch, change the default branch to main and then delete master...

@kyleam
Copy link
Contributor

kyleam commented Oct 20, 2020

is there a plan to change the default branch created by datalad create to main instead of master in your roadmap?

As far as I'm aware, it hasn't been discussed one way or the other. But, assuming git 2.28 or later, you can use --initial-branch with datalad create:

datalad create PATH --initial-branch main

You could also set init.defaultBranch in your global git configuration.

for the already existing datasets if we change our default branch to be named main and delete master, how would DataLad react?

I'm not sure, as there are probably lots of angles to consider. DataLad core should be in a good place with respect to not hard coding the default branch name (and we test with a custom init.defaultBranch, gh-4632), but there are additional concerns when renaming the default branch in an existing repo. At the very least, existing consumers would need to manually adjust. And I don't believe the -crawler extension has been updated to avoid hard coding master as the default branch.

@yarikoptic
Copy link
Member Author

FWIW, hopefully we still finalize #5010 one way or another (most likely making it github specific) so the issue would go away

@thomas-bayes
Copy link

In datalad 0.15.5 with git 2.35.0, there is no such option --initial-branch. If anything changed there, it is not documented, the online manual still mentions --initial-branch in the FAQ. What is the current approach to solving this?

@adswa
Copy link
Member

adswa commented Mar 9, 2022

I think this is just confusingly worded in the FAQ, I will fix this.
The --initial-branch parameter does not belong to create, it should be a Git parameter that we pass through to git init. Anything after the path will be treated as intended for Git.
From datalad create --help:

positional arguments:
  PATH                  path where the dataset shall be created, directories [...]
  INIT OPTIONS          options to pass to git init. Any argument specified
                        after the destination path of the repository will be
                        passed to git-init as-is. Note that not all options
                        will lead to viable results. For example '--bare' will
                        not yield a repository where DataLad can adjust files
                        in its working tree.

I checked Git 2.35.0's docs and it still lists the parameter. I'm running 0.35.1 locally and this works:

(handbook2) adina@muninn in /tmp
❱ datalad create abc --initial-branch lookatme                              1 !
[INFO   ] Creating a new annex repo at /tmp/abc 
create(ok): /tmp/abc (dataset)
(handbook2) adina@muninn in /tmp
❱ cd abc
(handbook2) adina@muninn in /tmp/abc on git:lookatme
❱ git branch
  git-annex
* lookatme

Does that work for you as well? Thanks for bringing the suboptimal wording to our attention, I can totally see how this is not clear at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants