New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish to github with specialremote for data access #335

Closed
mih opened this Issue Jan 14, 2016 · 9 comments

Comments

Projects
None yet
3 participants
@mih
Member

mih commented Jan 14, 2016

Here is the protocol of setting up a fairly convenient state to aid public data consumption. This protocol assumes that you only push to a server that shall host the data, and do not access the local machine from the server (as this will often not be possible).

Start with regular local annex repo.

git init
git annex init "thebeginning"
git annex add .
git commit

On the remote server create new Git repo. We use a non-bare repo. Using a bare repo is possible and makes some of the steps below unnecessary. However, having a work tree on the server allows for additional use cases (browsing, etc.)

# --shared all is likely what you want, but double-check before running
git init --shared=all
# with modern Git the next is unnecessary and can be replaced with the line afterwards
#git checkout -b ignore
git config receive.denyCurrentBranch updateInstead
mv .git/hooks/post-update.sample .git/hooks/post-update
git update-server-info

On the local machine add the new remote. Do it twice, once for push access via SSH, and once for anonymous pull access via HTTP. And push the content.

git remote add httpdata HTTPURL/.git
git remote add sshdata SSHURL
git push sshdata git-annex
git push sshdata master

Now go back to the remote, and init the annex.

# with the modern Git setup the following is not required
#git checkout master
git annex init "public data"

Now the remote repo can be enabled as a special remote on the client-side. Once done, sync the state with the remote.

git annex initremote datasrc type=git location=HTTPURL/.git autoenable=true
# at this point the http remote is no longer needed
git remote remove httpdata
git fetch sshdata
git annex merge

Now the whole thing can be pushed to github. Create a repo on github, and add it as a remote to the local repo. Then push

git push github master
git push github git-annex

At this point, anybody can clone from github and get a local annex repo with an automatically enabled special remote.

git clone https://github.com/SOMETHING mine
git annex init

At this point the datasrc did not receive any files yet, hence none can be annex-get. The necessary workflow is as follows. In the original repo do:

git annex copy SOMEFILE --to sshdata
git push github git-annex

This is sufficient to make a file available and the availability known. In the anonymously cloned repo now do:

git pull
git annex get SOMEFILE

and it will be downloaded via HTTP from datasrc.

The above is verified and works. The rest is anticipation and extrapolation:

Further workflow for maintaining sshdata:

git annex sync sshdata

Further workflow for publishing to github

git push github git-annex
git push github master

In a collaborative scenario where the gihub repo can be modified from elsewhere, git annex sync would be a better fit.

Further workflow for any anonymous clone:

git pull

The latter cannot use annex sync, because it would want to push to github without sufficient permissions.

@mih mih added the documentation label Jan 14, 2016

@mih mih added this to the Publication (tracking) milestone Jan 14, 2016

@yarikoptic

This comment has been minimized.

Member

yarikoptic commented Jan 14, 2016

yeap. Thanks for detailing it

When we get to handles thought -- with all the rewrites for datalad's meta (thus no simple pushes of the master containing local urls, or pulls) it would be a bit more evolved ;-) Without "global" urls it could've been indeed this straightforward

@mih

This comment has been minimized.

Member

mih commented Jan 14, 2016

So let's think again, where we need which URLs exactly.

@mih

This comment has been minimized.

Member

mih commented Apr 6, 2016

@bpoldrack @yarikoptic Amazingly, there is no need for an API key to use the github rest API. Here is how to create a repo via python-github:

import github as gh
g=gh.Github('myusername', 'mypasswd')
u=g.get_user()
u.create_repo('apitest', description='some fancy', homepage='example.com')

This should make create_publication_target_github quite a bit simpler than the sshwebserver one...

Cheers to the GitHub folks. Well done!

@mih

This comment has been minimized.

Member

mih commented Apr 6, 2016

Moreover, it supports the creation of oAuth tokens, so there is no need to store passwords....

@yarikoptic

This comment has been minimized.

Member

yarikoptic commented Apr 6, 2016

sweet! works even for organizations

*In [5]: g = gh.Github('yarikoptic-test', '....')

*In [7]: o = g.get_organization('yarikoptic-test-org')

In [8]: o.create_repo('apitest', description='some fancy', homepage='example.com')
Out[8]: <github.Repository.Repository at 0x7fd7880320d0>

as for credentials, indeed could go through oauth! I guess we should provide some centralized helper for those, otherwise could use meanwhile the Credentials contraption

In [1]: from datalad.downloaders.providers import Credential

In [2]: cred = Credential('github', 'user_password', 'https://github.com/')

In [3]: creds = cred()
You need to authenticate with 'github' credentials. https://github.com/ provides information on how to gain access
user: 
You need to authenticate with 'github' credentials. https://github.com/ provides information on how to gain access
password: 

In [4]: print creds
{'password': '...', 'user': 'yarikoptic-test'}

which uses keyring module, and ATM with default backend which should theoretically match the system (in my case -- gnome's)

@mih mih added the WIP label Oct 14, 2016

mih added a commit to mih/datalad that referenced this issue Oct 15, 2016

NF: `create_sibling_github` (see dataladgh-335)
This is a working preview. Some aspects will be later test(ed/able). But
I do not plan to test the actual interaction with the Github API.

Many possible features are missing (customizable repo name templating,
...). It is unlikely that this PR will get them, because my focus is on
addressing the core of datalad#335

@mih mih referenced this issue Oct 15, 2016

Merged

NF: `create_sibling_github` (see gh-335) #1018

3 of 3 tasks complete

mih added a commit to mih/datalad that referenced this issue Oct 21, 2016

NF: `create_sibling_github` (see dataladgh-335)
This is a working preview. Some aspects will be later test(ed/able). But
I do not plan to test the actual interaction with the Github API.

Many possible features are missing (customizable repo name templating,
...). It is unlikely that this PR will get them, because my focus is on
addressing the core of datalad#335

mih added a commit that referenced this issue Oct 22, 2016

Merge pull request #1018 from mih/nf-githubsibling
NF: `create_sibling_github` (see gh-335)
@Paolopost

This comment has been minimized.

Paolopost commented Nov 10, 2016

I tried the configuration reported above without success.

The main issue is that files on "mine" repository (the clone from github) appear to be available (with the 'whereis' command) only on [sshdata] and [public] both not accessible. The files are not indexes by [datasrc] remote. Following the example the files from local are copied on the server only to [sshdata].
Where am I wrong?

I have an additional question.
It looks like the remote [httpdata] doesn't have any role.

P.S. Do you have a recommendation for a docker to config git server over http?
I tried "cirocosta/gitserver-http" but it supports only bare repos.

@Paolopost

This comment has been minimized.

Paolopost commented Nov 11, 2016

An additional detail. On the server the storage is mounted with sshfs.
When I init the annex "public" I received the following notification:

init public
Detected a filesystem without fifo support.
Disabling ssh connection caching.
ok

I don't believe it is relevant to the issue mentioned above.

@mih mih referenced this issue Jan 19, 2017

Open

`publish` needs attention #1197

1 of 9 tasks complete
@mih

This comment has been minimized.

Member

mih commented Feb 3, 2017

Finally an update on this issue. With #1237 many things get changed (and hopefully improved). There is one more issue before this is fully resolved. Here is how the current flow is by means of a demo script. It does everything, except for the special remote setup -- which still needs some thought. Anyways, here it is:

#!/bin/bash -e

wdir="$(mktemp -d)"
cd $wdir

# fresh dataset
datalad create orig
cd orig
# remotes
# own webserver
datalad --dbg create-sibling \
	-s target1 \
	--existing replace \
	<serverhostname>:public_html/demoannex
# these two are TODO for special remote setup
#	--as-common-datasrc ownserver \
#	--target-url http://<serverhostname>/~user/demoannex/.git \
# github
datalad create-sibling-github \
	--github-login <githubuser> \
	--publish-depends target1 \
	--existing reconfigure \
	--access-protocol ssh \
	demoannex

# some new content
touch probe
datalad add probe

# publish to github, which involves transfer to own server too
datalad publish --to github

# random new user gets the new repo from from github
cd ..
datalad install -s git@github.com:<githubuser>/demoannex.git public
@mih

This comment has been minimized.

Member

mih commented Feb 3, 2017

OK, this is done (with the present state of #1237) -- just put back the two commented lines in the script above.

@mih mih added fix-implemented and removed WIP labels Feb 3, 2017

@mih mih referenced this issue Feb 10, 2017

Closed

"Reviewer mode" #29

yarikoptic added a commit that referenced this issue Mar 20, 2017

Merge tag '0.5.0' into debian
This release includes an avalanche of bug fixes, enhancements, and
additions which at large should stay consistent with previous behavior
but provide better functioning.  Lots of code was refactored to provide
more consistent code-base, and some API breakage has happened.  Further
work is ongoing to standardize output and results reporting
(see [PR 1350])

- requires [git-annex] >= 6.20161210 (or better even >= 6.20161210 for
  improved functionality)
- commands should now operate on paths specified (if any), without
  causing side-effects on other dirty/staged files
- [save]
    - `-a` is deprecated in favor of `-u` or `--all-updates`
      so only changes known components get saved, and no new files
      automagically added
    - `-S` does no longer store the originating dataset in its commit
       message
- [add]
    - can specify commit/save message with `-m`
- [add-sibling] and [create-sibling]
    - now take the name of the sibling (remote) as a `-s` (`--name`)
      option, not a positional argument
    - `--publish-depends` to setup publishing data and code to multiple
      repositories (e.g. github + webserve) should now be functional
      see [this comment](#335 (comment))
    - got `--publish-by-default` to specify what refs should be published
      by default
    - got `--annex-wanted`, `--annex-groupwanted` and `--annex-group`
      settings which would be used to instruct annex about preferred
      content. [publish] then will publish data using those settings if
      `wanted` is set.
    - got `--inherit` option to automagically figure out url/wanted and
      other git/annex settings for new remote sub-dataset to be constructed
- [publish]
    - got `--skip-failing` refactored into `--missing` option
      which could use new feature of [create-sibling] `--inherit`

- More consistent interaction through ssh - all ssh connections go
  through [sshrun] shim for a "single point of authentication", etc.
- More robust [ls] operation outside of the datasets
- A number of fixes for direct and v6 mode of annex

- New [drop] and [remove] commands
- [clean]
    - got `--what` to specify explicitly what cleaning steps to perform
      and now could be invoked with `-r`
- `datalad` and `git-annex-remote*` scripts now do not use setuptools
  entry points mechanism and rely on simple import to shorten start up time
- [Dataset] is also now using [Flyweight pattern], so the same instance is
  reused for the same dataset
- progressbars should not add more empty lines

- Majority of the commands now go through `_prep` for arguments validation
  and pre-processing to avoid recursive invocations

* tag '0.5.0':
  Preparing for 0.5.0 release
  DOC: more referfences
  more of changelogs
  %s seems to be more resilient than str ;)
  OPT: use eatmydata for some installations to hopefully shave off a few seconds
  BF(workaround): convert all fields to string while sorting meta data
  ENH/BF: crawler - support few more diff modes, stage only non empty filenames, no split 117 ds in openfmri
  BF: explicitly import email.parser submodule (Closes #1104)
  initiated changelog for 0.5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment