Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Standardized way to add examples to commands #3821

Merged
merged 13 commits into from
Oct 29, 2019

Conversation

adswa
Copy link
Member

@adswa adswa commented Oct 21, 2019

Following up on #3757, I'm drafting a standardized way to include examples into the docstrings or help messages of commands, and I would appreciate critical feedback.

My idea was to mimick the current way in which _params_ within each command class holds the commands parameters and has custom functions to add them to the parser or the docstring.

I propose to have _examples_ as a list of dictionaries, with each dictionary being one command example. The key text holds a description of the example, and code_py or code_cmd keys holds code snippets.

    _examples_ = [
        dict(text="""Apply the text2git procedure upon creation of a dataset""",
             code_py="create(path='mydataset', cfg_proc='text2git')",
             code_cmd="datalad create -c text2git mydataset"),

Based on _examples_, additional functions in build_doc and setup_parser build the example and append it to the docstring or the help message. This is how it currently looks for me for

cmd line help
datalad install --help
Usage: datalad install [-h] [-s SOURCE] [-d DATASET] [-g] [-D DESCRIPTION]
                       [-r] [-R LEVELS] [--nosave] [--reckless] [-J NJOBS]
                       [PATH [PATH ...]]

Install a dataset from a (remote) source.

This command creates a local sibling of an existing dataset from a
(remote) location identified via a URL or path. Optional recursion into
potential subdatasets, and download of all referenced data is supported.
The new dataset can be optionally registered in an existing
superdataset by identifying it via the DATASET argument (the new
dataset's path needs to be located within the superdataset for that).

It is recommended to provide a brief description to label the dataset's
nature *and* location, e.g. "Michael's music on black laptop". This helps
humans to identify data locations in distributed scenarios.  By default an
identifier comprised of user and machine name, plus path will be generated.

When only partial dataset content shall be obtained, it is recommended to
use this command without the GET-DATA flag, followed by a
`get` operation to obtain the desired data.

NOTE
  Power-user info: This command uses git clone, and
  git annex init to prepare the dataset. Registering to a
  superdataset is performed via a git submodule add operation
  in the discovered superdataset.

*Examples*
Install a dataset from Github into the current directory:

   % datalad install https://github.com/datalad-datasets/longnow-podcasts.git

Install a dataset as a subdataset into the current dataset:

   % datalad install -d . --source='https://github.com/datalad-datasets/longnow-podcasts.git'

Install a dataset, and get all contents right away:

   % datalad install --get-data --source https://github.com/datalad-datasets/longnow-podcasts.git

Install a dataset with all its subdatasets:

   % datalad install https://github.com/datalad-datasets/longnow-podcasts.git --recursive

*Arguments*
  PATH                  path/name of the installation target. If no PATH is
                        provided a destination path will be derived from a
                        source URL similar to git clone. [Default: None]

*Options*
  -h, --help, --help-np
                        show this help message. --help-np forcefully disables
                        the use of a pager for displaying the help message
  -s SOURCE, --source SOURCE
                        URL or local path of the installation source.
                        Constraints: value must be a string [Default: None]
  -d DATASET, --dataset DATASET
                        specify the dataset to perform the install operation
                        on. If no dataset is given, an attempt is made to
                        identify the dataset in a parent directory of the
                        current working directory and/or the PATH given.
                        Constraints: Value must be a Dataset or a valid
                        identifier of a Dataset (e.g. a path) [Default: None]
  -g, --get-data        if given, obtain all data content too. [Default:
                        False]
  -D DESCRIPTION, --description DESCRIPTION
                        short description to use for a dataset location. Its
                        primary purpose is to help humans to identify a
                        dataset copy (e.g., "mike's dataset on lab server").
                        Note that when a dataset is published, this
                        information becomes available on the remote side.
                        Constraints: value must be a string [Default: None]
  -r, --recursive       if set, recurse into potential subdataset. [Default:
                        False]
  -R LEVELS, --recursion-limit LEVELS
                        limit recursion into subdataset to the given number of
                        levels. Constraints: value must be convertible to type
                        'int' [Default: None]
  --nosave              by default all modifications to a dataset are
                        immediately saved. Giving this option will disable
                        this behavior. [Default: True]
  --reckless            Set up the dataset to be able to obtain content in the
                        cheapest/fastest possible way, even if this poses a
                        potential risk the data integrity (e.g. hardlink files
                        from a local clone of the dataset). Use with care, and
                        limit to "read-only" use cases. With this flag the
                        installed dataset will be marked as untrusted.
                        [Default: False]
  -J NJOBS, --jobs NJOBS
                        how many parallel jobs (where possible) to use.
                        Constraints: value must be convertible to type 'int',
                        or value must be one of ('auto',) [Default: 'auto']

and for a

Python help
Help on function __call__ in module datalad.core.local.create:

Signature:
create(
    path=None,
    initopts=None,
    force=False,
    description=None,
    dataset=None,
    no_annex=False,
    fake_dates=False,
    cfg_proc=None,
)
Docstring:
Create a new dataset from scratch.

This command initializes a new dataset at a given location, or the
current directory. The new dataset can optionally be registered in an
existing superdataset (the new dataset's path needs to be located
within the superdataset for that, and the superdataset needs to be given
explicitly via `dataset`). It is recommended
to provide a brief description to label the dataset's nature *and*
location, e.g. "Michael's music on black laptop". This helps humans to
identify data locations in distributed scenarios.  By default an identifier
comprised of user and machine name, plus path will be generated.

This command only creates a new dataset, it does not add existing content
to it, even if the target directory already contains additional files or
directories.

Plain Git repositories can be created via the `no_annex` flag.
However, the result will not be a full dataset, and, consequently,
not all features are supported (e.g. a description).

To create a local version of a remote dataset use the
:func:`~datalad.api.install` command instead.

.. note::
  Power-user info: This command uses :command:`git init` and
  :command:`git annex init` to prepare the new dataset. Registering to a
  superdataset is performed via a :command:`git submodule add` operation
  in the discovered superdataset.

Examples
--------
Create a dataset 'mydataset' in the current directory::

   create(path='mydataset')

Apply the text2git procedure upon creation of a dataset::

   create(path='mydataset', cfg_proc='text2git')

Create a subdataset in the root of an existing dataset::

   create(dataset='.', path='mysubdataset')

Create a dataset in an existing, non-empty directory::

   create(force=True, path='.')

Create a plain Git repository::

   create(path='mydataset', no_annex=True)


Parameters
----------
path : str or Dataset or None, optional
  path where the dataset shall be created, directories will be created
  as necessary. If no location is provided, a dataset will be created
  in the current working directory. Either way the command will error
  if the target directory is not empty. Use `force` to create a
  dataset in a non-empty directory. [Default: None]
initopts
  options to pass to :command:`git init`. Options can be given as a
  list of command line arguments or as a GitPython-style option
  dictionary. Note that not all options will lead to viable results.
  For example '--bare' will not yield a repository where DataLad can
  adjust files in its worktree. [Default: None]
force : bool, optional
  enforce creation of a dataset in a non-empty directory. [Default:
  False]
description : str or None, optional
  short description to use for a dataset location. Its primary purpose
  is to help humans to identify a dataset copy (e.g., "mike's dataset
  on lab server"). Note that when a dataset is published, this
  information becomes available on the remote side. [Default: None]
dataset : Dataset or None, optional
  specify the dataset to perform the create operation on. If a dataset
  is given, a new subdataset will be created in it. [Default: None]
no_annex : bool, optional
  if set, a plain Git repository will be created without any annex.
  [Default: False]
fake_dates : bool, optional
  Configure the repository to use fake dates. The date for a new
  commit will be set to one second later than the latest commit in the
  repository. This can be used to anonymize dates. [Default: False]
cfg_proc
  Run cfg_PROC procedure(s) (can be specified multiple times) on the
  created dataset. Use `run_procedure(discover=True)` to get a list of
  available procedures, such as cfg_text2git. [Default: None]
on_failure : {'ignore', 'continue', 'stop'}, optional
  behavior to perform on failure: 'ignore' any failure is reported,
  but does not cause an exception; 'continue' if any failure occurs an
  exception will be raised at the end, but processing other actions
  will continue for as long as possible; 'stop': processing will stop
  on first failure and an exception is raised. A failure is any result
  with status 'impossible' or 'error'. Raised exception is an
  IncompleteResultsError that carries the result dictionaries of the
  failures in its `failed` attribute. [Default: 'continue']
proc_post
  Like `proc_pre`, but procedures are executed after the main command
  has finished. [Default: None]
proc_pre
  DataLad procedure to run prior to the main command. The argument a
  list of lists with procedure names and optional arguments.
  Procedures are called in the order their are given in this list. It
  is important to provide the respective target dataset to run a
  procedure on as the `dataset` argument of the main command.
  [Default: None]
result_filter : callable or None, optional
  if given, each to-be-returned status dictionary is passed to this
  callable, and is only returned if the callable's return value does
  not evaluate to False or a ValueError exception is raised. If the
  given callable supports `**kwargs` it will additionally be passed
  the keyword arguments of the original API call. [Default: None]
result_renderer : {'default', 'json', 'json_pp', 'tailored'} or None, optional
  format of return value rendering on stdout. [Default: None]
result_xfm : {'datasets', 'successdatasets-or-none', 'paths', 'relpaths', 'metadata'} or callable or None, optional
  if given, each to-be-returned result status dictionary is passed to
  this callable, and its return value becomes the result instead. This
  is different from `result_filter`, as it can perform arbitrary
  transformation of the result value. This is mostly useful for top-
  level command invocations that need to provide the results in a
  particular format. Instead of a callable, a label for a pre-crafted
  result transformation can be given. [Default: None]
return_type : {'generator', 'list', 'item-or-list'}, optional
  return value behavior switch. If 'item-or-list' a single value is
  returned instead of a one-item return value list, or a list in case
  of multiple return values. `None` is return in case of an empty
  list. [Default: 'list']
File:      ~/repos/datalad/datalad/core/local/create.py
Type:      FunctionWrapper

Note that in the docstrings, examples use the simple :: rst markup which I believe gets rendered by Sphinx into code markup

Create a subdataset in the root of an existing dataset::
    
   create(dataset='.', path='mysubdataset')

If you have any thoughts on this, please let me know. I will in any case draft examples for hopefully all commands.

TODO

The dictionary data structure can be expanded with other keys, such as setup_py/setup_cmd and teardown_py/teardown_cmd. These keys could hold the relevant code to run the command example and clean up afterwards, e.g.

        dict(text="""Create a dataset 'mydataset' in the current directory""",
             code_py="create(path='mydataset')",
             code_cmd="datalad create mydataset",
             setup_py="python -c 'from datalad.api import create, remove'",
             setup_cmd="datalad create mydataset",
             teardown_py="remove(path='mydataset')",
             teardown_cmd="datalad remove mydataset"),

This would make it possible to write functions that build the examples, test whether they run, and tear down what an example executed.

@codecov
Copy link

codecov bot commented Oct 21, 2019

Codecov Report

Merging #3821 into master will increase coverage by 0.03%.
The diff coverage is 97.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3821      +/-   ##
==========================================
+ Coverage   80.74%   80.77%   +0.03%     
==========================================
  Files         273      273              
  Lines       35905    35963      +58     
==========================================
+ Hits        28990    29050      +60     
+ Misses       6915     6913       -2
Impacted Files Coverage Δ
datalad/distribution/get.py 81.81% <ø> (+1.81%) ⬆️
datalad/core/local/create.py 96.24% <100%> (+0.02%) ⬆️
datalad/distribution/drop.py 84.33% <100%> (+0.19%) ⬆️
datalad/core/local/run.py 92.24% <100%> (+0.03%) ⬆️
datalad/cmdline/main.py 78.83% <100%> (+0.26%) ⬆️
datalad/core/local/status.py 96.07% <100%> (+0.03%) ⬆️
datalad/distribution/install.py 97.75% <100%> (+0.02%) ⬆️
datalad/core/local/save.py 86.84% <100%> (+0.17%) ⬆️
datalad/interface/base.py 90.6% <96.42%> (+0.53%) ⬆️
datalad/support/gitrepo.py 83.5% <0%> (-0.11%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 004daa2...ed58d5d. Read the comment docs.

@yarikoptic
Copy link
Member

I am not engaging, but at first look, it looks great!!!

@adswa
Copy link
Member Author

adswa commented Oct 23, 2019

There is one failure not related to my changes.

(due to

Downloading archive: https://storage.googleapis.com/travis-ci-language-archives/python/binaries/ubuntu/16.04/x86_64/python-3.5.tar.bz2
129.62s$ curl -sSf -o python-3.5.tar.bz2 ${archive_url}
curl: (7) Failed to connect to storage.googleapis.com port 443: Connection timed out
Unable to download 3.5 archive. The archive may not exist. Please consider a different version.

)

Importantly, I don't break the docs anymore by trying to improve the docs ;-) If anyone has critical feedback, I'd be very happy to hear it.

_examples_ = [
dict(text="Drop single file content",
code_py="drop('path/to/file')",
code_cmd="datalad drop <path/to/file>"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style comment: If we are using non-placeholder syntax 'path/to/file' for one API, we should stick with it for the other.

"source='https://github.com/datalad-datasets/longnow-podcasts.git')",
code_cmd="datalad install -d . "
"--source='https://github.com/datalad-datasets/longnow-podcasts.git'"),
dict(text="Install a dataset, and get all contents right away",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all content (?), i.e. no plural

@mih
Copy link
Member

mih commented Oct 23, 2019

I think this is great (made few comments above).

I would recommend getting it merged now, and not make the conversion of all examples everywhere a precondition. In many cases those should be looked at carefully (not just converted) and re-evaluated if they still make sense at all.

What I would appreciate though, is a short description of how this is meant to be in https://github.com/datalad/datalad/blob/22cef22c0f84c6a9fdd4206f94f29af8f538c670/docs/source/designpatterns.rst

datalad/core/local/create.py Outdated Show resolved Hide resolved
datalad/core/local/run.py Outdated Show resolved Hide resolved
adswa and others added 6 commits October 29, 2019 08:09
Looking forward to the ruby on rails API!
This should enable help-like specification in triple-quotes, but also
meaningful formating across lines in the style of the respective API.
Removes confusing backslashes from docs too.
@adswa adswa marked this pull request as ready for review October 29, 2019 07:57
@adswa adswa changed the title [WIP] ENH: Standardized way to add examples to commands ENH: Standardized way to add examples to commands Oct 29, 2019
Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready from my POV. Thanks!

@mih mih merged commit a04e26c into datalad:master Oct 29, 2019
@adswa adswa deleted the ENH/3757-examples branch December 18, 2020 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants