Skip to content

Commit

Permalink
Updates to documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
jfischer committed Mar 13, 2019
1 parent dce6ef6 commit ffe37da
Show file tree
Hide file tree
Showing 6 changed files with 147 additions and 31 deletions.
63 changes: 62 additions & 1 deletion dataworkspaces/lineage.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
"""
API for tracking data lineage
This module provides an API for tracking
*data lineage* -- the history of how a given result was created, including the
versions of original source data and the various steps run in the *data pipeline*
to produce the final result.
"""
import sys
from abc import ABC, abstractmethod
Expand All @@ -26,6 +29,10 @@
##########################################################################

class Lineage(contextlib.AbstractContextManager):
"""This is the main object for tracking the execution of a step.
Rather than instantiating it directly, use the :class:`~LineageBuilder`
class to construct your :class:`~Lineage` instance.
"""
def __init__(self, step_name:str, start_time:datetime.datetime,
parameters:Dict[str,Any],
inputs:List[Union[str, ResourceRef]],
Expand Down Expand Up @@ -149,6 +156,56 @@ def make_lineage(parameters:Dict[str,Any], inputs:List[Union[str, ResourceRef]],
command_line=[sys.executable]+sys.argv)

class LineageBuilder:
"""Use this class to declaratively build :class:`~Lineage` objects. Instantiate
a LineageBuilder instance, and call a sequence of configuration methods
to specify your inputs, parameters, your workspace (if the script is not
already inside the workspace), and whether this is a results step. Each
configuration method returns the builder, so you can chain them together.
Finally, call :func:`~eval` to instantiate the :class:`~Lineage` object.
**Configuration Methods**
To specify the workflow step's name, call one of:
* :func:`~as_script_step` - the script's name will be used to infer the step
* with_step_name - explicitly specify the step name
To specify the parameters of the step (e.g. command line arguments), use the
:func:`~with_parameters` method.
To specify the input of the step call one or more of:
* :func:`~with_input_path` - resolve the local filesystem path to a resource and
subpath and add it to the lineage as inputs. May be called more than once.
* :func:`~with_input_paths` - resolve a list of local filesystem paths to
resources and subpaths and add them to the lineage as inputs. May be called
more than once.
* :func:`~with_input_ref` - add the resource and subpath to the lineage as an input.
May be called more than once.
* :func:`~with_no_inputs` - mutually exclusive with the other input methods. This
signals that there are no inputs to this step.
If you need to specify the workspace's root directory, use the
:func:`~with_workspace_directory` method. Otherwise, the lineage API will attempt
to infer the workspace directory by looking at the path of the script.
Call :func:`~as_results_step` to indicate that this step is producing results.
This will cause a ``results.json`` file and a ``lineage.json`` file to be created
in the specified directory.
**Example**
Here is an example where we build a :class:`~Lineage` object for a script,
that has one input, and that produces results::
lineage = LineageBuilder()\\
.as_script_step()\\
.with_parameters({'gamma':0.001})\\
.with_input_path(args.intermediate_data)\\
.as_results_step().eval()
**Methods**
"""
def __init__(self):
self.step_name = None # type: Optional[str]
self.command_line = None # type: Optional[List[str]]
Expand Down Expand Up @@ -219,6 +276,10 @@ def as_results_step(self, results_dir:str, run_description:Optional[str]=None)\
return self

def eval(self) -> Lineage:
"""Validate the current configuration, making sure all required
properties have been specified, and return a :class:`~Lineage` object
with the requested configuration.
"""
assert self.step_name is not None, "Need to specify step name"
assert self.parameters is not None, "Need to specify parameters"
assert self.no_inputs or (self.inputs is not None),\
Expand Down
13 changes: 8 additions & 5 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
import os
import sys
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))


# -- Project information -----------------------------------------------------
Expand Down Expand Up @@ -43,6 +43,8 @@
'sphinx.ext.viewcode',
]

autodoc_mock_imports=['click']

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

Expand Down Expand Up @@ -76,7 +78,8 @@
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'alabaster'
#html_theme = 'alabaster'
html_theme = 'sphinx_rtd_theme'

# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
Expand Down Expand Up @@ -175,4 +178,4 @@
epub_exclude_files = ['search.html']


# -- Extension configuration -------------------------------------------------
# -- Extension configuration -------------------------------------------------
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Unix-like systems, including Linux, MacOS, and on Windows via the
tutorial
commands
resources
lineage
internals


Expand Down
2 changes: 1 addition & 1 deletion docs/internals.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _internals:

5. Internals: Developer's Guide
6. Internals: Developer's Guide
===============================
This section is a guide for people working on the development of Data Workspaces
or people who which to extend it (e.g. through their own resource types or
Expand Down
81 changes: 57 additions & 24 deletions docs/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Here is a quick example to give you a flavor of the project, using
`scikit-learn <https://scikit-learn.org>`_
and the famous digits dataset running in a Jupyter Notebook.

First, install the libary::
First, install [#introf1]_ the libary::

pip install dataworkspaces

Expand All @@ -24,6 +24,8 @@ one for the source code, and one for the results. These are special
subdirectories, in that they are *resources* which can be tracked and versioned
independently.

.. [#introf1] See the :ref:`Installation section <installation>` for more options and details.
Now, we are going to add our source data to the workspace. This resides in an
external, third-party git repository. It is simple to add::

Expand Down Expand Up @@ -108,6 +110,60 @@ Some things you can do from here:
* More complex scenarios involving multi-step data pipelines can easily
be automated. See the documentation for details.

.. _installation:

Installation
------------
Now, let us look into more detail at the options for installation.

Prerequisites
~~~~~~~~~~~~~
This software runs directly on Linux and MacOSx. Windows is supported by via the
`Windows Subsystem for Linux <https://docs.microsoft.com/en-us/windows/wsl/install-win10>`_. The following software should be pre-installed:

* git
* Python 3.5 or later
* Optionally, the `rclone <https://rclone.org>`_ utility, if you are going to be
using it to sync with a remote copy of your data.

Installation from the Python Package Index (PyPi)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is the easiest way to install Data Workspaces is via
the Python Package Index at http://pypi.org.

We recommend first creating a
`virtual environment <https://docs.python.org/3/library/venv.html#venv-def>`_
to contain the Data Workspaces software and any other software needed for your
project. Using the standard Python 3 distribution, you can create and *activate*
a virtual environment via::

python3 -m venv VIRTUAL_ENVIRONMENT_PATH
source VIRTUAL_ENVIRONMENT_PATH/bin/activate

If you are using the `Anaconda <https://www.anaconda.com/distribution/>`_
distribution of Python 3, you can create and activate a virtual environment via::

conda create --name VIRTUAL_ENVIRONMENT_NAME
conda activate VIRTUAL_ENVIRONMENT_NAME

Now that you have your virtual environment set up, we can install the actual library::

pip install dataworkspaces

To verify that it was installed correctly, run::

dws --help


Installation via the source tree
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can clone the source tree and install it as follows::

git clone git@github.com:data-workspaces/data-workspaces-core.git
cd data-workspaces-python
pip install `pwd`
dws --help # just a sanity check that it was installed correctly


Concepts
--------
Expand Down Expand Up @@ -156,29 +212,6 @@ Taken together, these features let you:
6. Easily reproduce your environment on a new machine to parallelize work.
7. Publish your environment on a site like GitHub or GitLab for others to download and explore.

Installation
------------
Rerequisites
~~~~~~~~~~~~
This software runs directly on Linux and MacOSx. Windows is supported by via the
`Windows Subsystem for Linux <https://docs.microsoft.com/en-us/windows/wsl/install-win10>`_. The following software should be pre-installed:

* git
* Python 3.5 or later
* Optionally, the `rclone <https://rclone.org>`_ utility, if you are going to be
using it to sync with a remote copy of your data.

Installation via pip
~~~~~~~~~~~~~~~~~~~~
TODO

Installation via the source tree
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can clone the source tree and install it as follows::

git clone git@github.com:jfischer/data-workspaces-python.git
cd data-workspaces-python
pip install `pwd`

Commmand Line Interface
-----------------------
Expand Down
18 changes: 18 additions & 0 deletions docs/lineage.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
.. _lineage:

5. Lineage API
==============
The Lineage API is provided by the module ``dataworkspaces.lineage``.

.. automodule:: dataworkspaces.lineage
:no-undoc-members:


.. autoclass:: Lineage()
:members:
:no-undoc-members:


.. autoclass:: LineageBuilder
:members:
:undoc-members:

0 comments on commit ffe37da

Please sign in to comment.