Skip to content

Commit

Permalink
Doc cleanup
Browse files Browse the repository at this point in the history
  • Loading branch information
jfischer committed Mar 15, 2019
1 parent ffe37da commit ccaa724
Show file tree
Hide file tree
Showing 3 changed files with 43 additions and 23 deletions.
21 changes: 18 additions & 3 deletions dataworkspaces/lineage.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
*data lineage* -- the history of how a given result was created, including the
versions of original source data and the various steps run in the *data pipeline*
to produce the final result.
"""
import sys
from abc import ABC, abstractmethod
Expand Down Expand Up @@ -61,10 +62,20 @@ def __init__(self, step_name:str, start_time:datetime.datetime,
self.in_progress = True

def add_output_path(self, path:str):
(name, subpath) = self.resources.map_local_path_to_resource(path)
self.step.add_output(self.store, ResourceRef(name, subpath))
"""Resolve the path to a resource name and subpath. Add
that to the lineage as an output of the step. From this point on,
if the step fails (:func:`~abort` is called), the associated resource
and subpath will be marked as being in an "unknown" state.
"""
(name, subpath) = self.resources.map_local_path_to_resource(path)
self.step.add_output(self.store, ResourceRef(name, subpath))

def add_output_ref(self, ref:ResourceRef):
"""Add the resource reference to the lineage as an output of the step.
From this point on, if the step fails (:func:`~abort` is called), the
associated resource and subpath will be marked as being in an
"unknown" state.
"""
self.step.add_output(self.store, ref)

def abort(self):
Expand Down Expand Up @@ -103,7 +114,11 @@ def __exit__(self, exc_type, exc_val, exc_tb):
return False # don't suppress any exception

class ResultsLineage(Lineage):
"""Lineage for a results step.
"""Lineage for a results step. This marks the :class:`~Lineage`
object as generating results. At the end of the step execution,
a ``lineage.json`` file will be written to the results directory.
It also adds the :func:`~write_results` method for writing a
JSON summary of the final results.
"""
def __init__(self, step_name:str, start_time:datetime.datetime,
parameters:Dict[str,Any],
Expand Down
13 changes: 13 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,19 @@ Data Workspaces runs on
Unix-like systems, including Linux, MacOS, and on Windows via the
`Windows Subsystem for Linux <https://docs.microsoft.com/en-us/windows/wsl/install-win10>`_.

Data Workspaces lets you:

1. Track and version all the different resources for your data science project
from one place.
2. Automatically track the full history of your experimental results. Scripts can easily be
developed to build reports on these results.
3. Reproduce any prior experiment, including the source data, code, and configuration parameters used.
4. Go back to a prior experiment as a "branching-off" point to explore additional permuations.
5. Collaborate with others on the same project, sharing data, code, and results.
6. Easily reproduce your environment on a new machine to parallelize work.
7. Publish your environment on a site like GitHub or GitLab for others to download and explore.


.. toctree::
:maxdepth: 2
:caption: Contents:
Expand Down
32 changes: 12 additions & 20 deletions docs/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ pipeline.

A workspace contains one or more *resources*. Each resource represents
a collection of data that has a particular *role* in the project -- source
data, intermediate data (generated by processinng the original source data),
data, intermediate data (generated by processing the original source data),
code, and results. Resources can be subdirectories in the workspace's
Git repository, separate git repositories, local directories, or remote
systems (e.g. an S3 bucket or a remote server's files accessed via ssh).
Expand All @@ -192,25 +192,17 @@ resource, you can *restore* back to any prior snapshot.
*Results resources* are handled a little differently than other types: they
are always additive. Each snapshot of a results resource takes the current files
in the resource and moves it to a snapshot-specific subdirectory. This lets you
view and compare the results of all your prior experiements.

Building on Git, a data workspace can be synced with a remote copy, called the *origin*.
The data workspace command line tool, ``dws``, provides ``push`` and ``pull`` commands,
similar to their Git analogs. In addition to the workspace itself, these commands can sync
all the resources referenced by the workspace. Finally, there is a ``clone`` command which can initialize
your environment on a new machine.

Taken together, these features let you:

1. Track and version all the different resources for your data science project
from one place.
2. Automatically track the full history of your experimental results. Scripts can easily be
developed to build reports on these results.
3. Reproduce any prior experiment, including the source data, code, and configuration parameters used.
4. Go back to a prior experiment as a "branching-off" point to explore additional permuations.
5. Collaborate with others on the same project, sharing data, code, and results.
6. Easily reproduce your environment on a new machine to parallelize work.
7. Publish your environment on a site like GitHub or GitLab for others to download and explore.
view and compare the results of all your prior experiments.

You interact with your data workspace through the ``dws`` command line tool,
which like Git, has various subcommands for the actions you might take
(e.g. creating a new snapshot, syncing with a remote repository, etc.).

Beyond the basic versioning of your project through snapshots, you can use
the :ref:`Lineage API <lineage>` to track each step of your workflow, including inputs/outputs,
parameters, and metrics (accuracy, loss, precision, recall, roc, etc.). This lineage data is
saved with your snapshots so you can understand how you arrived at each
of your results.


Commmand Line Interface
Expand Down

0 comments on commit ccaa724

Please sign in to comment.