Skip to content

Commit

Permalink
More lineage docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jfischer committed Mar 15, 2019
1 parent ccaa724 commit 9a0e7e6
Show file tree
Hide file tree
Showing 3 changed files with 94 additions and 1 deletion.
92 changes: 91 additions & 1 deletion dataworkspaces/lineage.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,96 @@
versions of original source data and the various steps run in the *data pipeline*
to produce the final result.
The basic idea is that your workflow is a sequence of *pipeline steps*::
---------- ---------- ---------- ----------
| | | | | | | |
| Step 1 |---->| Step 2 |---->| Step 3 |---->| Step 4 |
| | | | | | | |
---------- ---------- ---------- ----------
A step could be a command line script, a Jupyter notebook or perhaps
a step in an automated workflow tool (e.g. Apache Airflow).
Each step takes a number of *inputs* and *parameters* and generates *outputs*.
The inputs are resources in your workspace (or subpaths within a resource) from
which the step will read to perform its task. The parameters are configuration
values passed to the step (e.g. the command line arguments of a script). The outputs
are the resources (or subpaths within a resource), which are written to by
the step. The outputs may represent results or intermediate data to be consumed
by downstream steps.
The lineage API captures this data for each step. Here is a view of the data captured::
Parameters
|| || ||
\/ \/ \/
------------
=>| |=>
Input resources =>| Step i |=> Output resources
=>| |=>
------------
To do this, we need use the following classes:
* :class:`~ResourceRef` - A reference to a resource for use as a step input or output.
A reference contains a resource name and an optional path within that resource.
* :class:`~Lineage` - The main lineage object, instantiated at the start of your step.
At the beginning of your step, you specify the inputs, parameters, and outputs. At the
end of the step, the data is saved, along with any results you might have from that step.
Lineage instances are
`context managers <https://docs.python.org/3/reference/datamodel.html#context-managers>`_,
which means you can use a ``with`` statement to manage their lifecycle.
* :class:`~LineageBuilder` - This is a helper class to guide the creation of your lineage
object.
**Example**
Here is an example usage of the lineage API in a command line script::
import argparse
from dataworkspaces.lineage import LineageBuilder
def main():
...
parser = argparse.ArgumentParser()
parser.add_argument('--gamma', type=float, default=0.01,
help="Regularization parameter")
parser.add_argument('input_data_dir', metavar='INPUT_DATA_DIR', type=str,
help='Path to input data')
parser.add_argument('results_dir', metavar='RESULTS_DIR', type=str,
help='Path to where results should be stored')
args = parser.parse_args()
...
# Create a LineageBuilder instance to specify the details of the step
# to the lineage API.
builder = LineageBuilder()\\
.as_script_step()\\
.with_paramerers({'gamma':args.gamma})\\
.with_input_path(args.input_data_dir)\\
.as_results_step(args.results_dir)
# builder.eval() will construct the lineage object. We call it within a
# with statement to get automatic save/cleanup when we leave the
# with block.
with builder.eval() as lineage:
... do your work here ...
# all done, write the results
lineage.write_results({'accuracy':accuracy,
'precision':precision,
'recall':recall,
'roc_auc':roc_auc})
# When leaving the with block, the lineage is automatically saved to the
# workspace. If an exception is thrown, the lineage is not saved, but the
# outputs are marked as being in an unknown state.
return 0
# boilerplate to call our main function if this is called as a script.
if __name__ == '__main__:
sys.exit(main())
"""
import sys
from abc import ABC, abstractmethod
Expand Down Expand Up @@ -217,7 +307,7 @@ class LineageBuilder:
.as_script_step()\\
.with_parameters({'gamma':0.001})\\
.with_input_path(args.intermediate_data)\\
.as_results_step().eval()
.as_results_step('../results').eval()
**Methods**
"""
Expand Down
1 change: 1 addition & 0 deletions docs/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
_build/
.DS_Store
2 changes: 2 additions & 0 deletions docs/lineage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ The Lineage API is provided by the module ``dataworkspaces.lineage``.
.. automodule:: dataworkspaces.lineage
:no-undoc-members:

Classes
-------

.. autoclass:: Lineage()
:members:
Expand Down

0 comments on commit 9a0e7e6

Please sign in to comment.