This section is a guide for people working on the development of Data Workspaces or people who which to extend it (e.g. through their own resource types or kits).
In summary:
- Install Python 3 via the Anaconda distribution.
- Make sure you have the
git
andmake
utilties on your system. - Create a virtual environment called
dws
and activate it. - Install
mypy
viapip
. - Clone the main
data-workspaces-python
repo. - Install the
dataworkpaces
package into your virtual environment. - Download and configure
rclone
. - Run the tests.
Here are the details:
We recommend using Anaconda3 for development, as it can be easily installed on Mac and Linux and includes most of the packages you will need for development and data science projects. You will also need some basic system utilities: git
and make
(which may already be installed).
Once you have Anaconda3 installed, create a virtual environment for your work:
conda create --name dws
To activate the environment:
conda activate dws
You will need the mypy type checker, which is run as part of the tests. It is best to install via pip to get the latest version (some older versions may be buggy). Once you have activated your environment, mypy
may be installed as follows:
pip install mypy
Next, clone the Data Workspaces main source tree:
git clone git@github.com:data-workspaces/data-workspaces-core.git
Now, we install the data workspaces library, via pip
, using an editable mode so that our source tree changes are immediately visible:
cd data-workspaces-core
pip install --editable `pwd`
With this setup, you should not have to configure PYTHONPATH
.
Next, we install rclone
, a file copying utility used by the rclone resource. You can download the latest rclone
executable from http://rclone.org. Just make sure that the executable is available in your executable path. Alternatively, on Linux, you can install rclone
via your system's package manager. To configure rclone
, see the instructions here <rclone_config>
in the Resource Reference.
Now, you should be ready to run the tests:
cd tests
make test
The tests will print a lot to the console. If all goes well, it should end with something like this:
----------------------------------------------------------------------
Ran 40 tests in 23.664s
OK
Here is a block diagram of the system architecture:
The core of the system is the Workspaces API, which is located in dataworkspaces/workspace.py
. This API provides a base Workspace
classes for the workspace and a base Resource
class for resources. There are also several mixin classes which define extensions to the basic functionality. See Core Workspace and Resource API <workspace_api>
for details.
The Workspace
class has one or more backends which implement the storage and management of the workspace metadata. Currently, there is only one complete backend, the git backend, which stores its metadata in a git repository.
Independent of the workspace backend are resource types, which provide concrete implementations of the the Resource
base class. These currently include resource types for git, git subdirectories, rclone, and local files.
Above the workspace API sits the Command API, which implements command functions for operations like init, clone, push, pull, snapshot and restore. There is a thin layer above this for programmatic access (dataworkspaces.api
) as well as a full command line interface, implmented using the click package (see dataworkspaces.dws
).
The Lineage API is also implemented on top of the basic workspace interface and provides a way for pipeline steps to record their inputs, outputs, and code dependencies.
Finally, kits provide integration with user-facing libraries and applications, such as Scikit-learn and Jupyter notebooks.
The code is organized as follows:
dataworkspaces/
api.py
- API to run a subset of the workspace commands from Python. This is useful for building integrations.dws.py
- the command line interfaceerrors.py
- common exception class definitionslineage.py
- the generic lineage apiworkspace.py
- the core api for workspaces and resourcesbackends/
- implementations of workspace backendsutils/
- lower level utilities used by the upper layersresources/
- implementations of the resource typescommands/
- implementations of the individual dws commandsthird_party/
- third-party code (e.g. git-fat)kits/
- adapters to specific external technologies
When using the git backend, a data workspace is contained within a Git repository. The metadata about resources, snapshots and lineage is stored in the subdirectory .dataworkspace
. The various resources can be other subdirectories of the workspace's repository or may be external to the main Git repo.
The layout for the files under the .dataworkspace
directory is as follows:
.dataworkspace/
config.json
- overall configuration (e.g. workspace name, global params)local_params.json
- local parameters (e.g. hostname); not checked into gitresources.json
- lists all resources and their config parametersresource_local_params.json
- configuration for resources that is local to this machine (e.g. path to the resource); not checked into gitcurrent_lineage/
- contains lineage files reflecting current state of each resource; not checked into gitfile/
- contains metadata for local files based resources; in particular, has the file-level hash information for snapshotssnapshots/
- snapshot metadata
snapshot-<HASHCODE>.json
- lists the hashes for each resource in the snapshot. The hash of this file is the hash of the overall snapshot.snapshot_history.json
- metadata for the past snapshotssnapshot_lineage/
- contains lineage data for past snapshots
<HASHCODE>/
- directory containing the current lineage files at the time of the snapshot associated with the hashcode. Unlikecurrent_lineeage
, this is checked into git.
In designing the workspace database, we try to follow the following guidelines:
- Use JSON as our file format where possible - it is human readable and editable and easy to work with from within Python.
- Local or uncommitted state is not stored in Git (we use .gitignore to keep the files outside of the repo). Such files are initialized by
dws init
anddws clone
. - Avoid git merge conflicts by storing data in seperate files where possible. For example, the resources.json file should really be broken up into one file per resource, stored under a common directory (see issue #13).
- Use git's design as an inspiration. It provides an efficient and flexible representation.
The bulk of the work for each command is done by the core Workspace API and its backends. The command fuction itself (dataworkspaces.commands.COMMAND_NAME
) performs parameter-checking, calls the associated parts of Workspace API, and handles user interactions when needed.
Resources are orthoginal to commands and represent the collections of files to be versioned.
A resource may have one of four roles:
- Source Data Set - this should be treated read-only by the ML pipeline. Source data sets can be versioned.
- Intermediate Data - derived data created from the source data set(s) via one or more data pipeline stages.
- Results - the outputs of the machine learning / data science process.
- Code - code used to create the intermediate data and results, typically in a git repository or Docker container.
The treatment of resources may vary based on the role. We now look at resource functionality per role.
We want the ability to name source data sets and swap them in and out without changing other parts of the workspace. This still needs to be implemented.
For intermediate data, we may want to delete it from the current state of the workspace if it becomes out of date (e.g. a data source version is changed or swapped out). This still needs to be implemented.
In general, results should be additive.
For the snapshot
command, we move the results to a specific subdirectory per snapshot. The name of this subdirectory is determined by a template that can be changed by setting the parameter results.subdir
. By default, the template is: {DAY}/{DATE_TIME}-{USER}-{TAG}
. The moving of files is accomplished via the method results_move_current_files(rel_path, exclude)
on the Resource <resources> class. The snapshot()
method of the resource is still called as usual, after the result files have been moved.
Individual files may be excluded from being moved to a subdirectory. This is done through a configuration command. Need to think about where this would be stored --in the resources.json file? The files would be passed in the exclude set to results_move_current_files
.
If we run restore
to revert the workspace to an older state, we should not revert the results database. It should always be kept at the latest version. This is done by always putting results resources into the leave set, as if specified in the --leave
option. If the user puts a results resource in the --only
set, we will error out for now.
The module dataworkspaces.api
provides a simplified, high level programmatic inferface to Data Workspaces. It is for integration with third-party tooling.
dataworkspaces.api
Here is the detailed documentation for the Core Workspace API, found in dataworkspaces.workspace
.
dataworkspaces.workspace
Workspace
ResourceRoles
Resource
RESOURCE_ROLE_CHOICES
WorkspaceFactory
load_workspace
find_and_load_workspace
init_workspace
ResourceFactory
FileResourceMixin
LocalStateResourceMixin
Workspace backends should inherit from one of either SyncedWorkspaceMixin
or CentralWorkspaceMixin
.
SyncedWorkspaceMixin
CentralWorkspaceMixin
To support snapshots, the interfaces defined by SnapshotWorkspaceMixin
and SnapshotResourceMixin
should be implmented by workspace backends and resources, respectively. SnapshotMetadata
defines the metadata to be stored for each snapshot.
SnapshotMetadata
SnapshotWorkspaceMixin
SnapshotResourceMixin