Skip to content

Commit

Permalink
updated readme to refer to quickstart rather than include directly
Browse files Browse the repository at this point in the history
  • Loading branch information
jfischer committed Jan 14, 2020
1 parent 10ec0b6 commit 3118bb2
Showing 1 changed file with 3 additions and 111 deletions.
114 changes: 3 additions & 111 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,117 +21,9 @@ Windows Subsystem for Linux.

Quick Start
===========
Here is a quick example to give you a flavor of the project, using
`scikit-learn <https://scikit-learn.org>`_
and the famous digits dataset running in a Jupyter Notebook.

First, install the libary::

pip install dataworkspaces

Now, we will create a workspace::

mkdir quickstart
cd ./quickstart
dws init --create-resources code,results

This created our *workspace* (which is a git repository under the covers)
and initialized it with two subdirectories,
one for the source code, and one for the results. These are special
subdirectories, in that they are *resources* which can be tracked and versioned
independently.

Now, we are going to add our source data to the workspace. This resides in an
external, third-party git repository. It is simple to add::

git clone https://github.com/jfischer/sklearn-digits-dataset.git
dws add git --role=source-data --read-only ./sklearn-digits-dataset

The first line (``git clone ...``) makes a local copy of the Git repository for the
Digits dataset. The second line (``dws add git ...``) adds the repository to the workspace
as a resource to be tracked as part of our project. The ``--role`` option tells Data
Workspaces how we will use the resource (as source data), and the ``--read-only``
option indicates that we should treat the repository as read-only and never try to
push it to its ``origin`` (as you do not have write permissions to the ``origin``
copy of this repository).

Now, we can create a Jupyter notebook for running our experiments::

cd ./code
jupyter notebook

This will bring up the Jupyter app in your brower. Click on the *New*
dropdown (on the right side) and select "Python 3". Once in the notebook,
click on the current title ("Untitled", at the top, next to "Jupyter")
and change the title to ``digits-svc``.

Now, type the following Python code in the first cell::

import numpy as np
from os.path import join
from sklearn.svm import SVC
from dataworkspaces.kits.scikit_learn import load_dataset_from_resource,\
train_and_predict_with_cv
RESULTS_DIR='../results'
dataset = load_dataset_from_resource('sklearn-digits-dataset')
train_and_predict_with_cv(SVC, {'gamma':[0.01, 0.001, 0.0001]}, dataset,
RESULTS_DIR, random_state=42)

Now, run the cell. It will take a few seconds to train and test the
model. You should then see::

Best params were: {'gamma': 0.001}
accuracy: 0.99
classification report:
precision recall f1-score support
0.0 1.00 1.00 1.00 33
1.0 1.00 1.00 1.00 28
2.0 1.00 1.00 1.00 33
3.0 1.00 0.97 0.99 34
4.0 1.00 1.00 1.00 46
5.0 0.98 0.98 0.98 47
6.0 0.97 1.00 0.99 35
7.0 0.97 0.97 0.97 34
8.0 1.00 1.00 1.00 30
9.0 0.97 0.97 0.97 40
micro avg 0.99 0.99 0.99 360
macro avg 0.99 0.99 0.99 360
weighted avg 0.99 0.99 0.99 360
Wrote results to results:results.json

Now, you can save and shut down your notebook. If you look at the
directory ``quickstart/results``, you should see a ``results.json``
file with information about your run.

Next, let us take a *snapshot*, which will record the state of
the workspace and save the data lineage along with our results::

dws snapshot -m "first run with SVC" SVC-1

``SVC-1`` is the *tag* of our snapshot.
If you look in ``quickstart/results``, you will see that the results
(currently just ``results.json``) have been moved to the subdirectory
``snapshots/HOSTNAME-SVC-1``, where ``HOSTNAME`` is the hostname for your
local machine). A file, ``lineage.json``, containing a full
data lineage graph for our experiment has also been
created in that directory.

Some things you can do from here:

* Run more experiments and save their results by snapshotting the workspace.
If, at some point, we want to go back to our first experiment, we can run:
``dws restore SVC-1``. This will restore the state of the source data and
code subdirectories, but leave the full history of the results.
* Upload your workspace on GitHub or an any other Git hosting application.
This can be to have a backup copy or to share with others.
Others can download it via ``dws clone``.
* More complex scenarios involving multi-step data pipelines can easily
be automated. See the documentation for details.
Please see the
`Quickstart Section <https://data-workspaces-core.readthedocs.io/en/latest/intro.html#quick-start>`_
of the documentation.

Documentation
=============
Expand Down

0 comments on commit 3118bb2

Please sign in to comment.