Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
daizutabi committed Jun 1, 2020
1 parent 95061b7 commit b49e39d
Show file tree
Hide file tree
Showing 22 changed files with 312 additions and 204 deletions.
11 changes: 11 additions & 0 deletions docs/api/ivory.callbacks.md
@@ -0,0 +1,11 @@
# Callbacks

## ivory.callbacks.results

![mkapi](ivory.callbacks.results.Results)

![mkapi](ivory.callbacks.results.concatenate)

## ivory.callbacks.early_stopping

![mkapi](ivory.callbacks.early_stopping.EarlyStopping)
42 changes: 42 additions & 0 deletions docs/index.md
Expand Up @@ -11,3 +11,45 @@ Ivory is library-agnostic. You can use it with any machine learning library.
Get started using the Quickstart.

- [Quickstart](quickstart)

{{ ## cache:clear }}

Or take a look at the code below.

```python
import numpy as np

from ivory.callbacks.results import Results
from ivory.core.data import Data, Dataset, Datasets
from ivory.core.run import Run
from ivory.sklearn.estimator import Estimator
from ivory.sklearn.metrics import Metrics

data = Data()
data.index = np.arange(30)
data.input = np.arange(60).reshape(30, -1)
data.target = np.sum(data.input, axis=1)
data.fold = data.index % 4
datasets = Datasets(data, Dataset, fold=0)

estimator = Estimator(
model='sklearn.ensemble.RandomForestRegressor',
n_estimators=10,
max_depth=5,
)

run = Run(
name='first example',
datasets=datasets,
estimator=estimator,
results=Results(),
metrics=Metrics()
)
run.start()
```

```python
import matplotlib.pyplot as plt

plt.scatter(run.results.val.target, run.results.val.output)
```
99 changes: 50 additions & 49 deletions docs/quickstart.md
Expand Up @@ -10,9 +10,9 @@ Install Ivory using `pip`.
$ pip install ivory
~~~

## Using an Ivory Client
## Ivory Client

Ivory has the `Client` class that manages the workflow of machine learning. Let's create your first `Client` instance. In this quickstart, we are working with examples under the `examples` directory.
Ivory has the `Client` class that manages the workflow of machine learning. Let's create your first `Client` instance. In this quickstart, we are working with examples under the `examples` directory. Pass `examples` to the first argument of `ivory.create_client()`:

```python hide
import os
Expand All @@ -29,42 +29,42 @@ client = ivory.create_client("examples")
client
```

The representation of the `client` shows that it has two objects. These objects can be accessed by *index notation* or *dot notation*.
The representation of the `client` shows that it has two instances. These instances can be accessed by *index notation* or *dot notation*.

```python
client[0] # or client['tracker'], or client.tracker
```

The first object is a `Tracker` instance which connects Ivory to [MLFlow Tracking](https://mlflow.org/docs/latest/tracking.html).
The first instance is a `Tracker` instance that connects Ivory to [MLFlow Tracking](https://mlflow.org/docs/latest/tracking.html).

Because a `Client` instance is an iterable, you can get all of the objects by applying `list()` to it.
Because a `Client` instance is an iterable, you can get all of the instances by applying `list()` to it.

```python
list(client)
```

The second objects is named `tuner`.
The second instance is named `tuner`.

```python
client.tuner
```

A `Tuner` instance connects Ivory to [Optuna: A hyperparameter optimization framework](https://preferred.jp/en/projects/optuna/).

We can customize these objects with a YAML file named `client.yml` under the woking directory. In our case, the file just contains the minimum settings.
We can customize these objects with a YAML file named `client.yml` under the working directory. In our case, the file just contains the minimum settings.

#File client.yml {%=/examples/client.yml%}

!!! note
A YAML file for client is not required. If there is no file for client, Ivory creates a default client with a tracker and without a tuner.
If you don't need any customization, the YAML file for client is not required. If there is no file for client, Ivory creates a default client with a tracker and tuner. (So, the above file is unnecessary.)

If you don't need a tracker, for example in debugging, use `ivory.create_client(tracker=False)`.
If you don't need a tracker and/or tuner, for example in debugging, use `ivory.create_client(tracker=False, tuner=False)`.

## Create NumPy data

In this quickstart, we try to predict rectangles area from thier width and height using [PyTorch](https://pytorch.org/). First, prepare the data as [NumPy](https://numpy.org/) arrays. In `rectangle/data.py` under the working directory, a `create_data()` function is defined. The `ivory.create_client()` function automatically inserts the working directory to `sys.path`, so that we can import the module regardless of the current directory.
In this quickstart, we try to predict rectangles area from their width and height using [PyTorch](https://pytorch.org/). First, prepare the data as [NumPy](https://numpy.org/) arrays. In `rectangle/data.py` under the working directory, a `create_data()` is defined. The `ivory.create_client()` automatically inserts the working directory to `sys.path`, so that we can import the module regardless of the current directory.

Let's check the `create_data()` function defined in `rectangle/data.py` and an example output:
Let's check the `create_data()` defined in `rectangle/data.py` and an example output:

```python hide
import rectangle.data
Expand All @@ -83,11 +83,7 @@ xy
z
```

## Set of Data classes

Ivory defines a set of Data classes (`Data`, `Dataset`, `Datasets`). But now, we use the `Data` class only.

In the above file, the `kfold_split()` function creates a fold array.
`ivory.utils.fold.kfold_split()` creates a fold array.

```python
import numpy as np
Expand All @@ -96,7 +92,11 @@ from ivory.utils.fold import kfold_split
kfold_split(np.arange(10), n_splits=3)
```

Now, we can get a `Data` instance.
## Set of Data Classes

Ivory defines a set of base classes for data (`Data`, `Dataset`, `Datasets`, and `DataLoaders`) that user's custom classes can inherit. But now, we use the `Data` only.

Now, we can get a `rectangle.data.Data` instance.

```python
data = rectangle.data.Data()
Expand All @@ -107,11 +107,11 @@ data
data.get(0) # get data of index = 0.
```

This returned value is a tuple of (index, input, target). Ivory always keeps data index so that we can know where a sample comes from.
The returned value is a tuple of (index, input, target). Ivory always keeps data index so that we can know where a sample comes from.

## Define a model

We use a simple MLP model here.
We use a simple MLP model. Note that the number of hidden layers and the size of each hidden layer are customizable.

```python hide
import rectangle.torch
Expand All @@ -125,15 +125,15 @@ Ivory configures a run using a YAML file. Here is a full example.

#File torch.yaml {%=/examples/torch.yml%}

Let's create a run by `Client.create_run()`
Let's create a run calling the `Client.create_run()`.

```python
run = client.create_run('torch')
run
```

!!! note
`Client.create_run(<name>)` creates an experiment named `<name>` if it hasn't existed yet. By cliking an icon (<i class="far fa-eye-slash" style="font-size:0.8rem; color: #ff8888;"></i>) in the above cell, you can see the log.
`Client.create_run(<name>)` creates an experiment named `<name>` if it hasn't existed yet. By clicking an icon (<i class="far fa-eye-slash" style="font-size:0.8rem; color: #ff8888;"></i>) in the above cell, you can see the log.

Or you can directly create an experiment then make the experiment create a run:

Expand All @@ -142,21 +142,22 @@ run
run = experiment.create_run()
~~~

A `Run` instance have a `params` attribute that holds the parameters for the run.
A `Run` instance have an attribute `params` that holds the parameters for the run.

```python
import yaml

print(yaml.dump(run.params, sort_keys=False))
```

This is similar to the YAML file we read before, but is slightly changed by the Ivory Client.
This is similar to the YAML file we read before, but has been slightly changed.

* Run and experiment sections are inserted.
* ExperimentID and RunID are assigned by MLFlow Tracking.
* Default classes are specified, for example `ivory.torch.trainer.Trainer` for a `trainer` instance.
* Run and experiment keys are inserted.
* Run name is assigned by Ivory Client.
* Experiment ID and Run ID are assigned by MLFlow Tracking.
* Default classes are specified, for example the `ivory.torch.trainer.Trainer` class for a `trainer` instance.

The `Client.create_run()` method can take keyword arguments to modify these parameters:
The `Client.create_run()` can take keyword arguments to modify these parameters:

```python
run = client.create_run(
Expand Down Expand Up @@ -204,7 +205,7 @@ run.results.val.target[:5]

## Test a model

Testing a model is as simple as training. Just call `run.start('test')` instead of a (default) `'train'` argument.
Testing a model is as simple as training. Just call `Run.start('test')` instead of a (default) `'train'` argument.

```python
run.start('test')
Expand All @@ -225,23 +226,23 @@ run.results.test.target[:5]

## Task for multiple runs

Ivory implements a special run type called **Task** which controls multiple nested runs. A task is useful for parameter search or cross validation.
Ivory implements a special run type called **Task** that controls multiple nested runs. A task is useful for parameter search or cross validation.

```python
task = client.create_task('torch')
task
```

The `Task` class has two methods to generate multiple runs: `Task.prodcut()` and `Task.chain()`. These two methods have the same functionality as [`itertools`](https://docs.python.org/3/library/itertools.html) of Python starndard library. Let's try to perform cross validation.
The `Task` class has two functions to generate multiple runs: `Task.prodcut()` and `Task.chain()`. These two functions have the same functionality as [`itertools`](https://docs.python.org/3/library/itertools.html) of Python starndard library. Let's try to perform cross validation.

```python
runs = task.product(fold=range(4), verbose=0, epochs=3)
runs
```

Like `itertools`'s functions, `Task.prodcut()` and `Task.chain()` return a generator, which yields runs that are configured by different parameters you specify. In this case, this generator will yield 4 runs with a fold number ranging from 0 to 3 for each. A `task` instance doesn't start any training by itself. In addtion, you can pass fixed parameters to update the original parameters in the YAML file.
Like `itertools`'s functions, `Task.prodcut()` and `Task.chain()` return a generator, which yields runs that are configured by different parameters you specify. In this case, this generator will yield 4 runs with a fold number ranging from 0 to 3 for each. A `task` instance doesn't start any training by itself. In addition, you can pass fixed parameters to update the original parameters in the YAML file.

Then start 4 runs by a `for` loop including `run.start('both')`. Here `'both'` means execution of test after training.
Then start 4 runs by a `for` loop including `run.start('both')`. Here `'both'` means successive test after training.

```python
for run in runs:
Expand All @@ -250,7 +251,7 @@ for run in runs:

## Collect runs

Our client has a `Tracker` instance. It stores the state of runs in background using MLFlow Tracking. The `Client` class provides several methods to access the stored runs. For example, `Client.search_run_ids()` returns a generator which yields RunID created by MLFlow Tracking.
Our client has a `Tracker` instance. It stores the state of runs in background using MLFlow Tracking. The `Client` provides several functions to access the stored runs. For example, `Client.search_run_ids()` returns a generator that yields Run ID assigned by MLFlow Tracking.

```python
# A helper function.
Expand All @@ -273,12 +274,12 @@ print_run_info(run_ids)
```

```python
# If `parent_run_id` is specified, nested runs having the parent are returned.
# If `parent_run_id` is specified, nested runs with the parent are returned.
run_ids = client.search_run_ids('torch', parent_run_id=task.id)
print_run_info(run_ids)
```

`Client.get_run_id()` and `Client.get_run_ids()` fetch RunID from run name, more strictly, (run class name in lower case) plus (run number).
`Client.get_run_id()` and `Client.get_run_ids()` fetch Run ID from run name, more strictly, a key-value pair of (run class name in lower case, run number).

```python
run_ids = [client.get_run_id('torch', run=0),
Expand All @@ -293,10 +294,10 @@ print_run_info(run_ids)

## Load runs and results

The `Client` instance can load runs. First select RunID(s) to load. We want to perform cross validation here, so that we need a run collection created by the `task#0`. In this case, we can use `Client.get_nested_run_ids()`. Why don't we use `Client.search_run_ids()` as we did above? Because we don't have an easy way to get a very long RunID after we restart a Python session and lose the `Task` instance. On the ohter hand, a run name is easy to manage and write.
A `Client` instance can load runs. First select Run ID(s) to load. We want to perform cross validation here, so that we need a run collection created by the `task#0`. In this case, we can use `Client.get_nested_run_ids()`. Why don't we use `Client.search_run_ids()` as we did above? Because we don't have an easy way to get a very long Run ID after we restart a Python session and lose the `Task` instance. On the other hand, a run name is easy to manage and write.

```python
# Assume that we restart a session so we have no run instances now.
# Assume that we restarted a session so we have no run instances now.
run_ids = list(client.get_nested_run_ids('torch', task=0))
print_run_info(run_ids)
```
Expand All @@ -308,7 +309,7 @@ run = client.load_run(run_ids[0])
run
```

Note that the `Client.load_run()` function doesn't require an experiment name because RunID is [UUID](https://en.wikipedia.org/wiki/Universally_unique_identifier).
Note that the `Client.load_run()` doesn't require an experiment name because Run ID is [UUID](https://en.wikipedia.org/wiki/Universally_unique_identifier).

As you expected, the fold number is 3.

Expand All @@ -334,50 +335,50 @@ print('[target]')
print(target)
```

If you don't need a whole run instance, the `Client.load_instance()` function is a better choice to save time and memory.
If you don't need a whole run instance, `Client.load_instance()` is a better choice to save time and memory.

```python
results = client.load_instance(run_ids[0], 'results')
results
```

```python
for mode in results: # Yield a mode.
print(mode, results[mode].output.shape)
for mode, result in results.items():
print(mode, result.output.shape)
```

For cross validation, we need 4 runs. In order to load multiple run's results at the same time, the Ivory `Client` provides a convenient method.
For cross validation, we need 4 runs. In order to load multiple run's results at the same time, the Ivory `Client` provides a convenient function.

```python
results = client.load_results(run_ids, verbose=False) # No progress bar.
results
```

```python
for mode, result in results.items(): # Yield a (mode, result).
for mode, result in results.items():
print(mode, result.output.shape)
```

!!! note
`Client.load_results()` drops train data for saving memory.

The lengths of validation data and test data are both 800 (200 times 4). But be careful about the test data. The length of unique samples is 200 (one fold size).
The lengths of the validation and test data are both 800 (200 times 4). But be careful about the test data. The length of unique samples should be 200 (one fold size).

```python
import numpy as np

len(np.unique(results.val.index)), len(np.unique(results.test.index))
```

Usually, duplicated samples in test data are averaged for ensembling. The `Results.mean()` function performs this *mean reduction* and returns a newly created `Rusults` instance.
Usually, duplicated samples in test data are averaged for ensembling. `Results.mean()` performs this *mean reduction* and returns a newly created `Rusults` instance.

```python
reduced_results = results.mean()
for mode, result in reduced_results.items():
print(mode, result.output.shape)
```

Compare these results.
Compare these two results.

```python
index = results.test.index
Expand All @@ -393,7 +394,7 @@ print('[reduced_results]')
print(x)
```

For convenience, The `Client.load_results()` function has a `reduction` keyword argument.
For convenience, The `Client.load_results()` has a `reduction` keyword argument.

```python
results = client.load_results(run_ids, reduction='mean', verbose=False)
Expand All @@ -405,15 +406,15 @@ for mode, result in results.items():
print(mode, result.output.shape)
```

A cross validation (CV) score can be calculated as follows:
The cross validation (CV) score can be calculated as follows:

```python
true = results.val.target
pred = results.val.output
np.mean(np.sqrt((true - pred) ** 2)) # Use any function for your metric.
```

And we got a prediction for the test data using 4 MLP models.
And we got prediction for the test data using 4 MLP models.

```python
results.test.output[:5]
Expand Down

0 comments on commit b49e39d

Please sign in to comment.