Skip to content

Commit

Permalink
drafting serving doc
Browse files Browse the repository at this point in the history
  • Loading branch information
antonymayi committed Aug 22, 2021
1 parent 83874b7 commit ea393b8
Show file tree
Hide file tree
Showing 13 changed files with 256 additions and 43 deletions.
6 changes: 3 additions & 3 deletions docs/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ FAQs
====

What data format is used in the pipeline between the actors?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
------------------------------------------------------------

ForML actually doesn't care. It is only responsible for wiring up the actors in the desired graph but is fairly
agnostic about the actual payload exchanged between them. It is the responsibility of the project implementor to engage
Expand All @@ -30,9 +30,9 @@ independent of the data formats being passed through.


Can a Feed engage multiple reader types so that I can mix for example file based datasources with data in a DB?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
---------------------------------------------------------------------------------------------------------------

No. It sounds like a cool idea to have a DSL interpreter that can just get raw data from any possible reader type and
natively implement the ETL operations on top of it, but since there are existing dedicated ETL platforms doing exactly
that (like the `Presto DB <https://prestodb.io/>`_, which ForML already can integrate with), trying to support the same
that (like the `Trino DB <https://trino.io/>`_, which ForML already can integrate with), trying to support the same
feature on the feed level would be unnecessarily stretching the project goals too far.
Binary file added docs/images/serving-components.odg
Binary file not shown.
Binary file added docs/images/serving-components.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,10 +89,11 @@ Content

.. toctree::
:maxdepth: 2
:caption: Runtime Manual
:caption: Operations Manual

platform
feed
registry/index
runner/index
sink
serving
22 changes: 14 additions & 8 deletions docs/lifecycle.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,13 @@
Lifecycle
=========

Machine learning projects are operated in typical modes that are followed in a particular order. This pattern is what we
call a lifecycle. ForML supports two specific lifecycles depending on the project stage.
Machine learning projects are operated in typical stages that are followed in a particular order. This pattern is what
we call a *lifecycle*. ForML supports two specific lifecycles depending on the project stage.

.. note::
Do not confuse the lifecycles with *operational modes*. Forml projects can be operated in number of modes
(:ref:`cli/batch <platform-cli>` - as used in the examples bellow, :doc:`interactively <interactive>` or using the
:doc:`serving layer <serving>`) each of which is subject to a particular lifecycle.

.. _lifecycle-development:

Expand All @@ -31,7 +36,7 @@ the development process allowing to quickly see the effect of the project change

The expected behaviour of the particular mode depends on the correct project setup as per the :doc:`project` sections.

The modes of a research lifecycle are:
The stages of a development lifecycle are:

Test
Simply run through the unit tests defined as per the :doc:`testing` framework.
Expand All @@ -41,8 +46,8 @@ Test
$ python3 setup.py test

Evaluate
Perform an evaluation based on the specs defined in ``evaluation.py`` and return the metrics. This can be defined
either as cross-validation or hold-out training. One of the potential use-cases might be a CI integration
Perform a backtesting evaluation based on the specs defined in ``evaluation.py`` and return the metrics. This can be
defined either as cross-validation or hold-out training. One of the potential use-cases might be a CI integration
to continuously monitor (evaluate) the changes in the project development.

Example::
Expand All @@ -59,7 +64,8 @@ Tune

Train
Run the pipeline in the standard train mode. This will produce all the defined models but since it won't persist
them, this mode is useful merely for testing the training (or displaying the task graph on the :doc:`graphviz`).
them, this mode is useful merely for testing the training (or displaying the task graph on the
:doc:`Graphviz runner <runner/graphviz>`).

Example::

Expand Down Expand Up @@ -96,8 +102,8 @@ becomes available for the *production lifecycle*. Contrary to the research, this
the project source code working copy as it operates solely on the published artifact plus potentially previously
persisted model generations.

The production lifecycle is operated using the CLI (see :doc:`runtime` for full synopsis) and offers the following
modes:
The production lifecycle is either exercised in batch mode using :ref:`the CLI <platform-cli>` or
embedded within a :doc:`serving layer <serving>`. In any case, the stages of the production lifecycle are:

Train
Fit (incrementally) the stateful parts of the pipeline using new labelled data producing a new *Generation* of
Expand Down
7 changes: 4 additions & 3 deletions docs/platform.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,13 @@
Platform Setup
==============

Platform is a configuration-driven selection of particular *providers* implementing four abstract concepts:
Platform is a configuration-driven selection of particular *providers* implementing a number of abstract concepts:

* :doc:`runner/index`
* :doc:`registry/index`
* :doc:`feed`
* :doc:`sink`
* :ref:`Serving components <serving-components>`

ForML uses an internal *bank* of available provider implementations of the different possible types. Provider instances
are registered in this bank using one of two possible *references*:
Expand Down Expand Up @@ -134,8 +135,8 @@ using a config file specified in the top-level ``logcfg`` option in the main `co
CLI
---

The production :doc:`lifecycle <lifecycle>` management can be fully operated from command-line using the following
syntax:
The production :doc:`lifecycle <lifecycle>` management can be fully operated in a batch mode from command-line using
the following syntax:

.. code-block:: none
Expand Down
95 changes: 69 additions & 26 deletions docs/project.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,23 @@
Project
=======

+--------------------------+
| Pipeline |
+--------------------------+
| Rollout Strategy |
+--------------------------+
| Dataset Specification | Query/ies, ordinal column, label extraction
+--------------------------+
| Training Schedule
+--------------------------+
| Evaluation Schedule
+--------------------------+
| Loss Function
+--------------------------+
| Evaluation Strategy
+--------------------------+
| Hyperparameter Tunning

Starting New Project
--------------------

Expand Down Expand Up @@ -43,6 +60,7 @@ project component structure wrapped within the python application layout might l

<project_name>
├── setup.py
├── rollout.py
├── <optional_project_namespace>
│ └── <project_name>
│ ├── __init__.py
Expand All @@ -51,7 +69,9 @@ project component structure wrapped within the python application layout might l
│ │ ├── <moduleX>.py # arbitrary user defined module
│ │ └── <moduleY>.py
│ ├── source.py
│ └── evaluation.py # here the component is just a module
│ ├── evaluation.py # here the component is just a module
│ ├── schedule.py
│ └── tuning.py
├── tests
│ ├── __init__.py
│ ├── test_pipeline.py
Expand Down Expand Up @@ -98,8 +118,8 @@ the custom locations of its project components using the ``component`` parameter

.. _project-pipeline:

Pipeline (``pipeline.py``)
''''''''''''''''''''''''''
Pipeline Topology (``pipeline.py``)
'''''''''''''''''''''''''''''''''''

Pipeline definition is the heart of the project component structure. The framework needs to understand the
pipeline as a *Directed Acyclic Task Dependency Graph*. For this purpose, it comes with a concept of *Operators* that
Expand All @@ -118,32 +138,11 @@ exposed to the framework via the ``component.setup()`` handler::
FLOW = preprocessing.NaNImputer() >> model.LR(random_state=42, solver='lbfgs')
component.setup(FLOW)

.. _project-evaluation:

Evaluation (``evaluation.py``)
''''''''''''''''''''''''''''''

Definition of the model evaluation strategy for both the development and production lifecycle.

.. note:: The whole evaluation implementation is an interim and more robust concept with different API is on the
.roadmap.

The evaluation strategy again needs to be submitted to the framework using the ``component.setup()`` handler::

from sklearn import model_selection, metrics
from forml.project import component
from forml.lib.flow.operator.folding import evaluation

EVAL = evaluation.MergingScorer(
crossvalidator=model_selection.StratifiedKFold(n_splits=2, shuffle=True, random_state=42),
metric=metrics.log_loss)
component.setup(EVAL)


.. _project-source:

Source (``source.py``)
''''''''''''''''''''''
Dataset Specification (``source.py``)
'''''''''''''''''''''''''''''''''''''

This component is a fundamental part of the :doc:`IO concept<io>`. A project can define the ETL process of sourcing
data into the pipeline using the :doc:`DSL <dsl>` referring to some :ref:`catalogized schemas
Expand All @@ -156,6 +155,9 @@ example below or documented in the :ref:`Source Descriptor Reference <io-source-
composition domain is separate from the main pipeline so adding an operator to the source composition vs
pipeline composition might have a different effect.

Part of the dataset specification can also be a reference to the *ordinal* column (used for determining data ranges for
splitting or incremental operations) and *label* columns for supervised learning/evaluation.

The Source descriptor again needs to be submitted to the framework using the ``component.setup()`` handler::

from forml.lib.flow.operator import cast
Expand All @@ -179,6 +181,47 @@ The Source descriptor again needs to be submitted to the framework using the ``c
component.setup(ETL)


.. _project-evaluation:

Evaluation Strategy (``evaluation.py``)
'''''''''''''''''''''''''''''''''''''''

Definition of the model evaluation strategy for both the development (backtesting) and production
:doc:`lifecycle <lifecycle>`.

.. note:: The whole evaluation implementation is an interim and more robust concept with different API is on the
.roadmap.

The evaluation strategy again needs to be submitted to the framework using the ``component.setup()`` handler::

from sklearn import model_selection, metrics
from forml.project import component
from forml.lib.flow.operator.folding import evaluation

EVAL = evaluation.MergingScorer(
crossvalidator=model_selection.StratifiedKFold(n_splits=2, shuffle=True, random_state=42),
metric=metrics.log_loss)
component.setup(EVAL)


.. _project-tuning:

Hyperparameter Tuning Strategy (``tuning.py``)
''''''''''''''''''''''''''''''''''''''''''''''


.. _project-schedule:

Scheduling Rules (``schedule.py``)
''''''''''''''''''''''''''''''''''


.. _project-rollout:

Rollout Strategy (``rollout.py``)
''''''''''''''''''''''''''''''''''


Tests
'''''

Expand Down
9 changes: 9 additions & 0 deletions docs/registry/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,12 @@ API

.. autoclass:: forml.runtime.asset.persistent.Registry
:members:


Providers
---------

.. toctree::
:maxdepth: 2

filesystem
10 changes: 10 additions & 0 deletions docs/runner/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,13 @@ API

.. autoclass:: forml.runtime.Runner
:members:


Providers
---------

.. toctree::
:maxdepth: 2

dask
graphviz

0 comments on commit ea393b8

Please sign in to comment.