drafting serving doc

formlio · Aug 22, 2021 · ea393b8 · ea393b8
1 parent 83874b7
commit ea393b8
Show file tree

Hide file tree

Showing 13 changed files with 256 additions and 43 deletions.
diff --git a/docs/faq.rst b/docs/faq.rst
@@ -17,7 +17,7 @@ FAQs
 ====
 
 What data format is used in the pipeline between the actors?
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+------------------------------------------------------------
 
 ForML actually doesn't care. It is only responsible for wiring up the actors in the desired graph but is fairly
 agnostic about the actual payload exchanged between them. It is the responsibility of the project implementor to engage
@@ -30,9 +30,9 @@ independent of the data formats being passed through.
 
 
 Can a Feed engage multiple reader types so that I can mix for example file based datasources with data in a DB?
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+---------------------------------------------------------------------------------------------------------------
 
 No. It sounds like a cool idea to have a DSL interpreter that can just get raw data from any possible reader type and
 natively implement the ETL operations on top of it, but since there are existing dedicated ETL platforms doing exactly
-that (like the `Presto DB <https://prestodb.io/>`_, which ForML already can integrate with), trying to support the same
+that (like the `Trino DB <https://trino.io/>`_, which ForML already can integrate with), trying to support the same
 feature on the feed level would be unnecessarily stretching the project goals too far.
diff --git a/docs/images/serving-components.odg b/docs/images/serving-components.odg
diff --git a/docs/images/serving-components.png b/docs/images/serving-components.png
diff --git a/docs/index.rst b/docs/index.rst
@@ -89,10 +89,11 @@ Content
 
 .. toctree::
     :maxdepth: 2
-    :caption: Runtime Manual
+    :caption: Operations Manual
 
     platform
     feed
     registry/index
     runner/index
     sink
+    serving
diff --git a/docs/lifecycle.rst b/docs/lifecycle.rst
@@ -16,8 +16,13 @@
 Lifecycle
 =========
 
-Machine learning projects are operated in typical modes that are followed in a particular order. This pattern is what we
-call a lifecycle. ForML supports two specific lifecycles depending on the project stage.
+Machine learning projects are operated in typical stages that are followed in a particular order. This pattern is what
+we call a *lifecycle*. ForML supports two specific lifecycles depending on the project stage.
+
+.. note::
+   Do not confuse the lifecycles with *operational modes*. Forml projects can be operated in number of modes
+   (:ref:`cli/batch <platform-cli>` - as used in the examples bellow, :doc:`interactively <interactive>` or using the
+   :doc:`serving layer <serving>`) each of which is subject to a particular lifecycle.
 
 .. _lifecycle-development:
 
@@ -31,7 +36,7 @@ the development process allowing to quickly see the effect of the project change
 
 The expected behaviour of the particular mode depends on the correct project setup as per the :doc:`project` sections.
 
-The modes of a research lifecycle are:
+The stages of a development lifecycle are:
 
 Test
     Simply run through the unit tests defined as per the :doc:`testing` framework.
@@ -41,8 +46,8 @@ Test
         $ python3 setup.py test
 
 Evaluate
-    Perform an evaluation based on the specs defined in ``evaluation.py`` and return the metrics. This can be defined
-    either as cross-validation or hold-out training. One of the potential use-cases might be a CI integration
+    Perform a backtesting evaluation based on the specs defined in ``evaluation.py`` and return the metrics. This can be
+    defined either as cross-validation or hold-out training. One of the potential use-cases might be a CI integration
     to continuously monitor (evaluate) the changes in the project development.
 
     Example::
@@ -59,7 +64,8 @@ Tune
 
 Train
     Run the pipeline in the standard train mode. This will produce all the defined models but since it won't persist
-    them, this mode is useful merely for testing the training (or displaying the task graph on the :doc:`graphviz`).
+    them, this mode is useful merely for testing the training (or displaying the task graph on the
+    :doc:`Graphviz runner <runner/graphviz>`).
 
     Example::
 
@@ -96,8 +102,8 @@ becomes available for the *production lifecycle*. Contrary to the research, this
 the project source code working copy as it operates solely on the published artifact plus potentially previously
 persisted model generations.
 
-The production lifecycle is operated using the CLI (see :doc:`runtime` for full synopsis) and offers the following
-modes:
+The production lifecycle is either exercised in batch mode using :ref:`the CLI <platform-cli>` or
+embedded within a :doc:`serving layer <serving>`. In any case, the stages of the production lifecycle are:
 
 Train
     Fit (incrementally) the stateful parts of the pipeline using new labelled data producing a new *Generation* of

diff --git a/docs/platform.rst b/docs/platform.rst
@@ -16,12 +16,13 @@
 Platform Setup
 ==============
 
-Platform is a configuration-driven selection of particular *providers* implementing four abstract concepts:
+Platform is a configuration-driven selection of particular *providers* implementing a number of abstract concepts:
 
 * :doc:`runner/index`
 * :doc:`registry/index`
 * :doc:`feed`
 * :doc:`sink`
+* :ref:`Serving components <serving-components>`
 
 ForML uses an internal *bank* of available provider implementations of the different possible types. Provider instances
 are registered in this bank using one of two possible *references*:
@@ -134,8 +135,8 @@ using a config file specified in the top-level ``logcfg`` option in the main `co
 CLI
 ---
 
-The production :doc:`lifecycle <lifecycle>` management can be fully operated from command-line using the following
-syntax:
+The production :doc:`lifecycle <lifecycle>` management can be fully operated in a batch mode from command-line using
+the following syntax:
 
 .. code-block:: none
 

diff --git a/docs/project.rst b/docs/project.rst
@@ -16,6 +16,23 @@
 Project
 =======
 
++--------------------------+
+| Pipeline                 |
++--------------------------+
+| Rollout Strategy         |
++--------------------------+
+| Dataset Specification    | Query/ies, ordinal column, label extraction
++--------------------------+
+| Training Schedule
++--------------------------+
+| Evaluation Schedule
++--------------------------+
+| Loss Function
++--------------------------+
+| Evaluation Strategy
++--------------------------+
+| Hyperparameter Tunning
+
 Starting New Project
 --------------------
 
@@ -43,6 +60,7 @@ project component structure wrapped within the python application layout might l
 
     <project_name>
       ├── setup.py
+      ├── rollout.py
       ├── <optional_project_namespace>
       │     └── <project_name>
       │          ├── __init__.py
@@ -51,7 +69,9 @@ project component structure wrapped within the python application layout might l
       │          │    ├── <moduleX>.py  # arbitrary user defined module
       │          │    └── <moduleY>.py
       │          ├── source.py
-      │          └── evaluation.py  # here the component is just a module
+      │          ├── evaluation.py  # here the component is just a module
+      │          ├── schedule.py
+      │          └── tuning.py
       ├── tests
       │    ├── __init__.py
       │    ├── test_pipeline.py
@@ -98,8 +118,8 @@ the custom locations of its project components using the ``component`` parameter
 
 .. _project-pipeline:
 
-Pipeline (``pipeline.py``)
-''''''''''''''''''''''''''
+Pipeline Topology (``pipeline.py``)
+'''''''''''''''''''''''''''''''''''
 
 Pipeline definition is the heart of the project component structure. The framework needs to understand the
 pipeline as a *Directed Acyclic Task Dependency Graph*. For this purpose, it comes with a concept of *Operators* that
@@ -118,32 +138,11 @@ exposed to the framework via the ``component.setup()`` handler::
     FLOW = preprocessing.NaNImputer() >> model.LR(random_state=42, solver='lbfgs')
     component.setup(FLOW)
 
-.. _project-evaluation:
-
-Evaluation (``evaluation.py``)
-''''''''''''''''''''''''''''''
-
-Definition of the model evaluation strategy for both the development and production lifecycle.
-
-.. note:: The whole evaluation implementation is an interim and more robust concept with different API is on the
-.roadmap.
-
-The evaluation strategy again needs to be submitted to the framework using the ``component.setup()`` handler::
-
-    from sklearn import model_selection, metrics
-    from forml.project import component
-    from forml.lib.flow.operator.folding import evaluation
-
-    EVAL = evaluation.MergingScorer(
-        crossvalidator=model_selection.StratifiedKFold(n_splits=2, shuffle=True, random_state=42),
-        metric=metrics.log_loss)
-    component.setup(EVAL)
-
 
 .. _project-source:
 
-Source (``source.py``)
-''''''''''''''''''''''
+Dataset Specification (``source.py``)
+'''''''''''''''''''''''''''''''''''''
 
 This component is a fundamental part of the :doc:`IO concept<io>`. A project can define the ETL process of sourcing
 data into the pipeline using the :doc:`DSL <dsl>` referring to some :ref:`catalogized schemas
@@ -156,6 +155,9 @@ example below or documented in the :ref:`Source Descriptor Reference <io-source-
           composition domain is separate from the main pipeline so adding an operator to the source composition vs
           pipeline composition might have a different effect.
 
+Part of the dataset specification can also be a reference to the *ordinal* column (used for determining data ranges for
+splitting or incremental operations) and *label* columns for supervised learning/evaluation.
+
 The Source descriptor again needs to be submitted to the framework using the ``component.setup()`` handler::
 
     from forml.lib.flow.operator import cast
@@ -179,6 +181,47 @@ The Source descriptor again needs to be submitted to the framework using the ``c
     component.setup(ETL)
 
 
+.. _project-evaluation:
+
+Evaluation Strategy (``evaluation.py``)
+'''''''''''''''''''''''''''''''''''''''
+
+Definition of the model evaluation strategy for both the development (backtesting) and production
+:doc:`lifecycle <lifecycle>`.
+
+.. note:: The whole evaluation implementation is an interim and more robust concept with different API is on the
+.roadmap.
+
+The evaluation strategy again needs to be submitted to the framework using the ``component.setup()`` handler::
+
+    from sklearn import model_selection, metrics
+    from forml.project import component
+    from forml.lib.flow.operator.folding import evaluation
+
+    EVAL = evaluation.MergingScorer(
+        crossvalidator=model_selection.StratifiedKFold(n_splits=2, shuffle=True, random_state=42),
+        metric=metrics.log_loss)
+    component.setup(EVAL)
+
+
+.. _project-tuning:
+
+Hyperparameter Tuning Strategy (``tuning.py``)
+''''''''''''''''''''''''''''''''''''''''''''''
+
+
+.. _project-schedule:
+
+Scheduling Rules (``schedule.py``)
+''''''''''''''''''''''''''''''''''
+
+
+.. _project-rollout:
+
+Rollout Strategy (``rollout.py``)
+''''''''''''''''''''''''''''''''''
+
+
 Tests
 '''''
 

diff --git a/docs/registry/index.rst b/docs/registry/index.rst
@@ -33,3 +33,12 @@ API
 
 .. autoclass:: forml.runtime.asset.persistent.Registry
     :members:
+
+
+Providers
+---------
+
+.. toctree::
+    :maxdepth: 2
+
+    filesystem
diff --git a/docs/runner/index.rst b/docs/runner/index.rst
@@ -33,3 +33,13 @@ API
 
 .. autoclass:: forml.runtime.Runner
     :members:
+
+
+Providers
+---------
+
+.. toctree::
+    :maxdepth: 2
+
+    dask
+    graphviz