Skip to content

Commit

Permalink
doc wip
Browse files Browse the repository at this point in the history
  • Loading branch information
antonymayi committed Nov 13, 2020
1 parent 475be1b commit ce62433
Show file tree
Hide file tree
Showing 57 changed files with 1,375 additions and 418 deletions.
2 changes: 2 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
[flake8]
show-source = true
enable-extensions=G
max-line-length = 120
ignore = E731,W504,I001,W503
exclude = .git,__pycache__,.eggs,*.egg
22 changes: 11 additions & 11 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
*.egg-info/
.tox/
build/
_build/
dist/
.tox/
docs/**/_*
htmlcov/
coverage.xml
junit.xml
.eggs
*.egg-info/
*.pyc
.idea
.cache
.coverage
.coverage.*
*.log
*.log.?
*.dot
*.dot.*
.pytest_cache/
.cache
.coverage
.coverage.*
.eggs
.idea
.ipynb_checkpoints/
.pytest_cache/
coverage.xml
junit.xml
45 changes: 39 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,48 @@ ForML
[![Coverage Status](https://img.shields.io/codecov/c/github/formlio/forml/master.svg)](https://codecov.io/github/formlio/forml?branch=master)
[![Documentation Status](https://readthedocs.org/projects/forml/badge/?version=latest)](https://forml.readthedocs.io/en/latest/?badge=latest)
[![License](http://img.shields.io/:license-Apache%202-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.txt)
[![PyPI version](https://badge.fury.io/py/forml.svg)](https://badge.fury.io/py/forml)
[![PyPI version](https://badge.fury.io/py/forml.svg)](https://pypi.org/project/forml/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)


ForML is a lifecycle management framework for Data science projects.
ForML is a framework for researching, implementing and operating data science projects.

Use ForML to formally describe a data science problem as a composition of high-level operators.
ForML expands your project into a task dependency graph specific to given life-cycle phase and executes it using any of
its supported runners.

Getting Started
---------------
Solutions built on ForML are naturally easy to reuse, extend, reproduce, or share and collaborate on.

Please visit the [documentation](docs) for help with [installing ForML](docs/installation.rst) or
the [examples](docs/examples.rst) to find some demo project implementations.

Not Just Another DAG
--------------------

Despite *DAG* (directed acyclic graph) being at the heart of ForML operations, it stands out amongst the many other task
dependency processing systems due to:

a. Its specialization on machine learning problems that's wired right into the flow topology.
b. Concept of high-level operator composition which helps wrapping complex ML techniques into simple reusable units.
c. Abstraction of runtime dependencies allowing to run the same project using different technologies.


History
-------

ForML started as an open-source project in response to ever painful transitions of datascience research into production.
While there are other projects trying to solve this problem, they are typically either generic data processing systems
too low-level to provide out-of-the-box ML lifecycle routines, or special scientific frameworks that are on the other
end too high-level to allow for robust operations.


Resources
---------

* [Documentation](https://forml.readthedocs.io/en/latest/)
* [Source Code](https://github.com/formlio/forml/)
* Mailing lists:

* Developers: `forml-dev@googlegroups.com`
* Users: `forml-users@googlegroups.com`

* [Issue Tracker](https://github.com/formlio/forml/issues/)
* [PyPI Repository](https://pypi.org/project/forml/)
131 changes: 131 additions & 0 deletions docs/concept.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
Concept
=======

ForML aims to address a wide spectrum of challenges emerging from classical ML projects. It combines a unique set of
features to help with all of the project phases starting from research and prototyping to delivery and beyond.

The following table presents a quick overview of the key features brought by ForML:

+----------------------------------+---------------------------------------------------------------------------+
| Merit | Contributing factor |
+==================================+===========================================================================+
| *flexibility*, *agility* | iterative development, continuous experimentation, optional interactivity |
+----------------------------------+---------------------------------------------------------------------------+
| *unification*, *reusability* | project structure convention, high-level workflow API |
+----------------------------------+---------------------------------------------------------------------------+
| *reproducibility*, *consistency* | versioned pipeline artifacts, native operator modality |
+----------------------------------+---------------------------------------------------------------------------+
| *portability*, *operability* | multilevel abstraction, pluggable provisioning, runtime independence |
+----------------------------------+---------------------------------------------------------------------------+

Conceptually these features can be split into two domains as presented in the next sections. The first one is
approaching a project from its implementation perspective and the other is dealing with its operational aspects.


Project Formalization
---------------------

Formalization is the prime concept ForML is built upon. Having a common *component structure* for ML projects,
an *expression API* for their workflows and generic *DSL* describing the required data sources leads to a cleaner
implementation that is easier to maintain, extend or exchange between different environments.

.. _concept-project:

Project Component Structure
ForML introduces its convention for organizing machine learning projects on the module level. This is to have
some common basic structure that can be understood across projects and help (not only) ForML itself to understand
the project just by visiting the expected places.

More details about project layout are explained in the :doc:`project` sections.

.. _concept-dsl:

Data Source DSL
ForML comes with custom DSL for specifying the data sources required by the project. This allows to decouple the
project from a particular data formats and storages and only refer to it using *catalogized schemas*. It is then
down to the particular execution platform to feed the project pipeline with the actual data based on the given
DSL query.

Example of data source DSL::

student.join(person, student.surname == person.surname) \
.join(school, student.school == school.sid) \
.select(student.surname.alias('student'), school['name'], function.Cast(student.score, kind.String())) \
.where(student.score < 2) \
.orderby(student.level, student.score)

Full guide and the DSL references can be found in the :doc:`dsl/index` sections.

.. _concept-workflow:

Workflow Expression API
ForML provides a convenient interface for describing the project workflow using high-level expressions for
*operator compositions* that transparently expand into a low-level acyclic task dependency graph (DAG) of primitive
*actors*. Based on the internal architecture of operators, ForML is able to derive different DAG shapes from the
same workflow depending on the actual lifecycle phase implementing the requested pipeline mode. This leaves the
workflow definition very clean with all the main complexity being carried out in lower layers.

Example of simple workflow::

FLOW = SimpleImputer(strategy='mean') >> LogisticRegression(max_iter=3, solver='lbfgs')

More on the *Operators* and *Actors* is discussed in the :doc:`workflow` sections. See also the :doc:`lifecycle`
sections for details on the supported pipeline modes.


Runtime Independence
--------------------

ForML has been carefully designed to entirely abstract away all of the fundamental runtime dependencies so that project
implementation is decoupled from any particular execution mechanism, storage technology, or data format. This allows
running the same unchanged project against an arbitrary combination of these runtime *providers*. Specific providers are
selected by the configuration of the runtime environment called simply the :doc:`Platform <platform>`.

.. _concept-io:

Data Providers & Result Consumers
The data source DSL defined within the project gets transcoded into a reader-specific ETL code and then served
by one of the available schema-matching *feed* providers. Feeds can potentially serve an arbitrary number of
data sources that are advertised against the same *schema catalogues* referred by projects. A platform can be
preconfigured with multiple different feeds held in a *pool* which at query time selects the most suitable feed to
serve the given project query.

Similarly any output produced by the ForML pipeline gets captured by the platform and sent to a configured *sink*.
A platform can specify different sink provider for each pipeline mode.

See the :doc:`io` sections for more info about the related concepts.

.. _concept-persistence:

Persistence
Fundamental aspect of a project lifecycle is the pipeline state transition occurring during *train* and/or *tune*
modes. Each of these transitions produce a new *Generation*. Generations based on same build of a project belong to
one *Lineage*.

Both Lineages and Generations are *project artifacts* that require a persistent runtime storage called *Registry*
that allows publishing, locating and fetching these entities. See the :doc:`registry/index` section for the list of
existing registry implementations and their configurations.

.. _concept-execution:

Execution
At runtime, the native actor DAG produced through the operator composition gets transformed to a representation
of the selected third-party task dependency *runner* and the actual execution is carried under its control.

The list of supported runners shipped with ForML and their documentations can be found in the :doc:`runner/index`
section.
27 changes: 20 additions & 7 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,9 @@
import os
import sys

sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'src'))
import forml # noqa: E402
sys.path.insert(0, os.path.abspath('..'))

import forml # noqa: E402

# -- Project information -----------------------------------------------------

Expand All @@ -49,22 +49,25 @@
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.coverage',
'sphinx.ext.viewcode',
'sphinx.ext.graphviz',
'sphinx.ext.intersphinx',
'sphinx.ext.napoleon',
'sphinx_rtd_theme',
]

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
templates_path = ['templates']

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
exclude_patterns = ['_build']

intersphinx_mapping = {
'dask': ('https://docs.dask.org/en/latest/', None),
'setuptools': ('https://setuptools.readthedocs.io/en/latest/', None),
'pandas': ('https://pandas.pydata.org/pandas-docs/stable/', None),
'python': ('https://docs.python.org/3/', None),
}
Expand All @@ -74,13 +77,23 @@
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'alabaster'
html_theme = 'sphinx_rtd_theme'

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
html_static_path = ['static']

html_show_sourcelink = False

html_show_copyright = False

# == Extensions configuration ==================================================

# -- Options for sphinx.ext.autodoc --------------------------------------------
# See: https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html

autoclass_content = 'both'
autosummary_generate = True
napoleon_numpy_docstring = False
napoleon_use_rtype = False
40 changes: 40 additions & 0 deletions docs/dsl/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
Data Source DSL
===============

To allow projects to :ref:`specify <project-source>` their data requirements in a portable way, ForML comes with its
generic DSL that's at :doc:`runtime <../platform>` intepreted by the :doc:`feeds subsystem <../feed>`.

Schema
------



Query


Functions
---------

.. autosummary::
:recursive:
:toctree: _auto

forml.io.dsl.function.aggregate
forml.io.dsl.function.conversion
forml.io.dsl.function.datetime
forml.io.dsl.function.math
38 changes: 38 additions & 0 deletions docs/faq.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
FAQs
====

What data format is used in the pipeline between the actors?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ForML actually doesn't care. It is only responsible for wiring up the actors in the desired graph but is fairly
agnostic about the actual payload exchanged between them. It is the responsibility of the project implementor to engage
actors that understand each other.

For convenience, the :doc:`lib` shipped with ForML contains certain actors/operators implementations that expect
the data to be `Pandas <https://pandas.pydata.org/>`_ dataframes. This is however rather a practical choice of the flow
library (or a controversy that might get it removed from the ForML framework long term) while the ForML core is truly
independent of the data formats being passed through.


Can a Feed engage multiple reader types so that I can mix for example file based datasources with data in a DB?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

No. It sounds like a cool idea to have a DSL interpreter that can just get raw data from any possible reader type and
natively implement the ETL operations on top of it, but since there are existing dedicated ETL platforms doing exactly
that (like the `Presto DB <https://prestodb.io/>`_, which ForML already can integrate with), trying to support the same
feature on the feed level would be unnecessarily stretching the project goals too far.

0 comments on commit ce62433

Please sign in to comment.