# Project Resources Management

Objective:

> Learning how ML solutions and their lifecycles can be managed as software projects.

Principles:

1. Project resources are organized to facilitate development, collaboration, and maintainability.
2. **Development lifecycle** handles all the project development phases leading to a model release. 
3. **Model registry** represents the persistence layer holding the published project artifacts at rest.
4. **Production lifecycle** takes care of all the necessary model operations after its release.

## Project Setup

ForML based solutions are standard Python projects typically with [the following minimal structure](https://docs.forml.io/en/latest/project.html):

| Component          | Location                 |
|--------------------|--------------------------|
| Project Descriptor | `pyproject.toml`         |
| Dependencies       | `pyproject.toml`         |
| Data Requirements  | `<module>/source.py`     |
| Evaluation Spec    | `<module>/evaluation.py` |
| Model Pipeline     | `<module>/pipeline.py`   |
| Tests              | `tests/`                 |

Additional typical components not used directly by ForML (see for example [Cookiecutter Datascience](https://drivendata.github.io/cookiecutter-data-science/)):
* `data/`
* `docs/`
* `notebooks/`
* CI/CD descriptors

### Starting a New Project

For the sake of this tutorial, let's start a new project called `dummy`:

In [1]:
! forml project --path .. init dummy

In [None]:
%cd ../dummy

In [None]:
! tree .

ForML is adopting the standard Python `pyproject.toml` descriptor:

In [None]:
from IPython import display
display.Code('pyproject.toml')

Let's keep our _Dummy_ project under version control:

In [None]:
! git init .
! git add .

### Filling-in the Project Components

#### Data Requirements

Let's define the [dummy/source.py](../dummy/dummy/source.py):

1. Open the [dummy/source.py](../dummy/dummy/source.py) component.
2. Update it with the final query DSL used previously in the chapter [2-task-dependency-management](2-task-dependency-management.ipynb#Operators):
```python
from forml import project
from forml.pipeline import payload

from dummycatalog import Foo

FEATURES = Foo.select(Foo.Level, Foo.Value)
OUTCOMES = Foo.Label

SOURCE = project.Source.query(FEATURES, OUTCOMES) >> payload.ToPandas()

project.setup(SOURCE)
```
4. **SAVE THE [dummy/source.py](dummy/source.py) FILE!**

In [6]:
! git add dummy/source.py

#### Evaluation

Let's configure the [dummy/evaluation.py](../dummy/dummy/evaluation.py):

1. Open the [dummy/evaluation.py](../dummy/dummy/evaluation.py) component.
2. Update it with the evaluation descriptor shown previously in the chapter [3-evaluation](3-evaluation.ipynb#Cross-validation-Method):
```python
from sklearn import metrics
from sklearn import model_selection

from forml import evaluation, project

EVALUATION = project.Evaluation(
    evaluation.Function(metrics.log_loss),
    evaluation.CrossVal(
        crossvalidator=model_selection.StratifiedKFold(
            n_splits=3, shuffle=True, random_state=42
        )
    ),
)

project.setup(EVALUATION)
```
4. **SAVE THE [dummy/evaluation.py](dummy/evaluation.py) FILE!**

In [7]:
! git add dummy/evaluation.py

#### Pipeline

Let's setup all the [dummy/pipeline.py](../dummy/dummy/pipeline.py) workflow:

1. Open the [dummy/pipeline.py](../dummy/dummy/pipeline.py) component.
2. Update it with all the actors, operators, and their composition as explored previously in chapter [2-task-dependency-management](2-task-dependency-management.ipynb).
3. Save the file!

```python
import typing

import pandas
from imblearn import over_sampling

from forml import project, flow
from forml.pipeline import payload, wrap

with wrap.importer():
    from sklearn.linear_model import LogisticRegression


@wrap.Actor.apply
def OrdActor(data: pandas.DataFrame, *, column: str) -> pandas.Series:
    return data[column].apply(lambda v: ord(v[0].lower()))


@wrap.Actor.train
def CenterActor(
    state: typing.Optional[float],
    data: pandas.DataFrame,
    labels: pandas.Series,
    *,
    column: str
) -> float:
    return data[column].mean()


@CenterActor.apply
def CenterActor(
    state: float, data: pandas.DataFrame, *, column: str
) -> pandas.DataFrame:
    return data[column] - state


@wrap.Actor.train
def MinMax(
    state: typing.Optional[tuple[float, float]],
    data: pandas.DataFrame,
    labels: pandas.Series,
    *,
    column: str
) -> tuple[float, float]:
    min_ = data[column].min()
    return min_, data[column].max() - min_


@wrap.Operator.mapper
@MinMax.apply
def MinMax(
    state: tuple[float, float], data: pandas.DataFrame, *, column: str
) -> pandas.DataFrame:
    data[column] = (data[column] - state[0]) / state[1]
    return data


@wrap.Actor.apply
def OverSampler(
    features: pandas.DataFrame,
    labels: pandas.Series,
    *,
    random_state: typing.Optional[int] = None
):
    """Stateless actor  with two input and two output ports for oversampling the features/labels of the minor class."""
    return over_sampling.RandomOverSampler(random_state=random_state).fit_resample(
        features, labels
    )


class Balancer(flow.Operator):
    """Balancer operator inserting the provided sampler into the ``train`` & ``label`` paths."""

    def __init__(self, sampler: flow.Builder = OverSampler.builder(random_state=42)):
        self._sampler = sampler

    def compose(self, scope: flow.Composable) -> flow.Trunk:
        left = scope.expand()
        sampler = flow.Worker(self._sampler, 2, 2)
        sampler[0].subscribe(left.train.publisher)
        new_features = flow.Future()
        new_features[0].subscribe(sampler[0])
        sampler[1].subscribe(left.label.publisher)
        new_labels = flow.Future()
        new_labels[0].subscribe(sampler[1])
        return left.use(
            train=left.train.extend(tail=new_features),
            label=left.label.extend(tail=new_labels),
        )


PIPELINE = (
    Balancer()
    >> payload.MapReduce(
        OrdActor.builder(column="Level"), CenterActor.builder(column="Value")
    )
    >> MinMax(column="Level")
    >> LogisticRegression(random_state=42)
)

project.setup(PIPELINE)
```

In [8]:
! git add dummy/pipeline.py

#### Dependencies

Let's add the explicit dependencies used in this project into the [pyproject.toml](../dummy/pyproject.toml):

1. Open the [pyproject.toml](../dummy/pyproject.toml).
2. Update it with the code below adding the new dependency of `imbalanced-learn==0.10.1`:
```toml
[project]
name = "dummy"
version = "0.1.dev1"
dependencies = [
    "forml==0.93",
    "imbalanced-learn==0.10.1"
]

[tool.forml]
package = "dummy"
```
3. **SAVE THE [pyproject.toml](pyproject.toml) FILE!**

In [9]:
! git add pyproject.toml

### Adding Unit Test for our Balancer Operator

In [10]:
! touch tests/test_pipeline.py

Edit the created [test_pipeline.py](../dummy/tests/test_pipeline.py) and implement the unit test:

1. Open the [test_pipeline.py](../dummy/tests/test_pipeline.py).
2. Update it with the code below providing the `TestBalancer` unit test implementation based on the chapter [2-task-dependency-management](2-task-dependency-management.ipynb):
```python
from forml import testing

from dummy import pipeline

class TestBalancer(testing.operator(pipeline.Balancer)):
    """Balancer unit tests."""

    default_oversample = (
        testing.Case()
        .train([[1], [1], [0]], [1, 1, 0])
        .returns([[1], [1], [0], [0]], labels=[1, 1, 0, 0])
    )
```
3. **SAVE THE [test_pipeline.py](../dummy/tests/test_pipeline.py) FILE!**

In [11]:
! git add tests/test_pipeline.py

## Development Lifecycle

The [development lifecycle](https://docs.forml.io/en/latest/lifecycle.html#development-life-cycle) covers all the project development phases leading to a model release. 

### Visualising the Train DAG

In [12]:
! forml project train -R graphviz

running train


This produces an SVG file under [dummy/forml.dot.svg](./img/train.svg) visualizing the given train workflow:

_dummy/forml.dot.svg_:
[![train flow](img/train.svg)](img/train.svg)

### Performing Development Evaluation

In [None]:
! forml project eval

### Running Tests

In [None]:
! forml project test

### Releasing

Once we are happy with the achieved results (good evaluation metric, unit tests passing), we can proceed to release the model version.

Let's start by committing and tagging the project codebase:

In [None]:
! git commit -m 'Released 0.1.dev1'
! git tag 0.1.dev1

Now we can kick off the release process to package the model artifact and publish it into the model registry:

In [None]:
! forml project release

## Model Registry

[Model registry](https://docs.forml.io/en/latest/registry.html) serves as a crucial interface for managing published models throughout the production lifecycle. It can be provided through a number of different implementations.

The registry has a tree hierarchy with levels of `project` / `release` / `generation`:

In [None]:
! forml model list

In [None]:
! forml model list dummy

In [19]:
! forml model list dummy 0.1.dev1

In [None]:
! tree /opt/forml/assets/registry/

## Production Lifecycle

The [production lifecycle](https://docs.forml.io/en/latest/lifecycle.html#production-life-cycle) takes care of all the necessary model operations after its release.

### Model Training

In [21]:
! forml model train dummy

In [None]:
! forml model list dummy 0.1.dev1

In [None]:
! tree /opt/forml/assets/registry/

### Followup Steps

We will leave the remaining steps of the production lifecycle to the final chapter [3-solution](../3-solution) which works with a real dataset.