# Integrate with Prefect

``rubicon_ml`` offers an integration with [Prefect](https://www.prefect.io/), an open source
workflow management engine used heavily within the Python ecosystem. In Prefect, a unit of
work is called a **task**, and a collection of **tasks** makes up a **flow**. A **flow**
represents your workflow pipeline. You can integrate ``rubicon_ml`` into your workflows to
persist metadata about your experimentation.

We'll [run a Prefect server locally](https://docs.prefect.io/core/getting_started/installation.html#running-the-local-server-and-ui)
for this example. If you already have Prefect and Docker installed, you can start a
Prefect server and agent with these commands:

```bash
prefect backend server
prefect server start

# and in a new terminal
prefect agent start
```

For more context, check out the
[workflow README on GitHub](https://github.com/capitalone/rubicon-ml/tree/main/rubicon/workflow).

### Setting up a simple flow

Now we can get started! Creating Prefect **tasks** is easy enough on its own, but we've added
some simple ones to the ``rubicon_ml`` library.

In [1]:
from rubicon_ml.workflow.prefect import (
    get_or_create_project_task,
    create_experiment_task,
    log_artifact_task,
    log_dataframe_task,
    log_feature_task,
    log_metric_task,
    log_parameter_task,
)

We'll need a workflow to integrate ``rubicon_ml`` logging into, so let's put together a simple one.
To mimic a realistic example, we'll create tasks for loading data, splitting said data, extracting
features, and training a model.

In [2]:
from prefect import task


@task
def load_data():
    from sklearn.datasets import load_wine
    
    return load_wine()

In [3]:
@task
def split_data(dataset):
    from sklearn.model_selection import train_test_split
    
    
    return train_test_split(
        dataset.data,
        dataset.target,
        test_size=0.25,
    )

In [4]:
@task
def get_feature_names(dataset):
    return dataset.feature_names

In [5]:
@task
def fit_pred_model(
    train_test_split_result,
    n_components,
    n_neighbors,
    is_standardized
):
    from sklearn import metrics
    from sklearn.decomposition import PCA
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import StandardScaler

    
    X_train, X_test, y_train, y_test = train_test_split_result

    if is_standardized:
        classifier = make_pipeline(
            StandardScaler(),
            PCA(n_components=n_components),
            KNeighborsClassifier(n_neighbors=n_neighbors),
        )
    else:
        classifier = make_pipeline(
            PCA(n_components=n_components),
            KNeighborsClassifier(n_neighbors=n_neighbors),
        )
        
    classifier.fit(X_train, y_train)
    pred_test = classifier.predict(X_test)
    accuracy = metrics.accuracy_score(y_test, pred_test)
    
    return accuracy

Without ``rubicon_ml``, here's what this simple **flow** would look like:

In [6]:
from prefect import flow


n_components = 2
n_neighbors = 5
is_standardized = True

@flow
def wine_flow():
    wine_dataset = load_data()
    
    feature_names = get_feature_names(wine_dataset)
    train_test_split = split_data(wine_dataset)
    
    return fit_pred_model(
        train_test_split,
        n_components,
        n_neighbors,
        is_standardized,
    )

### Running a flow and viewing results

Now we'll run the flow.

In [7]:
accuracy = wine_flow()
accuracy

0.9333333333333333

### Adding Rubicon to your flow

We can leverage the Prefect tasks within the ``rubicon_ml`` library to log all the
info we want about our model run. Then, we can use the standard ``rubicon_ml`` logging
library to retrieve and inspect our metrics and other logged data. This is much
simpler than digging through the state of each executed **task** and extracting
its results.

Here's the same flow from above, this time with ``rubicon_ml`` **tasks** integrated.

In [8]:
import os

from prefect import unmapped


n_components = 2
n_neighbors = 5
is_standardized = True

@flow
def rubicon_wine_flow():    
    project = get_or_create_project_task(
        "memory",
        ".",
        "Wine Classification with Prefect",
    )
    experiment = create_experiment_task(
        project,
        name="logged from a Prefect task",
    )
    
    wine_dataset = load_data()
    feature_names = get_feature_names(wine_dataset)
    train_test_split = split_data(wine_dataset)
    
    log_feature_task.map(unmapped(experiment), feature_names)
    log_parameter_task(experiment, "n_components", n_components)
    log_parameter_task(experiment, "n_neighbors", n_neighbors)
    log_parameter_task(experiment, "is_standardized", is_standardized)
    
    accuracy = fit_pred_model(
        train_test_split,
        n_components,
        n_neighbors,
        is_standardized,
    )
    
    log_metric_task(experiment, "accuracy", accuracy)

    return accuracy

Again, we'll register and run the **flow**.

In [9]:
accuracy = rubicon_wine_flow()
accuracy

1.0

This time we can use ``rubicon_ml`` to inspect our accuracy, among other things!

In [10]:
from rubicon_ml import Rubicon


rubicon = Rubicon(persistence="memory", root_dir=".")
project = rubicon.get_project("Wine Classification with Prefect")
    
experiment = project.experiments()[0]

features = [f.name for f in experiment.features()]
parameters = [(p.name, p.value) for p in experiment.parameters()]
metrics = [(m.name, m.value) for m in experiment.metrics()]

print(
    f"experiment {experiment.id}\n"
    f"features: {features}\nparameters: {parameters}\n"
    f"metrics: {metrics}"
)

experiment 35173154-a912-4ed0-9920-188d62e6f3d2
features: ['proline', 'proanthocyanins', 'flavanoids', 'alcalinity_of_ash', 'nonflavanoid_phenols', 'hue', 'alcohol', 'od280/od315_of_diluted_wines', 'malic_acid', 'ash', 'magnesium', 'total_phenols', 'color_intensity']
parameters: [('n_components', 2), ('n_neighbors', 5), ('is_standardized', True)]
metrics: [('accuracy', 1.0)]
