# Hello `rubicon-ml`!

This notebook highlights the main features of the `rubicon-ml` logging library
in the context of a classification problem. `rubicon-ml` can track model metadata
and artifacts to aid in reproducible experimentation.

## loading & preparing the data

First, let's take a look at the data we'll be using.

In [1]:
! pip install palmerpenguins



More information on the palmer pengunis dataset can be found
[here](https://allisonhorst.github.io/palmerpenguins/).

In [2]:
from palmerpenguins import load_penguins

penguins_df = load_penguins()
penguins_df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


We can validate that we're looking at the right data by comparing this plot to the
one at the link above.

In [3]:
import plotly.express as px

px.scatter(penguins_df, x="flipper_length_mm", y="bill_length_mm", color="species")

Before we get started, we'll encode the catagorical variables in the palmer penguins
dataset.

In [4]:
from sklearn.preprocessing import LabelEncoder

penguin_encoder = LabelEncoder()

for column in ["species", "island", "sex"]:
    penguins_df[column] = penguin_encoder.fit_transform(penguins_df[column])

penguins_df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,0,2,39.1,18.7,181.0,3750.0,1,2007
1,0,2,39.5,17.4,186.0,3800.0,0,2007
2,0,2,40.3,18.0,195.0,3250.0,0,2007
3,0,2,,,,,2,2007
4,0,2,36.7,19.3,193.0,3450.0,0,2007


## setting up a pipeline

We'll be leveraging Scikit-learn quite a bit in this example, but `rubicon-ml` can
be integrated into any Python codebase! Let's split the palmer penguins dataset into
a train and test set.

In [5]:
from sklearn.model_selection import train_test_split

train_penguins_df, test_penguins_df = train_test_split(penguins_df, test_size=.30)

target_column = "species"
feature_columns = [c for c in train_penguins_df.columns if c != target_column]

X_train, y_train = train_penguins_df[feature_columns], train_penguins_df[target_column]
X_test, y_test = test_penguins_df[feature_columns], test_penguins_df[target_column]

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((240, 7), (240,), (104, 7), (104,))

Now we can build a simple pipeline to train a classifier on the palmer penguin data.
We'll start with an imputer to fill out some missing data then train a k-nearest
neighbors classifier.

In [6]:
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

steps = [
    ("si", SimpleImputer(strategy="mean")),
    ("kn", KNeighborsClassifier(n_neighbors=5)),
]

penguin_pipeline = Pipeline(steps=steps)
penguin_pipeline.fit(X_train, y_train)

score = penguin_pipeline.score(X_test, y_test)
score

0.7788461538461539

## logging metadata to `rubicon-ml`

Now that we've fit our pipeline, let's log our inputs and outputs to `rubicon-ml`
so we can revisit them later.

In [7]:
from rubicon_ml import Rubicon

rubicon = Rubicon(
    persistence="filesystem",
    root_dir="./rubicon-root",
    auto_git_enabled=True,
)
project = rubicon.get_or_create_project(name="demo")

experiment = project.log_experiment(name="classifying penguins")
parameter_strategy = experiment.log_parameter(name="strategy", value="mean")
parameter_n_neighbors = experiment.log_parameter(name="n_neighbors", value=5)
metric_accuracy = experiment.log_metric(name="accuracy", value=score)

Users can retrieve parameters, metrics, and lots of other metadata from logged experiments.

In [8]:
print(experiment)
print(experiment.branch_name, experiment.commit_hash)
print()
print([(p.name, p.value) for p in experiment.parameters()])
print([(m.name, m.value) for m in experiment.metrics()])

Experiment(project_name='demo', id='8dd488f3-7dd7-4ad5-ac82-221e5f6db900', name='classifying penguins', description=None, model_name=None, branch_name='example', commit_hash='b66a307cfb99ea91ec2f11b6fdd16ce41e3c54ab', training_metadata=None, tags=[], created_at=datetime.datetime(2022, 3, 29, 18, 36, 1, 48644))
example b66a307cfb99ea91ec2f11b6fdd16ce41e3c54ab

[('strategy', 'mean'), ('n_neighbors', 5)]
[('accuracy', 0.7788461538461539)]


## logging with the `RubiconPipeline`

The experiment tracking code above can be easily inserted into any Python codebase with a little
bit of thought around what you want to be logging. However, when using Scikit-learn, `rubicon-ml`
can do a little bit of that thinking for you!

The `RubiconPipeline` is a drop-in replacement for the existing Scikit-learn `Pipeline` we used above.
However, the `RubiconPipeline` is able to intelligently log a new experiment that contains each input 
to the pipeline every time the pipeline is fit. Scoring the pipeline will also log metrics.

Here, we'll simply instantiate a `RubiconPipeline` with the same steps we used above - but this time
we'll give it a `rubicon-ml` project too!

In [9]:
from rubicon_ml.sklearn import RubiconPipeline

rubicon_penguin_pipeline = RubiconPipeline(
    project=project,
    experiment_kwargs={"name": "KNeighborsClassifier", "tags": ["knn"]},
    steps=steps,
)
rubicon_penguin_pipeline.fit(X_train, y_train)

pipeline_experiment = rubicon_penguin_pipeline.experiment

rubicon_penguin_pipeline.score(X_test, y_test)

0.7788461538461539

After fitting and scoring the new `RubiconPipeline`, we can clearly see that we've captured
a lot more information than the manually logged pipeline with far less code!

In [10]:
print(pipeline_experiment)
print()
print([(p.name, p.value) for p in pipeline_experiment.parameters()])
print([(m.name, m.value) for m in pipeline_experiment.metrics()])

Experiment(project_name='demo', id='8aa5f5a0-3448-4ce3-8124-e8858b4d5bc1', name='KNeighborsClassifier', description=None, model_name=None, branch_name='example', commit_hash='b66a307cfb99ea91ec2f11b6fdd16ce41e3c54ab', training_metadata=None, tags=['knn'], created_at=datetime.datetime(2022, 3, 29, 18, 36, 3, 798132))

[('si__add_indicator', False), ('si__copy', True), ('si__fill_value', None), ('si__missing_values', nan), ('si__strategy', 'mean'), ('si__verbose', 0), ('kn__algorithm', 'auto'), ('kn__leaf_size', 30), ('kn__metric', 'minkowski'), ('kn__metric_params', None), ('kn__n_jobs', None), ('kn__n_neighbors', 5), ('kn__p', 2), ('kn__weights', 'uniform')]
[('score', 0.7788461538461539)]


## grid search (and other meta-estimators)

The `RubiconPipeline` itself is a Scikit-learn estimator, meaning that it can be leveraged
with any existing meta-estimator that Scikit-learn or a variety of other tooling suites
offer. The great part about integrating `rubicon-ml` into Scikit-learn is that so many other
tools have also decided to make the same integration, effectively giving `rubicon-ml` compatibility
with them as well.

`rubcion-ml` is particularly useful in evaluating hyperparamter- and feature-space searches while using
tools that levearage the Scikit-learn estimator specification. For this particular example, we'll use
the built-in `GridSearchCV`, but `rubicon-ml` can work with any searching tool that levearges the
Scikit-learn API like Optuna or Hyperopt.

The grid search below will train 36 separate models and automatically log an experiment representing
each of them to the newly created "grid search demo" project.

In [11]:
from sklearn.model_selection import GridSearchCV

parameters = {
    "si__strategy": ["mean", "median", "most_frequent"],
    "kn__n_neighbors": [2, 4, 8, 16, 32, 64],
}

grid_search_project = rubicon.get_or_create_project(name="grid search demo")
rubicon_penguin_pipeline.project = grid_search_project

grid_search = GridSearchCV(
    rubicon_penguin_pipeline,
    cv=2,
    param_grid=parameters,
    refit=False,
    verbose=True,
)

grid_search.fit(X_train, y_train)
grid_search_project.experiments()

Fitting 2 folds for each of 18 candidates, totalling 36 fits


[<rubicon_ml.client.experiment.Experiment at 0x1609a2020>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a14e0>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a1c60>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a07c0>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a18a0>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a02b0>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a1600>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a0970>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a1360>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a0eb0>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a1300>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a0520>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a00a0>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a0b50>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a0d60>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a11e0>,
 <rubicon_ml.client.experiment.Experiment at 0x1609a0460

## widgets & visualizations

In a real-world scenario, many modelers will end up with hundreds, thousands or even more experiments.
Inspecting this many experiments via a Python interpreter can get messy quickly, so we designed a suite
of widgets that can run in-line in a notebook or on their own server to make exploring logged experiments
simpler!

All of the available visualizations can be found [here](https://capitalone.github.io/rubicon-ml/visualizations.html)
and we're always open to requests for new ones! The experiment table rendered in the following cell is
a simple overview of each of the experiments logged to a project.

In [12]:
from rubicon_ml.viz import ExperimentsTable

ExperimentsTable(experiments=grid_search_project.experiments()).show()

Dash is running on http://127.0.0.1:8050/

 * Serving Flask app 'rubicon_ml.viz.base' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


If you're following along with the demo live, make sure to select and publish a few experiments
from the experiment table above - it'll be necessary to continue executing the demo.

`rubicon-ml` experiments can be published programatically as well. If you did not load the widget,
you can publish the experiments like this:

```python
from rubicon_ml import publish

publish(grid_search_project.experiments(), "./rubicon-ml-catalog.yml")
```

## sharing & collaborating

Sharing results and collaborating with others to draw better conclusions is a huge part
of the experimentation phase. Sometimes these collaborators may be just as technical as
the model developer, or they could be analysts from a completely different domain. The
goal of `rubicon-ml`'s `intake` integration is to make sharing `rubicon-ml` experiments
easy regardless of the viewers technical skill level.

`intake` catalogs can be read with a single call to the `intake` library and return fully
operational `rubicon-ml` experiment objects. Future plans include exposing this capability
via a simple CLI so users don't even need to launch a Python interpreter. Soon we'll be
able to directly launch visualizations from catalog files too!

The nice thing about the `intake` catalogs is that its much easier to share experiments
by sharing a path to a single file than it is to tell people _"you need experiments 
3b6fe9b7-a547-40d4-82ba-e81dd02b1bfb and d1ddfc75-cefb-46ba-b4e4-2cd8e045d7d0 from
that S3 bucket we always use..."_ All that info is succinctly captured in the generated
catalog, ready for `intake` to interpret and turn back into `rubicon-ml` experiments.

Access controls are handled by the underlying filesystems - as long as the experimenter
and collaborator both have access, they'll both be able to share experiments.

In [13]:
import intake

catalog = intake.open_catalog("./rubicon-ml-catalog.yml")

for source in catalog:
    catalog[source].discover()
    
shared_experiments = [catalog[source].read() for source in catalog]

[(e.id, e.metric(name="score").value) for e in shared_experiments]

[('4d0029fc-fa04-4907-95ef-46a04ac9b62d', 0.7),
 ('706a27a5-2ef6-4a5b-89e1-ead0060b4660', 0.725),
 ('786a9813-d7eb-4e94-9d37-642a27540a45', 0.7)]

## trying a new model

Now, imagine we're a model developer that's collaborating with the old version of us
that just trained the k-nearest neighbors model. We've had the best k-nearest neighbor
models shared with us via the `intake` catalog above and we think we can do better with
a random forest classifier.

Similarly to how we did it above, we'll train a new model and log some new experiments
to a new `rubicon-ml` project in a completely different location. This time our grid search
will log 30 experiments.

In [14]:
from sklearn.ensemble import RandomForestClassifier

new_steps = [
    ("si", SimpleImputer(strategy="mean")),
    ("rf", RandomForestClassifier(n_estimators=100)),
]

new_rubicon = Rubicon(
    persistence="filesystem",
    root_dir="./new-rubicon-root",
    auto_git_enabled=True,
)
new_project = new_rubicon.get_or_create_project(name="demo")

new_pipeline = RubiconPipeline(
    project=new_project,
    experiment_kwargs={"name": "RandomForestClassifier", "tags": ["rf"]},
    steps=new_steps,
)

new_parameters = {
    "si__strategy": ["mean", "median", "most_frequent"],
    "rf__n_estimators": [25, 50, 100, 200, 400],
}

new_grid_search = GridSearchCV(
    new_pipeline,
    cv=2,
    param_grid=new_parameters,
    refit=False,
    verbose=True,
)

new_grid_search.fit(X_train, y_train)

Fitting 2 folds for each of 15 candidates, totalling 30 fits


GridSearchCV(cv=2,
             estimator=RubiconPipeline(experiment_kwargs={'name': 'RandomForestClassifier',
                                                          'tags': ['rf']},
                                       project=<rubicon_ml.client.project.Project object at 0x16189a7d0>,
                                       steps=[('si', SimpleImputer()),
                                              ('rf',
                                               RandomForestClassifier())]),
             param_grid={'rf__n_estimators': [25, 50, 100, 200, 400],
                         'si__strategy': ['mean', 'median', 'most_frequent']},
             refit=False, verbose=True)

Let's use `rubicon-ml`'s `Dashboard` to inspect these new experiments. The dashboard is a
way to combine `rubicon-ml` widgets to create custom dashboards for model interpretability.

You can read more about the dashboard [here](https://capitalone.github.io/rubicon-ml/visualizations/dashboard.html)

In [15]:
from rubicon_ml.viz import Dashboard, MetricCorrelationPlot

Dashboard(
    experiments=new_project.experiments(),
    widgets=[
        [ExperimentsTable()],
        [MetricCorrelationPlot(parameter_names=["si__strategy", "rf__n_estimators"])],
    ],
).serve()

Dash is running on http://127.0.0.1:8051/

 * Serving Flask app 'rubicon_ml.viz.base' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


We can even publish catalogs containing experiments from completely different `rubicon-ml` 
locations. This is particularly useful when logging and retrieving experiments from S3.

Inspecting the "combined-catalog.yml" file, we can see that there are experiments from
multiple sources. When we read the whole catalog, we'll get a list of experiments as if
they were all loaded from the same place!

In [16]:
from rubicon_ml import publish

combined_catalog = publish(
    shared_experiments + new_project.experiments(),
    "./combined-catalog.yml",
)

## custom estimators and loggers

Say we have some custom code that doesn't fit the Scikit-learn estimator specification,
but we'd like to leverage `rubicon-ml`'s automatic logging instead of adding logging
statements all over your codebase like the very first example in this notebook.

Maybe we decided that both the k-nearest neighbors and random forrest classifiers have
some value and should be considered in the final output, so we score predictions with
a block of code like this:

```python
def combined_score(X):
    knn_score = KNeighborsClassifier(n_neighbors=n_neighbors).score(X)
    rf_score = RandomForestClassifier(n_estimators=n_estimators).score(X)

    return (knn_score + (rf_score * 2)) / 3
```

Instead of manually logging these results to `rubicon-ml`, we can wrap the code
above in a class that fits the Scikit-learn estimator specification!

In [17]:
from sklearn.base import BaseEstimator

class ComboEstimator(BaseEstimator):
    def __init__(self, n_neighbors=2, n_estimators=25):
        super().__init__()
        
        self.n_neighbors = n_neighbors
        self.n_estimators = n_estimators
        
        self.knn = KNeighborsClassifier(n_neighbors=n_neighbors)
        self.rf = RandomForestClassifier(n_estimators=n_estimators)
        
    def fit(self, X, y):
        self.knn.fit(X, y)
        self.rf.fit(X, y)
        
    def score(self, X):
        knn_score = self.knn.score(X)
        rf_score = self.rf.score(X)
        
        return (knn_score + (rf_score * 2)) / 3

Now, all the parameters of the new estimator will be automatically logged when it's
placed in a `RubiconPipeline`!

If we want to get into custom loggers that can do more than just logging paramters,
we'll need to extend `rubicon-ml`'s `EstimatorLogger`. Here, we'll create a new
`ModelLogger` that pickles and logs the trained models to `rubicon-ml` as artifacts,
or arbitrary files.

In [18]:
import pickle

from rubicon_ml.sklearn.estimator_logger import EstimatorLogger

class ModelLogger(EstimatorLogger):
    def log_parameters(self):
        super().log_parameters()
        
        self.experiment.log_artifact(data_bytes=pickle.dumps(self.estimator.knn), name="knn")
        self.experiment.log_artifact(data_bytes=pickle.dumps(self.estimator.rf), name="rf")

We can pass the new `ComboEstimator` in a tuple along with its custom logger we just
implemented above to `rubicon-ml`'s `make_pipeline` function. Like the `RubiconPipeline`,
`rubicon-ml`'s version of `make_pipeline` should function very similarly to the Scikit-learn
version. The main difference is that it accepts a project as well as optional custom loggers.

After we've fit our pipeline, we can retrieve the logged artifacts.

In [19]:
from rubicon_ml.sklearn import make_pipeline

another_pipeline = make_pipeline(
    new_project,
    SimpleImputer(strategy="mean"),
    (ComboEstimator(n_neighbors=16, n_estimators=100), ModelLogger()),
)

another_pipeline.fit(X_train, y_train)

[(a.name, a) for a in another_pipeline.experiment.artifacts()]

[('knn', <rubicon_ml.client.artifact.Artifact at 0x161c046a0>),
 ('rf', <rubicon_ml.client.artifact.Artifact at 0x161c04a00>)]

The artifacts' data can be un-pickled to retrieve the trained model objects.

In [20]:
for artifact in another_pipeline.experiment.artifacts():
    print(pickle.loads(artifact.data))

KNeighborsClassifier(n_neighbors=16)
RandomForestClassifier()
