# Quick Look

This is a simple classification example based on
[this Scikit-learn example](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py).
The wine dataset used in this example classifies different Italian wines by the cultivators that produced them
based on 13 features describing their chemical makeup. More infomation on this dataset can be found by running 
``print(wine.DESCR)`` after the dataset has been loaded.

We'll use this example to take a quick look at each of ``rubicon_ml``'s core components - logging, the dashboard, and
our sharing process.

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split


wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data,
    wine.target,
    test_size=0.50,
)

## Logging Library

The logging library provides an API for storing and retrieving model inputs, outputs, and analyses. Using ``fsspec``,
it supports three types of persistence:

* **local**: for personal projects and individual exploration
* **S3**: for collaborating, sharing, and backing up local work
* **in-memory**: for testing and development - no need to clean up after yourself!

Within ``rubicon_ml``, the implementations of these filesystems are very lightweight, with most functionality coming
directly from ``fsspec``. In fact, ``rubicon_ml``’s persistence layer is designed to be easily extensible and could
be extended to support any other type of persistence that ``fsspec`` supports.

### Local Logging

First, we'll configure our ``Rubicon`` object and create a project.

In [2]:
import os

from rubicon_ml import Rubicon


root_dir = os.environ.get("RUBICON_ROOT", "rubicon-root")
root_path = f"{os.getcwd()}/{root_dir}"

rubicon = Rubicon(persistence="filesystem", root_dir=root_path)
project = rubicon.get_or_create_project("Wine Classification")

project

<rubicon_ml.client.project.Project at 0x15ac30370>

We’ll also create a ``run_experiment`` function to abstract our logging into a single spot. This
isn't necessary - ``rubicon_ml`` is designed to be flexible and fit into your workflow however you see fit!

In [3]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


def run_experiment(project, is_standardized=False, n_components=2):
    experiment = project.log_experiment(model_name=GaussianNB.__name__)
    
    experiment.log_parameter("is_standardized", is_standardized)
    experiment.log_parameter("n_components", n_components)

    for name in wine.feature_names:
        experiment.log_feature(name)

    if is_standardized:
        classifier = make_pipeline(
            StandardScaler(),
            PCA(n_components=n_components),
            GaussianNB(),
        )
    else:
        classifier = make_pipeline(
            PCA(n_components=n_components),
            GaussianNB(),
        )
        
    classifier.fit(X_train, y_train)
    pred_test = classifier.predict(X_test)
    accuracy = metrics.accuracy_score(y_test, pred_test)
    
    experiment.log_metric("accuracy", accuracy)
    
    confusion_matrix = pd.crosstab(
        wine.target_names[y_test],
        wine.target_names[pred_test],
        rownames=["actual"],
        colnames=["predicted"],
    )
    
    experiment.log_dataframe(
        confusion_matrix,
        tags=["confusion matrix"],
    )

    if accuracy >= .9:
        experiment.add_tags(["success"])
    else:
        experiment.add_tags(["failure"])

Now we can run our experiment across a few different parameter sets! We'll end up with 14
unique experiments.

In [4]:
for is_standardized in [True, False]:
    for n_components in range(1, 15, 2):
        run_experiment(
            project,
            is_standardized=is_standardized,
            n_components=n_components,
        )

### S3 Logging

Logging to S3 is as simple as changing the ``root_dir`` when instantiating the ``Rubicon`` object:

```python
rubicon = Rubicon(
    persistence="filesystem",
    root_dir="s3://my-bucket/path/to/rubicon-root",
)
```

### In-memory Logging

The same is true of in-memory logging! In-memory logging doesn't require a ``root_dir``, but if
you're working with other in-memory ``fsspec`` filesystems you can still provide one.

```python
rubicon = Rubicon(persistence="memory")
```

## Dashboard

``rubicon_ml`` comes with a UI add-on (installable as rubicon-ml[ui] if you're using ``pip`` to manage
your Python environment) that allows you to explore, visualize, and compare data within your
projects. If you have ``git`` integration enabled, the dashboard will group your
experiments by commit automatically and link you directly to the corresponding model code.

In [5]:
from rubicon_ml.ui import Dashboard


Dashboard(rubicon=rubicon).run_server_inline()

 * Running on http://127.0.0.1:8050/ (Press CTRL+C to quit)
127.0.0.1 - - [12/May/2021 09:58:14] "[37mGET /_alive_10a36bff-b1c3-4f41-951e-b57ddc890f93 HTTP/1.1[0m" 200 -


Dash app running on http://127.0.0.1:8050/


![dashboard](quick-look-dashboard.png "dashboard")

By default, the dashboard launches on your localhost's port 8050. To view the dashboard inline in a notebook
or in a new JupyterLab window, instantiate the ``Dashboard`` object with ``mode="inline"`` or 
``mode="jupyterlab"``, respectively.

You don't even have to launch the dashboard from a Python interpreter. ``rubicon_ml`` comes packaged with a CLI that
can launch the dashboard from your terminal.

```
rubicon_ml ui --root-dir ./rubicon-root
```

## Publishing & Sharing

Once you have a project stored in S3, anyone with access to that bucket can use the Python library
to pull down your project and explore the data themselves or they can visualize the project within the
dashboard.

Additionally, ``rubicon_ml`` offers a process to share a selected subset of your logged data via publishing and
consuming custom [Intake](https://github.com/intake/intake) catalogs.

### Publishing

Here, we’ll publish some experiments by generating an Intake catalog. The catalog file, generated as YAML,
simply points to the actual data and can be shared and versioned independently of your ``rubicon_ml`` project data.

You can use the ``experiment_tags`` parameter to publish experiments with specific tags, like **success**.
The resulting file will have a root ``sources`` key, followed by a number of sources representing experiments
and/or projects. In this example, you'll notice the ``urlpaths`` are all local since we're using a local filesystem.
Publishing is most useful when refrencing remote data in S3. That's because publishing does not change or move the
data - it only references it's location.

In [6]:
_ = rubicon.publish(
    project.name,
    experiment_tags=["success"],
    output_filepath="./wine_catalog.yml",
)

In [7]:
! head -7 wine_catalog.yml

sources:
  experiment_54d24e34_c0af_4565_909d_65b5900ba0e0:
    args:
      experiment_id: 54d24e34-c0af-4565-909d-65b5900ba0e0
      project_name: Wine Classification
      urlpath: /Users/ryan/oss/rubicon/notebooks/rubicon-root
    driver: rubicon_ml_experiment


### Consuming Published Projects & Experiments

If you’ve been given a catalog of published experiments, you can easily load these with ``intake``
and the custom ``rubicon_ml.intake_rubicon`` driver, which both come packaged with the Python library.

In [8]:
import intake


catalog = intake.open_catalog("./wine_catalog.yml")

Then, you can use the ``catalog`` object to load the published projects or experiments into memory.
From there, you'll have a ``rubicon_ml.Project`` or ``rubicon_ml.Experiment`` object you can use as you
normally would!

In [9]:
first_source = catalog[list(catalog)[0]]

first_source.discover()
first_source.read()

<rubicon_ml.client.experiment.Experiment at 0x15f5437c0>

Since this catalog doesn't point to any remote data, its not super useful.
Let's clean up after ourselves and get rid of it.

In [10]:
! rm wine_catalog.yml