# Rubicon Quick Look

This is a simple classification example based 
on this [`scikit-learn` example](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py).
The wine dataset used in this example classifies different Italian wines by the cultivators that produced them based 
on 13 features describing their chemical makeup. More infomation on this dataset can be found by running 
`print(wine.DESCR)` after the dataset has been loaded.

The pipeline contains an optional `StandardScaler` configured by the `is_standardized` parameter, a `PCA` 
configured by the `n_components` parameter, and finally a `GaussianNB`.

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.50)

## Local Logging

In [None]:
from rubicon import Rubicon

root_dir = "./rubicon-root"

rubicon = Rubicon(persistence="filesystem", root_dir=root_dir)
project = rubicon.get_or_create_project("Wine Classification")

print(project)

In [None]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

def run_experiment(project, is_standardized=False, n_components=2):
    experiment = project.log_experiment(model_name=GaussianNB.__name__)
    
    experiment.log_parameter("is_standardized", is_standardized)
    experiment.log_parameter("n_components", n_components)

    for name in wine.feature_names:
        experiment.log_feature(name)

    if is_standardized:
        classifier = make_pipeline(StandardScaler(), PCA(n_components=n_components), GaussianNB())
    else:
        classifier = make_pipeline(PCA(n_components=n_components), GaussianNB())
        
    classifier.fit(X_train, y_train)
    pred_test = classifier.predict(X_test)
    accuracy = metrics.accuracy_score(y_test, pred_test)
    
    experiment.log_metric("accuracy", accuracy)
    
    confusion_matrix = pd.crosstab(
        wine.target_names[y_test], wine.target_names[pred_test],
        rownames=["actual"], colnames=["predicted"],
    )
    
    experiment.log_dataframe(
        confusion_matrix, tags=["confusion matrix"]
    )

    if accuracy >= .9:
        experiment.add_tags(["success"])
    else:
        experiment.add_tags(["failure"])

We'll use the `multiprocessing` package to run 14 uniquely configured experiments in parallel.

In [None]:
import multiprocessing

processes = []

for is_standardized in [True, False]:
    for n_components in range(1, 15, 2):
        processes.append(multiprocessing.Process(
            target=run_experiment, args=[project],
            kwargs={"is_standardized": is_standardized, "n_components": n_components}
        ))

In [None]:
for process in processes:
    process.start()
    
for process in processes:
    process.join()

## S3 Logging

Logging to S3 is as simple as changing the `root_dir` when instantiating the `Rubicon` object:

```python
rubicon = Rubicon(persistence="filesystem", root_dir="s3://my-bucket/path/to/rubicon-root")
```

`rubicon` can even help you sync a locally logged project to S3 after the fact:

In [None]:
s3_path = "s3://my-bucket/path/to/rubicon-root"
rubicon.sync(project.name, s3_path)

We can create a second `rubicon` client pointing to S3 to verify our experiments were copied.

In [None]:
rubicon_s3 = Rubicon(persistence="filesystem", root_dir=s3_path)
project_s3 = rubicon_s3.get_project("Wine Classification")

assert len(project_s3.experiments()) == len(project.experiments())

## Publishing & Sharing

Once we've synced our local project with S3, anyone with access to that bucket and knowledge of how to use the `rubicon` library can use it to pull the project and explore the data themselves.

But, we've also made it even easier for users that aren't familiar with `rubicon`! Collaborators and reviewers can use [**Intake**](https://intake.readthedocs.io/en/latest/), an open source library whose goal is *\"taking the pain out of data access and distribution\"*, to load published `rubicon` experiments locally.

First, we'll "publish" some `rubicon` experiments by generating an Intake catalog that we can share. "Publishing" a project or experiment does not change or move the data, it is simply a way to designate an object as finalized. A **catalog** is a single YAML file that works as a pointer to the actual Rubicon data. Sharing and versioning this one file is much simpler than doing the same with an entire directory of data.

**Note**: We're using the **S3 client** we just created to publish here. Sharing a catalog that points to local files on your own machine isn't going to be any help unless you're certain your collaborators will be working on the same machine!

In [None]:
rubicon_s3.publish(project_s3.name, experiment_tags=["success"], output_filepath="./wine_catalog.yml")

! ls -l wine_catalog.yml
! cat wine_catalog.yml

## Consuming Published Data

As a consumer of Rubicon data, reading the catalog and loading the published Rubicon experiments into memory only requires the Intake drivers included in `rubicon.intake-rubicon`. Intake was installed as a part of the `rubicon` library, so we can use it directly to load our data.

In [None]:
import intake

catalog = intake.open_catalog("./wine_catalog.yml")
list(catalog)

Check out the collaboration and review example notebook for more details on reading catalogs.

## Dashboard

If you’ve installed the ui extra, you can use the **CLI** launch the Rubicon Dashboard locally to explore your logged data. The dashboard is fully interactive and offers a number of ways to compare your experiments, with even more on the way! (Interrupt the kernel from the menu above to stop running the dashboard.)

In [None]:
! rubicon ui --root-dir "./rubicon-root"