# Sharing Experiments

In the [first part](https://capitalone.github.io/rubicon-ml/quick-look/logging-experiments.html)
of the quick look, we learned how to log ``rubicon_ml`` experiments in the context of a
simple classification problem. We also performed a small hyperparameter search to show how ``rubicon_ml``
can be used to compare the results of multiple model fits.

Inspecting our model fit results in the same session that we trained the model is certainly useful, but
sharing experiments can help us collaborate with teammates and compare new model training results to old
experiments.

First, we'll create a ``rubicon_ml`` entry point and get the project we logged in the first part of the
quick look.

In [1]:
from rubicon_ml import Rubicon

rubicon = Rubicon(persistence="filesystem", root_dir="./rubicon-root")
project = rubicon.get_project(name="classifying penguins")
project

<rubicon_ml.client.project.Project at 0x16f06d030>

Let's say we want to share the results of our hyperparmeter search with another teammate so they can evaluate the results.
``rubicon_ml``'s ``publish`` function takes a list of experiments as an input and  uses ``intake`` to generate a catalog
containing the bare-minimum amount of metadata needed to retrieve an experiment, like its ID and filepath. More on ``intake``
can be found [in their docs](https://intake.readthedocs.io/en/latest/).

Hyperparameter searches can span thousands of combos, so sharing every single file ``rubicon_ml`` logs during the training
process can be a lot. That's why we use ``intake`` via our ``publish`` function to only share what needs to be shared in a
single YAML file. Then, later, users can use said YAML file to retrieve the experiments shared within it.

**Note**: Sharing experiments relys on both the sharer and the recipient having access to the same underlying data source.
In this example, we're using a local filesystem - so these experiments couldn't actually be shared with anyone other than 
people on this same physical machine. To get the most out of sharing, log your experiments to an S3 bucket that all teammates
have access to.

In [2]:
from rubicon_ml import publish

publish(
    project.experiments(tags=["parameter search"]),
    output_filepath="./penguin_catalog.yml",
)

!head -7 penguin_catalog.yml

sources:
  experiment_08625083_d5ec_4303_aeb1_8fd3b695a79a:
    args:
      experiment_id: 08625083-d5ec-4303-aeb1-8fd3b695a79a
      project_name: classifying penguins
      urlpath: ./rubicon-root
    driver: rubicon_ml_experiment


Each catalog contains a "source" for each ``rubicon_ml`` experiment. These sources contain the minimum metadata needed
to retrieve the associated experiment - the ``experiment_id``, ``project_name`` and ``urlpath`` to the root of the 
``rubicon_ml`` repository used as an entry point. The ``rubicon_ml_experiment`` driver can be found 
[within our library](https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/intake_rubicon/experiment.py)
and leverages the metadata in the YAML catalog to return the experiment objects associated to it.

Provided the recipient of the shared YAML catalog has read access to the filesystem represented by ``urlpath``,
they can now use ``intake`` directly to read the catalog and load in the shared ``rubicon_ml`` expierments
for their own inspection.

In [10]:
import intake

catalog = intake.open_catalog("./penguin_catalog.yml")

for source in catalog:
    catalog[source].discover()
    
shared_experiments = [catalog[source].read() for source in catalog]

print("shared experiments:")
for experiment in shared_experiments:
    print(
        f"\tid: {experiment.id}, "
        f"parameters: {[(p.name, p.value) for p in experiment.parameters()]}, "
        f"metrics: {[(m.name, m.value) for m in experiment.metrics()]}"
    )

shared experiments:
	id: 08625083-d5ec-4303-aeb1-8fd3b695a79a, parameters: [('strategy', 'mean'), ('n_neighbors', 15)], metrics: [('accuracy', 0.7403846153846154)]
	id: 0a9b624d-705f-4fa4-9238-7948ca38cf5c, parameters: [('strategy', 'mean'), ('n_neighbors', 10)], metrics: [('accuracy', 0.7403846153846154)]
	id: 0cd1373b-b7a3-4b12-9a2d-8a054c1034b9, parameters: [('strategy', 'mean'), ('n_neighbors', 5)], metrics: [('accuracy', 0.7211538461538461)]
	id: 165c8123-aee4-4830-b7e8-6cbab7ae4586, parameters: [('strategy', 'median'), ('n_neighbors', 5)], metrics: [('accuracy', 0.8365384615384616)]
	id: 1a692338-9131-4763-b7df-c5cca8df88ae, parameters: [('strategy', 'median'), ('n_neighbors', 5)], metrics: [('accuracy', 0.7115384615384616)]
	id: 1c740872-aab5-4b88-92f2-e6287b2425a6, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 10)], metrics: [('accuracy', 0.7884615384615384)]
	id: 252ecbbd-8a8c-4ede-b457-d0332530a06c, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 5)]

Suppose we have a new `intake catalog` and would like to update and append to the same catalog file. To do this, we leverage an optional argument in the `publish` function as seen below.

In [21]:

#reading and saving a random experiment from the classifying penguins catalog
penguin_catalog = intake.open_catalog("./penguin_catalog.yml")
for source in catalog:
    penguin_catalog[source].discover()
    
shared_experiments = [penguin_catalog[source].read() for source in penguin_catalog]

#choose experiment with id: 0a9b624d-705f-4fa4-9238-7948ca38cf5c which does not exist in the `update_catalog_test.yml` catalog
random_new_experiment = [shared_experiments[1]]


#updating new catalog file without discovering and reading contents
new_catalog = publish(catalog_filepath="./update_catalog_test.yml", experiments = random_new_experiment)



!head -100 update_catalog_test.yml




sources:
  experiment_08625083_d5ec_4303_aeb1_8fd3b695a79a:
    args:
      experiment_id: 08625083-d5ec-4303-aeb1-8fd3b695a79a
      project_name: classifying penguins
      urlpath: ./rubicon-root
    driver: rubicon_ml_experiment
  experiment_0a9b624d_705f_4fa4_9238_7948ca38cf5c:
    args:
      experiment_id: 0a9b624d-705f-4fa4-9238-7948ca38cf5c
      project_name: classifying penguins
      urlpath: ./rubicon-root
    driver: rubicon_ml_experiment
  experiment_0cd1373b_b7a3_4b12_9a2d_8a054c1034b9:
    args:
      experiment_id: 0cd1373b-b7a3-4b12-9a2d-8a054c1034b9
      project_name: classifying penguins
      urlpath: ./rubicon-root
    driver: rubicon_ml_experiment
