# Logging Asynchronously

The asynchronous `rubicon` client offers a way to read and write `rubicon` objects using Python's builtin `asyncio` module. `rubicon` is lightweight computationally, but reading and writing to S3 takes time over the network. We can use `asyncio` to asynchronously communicate with S3 while executing other work.

There are **two main differences** between the asynchronous and standard `rubicon` clients:

* the asynchronous client is for **S3 logging only**
* the asynchronous client's functions **return coroutines** rather than their standard return values

In [None]:
from rubicon.client.asynchronous import Rubicon

s3_bucket = "my-bucket"
root_dir = f"s3://{s3_bucket}/rubicon-root"

rubicon = Rubicon(persistence="filesystem", root_dir=root_dir)
project = await rubicon.get_or_create_project(
    "Asynchronous Experiments", description="Training multiple models asynchronously"
)

print(project)

We'll take another look at the Iris dataset in this example.

In [None]:
from datetime import datetime
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

random_state = int(datetime.utcnow().timestamp())

iris = load_iris()
iris_datasets = train_test_split(iris['data'], iris['target'], random_state=random_state)

We can define an asynchronous `run_experiment` function to log a new experiment to the provided `project` then 
train, run and log a model of type `classifier_cls` using the training and testing data in `iris_datasets`.

In [None]:
import asyncio
import pandas as pd
from collections import namedtuple

SklearnTrainingMetadata = namedtuple("SklearnTrainingMetadata", "module_name method")

async def run_experiment(project, classifier_cls, iris_datasets, **kwargs):
    X_train, X_test, y_train, y_test = iris_datasets
    
    # await logging the experiment so we can log other things to it
    experiment = await project.log_experiment(
        training_metadata=[SklearnTrainingMetadata("sklearn.datasets", "load_iris")],
        model_name=classifier_cls.__name__,
        tags=[classifier_cls.__name__],
    )
    
    # gather a list of coroutines that will log objects to the experiment
    rubicon_logging_coroutines = []
    
    for key, value in kwargs.items():
        parameter_coroutine = experiment.log_parameter(key, value)
        rubicon_logging_coroutines.append(parameter_coroutine)
    
    parameter_coroutine = experiment.log_parameter("n_features", len(iris.feature_names))
    rubicon_logging_coroutines.append(parameter_coroutine)
    
    for name in iris.feature_names:
        feature_coroutine = experiment.log_feature(name)
        rubicon_logging_coroutines.append(feature_coroutine)
        
    classifier = classifier_cls(**kwargs)
    classifier.fit(X_train, y_train)
    classifier.predict(X_test)
    
    accuracy = classifier.score(X_test, y_test)
    
    metric_coroutine = experiment.log_metric("accuracy", accuracy)
    rubicon_logging_coroutines.append(metric_coroutine)

    if accuracy >= .94:
        tag_coroutine = experiment.add_tags(["success"])
    else:
        tag_coroutine = experiment.add_tags(["failure"])
    rubicon_logging_coroutines.append(tag_coroutine)
    
    # execute all logging coroutines asynchronously
    await asyncio.gather(*rubicon_logging_coroutines)
    
    return experiment

This time we'll take a look at two more classifiers, **decision tree** and **k-neighbors**, in addition to the **random forest** classifier. Each classifier will be ran across four sets of parameters (provided as `kwargs` to `run_experiment`), for a total of 12 experiments. Here, we'll build up a list of coroutines that will run each experiment asynchronously.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

coroutines = []

for n_estimators in [10, 20, 30, 40]:
    coroutines.append(run_experiment(
        project, RandomForestClassifier, iris_datasets,
        n_estimators=n_estimators, random_state=random_state,
    ))
    
for criterion in ["gini", "entropy"]:
    for splitter in ["best", "random"]:
        coroutines.append(run_experiment(
            project, DecisionTreeClassifier, iris_datasets,
            criterion=criterion, splitter=splitter, random_state=random_state,
        ))

for n_neighbors in [5, 10, 15, 20]:
    coroutines.append(run_experiment(
        project, KNeighborsClassifier, iris_datasets, n_neighbors=n_neighbors,
    ))

Let's run all our experiments asynchronously!

In [None]:
experiments = await asyncio.gather(*coroutines)
experiments

Now we can validate that we successfully logged all 12 experiments to our project.

In [None]:
len(await project.experiments())

Let's see which experiments we tagged as successful and what type of model they used.

In [None]:
for e in await project.experiments(tags=["success"]):    
    print(f"experiment {e.id} was successful using a {e.model_name}")

We can also take a deeper look at any of our experiments.

In [None]:
experiment = experiments[0]

print(f"training_metadata: {SklearnTrainingMetadata(*experiment.training_metadata)}")
print(f"tags: {await experiment.tags}")
print("parameters:")
for parameter in await experiment.parameters():
    print(f"\t{parameter.name}: {parameter.value}")
print("features:")
for feature in await experiment.features():
    print(f"\t{feature.name}")
print("metrics:")
for metric in await experiment.metrics():
    print(f"\t{metric.name}: {metric.value}")

Or we could grab the project's data as a dataframe, which could easily be manipulated:

In [None]:
ddf = await rubicon.get_project_as_dask_df("Asynchronous Experiments")
ddf.compute()