# Logging Concurrently
 
Let's see how a few other popular ``scikit-learn`` models perform with the
wine dataset. ``rubicon`` logging is totally thread-safe, so we can test a
lot of model configurations at once.

In [1]:
import os

from rubicon import Rubicon


root_dir = f"{os.path.dirname(os.getcwd())}/rubicon-root"

rubicon = Rubicon(persistence="filesystem", root_dir=root_dir)
project = rubicon.get_or_create_project(
    "Concurrent Experiments",
    description="training multiple models in parallel",
)

project

<rubicon.client.project.Project at 0x15f6c5d00>

For a recap of the contents of the wine dataset, check out ``wine.DESCR``
and ``wine.data``. We'll put together a training dataset using a subset of the data.

In [2]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split


wine = load_wine()
wine_feature_names = wine.feature_names
wine_datasets = train_test_split(
    wine["data"],
    wine["target"],
    test_size=0.25,
)

We'll use this ``run_experiment`` function to log a new **experiment** to
the provided **project** then train, run and log a model of type ``classifier_cls`` using
the training and testing data in ``wine_datasets``.

In [3]:
import pandas as pd
from collections import namedtuple


SklearnTrainingMetadata = namedtuple("SklearnTrainingMetadata", "module_name method")

def run_experiment(project, classifier_cls, wine_datasets, feature_names, **kwargs):
    X_train, X_test, y_train, y_test = wine_datasets
    
    experiment = project.log_experiment(
        training_metadata=[
            SklearnTrainingMetadata("sklearn.datasets", "load_wine"),
        ],
        model_name=classifier_cls.__name__,
        tags=[classifier_cls.__name__],
    )
    
    for key, value in kwargs.items():
        experiment.log_parameter(key, value)
    
    for name in feature_names:
        experiment.log_feature(name)
        
    classifier = classifier_cls(**kwargs)
    classifier.fit(X_train, y_train)
    classifier.predict(X_test)
    
    accuracy = classifier.score(X_test, y_test)
    
    experiment.log_metric("accuracy", accuracy)

    if accuracy >= .95:
        experiment.add_tags(["success"])
    else:
        experiment.add_tags(["failure"])

This time we'll take a look at three classifiers - ``RandomForestClassifier``, ``DecisionTreeClassifier``, and
``KNeighborsClassifier`` - to see which performs best. Each classifier will be run across four sets of parameters
(provided as ``kwargs`` to ``run_experiment``), for a total of 12 experiments. Here, we'll build up a list of
processes that will run each experiment in parallel.

In [4]:
import multiprocessing

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier


processes = []

for n_estimators in [10, 20, 30, 40]:
    processes.append(multiprocessing.Process(
        target=run_experiment,
        args=[project, RandomForestClassifier, wine_datasets, wine_feature_names],
        kwargs={"n_estimators": n_estimators},
    ))   

for n_neighbors in [5, 10, 15, 20]:
    processes.append(multiprocessing.Process(
        target=run_experiment,
        args=[project, KNeighborsClassifier, wine_datasets, wine_feature_names],
        kwargs={"n_neighbors": n_neighbors},
    ))

for criterion in ["gini", "entropy"]:
    for splitter in ["best", "random"]:
        processes.append(multiprocessing.Process(
            target=run_experiment,
            args=[project, DecisionTreeClassifier, wine_datasets, wine_feature_names],
            kwargs={
                "criterion": criterion,
                "splitter": splitter,
            },
        ))

Let's run all our experiments in parallel!

In [5]:
for process in processes:
    process.start()
    
for process in processes:
    process.join()

If you're playing along at home, you'll realize that Python's ``multiprocessing`` doesn't
work well in iPython. The above won't actually spawn the ``run_experiment`` processes unless
we're running it from a ``__main__`` process, so lets use this script to actually write our data.
It contains the exact same code defined in the cells above.

In [6]:
%run ./logging-concurrently.py

Now we can validate that we successfully logged all 12 experiments to our project.

In [7]:
len(project.experiments())

12

Let's see which experiments we tagged as successful and what type of model they used.

In [8]:
for e in project.experiments(tags=["success"]):    
    print(f"experiment {e.id[:8]} was successful using a {e.model_name}")

experiment 25e04cff was successful using a RandomForestClassifier
experiment 98e20ab6 was successful using a RandomForestClassifier
experiment e89050d5 was successful using a RandomForestClassifier
experiment fcb26efe was successful using a RandomForestClassifier


We can also take a deeper look at any of our experiments.

In [9]:
first_experiment = project.experiments()[0]

training_metadata = SklearnTrainingMetadata(*first_experiment.training_metadata)
tags = first_experiment.tags

parameters = [(p.name, p.value) for p in first_experiment.parameters()]
metrics = [(m.name, m.value) for m in first_experiment.metrics()]
    
print(
    f"experiment {first_experiment.id}\n"
    f"training metadata: {training_metadata}\ntags: {tags}\n"
    f"parameters: {parameters}\nmetrics: {metrics}"
)

experiment 0321ae46-9845-41e9-97a4-49637736ebbb
training metadata: SklearnTrainingMetadata(module_name='sklearn.datasets', method='load_wine')
tags: ['DecisionTreeClassifier', 'failure']
parameters: [('criterion', 'entropy'), ('splitter', 'best')]
metrics: [('accuracy', 0.8666666666666667)]


Or we could grab the project's data as a dataframe!

In [10]:
ddf = rubicon.get_project_as_dask_df("Concurrent Experiments")
ddf.compute()

Unnamed: 0,id,name,description,model_name,commit_hash,tags,created_at,n_estimators,n_neighbors,criterion,splitter,accuracy
0,2c64f744-b4d6-4d63-bc78-6e297dfd1f64,,,DecisionTreeClassifier,,"[DecisionTreeClassifier, failure]",2021-04-16 14:26:02.849734,,,entropy,random,0.933333
1,d08469d4-ca8f-4e30-bad5-d7171a5f0892,,,DecisionTreeClassifier,,"[DecisionTreeClassifier, failure]",2021-04-16 14:26:02.816129,,,gini,random,0.911111
2,98685436-93a6-41ef-bdd8-ab5c731e377b,,,DecisionTreeClassifier,,"[DecisionTreeClassifier, failure]",2021-04-16 14:26:02.784963,,,gini,best,0.866667
3,0321ae46-9845-41e9-97a4-49637736ebbb,,,DecisionTreeClassifier,,"[DecisionTreeClassifier, failure]",2021-04-16 14:26:02.777539,,,entropy,best,0.866667
4,0f34115e-a912-416b-8798-25ad8c632649,,,KNeighborsClassifier,,"[KNeighborsClassifier, failure]",2021-04-16 14:26:02.688498,,10.0,,,0.8
5,87b8c62d-b2dd-4174-b467-2666409045e3,,,KNeighborsClassifier,,"[KNeighborsClassifier, failure]",2021-04-16 14:26:02.601041,,15.0,,,0.777778
6,25e04cff-1e65-44fc-abf2-fb80cbed8b1d,,,RandomForestClassifier,,"[success, RandomForestClassifier]",2021-04-16 14:26:02.548160,30.0,,,,1.0
7,fcb26efe-7f92-4e71-a9b3-a3050eba0d06,,,RandomForestClassifier,,"[success, RandomForestClassifier]",2021-04-16 14:26:02.452911,10.0,,,,0.955556
8,fbe7e8d6-3b19-4739-b12c-4da54434d9e4,,,KNeighborsClassifier,,"[KNeighborsClassifier, failure]",2021-04-16 14:26:02.449763,,20.0,,,0.755556
9,87de17d7-ef18-4eee-aac7-31fdce6041a0,,,KNeighborsClassifier,,"[KNeighborsClassifier, failure]",2021-04-16 14:26:02.417662,,5.0,,,0.755556
