# Logging Basics

Let's see how `rubicon` can help us create an optimal ML model. We'll use supervised learning to train a classification model using `scikit-learn`'s [Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) that can predict the class of Iris.

Let's load the Iris dataset and take a quick look at it's contents:

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
print(iris['DESCR'][:193])

We'll load the data into a dask dataframe so it's easier to work with throughout our example.

In [None]:
from dask import dataframe as dd

iris_data = dd.from_array(iris.data, columns=iris.feature_names)
iris_data.compute()

Each sample (row in our dataframe) holds measurements of one of the 3 classes of Iris plants. We'll train on 75% of the data and use the last 25% to make predictions using our model. Then, we'll compare with the recorded results to compute our model's accuracy.

Let's shuffle the data and define our training and prediction subsets:

In [None]:
from datetime import datetime
from sklearn.model_selection import train_test_split

random_state = int(datetime.utcnow().timestamp())
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=random_state)

Let's define a method for fitting our model. We'll use sklearn's [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
from sklearn.ensemble import RandomForestClassifier

def fit_classifier(n_estimators, n_features, random_state, X_train, y_train):
    rfc = RandomForestClassifier(n_estimators=n_estimators, random_state=random_state)
    rfc.fit(X_train, y_train)
    
    return rfc

RandomForestClassifier accepts `n_estimators` (the number of trees in the forest) as a **parameter**. What if we wanted to tune this **parameter** to see how it affected our output **metric**, accuracy?

This is where `Rubicon`'s **experiment** logging comes in handy. Let's get `Rubicon` setup:

In [None]:
from rubicon import Rubicon

root_dir = "./rubicon-root"
rubicon = Rubicon(persistence="filesystem", root_dir=root_dir)

Now, let's create a project to hold our various **experiments** as we try different values of `n_estimators`.

In [None]:
project = rubicon.create_project("Iris Model", description="Using scikit-learn's to create an Iris classifier.")
print(project)

Let's log some **experiments** and tag them with "success" if their accuracy is greater than or equal to 94% (arbitrarily chosen for this example) so we can easily retrieve them later.

In [None]:
from collections import namedtuple

SklearnTrainingMetadata = namedtuple("SklearnTrainingMetadata", "module_name method")

for e in range(1, 100, 5):
    n_estimators = e
    n_features = len(iris.feature_names)
    
    experiment = project.log_experiment(
        training_metadata=[SklearnTrainingMetadata("sklearn.datasets", "load_iris")],
        model_name="Iris Prediction Model",
        tags=["iris"],
    )
    
    experiment.log_parameter("n_estimators", n_estimators)
    experiment.log_parameter("n_features", n_features)
    experiment.log_parameter("random_state", random_state)
    
    rfc = fit_classifier(n_estimators, n_features, random_state, X_train, y_train)
    
    feature_importances = list(zip(iris.feature_names, rfc.feature_importances_))
    for name, importance in feature_importances:
        experiment.log_feature(name, importance=importance)
    
    accuracy = rfc.score(X_test, y_test)
    print(f"Accuracy: {accuracy}")
    
    experiment.log_metric("accuracy", accuracy)
    
    if accuracy >= .94:
        experiment.add_tags(["success"])

Let's fetch the **experiments** tagged with "success":

In [None]:
success_experiments = project.experiments(tags=["success"])
for success_experiment in success_experiments:
    for p in success_experiment.parameters():
        if p.name == 'n_estimators':
            print(f"n_estimators value: {p.value}")

While this is a simple example, it shows how we can use `rubicon` to track our model's performance over time as we try different **parameters** to optimize our **metrics**. We also logged the training metadata, **parameters**, and **features** used for each **experiment** so we can analyze that data later on.

`rubicon` supports even more logging capabilities, like logging **artifacts** and **dataframes**, to ensure complete reproducibility as well.

-----

Now that we're "done" logging our data, wew'd likely want to explore it. We could use Rubicon's Dashboard to visualize our logged data:

In [None]:
from rubicon.ui import Dashboard

Dashboard(persistence="filesystem", root_dir="./rubicon-root").run_server()

or we could grab the project's data as a pandas dataframe, which could easily be manipulated:

In [None]:
ddf = rubicon.get_project_as_dask_df("Iris Model")
ddf.compute()