# `rubicon` - `sklearn` Integration

Integrate `rubicon` into your Scikit Learn pipelines without the need to add logger statements all over your code.

## Simple pipeline run

We'll setup a simple pipeline and use the `rubicon`'s sklearn integration to automatically log parameters and metrics as we fit and score the model.

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from rubicon import Rubicon
from rubicon.sklearn import RubiconPipeline

rubicon = Rubicon(persistence="memory", root_dir="root")
project = rubicon.get_or_create_project("Rubicon Pipeline Example")

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

pipe = RubiconPipeline([('scaler', StandardScaler()), ('svc', SVC())], project)
pipe

In [None]:
pipe.fit(X_train, y_train)

In [None]:
pipe.score(X_test, y_test)

### Fetch the results from Rubicon

During the pipeline run, an experiment was automatically created and the corresponding parameters and metrics logged to it. We can use the `rubicon` library to pull these experiments back or view them on the dashboard.

In [None]:
experiments = project.experiments()
print(f"{len(experiments)} experiment(s)")
experiment = experiments[0]
experiment

In [None]:
for param in experiment.parameters():
    print(f"{param.name}: {param.value}")

In [None]:
for metric in experiment.metrics():
    print(f"{metric.name}: {metric.value}")

## Use GridSearch

`GridSearch` is commonly used to test many different parameters across an estimator or pipeline. 
Each set of parameters tried in the grid search will be logged as an individual experiment.

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
categories = ["alt.atheism", "talk.religion.misc"]

print("Loading 20 newsgroups dataset for categories:")
print(categories)

data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))

In [None]:
! rm -r ./rubicon-root

In [None]:
from rubicon import Rubicon
from rubicon.sklearn import FilterLogger, RubiconPipeline


rubicon = Rubicon(persistence="filesystem", root_dir="./rubicon-root")
project = rubicon.get_or_create_project("Rubicon Grid Search Example")

pipeline = RubiconPipeline(
    project,
    [
        ("vect", CountVectorizer()),
        ("tfidf", TfidfTransformer()),
        ("clf", SGDClassifier()),
    ],
    user_defined_loggers = {
        "vect": FilterLogger(ignore=["dtype"]),
        "tfidf": FilterLogger(ignore_all=True),
        "clf": FilterLogger(select=["max_iter", "alpha", "penalty"]),
    }
)

In [None]:
parameters = {
    "vect__max_df": (0.5, 0.75, 1.0),
    "vect__ngram_range": ((1, 1), (1, 2)),
    "clf__max_iter": (20,),
    "clf__alpha": (0.00001, 0.000001),
    "clf__penalty": ("l2", "elasticnet"),
}

grid_search = GridSearchCV(pipeline, parameters, cv=2, n_jobs=-1, verbose=1)

print("Performing grid search...")
print("RubiconPipeline:", [name for name, *_ in pipeline.steps])

grid_search.fit(data.data, data.target, tags=["gridsearch", "param set 1"])

Fetching just the best parameters from the `GridSearchCV` object involves digging into the objects properties.

In [None]:
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

### Fetch the results from Rubicon

But with `rubicon`, we can inspect the details of each pipeline run with only a few lines of code!

In [None]:
experiments = project.experiments()

num_experiments = len(experiments)
num_parameters_each = len(experiments[0].parameters())

print(f"{num_experiments} experiments logged with {num_parameters_each} parameters each")

In [None]:
from rubicon.ui import Dashboard

Dashboard(persistence="filesystem", root_dir="./rubicon-root").run_server()

## What's Next?

user-defined loggers

In [None]:
from my_lib import MyLogger, MyOtherLogger

pipeline = RubiconPipeline(
    [
        ("vect", CountVectorizer(), MyLogger),
        ("tfidf", TfidfTransformer(), MyOtherLogger),
        ("clf", SGDClassifier()),
    ],
    project,
)

re-construct `RubiconPipeline`s from experiments

In [None]:
project = rubicon.get_or_create_project("Rubicon Grid Search Example")
experiment = project.experiments()[0]

pipeline = experiment.reconstruct_pipeline()
pipeline.score(my_data)