# Scikit-learn

This example shows how you can integrate `rubicon` into your Scikit Learn pipelines to enable automatic logging of parameters and metrics as you fit and score the model!

## Simple pipeline run

Using `rubicon`'s `RubiconPipeline` class, you can set up a enhanced sklearn pipeline with automated logging:

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from rubicon import Rubicon
from rubicon.sklearn import RubiconPipeline

rubicon = Rubicon(persistence="memory", root_dir="root")
project = rubicon.get_or_create_project("Rubicon Pipeline Example")

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

pipe = RubiconPipeline(project, [('scaler', StandardScaler()), ('svc', SVC())])
pipe

In [None]:
pipe.fit(X_train, y_train)

In [None]:
pipe.score(X_test, y_test)

### Fetch the results from Rubicon

During the pipeline run, an experiment was automatically created and the corresponding parameters and metrics logged to it. Afterwards, you can use the `rubicon` library to pull these experiments back or view them by running the dashboard.

In [None]:
experiments = project.experiments()
print(f"{len(experiments)} experiment(s)")
experiment = experiments[0]
experiment

In [None]:
for param in experiment.parameters():
    print(f"{param.name}: {param.value}")

In [None]:
for metric in experiment.metrics():
    print(f"{metric.name}: {metric.value}")

## A more realistic example using GridSearch

`GridSearch` is commonly used to test many different parameters across an estimator or pipeline in the hopes of finding the optimal param set. `RubiconPipeline` can be passed to sklearn's `GridSearchCV` to automatically log each set of parameters tried in the grid search as an individual experiment. Then, all of these experiments can be explored (and filtered/sorted) within the dashboard.

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
categories = ["alt.atheism", "talk.religion.misc"]

print("Loading 20 newsgroups dataset for categories:")
print(categories)

data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))

Optionally, pass the `user_defined_loggers` argument to the `RubiconPipeline` to have more control over exactly which parameters are logged for specific estimators. For example, you can use the `FilteredLogger` class to select or ignore parameters on any estimator. Here we'll use a combo of select and ignore so that we log only the relavant params.

In [None]:
from rubicon import Rubicon
from rubicon.sklearn import FilterEstimatorLogger, RubiconPipeline


rubicon = Rubicon(persistence="filesystem", root_dir="./rubicon-root")
project = rubicon.get_or_create_project("Rubicon Grid Search Example")

pipeline = RubiconPipeline(
    project,
    [
        ("vect", CountVectorizer()),
        ("tfidf", TfidfTransformer()),
        ("clf", SGDClassifier()),
    ],
    user_defined_loggers = {
        "vect": FilterEstimatorLogger(select=["input", "decode_error", "max_df", "lowercase", "ngram_range"]),
        "tfidf": FilterEstimatorLogger(ignore_all=True),
        "clf": FilterEstimatorLogger(select=["max_iter", "alpha", "penalty"]),
    },
    experiment_kwargs={"name": "RubiconPipeline experiment", "tags": ["gridsearch"], "model_name": "myModel"}
)

In [None]:
parameters = {
    "vect__max_df": (0.5, 0.75, 1.0),
    "vect__ngram_range": ((1, 1), (1, 2)),
    "clf__max_iter": (20,),
    "clf__alpha": (0.00001, 0.000001),
    "clf__penalty": ("l2", "elasticnet"),
}

grid_search = GridSearchCV(pipeline, parameters, cv=2, n_jobs=-1, refit=False, verbose=1)

print("Performing grid search...")
print("RubiconPipeline:", [name for name, *_ in pipeline.steps])

grid_search.fit(data.data, data.target, tags=["param set 1"])

### Fetching results

Fetching the best parameters from the `GridSearchCV` object involves digging into the objects properties but doesn't paint the full picture of the rest of the experimentation:

In [None]:
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

With `rubicon`, we can view all of the experiments and easily see how they trend towards the optimal solution:

In [None]:
from rubicon.ui import Dashboard

# could also be run using the CLI
Dashboard(persistence="filesystem", root_dir="./rubicon-root").run_server()