# Logging Your Own `rubicon-ml` Project

`rubicon-ml` plays nicely with multiple popular projects in the PyData ecosystem. If you're already
using Scikit-learn, logging with `rubicon-ml` is simple! The code in the cell below is a copy of
[this Scikit-learn example](https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py)
from their documentation. You'll create a `rubicon-ml` project, change one line in the cell below,
and you'll be logging!

You don't need to run this cell before adding in `rubicon-ml`. Jump down to the next one to see how!

In [None]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Mathieu Blondel <mathieu@mblondel.org>
# License: BSD 3 clause
from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

print(__doc__)

# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')


# #############################################################################
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
print()

# #############################################################################
# Define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    # 'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    # 'tfidf__use_idf': (True, False),
    # 'tfidf__norm': ('l1', 'l2'),
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    # 'clf__max_iter': (10, 50, 80),
}

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(data.data, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

First, you can create a project to store your logged model runs. You'll also import
the `RubiconPipeline`, which will replace the standard Scikit-learn pipeline in the
example above.

In [None]:
import os

from rubicon import Rubicon
from rubicon.sklearn import RubiconPipeline


rubicon = Rubicon(persistence="filesystem", root_dir=f"{os.getcwd()}/rubicon-root")
project = rubicon.get_or_create_project(name="Binder Pipeline")

Then you just need to swap the Scikit-learn `Pipeline` with the `RubiconPipeline` and pass
in the project you just created as the first parameter. After editing the first cell, your
pipeline definition should look like this:


```python
pipeline = RubiconPipeline(project, [
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])
```

When everything looks good, give the first cell a run to log some data! Once it's finished,
you can use the dashboard to explore all the model runs in the **Binder Pipeline** project.
This time we'll run the dashboard in a new window within JupyterLab by instantiating the
dashboard with `mode="jupyterlab"`.

You'll notice the output from the
original example states that 120 models were fit - that's one for each of the 24 parameter
sets across 5 folds of the data. For each of those 120 fits, you'll see an experiment logged
by `rubicon-ml` on the dashboard (You'll also see a 121st experiment without a score that was
generated when the model's best estimator was fit at the end of the grid search).

These experiments contain a lot of data, as `rubicon-ml`'s
Scikit-learn logging is very verbose by default. To learn more about `rubicon-ml`'s
Scikit-learn integration and filtering logged estimator parameters, check out the
`integration-sklearn` example in `notebooks/integrations`.

In [None]:
from rubicon.ui import Dashboard


Dashboard(
    persistence="filesystem",
    root_dir=f"{os.getcwd()}/rubicon-root",
    mode="jupyterlab",
).run_server(debug=True)