# Rubicon Integration in AVAL pipeline

## Stage 1 - Callback reference
- Inside each component (`RFE` and `Bayesian Search`), import the respective callback classes from `mltrain.callbacks.rubicon` module
- Right before calling the `fit` function, as the Callback classes have currently been written inside of `mltrain`, we need to instantiate a `Rubicon` object
- From the `Rubicon` object, either create or find a new project and assign the global varibale inside the Callback class: `RubiconCallbackClassName.project`
- We repeat this for any `experiment` tags that need to be attached to the current run of the `Bayesian search` or the `RFE` 
- Note: This current implrenetation may not be the final as this is a workaround based on the way `mltrain` Callbacks are structured

In [None]:
from rubicon_ml import Rubicon
from mltrain.callbacks.rubicon_ml.optuna import RubiconOptunaCallback
from mltrain.callbacks.rubicon_ml.RFE import RubiconRFECallback
import secrets


rubicon = Rubicon(persistence="filesystem")
project =  rubicon.get_or_create_project("Pipeline-Run-Persistance")
RubiconOptunaCallback.project = project 
identifier = secrets.token_hex(15)
RubiconOptunaCallback.experiment_tags = [identifier]


## Stage 2 - Rubicon Project and Tags
- Inside the component function, after creating the `RFE` and the `mlOptunaCV` object, pass the imported callback function as a parameter of the `fit` function. The example uses the `Optuna` object

In [None]:

bgs = mlOptunaCV(
        estimator=model_class,
        param_space=config['grid_search']["bayesian_grid_search"]['param_space'],
        sampler=TPESampler(),
        n_trials=config['grid_search']["bayesian_grid_search"]['n_trials'],
        scoring=scorer_mapper(config["scorer"]),
        callbacks=[BGSCallback,RubiconOptunaCallback],
        cv=[(train_idx, test_idx)]
    )


## Stage 3 - Local Data Retrieval in Components
- At this stage, the Rubicon project has been populated with the current iteration of `experiments` 
- We can retrieve the project after the `fit` function has finished processing
- We can test this out by listing `RubiconCallBackClass.project.experiments` or by launching the `Dashboard`


In [None]:
from rubicon_ml.viz.dashboard import Dashboard
RubiconOptunaCallback.project.experiments()
default_dashbaord = Dashboard(experiments=RubiconOptunaCallback.experiments())
default_dashbaord.show()

## Stage 4 Retrieval in other components 
-  This stage of the process is extremely important as it is one of the main use for integrating `Rubicon` workflow. In order to retrieve projects for later use in different components, there are two potential options. 
- The first option is retrieval of the `Rubicon.project` object inside different components which contains an entry point to the project and all experiments that are tracked within the project. This would be local and we would be passing the project back and forth between components to add experiments to it in the next stage
- The second method is to create a new `Rubicon` object in each component while leveraging the intermediary file system that would need to be set up in the first component that creates projects. Most likely this creation step would be done in the `Data Prep` component or something similar. This way, we can seamlessly access the file system in any component to retrieve `Rubicon` projects and their experiments. 

In [None]:
#Method 1
#Inside Bayesian Search Stage

#For purposes of notebook return_tuple represents the ``return`` keyword
return_tuple = (grid_search_result(config, metadata), RubiconOptunaCallback.project)

#Inside Model Fit Stage
project = RubiconOptunaCallback.project
default_dashbaord = Dashboard(experiments=RubiconOptunaCallback.experiments())
default_dashbaord.show()

#Method 2
#Inside Bayesian Search Stage

rubicon = Rubicon(persistence="filesystem")
project =  rubicon.get_or_create_project("Pipeline-Run-Persistance")
for experiment in RubiconOptunaCallback.experiments():
    project.log_experiment(experiment)

#Inside Model Fit Stage
project =  rubicon.get_or_create_project("Pipeline-Run-Persistance")
default_dashbaord = Dashboard(experiments=RubiconOptunaCallback.experiments())
default_dashbaord.show()

    

## Stage 5 - Data peristence and Next steps 
- This stage is the most important stage as it has the most nuance. We first need to determine what kind of persistence we want the `Rubicon` project / experiments to endure. 
- The two most common types that are currently being utilized within `Rubicon` are Fileystem petrsistence and `S3` persistence. Each comes with a slightly differnet approach.
- The Filesystem approach needs a way to create a file system on the selected `PVC` to mount.
- We can do this by leverageing `initContainer` inside the podspec and create a directory where `Rubicon` projects live inside using `mkdir -p`
- The next way to persist data is via `S3`. This is more simple in a sense as we can potentially try to connect to the buckets directly via `Rubicon's` integration with S3 within the component itself. Although there is a room for adjustment as we have to take into consideration the use of this `Rubicon` data in other components, implying that sending data to `S3` from one component just to recieve it from the next component would be inefficient. 

In [None]:

#Method 1
#This method with the PVC mount needs more research with the kubernetes API


#Method 2
from rubicon_ml import publish

catalog = publish(
    project.experiments(tags=[RubiconOptunaCallback.experiment_tags]),
    output_filepath="./pipeline_persisence_catalog.yml",
)

#From here we can push catalog to S3 using standard methods