## 1 Setup a local Tracking Server

- Local tracking server

    ```bash
    cd /opt/mlflow-tracking-server/
    mkdir -p backend
    mkdir -p artifacts
    mlflow server --backend-store-uri ./backend --default-artifact-root ./artifacts/  --host 0.0.0.0
    ```


- In the project folder (ensure that the local path to artifacts is the same as for the local tracking server)
    
    ```
    ln -s /opt/mlflow-tracking-server/artifacts artifacts
    ```

## 2 Use a Databricks Connect enable environment

In [3]:
from databricks_jupyterlab.connect import dbcontext, is_remote

if is_remote():        
    display(dbcontext())    
else:
    spark = None

Spark context already exists


## 3 Model development

Start small local and then do the full parameter space search remotely

In [4]:
%%time
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from databricks_jupyterlab.gridsearchcv import GridSearchCV

if is_remote():
    data_path = "/dbfs/data/digits/digits.csv"
    tracking_uri = None
    experiment = "/Shared/experiments/digits-spark-sklearn"
    param_grid = {
        "max_depth": [3, None],
        "max_features": [1, 3, 10],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 3, 10],
        "bootstrap": [True, False],
        "criterion": ["gini", "entropy"],
        "n_estimators": [10, 20, 40, 80]
    }
else:
    data_path = "~/Data/digits/digits.csv"
    tracking_uri = "http://localhost:5000"
    experiment = "digits-spark-sklearn"
    param_grid = {
        "max_depth": [3, None],
        "max_features": [1, 3],
        "min_samples_split": [2, 10],
        "min_samples_leaf": [1, 10],
        "n_estimators": [10, 20, 40]
    }

df = pd.read_csv(data_path)
X = df.loc[:, df.columns != 'target'].values
y = df["target"].values

cv = GridSearchCV(RandomForestClassifier(), param_grid, cv=3, spark=spark)
cv.fit(X,y)

Remote crossvalidation, paramter grid size: 864



HBox(children=(Button(icon='arrow-circle-right', style=ButtonStyle(), tooltip='Toggle progress bar', _dom_clas…

CPU times: user 1.52 s, sys: 296 ms, total: 1.82 s
Wall time: 38.6 s


### Tracking

In [None]:
cv.log_cv(tracking_uri=tracking_uri, experiment=experiment, name="digits-01")