# Using the Gordo Mlflow reporter with AzureML

## Building on a cluster
When a gordo workflow is generated from a YAML config using `kubectl apply -f config.yml`, the model is built by the model builder pod. If a remote logging "reporter" was configured in the `config.yml`, then at the end of the model building step the metadata will be logged with the specified reporter. 

**Note**
When using the MLflow reporter, the cluster running the workflow must have the AzureML workspace credentials set to the environment variable `AZUREML_WORKSPACE_STR` as well as the `DL_SERVICE_AUTH_STR`.

The cluster should use the workspace credentials associated with the deployment stage associated with that cluster, e.g. "production", "staging", "testing", etc.

While reporters can be defined in the globals runtime when using the workflow generator, they must be defined by machine when building locally.

In [None]:
import os

from azureml.core.workspace import Workspace
from azureml.core.authentication import InteractiveLoginAuthentication
import mlflow

from gordo.reporters.mlflow import get_mlflow_client

In [None]:
# Note that dummy tag names are used here. The data provider is
# patched for testing the notebook code, but these will need to
# be changed or the use RadomDataProviderfor properly testing this
# workflow.
config_str = """
apiVersion: equinor.com/v1
kind: Gordo
metadata:
  name: test-project
spec:
  deploy-version: 0.52.1
  config:
    machines:
      - dataset:
          tags:
            - TRA-Tag-1
            - TRA-Tag-2
          target_tag_list:
            - TRA-Tag-3
            - TRA-Tag-4
          train_end_date: '2019-03-01T00:00:00+00:00'
          train_start_date: '2019-01-01T00:00:00+00:00'
          data_provider:            
            interactive: True
        metadata:
          information: 'Use RandomForestRegressor to predict separate set of tags.'
        model:
          gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector:
            base_estimator:
              sklearn.compose.TransformedTargetRegressor:
                transformer: sklearn.preprocessing.data.MinMaxScaler
                regressor:
                  sklearn.pipeline.Pipeline:
                    steps:
                      - sklearn.decomposition.pca.PCA
                      - sklearn.multioutput.MultiOutputRegressor:
                          estimator:
                            sklearn.ensemble.forest.RandomForestRegressor:
                              n_estimators: 35
                              max_depth: 10
        name: supervised-random-forest-anomaly
        # During local building, reporters must be defined by machine
        runtime:
          reporters:
            - gordo.reporters.mlflow.MlFlowReporter
globals:
  runtime:
    builder:
      # Remote logging is by default deactived without setting anything.
      remote_logging:
        enable: False
    """


## Building locally

To build machines locally, but log remotely, configure the `AZUREML_WORKSPACE_STR` and `DL_SERVICE_AUTH_STR` as described above, then run the config file with the reporter configuration in `gordo.builder.local_build.local_build` method.

In [None]:
from gordo.builder.local_build import local_build
import os

# This downloads 1yr of data from the datalake
# so it will of coarse take some time
model, machine = next(local_build(config_str))

In [None]:
# During a deployment, the CLI build method calls the reporters.
# In a local build, we'll do that manually
machine.report()

# Reviewing results

## AzureML Frontend

The AzureML frontend can be helpful for quickly looking that your results appear to be populating correctly, for example during a gordo deployment. [Portal Link](https://ml.azure.com/?wsid=/subscriptions/019958ea-fe2c-4e14-bbd9-0d2db8ed7cfc/resourcegroups/gordo-ml-workspace-poc-rg/workspaces/gordo-ml-workspace-poc-ml)

## Querying with MlflowClient


The necessary requirements for using Mlflow with AzureML are installed with gordo, so you can just use the client from your gordo `virtualenv`.

The following are just some general examples, but you can find further documention on the client [here](https://www.mlflow.org/docs/latest/tracking.html#querying-runs-programmatically) as well as API documentation [here](https://www.mlflow.org/docs/latest/python_api/mlflow.tracking.html).



In [None]:
# If you want to configure the client to query results on AzureML,
# define the connection arguments in a kwargs dict.
workspace_kwargs = {                 
    "subscription_id":"value",                 
    "resource_group": "value",                 
    "workspace_name": "value",
    "auth": InteractiveLoginAuthentication(force=True)
    }

# To login automatically, provide the service principal 
# arguments in a kwargs dict
service_principal_kwargs = {                 
    "tenant_id": "<value>",
    "service_principal_id": "<value>",
    "service_principal_password": "<value>"
    }

# For the case of this example, we'll just run things locally, so we'll
# just pass empty dicts, which is the default when no arguments are passed.
workspace_kwargs = {}
service_principal_kwargs = {}
client = get_mlflow_client(workspace_kwargs, service_principal_kwargs)

### Experiments
Each build of a machine corresponds to a new run for an experiment with that machine's name. With each subsequent deployment, there will be a new run under each built machines name.

In [None]:
# Get all experiments (can take a bit)
experiments = client.list_experiments()

# We've only built one machine, but it'ss
for exp in experiments:
    print(exp.name)

In [None]:
# Get a single experiment by name
exp = client.get_experiment_by_name("supervised-random-forest-anomaly")
print(exp)

In [None]:
# Find experiments matching some pattern
experiment_ids = [e.experiment_id for e in experiments if e.name.startswith("super")]
exp_id = experiment_ids[0]
print(exp_id)

### Runs
Searching of Runs can be perfomed with some [built-in arguments](https://www.mlflow.org/docs/latest/python_api/mlflow.tracking.html#mlflow.tracking.MlflowClient.search_runs), or with basic SQL select queries passed to the `filter_string` argument. 

In [None]:
## Using order by a metric
runs = client.search_runs(experiment_ids=experiment_ids, max_results=50, order_by=["metrics.r_2"])

print("Number of runs:", len(runs))
print("Example:", runs[0])

In [None]:
# Using an SQL filter string

# First we can get a single run and look at what metrics are logged in gordo
runs = client.search_runs(experiment_ids=experiment_ids, max_results=1)
runs[0].data.metrics.keys()

# We can then search for runs matching a certain R2 score range
# Note that the Identifier must be enclosed in backticks or double quotes
runs = client.search_runs(experiment_ids=experiment_ids, filter_string='metrics.`r2-score` < 8',  max_results=10) 

print("Number of runs:", len(runs))
print("Example:", runs[0])

There are som handy tools using the `azureml-sdk` as well. For example, you can bring up a widget displaying information about a run, and get metrics as iterables.

In [None]:
# We'll put this in an if statement, so the rest of this 
# notebook can be tested
if False:
    from azureml.widgets import RunDetails
    from azureml.core.experiment import Experiment
    from azureml.core.run import Run
    experiment = Experiment(ws, experiments[-80].name)
    azure_run = next(experiment.get_runs())
    RunDetails(azure_run).show()

In [None]:
if False:
    import matplotlib.pyplot as plt
    # Or do some things yourself
    metrics = azure_run.get_metrics()
    print(metrics.keys())
    plt.plot(range(len(metrics["accuracy"])), metrics["accuracy"])
    plt.show()
    print(azure_run.properties)

### Artifacts
Artificacts are files, such JSON, images, pickled models, etc. The following are examples on explicitly uploading and downloading them on AzureML with a given `run_id`.

In [None]:
import os
import uuid
import json
import shutil

run_id = client.list_run_infos(exp.experiment_id)[-1].run_id
art_id = f"{uuid.uuid4().hex}"

# Upload artifacts
local_path = os.path.abspath(f"./{exp.name}_{run_id}/")
if os.path.isdir(local_path):
    shutil.rmtree(local_path)
os.makedirs(local_path, exist_ok=True)

json.dump({"a": 42.0, "b":"text"}, open(os.path.join(local_path, f"{art_id}.json"), "w"))

client.log_artifacts(run_id, local_path)

In [None]:
# Get artifacts for a given Run
artifacts = client.list_artifacts(run_id)

# Make a new path to save these to
new_local_path = os.path.join(local_path, "downloaded")
os.makedirs(new_local_path, exist_ok=True)

# Iterate over Run's artifacts and save them
for f in artifacts:
    client.download_artifacts(run_id=run_id, path=f.path, dst_path=local_path)
    print("Downloaded:", f)