# Compare Model evaluation metrics tracked with MLflow

In this notebook we'll demonstrate how to download the fmeval metrics tracked using MLflow to create a visual comparison in the form of [radar or spider charts](https://en.wikipedia.org/wiki/Radar_chart).

In [None]:
from dotenv import load_dotenv
from fmeval_mlflow import get_metrics_from_experiment

We set the environmental variables `MLFLOW_TRACKING_URI` and `MLFLOW_TRACKING_USERNAME` from the `.env` file created in [00-Setup](./00-Setup.ipynb).
Alternatively you can set the tracking URL using the `MLflow` SDK method:

``` python
mlflow.set_tracking_uri(tracking_server_arn)
```

In [None]:
load_dotenv()

We organize the metrics into two major categories: those for which a larger value is better, and those where a smaller value is better. These two classes will be plotted on separated charts, making the interpretation of the comparison easier and more immediate.

In [None]:
larger_better = ["factual_knowledge", "summarization_accuracy"]
smaller_better = ["toxicity"]

### Plotting function
This plotting function will make it easier to create consistet radar plots.

In [None]:
import numpy as np
import plotly.graph_objects as go


def create_trace(values, categories, name: str):
    mask = ~np.isnan(values)
    values = values[mask]
    categories = categories[mask]
    return go.Scatterpolar(r=values, theta=categories, fill="toself", name=name)


def create_spider_fig(
    df,
    title: str | None = None,
    fig: go.Figure | None = None,
    aggregation: str = "run_id",
):
    if fig is None:
        fig = go.Figure()
    traces = df.groupby([aggregation])[["model_id", "metric", "value"]].apply(
        lambda x: create_trace(
            x["value"].values,
            x["metric"].values,
            x["model_id"].iloc[0],
        )
    )
    for trace in traces:
        fig.add_trace(trace)

    fig.update_layout(
        polar=dict(
            radialaxis=dict(
                visible=True,
                # range=[0, max(values) + max(values) * 0.1]
            )
        ),
        title=title,
    )

    return fig

## Compare runs

In [None]:
experiment_name = "fmeval-mlflow-simple-runs"

### Retrieving metrics

The retrieval of the metrics from the MLflow experiemnt is encapsultated in the uttlity function `get_metrics_from_experiment()`. You can check the details of the code in [uttls.py](uttls.py).

In [None]:
metrics = get_metrics_from_experiment(experiment_name)
metrics.pivot_table(
    index=["evaluation", "metric"], columns=["model_id"], values="value"
)

### Create plots

In [None]:
fig = create_spider_fig(
    metrics[metrics["evaluation"].isin(larger_better)], aggregation="run_id"
)
fig.show()

In [None]:
fig = create_spider_fig(
    metrics[metrics["evaluation"].isin(smaller_better)], aggregation="run_id"
)
fig.show()

## Compare nested runs

In [None]:
experiment_name = "fmeval-mlflow-nested-runs"

### Retrieving metrics

The retrieval of the metrics from the MLflow experiemnt is encapsultated in the uttlity function `get_metrics_from_experiment()`. You can check the details of the code in [uttls.py](uttls.py).

In [None]:
metrics = get_metrics_from_experiment(experiment_name)

In [None]:
metrics.pivot_table(
    index=["evaluation", "metric"], columns=["model_id"], values="value"
)

### Create plots

In [None]:
fig = create_spider_fig(
    metrics[metrics["evaluation"].isin(larger_better)],
    aggregation="tags.mlflow.parentRunId",
)
fig.show()

In [None]:
fig = create_spider_fig(metrics[metrics["evaluation"].isin(smaller_better)])
fig.show()