# Experiment Analysis: Regression

In this notebook we'll be looking at the results of the cash controller 
trained on a set of regression dataset.

The results we'll be looking at are from the following floydhub job:
[197](https://www.floydhub.com/nielsbantilan/projects/deep-cash/197),
[200](https://www.floydhub.com/nielsbantilan/projects/deep-cash/200)

In [51]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [53]:
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import seaborn as sns

from plotly import tools
from plotly.offline import iplot, init_notebook_mode

init_notebook_mode(connected=False)

sns.set_style("whitegrid")

%matplotlib inline

In [92]:
JOB_ENTROPY_COEF_MAP = {
    197: 0.2,
    200: 0.0,
}

JOB_NUMS = JOB_ENTROPY_COEF_MAP.keys()
results = pd.concat([
    pd.read_csv("../floyd_outputs/%d/rnn_cash_controller_experiment.csv" % job_num)
    .assign(job_number=job_num)
    .assign(
        entropy_coef=lambda df: df.job_number.map(JOB_ENTROPY_COEF_MAP),
        job_trial_id=lambda df: df.job_number.astype(str).str.cat(
            df.trial_number.astype(str), sep="-"))
    for job_num in JOB_NUMS
])
results.head()

Unnamed: 0,episode,data_env_names,losses,aggregate_gradients,mean_rewards,mean_validation_scores,std_validation_scores,n_successful_mlfs,n_unique_mlfs,n_unique_hyperparams,mlf_diversity,hyperparam_diversity,best_validation_scores,trial_number,job_number,entropy_coef,job_trial_id
0,1,cleveland,10.910141,-0.202859,0.743546,0.874776,0.8300808,12,16,16,1.0,1.0,2.708622e-08,0,197,0.2,197-0
1,2,diabetes,-6.297242,0.080175,0.004811,5148.254,538.8327,4,16,16,1.0,1.0,4569.841,0,197,0.2,197-0
2,3,detroit,-2.664061,-0.105272,0.072067,538972.0,1415013.0,8,15,16,0.933333,1.0,2.902976e-07,0,197,0.2,197-0
3,4,pol,-2.216673,0.579007,0.11785,908238800000.0,1573116000000.0,4,16,16,1.0,1.0,5.293618e-07,0,197,0.2,197-0
4,5,vineyard,5.9798,-0.253207,0.522985,21799340000.0,65398020000.0,10,15,16,0.933333,1.0,1.950208e-12,0,197,0.2,197-0


Compute the exponentially-weighted mean of the metrics for each job.

In [93]:
METRICS = [
    "losses",
    "aggregate_gradients",
    "best_validation_scores",
    "mean_rewards",
    "mean_validation_scores",
    "n_successful_mlfs",
    "mlf_diversity",
    "hyperparam_diversity",
]

mean_results = (
    results
    .set_index(["episode", "job_number"])
    .groupby("job_number")
    .apply(lambda df: df[METRICS].ewm(alpha=0.05).mean())
    .reset_index()
)

mean_results.head()

Unnamed: 0,episode,job_number,losses,aggregate_gradients,best_validation_scores,mean_rewards,mean_validation_scores,n_successful_mlfs,mlf_diversity,hyperparam_diversity
0,1,197,10.910141,-0.202859,2.708622e-08,0.743546,0.874776,12.0,1.0,1.0
1,2,197,2.085842,-0.057713,2343.508,0.364708,2640.557,7.897436,1.0,1.0
2,3,197,0.42067,-0.074386,1521.945,0.262117,190662.1,7.933392,0.976629,1.0
3,4,197,-0.290228,0.101737,1111.704,0.223229,244816700000.0,6.873143,0.982928,1.0
4,5,197,1.095603,0.023285,865.9896,0.289483,195524300000.0,7.564255,0.971967,1.0


In [94]:
import colorlover
import math

JOB_NUMBERS = list(results.job_number.unique())

PALETTE = colorlover.scales["4"]["qual"]["Paired"]

METRIC_PALETTE_MAP = {
    m: PALETTE[0 - len(JOB_NUMBERS): ]
    for i, m in enumerate(METRICS)
}


def subplot_coords(iterable, ncols, one_indexed=True):
    n = len(iterable)
    nrows = math.ceil(n / ncols)
    offset = 1 if one_indexed else 0
    return {
        "nrows": nrows,
        "ncols": ncols,
        "coords":  [(i + offset, j + offset)
                    for i in range(nrows)
                    for j in range(ncols)]
    }


def create_time_series(x, y, group_name, showlegend, group_colormap=None):
    line_dict = dict(width=1)
    if group_colormap is not None:
        line_dict.update(dict(color=group_colormap[group_name]))
    return go.Scatter(
        x=x,
        y=y,
        name=group_name,
        legendgroup=group_name,
        mode='lines',
        line=line_dict,
        showlegend=showlegend,
        opacity=0.7,
    )


def create_multi_time_series(
        results, group_column, metric, legend_metric="mlf_diversity"):
    groups = results[group_column].unique()
    cm = {g: PALETTE[i] for i, g in enumerate(groups)}
    showlegend = True if metric == legend_metric else False
    return (
        results
        .groupby(group_column)
        .apply(lambda df: create_time_series(
            df["episode"],
            df[metric],
            df[group_column].iloc[0],
            showlegend,
            cm
        ))
        .tolist())

coords = subplot_coords(METRICS, 2)
fig = tools.make_subplots(
    rows=coords["nrows"],
    cols=coords["ncols"],
    subplot_titles=METRICS,
    vertical_spacing=0.1,
    print_grid=False)

for i, metric in enumerate(METRICS):
    traces = create_multi_time_series(
        results.astype({"job_number": str}),
        "job_number",
        metric,
    )
    row_i = coords["coords"][i][0]
    col_i = coords["coords"][i][1]
    for trace in traces:
        fig.append_trace(trace, row_i, col_i)
    # add x-axis titles on the bottom ncols plots
    if i >= (coords["ncols"] * coords["nrows"] - coords["ncols"]):
        xax = "xaxis%s" % ("" if i == 0 else i + 1)
        fig.layout[xax].update({"title": "episode"})

fig.layout.update({
    "height": 800,
})

## Plotting the Exponential Moving Average of Model Fit Metrics

The model fit metrics that we want to look at to assess the different
qualities of the controllers that we've trained under various `entropy_coef`
settings are the following:

- `losses` and `aggregate_gradients` provide a general sense of the shape of
  the objective function during the course of training. Roughly speaking,
  a negative loss indicates that the controller's observed reward is worse
  than its expected reward, and a positive loss indicates the converse.
- `best_validation_scores`, `mean_validation_scores`, and `mean_rewards`
  are different ways of looking at the validation performance of the
  controller's proposed ML frameworks.
- `n_successful_mlfs` is a "nice-to-know" metric to indicate the number
  of successful ML frameworks per episode (i.e. those that did not produce
  a fit or scoring error during MLF fitting/evaluation). This should really
  be normalized as a percentage of iterations (MLF proposals) per episode.
- `mlf_diversity` indicates the diversity of MLFs proposed by the controller
  per episode: 1.0 indicates all proposals are unique MLFs, while 0.0 indicates
  that all proposed MLFs are the same.
- `hyperparam_diversity` is similar to `mlf_diversity` in meaning but measures
  the diversity in hyperparameter settings.

In [95]:
iplot(fig)

## Notes

- The controller with `entropy_coef = 0.1` (light blue) yields the highest `mean_rewards`,
  but note that the `mlf_diversity` and `hyperparam_diversity` indicates that the
  controller converges to the same ML framework (with varying hyperparameter settings)
  at around 200 episodes.
- Setting the `entropy_coef` to a larger and larger value results in controllers that
  end up exploring for the duration of training and do not converge to a smaller set of
  ML frameworks.
- For the particular set of `algorithm_components` available to the controller as of
  git commit `761b9cf`, `entropy_coef=0.2` seems to be the "goldilocks" setting in which
  `mean_rewards` increases to ~70% which still proposing a diverse set of MLFs.
- Note, however, that the `best_validation_scores` achieved by higher `entropy_coef`
  controllers are still fairly high (~90% `f1_scores`) in the case of `entropy_coef=0.4`.


In [96]:
# TODO: compute the exponential moving average of metrics per dataset
mean_results_by_data = (
    results
    .set_index(
        ["episode", "job_number", "data_env_names"])
    .groupby(["job_number", "data_env_names"])
    .apply(lambda df: df[METRICS].ewm(alpha=0.05).mean())
    .reset_index()
)

mean_results_by_data.head()

Unnamed: 0,episode,job_number,data_env_names,losses,aggregate_gradients,best_validation_scores,mean_rewards,mean_validation_scores,n_successful_mlfs,mlf_diversity,hyperparam_diversity
0,1,197,cleveland,10.910141,-0.202859,2.708622e-08,0.743546,0.874776,12.0,1.0,1.0
1,2,197,diabetes,-6.297242,0.080175,4569.841,0.004811,5148.254,4.0,1.0,1.0
2,3,197,detroit,-2.664061,-0.105272,2.902976e-07,0.072067,538972.0,8.0,0.933333,1.0
3,4,197,pol,-2.216673,0.579007,5.293618e-07,0.11785,908238800000.0,4.0,1.0,1.0
4,5,197,vineyard,5.9798,-0.253207,1.950208e-12,0.522985,21799340000.0,10.0,0.933333,1.0


In [114]:
COLORMAP[197]

'rgb(166,206,227)'

In [122]:
from collections import defaultdict

COLORMAP = {
    g: PALETTE[i] for i, g in
    enumerate(mean_results_by_data.job_number.unique())}

def time_series(df, y, legend_metric="anneal"):
    line_dict = dict(width=1)
    job_number = df["job_number"].iloc[0]
    env_name = df["data_env_names"].iloc[0]
    color = COLORMAP.get(job_number)
    showlegend = True if env_name == legend_metric else False
    if color is not None:
        line_dict.update(dict(color=color))
    return go.Scatter(
        x=df["episode"],
        y=df[y],
        name=str(job_number),
        legendgroup=str(job_number),
        line=line_dict,
        mode='lines',
        showlegend=showlegend,
    )

# time_series_data is a dict where the key is
# the env_name and value is the corresponding
# trace.
_time_series_data = (
    results
    .groupby(["data_env_names", "job_number"])
    .apply(time_series, y="mean_rewards")
    .to_dict()
)

time_series_data = defaultdict(dict)
for (data_env, job_num), trace in _time_series_data.items():
    time_series_data[data_env][job_num] = trace

coords = subplot_coords(time_series_data, 3)
fig = tools.make_subplots(
    rows=coords["nrows"],
    cols=coords["ncols"],
    subplot_titles=list(time_series_data.keys()),
    vertical_spacing=0.1,
    print_grid=False)

for i, (data_env, traces) in enumerate(time_series_data.items()):
    row_i, col_i = coords["coords"][i][0], coords["coords"][i][1]
    for job_num, trace in traces.items():
        fig.append_trace(trace, row_i, col_i)
        # add x-axis titles on the bottom ncols plots
        if i >= (coords["ncols"] * coords["nrows"] - coords["ncols"]):
            xax = "xaxis%s" % ("" if i == 0 else i + 1)
            fig.layout[xax].update({"title": "episode"})
    
fig.layout.update({
    "height": 1000,
})

## Plotting the Exponential Moving Average of Model Fit Metrics by Dataset

In [123]:
iplot(fig)

The same general pattern can be seen here with the per-dataset `mean_rewards` for each
controller trained under different `entropy_coef` settings, as in the above [notes](#Notes)
section.

## The Best MLFs Found by the Controllers

In [125]:
# analyze the MLF pipelines proposed in floyd_outputs/96/rnn_cash_controller_experiment_mlfs/
import joblib
import re

from pathlib2 import Path

from deep_cash import utils

best_mlfs = []
for job_id in JOB_NUMS:
    job_output_fp = Path("../floyd_outputs/%d" % job_id)
    for fp in job_output_fp.glob("cash_controller_mlfs_trial_*/*.pkl"):
        mlf = joblib.load(fp)
        episode = int(re.match("best_mlf_episode_(\d+).pkl", fp.name).group(1))
        mlf_str = "NONE" if mlf is None else utils._ml_framework_string(mlf)
        best_mlfs.append([job_id, episode, mlf_str])

best_mlfs = pd.DataFrame(
    best_mlfs, columns=["job_number", "episode", "mlf"])
best_mlfs.head()

Unnamed: 0,job_number,episode,mlf
0,197,407,NumericImputer > OneHotEncoder > Normalizer > ...
1,197,1315,NumericImputer > OneHotEncoder > MinMaxScaler ...
2,197,1473,NumericImputer > OneHotEncoder > MinMaxScaler ...
3,197,361,NumericImputer > OneHotEncoder > StandardScale...
4,197,375,NumericImputer > OneHotEncoder > RobustScaler ...


The plot below is fairly messy looking if we look at all of the `entropy_coef`
conditions, so by default we only show the best MLFs proposed by the
`entropy_coef = {0.1, 0.2}` controllers. We can see that past the
100th episode of training, the `entropy_coef=0.1` controller converges
on proposing `OneHotEncoder > Imputer > MinMaxScaler > PCA > LogisticRegression`
pretty much exclusively, while the `entropy_coef=0.2` is still proposing
a wide variety of MLFs.

In [129]:
def create_best_mlf_timeline(x, y, color):
    return go.Scatter(
        x=x,
        y=y,
        mode='markers',
        opacity=0.7,
        line=dict(width=1, color=color)
    )


plot_mlf = best_mlfs

traces = (
    plot_mlf.groupby("job_number")
    .apply(lambda df: create_best_mlf_timeline(
        df.episode, df.mlf, COLORMAP.get(df.name)))
).tolist()

fig = go.Figure(
    data=traces,
    layout=dict(
        height=600,
        margin=dict(l=600),
        hovermode="closest"
    ))
iplot(fig)