# Classification Surrogate Tests

We are interested in testing whether or not a surrogate model can correctly identify unknown constraints based on categorical criteria with classification surrogates. Essentially, we want to account for scenarios where specialists can look at a set of experiments and label outcomes as 'acceptable', 'unacceptable', 'ideal', etc. 

This involves new models that produce `CategoricalOutput`'s rather than continuous outputs. Mathematically, if $g_{\theta}:\mathbb{R}^d\to[0,1]^c$ represents the function governed by learnable parameters $\theta$ which outputs a probability vector over $c$ potential classes (i.e. for input $x\in\mathbb{R}^d$, $g_{\theta}(x)^\top\mathbf{1}=1$ where $\mathbf{1}$ is the vector of all 1's) and we have acceptibility criteria for the corresponding classes given by $a\in\{0,1\}^c$, we can compute the scalar output $g_{\theta}(x)^\top a\in[0,1]$ which represents the expected value of acceptance as an objective value to be passed in as a constrained function.

In this script, we look at the [Rosenbrock function constrained to a disk](https://en.wikipedia.org/wiki/Test_functions_for_optimization#cite_note-12) which attains a global minima at $(x_0^*,x_1^*)=(1.0, 1.0)$. To facilitate testing the functionality offered by BoFire, we label all points inside of the circle $x_0^2+x_1^2\le2$ as 'acceptable' and further label anything inside of the intersection of this circle and the circle $(x_0-1)^2+(x_1-1)^2\le1.0$ as 'ideal'; points lying outside of these two locations are labeled as "unacceptable."

In [None]:
# Import packages
import numpy as np
import pandas as pd

import bofire.strategies.api as strategies
from bofire.data_models.api import Domain, Inputs, Outputs
from bofire.data_models.features.api import (
    CategoricalInput,
    CategoricalOutput,
    ContinuousInput,
    ContinuousOutput,
)
from bofire.data_models.objectives.api import (
    ConstrainedCategoricalObjective,
    MinimizeObjective,
    MinimizeSigmoidObjective,
)

## Manual setup of the optimization domain

The following cells show how to manually setup the optimization problem in BoFire for didactic purposes.

In [None]:
# Write helper functions which give the objective and the constraints
def rosenbrock(x: pd.Series) -> pd.Series:
    assert "x_0" in x.columns
    assert "x_1" in x.columns
    return (1 - x["x_0"]) ** 2 + 100 * (x["x_1"] - x["x_0"] ** 2) ** 2


def constraints(x: pd.Series) -> pd.Series:
    assert "x_0" in x.columns
    assert "x_1" in x.columns
    feasiblity_vector = []
    for _, row in x.iterrows():
        if (row["x_0"] ** 2 + row["x_1"] ** 2 <= 2.0) and (
            (row["x_0"] - 1.0) ** 2 + (row["x_1"] - 1.0) ** 2 <= 1.0
        ):
            feasiblity_vector.append("ideal")
        elif row["x_0"] ** 2 + row["x_1"] ** 2 <= 2.0:
            feasiblity_vector.append("acceptable")
        else:
            feasiblity_vector.append("unacceptable")
    return feasiblity_vector

In [None]:
# Set-up the inputs and outputs, use categorical domain just as an example
input_features = Inputs(
    features=[ContinuousInput(key=f"x_{i}", bounds=(-1.75, 1.75)) for i in range(2)]
    + [CategoricalInput(key="x_3", categories=["0", "1"], allowed=[True, True])],
)

# here the minimize objective is used, if you want to maximize you have to use the maximize objective.
output_features = Outputs(
    features=[
        ContinuousOutput(key=f"f_{0}", objective=MinimizeObjective(w=1.0)),
        CategoricalOutput(
            key=f"f_{1}",
            categories=["unacceptable", "acceptable", "ideal"],
            objective=ConstrainedCategoricalObjective(
                categories=["unacceptable", "acceptable", "ideal"],
                desirability=[False, True, True],
            ),
        ),  # This function will be associated with learning the categories
        ContinuousOutput(
            key=f"f_{2}",
            objective=MinimizeSigmoidObjective(w=1.0, tp=0.0, steepness=0.5),
        ),
    ],
)

# Create domain
domain1 = Domain(inputs=input_features, outputs=output_features)

# Sample random points
sample_df = domain1.inputs.sample(100)

# Write a function which outputs one continuous variable and another discrete based on some logic
sample_df["f_0"] = rosenbrock(x=sample_df)
sample_df["f_1"] = constraints(x=sample_df)
sample_df["f_2"] = sample_df["x_3"].astype(float) + 1e-2 * np.random.uniform(
    size=(len(sample_df),),
)
sample_df.head(5)

Unnamed: 0,x_0,x_1,x_3,f_0,f_1,f_2
0,0.895917,0.153962,0,42.092749,ideal,0.006245
1,-0.247675,-0.996756,1,113.514085,acceptable,1.009216
2,1.170557,1.694727,0,10.560566,unacceptable,0.006636
3,1.348852,-0.967109,0,776.586073,unacceptable,0.001465
4,-0.767254,-1.614315,1,488.441092,unacceptable,1.006189


In [None]:
# Plot the sample df
import math

import plotly.express as px


fig = px.scatter(
    sample_df,
    x="x_0",
    y="x_1",
    color="f_1",
    width=550,
    height=525,
    title="Samples with labels",
)
fig.add_shape(
    type="circle",
    xref="x",
    yref="y",
    opacity=0.1,
    fillcolor="red",
    x0=-math.sqrt(2),
    y0=-math.sqrt(2),
    x1=math.sqrt(2),
    y1=math.sqrt(2),
    line_color="red",
)
fig.add_shape(
    type="circle",
    xref="x",
    yref="y",
    opacity=0.2,
    fillcolor="LightSeaGreen",
    x0=0,
    y0=0,
    x1=2,
    y1=2,
    line_color="LightSeaGreen",
)
fig.show()

## Evaluate the classification model performance (outside of the optimization procedure)

In [None]:
# Import packages
import bofire.surrogates.api as surrogates
from bofire.data_models.surrogates.api import ClassificationMLPEnsemble
from bofire.surrogates.diagnostics import ClassificationMetricsEnum


# Instantiate the surrogate data model
surrogate_data = ClassificationMLPEnsemble(
    inputs=domain1.inputs,
    outputs=Outputs(features=[domain1.outputs.get_by_key("f_1")]),
    lr=0.03,
    n_epochs=100,
    hidden_layer_sizes=(
        4,
        2,
    ),
    weight_decay=0.0,
    batch_size=10,
    activation="tanh",
)
surrogate = surrogates.map(surrogate_data)

# Fit the surrogate to the classification data
cv_df = sample_df.drop(["f_0", "f_2"], axis=1)
cv_df["valid_f_1"] = 1
cv_train, cv_test, _ = surrogate.cross_validate(cv_df, folds=3)

In [None]:
# Print training performance
cv_train.get_metrics(
    metrics=ClassificationMetricsEnum,
    combine_folds=True,
)

Unnamed: 0,ACCURACY,F1
0,0.85,0.85


In [None]:
# Print test performance
cv_test.get_metrics(
    metrics=ClassificationMetricsEnum,
    combine_folds=True,
)

Unnamed: 0,ACCURACY,F1
0,0.7,0.7


## Setup strategy and ask for candidates

Now we setup a `SoboStrategy` for generating candidates, the categorical output is modelled using the surrogate from above. The categorical output is modelled as an output constraint in the acquistion function optimization (constrained expected improvement). For more details have a look at this notebook: https://github.com/pytorch/botorch/blob/main/notebooks_community/clf_constrained_bo.ipynb and/or this paper: https://arxiv.org/abs/2402.07692.


In [None]:
from bofire.data_models.acquisition_functions.api import qLogEI
from bofire.data_models.strategies.api import SoboStrategy
from bofire.data_models.surrogates.api import BotorchSurrogates


strategy_data = SoboStrategy(
    domain=domain1,
    acquisition_function=qLogEI(),
    surrogate_specs=BotorchSurrogates(
        surrogates=[surrogate_data],
    ),
)

strategy = strategies.map(strategy_data)

strategy.tell(sample_df)

In [None]:
candidates = strategy.ask(10)
candidates

Unnamed: 0,x_0,x_1,x_3,f_1_pred,f_1_sd,f_1_unacceptable_prob,f_1_acceptable_prob,f_1_ideal_prob,f_0_pred,f_2_pred,f_1_unacceptable_sd,f_1_acceptable_sd,f_1_ideal_sd,f_0_sd,f_2_sd,f_0_des,f_2_des,f_1_des
0,0.52121,0.269344,0,acceptable,0.0,-0.599523,0.006038,0.000453,0.801745,0.197802,8.447225,0.005222,0.000587,0.439619,0.439731,-0.801745,0.475295,0.006491
1,-0.534604,0.281354,0,unacceptable,0.0,0.992266,0.005395,0.000284,0.998946,0.00077,8.983122,0.005228,0.000315,0.001168,0.000942,-0.998946,0.499904,0.005679
2,-1.310769,1.75,0,unacceptable,0.0,1.591719,0.006466,0.399993,0.599155,0.000852,16.280403,0.006001,0.546267,0.54694,0.000798,-0.599155,0.499893,0.406459
3,-0.05134,-7.7e-05,0,unacceptable,0.0,0.696201,0.005941,0.000278,0.99896,0.000762,7.099107,0.005202,0.000313,0.001147,0.000927,-0.99896,0.499905,0.006219
4,0.320754,0.112356,0,acceptable,0.0,-0.130758,0.006143,0.000301,0.998758,0.000941,6.648483,0.005214,0.000342,0.001474,0.00127,-0.998758,0.499882,0.006444
5,0.798712,0.635166,0,ideal,0.0,-0.338928,0.005531,0.59617,0.202582,0.201248,7.134383,0.005218,0.544056,0.445691,0.445637,-0.202582,0.474865,0.601701
6,0.619285,0.342774,0,ideal,0.0,-0.461361,0.005945,0.008842,0.626766,0.364392,8.069609,0.005224,0.013143,0.494871,0.499344,-0.626766,0.454577,0.014787
7,-0.650821,0.401533,0,unacceptable,0.0,1.701163,0.005345,0.00031,0.998893,0.000797,8.275364,0.005238,0.000327,0.00125,0.000996,-0.998893,0.4999,0.005655
8,-1.33106,1.75,0,unacceptable,0.0,2.129736,0.006499,0.399992,0.599152,0.000855,16.455936,0.006021,0.546263,0.546937,0.000796,-0.599152,0.499893,0.406491
9,-0.969561,0.933397,0,unacceptable,0.0,5.872413,0.005596,0.376938,0.622362,0.0007,9.472464,0.00527,0.5172,0.517651,0.00099,-0.622362,0.499912,0.382533


## Check classification of proposed candidates

Use the logic from above to verify the classification values

In [None]:
# Append to the candidates
candidates["f_1_true"] = constraints(x=candidates)

In [None]:
# Print results
candidates[["x_0", "x_1", "f_1_pred", "f_1_true"]]

Unnamed: 0,x_0,x_1,f_1_pred,f_1_true
0,0.52121,0.269344,acceptable,ideal
1,-0.534604,0.281354,unacceptable,acceptable
2,-1.310769,1.75,unacceptable,unacceptable
3,-0.05134,-7.7e-05,unacceptable,acceptable
4,0.320754,0.112356,acceptable,acceptable
5,0.798712,0.635166,ideal,ideal
6,0.619285,0.342774,ideal,ideal
7,-0.650821,0.401533,unacceptable,acceptable
8,-1.33106,1.75,unacceptable,unacceptable
9,-0.969561,0.933397,unacceptable,acceptable
