Classification surrogates #297

gmancino · 2023-10-03T16:21:26Z

Add Classification Models for Surrogates

Adding new surrogates to allow for classification of output values (e.g. 'unacceptable', 'acceptable', or 'ideal') to use for modeling unknown constraints. Concretely, if $g_{\theta}:\mathbb{R}^d\to[0,1]^c$ represents a function governed by learnable parameters $\theta$ which outputs a probability vector over $c$ potential classes (i.e. for input $x\in\mathbb{R}^d$, $g_{\theta}(x)^\top\mathbf{1}=1$ where $\mathbf{1}$ is the vector of all 1's) and we have acceptability criteria for the corresponding classes given by $a\in{0,1}^c$, we can compute the expected acceptability as scalar output via $g_{\theta}(x)^\top a\in[0,1]$ which can be passed in as a constrained objective function.

Classification Models

We add a new objective function in 'bofire/data_models/objectives/categorical.py' which can be passed to outputs of type CategoricalOutput. These are instantiated with a list of probability scales (i.e. the acceptability criteria vector) via the desirability argument and inherit the categories from the corresponding output. Using this new objective:

We implement an MLP ensemble method to start ('bofire/data_models/surrogates/mlp.py') which outputs a $c$-dimensional probability vector for each datapoint
The predicted value (stored in {key}_{class}_prob of the predictions) is the argmax along this probability vector, while the objective value (stored in {key}_{class}_{des} of the predictions) is the inner-product of the probability vector with the acceptability criteria vector
- This value is also computed in the constrained_objective2botorch function, which currently undergoes the inverse of the sigmoid transformation to maintain the value in the probability space
We pass in the objective value to BoTorch as a constraint

gmancino · 2023-10-03T16:23:26Z

@jduerholt, please let me know if the style of the updates here are appropriate. I will build more models if the initial idea makes sense.

jduerholt

Hi Gabe, this looks really impressive. I just added some intitial questions as I am not yet fully through this beast :D

bofire/strategies/doe/utils_categorical_discrete.py

bofire/data_models/features/categorical.py

jduerholt · 2023-10-03T22:19:54Z

bofire/utils/torch_tools.py

+        return (
+            [
+                lambda Z: -1.0
+                * objective.w


Why objective.w?

This is an error. I was testing things locally; it will be changed in the next revision :)

jduerholt · 2023-10-03T22:22:05Z

bofire/data_models/features/categorical.py

-                raise ValueError("Objective values has to be smaller equal than 1.")
-            if o < 0:
-                raise ValueError("Objective values has to be larger equal than zero")
+        for w in weights:


If we go for this structure, this validation should go to the inside of the objective class as validator there, or we just do it via the annotated stuff. But then also there.

I see, I will think more about it.

bofire/data_models/features/categorical.py

bofire/surrogates/mlp_classifier.py

bofire/utils/torch_tools.py

gmancino · 2023-10-05T14:41:23Z

@jduerholt thank you so much for your feedback! I think I have addressed all of your concerns, but please let me know if there are any additional errors. If it all looks good, I plan on implementing something like in pytorch/botorch#640 for a categorical GP. This should cover our initial basis :)

jduerholt

Hi Gabe,

sorry, that it took so long.

I really like it, how you manage to get the predictions from the multilabel classification into a botorch constraint. Very smart. I was just wondering if there is a torch.exp transformation due to log_softmax needed. Especially the use of the log_softmax should be discussed in the context of versatility.

Best,

Johannes

bofire/data_models/domain/features.py

jduerholt · 2023-12-08T10:09:35Z

bofire/data_models/domain/features.py

+        ]
+        if len(categorical_cols) == 0:
+            return candidates
+        for col in categorical_cols:


Suggested change

for col in categorical_cols:

for feat in self.get(CategoricalOutputFeature):

col = f"{feat.key}_pred"

if feat.key not in candidates:

raise ValueError(f"missing column {col}")

feat.validate_experimental(candidates[col]) # this is doing what you do in the second if

This is for me a bit cleaner and better readable.

jduerholt · 2023-12-08T10:09:58Z

bofire/data_models/domain/features.py

+            (f"{obj.key}_pred", obj.categories)
+            for obj in self.get_by_objective(includes=CategoricalObjective)
+        ]
+        if len(categorical_cols) == 0:


This can be removed if you do it as shown below.

jduerholt · 2023-12-08T10:38:08Z

bofire/data_models/features/categorical.py

    def to_dict(self) -> Dict:
        """Returns the catergories and corresponding objective values as dictionary"""
-        return dict(zip(self.categories, self.objective))
+        return dict(zip(self.categories, self.objective.desirability))


This and the following methods are porblematic here, as they are not objective agnostic, if one would assign a different objective with a different attribute structure, this method would fail. They should be moved in the respecitve objective class.

This would mean we set up functions within the objective which take in the categories and do some processing. I think it's still unclear to me why the categories cannot be directly passed to the objective upon instantiation

jduerholt · 2023-12-08T10:38:41Z

bofire/data_models/features/categorical.py


    def __call__(self, values: pd.Series) -> pd.Series:
-        return values.map(self.to_dict()).astype(float)
+        return values.map(self.to_dict())


This also need to call a method within the objective.

jduerholt · 2023-12-08T11:11:16Z

bofire/surrogates/mlp_classifier.py

+        return self.X[i], self.y[i]
+
+
+class MLPClassifier(nn.Module):


Can we also reuse code here by basing this and the MLP for regression on the same parent class?

jduerholt · 2023-12-08T11:14:27Z

bofire/surrogates/mlp_classifier.py

+        self.layers = nn.Sequential(*layers)
+
+    def forward(self, x):
+        return nn.functional.log_softmax(self.layers(x), dim=1)


What happens if we have binary classification, than we should just have one output neuron or and no need for softmax, or?

And why the log softmax? Is it because of the constraint handling?

jduerholt · 2023-12-08T11:15:59Z

bofire/surrogates/mlp_classifier.py

+        return nn.functional.log_softmax(self.layers(x), dim=1)
+
+
+class _MLPClassifierEnsemble(EnsembleModel):


Also here, code reuse should be possible, or?

jduerholt · 2023-12-08T11:18:52Z

bofire/surrogates/mlp_classifier.py

+            current_loss += loss.item()
+
+
+class MLPClassifierEnsemble(BotorchSurrogate, TrainableSurrogate):


Also here code reusing should be possible.

jduerholt · 2023-12-08T11:40:02Z

bofire/utils/torch_tools.py

+                * (
+                    Z[..., idx : idx + len(objective.desirability)]
+                    * torch.tensor(objective.desirability).to(**tkwargs)
+                ).sum(-1)


This is a dot product or? Why not use torch.dot?

But really a genious idea.

But could it be that you overlooked that the classification model is not returning probabilities but log probabilities? So, a torch.exp is needed, or?

But maybe, I overlooked a transformation somewhere.

…ates

gmancino · 2023-12-21T21:53:33Z

Hi @jduerholt, I have made the corresponding changes to this PR which we have discussed. Some tests are currently failing because I changed the TCategoryVals to be of type Tuple[str, ...], which would make the categories immutable. Hence, in each hard coded location of categories (which are currently lists) the tests fail. I know we discussed using Tuples instead of Lists for this so it may be worth changing by hand unless you have some other opinion? There may be additional technical comments on the changes made which need to be addressed first ;)

jduerholt

Hi Gabe,

I looked over the first part of the PR and let inline comments. I will look over the rest during the Christmas days.

Main comments are due to missing tests.

Thanks!

Best,

Johannes

jduerholt · 2023-12-08T13:21:08Z

bofire/data_models/objectives/categorical.py

+    """
+
+    w: TWeight = 1.0
+    desirability: Tuple[float, ...]


Suggested change

desirability: Tuple[float, ...]

desirability: Annotated[Tuple[Annotated[float, Field(le=0, ge=0)]], Field(min_items=2)]

jduerholt · 2023-12-08T13:28:02Z

bofire/data_models/surrogates/mlp_classifier.py

+from bofire.data_models.surrogates.trainable import TrainableSurrogate
+
+
+class MLPClassifierEnsemble(BotorchSurrogate, TrainableSurrogate):


Baseclass called: MLPEnsemble, child classes ClassifierMLPEnsemble and RegressionMLPEnsemble.

jduerholt · 2023-12-22T13:02:43Z

bofire/data_models/features/feature.py

@@ -165,7 +165,7 @@ def is_categorical(s: pd.Series, categories: List[str]):
 TDescriptors = Annotated[List[str], Field(min_items=1)]


-TCategoryVals = Annotated[List[str], Field(min_items=2)]
+TCategoryVals = Tuple[str, ...]


Suggested change

TCategoryVals = Tuple[str, ...]

TCategoryVals = Annotated[Tuple[str, ...], min_length=2]

This would be how one should do it in pydantic 2, as soon as the migration is done. Here it introduces problems due to pydantic version 1. For this reason, I would let it as we had it originally. Note that you do not have to change it everywhere, since pydantic should do the casting. It also looks that the error which you have in the tests comes from some manual changes, since for me it works when I only change in the main branch the TCategoryVals as you did above and do not change anything else. But we should do this as soon as we have pydantic 2. Can you add this as suggestion in the pydantic2 PR?

bofire/data_models/domain/features.py

jduerholt · 2023-12-22T14:32:59Z

bofire/data_models/surrogates/polynomial.py

+        Returns:
+            bool: True if the output type is valid for the surrogate chosen, False otherwise
+        """
+        return True if isinstance(my_type, ContinuousOutput) else False


Suggested change

return True if isinstance(my_type, ContinuousOutput) else False

return isinstance(my_type, ContinuousOutput)

jduerholt · 2023-12-22T14:33:07Z

bofire/data_models/surrogates/random_forest.py

+        Returns:
+            bool: True if the output type is valid for the surrogate chosen, False otherwise
+        """
+        return True if isinstance(my_type, ContinuousOutput) else False


Suggested change

return True if isinstance(my_type, ContinuousOutput) else False

return isinstance(my_type, ContinuousOutput)

jduerholt · 2023-12-22T14:33:18Z

bofire/data_models/surrogates/single_task_gp.py

+        Returns:
+            bool: True if the output type is valid for the surrogate chosen, False otherwise
+        """
+        return True if isinstance(my_type, ContinuousOutput) else False


Suggested change

return True if isinstance(my_type, ContinuousOutput) else False

return isinstance(my_type, ContinuousOutput)

jduerholt · 2023-12-22T14:33:33Z

bofire/data_models/surrogates/tanimoto_gp.py

+        Returns:
+            bool: True if the output type is valid for the surrogate chosen, False otherwise
+        """
+        return True if isinstance(my_type, ContinuousOutput) else False


Suggested change

return True if isinstance(my_type, ContinuousOutput) else False

return isinstance(my_type, ContinuousOutput)

jduerholt · 2023-12-22T14:33:47Z

bofire/data_models/surrogates/xgb.py

+        Returns:
+            bool: True if the output type is valid for the surrogate chosen, False otherwise
+        """
+        return True if isinstance(my_type, ContinuousOutput) else False


Suggested change

return True if isinstance(my_type, ContinuousOutput) else False

return isinstance(my_type, ContinuousOutput)

…ates

gmancino · 2024-01-30T16:33:05Z

bofire/surrogates/surrogate.py

+        pred_cols = []
+        sd_cols = []
+        for featkey in self.outputs.get_keys():
+            if hasattr(self.outputs.get_by_key(featkey), "categories"):


@jduerholt, there is a type error failing here. How do you recommend checking if the output type is categorical? Line 71 below also fails...

for feat in self.outputs.get(CategoricalOutput): Doing it like this, you only get categorical output features ;) The same filtereing also applies to get_keys. Have a look at the method in domain.features.py, they are super helpful and I use them really a lot, so you should familiarize with them.

If you only need the keys use get_keys(CategoricalOutput), it could be that you have still a type error, then add a type: ignore

gmancino · 2024-01-30T16:34:22Z

bofire/surrogates/mlp.py

    batch_size: int = 10,
    n_epoches: int = 200,
    lr: float = 1e-4,
    shuffle: bool = True,
    weight_decay: float = 0.0,
+    loss_function: Union[


@jduerholt there is a type error here. I tried removing the type from the loss function and also trying nn.module. I could try nn.modules.loss unless you have any other suggestions?

Puh, no idea, then just set a type: ignore behind it ...

jduerholt · 2024-02-01T11:54:31Z

Regarding you failing test, something went wrong in one of your merges against main, in main the line with the correct terms looks as following:

bofire/tests/bofire/strategies/doe/test_utils.py

Line 77 in bf2a119

terms = ["1", "x0", "x1", "x2", "x0 ** 2", "x1 ** 2", "x2 ** 2"]

In your branch it looks likes this:

bofire/tests/bofire/strategies/doe/test_utils.py

Line 77 in 20ec136

terms = ["1", "x0", "x1", "x2", "x0**2", "x1**2", "x2**2"]

. In main we updated it at some point to the version looking like this "x ** 2" which is the new formulaic format, whereas you have in your branch still "x**2" which is the old format.

I assume some problems with a merge.

jduerholt · 2024-02-01T11:56:16Z

tests/bofire/strategies/doe/test_utils.py

@@ -74,7 +74,7 @@ def test_get_formula_from_string():
    assert all(term in np.array(model_formula, dtype=str) for term in terms)

    # linear and quadratic
-    terms = ["1", "x0", "x1", "x2", "x0 ** 2", "x1 ** 2", "x2 ** 2"]
+    terms = ["1", "x0", "x1", "x2", "x0**2", "x1**2", "x2**2"]


One can see it also here that your branch differes from main in doe/test_utils.py

…ates

gmancino · 2024-02-01T18:44:58Z

@jduerholt This is ready for another round of reviews :) Hopefully the last one!

gmancino

I should have addressed all previous concerns. The tutorials also show the functionality of the new feature. We can add extra tests, but I added the major ones for now I believe.

bofire/data_models/domain/features.py

gmancino · 2024-02-01T18:56:57Z

bofire/data_models/domain/features.py

            itertools.chain.from_iterable(
                [
-                    [f"{key}_pred", f"{key}_sd", f"{key}_des"]
-                    for key in self.get_keys_by_objective(Objective)
+                    [f"{obj.key}_pred", f"{obj.key}_sd", f"{obj.key}_des"]


jduerholt

Hi Gabe,

thank you very much. Looks overall good. I added only some minor points. Next step would then to add tests, or?

Regarding your notebook, let us discuss if we not want to setup an example where the classifier has better performance, or what do you think?

Best,

Johannes

bofire/data_models/features/categorical.py

jduerholt · 2024-02-03T19:57:18Z

bofire/data_models/features/categorical.py

@@ -359,38 +361,63 @@ class CategoricalOutput(Output):
    order_id: ClassVar[int] = 9

    categories: TCategoryVals
-    objective: Annotated[List[Annotated[float, Field(ge=0, le=1)]], Field(min_length=2)]
+    objective: Optional[AnyCategoricalObjective] = Field(
+        default_factory=lambda: ConstrainedCategoricalObjective(


This seems wrong to me, why give an objective with categerories "a", "b" as default? One can build a default function by accessing the categories attribute.

Should we set desireabilities to be an array of True's in this instance?

This actually ties into the above questions (i.e. regarding None types). I am also not sure of how to instantiate in this instance since it seems sort of "chicken-and-egg." Either way, we just need a default case here. Let me know what you prefer.

Let us do the following, we implement the default generation into the field validator for the CategeforicalOutput:

def validate_objective(cls, objective, info): if objective is None: return ConstrainedCategoricalObjective(categories=info.data[`categories`], desirabilities = [True]*len(info.data[`categories`])] else: pass # do the validaton already implemented in the validator.

By this we are then setting as default that everything is allowed, which is the same as ignorig the CategoricalOutput in the optimization.

In the attribute, we then set the folowing:

objective: Optional[AnyCategoricalObjective] = Field(default=None, validate_default=True)

Does this makes sense to you?

Intuitively this makes sense, however, just applying this as is actually causes the following error: FAILED tests/bofire/data_models/serialization/test_deserialization.py::test_outputs_should_be_deserializable[outputs_spec0]

This does not make sense since these output types are Continuous. Care to comment?

tests/bofire/data_models/specs/features.py

jduerholt · 2024-02-03T20:01:17Z

bofire/data_models/features/categorical.py

        return objective

+    @classmethod
+    def from_objective(


Could not find a test for this.

bofire/data_models/features/categorical.py

bofire/data_models/surrogates/mlp.py

bofire/surrogates/surrogate.py

bofire/surrogates/trainable.py

bofire/utils/torch_tools.py

tests/bofire/utils/test_torch_tools.py

…ates

jduerholt

Hi Gabe,

due to the lack of time, not a full review, but a few minor things to work on. We are very close to the finish line.

Best,

Johannes

jduerholt · 2024-02-15T21:37:26Z

bofire/strategies/predictives/predictive.py

+                    loc=0,
+                    column=f"{feat.key}_pred",
+                    value=predictions.filter(regex=f"{feat.key}(.*)_prob")
+                    .idxmax(1)


This is doubled with surrogte.py, could we somehow also write a helper function for this and call them in both ocassions, some method called for example postprocess_categegorical_predictions? We could place it in surrogate.py. What do you think?

I will add it to the naming_conventions file. Unless you can tell me how to access the surrogates within the predictives to call a method specified within the surrogate class?

jduerholt · 2024-02-15T21:38:55Z

bofire/utils/naming_conventions.py

+            for cat in outputs.get_by_key(featkey).categories  # type: ignore
+        ]
+        sd_cols = sd_cols + [
+            f"{featkey}_{cat}_sd"


Nice. Could you also add a test for it?

bofire/utils/naming_conventions.py

tests/bofire/surrogates/test_mlp.py

…ates

gmancino · 2024-02-26T15:32:04Z

@jduerholt, I have updated all of the changes previously requested (including tests). Please let me know what you think once you have some time :)

jduerholt

Looks very good, last two things are commented. When they are finished, we can merge it in ;)

bofire/data_models/objectives/categorical.py

gmancino · 2024-02-26T21:11:04Z

@jduerholt the request changes are complete :)

jduerholt

Thank you very much for all your efforts and your patience with me!

gmancino added 7 commits September 22, 2023 09:09

Begin refactor

11cc40a

Update bug fixes

9b688e0

Merge branch 'main' into classification_surrogates

1e35bad

Update based on main

51ea59d

Start fixing constraint output checks

e1414c5

Sync categories and objectives

bb14774

Add categorical objective

acdbbf2

gmancino requested a review from jduerholt October 3, 2023 16:21

jduerholt reviewed Oct 3, 2023

View reviewed changes

bofire/utils/torch_tools.py Show resolved Hide resolved

Update validators, fix bugs, link categorical objectives and weights

53e61da

jduerholt requested changes Dec 8, 2023

View reviewed changes

gmancino added 3 commits December 21, 2023 08:55

Pre-merge commit

89df3a5

Merge remote-tracking branch 'origin/main' into classification_surrog…

0b5f083

…ates

Refactor classification surrogates

b1f5147

jduerholt requested changes Dec 22, 2023

View reviewed changes

gmancino added 8 commits January 17, 2024 15:44

Merge remote-tracking branch 'origin/main' into classification_surrog…

ecbd410

…ates

Address previous PR issues

9caecc7

Initial test fixes

090dcfe

Fix type changes and tutorials

4f75694

Fix tests

dbeb6b8

Merge remote-tracking branch 'origin/main' into classification_surrog…

c8aa1af

…ates

Update Tanimoto GP

2021d73

Fix black version

20ec136

gmancino commented Jan 30, 2024

View reviewed changes

jduerholt reviewed Feb 1, 2024

View reviewed changes

gmancino added 8 commits February 1, 2024 08:36

Update DOE

b8a99fc

Merge remote-tracking branch 'origin/main' into classification_surrog…

d1f7b6b

…ates

Fix type checks

a2287d8

More type fixes

6cb4676

Fix MLP loss function issue

93d5675

Formatting

3ca1d34

Format MLP loss function

5ed04e9

Type checking fix

417fb0f

gmancino commented Feb 1, 2024

View reviewed changes

jduerholt requested changes Feb 5, 2024

View reviewed changes

gmancino added 5 commits February 8, 2024 11:30

Start fixes

dc7dac0

Merge remote-tracking branch 'origin/main' into classification_surrog…

6a792e1

…ates

Update PR to include type changes and tests

3d2465e

Type checking

18d0f13

Fix constraint tests

c19411c

jduerholt reviewed Feb 15, 2024

View reviewed changes

gmancino added 5 commits February 16, 2024 13:13

Fix tests and update naming convention script

8266240

Merge remote-tracking branch 'origin/main' into classification_surrog…

db06fe3

…ates

Remove comments from MLP file

e009509

Fix tests

f025988

Fix MLP scalers

8dc5ee0

jduerholt requested changes Feb 26, 2024

View reviewed changes

bofire/data_models/objectives/categorical.py Outdated Show resolved Hide resolved

bofire/data_models/objectives/categorical.py Outdated Show resolved Hide resolved

Remove CategoricalObjective

dd341d6

jduerholt approved these changes Feb 27, 2024

View reviewed changes

jduerholt merged commit 7b15308 into main Feb 27, 2024
10 checks passed

jduerholt deleted the classification_surrogates branch February 27, 2024 08:05

-        for col in categorical_cols:
+        for feat in self.get(CategoricalOutputFeature):
+        col = f"{feat.key}_pred"
+            if feat.key not in candidates:
+                  raise ValueError(f"missing column {col}")
+            feat.validate_experimental(candidates[col]) # this is doing what you do in the second if

		return nn.functional.log_softmax(self.layers(x), dim=1)


		class _MLPClassifierEnsemble(EnsembleModel):

		current_loss += loss.item()


		class MLPClassifierEnsemble(BotorchSurrogate, TrainableSurrogate):

	desirability: Tuple[float, ...]
	desirability: Annotated[Tuple[Annotated[float, Field(le=0, ge=0)]], Field(min_items=2)]

		from bofire.data_models.surrogates.trainable import TrainableSurrogate


		class MLPClassifierEnsemble(BotorchSurrogate, TrainableSurrogate):

	TCategoryVals = Tuple[str, ...]
	TCategoryVals = Annotated[Tuple[str, ...], min_length=2]

	return True if isinstance(my_type, ContinuousOutput) else False
	return isinstance(my_type, ContinuousOutput)

Classification surrogates #297

Classification surrogates #297

Conversation

gmancino commented Oct 3, 2023 • edited Loading

Add Classification Models for Surrogates

Classification Models

gmancino commented Oct 3, 2023

jduerholt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmancino commented Oct 5, 2023

jduerholt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmancino commented Dec 21, 2023

jduerholt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jduerholt commented Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

gmancino commented Feb 1, 2024

gmancino left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jduerholt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jduerholt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmancino commented Feb 26, 2024

jduerholt left a comment

Choose a reason for hiding this comment

gmancino commented Feb 26, 2024

jduerholt left a comment

Choose a reason for hiding this comment

gmancino commented Oct 3, 2023 •

edited

Loading

jduerholt commented Feb 1, 2024 •

edited

Loading