Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classification surrogates #297

Merged
merged 38 commits into from
Feb 27, 2024
Merged

Classification surrogates #297

merged 38 commits into from
Feb 27, 2024

Conversation

gmancino
Copy link
Contributor

@gmancino gmancino commented Oct 3, 2023

Add Classification Models for Surrogates

Adding new surrogates to allow for classification of output values (e.g. 'unacceptable', 'acceptable', or 'ideal') to use for modeling unknown constraints. Concretely, if $g_{\theta}:\mathbb{R}^d\to[0,1]^c$ represents a function governed by learnable parameters $\theta$ which outputs a probability vector over $c$ potential classes (i.e. for input $x\in\mathbb{R}^d$, $g_{\theta}(x)^\top\mathbf{1}=1$ where $\mathbf{1}$ is the vector of all 1's) and we have acceptability criteria for the corresponding classes given by $a\in{0,1}^c$, we can compute the expected acceptability as scalar output via $g_{\theta}(x)^\top a\in[0,1]$ which can be passed in as a constrained objective function.

Classification Models

We add a new objective function in 'bofire/data_models/objectives/categorical.py' which can be passed to outputs of type CategoricalOutput. These are instantiated with a list of probability scales (i.e. the acceptability criteria vector) via the desirability argument and inherit the categories from the corresponding output. Using this new objective:

  1. We implement an MLP ensemble method to start ('bofire/data_models/surrogates/mlp.py') which outputs a $c$-dimensional probability vector for each datapoint
  2. The predicted value (stored in {key}_{class}_prob of the predictions) is the argmax along this probability vector, while the objective value (stored in {key}_{class}_{des} of the predictions) is the inner-product of the probability vector with the acceptability criteria vector
    • This value is also computed in the constrained_objective2botorch function, which currently undergoes the inverse of the sigmoid transformation to maintain the value in the probability space
  3. We pass in the objective value to BoTorch as a constraint

@gmancino
Copy link
Contributor Author

gmancino commented Oct 3, 2023

@jduerholt, please let me know if the style of the updates here are appropriate. I will build more models if the initial idea makes sense.

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Gabe, this looks really impressive. I just added some intitial questions as I am not yet fully through this beast :D

bofire/strategies/doe/utils_categorical_discrete.py Outdated Show resolved Hide resolved
bofire/data_models/features/categorical.py Outdated Show resolved Hide resolved
return (
[
lambda Z: -1.0
* objective.w
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why objective.w?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an error. I was testing things locally; it will be changed in the next revision :)

raise ValueError("Objective values has to be smaller equal than 1.")
if o < 0:
raise ValueError("Objective values has to be larger equal than zero")
for w in weights:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go for this structure, this validation should go to the inside of the objective class as validator there, or we just do it via the annotated stuff. But then also there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I will think more about it.

bofire/data_models/features/categorical.py Outdated Show resolved Hide resolved
bofire/surrogates/mlp_classifier.py Outdated Show resolved Hide resolved
bofire/surrogates/mlp_classifier.py Outdated Show resolved Hide resolved
@gmancino
Copy link
Contributor Author

gmancino commented Oct 5, 2023

@jduerholt thank you so much for your feedback! I think I have addressed all of your concerns, but please let me know if there are any additional errors. If it all looks good, I plan on implementing something like in pytorch/botorch#640 for a categorical GP. This should cover our initial basis :)

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Gabe,

sorry, that it took so long.

I really like it, how you manage to get the predictions from the multilabel classification into a botorch constraint. Very smart. I was just wondering if there is a torch.exp transformation due to log_softmax needed. Especially the use of the log_softmax should be discussed in the context of versatility.

Best,

Johannes

bofire/data_models/domain/features.py Show resolved Hide resolved
]
if len(categorical_cols) == 0:
return candidates
for col in categorical_cols:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for col in categorical_cols:
for feat in self.get(CategoricalOutputFeature):
col = f"{feat.key}_pred"
if feat.key not in candidates:
raise ValueError(f"missing column {col}")
feat.validate_experimental(candidates[col]) # this is doing what you do in the second if

This is for me a bit cleaner and better readable.

(f"{obj.key}_pred", obj.categories)
for obj in self.get_by_objective(includes=CategoricalObjective)
]
if len(categorical_cols) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed if you do it as shown below.

def to_dict(self) -> Dict:
"""Returns the catergories and corresponding objective values as dictionary"""
return dict(zip(self.categories, self.objective))
return dict(zip(self.categories, self.objective.desirability))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the following methods are porblematic here, as they are not objective agnostic, if one would assign a different objective with a different attribute structure, this method would fail. They should be moved in the respecitve objective class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would mean we set up functions within the objective which take in the categories and do some processing. I think it's still unclear to me why the categories cannot be directly passed to the objective upon instantiation


def __call__(self, values: pd.Series) -> pd.Series:
return values.map(self.to_dict()).astype(float)
return values.map(self.to_dict())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also need to call a method within the objective.

return self.X[i], self.y[i]


class MLPClassifier(nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also reuse code here by basing this and the MLP for regression on the same parent class?

self.layers = nn.Sequential(*layers)

def forward(self, x):
return nn.functional.log_softmax(self.layers(x), dim=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we have binary classification, than we should just have one output neuron or and no need for softmax, or?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why the log softmax? Is it because of the constraint handling?

return nn.functional.log_softmax(self.layers(x), dim=1)


class _MLPClassifierEnsemble(EnsembleModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, code reuse should be possible, or?

current_loss += loss.item()


class MLPClassifierEnsemble(BotorchSurrogate, TrainableSurrogate):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here code reusing should be possible.

* (
Z[..., idx : idx + len(objective.desirability)]
* torch.tensor(objective.desirability).to(**tkwargs)
).sum(-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dot product or? Why not use torch.dot?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But really a genious idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But could it be that you overlooked that the classification model is not returning probabilities but log probabilities? So, a torch.exp is needed, or?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe, I overlooked a transformation somewhere.

@gmancino
Copy link
Contributor Author

Hi @jduerholt, I have made the corresponding changes to this PR which we have discussed. Some tests are currently failing because I changed the TCategoryVals to be of type Tuple[str, ...], which would make the categories immutable. Hence, in each hard coded location of categories (which are currently lists) the tests fail. I know we discussed using Tuples instead of Lists for this so it may be worth changing by hand unless you have some other opinion? There may be additional technical comments on the changes made which need to be addressed first ;)

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Gabe,

I looked over the first part of the PR and let inline comments. I will look over the rest during the Christmas days.

Main comments are due to missing tests.

Thanks!

Best,

Johannes

"""

w: TWeight = 1.0
desirability: Tuple[float, ...]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
desirability: Tuple[float, ...]
desirability: Annotated[Tuple[Annotated[float, Field(le=0, ge=0)]], Field(min_items=2)]

from bofire.data_models.surrogates.trainable import TrainableSurrogate


class MLPClassifierEnsemble(BotorchSurrogate, TrainableSurrogate):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Baseclass called: MLPEnsemble, child classes ClassifierMLPEnsemble and RegressionMLPEnsemble.

@@ -165,7 +165,7 @@ def is_categorical(s: pd.Series, categories: List[str]):
TDescriptors = Annotated[List[str], Field(min_items=1)]


TCategoryVals = Annotated[List[str], Field(min_items=2)]
TCategoryVals = Tuple[str, ...]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TCategoryVals = Tuple[str, ...]
TCategoryVals = Annotated[Tuple[str, ...], min_length=2]

This would be how one should do it in pydantic 2, as soon as the migration is done. Here it introduces problems due to pydantic version 1. For this reason, I would let it as we had it originally. Note that you do not have to change it everywhere, since pydantic should do the casting. It also looks that the error which you have in the tests comes from some manual changes, since for me it works when I only change in the main branch the TCategoryVals as you did above and do not change anything else. But we should do this as soon as we have pydantic 2. Can you add this as suggestion in the pydantic2 PR?

bofire/data_models/domain/features.py Show resolved Hide resolved
bofire/data_models/domain/features.py Show resolved Hide resolved
Returns:
bool: True if the output type is valid for the surrogate chosen, False otherwise
"""
return True if isinstance(my_type, ContinuousOutput) else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return True if isinstance(my_type, ContinuousOutput) else False
return isinstance(my_type, ContinuousOutput)

Returns:
bool: True if the output type is valid for the surrogate chosen, False otherwise
"""
return True if isinstance(my_type, ContinuousOutput) else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return True if isinstance(my_type, ContinuousOutput) else False
return isinstance(my_type, ContinuousOutput)

Returns:
bool: True if the output type is valid for the surrogate chosen, False otherwise
"""
return True if isinstance(my_type, ContinuousOutput) else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return True if isinstance(my_type, ContinuousOutput) else False
return isinstance(my_type, ContinuousOutput)

Returns:
bool: True if the output type is valid for the surrogate chosen, False otherwise
"""
return True if isinstance(my_type, ContinuousOutput) else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return True if isinstance(my_type, ContinuousOutput) else False
return isinstance(my_type, ContinuousOutput)

Returns:
bool: True if the output type is valid for the surrogate chosen, False otherwise
"""
return True if isinstance(my_type, ContinuousOutput) else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return True if isinstance(my_type, ContinuousOutput) else False
return isinstance(my_type, ContinuousOutput)

pred_cols = []
sd_cols = []
for featkey in self.outputs.get_keys():
if hasattr(self.outputs.get_by_key(featkey), "categories"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jduerholt, there is a type error failing here. How do you recommend checking if the output type is categorical? Line 71 below also fails...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for feat in self.outputs.get(CategoricalOutput): Doing it like this, you only get categorical output features ;) The same filtereing also applies to get_keys. Have a look at the method in domain.features.py, they are super helpful and I use them really a lot, so you should familiarize with them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you only need the keys use get_keys(CategoricalOutput), it could be that you have still a type error, then add a type: ignore

batch_size: int = 10,
n_epoches: int = 200,
lr: float = 1e-4,
shuffle: bool = True,
weight_decay: float = 0.0,
loss_function: Union[
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jduerholt there is a type error here. I tried removing the type from the loss function and also trying nn.module. I could try nn.modules.loss unless you have any other suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Puh, no idea, then just set a type: ignore behind it ...

@jduerholt
Copy link
Contributor

jduerholt commented Feb 1, 2024

Regarding you failing test, something went wrong in one of your merges against main, in main the line with the correct terms looks as following:

terms = ["1", "x0", "x1", "x2", "x0 ** 2", "x1 ** 2", "x2 ** 2"]

In your branch it looks likes this:

terms = ["1", "x0", "x1", "x2", "x0**2", "x1**2", "x2**2"]
. In main we updated it at some point to the version looking like this "x ** 2" which is the new formulaic format, whereas you have in your branch still "x**2" which is the old format.

I assume some problems with a merge.

@@ -74,7 +74,7 @@ def test_get_formula_from_string():
assert all(term in np.array(model_formula, dtype=str) for term in terms)

# linear and quadratic
terms = ["1", "x0", "x1", "x2", "x0 ** 2", "x1 ** 2", "x2 ** 2"]
terms = ["1", "x0", "x1", "x2", "x0**2", "x1**2", "x2**2"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can see it also here that your branch differes from main in doe/test_utils.py

@gmancino
Copy link
Contributor Author

gmancino commented Feb 1, 2024

@jduerholt This is ready for another round of reviews :) Hopefully the last one!

Copy link
Contributor Author

@gmancino gmancino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have addressed all previous concerns. The tutorials also show the functionality of the new feature. We can add extra tests, but I added the major ones for now I believe.

bofire/data_models/domain/features.py Show resolved Hide resolved
bofire/data_models/domain/features.py Show resolved Hide resolved
itertools.chain.from_iterable(
[
[f"{key}_pred", f"{key}_sd", f"{key}_des"]
for key in self.get_keys_by_objective(Objective)
[f"{obj.key}_pred", f"{obj.key}_sd", f"{obj.key}_des"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Gabe,

thank you very much. Looks overall good. I added only some minor points. Next step would then to add tests, or?

Regarding your notebook, let us discuss if we not want to setup an example where the classifier has better performance, or what do you think?

Best,

Johannes

bofire/data_models/features/categorical.py Outdated Show resolved Hide resolved
@@ -359,38 +361,63 @@ class CategoricalOutput(Output):
order_id: ClassVar[int] = 9

categories: TCategoryVals
objective: Annotated[List[Annotated[float, Field(ge=0, le=1)]], Field(min_length=2)]
objective: Optional[AnyCategoricalObjective] = Field(
default_factory=lambda: ConstrainedCategoricalObjective(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong to me, why give an objective with categerories "a", "b" as default? One can build a default function by accessing the categories attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set desireabilities to be an array of True's in this instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually ties into the above questions (i.e. regarding None types). I am also not sure of how to instantiate in this instance since it seems sort of "chicken-and-egg." Either way, we just need a default case here. Let me know what you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us do the following, we implement the default generation into the field validator for the CategeforicalOutput:

def validate_objective(cls, objective, info):
    if objective is None:
         return ConstrainedCategoricalObjective(categories=info.data[`categories`], desirabilities = [True]*len(info.data[`categories`])]
    else:
        pass # do the validaton already implemented in the validator.

By this we are then setting as default that everything is allowed, which is the same as ignorig the CategoricalOutput in the optimization.

In the attribute, we then set the folowing:

objective: Optional[AnyCategoricalObjective] = Field(default=None, validate_default=True)

Does this makes sense to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively this makes sense, however, just applying this as is actually causes the following error: FAILED tests/bofire/data_models/serialization/test_deserialization.py::test_outputs_should_be_deserializable[outputs_spec0]

This does not make sense since these output types are Continuous. Care to comment?

tests/bofire/data_models/specs/features.py Show resolved Hide resolved
return objective

@classmethod
def from_objective(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could not find a test for this.

bofire/data_models/features/categorical.py Show resolved Hide resolved
bofire/data_models/surrogates/mlp.py Show resolved Hide resolved
bofire/surrogates/surrogate.py Outdated Show resolved Hide resolved
bofire/surrogates/trainable.py Show resolved Hide resolved
bofire/utils/torch_tools.py Outdated Show resolved Hide resolved
tests/bofire/utils/test_torch_tools.py Show resolved Hide resolved
Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Gabe,

due to the lack of time, not a full review, but a few minor things to work on. We are very close to the finish line.

Best,

Johannes

loc=0,
column=f"{feat.key}_pred",
value=predictions.filter(regex=f"{feat.key}(.*)_prob")
.idxmax(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doubled with surrogte.py, could we somehow also write a helper function for this and call them in both ocassions, some method called for example postprocess_categegorical_predictions? We could place it in surrogate.py. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it to the naming_conventions file. Unless you can tell me how to access the surrogates within the predictives to call a method specified within the surrogate class?

for cat in outputs.get_by_key(featkey).categories # type: ignore
]
sd_cols = sd_cols + [
f"{featkey}_{cat}_sd"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Could you also add a test for it?

bofire/utils/naming_conventions.py Outdated Show resolved Hide resolved
bofire/utils/naming_conventions.py Outdated Show resolved Hide resolved
tests/bofire/surrogates/test_mlp.py Show resolved Hide resolved
@gmancino
Copy link
Contributor Author

@jduerholt, I have updated all of the changes previously requested (including tests). Please let me know what you think once you have some time :)

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good, last two things are commented. When they are finished, we can merge it in ;)

bofire/data_models/objectives/categorical.py Outdated Show resolved Hide resolved
bofire/data_models/objectives/categorical.py Outdated Show resolved Hide resolved
@gmancino
Copy link
Contributor Author

@jduerholt the request changes are complete :)

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for all your efforts and your patience with me!

@jduerholt jduerholt merged commit 7b15308 into main Feb 27, 2024
10 checks passed
@jduerholt jduerholt deleted the classification_surrogates branch February 27, 2024 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants