Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt ZairaChem to regression tasks #31

Open
4 tasks
miquelduranfrigola opened this issue Dec 5, 2023 · 3 comments
Open
4 tasks

Adapt ZairaChem to regression tasks #31

miquelduranfrigola opened this issue Dec 5, 2023 · 3 comments
Assignees

Comments

@miquelduranfrigola
Copy link
Member

miquelduranfrigola commented Dec 5, 2023

Motivation

At the moment, ZairaChem only works with binary classification tasks. However, in a real-world scenario, we often encounter regression tasks, for example, to predict the IC50 values or pChEMBL values. We would like to extend ZairaChem to work with regression tasks.

Suggested approach

We see two possible approaches to the problem:

  • Extend ZairaChem with AutoML regression modules: The natural approach would be to extend ZairaChem with AutoML regression modules, like the ones provided by FLAML, AutoGluon etc. While this sounds very reasonable, it may present additional challenges, such as new metrics for validations, harmonization of the y variable, etc.
  • Divide the regression problem into n classification tasks: An alternative solution would be to simply divide the regression problem into n classification tasks, for example, cutting at different percentiles. Then, for each percentile, we would have classification problem for which we could use the vanilla ZairaChem. At the end of the procedures, we could do a meta-regressor based on the predicted probabilities at each cutoff. This approach would be much slower, obviously, but it may be robust and easier to implement.

It is not clear yet which approach is best. I am personally inclined towards the second option, although it may end up being too computationally demanding. In the roadmap below, I assume we take this option.

Roadmap

  • Harmonize y data for a given regression task. Sometimes, regression values are awkwardly distributed and we need to clean them up previous to training. For example, we may want to log-transform values, or power-transform them, or simply remove outliers. While this has been partially implemented in ZairaChem already, a production-ready module is not available yet.
  • Parallelize or, at least, organize multiple ZairaChem runs (for each binary classification cutoff) in a centralized manner, including a shared folder.
  • Write a meta-regressor that takes the output probabilities at each cutoff as input features and returns a regression value. The architecture of the meta-regressor should be as simple as possible, ideally a linear regression or an SVR.
  • Extend default ZairaChem plots to illustrate performance in a regression scenario.
@HellenNamulinda
Copy link
Collaborator

@miquelduranfrigola,
While the 2nd option might be 'easier to implement', I doubt whether it may be the optimal solution.

  • Dividing continuous outcomes/values into classes inherently sacrifices precision, potentially leading to less accurate predictions.
  • As you also highlighted; the multiple classification models will potentiallyl increase complexity and computational overhead(We wouldn't increase computation cost for the users)
  • Won't classifying continuous values require task-specific decisions for cutoff points? will the same cut-offs be adaptable to different regression scenarios?
  • For the meta-regressor, will it be using the same validation metrics for classification, or new metrics?

Classification might be a temporary workaround.
So, directly embracing AutoML regression modules would be the ideal approach to ensure optimal accuracy and alignment with the continuous nature of regression problems.

@miquelduranfrigola
Copy link
Member Author

Hello @HellenNamulinda - thanks for your insightful comments. I completely agree with your points.
This is certainly something we can discuss and will require some thinking. We are faced here with a cost-benefit problem, i.e. your project has limited time, and we need to do our best to produce an acceptable outcome. Let's evaluate the roadmap together. If we find out that regression tasks are feasible, then I am more than open to trying this avenue. For now, for project proposal purposes, we can mention that both options will be considered? Do you agree with this approach?

@GemmaTuron
Copy link
Member

We will start by doing some mild tests outside ZairaChem with @JHlozek. Select one model we know well and compare a normal regression with a classifier-based surrogate regression - and we will then decide which approach we take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Status: Todo
Development

No branches or pull requests

4 participants