-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model interpretability roadmap #35
Comments
Progress Updates
Next
|
Thanks @HellenNamulinda - this is useful. We did not discuss much about |
Hello @miquelduranfrigola, I've compared XGBoost's performance with default parameters against those optimized by Optuna. Surprisingly, the default parameters seem to yield better results, with an It's worth noting that the parameters optimized by Optuna can vary in each study, introducing some uncertainty in the results. I may need to adjust the search space to align more closely with the default parameters. Additionally, I've observed that training a CatBoost model is consistently slower, which prolongs the optimization process with Optuna. However, I believe Optuna is still valuable. It's essential to carefully define the search parameters to achieve optimal results. |
Thanks @HellenNamulinda , this is useful. I agree we need to use optuna. We'll have to play a bit with the search space, then, and perhaps increase the number of iterations. |
Progress Updates
Next
|
Thanks @HellenNamulinda , all next steps sound good to me.
|
From the meetings,
@miquelduranfrigola, We agreed to experiment and compare performance, rdkit descriptors without feature selection(using all the descriptors). Note: All the experiments were done done using catboost with default parameters. Trying zero-shot, XGBoost's performance on the pf_3d7_ic50 data improved from |
From last week, Also, use Zero-shot AutoML(ersilia-os/xai4chem@f5d5ad8). but FLAML zero-shot only supports XGBoost and not Catboost. |
To be able to interprete other trained models besides the regression models developed using xai4chem, it was best to have the explain_model as a separate module(independent of the regressor). With the explain_module, interpretability plots can be generated even for trained classification models. |
Hello @miquelduranfrigola, In our pipeline, we choose features to be any of the three descriptors(Small(datamol), Mid-size (RDKit), and Large (Mordred)) or the count-based morgan fingerprints. For interpratability, we are currently saving three interpretability plots; barplot, beeswarm plot and a waterfall plot for the first data sample(this can be generated for other samples). All the other usuage details are documented in the README. Some pending concerns Benchmark The MMV Data: And this brings us to combining descriptors and fingerprints? |
Thanks @HellenNamulinda — very informative. Let's first close the pending concerns and then we will look into blending or not descriptors and fingerprints. |
@miquelduranfrigola, This week, I'm working on mapping interpretation(shapely values) unto chemical structures for fingerprint features. |
Hello @miquelduranfrigola, Over the past weeks we have been modelling the MMV data(both the small set(LDH assay), and large set(Luminescence assay), first as a regression problem and then as a classification problem.
Our initial results showed that Morgan fingerprints with reduced number of features(100) did not significantly compromise the model's performance with a better R2 score and mean absolute error (MAE). This indicated that a smaller, more interpretable set of features could be used without losing much predictive power. Despite using the same descriptors and feature selections, the R2 scores were generally low, indicating that our models could not adequately explain the variability in the data. From this, we had to try classification instead of regression, and compare classification performance metrics.
We experimented with a cutoff of 30%(and 40%). In both, the dataset was extremely imbalanced. The default prediction threshold (0.5) generally provided high precision but low recall, indicating that while the models were good at identifying active compounds, they missed many true positives. Regarding adapting regression to zairachem, you had mentioned that the second option might be used if the classification performance stands out. |
Hi @HellenNamulinda this is a great summary. Before including it to ZairaChem, let's focus on packaging xai4chem nicely, including updating the README file if necessary. To me, the most important step now is to get a nice report at the end of the run. The more I think about it, the more I realize that we may want to run interpretability from multiple scopes, for example, physicochemistry, fingerprints, etc, in independent runs. Can you please list here which are the descriptors that are fully implemented already? Thanks! |
Hello @miquelduranfrigola, As for fingerprints, 2 options are currently supported(rdkit or morgan). As before, xai4chem has 3 descriptor types, that's datamol(small), rdkit(medium size) and Mordred(large size). Also, 2 fingerprints types are used (morgan and rdkit). There is a slight difference in how these two fingerprint features are mapped back to molecules, and it has already been implemented. As for the nice report, would a canvas be a good way? but there are several interpretability plots generated.
|
Hi @HellenNamulinda this sounds good to me. Thanks for the update.
|
Adding an interpretability module to ZairaChem
Background
This project is related to @HellenNamulinda's MSc thesis at Makerere University. The thesis is co-supervised by Dr. Joyce Nakatumba-Nabende. At the moment, ZairaChem does not have any explainable AI (XAI) capabilities. The goal of this project is to develop an automated tool for model interpretability that can be incorporated into ZairaChem. While there are many approaches for chemistry, here we will focus on the following:
Objectives
Steps
FAQ
Where do we create issues?
Most issues related to this work should be created in the xai4chem repository. When we reach a point of integration to ZairaChem, we can create issue there correspondingly.
Is there a more comprehensive description of the project available?
Yes. This is part of @HellenNamulinda 's MSc project and she is writing a thesis accordingly. A project proposal document is already available.
The text was updated successfully, but these errors were encountered: