-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make SpeciesFeature
implementations, tests
#72
Comments
SpeciesFeature
testsSpeciesFeature
implementations, tests
Okay so @DhairyaLGandhi and I finally have an occasion to actually do this (for that most noblest of goals of completely trouncing deepchem's performance using AGN), and we had a chat about it earlier and realized some things. @thazhemadam, would love your input... At the moment, the (implicit) presumption is that the workflow looks like (presuming that each step has access only to the outputs of the previous one) structure (e.g. SMILES, CIF, etc...) --> some intermediate structural representation such as However, for the case of structure (SMILES) --> GraphMol --> compute features --> convert to AtomGraph and store in and the API as presently designed does not easily accommodate this. I could see two generic forms of solution to this:
A related question (especially if we follow route 1 above) is whether the type hierarchy of feature descriptors as we've constructed it is actually adding anything. In the preliminary solution @DhairyaLGandhi had sketched out, he basically defined new structs for each feature (hybridization, aromaticity, valence, etc. etc.) without ever making use of the If we were to do this via Finally, the other thing I like about option 2 is that it maintains relative ease in changing around featurizations after the fact (e.g. adding/removing FD's) without completely starting from scratch. If you carry along the structure representation, you have everything you need to compute additional features, etc. Okay, that was a lot, but I think writing it up helped me to clarify some things in my head about possible ways to proceed. But please LMK your thoughts! If we have general consensus on an approach, I can make a branch and start sketching things out. |
I think the design goal of maintaining the "decodability" present in FeaturizedAtoms is good (I like it too). What would we need to preserve it and get a bit more flexibility? Ideally AtomGraph constructed from any source would be equivalent, but storing the source would in effect delineate them. There are pros and cons to this. It makes decoding easier, but adds overheads at all times (I'm unsure if that's worth it tbh, but we wouldn't know that for some time yet I think) |
Just to clarify, the decodability isn't an issue with either of the routes outlined above, because that's just a matter of "inverting" the one-hot encoder, which we can already do. The other kind of separation of concerns that your approach didn't have was the separation between the calculating of the value of the feature (e.g. calling the MolecularGraph function) and the actual encoding of that value (e.g. calling the onehot-encoder). I think it's important to maintain this as well, because it allows experimentation with different parameters of encoding (e.g. different numbers of bins for continuous-valued features) or even different encoding approaches (e.g. direct numerical encoding, something I've wanted to experiment with). Those things being said, I definitely think you're asking the right kinds of questions. Ultimately, I think both routes also can offer the flexibility we need, but with different balances between fixed and variable costs of implementation: option 1 = low fixed cost, high variable cost (fairly easy to get up and running, but more work required each time we want to encode a different feature) In general in cases like this, I tend to lean towards the higher fixed cost, lower variable cost option...of course, as you also point out, option 2 also entails an increase in overhead, but it's unclear how much or if that matters. I think overhead that lives purely within ChemistryFeaturization is generally not an issue, but anything that might potentially be part of a model (as we're currently playing around with in the AD stuff) obviously has to be performant. However, it's also true that storing the structure could have benefits in the AD case, since it will make it simpler to propagate "all the way" back... |
To clarify, the design was to separate out the concerns and enforce it via API, the encoding etc are details that can (and should) be made part of the complete system. What happens inside the function is something that is completely up to us after all. Consider another design where everything is lazy and the featurizations only output a nested thunk which is materialized at the end. In this case, the encoding can happen efficiently at the end. From an AD pov, I would suggest doing what makes sense for the problem. Memory overhead can catch us in AD sometimes, but hopefully we don't have too much memory to hold on to. |
I agree. This would also break our logical flow requirement of AtomGraphs being the type that is primarily involved in feature generation.
I feel like the logically generic and deterministic guarantee we provide here right now might be worth holding on to.
That's a good question. I think
Agreed. Overall, I think Option 2. might be the better option. If the original representation is stored, then our current workflow wouldn't break, and we can simply apply TLDR; |
Thanks (as always) for a thoughtful and detailed perspective. I'll make a branch and start sketching this out and make a draft PR once I think I have something vaguely sensible, and we can comment/iterate from there. |
I also don't think the amount of memory we would be dealing with would be too high, and if storing the initial representation also would benefit in AD, then the tradeoff might be worth it. But, if cumulative cost proves that isn't, I'd assume we could just as easily extract the "pure" AtomGraph representation (which is basically the same structure, without the "original" GraphMol/Crystal, etc representation), create a FeaturizedAtoms object using it and use that in the models instead. |
I may be missing something but what is the benefit to AD? Holding on to memory is one of the things to avoid in AD usually, unless we can avoid computing something expensive. |
I just meant that it could be cleaner to propagate the forces back to the original structure if it's attached to the same object. Not that it makes the AD itself easier/faster. |
This may end up going along with Weave stuff since most of the actual examples of
SpeciesFeature
that we have atm are rdkit things, but filing it as a separate issue for now.The text was updated successfully, but these errors were encountered: