-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow mismatched ContactDifference #77
Conversation
Sorry for taking ages to go over this. July was heavily focused in another direction (project that had funding until July 31) and so I'm still catching up on things that got drowned in that (and got drowned before that). Thoughts:
This seems reasonable, but what's the reason for not having the change as the last argument (thus avoiding API break)? It's not a complaint; I'm just curious on the logic.
All of this seems like an effort to guess what the user wants when the answer really isn't clear: In the face of ambiguity, refuse the temptation to guess. The current code involves some fancy attempts to guess, but is there a problem if we just say "we don't know"? I think the only thing that is actually used is the topology, right? Isn't everything else saved simply for provenance? ( For the topology, my suggestion would be to allow an optional Error out if there are incompatible topologies without an override topology. For cases that require more expertise, I have no qualm with requiring the user to explain what is desired. |
Looking at the call signature I thought it would be more likely that people would want to change this behavior compared to changing the raised error class. This might be my bias though.
It is just that these values might not make sense after a safe/load cycle, but I will check if setting to
There is a bit of logic that we need to make sure the maps actually map in a sensible way. I could set it up so that
I kinda expected a clever (the follow is untested, but I would expect it to work) full_traj = md.load(traj_file.ext)
print(full_traj)
sub_atoms = [[1,2,3], [4,5,6], [7,8,12]]
# Needed if you want to remove the full drug residue
all_atoms = list(range(full_traj.top.n_atoms))
d_atoms = [j for i in sub_atoms for j in i]
not_d_atoms = [i for i in all_atoms if i not in d_atoms]
# slice sub residues
sub_trajs = [full_traj.atom_slice(i) for i in sub_atoms]
# Optional if you want to remove the original residue
full_traj = full_traj.atom_slice(not_d_atoms)
# Now we restack the trajectory
for sub_traj in sub_trajs:
full_traj = full_traj.stack(sub_traj)
print(full_traj)
The code in this comment would obviously not work for these situations |
Seems reasonable to me. I think the
I think that getting the sensible mapping is the hard part, and I think that it should be the user's responsibility. This is where the "retopologize" idea comes in, as a way to facilitate this for the user. My concerns here are specifically about point (ii) in dealing with atoms, i.e.,:
The problem is, which atoms can be fairly compared? I'd argue that backbone and probably even most CBs are fine (maybe not with Pro), but beyond that, the mapping rules would get complicated. Try to map heavy atoms in arginine to heavy atoms in phenylalanine. Both have the same number, but I see no physically meaningful correspondence. Atoms in MDTraj have In general, I'd lean toward disabling atom contacts when the atoms differ (and also disabling things like Also, in an unrelated point: separation of concerns. We're talking about the general problem of working with multiple topologies, even if the specific use is for a
This might work, although I think it's more straightforward to just create the desired topology. The "retopologize" idea really should be split into two cases:
|
The main issues with topologies that I would like to solve in this PR: Just to make clear I understood you correctly:
if the user gives a topology to override_topology,
That seems like in-scope for this PR and doable. Out of the scope for this PR (in my opinion):
A last comment: |
Scope of the PR: totally agree. The point is to make this easy to change to something with retopologize if/when that is available (specifically, step iia might be overtaken by the retopologize). It is not to include the retopologize stuff in this PR -- way too much. One change in the details: part iib, don't disable residue contacts; just error. If we can't do atoms contacts and we can't do residue contacts, immediately tell the user that this isn't going to work.
Sorry -- is this about things needed for the retopologizing stuff or about checking equality here? The hash for a residue tells you what MDTraj considers as defining the residue. Unfortunately, that includes the resid ( |
retopologizing (specifically case 1 where you named I think I have enough info for this PR, I will ping you when it is ready for review |
Glancing again at this: Many of these changes can go outside the That also helps to tell CodeClimate to shut up -- you have more freedom to structure things. |
Will do, just had an hour left to do some coding, so wanted to at least implement what we discussed before. Hope to have a couple left over hours later this week to round it up |
Alright, the current wall I am bashing my head against:
So far the only maybe viable solution I could find was to maybe implement something [like this](https://code.activestate.com/recipes/408713-late-binding-properties-allowing-subclasses-to-ove/. But I first wanted to check in if @dwhswenson might have another idea to circumvent this |
This seems more complicated than it originally looked (if I understand correctly, you need to dynamically monkey-patch the object while you're initializing it). Monkey-patching on initialization is a red flag that what you really want is a subclass. As is, this subclass would have to explicitly use Again, it would be good for as much of the mismatch-related code as possible to be outside of the class, so it can be reused. Probably as bare functions in a separate file. In the long run, it would be useful to make this into a On another note, if you add an example of this for the docs, please make it a separate notebook. I think the main notebook has gotten a bit long and isn't as straight-to-the-point as I'd like, so I'm working on rewriting it as several notebooks. |
I do agree that this is better suited in a subclass. There will still be some monkeypatching as we don't know before initialisation if But I might be able to work around the issue; current plan will be overriding all the functionality with the errors in the class definition, and then monkeypatching them at init to |
Maybe multiple subclasses |
@dwhswenson Just to summarize what this does:
How do we define two things as "the same":
When does
If this is all ok with you I don't mind writing an extra example notebook to illustrate this purpose.
Most of the |
This one should now be ready for a review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only glanced over it so far; first set of changes is to move ContactDifference._check_compatibility
and everything in the call tree of that (_fix_self
, _check_topology
, _disable_*
) into a separate file; maybe map_compatibility.py
. They only need self
to find each other as methods, and I could imagine these methods will be reused in other objects that combine two ContactObject
s.
The map_compatiblity.check_compatibility()
(or fix_parameters
? that seems more appropriate; perhaps for the filename too) method could then take allow_mismatched_residues
, allow_mismatched_atoms
(or a shorter name if you can come up with one) to be passed from the class on initialization. These could even be private class variables so __init__
doesn't need to be repeated.
That should actually get rid of the of disable_*
methods, I think. You can assemble the error string in check_topology
from some string patterns that can be module-level private constants.
Another approach, if you don't like passing the allow_mismatched_atoms
type variables around, is to wrap this into a ParameterFixer
class, which stores those on initialization.
Also, I didn't like the idea of overriding _check_compatibility
in ContactDifference
to have different inputs/outputs. So this also solves that.
More detailed review after these changes, but at first glance the basic idea looks good.
@dwhswenson This should cover your initial feedback, please have another look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions about the notebook: mainly, I'm confused by some if the output.
-
In the truncated protein, I feel like this is not the expected behavior. It seems to me that missing parts of the map should be treated like a NaN. That is, if you don't have them in one map, they shouldn't show results in the final map (because you can't actually say anything about the difference). That would make this example an empty map; maybe better to use the trajectory/first frame as in the main example?
-
I also don't understand the results when residue info is missing. Why does this all show along the diagonal? And again, shouldn't this actually be an empty map?
Otherwise, code looks good. A few small changes and clarifications requested, and the err=None
idea is just a thought; if you like it, implement it; if not, no worries.
Let's start with the easy question
This is due to the
So you would restrict the difference to the atoms (and residues) that exist in both maps? |
Co-authored-by: David W.H. Swenson <dwhs@hyperblazer.net>
Aha... makes sense. Could you add something to the notebook explaining that?
Yes, at least I think this makes more sense for
|
@dwhswenson Just to summarize what this does:
How do we define two things as "the same":
When does
On top of that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One required change (just renaming parameters now so we don't break API by doing it later) and one optional change. Otherwise, looks good!
Should solve: #60 and #73
Current summary is descibed in this comment
This PR: 1. [break of nonpublic api] adds a `return_failed=False` option to `_check_compatibility` if set to `True` it returns the failed attributes instead of raising an error, or an empty set instead of `True` if all attributes are equal. This breaks API because it was not added as the last argument. 2. Add the option `allow_incompatable=False` to `ContactDifference.__init__()` if `True` it will try to fix differences in the following ways: - `query`/`haystack`: We join the sets and assume they are ok - `cutoff`: we take the bigger cutoff - `n_neighbors_ignored`: we take the smaller number-
- if all of these atoms are equal, we assume the residues to also be equal and return the default topology(with the most atoms)
- if both positive and negative have the same atom indices involved in the contactmap but the atoms are not equal:
- If neither is the case, we disable
- If the residue indices are equal between the atoms involved in both quaries and haystacks: We take the (updated) default topology and join any residue name that is not equal
- else we disable
topology, main idea: we only look at the atoms involved in the quaries and haystacks of both
positive
andnegative
:We default to the topology with the most atoms and assume it also contains the most residues
For the atoms:
We take default topology with the most atoms, but join the names of the atoms that differ
atom_contacts
For the residues (if case i. was not true for the atoms):
.residue_contacts
@dwhswenson do you agree with these 'fixes'? (I would like to have some input here before I develop tests and examples)