Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow mismatched ContactDifference #77

Merged
merged 25 commits into from
Oct 21, 2020

Conversation

sroet
Copy link
Collaborator

@sroet sroet commented Jul 8, 2020

Should solve: #60 and #73

Current summary is descibed in this comment

This PR: 1. [break of nonpublic api] adds a `return_failed=False` option to `_check_compatibility` if set to `True` it returns the failed attributes instead of raising an error, or an empty set instead of `True` if all attributes are equal. This breaks API because it was not added as the last argument. 2. Add the option `allow_incompatable=False` to `ContactDifference.__init__()` if `True` it will try to fix differences in the following ways: - `query`/`haystack`: We join the sets and assume they are ok - `cutoff`: we take the bigger cutoff - `n_neighbors_ignored`: we take the smaller number
  • topology, main idea: we only look at the atoms involved in the quaries and haystacks of both positive and negative :
    We default to the topology with the most atoms and assume it also contains the most residues
    For the atoms:

    1. if all of these atoms are equal, we assume the residues to also be equal and return the default topology(with the most atoms)
    2. if both positive and negative have the same atom indices involved in the contactmap but the atoms are not equal:
      We take default topology with the most atoms, but join the names of the atoms that differ
    3. If neither is the case, we disable atom_contacts

    For the residues (if case i. was not true for the atoms):

    1. If the residue indices are equal between the atoms involved in both quaries and haystacks: We take the (updated) default topology and join any residue name that is not equal
    2. else we disable .residue_contacts

@dwhswenson do you agree with these 'fixes'? (I would like to have some input here before I develop tests and examples)

@dwhswenson
Copy link
Owner

Sorry for taking ages to go over this. July was heavily focused in another direction (project that had funding until July 31) and so I'm still catching up on things that got drowned in that (and got drowned before that).

Thoughts:

[break of nonpublic api] adds a return_failed=False option to _check_compatibility
if set to True it returns the failed attributes instead of raising an error, or an empty set instead of True if all attributes are equal.
This breaks API because it was not added as the last argument.

This seems reasonable, but what's the reason for not having the change as the last argument (thus avoiding API break)? It's not a complaint; I'm just curious on the logic.

Add the option allow_incompatable=False to ContactDifference.__init__()
if True it will try to fix differences in the following ways:

All of this seems like an effort to guess what the user wants when the answer really isn't clear: In the face of ambiguity, refuse the temptation to guess. The current code involves some fancy attempts to guess, but is there a problem if we just say "we don't know"? I think the only thing that is actually used is the topology, right? Isn't everything else saved simply for provenance? (cutoff, query, haystack...) If they conflict, and the user has explicitly said ignore conflicts, I think everything except the topology can be set to None. (Does that raise any issues you see?) Original values are still accessible (for provenance) in positive.foo and negative.foo.

For the topology, my suggestion would be to allow an optional topology argument to __init__, or maybe call it override_topology. In the long run, I'm interested in the idea of a retopologize function, either here or (better) in MDTraj. The idea comes from my work on drug binding. Many drugs are larger than a single amino acid, but treated as a single residue in PDB. So residue-residue contacts make more sense if you can split the drug into subunits (and sometimes how to do this is obvious). But the same idea could be used to create a fake Q/L residue, as desired in #60, and could even group residues together for more complicated (e.g., deletion/insertion) mutations.

Error out if there are incompatible topologies without an override topology. For cases that require more expertise, I have no qualm with requiring the user to explain what is desired.

@sroet
Copy link
Collaborator Author

sroet commented Aug 13, 2020

This seems reasonable, but what's the reason for not having the change as the last argument (thus avoiding API break)? It's not a complaint; I'm just curious on the logic.

Looking at the call signature I thought it would be more likely that people would want to change this behavior compared to changing the raised error class. This might be my bias though.

All of this seems like an effort to guess what the user wants when the answer really isn't clear: In the face of ambiguity, refuse the temptation to guess. The current code involves some fancy attempts to guess, but is there a problem if we just say "we don't know"? I think the only thing that is actually used is the topology, right? Isn't everything else saved simply for provenance? (cutoff, query, haystack...) If they conflict, and the user has explicitly said ignore conflicts, I think everything except the topology can be set to None. (Does that raise any issues you see?) Original values are still accessible (for provenance) in positive.foo and negative.foo.

It is just that these values might not make sense after a safe/load cycle, but I will check if setting to None raises any issues for that (and add tests)

For the topology, my suggestion would be to allow an optional topology argument to init, or maybe call it override_topology.

There is a bit of logic that we need to make sure the maps actually map in a sensible way. I could set it up so that True uses the internal function as described, and giving it a topology just does basic checks (makes sure all residue and atom indices of the map are actually present in the topology)

The idea comes from my work on drug binding. Many drugs are larger than a single amino acid, but treated as a single residue in PDB. So residue-residue contacts make more sense if you can split the drug into subunits (and sometimes how to do this is obvious).

I kinda expected a clever atom_slice and stack might work?

(the follow is untested, but I would expect it to work)

full_traj = md.load(traj_file.ext)
print(full_traj)
sub_atoms = [[1,2,3], [4,5,6], [7,8,12]]

# Needed if you want to remove the full drug residue
all_atoms = list(range(full_traj.top.n_atoms))
d_atoms = [j for i in sub_atoms for j in i]
not_d_atoms = [i for i in all_atoms if i not in d_atoms]

# slice sub residues
sub_trajs =  [full_traj.atom_slice(i) for i in sub_atoms]

# Optional if you want to remove the original residue
full_traj = full_traj.atom_slice(not_d_atoms)

# Now we restack the trajectory
for sub_traj in sub_trajs:
    full_traj = full_traj.stack(sub_traj)
print(full_traj)

But the same idea could be used to create a fake Q/L residue, as desired in #60, and could even group residues together for more complicated (e.g., deletion/insertion) mutations.

The code in this comment would obviously not work for these situations

@dwhswenson
Copy link
Owner

Looking at the call signature I thought it would be more likely that people would want to change this behavior compared to changing the raised error class.

Seems reasonable to me. I think the err argument may be a testing artifact that would be better handled by a mock.patch, although I might have had a good reason for it.

There is a bit of logic that we need to make sure the maps actually map in a sensible way. I could set it up so that True uses the internal function as described, and giving it a topology just does basic checks (makes sure all residue and atom indices of the map are actually present in the topology)

I think that getting the sensible mapping is the hard part, and I think that it should be the user's responsibility. This is where the "retopologize" idea comes in, as a way to facilitate this for the user. My concerns here are specifically about point (ii) in dealing with atoms, i.e.,:

if both positive and negative have the same atom indices involved in the contactmap but the atoms are not equal:
We take default topology with the most atoms, but join the names of the atoms that differ

The problem is, which atoms can be fairly compared? I'd argue that backbone and probably even most CBs are fine (maybe not with Pro), but beyond that, the mapping rules would get complicated. Try to map heavy atoms in arginine to heavy atoms in phenylalanine. Both have the same number, but I see no physically meaningful correspondence.

Atoms in MDTraj have name as well as symbol, which could allow us to guarantee mappings of backbone atoms. But I don't think that requiring the same name will necessarily make sense in general: Imagine someone who wanted to compare contacts in an Asp-Glu mutation. CG in Asp might naturally map to CD in Glu. All of this points to requiring user intervention (or something from us that facilitates user intervention, i.e., retopologizing).

In general, I'd lean toward disabling atom contacts when the atoms differ (and also disabling things like most_common_atoms_for_residue). This can be re-enabled if there's ever a MultiTopology (see below).

Also, in an unrelated point: separation of concerns. We're talking about the general problem of working with multiple topologies, even if the specific use is for a ContactDifference. Move the topology-fixing code out of ContactDifference, perhaps into its own module.

I kinda expected a clever atom_slice and stack might work?

This might work, although I think it's more straightforward to just create the desired topology. The "retopologize" idea really should be split into two cases:

  1. Rewriting how the topology assigns atoms to residues. This could either split a residue into subresidues (same chain, same resSeq, multiple resids) as with large ligand, or it could combine multiple residues into a larger residue (multiple resSeq for a single resid?), as with coarse graining. This would have the same number of atoms in either representation, and the atoms would represent the same thing.

  2. MultiTopology where there may be different numbers of atoms. Example: take a sequence alignment as a mapping, thus allowing any kinds of mutations (point, insertion, deletion) to be compared (needs a dummy residue, but that's not hard). This could provide/allow the user to provide the physically meaningful mapping between atoms.

@sroet
Copy link
Collaborator Author

sroet commented Aug 13, 2020

The problem is, which atoms can be fairly compared? I'd argue that backbone and probably even most CBs are fine (maybe not with Pro), but beyond that, the mapping rules would get complicated. Try to map heavy atoms in arginine to heavy atoms in phenylalanine. Both have the same number, but I see no physically meaningful correspondence.

The main issues with topologies that I would like to solve in this PR:
if the topologies differ, but not in a meaningful way for the contact map (different number of solvent molecules, which are not part of query/haystack), (part of the) the difference should work.

Just to make clear I understood you correctly:
If override_topology=True

  • i. check if the Atoms involved in the maps are equal. If this is not true, disable atom functionality (atom_contacts and the other functions)
  • ii. check if the Residues of the Atoms involved in the maps are equal, if not:
    • iia. if the residue indices are equal between the two topologies (just mutation): update the residue name to indicate this
    • iib. if the residue indices are not equal disable residue_contacts as well (we don't know how to do the mapping)

if the user gives a topology to override_topology,

  • we just check that all the atom indices of the contact map are present in the topology, otherwise disable atom based functions
  • same for residue indices

That seems like in-scope for this PR and doable.

Out of the scope for this PR (in my opinion):

  • anything more elaborate that tries to fix the underlying maps (so not straight up disabling)
  • We might be able to map contact maps from one topology on top of another (say a residue deletion) by remapping the indices of the map of the protein with deletion to the corresponding indices of the complete protein
  • Similar for insertion
  • We might allow a user to provide an atom-index mapping so that atom_contacts make sense
  • We might use topology.to_bondgraph() and do some graph similarity analysis to figure out the right guess

A last comment:
resid needs to be continuous and unique while resSeq can be an arbitrary int. Now, I MDTraj uses resid internally on the topology.residue(i) function. So we might need to do some more elaborate checking

@dwhswenson
Copy link
Owner

Scope of the PR: totally agree. The point is to make this easy to change to something with retopologize if/when that is available (specifically, step iia might be overtaken by the retopologize). It is not to include the retopologize stuff in this PR -- way too much.

One change in the details: part iib, don't disable residue contacts; just error. If we can't do atoms contacts and we can't do residue contacts, immediately tell the user that this isn't going to work.

resid needs to be continuous and unique while resSeq can be an arbitrary int. Now, I MDTraj uses resid internally on the topology.residue(i) function. So we might need to do some more elaborate checking

Sorry -- is this about things needed for the retopologizing stuff or about checking equality here? The hash for a residue tells you what MDTraj considers as defining the residue. Unfortunately, that includes the resid (index) but does not include chainid or any other information about things on levels other than the residue. (It looks like equality checks are by id, so two residues are only Python-equal if they're the same object in memory.)

@sroet
Copy link
Collaborator Author

sroet commented Aug 13, 2020

Sorry -- is this about things needed for the retopologizing stuff or about checking equality here?

retopologizing (specifically case 1 where you named resSeq and resid) one point to keep track of for our codebase: we select residues from topology.residue(res_idx)which usesresid`, so we should be careful that we keep track of that somewhere. ;)

I think I have enough info for this PR, I will ping you when it is ready for review

@dwhswenson
Copy link
Owner

Glancing again at this: Many of these changes can go outside the ContactDifference class. This is a general solution to trying manage when the user provides two conflicting topologies. I'd make them bare functions in a separate file. That way they can be easily reused in other situations (maybe there's something with ContactTrajectory that should act this way?)

That also helps to tell CodeClimate to shut up -- you have more freedom to structure things.

@sroet
Copy link
Collaborator Author

sroet commented Sep 7, 2020

Glancing again at this: Many of these changes can go outside the ContactDifference class. This is a general solution to trying manage when the user provides two conflicting topologies. I'd make them bare functions in a separate file. That way they can be easily reused in other situations (maybe there's something with ContactTrajectory that should act this way?)

Will do, just had an hour left to do some coding, so wanted to at least implement what we discussed before. Hope to have a couple left over hours later this week to round it up

@sroet
Copy link
Collaborator Author

sroet commented Oct 7, 2020

Alright, the current wall I am bashing my head against:

  • overriding inherited properties in a subclass is not allowed (such as self.haystack_residues) and I want to disable these methods/properties when atoms or residues don't make sense.

So far the only maybe viable solution I could find was to maybe implement something [like this](https://code.activestate.com/recipes/408713-late-binding-properties-allowing-subclasses-to-ove/.

But I first wanted to check in if @dwhswenson might have another idea to circumvent this

@dwhswenson
Copy link
Owner

This seems more complicated than it originally looked (if I understand correctly, you need to dynamically monkey-patch the object while you're initializing it). Monkey-patching on initialization is a red flag that what you really want is a subclass. MismatchedContactDifference would be a reasonable name, and then you should be able do it all with more standard inheritance approaches. From a usage standpoint, this has the same requirement that the user must explicitly ask to do this, but instead of a flag parameter in the __init__, you just have a separate class for this functionality.

As is, this subclass would have to explicitly use ContactObject.__init__ instead of super().__init__. That's a sign of a design flaw. I think the flaw can be fixed by changing the ContactDifference.__init__ to use self.__class__._check_compatibility(positive, negative), and overriding _check_compatibility in the subclass. Still not the prettiest code, but better than abusing the MRO.

Again, it would be good for as much of the mismatch-related code as possible to be outside of the class, so it can be reused. Probably as bare functions in a separate file. In the long run, it would be useful to make this into a MismatchMixin class that has the new _check_compatilibity, but right now I don't think various ContactObject subclasses are uniform enough for that to work.

On another note, if you add an example of this for the docs, please make it a separate notebook. I think the main notebook has gotten a bit long and isn't as straight-to-the-point as I'd like, so I'm working on rewriting it as several notebooks.

@sroet
Copy link
Collaborator Author

sroet commented Oct 8, 2020

This seems more complicated than it originally looked (if I understand correctly, you need to dynamically monkey-patch the object while you're initializing it). Monkey-patching on initialization is a red flag that what you really want is a subclass. MismatchedContactDifference would be a reasonable name, and then you should be able do it all with more standard inheritance approaches. From a usage standpoint, this has the same requirement that the user must explicitly ask to do this, but instead of a flag parameter in the init, you just have a separate class for this functionality.

I do agree that this is better suited in a subclass. There will still be some monkeypatching as we don't know before initialisation if atoms or residues are mismatched.

But I might be able to work around the issue; current plan will be overriding all the functionality with the errors in the class definition, and then monkeypatching them at init to super().func in order to enable them if they are allowed.

@dwhswenson
Copy link
Owner

There will still be some monkeypatching as we don't know before initialisation if atoms or residues are mismatched.

Maybe multiple subclasses AtomMismatch... and ResidueMismatch...? And if you want an easy way to create it for either case, make a factory function.

@sroet
Copy link
Collaborator Author

sroet commented Oct 9, 2020

@dwhswenson
This is now feature complete and has full test coverage.

Just to summarize what this does:

  • ContactDifference does not care anymore about atoms or residues not involved in the actual map (but checks both topologies for all atoms involved in both maps).
  • ContactDifference does not care about any of the other attributes (cutoff, query, haystack etc ...) anymore
  • ContactDifference.topology will get a custom residue name if that is the only difference. (so no different atoms) (See below how we define "different")
  • If either the (involved) residue indices don't overlap, or the involved atom indices an informative error is raised
  • AtomMismatchedContactDifference allows for the atoms to be different, but not the residues
  • ResidueMismatchedContactDifference allows for the residues to be different but not the atoms
  • OverrideTopologyContactDifference allows the user to give a topology, this one does need to contain all involved atoms and residues (this also allows a user to choose the residue name if they don't want the renaming)

How do we define two things as "the same":
atoms (all of the following have to be true):

  • the name (excluding the residue part) needs to be identical (e.g. XXX-C0 == YYY-C0, but XXX-C0 != XXX-C1)
  • the element (md.element...) needs to be identical
  • the index needs to be identical
  • the serial needs to be identical (this one might be too strict, what do you think @dwhswenson )

residues (all of the following have to be true):

  • the name has to be the same

When does ContactDifference fail?

  • If some atom indices are not equal or missing in one of the topologies
  • If the set of residue indices differ between the map topologies (done by comparingset([top.atom(i).residue.index for i in map_atoms]))
  • If not all the residues are present in both maps

If this is all ok with you I don't mind writing an extra example notebook to illustrate this purpose.

Again, it would be good for as much of the mismatch-related code as possible to be outside of the class, so it can be reused. Probably as bare functions in a separate file. In the long run, it would be useful to make this into a MismatchMixin class that has the new _check_compatilibity, but right now I don't think various ContactObject subclasses are uniform enough for that to work.

Most of the topology mismatch code is pulled to topol.py, which functions did you want to pull out further?

@sroet sroet changed the title WIP: Allow for mismatched ContactDifference [REVIEW] Allow for mismatched ContactDifference Oct 11, 2020
@sroet sroet requested a review from dwhswenson October 11, 2020 16:12
@sroet
Copy link
Collaborator Author

sroet commented Oct 11, 2020

This one should now be ready for a review

Copy link
Owner

@dwhswenson dwhswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only glanced over it so far; first set of changes is to move ContactDifference._check_compatibility and everything in the call tree of that (_fix_self, _check_topology, _disable_*) into a separate file; maybe map_compatibility.py. They only need self to find each other as methods, and I could imagine these methods will be reused in other objects that combine two ContactObjects.

The map_compatiblity.check_compatibility() (or fix_parameters? that seems more appropriate; perhaps for the filename too) method could then take allow_mismatched_residues, allow_mismatched_atoms (or a shorter name if you can come up with one) to be passed from the class on initialization. These could even be private class variables so __init__ doesn't need to be repeated.

That should actually get rid of the of disable_* methods, I think. You can assemble the error string in check_topology from some string patterns that can be module-level private constants.

Another approach, if you don't like passing the allow_mismatched_atoms type variables around, is to wrap this into a ParameterFixer class, which stores those on initialization.

Also, I didn't like the idea of overriding _check_compatibility in ContactDifference to have different inputs/outputs. So this also solves that.

More detailed review after these changes, but at first glance the basic idea looks good.

@sroet
Copy link
Collaborator Author

sroet commented Oct 12, 2020

@dwhswenson This should cover your initial feedback, please have another look

@sroet sroet requested a review from dwhswenson October 12, 2020 20:38
Copy link
Owner

@dwhswenson dwhswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions about the notebook: mainly, I'm confused by some if the output.

  • In the truncated protein, I feel like this is not the expected behavior. It seems to me that missing parts of the map should be treated like a NaN. That is, if you don't have them in one map, they shouldn't show results in the final map (because you can't actually say anything about the difference). That would make this example an empty map; maybe better to use the trajectory/first frame as in the main example?

  • I also don't understand the results when residue info is missing. Why does this all show along the diagonal? And again, shouldn't this actually be an empty map?

Otherwise, code looks good. A few small changes and clarifications requested, and the err=None idea is just a thought; if you like it, implement it; if not, no worries.

contact_map/contact_map.py Outdated Show resolved Hide resolved
contact_map/contact_map.py Outdated Show resolved Hide resolved
contact_map/topology.py Outdated Show resolved Hide resolved
contact_map/topology.py Show resolved Hide resolved
@sroet
Copy link
Collaborator Author

sroet commented Oct 13, 2020

Let's start with the easy question

I also don't understand the results when residue info is missing. Why does this all show along the diagonal? And again, shouldn't this actually be an empty map?

This is due to the n_neighbors_ignored, as all atoms are part of the same residue, the broken_residue_frame is actually empty (all atom contacts are ignored as they are part of the same residue).

In the truncated protein, I feel like this is not the expected behavior. It seems to me that missing parts of the map should be treated like a NaN. That is, if you don't have them in one map, they shouldn't show results in the final map (because you can't actually say anything about the difference). That would make this example an empty map; maybe better to use the trajectory/first frame as in the main example?

So you would restrict the difference to the atoms (and residues) that exist in both maps?
Essentially changing these lines to be an intersect and altering how the actual difference maps are generated (so not using all the counts from positive and negative)? I think this is a fair expectation, and I can implement this if that is what you mean. (It will require more altering of the ContactDifference class)

Co-authored-by: David W.H. Swenson <dwhs@hyperblazer.net>
@dwhswenson
Copy link
Owner

This is due to the n_neighbors_ignored, as all atoms are part of the same residue, the broken_residue_frame is actually empty (all atom contacts are ignored as they are part of the same residue).

Aha... makes sense. Could you add something to the notebook explaining that?

So you would restrict the difference to the atoms (and residues) that exist in both maps?

Yes, at least I think this makes more sense for ContactDifference stuff. I've been trying to figure out how we might use this parameter mismatch idea in, for example, concatenating ContactTrajectory objects (I'm not sure if there's a use case where that makes sense, but that's a reason I'm suggesting keeping the parameter fixing stuff out of the inheritance chain.) My gut feeling is that you'd want to keep all of them in that situation.

ContactSimilarity would be the same as ContactDifference. I implemented that recently, by the way -- as a quickly hacked subclass of ContactDifference. Similarity measure was sqrt(freq_A * freq_B). I haven't seen that anywhere in the literature, but was doing some analysis and it made sense to have similarity as well as difference.

@sroet
Copy link
Collaborator Author

sroet commented Oct 16, 2020

@dwhswenson
This is altered based on the new behavior, is feature complete has full test coverage and an example.

Just to summarize what this does:

  • [api-break of non-public api] map._check_compatibility will now return an empty set (Falsey instead of True) if two maps are compatible.
  • ContactDifference only cares about atoms and residues that can be present in both maps and returns nan for anything outside of that.
  • ContactDifference does not care about any of the other attributes (cutoff, query, haystack etc ...) anymore
  • ContactDifference.topology will get a custom residue name if that is the only difference. (so no different atoms) (See below how we define "different")
  • If either the overlapipng residues are not equal, or the overlapping atom indices an informative error is raised
  • AtomMismatchedContactDifference allows for the atoms to be different, but not the residues
  • ResidueMismatchedContactDifference allows for the residues to be different but not the atoms
  • OverrideTopologyContactDifference allows the user to give a topology, this one does need to contain all involved atoms and residues (this also allows a user to choose the residue name if they don't want the renaming)

How do we define two things as "the same":
atoms (all of the following have to be true):

  • the name (excluding the residue part) needs to be identical (e.g. XXX-C0 == YYY-C0, but XXX-C0 != XXX-C1)
  • the element (md.element...) needs to be identical
  • the index needs to be identical
  • the serial needs to be identical (this one might be too strict, what do you think @dwhswenson )

residues (all of the following have to be true):

  • the name, resSeq and segment_id have to be the same

When does ContactDifference fail?

  • If some atoms are not equal or missing if an overriding topology is given
  • If the residues are not equal or missing if an overriding topology is given

On top of that ContactCount now has a filter function that will return a filtered ContactCount which only contains contacts where both contact indices are in a list of idx.

@sroet sroet requested a review from dwhswenson October 16, 2020 20:48
Copy link
Owner

@dwhswenson dwhswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One required change (just renaming parameters now so we don't break API by doing it later) and one optional change. Otherwise, looks good!

contact_map/fix_parameters.py Outdated Show resolved Hide resolved
contact_map/contact_map.py Outdated Show resolved Hide resolved
@sroet sroet requested a review from dwhswenson October 20, 2020 18:16
@dwhswenson dwhswenson changed the title [REVIEW] Allow for mismatched ContactDifference Allow for mismatched ContactDifference Oct 21, 2020
@dwhswenson dwhswenson changed the title Allow for mismatched ContactDifference Allow mismatched ContactDifference Oct 21, 2020
@dwhswenson dwhswenson merged commit 350a28e into dwhswenson:master Oct 21, 2020
@dwhswenson dwhswenson mentioned this pull request Oct 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants