Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-increment nincluded/nreported after merge #47

Merged
merged 6 commits into from Aug 2, 2023

Conversation

zdk123
Copy link
Contributor

@zdk123 zdk123 commented Jul 11, 2023

Stub of a fix for #46. It would probably be better if there was a true fix in libhmmer.

Does this look safe to do? I can write a formal test if so.

@codecov
Copy link

codecov bot commented Jul 11, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.12% 🎉

Comparison is base (43e7d76) 76.52% compared to head (0d85341) 76.65%.
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #47      +/-   ##
==========================================
+ Coverage   76.52%   76.65%   +0.12%     
==========================================
  Files           7        7              
  Lines        6965     6968       +3     
==========================================
+ Hits         5330     5341      +11     
+ Misses       1635     1627       -8     
Flag Coverage Δ
v3.10 76.65% <100.00%> (+0.12%) ⬆️
v3.11 76.65% <100.00%> (+0.12%) ⬆️
v3.7 76.62% <100.00%> (+0.12%) ⬆️
v3.8 76.65% <100.00%> (+0.12%) ⬆️
v3.9 76.65% <100.00%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
pyhmmer/plan7.pyx 73.87% <100.00%> (+0.26%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@althonos
Copy link
Owner

You can actually set nincluded and nreported to zero for all hits before calling p7_tophits_Threshold, I think the threshold will set the counts to the right number on its own.

@zdk123
Copy link
Contributor Author

zdk123 commented Jul 18, 2023

Close this in favor of linking to the fixed hmmer library?

@althonos
Copy link
Owner

Nah, until a new numbered release of HMMER3 I'm not gonna bump the local version, so a temporary patch is welcome. It would be great If you can update the code as suggested above, and add a reminder to remove the patch later, similar to:

pyhmmer/pyhmmer/plan7.pyx

Lines 7881 to 7882 in e8f70f8

# TODO(@althonos): Replace with `p7_tophits_Clone` as implemented
# in EddyRivasLab/hmmer#273 when formally released.

Thanks 🙏

@zdk123
Copy link
Contributor Author

zdk123 commented Jul 20, 2023

@althonos Follow on bug and a question for you.

I am attempting to write a test but realize that users don't have read access to tophits._th.hit - this only gets exposed by the internal hmmer domain table writer.

I thought instead to expose nincluded/nreported for all the hits in the state dictionary.

This okay to add here?

        for i in range(self._th.N):
            offset = (<ptrdiff_t> self._th.hit[i] - <ptrdiff_t> &self._th.unsrt[0]) // sizeof(P7_HIT)
            hits.append(offset)
            hits_nreported.append(self._th.hit[i].nreported)
            hits_nincluded.append(self._th.hit[i].nincluded)

I've done this locally in a branch and found that, actually, to my surprise that nreported/nincluded might not being set for each hit after re-serializing in TopHits.__setstate__. I think we need the following addition here:

        for i, offset in enumerate(state["hit"]):
            self._th.hit[i] = &self._th.unsrt[offset]
            self._th.hit[i].nreported = state["hits_nreported"][i]
            self._th.hit[i].nincluded = state["hits_nincluded"][i]

Is this desirable to have or do you want to have another way to return/set this data other than the state dictionary?

@althonos
Copy link
Owner

althonos commented Jul 21, 2023

You can get the nreported and nincluded using len(hit.domains.included) and len(hit.domains.reported) because I made these properties return a sized iterator ☺️ These numbers should be read-only because the TopHits instances are immutable on the Python side, ignoring the option to sort the hits.

That's indeed a problem if the attributes are not reset after a __setstate__.

@althonos
Copy link
Owner

I've done this locally in a branch and found that, actually, to my surprise that nreported/nincluded might not being set for each hit after re-serializing in TopHits.setstate. I think we need the following addition here:

Actually that should be handle by p7_hit_Deserialize for each hit I think! So no need to do it in the Cython code.

@zdk123
Copy link
Contributor Author

zdk123 commented Jul 29, 2023

Looks like my updated test caught another bug on the number of included domains, when merging with an empty TopHit object (and maybe more generally?)

from my branch:

from pyhmmer.plan7 import HMMFile, Pipeline, TopHits
from pyhmmer.easel import SequenceFile

thioesterase = HMMFile("pyhmmer/tests/data/hmms/db/Thioesterase.hmm").read()

with SequenceFile('pyhmmer/tests/data/seqs/938293.PRJEB85.HG003687.faa', digital=True) as seqfile:
    proteins = seqfile.read_block()

pli = Pipeline(thioesterase.alphabet)

hits = pli.search_hmm(thioesterase, proteins)

[len(hit.domains.reported) for hit in hits]
# 1
[len(hit.domains.included) for hit in hits]
# 0

empty = TopHits()
empty_merged = empty.merge(hits)
[len(hit.domains.reported) for hit in empty_merged]
# 1
[len(hit.domains.included) for hit in empty_merged]
# 1

It looks like like empty TopHit gets initialized with inclusion thresholds of domain score with something other than the Pipeline defaults:

empty.incdomT
# 0.0
empty_merged.incdomT
# 0.0
hits.incdomT
# None

I understand this is extremely edge case and probably could just remove the test, but noting it here because it looks like merging TopHit objects with different inclusion/reporting cutoffs isn't being explicitly handled and maybe that should raise an Error?

@althonos
Copy link
Owner

Good catch! Merging TopHits with different parameters should indeed be disallowed. But I'm thinking how to fix this when TopHits are created from the Python constructor, I can't think of a simple solution...

@@ -71,6 +71,8 @@ def assertHitEqual(self, h1, h2):
"attribute {!r} differs".format(attr)
)
self.assertEqual(len(h1.domains), len(h2.domains))
self.assertEqual(len(h1.domains.reported), len(h2.domains.reported))
# self.assertEqual(len(h1.domains.included), len(h2.domains.included))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests start failing if I uncomment this line

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually again an issue with HMMER, in p7_tophits_Threshold:

if (p7_pli_DomainReportable(pli, th->hit[h]->dcl[d].bitscore, th->hit[h]->dcl[d].lnP))
  th->hit[h]->dcl[d].is_reported = TRUE;
if ((th->hit[h]->flags & p7_IS_INCLUDED) &&
    p7_pli_DomainIncludable(pli, th->hit[h]->dcl[d].bitscore, th->hit[h]->dcl[d].lnP))
  th->hit[h]->dcl[d].is_included = TRUE;

means that if a domain was previously included/reported, calling p7_tophits_Threshold again with a different threshold will not exclude it, it can only include/report new domains. Changing it to:

th->hit[h]->dcl[d].is_reported = p7_pli_DomainReportable(pli, th->hit[h]->dcl[d].bitscore, th->hit[h]->dcl[d].lnP);
th->hit[h]->dcl[d].is_included = (th->hit[h]->flags & p7_IS_INCLUDED) && p7_pli_DomainIncludable(pli, th->hit[h]->dcl[d].bitscore, th->hit[h]->dcl[d].lnP);

should fix it :)

@althonos
Copy link
Owner

althonos commented Aug 2, 2023

Okay, thanks a lot 👍
I'm gonna merge as-is, then try to figure out a way to fix the second issue you raised :)

@althonos althonos merged commit 60776bc into althonos:master Aug 2, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants