Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
screed doesn't uppercase DNA in loaded records #1434
Comments
|
Hey, this breaks all of our diginorm and trim-low-abund consistency checks. Sweet :)
|
|
Instead of using a macro, would it be any slower to use an inline function that raises an exception on an unknown character? |
|
A duplicate of #370, basically, except that back then we thought we'd handled it in scripts. |
This was referenced Sep 4, 2016
betatim
added a commit
to betatim/khmer
that referenced
this issue
Sep 5, 2016
|
|
betatim |
9157008
|
ctb
added this to the
2.1
milestone
Sep 26, 2016
ctb
closed this
Oct 3, 2016
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
ctb commentedSep 3, 2016
•
edited
In the course of mucking about with hash functions, I discovered that there are multiple places in the code where bad data is fed to our hashing code.
More specifically,
twobit_repretc in kmer_hash.hh) intentionally does not handle lower case DNA characters (it treats them as 'G');This is an excellent combination of optimization (avoiding too much input DNA sanitization 'cause it makes hashing slow) with poorly specified library behavior by screed.
This affects a number of tests, which is how I found it. For example, tests/test_filter_abund.py::test_filter_abund_6_trim_high_abund_Z will fail if you uppercase
record.sequenceinkhmer/thread_utils.py::verbose_loader.There are lots of possible solutions -
KHMER_EXTRA_SANITY_CHECKSturned on;but this kind of thing has happened enough in the past that we should probably think about how to check for it more systematically. Thoughts welcome.