screed doesn't uppercase DNA in loaded records #1434

Closed
ctb opened this Issue Sep 3, 2016 · 4 comments

Comments

Projects
None yet
1 participant
Owner

ctb commented Sep 3, 2016 edited

In the course of mucking about with hash functions, I discovered that there are multiple places in the code where bad data is fed to our hashing code.

More specifically,

  • our hashing code (twobit_repr etc in kmer_hash.hh) intentionally does not handle lower case DNA characters (it treats them as 'G');
  • screed does not upper case DNA sequence when it loads it;
  • so, any place where screed is used to load sequences has the potential to allow through lowercase DNA, which are then treated as 'G'.

This is an excellent combination of optimization (avoiding too much input DNA sanitization 'cause it makes hashing slow) with poorly specified library behavior by screed.

This affects a number of tests, which is how I found it. For example, tests/test_filter_abund.py::test_filter_abund_6_trim_high_abund_Z will fail if you uppercase record.sequence in khmer/thread_utils.py::verbose_loader.

There are lots of possible solutions -

  • amend all the relevant scripts to use uppercase;
  • make sure we periodically recompile and run the tests with KHMER_EXTRA_SANITY_CHECKS turned on;
  • change the default behavior of screed;

but this kind of thing has happened enough in the past that we should probably think about how to check for it more systematically. Thoughts welcome.

ctb referenced this issue Sep 3, 2016

Merged

fix lowercase actgn input handling. #1435

9 of 9 tasks complete
Owner

ctb commented Sep 3, 2016

Hey, this breaks all of our diginorm and trim-low-abund consistency checks. Sweet :)

tests/test_script_output.py::test_normalize_by_median_k21_C20_M1e7 FAILED
tests/test_script_output.py::test_normalize_by_median_k21_C15_M1e7 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_C0_M1e7_diginorm FAILED
tests/test_script_output.py::test_trim_low_abund_k21_C0_M1e7_diginorm_dn15 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_C2_M1e7_diginorm_dn15 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C2 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C3 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C4 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C4_variable FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C4_variable_Z25 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C4_variable_Z15 FAILED

ctb referenced this issue Sep 3, 2016

Closed

Support for k>32 #1426

Owner

ctb commented Sep 3, 2016

Instead of using a macro, would it be any slower to use an inline function that raises an exception on an unknown character?

Owner

ctb commented Sep 4, 2016

A duplicate of #370, basically, except that back then we thought we'd handled it in scripts.
Sigh.

@betatim betatim added a commit to betatim/khmer that referenced this issue Sep 5, 2016

@betatim betatim twobit_repr as lookup table
Exploratory branch for #1434. Use a lookup table and a function
instead of a macro.
9157008

betatim referenced this issue Sep 5, 2016

Closed

twobit_repr as lookup table #1438

0 of 9 tasks complete

ctb added this to the 2.1 milestone Sep 26, 2016

Owner

ctb commented Oct 3, 2016

Fixed in #1435. Leaving #370 open until we do a survey of the codebase.

ctb closed this Oct 3, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment