screed doesn't uppercase DNA in loaded records #1434

ctb · 2016-09-03T16:26:18Z

In the course of mucking about with hash functions, I discovered that there are multiple places in the code where bad data is fed to our hashing code.

More specifically,

our hashing code (twobit_repr etc in kmer_hash.hh) intentionally does not handle lower case DNA characters (it treats them as 'G');
screed does not upper case DNA sequence when it loads it;
so, any place where screed is used to load sequences has the potential to allow through lowercase DNA, which are then treated as 'G'.

This is an excellent combination of optimization (avoiding too much input DNA sanitization 'cause it makes hashing slow) with poorly specified library behavior by screed.

This affects a number of tests, which is how I found it. For example, tests/test_filter_abund.py::test_filter_abund_6_trim_high_abund_Z will fail if you uppercase record.sequence in khmer/thread_utils.py::verbose_loader.

There are lots of possible solutions -

amend all the relevant scripts to use uppercase;
make sure we periodically recompile and run the tests with KHMER_EXTRA_SANITY_CHECKS turned on;
change the default behavior of screed;

but this kind of thing has happened enough in the past that we should probably think about how to check for it more systematically. Thoughts welcome.

The text was updated successfully, but these errors were encountered:

ctb · 2016-09-03T16:35:37Z

Hey, this breaks all of our diginorm and trim-low-abund consistency checks. Sweet :)

tests/test_script_output.py::test_normalize_by_median_k21_C20_M1e7 FAILED
tests/test_script_output.py::test_normalize_by_median_k21_C15_M1e7 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_C0_M1e7_diginorm FAILED
tests/test_script_output.py::test_trim_low_abund_k21_C0_M1e7_diginorm_dn15 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_C2_M1e7_diginorm_dn15 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C2 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C3 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C4 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C4_variable FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C4_variable_Z25 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C4_variable_Z15 FAILED

ctb · 2016-09-03T22:11:00Z

Instead of using a macro, would it be any slower to use an inline function that raises an exception on an unknown character?

ctb · 2016-09-04T22:26:06Z

A duplicate of #370, basically, except that back then we thought we'd handled it in scripts.
Sigh.

Exploratory branch for dib-lab#1434. Use a lookup table and a function instead of a macro.

ctb · 2016-10-03T21:18:34Z

Fixed in #1435. Leaving #370 open until we do a survey of the codebase.

ctb mentioned this issue Sep 3, 2016

fix lowercase actgn input handling. #1435

Merged

9 tasks

ctb mentioned this issue Sep 3, 2016

Support for k>32 #1426

Closed

This was referenced Sep 4, 2016

Factor out sequence preprocessing of reads with 'N's in them #986

Closed

Test all scripts/ for appropriate behavior when len(read) < K #859

Open

betatim added a commit to betatim/khmer that referenced this issue Sep 5, 2016

twobit_repr as lookup table

9157008

Exploratory branch for dib-lab#1434. Use a lookup table and a function instead of a macro.

betatim mentioned this issue Sep 5, 2016

twobit_repr as lookup table #1438

Closed

9 tasks

ctb added this to the 2.1 milestone Sep 26, 2016

ctb closed this as completed Oct 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

screed doesn't uppercase DNA in loaded records #1434

screed doesn't uppercase DNA in loaded records #1434

ctb commented Sep 3, 2016 •

edited

ctb commented Sep 3, 2016

ctb commented Sep 3, 2016

ctb commented Sep 4, 2016

ctb commented Oct 3, 2016

screed doesn't uppercase DNA in loaded records #1434

screed doesn't uppercase DNA in loaded records #1434

Comments

ctb commented Sep 3, 2016 • edited

ctb commented Sep 3, 2016

ctb commented Sep 3, 2016

ctb commented Sep 4, 2016

ctb commented Oct 3, 2016

ctb commented Sep 3, 2016 •

edited