Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

screed doesn't uppercase DNA in loaded records #1434

Closed
ctb opened this issue Sep 3, 2016 · 4 comments
Closed

screed doesn't uppercase DNA in loaded records #1434

ctb opened this issue Sep 3, 2016 · 4 comments
Milestone

Comments

@ctb
Copy link
Member

ctb commented Sep 3, 2016

In the course of mucking about with hash functions, I discovered that there are multiple places in the code where bad data is fed to our hashing code.

More specifically,

  • our hashing code (twobit_repr etc in kmer_hash.hh) intentionally does not handle lower case DNA characters (it treats them as 'G');
  • screed does not upper case DNA sequence when it loads it;
  • so, any place where screed is used to load sequences has the potential to allow through lowercase DNA, which are then treated as 'G'.

This is an excellent combination of optimization (avoiding too much input DNA sanitization 'cause it makes hashing slow) with poorly specified library behavior by screed.

This affects a number of tests, which is how I found it. For example, tests/test_filter_abund.py::test_filter_abund_6_trim_high_abund_Z will fail if you uppercase record.sequence in khmer/thread_utils.py::verbose_loader.

There are lots of possible solutions -

  • amend all the relevant scripts to use uppercase;
  • make sure we periodically recompile and run the tests with KHMER_EXTRA_SANITY_CHECKS turned on;
  • change the default behavior of screed;

but this kind of thing has happened enough in the past that we should probably think about how to check for it more systematically. Thoughts welcome.

@ctb
Copy link
Member Author

ctb commented Sep 3, 2016

Hey, this breaks all of our diginorm and trim-low-abund consistency checks. Sweet :)

tests/test_script_output.py::test_normalize_by_median_k21_C20_M1e7 FAILED
tests/test_script_output.py::test_normalize_by_median_k21_C15_M1e7 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_C0_M1e7_diginorm FAILED
tests/test_script_output.py::test_trim_low_abund_k21_C0_M1e7_diginorm_dn15 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_C2_M1e7_diginorm_dn15 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C2 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C3 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C4 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C4_variable FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C4_variable_Z25 FAILED
tests/test_script_output.py::test_trim_low_abund_k21_M1e7_C4_variable_Z15 FAILED

@ctb ctb mentioned this issue Sep 3, 2016
@ctb
Copy link
Member Author

ctb commented Sep 3, 2016

Instead of using a macro, would it be any slower to use an inline function that raises an exception on an unknown character?

@ctb
Copy link
Member Author

ctb commented Sep 4, 2016

A duplicate of #370, basically, except that back then we thought we'd handled it in scripts.
Sigh.

betatim added a commit to betatim/khmer that referenced this issue Sep 5, 2016
Exploratory branch for dib-lab#1434. Use a lookup table and a function
instead of a macro.
@ctb ctb added this to the 2.1 milestone Sep 26, 2016
@ctb
Copy link
Member Author

ctb commented Oct 3, 2016

Fixed in #1435. Leaving #370 open until we do a survey of the codebase.

@ctb ctb closed this as completed Oct 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant