[MRG] remove redundant ACGT-checking code. #1590

Merged
merged 55 commits into from Jun 2, 2017

Conversation

Projects
None yet
5 participants
Owner

ctb commented Jan 24, 2017 edited

This removes check_and_process_read from Hashtable, and eliminates sequence cleaning from non-bulk-loading code. All such Python-accessible code should use the cleaned_seq attribute that is available from both the ReadParser code / Read objects and the khmer.utils.broken_paired_reader code.


See #1541 (comment) for background context.

Prior to this PR,

  • consume_fasta & other bulk sequence loading functions automatically upper-cased DNA and ignored any sequence (no matter how long) that contained non-ACGT.
  • trim_on_abundance and trim_below_abundance similarly upper-cased DNA and ignored non-ACGT-containing strings.
  • find_spectral_error_positions raised an exception on non-ACGT-containing strings.

This PR updates the test code introduced in #1633 to match the new behavior.

This PR also includes #1661.


  • Is it mergeable?
  • make test Did it pass the tests?
  • make clean diff-cover If it introduces new functionality in
    scripts/ is it tested?
  • make format diff_pylint_report cppcheck doc pydocstyle Is it well
    formatted?
  • Did it change the command-line interface? Only backwards-compatible
    additions are allowed without a major version increment. Changing file
    formats also requires a major version number increment.
  • For substantial changes or changes to the command-line interface, is it
    documented in CHANGELOG.md? See keepachangelog
    for more details.
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Do the changes respect streaming IO? (Are they
    tested for streaming IO?)
@ctb ctb remove unneeded check_and_process_read code
7d46f07

ctb changed the title from remove unneeded check_and_normalize_read calls to remove redundant ACGT-checking code. Jan 24, 2017

Owner

ctb commented Jan 24, 2017

Although reading #1491 I'm wondering if we should be using cleaned_seq or equivalent in the C++ bulk consume_fasta code, instead of read.sequence. If so, then we really need to add some tests so that the current code breaks :).

codecov-io commented Jan 24, 2017 edited

Codecov Report

Merging #1590 into master will increase coverage by <.01%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #1590      +/-   ##
=========================================
+ Coverage    0.05%   0.05%   +<.01%     
=========================================
  Files          91      91              
  Lines       11500   11483      -17     
  Branches     3063    3056       -7     
=========================================
  Hits            6       6              
+ Misses      11494   11477      -17
Impacted Files Coverage Δ
include/oxli/hashtable.hh 0% <ø> (ø) ⬆️
src/oxli/subset.cc 0% <0%> (ø) ⬆️
src/oxli/hashtable.cc 0% <0%> (ø) ⬆️
src/oxli/labelhash.cc 0% <0%> (ø) ⬆️
src/oxli/hashgraph.cc 0% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4fc53f6...1a54b45. Read the comment docs.

Owner

ctb commented Feb 14, 2017

Also see #1619 (comment)

Member

camillescott commented Feb 14, 2017

Also see #1595 -- some more sophisticated alphabet checking was added to the parsers. Basically, you can give it one of the alphabets (as defined in alphabets.hh) by name, and it'll use that for its checks; as perhaps implied by that, I'm in favor of moving this sort of checking code into the parsers. Alternatively, the graphs could take a parameter for whether they should run any checks on what they consume (also using the alphabets).

Member

camillescott commented Feb 14, 2017

And, looking at #1511 closer, moving that checking code into the parsers is already becoming SOP. Yay.

Owner

ctb commented Feb 14, 2017

ctb referenced this pull request Feb 16, 2017

Closed

Introduce consume_seqfile_banding method #1571

12 of 12 tasks complete

ctb changed the base branch to remove/is_valid_dna_tests from master Feb 19, 2017

Owner

ctb commented Feb 20, 2017 edited

@standage note that one thing the partitioning code does is fail to output reads containing Ns.

Contributor

standage commented Feb 20, 2017

note that one thing the partitioning code does is fail to output reads containing Ns

Roger that, good to know.

Owner

ctb commented Feb 20, 2017

ctb added some commits Feb 25, 2017

@ctb ctb Merge branch 'master' of github.com:dib-lab/khmer into remove/is_vali…
…d_dna
0f1bf48
@ctb ctb remove check_and_normalize_read from trim_on_stoptags bfeebef
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna efd2bb3
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna 6d97388
@ctb ctb split tests out to test_sequence_validation 0cdd034
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna f63a4d8
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna c7123da
@ctb ctb add exceptions cad8213
@ctb ctb Merge remote-tracking branch 'origin/master' into remove/is_valid_dna c269436
@ctb ctb fix setup.py for offline operation 990dc37
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna 85a730e
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna 264dc29
@ctb ctb remove check_and_normalize_read from hashgraph 416f94b
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna cec6243
@ctb ctb fix problems revealed by larger ksize 7375dbd
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna 12aa4d8
@ctb ctb updated read cleaning in labelhash & tests 06342dc
@ctb ctb removed check_and_normalize_read, yay 8b34213
@ctb ctb uncomment pytest-runner req b17f386
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna 3164408
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna 8b846a1
@ctb ctb switch to countingtype f708a76
@ctb ctb switch to countingtype x 2 b2d0838
@ctb ctb update trim functions on bad dna to be properly undefined 10ce9bd
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna 98cd756
@ctb ctb Merge branch 'fix/consume_partitioned_err' into remove/is_valid_dna 8a683c8
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna fc63010
@ctb ctb Merge branch 'fix/consume_partitioned_err' into remove/is_valid_dna 01c3b44
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna cb3d6cc
@ctb ctb add partition ID to new 'bad' sequence 9a5080b
@ctb ctb adjust tests for new 'bad dna' sequence 3f6a673
@ctb ctb Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna d4eb8a5

ctb changed the base branch to master from remove/is_valid_dna_tests Mar 20, 2017

Owner

ctb commented Mar 20, 2017

I wonder if this code is any faster now that we don't iterate across every sequence 2 or 3 times?

@betatim betatim Merge branch 'master' into remove/is_valid_dna
122258c
Member

betatim commented Mar 22, 2017

Time to review?

Owner

ctb commented Mar 22, 2017

Member

betatim commented Mar 22, 2017

Ran ./scripts/abundance-dist-single.py -x 1e8 -b ecoli_ref-5m.fastq somehisto.hist -k 31 -s which takes about 120s on my machine both on master and this branch :-/

ctb changed the title from remove redundant ACGT-checking code. to [MRG] remove redundant ACGT-checking code. Jun 2, 2017

Owner

ctb commented Jun 2, 2017

@betatim @luizirber @camillescott @standage this PR is now ready for merge into master and (IMO) should be given high priority for review. @camillescott's prediction of many merge conflicts was mistaken, thank goodness - only one small conflict to be resolved!

This merge significantly changes the details around ACTGN handling and will require a bump to khmer 3.0.

Owner

ctb commented Jun 2, 2017

Tests pass!

@standage

Looks good to me. As I understand it, khmer now makes no attempt to clean up kmers that are passed to the sketch.add() function, and does a [^ACGT] --> A conversion for any bulk sequence loading code. Correct?

+ # because different hash functions do different things with
+ # non-ACTG characters. So all we want to do is verify that the
+ # functions execute w/o error on the k-mers before the "bad" DNA,
+ # and don't return positions in the "good" DNA.
@standage

standage Jun 2, 2017

Contributor

The meaning of the phrase and don't return positions in the "good" DNA is unclear to me.

@ctb

ctb Jun 2, 2017

Owner

The behavior of the hash functions on ACTG should all be the same with respect to "good" sequences, and there should be no errors in there to trim at.

Contributor

standage commented Jun 2, 2017

Also, does the partitioning code now output reads containing Ns @ctb?

Owner

ctb commented Jun 2, 2017

@standage yes; re partitioning and ignoring of reads containing N, see: this removed line

@ctb ctb merged commit e768617 into master Jun 2, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

ctb deleted the remove/is_valid_dna branch Jun 2, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment