Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
[MRG] remove redundant ACGT-checking code. #1590
Conversation
ctb
changed the title from
remove unneeded check_and_normalize_read calls to remove redundant ACGT-checking code.
Jan 24, 2017
|
Although reading #1491 I'm wondering if we should be using |
codecov-io
commented
Jan 24, 2017
•
Codecov Report
@@ Coverage Diff @@
## master #1590 +/- ##
=========================================
+ Coverage 0.05% 0.05% +<.01%
=========================================
Files 91 91
Lines 11500 11483 -17
Branches 3063 3056 -7
=========================================
Hits 6 6
+ Misses 11494 11477 -17
Continue to review full report at Codecov.
|
This was referenced Jan 26, 2017
|
Also see #1619 (comment) |
ctb
referenced
this pull request
Feb 14, 2017
Closed
Discrepancies between exact counts and approximate counts #1619
|
Also see #1595 -- some more sophisticated alphabet checking was added to the parsers. Basically, you can give it one of the alphabets (as defined in |
|
And, looking at #1511 closer, moving that checking code into the parsers is already becoming SOP. Yay. |
|
yeppers.
|
ctb
referenced
this pull request
Feb 19, 2017
Merged
some bulk sequence loading tests that nail down current ACGTN behavior. #1633
ctb
changed the base branch to
remove/is_valid_dna_tests from master
Feb 19, 2017
|
@standage note that one thing the partitioning code does is fail to output reads containing Ns. |
ctb
added some commits
Feb 20, 2017
Roger that, good to know. |
|
On Mon, Feb 20, 2017 at 08:07:57AM -0800, Daniel Standage wrote:
> note that one thing the partitioning code does is fail to output reads containing Ns
Roger that, good to know.
(for now - fixing it in this PR :)
|
ctb
added some commits
Feb 25, 2017
ctb
changed the base branch to
master from remove/is_valid_dna_tests
Mar 20, 2017
|
I wonder if this code is any faster now that we don't iterate across every sequence 2 or 3 times? |
|
Time to review? |
|
On Wed, Mar 22, 2017 at 07:47:42AM -0700, Tim Head wrote:
Time to review?
yes! but this will not be merged until after release.
|
|
Ran |
This was referenced May 18, 2017
ctb
changed the title from
remove redundant ACGT-checking code. to [MRG] remove redundant ACGT-checking code.
Jun 2, 2017
|
@betatim @luizirber @camillescott @standage this PR is now ready for merge into master and (IMO) should be given high priority for review. @camillescott's prediction of many merge conflicts was mistaken, thank goodness - only one small conflict to be resolved! This merge significantly changes the details around ACTGN handling and will require a bump to khmer 3.0. |
|
Tests pass! |
standage
approved these changes
Jun 2, 2017
Looks good to me. As I understand it, khmer now makes no attempt to clean up kmers that are passed to the sketch.add() function, and does a [^ACGT] --> A conversion for any bulk sequence loading code. Correct?
| + # because different hash functions do different things with | ||
| + # non-ACTG characters. So all we want to do is verify that the | ||
| + # functions execute w/o error on the k-mers before the "bad" DNA, | ||
| + # and don't return positions in the "good" DNA. |
standage
Jun 2, 2017
Contributor
The meaning of the phrase and don't return positions in the "good" DNA is unclear to me.
ctb
Jun 2, 2017
Owner
The behavior of the hash functions on ACTG should all be the same with respect to "good" sequences, and there should be no errors in there to trim at.
|
Also, does the partitioning code now output reads containing Ns @ctb? |
|
@standage yes; re partitioning and ignoring of reads containing N, see: this removed line |
ctb commentedJan 24, 2017
•
edited
This removes
check_and_process_readfromHashtable, and eliminates sequence cleaning from non-bulk-loading code. All such Python-accessible code should use thecleaned_seqattribute that is available from both theReadParsercode /Readobjects and thekhmer.utils.broken_paired_readercode.See #1541 (comment) for background context.
Prior to this PR,
consume_fasta& other bulk sequence loading functions automatically upper-cased DNA and ignored any sequence (no matter how long) that contained non-ACGT.trim_on_abundanceandtrim_below_abundancesimilarly upper-cased DNA and ignored non-ACGT-containing strings.find_spectral_error_positionsraised an exception on non-ACGT-containing strings.This PR updates the test code introduced in #1633 to match the new behavior.
This PR also includes #1661.
make testDid it pass the tests?make clean diff-coverIf it introduces new functionality inscripts/is it tested?make format diff_pylint_report cppcheck doc pydocstyleIs it wellformatted?
additions are allowed without a major version increment. Changing file
formats also requires a major version number increment.
documented in
CHANGELOG.md? See keepachangelogfor more details.
changes were made?
tested for streaming IO?)