Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hash functions supporting k > 32 to Counttable #1511

Merged
merged 116 commits into from
Feb 14, 2017
Merged

Conversation

ctb
Copy link
Member

@ctb ctb commented Nov 12, 2016

This follows from #1504 and adds support for irreversible hash functions (supporting k > 32).

Specifically, in this PR we:

  • update Counttable to use murmurhash and thus supported k > 32;
  • introduce a MurmurHashKmerIterator for hashing strings;
  • use parameterized tests to test the CPython API for _Countgraph, _Counttable, _SmallCountgraph, _SmallCounttable _Nodegraph, and _Nodetable;
  • provide some minimal documentation for the Python API for tables;

Basically this is many small changes that are now reasonably well tested and understood, taking advantage of the refactoring done in #1504.

  • Is it mergeable?
  • make test Did it pass the tests?
  • make clean diff-cover If it introduces new functionality in
    scripts/ is it tested?
  • make format diff_pylint_report cppcheck doc pydocstyle Is it well
    formatted?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Is it documented in the ChangeLog?
    http://en.wikipedia.org/wiki/Changelog#Format
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Do the changes respect streaming IO? (Are they
    tested for streaming IO?)
  • Is the Copyright year up to date?

@codecov-io
Copy link

codecov-io commented Nov 12, 2016

Codecov Report

❗ No coverage uploaded for pull request base (master@8c2b32c). Click here to learn what that means.
The diff coverage is 34.09%.

@@            Coverage Diff            @@
##             master    #1511   +/-   ##
=========================================
  Coverage          ?   69.79%           
=========================================
  Files             ?       66           
  Lines             ?     8966           
  Branches          ?     3059           
=========================================
  Hits              ?     6258           
  Misses            ?     1025           
  Partials          ?     1683
Impacted Files Coverage Δ
khmer/_cpy_nodetable.hh 50% <ø> (ø)
khmer/init.py 96.87% <100%> (ø)
lib/hllcounter.cc 60.6% <100%> (ø)
khmer/_khmer.cc 57.42% <22.33%> (ø)
lib/hashtable.cc 55.39% <32.43%> (ø)
lib/hashtable.hh 74.13% <36.36%> (ø)
lib/kmer_hash.cc 86.27% <50%> (ø)
lib/storage.cc 47.89% <50%> (ø)
lib/storage.hh 89.79% <75%> (ø)
lib/kmer_hash.hh 86.95% <80%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c2b32c...be56172. Read the comment docs.

@ctb
Copy link
Member Author

ctb commented Nov 12, 2016

w00t!

This code:

import khmer

a = khmer.Counttable(5, 1e6, 3)
b = khmer.Countgraph(5, 1e6, 3)

print(a.get_kmer_hashes("ATGGAGGACAAGTAT"))
print(b.get_kmer_hashes("ATGGAGGACAAGTAT"))

print(a.hash('ATGGA'))
print(b.hash('ATGGA'))

now yields:

[2898244513215582206, 7782325612895303798, 17716873001369615262, 5266100104194697999, 16527671766578686784, 1954197895523434265, 9372126835751090531, 2912821416709855779, 8698352977978348149, 13735319396097939869]
[124, 499, 666, 422, 242, 474, 374, 131, 151, 52, 73]
2898244513215582206
124

so we're finally able to swap out hash functions on Hashtable classes!

@ctb
Copy link
Member Author

ctb commented Nov 12, 2016

(and reversibility is no longer a requirement on Hashtable derived classes, only on Hashgraph derivatives.)

@ctb
Copy link
Member Author

ctb commented Nov 12, 2016

So this now works, too:

import khmer

a = khmer.Counttable(33, 1e6, 3)

print(a.get_kmer_counts("ATGGATATGGAGGACAAGTATATGGAGGACAAGTATATGGAGGACAAGTAT"))
print(a.consume("ATGGATATGGAGGACAAGTATATGGAGGACAAGTATATGGAGGACAAGTAT"))
print(a.get_kmer_counts("ATGGATATGGAGGACAAGTATATGGAGGACAAGTATATGGAGGACAAGTAT"))

@betatim
Copy link
Member

betatim commented Nov 14, 2016

What am I missing here:

s = "ATGGATATGGAGGACAAGTATATGGAGGACAAGTATATGGAGGACAAGTAT"
a =khmer.Counttable(33, 1e6, 3)
a.get_kmer_hashes(s[:33]) # -> []
a.get_kmer_hashes(s[:34]) # -> [48584371158645721]

@ctb
Copy link
Member Author

ctb commented Nov 14, 2016 via email

lib/hashtable.cc Outdated
}


void Hashtable::get_kmer_counts(const std::string &s,
std::vector<BoundedCounterType> &counts) const
{
KmerIterator kmers(s.c_str(), _ksize);
KmerHashIterator * kmers = new_kmer_iterator(s);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would wrap it in std::unique_ptr< KmerHashIterator > kmers(new_kmer_iterator(s)) and then get rid of the delete.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably also change new_kmer_iterator to return a unique_ptr directly.

def test_counttable_no_unhash():
x = khmer.Counttable(4, 21, 3)

try:
Copy link
Member

@betatim betatim Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

->

with pytest.raises(ValueError):
  x.reverse_hash(1)

@@ -148,29 +148,9 @@ static bool convert_PyObject_to_HashIntoType(PyObject * value,
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment for this function needs updating (delete last two lines)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@ctb
Copy link
Member Author

ctb commented Nov 15, 2016 via email

@ctb
Copy link
Member Author

ctb commented Nov 15, 2016 via email

@ctb
Copy link
Member Author

ctb commented Nov 15, 2016

Question for all you foolish mortals: how should we handle testing of CPython class hierarchies?

Do we just test most of the methods on the base types and then test only the derived methods on derived types, and rely on code coverage to tell us when we're missing something?

Or do we massively duplicate tests?

Or what?

@betatim
Copy link
Member

betatim commented Nov 15, 2016

I'd probably go for option (a).

Another option I like is to devise a set of tests/checks for things that are graph-ish and apply them (via a for loop) to all things that want to be graph-ish (same for table-ish, X-ish). And then you add tests for each type that test what is special about that type.

@betatim
Copy link
Member

betatim commented Nov 15, 2016

I can't work out why travis is ignoring this. According to the logs it is ignoring this "as per configuration" which isn't super helpful for debugging. One thing that might wake travis up is if you change the target of this PR to master (which it should be anyway no?)

@standage
Copy link
Member

@betatim Consider the following (from our .travis.yml file).

branches:
  only: 
    - master

@ctb
Copy link
Member Author

ctb commented Nov 15, 2016 via email

@ctb ctb changed the base branch from refactor/storage2 to master November 15, 2016 17:49
@betatim
Copy link
Member

betatim commented Feb 13, 2017

when I run the coverage tools locally coverage-gcovr.xml is empty. At some point we stopped generating gcda files when the tests run which is what gcovr needs (I think). Trying to find a "good" commit to git bisect.

Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

* get_raw_tables
* do_subset_partition_with_abundance

Nodegraph:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this is under the "counting types" heading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed 8f9e8b0

} catch (khmer_exception &e) {
PyErr_SetString(PyExc_ValueError, e.what());
return NULL;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w00t!

@@ -0,0 +1,442 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are all pretty simple and cover the basic features well.

@@ -509,6 +507,53 @@ const
return posns;
}

class MurmurKmerHashIterator : public KmerHashIterator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be in lib/kmer_hash.cc (or lib/kmer_hash.hh)?

else:
raise Exception("unknown tabletype")

z = kh.get('ATGGC')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this should be loaded.get shouldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed dc6b287 - thanks!

@betatim betatim mentioned this pull request Feb 14, 2017
8 tasks
@ctb
Copy link
Member Author

ctb commented Feb 14, 2017 via email

@ctb
Copy link
Member Author

ctb commented Feb 14, 2017 via email

@luizirber
Copy link
Member

luizirber commented Feb 14, 2017 via email

@ctb
Copy link
Member Author

ctb commented Feb 14, 2017 via email

@luizirber
Copy link
Member

luizirber commented Feb 14, 2017 via email

@ctb
Copy link
Member Author

ctb commented Feb 14, 2017

The Mac OS X build is holding this up. Can we merge, pretty please? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants