C++ library design #1592

Open
standage opened this Issue Jan 25, 2017 · 3 comments

Comments

Projects
None yet
2 participants
Contributor

standage commented Jan 25, 2017 edited

The vast majority of khmer's functionality depends on two data data structures (sketches?): Bloom filter and Count-min sketch. And REALLY these are pretty much the same data structure at the conceptual and interface level, the only difference being whether the internal implementation stores k-mer counts or binary k-mer presence/absence values.

The khmer code base now includes several variations on this core data structure: different hash functions, with or without graph traversal, etc., with plans for even more. Supporting all of these variations long term is going to require some investment in refactoring, I think. Adding new functionality (like new hash functions or support for new input formats) currently involves touching a lot of code, and has proven more difficult and time intensive than anticipated. Plus, with our medium- to long-term plan to transition to Cython for the C++/Python glue, I think there's occasion to consider what we could be doing more cleanly at the C++ level (example). (I of course acknowledge the tension between engineering the software and doing useful stuff with the software, a perennial source of headache as @ctb so eloquently remarked in his latest Moore Foundation report.)

In this issue I just wanted to collect all of the relevant components and dimensions we need to consider, as a reference for future work on the core library. Anything I'm missing?

  • storage
    • 8 bits per entry (k-mer counts)
    • 4 bits per entry (small k-mer counts)
    • 1 bit per entry (k-mer presence/absence)
    • potentially others (such as 16 bit counters)?
  • data input format
    • Fasta/Fastq
    • BAM (also supported by SeqAn)
  • strategies for handling non-ACGT characters in the input
    • ignore entire sequences containing non-ACGT
    • ignore only k-mers containing non-ACGT
    • convert non-ACGT characters to A
    • regardless of strategy:
      • collect and report statistics on the number of reads / k-mers ignored or modified
      • preserve original read sequence for any output produced, regardless of how read is used internally
  • hash functions
    • reversible hash supporting graph traversal
    • non-reversible hash supporting k>32
    • strand-specific hash function?
Owner

ctb commented Jan 25, 2017

Contributor

standage commented Jan 25, 2017

I'm positive about refactoring and so on if it going to simplify, enable better correctness, and speed up R&D.

Of course. I still do occasionally take time to work on khmer for khmer's sake, but recently my contributions have all be in support of what I need to get done with kevlar. I'm not suggesting we reorganize (or redisorganize :-) for its own sake, but to simplify, to bring the model reflected in our code into better alignment with our conceptual model. To get rid of duplicated code, to decouple classes that don't need to be tightly coupled. I think this will do a lot to speed up R&D.

In principle, I agree with the sentiment that gradual refactoring is preferable to large-scale refactoring. That said, recent experience shows that some (even small) changes end up touching a lot of code, and require a lot of discipline not to try to Solve All The Problems.

Owner

ctb commented Jan 25, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment