make C++ code that mimics the utils sequence handling code #1483

ctb · 2016-10-06T04:03:56Z

I feel like the code in utils.py introduced #1435 is headed in the right direction. It'd be nice to have C++ versions of the ReadBundle and broken_paired_iter functions, for speed.

ctb · 2016-10-06T13:26:46Z

#769 is relevant.

betatim · 2016-10-21T12:49:59Z

I'll take a look at this.

betatim · 2016-11-15T06:50:10Z

Regarding doing things "for speed": using khmer.ReadParser instead of screed speeds up broken_pair_reader by a bit. Next slowest thing is check_is_pair.

Not sure there is any benefit in moving ReadBundle to c++, at least not for speed.

ctb · 2016-11-15T13:59:38Z

can broken_paired_reader be converted into C++, do you think?

betatim · 2016-11-16T09:27:02Z

You could. What is the motivation? Do you want to use it from other c++ code?

If you run kernprof -l -v scripts/extract-paired-reads.py /tmp/paired-mixed.fq on the output of
for i in {1..2000}; do cat tests/test-data/paired-mixed.fq >> /tmp/paired-mixed.fq ;done you go from 1.3s to 0.5s by switching the read parser. After applying that switch you spend about 44% of your time in broken_paired_reader on the for loop, and 35% calling check_is_pair.

ctb · 2016-11-16T13:49:16Z

speed.

here's an evaluation file that's a bit bigger for you to play with :)

https://s3.amazonaws.com/public.ged.msu.edu/ecoli_ref-5m.fastq.gz

betatim · 2016-11-16T14:31:49Z

http://gph.is/1TKQFaC

because I liked it so much. I'll take a look.

While poking around the other day I found remnants like https://github.com/dib-lab/khmer/blob/master/lib/read_parsers.hh#L111-L115 and https://github.com/dib-lab/khmer/blob/master/lib/read_parsers.cc#L204-L208 is there any memory on that still around? It looks like it once was the start of something similar to broken_paired_reader

ctb · 2016-11-16T14:33:11Z

On Wed, Nov 16, 2016 at 06:31:50AM -0800, Tim Head wrote:

http://gph.is/1TKQFaC

because I liked it so much. I'll take a look.

While poking around the other day I found remnants like https://github.com/dib-lab/khmer/blob/master/lib/read_parsers.hh#L111-L115 and https://github.com/dib-lab/khmer/blob/master/lib/read_parsers.cc#L204-L208 is there any memory on that still around? It looks like it once was the start of something similar to broken_paired_reader

Cleanup on aisle 7! Cleanup on aisle 7!

*That code should be removed.)

betatim · 2016-11-17T13:12:38Z

A broken_paired_reader lives in #1502 do we wait for the cython revolution or do our own thing in the meantime?

ctb · 2016-11-17T13:37:46Z

make the argument for incorporating the auto-generated C++ code :).

betatim · 2016-11-18T10:42:33Z

Made this little comparison and was surprised at how slow GzipFile is, and that ReadParser is essentially no faster than zcat + python. Output of running the script:

<function read_readparser at 0x101cee598> 23.03s
<function read_plain at 0x101cee400> 21.17s
<function read_gzcat at 0x101cee488> 20.80s
<function read_gzip at 0x101cee510> 84.69s

scripts/extract-paired-reads.py ecoli_ref-5m.fastq.gz -s ecoli_ref-5m.fastq.se -p /dev/null takes about 75s (65s if you use an already gunzipped fastq)

Just reading through the file (for line in file: pass) takes a few seconds.

betatim · 2016-11-18T13:17:32Z

make the argument for incorporating the auto-generated C++ code :).

I'm taking that as a "we do our own thing". Working on that now.

ctb · 2016-11-18T13:47:21Z

On Fri, Nov 18, 2016 at 05:17:33AM -0800, Tim Head wrote:

make the argument for incorporating the auto-generated C++ code :).

I'm taking that as a "we do our own thing". Working on that now.

well, no... if the Cython code looks fine and compiles across platforms,
then just copy it in and edit it appropriately, add tests. We just have
to take ownership of the C++ output rather than the Cython code for 2.1.

betatim · 2016-11-18T14:09:37Z

What do you mean with take ownership? Adding it to the repo directly is easy, editing it by hand is ~impossible. It won't add a runtime dependency on cython though but there will be a new dev dependency.

Currently you distribute wheels right? So might even be that for pip install khmer nothing would change.

ctb · 2016-11-18T14:11:29Z

On Fri, Nov 18, 2016 at 06:09:37AM -0800, Tim Head wrote:

What do you mean with take ownership? Adding it to the repo directly is easy, editing it by hand is ~impossible. It won't add a runtime dependency on cython though but there will be a new dev dependency.

Let's give it a try and take a close look at any new headers that are
introduced. I've been having trouble with @camillescott's branch but
that seems to be more the c++11 headers than anything else.

betatim · 2016-11-18T14:45:52Z

What is the deal with:

rp = khmer.ReadParser("myfile")
for pair in rp.iter_read_pairs():
  ...stuff...

souping that up to support broken paired reads seems like the way to go if we want small movements? (or is there a trap here that I haven't spotted?)

ctb · 2016-11-18T14:46:49Z

On Fri, Nov 18, 2016 at 06:45:52AM -0800, Tim Head wrote:

What is the deal with:
rp = khmer.ReadParser("myfile")
for pair in rp.iter_read_pairs():
  ...stuff...
souping that up to support broken paired reads seems like the way to go if we want small movements? (or is there a trap here that I haven't spotted?)

sounds good.

ctb · 2017-02-19T19:25:05Z

The cython revolution in #1595 will apparently fix all these things.

betatim mentioned this issue Oct 24, 2016

expose khmer.Read to python and use ReadParser in scripts #1491

Merged

8 tasks

standage mentioned this issue Nov 17, 2016

Drop "allow" mode for parsing paired reads #1534

Merged

9 tasks

This was referenced Jan 25, 2017

Add cleaned_seq attribute to Read class #1591

Merged

Add cleaned_seq attribute to Record dib-lab/screed#66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make C++ code that mimics the utils sequence handling code #1483

make C++ code that mimics the utils sequence handling code #1483

ctb commented Oct 6, 2016

ctb commented Oct 6, 2016

betatim commented Oct 21, 2016

betatim commented Nov 15, 2016

ctb commented Nov 15, 2016

betatim commented Nov 16, 2016

ctb commented Nov 16, 2016 •

edited

betatim commented Nov 16, 2016

ctb commented Nov 16, 2016

betatim commented Nov 17, 2016

ctb commented Nov 17, 2016 via email

betatim commented Nov 18, 2016 •

edited

betatim commented Nov 18, 2016

ctb commented Nov 18, 2016

betatim commented Nov 18, 2016

ctb commented Nov 18, 2016

betatim commented Nov 18, 2016

ctb commented Nov 18, 2016

ctb commented Feb 19, 2017

make C++ code that mimics the utils sequence handling code #1483

make C++ code that mimics the utils sequence handling code #1483

Comments

ctb commented Oct 6, 2016

ctb commented Oct 6, 2016

betatim commented Oct 21, 2016

betatim commented Nov 15, 2016

ctb commented Nov 15, 2016

betatim commented Nov 16, 2016

ctb commented Nov 16, 2016 • edited

betatim commented Nov 16, 2016

ctb commented Nov 16, 2016

betatim commented Nov 17, 2016

ctb commented Nov 17, 2016 via email

betatim commented Nov 18, 2016 • edited

betatim commented Nov 18, 2016

ctb commented Nov 18, 2016

betatim commented Nov 18, 2016

ctb commented Nov 18, 2016

betatim commented Nov 18, 2016

ctb commented Nov 18, 2016

ctb commented Feb 19, 2017

ctb commented Nov 16, 2016 •

edited

betatim commented Nov 18, 2016 •

edited