Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make C++ code that mimics the utils sequence handling code #1483

Open
ctb opened this issue Oct 6, 2016 · 18 comments
Open

make C++ code that mimics the utils sequence handling code #1483

ctb opened this issue Oct 6, 2016 · 18 comments

Comments

@ctb
Copy link
Member

ctb commented Oct 6, 2016

I feel like the code in utils.py introduced #1435 is headed in the right direction. It'd be nice to have C++ versions of the ReadBundle and broken_paired_iter functions, for speed.

@ctb
Copy link
Member Author

ctb commented Oct 6, 2016

#769 is relevant.

@betatim
Copy link
Member

betatim commented Oct 21, 2016

I'll take a look at this.

@betatim
Copy link
Member

betatim commented Nov 15, 2016

Regarding doing things "for speed": using khmer.ReadParser instead of screed speeds up broken_pair_reader by a bit. Next slowest thing is check_is_pair.

Not sure there is any benefit in moving ReadBundle to c++, at least not for speed.

@ctb
Copy link
Member Author

ctb commented Nov 15, 2016

can broken_paired_reader be converted into C++, do you think?

@betatim
Copy link
Member

betatim commented Nov 16, 2016

You could. What is the motivation? Do you want to use it from other c++ code?


If you run kernprof -l -v scripts/extract-paired-reads.py /tmp/paired-mixed.fq on the output of
for i in {1..2000}; do cat tests/test-data/paired-mixed.fq >> /tmp/paired-mixed.fq ;done you go from 1.3s to 0.5s by switching the read parser. After applying that switch you spend about 44% of your time in broken_paired_reader on the for loop, and 35% calling check_is_pair.

@ctb
Copy link
Member Author

ctb commented Nov 16, 2016

speed.

here's an evaluation file that's a bit bigger for you to play with :)

https://s3.amazonaws.com/public.ged.msu.edu/ecoli_ref-5m.fastq.gz

@betatim
Copy link
Member

betatim commented Nov 16, 2016

http://gph.is/1TKQFaC

because I liked it so much. I'll take a look.

While poking around the other day I found remnants like https://github.com/dib-lab/khmer/blob/master/lib/read_parsers.hh#L111-L115 and https://github.com/dib-lab/khmer/blob/master/lib/read_parsers.cc#L204-L208 is there any memory on that still around? It looks like it once was the start of something similar to broken_paired_reader

@ctb
Copy link
Member Author

ctb commented Nov 16, 2016

On Wed, Nov 16, 2016 at 06:31:50AM -0800, Tim Head wrote:

http://gph.is/1TKQFaC

because I liked it so much. I'll take a look.

While poking around the other day I found remnants like https://github.com/dib-lab/khmer/blob/master/lib/read_parsers.hh#L111-L115 and https://github.com/dib-lab/khmer/blob/master/lib/read_parsers.cc#L204-L208 is there any memory on that still around? It looks like it once was the start of something similar to broken_paired_reader

Cleanup on aisle 7! Cleanup on aisle 7!

*That code should be removed.)

@betatim
Copy link
Member

betatim commented Nov 17, 2016

A broken_paired_reader lives in #1502 do we wait for the cython revolution or do our own thing in the meantime?

@ctb
Copy link
Member Author

ctb commented Nov 17, 2016 via email

@betatim
Copy link
Member

betatim commented Nov 18, 2016

Made this little comparison and was surprised at how slow GzipFile is, and that ReadParser is essentially no faster than zcat + python. Output of running the script:

<function read_readparser at 0x101cee598> 23.03s
<function read_plain at 0x101cee400> 21.17s
<function read_gzcat at 0x101cee488> 20.80s
<function read_gzip at 0x101cee510> 84.69s

scripts/extract-paired-reads.py ecoli_ref-5m.fastq.gz -s ecoli_ref-5m.fastq.se -p /dev/null takes about 75s (65s if you use an already gunzipped fastq)

Just reading through the file (for line in file: pass) takes a few seconds.

@betatim
Copy link
Member

betatim commented Nov 18, 2016

make the argument for incorporating the auto-generated C++ code :).

I'm taking that as a "we do our own thing". Working on that now.

@ctb
Copy link
Member Author

ctb commented Nov 18, 2016

On Fri, Nov 18, 2016 at 05:17:33AM -0800, Tim Head wrote:

make the argument for incorporating the auto-generated C++ code :).

I'm taking that as a "we do our own thing". Working on that now.

well, no... if the Cython code looks fine and compiles across platforms,
then just copy it in and edit it appropriately, add tests. We just have
to take ownership of the C++ output rather than the Cython code for 2.1.

@betatim
Copy link
Member

betatim commented Nov 18, 2016

What do you mean with take ownership? Adding it to the repo directly is easy, editing it by hand is ~impossible. It won't add a runtime dependency on cython though but there will be a new dev dependency.

Currently you distribute wheels right? So might even be that for pip install khmer nothing would change.

@ctb
Copy link
Member Author

ctb commented Nov 18, 2016

On Fri, Nov 18, 2016 at 06:09:37AM -0800, Tim Head wrote:

What do you mean with take ownership? Adding it to the repo directly is easy, editing it by hand is ~impossible. It won't add a runtime dependency on cython though but there will be a new dev dependency.

Let's give it a try and take a close look at any new headers that are
introduced. I've been having trouble with @camillescott's branch but
that seems to be more the c++11 headers than anything else.

@betatim
Copy link
Member

betatim commented Nov 18, 2016

What is the deal with:

rp = khmer.ReadParser("myfile")
for pair in rp.iter_read_pairs():
  ...stuff...

souping that up to support broken paired reads seems like the way to go if we want small movements? (or is there a trap here that I haven't spotted?)

@ctb
Copy link
Member Author

ctb commented Nov 18, 2016

On Fri, Nov 18, 2016 at 06:45:52AM -0800, Tim Head wrote:

What is the deal with:

rp = khmer.ReadParser("myfile")
for pair in rp.iter_read_pairs():
  ...stuff...

souping that up to support broken paired reads seems like the way to go if we want small movements? (or is there a trap here that I haven't spotted?)

sounds good.

@ctb
Copy link
Member Author

ctb commented Feb 19, 2017

The cython revolution in #1595 will apparently fix all these things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants