-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test use of numpy array instead of str for storing BiologicalSequence._sequence #60
Comments
If |
A few more comments on working with numpy:
|
@Jorge-C there was a further discussion with additional examples of numpy approaches in #59. Definitely agree re: Regarding the hashing, that is a very good point. Looks like one way to go is to hash a view |
Thanks @wasade I had missed that discussion. Hashing arrays like that is a nifty trick! |
I've done a bunch of barebones Sequence classes, using different approaches (cython, LUT, translation tables, etc) just to get a taste of what can be done. Here's the gist and a link to view it in nbviewer. The general idea is that numpy+cython win, and the fastest approach there would rc "O(1e6)" sequences in roughly "O(1)" seconds. Ah, and don't use |
That is awesome! I think there are some clear winners and losers. The code On Wed, Feb 5, 2014 at 10:31 PM, Jorge Cañardo Alastuey <
|
Ups, there isn't a version using Nonetheless, I'd like to see a more complete benchmark (maybe including Another approach that I don't really know how to write in cython would be 2014-02-05 Daniel McDonald notifications@github.com:
|
These comparisons are very impressive — thanks for putting this together! Rob On Feb 5, 2014, at 10:31 PM, Jorge Cañardo Alastuey <notifications@github.commailto:notifications@github.com> wrote: I've done a bunch of barebones Sequence classes, using different approaches (cython, LUT, translation tables, etc) just to get a taste of what can be done. Here's the gisthttps://gist.github.com/Jorge-C/d51b48d3e18897c46ea2 and a link to view it in nbviewerhttp://nbviewer.ipython.org/urls/gist.github.com/Jorge-C/d51b48d3e18897c46ea2/raw/73d7e11e4b72d6ba90e0021931afa230e63031e9/cython+sequences.ipynb?create=1. The general idea is that numpy+cython win, and the fastest approach there would rc "O(1e6)" sequences in roughly "O(1)" seconds. Ah, and don't use np.char.translate, I didn't even bother to include it because it was slower than a pure python list comprehension). Please point out optimizations/mistakes! — |
Yes, this is really useful, thanks @Jorge-C. I'm not going to have time to work on this code today, but will tomorrow. Also, everyone, remember that there are still unassigned core objects and very frequently used functionality that we want to port from PyCogent. |
@Jorge-C I agree about the object instantiation and it is worth testing, though even if there is a small overhead there the benefits of using numpy with cython likely will still outperform at the end of the day. One other thing that could be done to reduce overhead would be to implement the fasta/fastq parsers in cython, which would further streamline this whole process |
re: the switch statement, you'd have to inline C I believe. But, I really do think the overall benefits of numpy here outweigh everything, |
Fixed in #920. Still more performance benchmarking and improvements to make (part of the upcoming sprint's goals) but |
Suggested by @wasade. I did some initial testing, and it shouldn't be hard to make this change. See discussion on #53, but in particular:
#53 (diff)
#53 (diff)
The text was updated successfully, but these errors were encountered: