Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test use of numpy array instead of str for storing BiologicalSequence._sequence #60

Closed
gregcaporaso opened this issue Jan 26, 2014 · 12 comments

Comments

@gregcaporaso
Copy link
Contributor

Suggested by @wasade. I did some initial testing, and it shouldn't be hard to make this change. See discussion on #53, but in particular:

#53 (diff)
#53 (diff)

@Jorge-C
Copy link
Contributor

Jorge-C commented Jan 26, 2014

If self._sequence becomes a numpy ndarray, this will rise a TypeError: unhashable type: 'numpy.ndarray'

@Jorge-C
Copy link
Contributor

Jorge-C commented Jan 26, 2014

A few more comments on working with numpy:

  • Instead of s = array(list("AATTGGCC")), this doesn't need the intermediate list: np.{fromstring|array}("AATTGGCC", dtype='c') (I haven't tested which one is faster).
  • map(ord, s) where s is a numpy array can be rewritten to take no time nor memory: s.view(np.uint8).
  • An important caveat with advanced indexing (ie, array[[1, 2]]) is that it's slower (around 3-5x) than using np.take.
  • To perform rev-comp, we should benchmark np.char.translate and a LUT. Other options?

@wasade
Copy link
Collaborator

wasade commented Jan 27, 2014

@Jorge-C there was a further discussion with additional examples of numpy approaches in #59. Definitely agree re: fromstring or array. s.view is the way to go. I didn't realize that indexing was slower than .take, thank you for pointing that out. Also didn't realize there was a np.char.translate -- it is likely pretty optimized but agree its worth a quick bench. The nice thing about doing this also entirely in numpy is that it gets even easier to drop to Cython.

Regarding the hashing, that is a very good point. Looks like one way to go is to hash a view

@Jorge-C
Copy link
Contributor

Jorge-C commented Jan 27, 2014

Thanks @wasade I had missed that discussion.

Hashing arrays like that is a nifty trick!

@Jorge-C
Copy link
Contributor

Jorge-C commented Feb 6, 2014

I've done a bunch of barebones Sequence classes, using different approaches (cython, LUT, translation tables, etc) just to get a taste of what can be done. Here's the gist and a link to view it in nbviewer.

The general idea is that numpy+cython win, and the fastest approach there would rc "O(1e6)" sequences in roughly "O(1)" seconds. Ah, and don't use np.char.translate, I didn't even bother to include it because it was slower than a pure python list comprehension). Please point out optimizations/mistakes!

@wasade
Copy link
Collaborator

wasade commented Feb 6, 2014

That is awesome! I think there are some clear winners and losers. The code
is very clean too. BTW, I didn't realize you could inline cython's
annotate... very cool.

On Wed, Feb 5, 2014 at 10:31 PM, Jorge Cañardo Alastuey <
notifications@github.com> wrote:

I've done a bunch of barebones Sequence classes, using different
approaches (cython, LUT, translation tables, etc) just to get a taste of
what can be done. Here's the gisthttps://gist.github.com/Jorge-C/d51b48d3e18897c46ea2and a link to view it in
nbviewerhttp://nbviewer.ipython.org/urls/gist.github.com/Jorge-C/d51b48d3e18897c46ea2/raw/73d7e11e4b72d6ba90e0021931afa230e63031e9/cython+sequences.ipynb?create=1
.

The general idea is that numpy+cython win, and the fastest approach there
would rc "O(1e6)" sequences in roughly "O(1)" seconds. Ah, and don't use
np.char.translate, I didn't even bother to include it because it was
slower than a pure python list comprehension). Please point out
optimizations/mistakes!

Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-34293900
.

@Jorge-C
Copy link
Contributor

Jorge-C commented Feb 6, 2014

Ups, there isn't a version using np.take/advanced indexing (despite being
faster than plain iterative python it doesn't come close to the faster
ones).

Nonetheless, I'd like to see a more complete benchmark (maybe including
loading sequences from the file?) because converting strings to numpy
arrays isn't free (and the shorter the worse).

Another approach that I don't really know how to write in cython would be
to iterate the string and have a switch statement to do the actual rev-comp
(kind of C/C++). It could avoid instantiation costs and still be fast
rev-complementing.

2014-02-05 Daniel McDonald notifications@github.com:

That is awesome! I think there are some clear winners and losers. The code
is very clean too. BTW, I didn't realize you could inline cython's
annotate... very cool.

On Wed, Feb 5, 2014 at 10:31 PM, Jorge Cañardo Alastuey <
notifications@github.com> wrote:

I've done a bunch of barebones Sequence classes, using different
approaches (cython, LUT, translation tables, etc) just to get a taste of
what can be done. Here's the gist<
https://gist.github.com/Jorge-C/d51b48d3e18897c46ea2>and a link to view
it in
nbviewer<
http://nbviewer.ipython.org/urls/gist.github.com/Jorge-C/d51b48d3e18897c46ea2/raw/73d7e11e4b72d6ba90e0021931afa230e63031e9/cython+sequences.ipynb?create=1>

.

The general idea is that numpy+cython win, and the fastest approach
there
would rc "O(1e6)" sequences in roughly "O(1)" seconds. Ah, and don't use
np.char.translate, I didn't even bother to include it because it was
slower than a pure python list comprehension). Please point out
optimizations/mistakes!

Reply to this email directly or view it on GitHub<
https://github.com/biocore/bipy/issues/60#issuecomment-34293900>
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-34294773
.

@rob-knight
Copy link

These comparisons are very impressive — thanks for putting this together!

Rob

On Feb 5, 2014, at 10:31 PM, Jorge Cañardo Alastuey <notifications@github.commailto:notifications@github.com> wrote:

I've done a bunch of barebones Sequence classes, using different approaches (cython, LUT, translation tables, etc) just to get a taste of what can be done. Here's the gisthttps://gist.github.com/Jorge-C/d51b48d3e18897c46ea2 and a link to view it in nbviewerhttp://nbviewer.ipython.org/urls/gist.github.com/Jorge-C/d51b48d3e18897c46ea2/raw/73d7e11e4b72d6ba90e0021931afa230e63031e9/cython+sequences.ipynb?create=1.

The general idea is that numpy+cython win, and the fastest approach there would rc "O(1e6)" sequences in roughly "O(1)" seconds. Ah, and don't use np.char.translate, I didn't even bother to include it because it was slower than a pure python list comprehension). Please point out optimizations/mistakes!


Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-34293900.

@gregcaporaso
Copy link
Contributor Author

Yes, this is really useful, thanks @Jorge-C. I'm not going to have time to work on this code today, but will tomorrow.

Also, everyone, remember that there are still unassigned core objects and very frequently used functionality that we want to port from PyCogent.

@wasade
Copy link
Collaborator

wasade commented Feb 6, 2014

@Jorge-C I agree about the object instantiation and it is worth testing, though even if there is a small overhead there the benefits of using numpy with cython likely will still outperform at the end of the day. One other thing that could be done to reduce overhead would be to implement the fasta/fastq parsers in cython, which would further streamline this whole process

@wasade
Copy link
Collaborator

wasade commented Feb 6, 2014

re: the switch statement, you'd have to inline C I believe. But, I really do think the overall benefits of numpy here outweigh everything,

@jairideout jairideout removed this from the 0.4.0: Beta Release milestone Mar 9, 2015
@jairideout
Copy link
Contributor

Fixed in #920. Still more performance benchmarking and improvements to make (part of the upcoming sprint's goals) but ndarray is used internally in Sequence objects now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants