test use of numpy array instead of str for storing BiologicalSequence._sequence #60

gregcaporaso · 2014-01-26T19:39:52Z

Suggested by @wasade. I did some initial testing, and it shouldn't be hard to make this change. See discussion on #53, but in particular:

#53 (diff)
#53 (diff)

Jorge-C · 2014-01-26T22:03:06Z

If self._sequence becomes a numpy ndarray, this will rise a TypeError: unhashable type: 'numpy.ndarray'

Jorge-C · 2014-01-26T23:48:06Z

A few more comments on working with numpy:

Instead of s = array(list("AATTGGCC")), this doesn't need the intermediate list: np.{fromstring|array}("AATTGGCC", dtype='c') (I haven't tested which one is faster).
map(ord, s) where s is a numpy array can be rewritten to take no time nor memory: s.view(np.uint8).
An important caveat with advanced indexing (ie, array[[1, 2]]) is that it's slower (around 3-5x) than using np.take.
To perform rev-comp, we should benchmark np.char.translate and a LUT. Other options?

wasade · 2014-01-27T07:10:51Z

@Jorge-C there was a further discussion with additional examples of numpy approaches in #59. Definitely agree re: fromstring or array. s.view is the way to go. I didn't realize that indexing was slower than .take, thank you for pointing that out. Also didn't realize there was a np.char.translate -- it is likely pretty optimized but agree its worth a quick bench. The nice thing about doing this also entirely in numpy is that it gets even easier to drop to Cython.

Regarding the hashing, that is a very good point. Looks like one way to go is to hash a view

Jorge-C · 2014-01-27T15:57:23Z

Thanks @wasade I had missed that discussion.

Hashing arrays like that is a nifty trick!

Jorge-C · 2014-02-06T05:31:50Z

I've done a bunch of barebones Sequence classes, using different approaches (cython, LUT, translation tables, etc) just to get a taste of what can be done. Here's the gist and a link to view it in nbviewer.

The general idea is that numpy+cython win, and the fastest approach there would rc "O(1e6)" sequences in roughly "O(1)" seconds. Ah, and don't use np.char.translate, I didn't even bother to include it because it was slower than a pure python list comprehension). Please point out optimizations/mistakes!

wasade · 2014-02-06T05:53:17Z

That is awesome! I think there are some clear winners and losers. The code
is very clean too. BTW, I didn't realize you could inline cython's
annotate... very cool.

On Wed, Feb 5, 2014 at 10:31 PM, Jorge Cañardo Alastuey <
notifications@github.com> wrote:

I've done a bunch of barebones Sequence classes, using different
approaches (cython, LUT, translation tables, etc) just to get a taste of
what can be done. Here's the gisthttps://gist.github.com/Jorge-C/d51b48d3e18897c46ea2and a link to view it in
nbviewerhttp://nbviewer.ipython.org/urls/gist.github.com/Jorge-C/d51b48d3e18897c46ea2/raw/73d7e11e4b72d6ba90e0021931afa230e63031e9/cython+sequences.ipynb?create=1
.

The general idea is that numpy+cython win, and the fastest approach there
would rc "O(1e6)" sequences in roughly "O(1)" seconds. Ah, and don't use
np.char.translate, I didn't even bother to include it because it was
slower than a pure python list comprehension). Please point out
optimizations/mistakes!

Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-34293900
.

Jorge-C · 2014-02-06T06:12:26Z

Ups, there isn't a version using np.take/advanced indexing (despite being
faster than plain iterative python it doesn't come close to the faster
ones).

Nonetheless, I'd like to see a more complete benchmark (maybe including
loading sequences from the file?) because converting strings to numpy
arrays isn't free (and the shorter the worse).

Another approach that I don't really know how to write in cython would be
to iterate the string and have a switch statement to do the actual rev-comp
(kind of C/C++). It could avoid instantiation costs and still be fast
rev-complementing.

2014-02-05 Daniel McDonald notifications@github.com:

That is awesome! I think there are some clear winners and losers. The code
is very clean too. BTW, I didn't realize you could inline cython's
annotate... very cool.

On Wed, Feb 5, 2014 at 10:31 PM, Jorge Cañardo Alastuey <
notifications@github.com> wrote:

I've done a bunch of barebones Sequence classes, using different
approaches (cython, LUT, translation tables, etc) just to get a taste of
what can be done. Here's the gist<
https://gist.github.com/Jorge-C/d51b48d3e18897c46ea2>and a link to view
it in
nbviewer<
http://nbviewer.ipython.org/urls/gist.github.com/Jorge-C/d51b48d3e18897c46ea2/raw/73d7e11e4b72d6ba90e0021931afa230e63031e9/cython+sequences.ipynb?create=1>

.

The general idea is that numpy+cython win, and the fastest approach
there
would rc "O(1e6)" sequences in roughly "O(1)" seconds. Ah, and don't use
np.char.translate, I didn't even bother to include it because it was
slower than a pure python list comprehension). Please point out
optimizations/mistakes!

Reply to this email directly or view it on GitHub<
https://github.com/biocore/bipy/issues/60#issuecomment-34293900>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-34294773
.

rob-knight · 2014-02-06T13:18:46Z

These comparisons are very impressive — thanks for putting this together!

Rob

On Feb 5, 2014, at 10:31 PM, Jorge Cañardo Alastuey <notifications@github.com mailto:notifications@github.com> wrote:

I've done a bunch of barebones Sequence classes, using different approaches (cython, LUT, translation tables, etc) just to get a taste of what can be done. Here's the gisthttps://gist.github.com/Jorge-C/d51b48d3e18897c46ea2 and a link to view it in nbviewerhttp://nbviewer.ipython.org/urls/gist.github.com/Jorge-C/d51b48d3e18897c46ea2/raw/73d7e11e4b72d6ba90e0021931afa230e63031e9/cython+sequences.ipynb?create=1.

The general idea is that numpy+cython win, and the fastest approach there would rc "O(1e6)" sequences in roughly "O(1)" seconds. Ah, and don't use np.char.translate, I didn't even bother to include it because it was slower than a pure python list comprehension). Please point out optimizations/mistakes!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-34293900.

gregcaporaso · 2014-02-06T13:42:02Z

Yes, this is really useful, thanks @Jorge-C. I'm not going to have time to work on this code today, but will tomorrow.

Also, everyone, remember that there are still unassigned core objects and very frequently used functionality that we want to port from PyCogent.

wasade · 2014-02-06T15:32:35Z

@Jorge-C I agree about the object instantiation and it is worth testing, though even if there is a small overhead there the benefits of using numpy with cython likely will still outperform at the end of the day. One other thing that could be done to reduce overhead would be to implement the fasta/fastq parsers in cython, which would further streamline this whole process

wasade · 2014-02-06T15:42:32Z

re: the switch statement, you'd have to inline C I believe. But, I really do think the overall benefits of numpy here outweigh everything,

jairideout · 2015-05-08T22:04:17Z

Fixed in #920. Still more performance benchmarking and improvements to make (part of the upcoming sprint's goals) but ndarray is used internally in Sequence objects now.

ghost assigned gregcaporaso Jan 26, 2014

This was referenced Jan 26, 2014

replace BiologicalSequence._hamming_distance with call to scipy.spatial.distance.hamming #61

Closed

added first pass at BiologicalSequence object #53

Merged

gregcaporaso mentioned this issue Feb 5, 2014

Make rc a classmethod #73

Closed

marcelcaraciolo mentioned this issue Apr 25, 2014

Sequence Representation in Biopandas genomika/biopandas#2

Open

gregcaporaso added the performance-enhancement label May 12, 2014

gregcaporaso added this to the 0.2.0: pre-SciPy release milestone May 12, 2014

gregcaporaso mentioned this issue May 13, 2014

should translation be implemented as a class rather than a method? #59

Closed

This was referenced Jun 10, 2014

k_words performance #411

Closed

using maketrans instead dict for complement map? #412

Closed

gregcaporaso removed this from the 0.2.0: first alpha release milestone Jul 11, 2014

ebolyen added enhancement question labels Feb 12, 2015

ebolyen added this to the 0.4.0: Beta Release milestone Feb 12, 2015

jairideout unassigned gregcaporaso Feb 12, 2015

jairideout added chapter and removed question labels Feb 12, 2015

ebolyen mentioned this issue Feb 12, 2015

Refactor Sequence Objects #826

Closed

7 tasks

jairideout added this to the 0.3.0: Sequences, Collections, and Alignments milestone Mar 9, 2015

jairideout removed this from the 0.4.0: Beta Release milestone Mar 9, 2015

jairideout closed this as completed May 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test use of numpy array instead of str for storing BiologicalSequence._sequence #60

test use of numpy array instead of str for storing BiologicalSequence._sequence #60

gregcaporaso commented Jan 26, 2014

Jorge-C commented Jan 26, 2014

Jorge-C commented Jan 26, 2014

wasade commented Jan 27, 2014

Jorge-C commented Jan 27, 2014

Jorge-C commented Feb 6, 2014

wasade commented Feb 6, 2014

Jorge-C commented Feb 6, 2014

rob-knight commented Feb 6, 2014

gregcaporaso commented Feb 6, 2014

wasade commented Feb 6, 2014

wasade commented Feb 6, 2014

jairideout commented May 8, 2015

test use of numpy array instead of str for storing BiologicalSequence._sequence #60

test use of numpy array instead of str for storing BiologicalSequence._sequence #60

Comments

gregcaporaso commented Jan 26, 2014

Jorge-C commented Jan 26, 2014

Jorge-C commented Jan 26, 2014

wasade commented Jan 27, 2014

Jorge-C commented Jan 27, 2014

Jorge-C commented Feb 6, 2014

wasade commented Feb 6, 2014

Jorge-C commented Feb 6, 2014

rob-knight commented Feb 6, 2014

gregcaporaso commented Feb 6, 2014

wasade commented Feb 6, 2014

wasade commented Feb 6, 2014

jairideout commented May 8, 2015