Permalink
Browse files

PrimarySeq::length() speedup

Avoid calculating length again when length is already recorded
  • Loading branch information...
1 parent 95d1f94 commit 7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5 @fangly fangly committed Oct 29, 2012
Showing with 15 additions and 12 deletions.
  1. +13 −10 Bio/PrimarySeq.pm
  2. +2 −2 t/Seq/PrimarySeq.t
View
@@ -261,7 +261,7 @@ sub direct_seq_set {
=head2 seq
Title : seq
- Usage : $string = $obj->seq()
+ Usage : $string = $obj->seq();
Function: Returns the sequence as a string of letters. The
case of the letters is left up to the implementer.
Suggested cases are upper case for proteins and lower case for
@@ -454,22 +454,25 @@ sub subseq {
=cut
sub length {
- my $self = shift;
- my $len = CORE::length($self->seq() || '');
-
- if(@_) {
- my $val = shift;
- if(defined($val) && $len && ($len != $val)) {
+ my ($self, $val) = @_;
+ if (defined $val) {
+ my $len = CORE::length($self->seq() || '');
+ if ($len && ($len != $val)) {
$self->throw("You're trying to lie about the length: ".
"is $len but you say ".$val);
}
$self->{'_seq_length'} = $val;
- } elsif(defined($self->{'_seq_length'})) {
- return $self->{'_seq_length'};
+ return $val;
+ } else {
+ if (defined $self->{'_seq_length'}) {
+ return $self->{'_seq_length'};
+ } else {
+ return CORE::length($self->seq() || '');
+ }
}
- return $len;
}
+
=head2 display_id
Title : display_id or display_name
View
@@ -8,7 +8,7 @@ BEGIN {
use lib '.';
use Bio::Root::Test;
- test_begin( -tests => 87 );
+ test_begin( -tests => 88 );
use_ok('Bio::PrimarySeq');
use_ok('Bio::Location::Simple');
@@ -44,7 +44,7 @@ is $seq->namespace, "t";
is $seq->version(0), 0;
is $seq->lsid_string(), "bioperl.org:t:X677667";
is $seq->namespace_string(), "t:X677667.0";
-$seq->version(47);
+ok $seq->version(47);
is $seq->version, 47;
is $seq->namespace_string(), "t:X677667.47";
is $seq->description(), 'Sample Bio::Seq object';

5 comments on commit 7436a1b

Owner

hyphaltip replied Oct 29, 2012

Should you invalidate length when every seq() is ever called with an updated value?

I have to ask, is this really a performance problem?

Should you invalidate length when every seq() is ever called with an updated value?

It already was..

I have to ask, is this really a performance problem?

A major one, specifically when repeatedly taking subsequences from Mb+ size genomes, for instance in https://sourceforge.net/projects/biogrinder (profiling with NYTProf of grinder led @fangly and myself to here). There's validation code that uses length() a lot (perhaps too much?).

Member

fangly replied Oct 29, 2012

In the present case, it makes a lot of sense to avoid an extraneous call to length() because every call to length() retrieves the full sequence through seq() in order to calculate its length. In the case of long sequences (e.g. a full-length microbial genome), this means passing MBs of data back and forth between seq() and length(), which is slow.

PS/ For the unaware, Bio::DB::Fasta is best suited to handle very large sequences.

Owner

hyphaltip replied Oct 30, 2012

that's great- I just wanted to have some documentation as to why this extra code was included. I'm all for optimizations when there is a reason for them that is tested.

Code for comparison here, if that helps (before after 7sec/1sec, all the 6secs is taken calling $newseq->seq, which implies validation, which needs the length
https://gist.github.com/3977545

Memoization of length was already implemented, just badly. So there's no extra features being introduced here

Please sign in to comment.