Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up DB::Fasta::subseq #95

Closed
rocky opened this issue Dec 19, 2014 · 15 comments
Closed

Speeding up DB::Fasta::subseq #95

rocky opened this issue Dec 19, 2014 · 15 comments

Comments

@rocky
Copy link
Collaborator

rocky commented Dec 19, 2014

This follows Pull request #96.

The background is basically that in running Variant Effect Prediction, a significant portion of time is spent in the stripping carriage returns and line feeds in the subseq method via:

$data =~ s/\n//g;
$data =~ s/\r//g

Compiling the match portion in a regular expression outside of the function helps, but better is doing it in C.

This commit I hope will be acceptable, and if not should serve as a concrete discussion for further work.

In that commit, in the Inline-C-Fasts branch of my fork, if module Inline::C is around, that will compile use a C function that overwrites another function of the same name which is a pure Perl function.

See also these benchmarks.

@fjossandon
Copy link
Member

I have been testing https://github.com/rocky/bioperl-live/commit/65e599b126d1598b594ecaca53777382686d0084 for a while with and without Inline in my system and it works well, but I have 2 requests for you.

  1. Please use a named scalar instead of $_ in the strip_nlcr subroutine; we have seen problems in the past for abusing the use of $_ that later had to be eradicated (like in 41a5483), so please use "my $string = shift;" or something like that.
  2. Something that I didn't realize before is that Inline produce a new folder called _Inline with some subfolders (lib, build) and files (.lock, .dll) that remains after execution. Can that be removed after execution to keep the place clean??
    Cheers.

@rocky
Copy link
Collaborator Author

rocky commented Dec 20, 2014

Thanks for checking things over. Since there seems to be agreement to consider Inline::C, I've closed the old pull request and created pull request #96 . In that, I've removed the use of $_ and added a my variable instead.

As for removing the Inline folder, that can be done, but I think you would be doing everyone a disservice and I suspect you misunderstand. So let me explain.

You only need to compile the C code once, not every time you run the program. And that implies that you really don't even need a C compiler around all the time, just when you want to create this speed-up code.

It is possible or even likely that in packaging this for an OS, say a .deb for Ubuntu, the packager will include the compiled code in that Inline folder, and thus the system which it the Bio::Perl package is installed in doesn't need to have a C compiler.

Adding a cache folder or additional files is not uncommon. Python3 uses such a folder. Python2 adds files with the extension .pyc, and I'm sure some versions of Perl do the same. The LLVM implementation of Ruby, rubinius, I believe also creates a folder in the later versions. Emacs Lisp changes the extension from .el to .elc.

Finally, let me again say thanks for checking this over, but please check it over again. Although I've run the test suite, and tried my best to check things over, I make lots of mistakes.

That's why I work on debuggers.

@fjossandon
Copy link
Member

Ok, thanks for your explanation, now I have a clearer picture. I have only 1 remaining question. You say that once the compiled file has been created, Inline will not touch it again, right? This means that if a change is made to the C code for whatever reason, the compiled file will be out of date and it will be necessary to delete it so it compiles again???

The code looks fine on first glance, but I will check it over again of course, don't worry (I'm downloading your branch right in this moment). Thanks to you for the patch. Looking around for similar cases, Bio::DB::Qual have a similar case on "subqual" that could benefit from the same approach. I will probably merge on Monday to test and give time to any other people speak if they want to.

Cheers.

@rocky
Copy link
Collaborator Author

rocky commented Dec 20, 2014

On Sat, Dec 20, 2014 at 4:07 PM, Francisco J. Ossandon <
notifications@github.com> wrote:

Ok, thanks for your explanation, now I have a clearer picture. I have only
1 remaining question. You say that once the compiled file has been created,
Inline will not touch it again, right? This means that if a change is made
to the C code for whatever reason, the compiled file will be out of date
and it will be necessary to delete it so it compiles again???

I could be wrong, but from the tests I have tried, if you change the
source C source code, the files in _Inline get recompiled. I'm not sure how
it figures this out although I can think of ways it might. The hex numbers
after the name Fasta_028f, and Fasta_2c5e seem suggestive.

And even if it does the stupid thing of recompiling every time, I would
imagine in the future it might not, because it doesn't have to.

The code looks fine on first glance, but I will check it over again of
course, don't worry (I'm downloading your branch right in this moment).
Thanks to you for the patch. Looking around for similar cases,
Bio::DB::Qual have a similar case on "subqual" that could benefit from the
same approach. I will probably merge on Monday to test and give time to any
other people speak if they want to.

Yes, please take your time. Things can only benefit from more careful
scrutiny by you and others. I am concerned that the added complexity really
does speed things the way my tests seem to indicate.

So I welcome independent verification from others. One of the things that
is tricky is that the time spent is dependent on data - whether there are substitutions
done or not, the size of the strings involved and so on. So I would be
interested in how this changes real runs. After all, that's what we want to
improve.

If this does have a significant impact and you want to use it in other
modules, I guess you may want to hoist this out of this module and put it
in a common module.

Cheers.

Cheers.


Reply to this email directly or view it on GitHub
#95 (comment).

@rocky
Copy link
Collaborator Author

rocky commented Dec 20, 2014

By the way, what's up with the FASTA/FASTQ XS bindings mentioned before?

@fjossandon
Copy link
Member

I don't know about XS, @cjfields mentioned it in #94, so maybe should ask him?

@fjossandon
Copy link
Member

After testing and testing I found no problem, so I'm merging your pull request now @rocky. I will be absent a couple of days for the holidays, but I was thinking on moving later the sub code from Bio::DB::Fasta to "Bio::DB::IndexBase", since IndexBase is the base to all the other Bio::DB modules (meaning that it could benefit Bio::DB::Qual without duplicating code). If you feel like doing it, it would be great, if not I will look on to it later.
Happy holidays!

@fjossandon
Copy link
Member

Note: I will export the changes to v1.6.x branch in a few days.

@rocky
Copy link
Collaborator Author

rocky commented Dec 24, 2014

Thanks for merging.

My experience is that it is better to make smaller more-frequent changes like this one, rather than wait and instead do fewer larger, more-encompassing changes.

Otherwise things bog down. Again, in my experience the quicker the turn-around the easier it is to get people to help out.

In the larger picture, yes, this should go in a more general module such as Bio::DB::IndexBase as you suggest — but as a separate change.

Personally, I am a little less concerned with that change because from my profiling, the second use in Bio::DB::Fasta::header just isn't used a lot. Think about it: that's used only to read the header line which typically will be less than 0.1% of the lines of the file. And I think the same is likely to be true for Bio::DB::Qual. So this might fall more in the category of abstraction nicety rather than speedup that anyone will notice.

Are there benchmark tests for Bio::Perl? If so, are they regularly run before releases to gauge changes in performance?

Happy holidays!

@fjossandon
Copy link
Member

@rocky, I made an experiment to see the effect of moving the strip_crnl to the base module IndexedBase.pm and found that using strip_crnl compiled with Inline C from IndexedBase.pm (instead of being a Perl sub) make it die with an error:

Usage: Bio::DB::IndexedBase::_strip_crnl_C(str) at Bio/DB/Fasta.pm line 350.

So this works => "$test_data = strip_crnl_C($test_data);" # Perl sub
But this don't => "$test_data = Bio::DB::IndexedBase->_strip_crnl_C($test_data);" # Inline C sub

When comparing speeds between locally held subrotine (sub_subst) vs imported from IndexedBase (sub_subst_index) using the Perl sub version it returns similar speeds.

raw_subst         290/s              --            -99%            -99%
sub_subst       49261/s          16864%              --             -0%
sub_subst_index 49505/s          16948%              0%              --

The C code in IndexedBase is identical (besides changing its name a little to compare it side by side with the local sub. Do you have any idea of why the Inline C sub would give an error when imported from another module??

@fjossandon
Copy link
Member

I forgot, as far as I know, there are no benchmark tests setup to monitor performance changes, only the test files for checking proper results, but no speed tests.

@rocky
Copy link
Collaborator Author

rocky commented Dec 31, 2014

Do you have any idea of why the Inline C sub would give an error when imported from another module??

The question leads me to believe there is a misunderstanding in how -> works, because it has nothing to do with "importing", and everything to do with ->.

Writing Bio::DB::IndexedBase->strip_C($test_data) in this specific case is the same as writing Bio::DB::IndexBase::strip_C(Bio::DB::IndexBase, $test_data). The C code as written takes just one argument, $test_data, not two.

So see if calling the routine as DB::IndexedBase::strip_C($test_data) works instead.

Also please, let's not try to make this OO by somehow adding that class extra class object. There is nothing OO about this routine. Use OO when it is warranted and don't use it when it is not.

In EnsEMBL's Variant Effect Prediction, it looks like there is a major source of slowness by an overuse of OO. This routine was written to remove up a bottleneck. Using it via DB::IndexedBase::strip_C($test_data) is fine. Using it via DB::IndexedBase->strip_C($test_data) will require more (misguided) coding to make this slower without adding any clarity or benefit.

@fjossandon
Copy link
Member

I was not trying to go OO, it was just a syntax confusion on my part. =)
Thanks for the explanation, now I have fixed the syntax and it works fine. I will push the change in a few minutes, once I run the test suite in case I overlooked something.

@fjossandon
Copy link
Member

Ok, final edit on 01ec10d. Issue closed

@rocky
Copy link
Collaborator Author

rocky commented Jan 2, 2015

Ok, good to hear we're on the same page here. Looked over the final commits and it all looks good and better than before. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants