Perl Module that uses consensus sequences to find Exon-Intron Boundaries
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



BoundaryFinder - Exon/Intron Boundary Finder Module


use BoundaryFinder;

$seq = "aaactgcaaggt";
my $hits = BoundaryFinder::score_seq(-seq => $seq, -cons_seq => $cons_seq);


The BoundaryFinder uses consensus sequences to find and score boundaries in a sequence. This was developed for finding exon boundaries, but it can be used with any consensus sequence.

The boundary finder works best when you finding boundaries in small sequences. This is because the number of matches is often large, especially if there are no "100%" positions. (e.g. 100% of nucleotides have to be an "A" at position x, or it does not match)


The general procedure of using the library to find exon boundaries is as follows:

1. Use the BlastManager to blast your query sequence agaist the subject database.

This will give a BlastResults oject, or a set of blast hits.

2. Determine what BlastResults or hits you want use.

Sometimes blast returns hits that are not relevant to your query. These could be weak matches that blast returned that do not make sense with your query. You can trim these out by changing the blast parameters (such as a minimum "e" value) or iterating through the BlastResults and removing the unwanted hits.

3. Use BoundaryFinder to find the boundaries of the blast hits

Each query is a pair of sequences representing the 5` and 3` ends of a blast hit.

This can be automatically done by passing a BoundaryResults object, or done manually. It will create a query with a 50 extra nucleotides from each end of the edges of the blast hit. You can easily specify a buffer size other than 50.


This hastily done ascii example shows how the BoundaryFinder will
create queries automatically from a hit. Given the blast result below,
a sequence is created by using 50 nucleotides upstream (5') to the end
of the blast hit (3' end). This is also done for the 3' end, with 
          50 NT after the end of the hit.

        | |||||||||||    ... and so on
|< -50  ^  ------------- To the end of the blast hit --------------->|

^ = Points to the 5' edge of the blast result.


This can be done manually too, and it takes only a little more work. You will just have to figure out the buffers for each hit, and add them with the BoundaryFinder::addQuery method.

4. Run the BoundaryFinder to get a BoundaryResults object.

The BoundaryResults object contains all the non-zero scores, and their positions in the blast database for easy retrieval. It is simply a wrapper of two hashes that contain the position as keys, and score as the value.


# First run blasts
my $query = "protein.faa";
my $db = "blast_databases/dmel"; # Name of the database to query

# Init the BlastManager
my $bm = BlastManager->new(-blast_db => $db);
# set the expect value (optional
$bm->evalue(-e => 1e-1);       

# Run the tblastn 
my $br = $bm->tblastn(-query => $query);

# Load the 5' and 3' consensus sequences for Drosophila
$seqIO = new ConsensusSeqIO(-path => "t/drosophila.cons");
my $cons_seq_5;
my $cons_seq_3;

$cons_seq_5 = $seqIO->next_seq;
$cons_seq_3 = $seqIO->next_seq;

# Run the blast
# The 5' and 3' are switched because that is the way the consensus sequences
# are from the literature
my $bf_results = BoundaryFinder::score_hit(-hit => $hit, 
                                    -db => $db,
                                    -cons_seq_5 => $cons_seq_3,
                                    -cons_seq_3 => $cons_seq_5  );


Nathan Elmore


Blast+ 2.2.28