-
Notifications
You must be signed in to change notification settings - Fork 182
/
bptutorial.pl
executable file
·3088 lines (2386 loc) · 118 KB
/
bptutorial.pl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# $Id$
=head1 NAME
BioPerlTutorial - a tutorial for bioperl
=head1 AUTHOR
Cared for by Peter Schattner <schattner@alum.mit.edu>
Copyright Peter Schattner
=head1 DESCRIPTION
This tutorial includes "snippets" of code and text from various
Bioperl documents including module documentation, example scripts
and "t" test scripts. You may distribute this tutorial under the
same terms as perl itself.
This document is written in Perl POD (plain old documentation)
format. You can run this file through your favorite pod translator
(pod2html, pod2man, pod2text, etc.) if you would like a more
convenient formatting.
Table of Contents
I. Introduction
I.1 Overview
I.2 Software requirements
I.2.1 For minimal bioperl installation
I.2.2 For complete installation
I.3 Installation procedures
I.4 Additional comments for non-unix users
II. Brief overview to bioperl\'s objects
II.1 Sequence objects: (Seq, PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI)
II.2 Alignment objects (SimpleAlign, UnivAln)
II.3 Location objects (Simple, Split, Fuzzy)
II.4 Interface objects and implementation objects
III. Using bioperl
III.1 Accessing sequence data from local and remote databases
III.1.1 Accessing remote databases (Bio::DB::GenBank, etc)
III.1.2 Indexing and accessing local databases (Bio::Index::*, bpindex.pl, bpfetch.pl)
III.2 Transforming formats of database/ file records
III.2.1 Transforming sequence files (SeqIO)
III.2.2 Transforming alignment files (AlignIO)
III.3 Manipulating sequences
III.3.1 Manipulating sequence data with Seq methods (Seq)
III.3.2 Obtaining basic sequence statistics- MW, residue &codon frequencies (SeqStats)
III.3.3 Identifying restriction enzyme sites (RestrictionEnzyme)
III.3.4 Identifying amino acid cleavage sites (Sigcleave)
III.3.5 Miscellaneous sequence utilities: OddCodes, SeqPattern
III.4 Searching for "similar" sequences
III.4.1 Running BLAST locally (StandAloneBlast)
III.4.2 Running BLAST remotely (using RemoteBlast.pm)
III.4.3 Parsing BLAST reports with Blast.pm
III.4.4 Parsing BLAST reports with BPlite, BPpsilite and BPbl2seq
III.4.5 Parsing HMM reports (HMMER::Results)
III.5 Creating and manipulating sequence alignments
III.5.1 Aligning 2 sequences with Smith-Waterman (pSW)
III.5.2 Aligning 2 sequences with Blast using bl2seq and AlignIO
III.5.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm)
III.5.4 Manipulating / displaying alignments (SimpleAlign, UnivAln)
III.6 Searching for genes and other structures on genomic DNA
(Genscan, Sim4, ESTScan, MZEF, Grail, Genemark, EPCR)
III.7 Developing machine readable sequence annotations
III.7.1 Representing sequence annotations (Annotation, SeqFeature)
III.7.2 Representing and large and/or changing sequences (LiveSeq,LargeSeq)
III.7.3 Representing related sequences - mutations, polymorphisms etc (Allele, SeqDiff,)
III.7.4 Sequence XML representations - generation and parsing (SeqIO::game)
IV. Related projects - biocorba, biopython, biojava, EMBOSS, Ensembl, AnnotationWorkbench
IV.1 Biocorba
IV.2 Biopython and biojava
IV.3 EMBOSS
IV.2 Ensembl and bioperl-db
IV.4 The Annotation Workbench and bioperl-gui
V. Appendices
V.1 Finding out which methods are used by which Bioperl Objects
V.2 Tutorial Demo Scripts
=head1 I. Introduction
=head2 I.1 Overview
Bioperl is a collection of perl modules that facilitate the
development of perl scripts for bio-informatics applications. As
such, it does not include ready to use programs in the sense that many
commercial packages and free web-based interfaces (eg Entrez, SRS) do.
On the other hand, bioperl does provide reusable perl modules that
facilitate writing perl scripts for sequence manipulation, accessing
of databases using a range of data formats and execution and parsing
of the results of various molecular biology programs including Blast,
clustalw, TCoffee, genscan, ESTscan and HMMER. Consequently, bioperl
enables developing scripts that can analyze large quantities of
sequence data in ways that are typically difficult or impossible with
web based systems.
In order to take advantage of bioperl, the user needs a basic
understanding of the perl programming language including an
understanding of how to use perl references, modules, objects and
methods. If these concepts are unfamiliar the user is referred to any
of the various introductory / intermediate books on perl. (I\'ve liked
S. Holzmer\'s Perl Core Language, Coriolis Technology Press, for
example). This tutorial is not intended to teach the fundamentals of
perl to those with little or no experience in the perl language. On
the other hand, advanced knowledge of perl - such as how to write a
perl object - is not required for successfully using bioperl.
Bioperl is open source software that is still under active
development. The advantages of open source software are well known.
They include the ability to freely examine and modify source code and
exemption from software licensing fees. However, since open source
software is typically developed by a large number of volunteer
programmers, the resulting code is often not as clearly organized and
its user interface not as standardized as in a mature commercial
product. In addition, in any project under active development,
documentation may not keep up with the development of new features.
Consequently the learning curve for actively developed, open source
source software is sometimes steep.
This tutorial is intended to ease the learning curve for new users of
bioperl. To that end the tutorial includes:
=over 4
=item *
Descriptions of what bio-informatics tasks can be handled with bioperl
=item *
Directions on where to find the methods to accomplish these tasks
within the bioperl package
=item *
Recommendations on where to go for additional information.
=item *
The POD documentation should contain runnable code in the SYNOPSIS section
which is meant to illustrate the use of a module and its methods.
=back
Running the tutorial.pl script while going through this tutorial - or
better yet, stepping through it with an interactive debugger - is a
good way of learning bioperl. The tutorial script is also a good
place from which to cut-and-paste code for your scripts(rather than
using the code snippets in this tutorial). The tutorial script should
work on your machine - and if it doesn\'t it would probably be a good
idea to find out why, before getting too involved with bioperl!
This tutorial does not intend to be a comprehensive description of all
the objects and methods available in bioperl. For that the reader is
directed to the documentation included with each of the modules. A
very useful interface for finding one's way within all the module
documentation can be found at http://doc.bioperl.org/bioperl-live/.
This interface lists all bioperl modules and descriptions of all
of their methods.
One potential problem in locating the correct documentation is that
multiple methods in different modules may all share the same name.
Moreover, because of perl's complex method of "inheritance",
it is not often clear which of the identically named methods is being
called by a given object. One way to resolve this question is by using
the software described in Appendix V.1.
For those who prefer more visual descriptions,
http://doc.bioperl.org/bioperl-live/ also offers links to three
PDF files which contain schematics that describe how many of the bioperl
objects related to one another.
=head2 I.2 Software requirements
=cut
=head2 I.2.1 Minimal bioperl installation
For a "minimal" installation of bioperl, you will need to have perl
itself installed as well as the bioperl "core modules". Bioperl has
been tested primarily using perl 5.005 and perl 5.6.
The minimal bioperl installation should still work under perl 5.004.
However, as increasing numbers of bioperl objects are using modules
from CPAN (see below), problems have been observed for bioperl running
under perl 5.004. So if you are having trouble running bioperl under
perl 5.004, you should probably upgrade your version of perl.
In addition to a current version of perl, the new user of bioperl is
encouraged to have access to, and familiarity with, an interactive
perl debugger. Bioperl is a large collection of complex interacting
software objects. Stepping through a script with an interactive
debugger is a very helpful way of seeing what is happening in such a
complex software system - especially when the software is not behaving
in the way that you expect. The free graphical debugger ptkdb
(available as Devel::ptkdb from CPAN) is highly recommended. Active
State offers a commercial graphical debugger for windows systems. The
standard perl distribution also contains a powerful interactive
debugger - though with a more cumbersome (command line) interface.
The Perl tool Data::Dumper used with the syntax:
use Data::Dumper;
printer Dumper($seq);
can also be helpful for obtaining debugging information on perl objects.
=head2 I.2.2 Complete installation
Taking full advantage of bioperl requires software beyond that for the
minimal installation. This additional software includes perl modules
from CPAN, bioperl perl extensions, a bioperl xs-extension, and
several standard compiled bioinformatics programs.
B<Perl - extensions>
The following perl modules are available from bioperl
(http://bioperl.org/Core/external.shtml)or from CPAN
(http://www.perl.com/CPAN/) are used by bioperl. The listing also
indicates what bioperl features will not be available if the
corresponding CPAN module is not downloaded. If these modules are not
available (eg non-unix operating systems), the remainder of bioperl
should still function correctly.
For accessing remote databases you will need:
=over 2
=item *
File-Temp-0.09
=item *
IO-String-1.01
=back
For accessing Ace databases you will need:
=over 1
=item *
AcePerl-1.68.
=back
For remote blast searches you will need:
=over 7
=item *
libwww-perl-5.48
=item *
Digest-MD5-2.12.
=item *
HTML-Parser-3.13
=item *
libnet-1.0703
=item *
MIME-Base64-2.11
=item *
URI-1.09
=item *
IO-stringy-1.216
=back
For xml parsing you will need:
=over 5
=item *
libxml-perl-0.07
=item *
XML-Parser-2.30
=item *
XML-Twig-2.02
=item *
XML-Writer-0.4
=item *
expat-1.95.1 from http://sourceforge.net/projects/expat/
=back
For more current and additional information on external modules
required by bioperl, check http://bioperl.org/Core/external.shtml
B<Bioperl c extensions & external bio-informatics programs>
Bioperl also uses several c-programs for sequence alignment and local
blast searching. To use these features of bioperl you will need an
ANSI C or Gnu C compiler as well as the actual program available from
sources such as:
for smith-waterman alignments- bioperl-ext-0.6 from
http://bioperl.org/Core/external.shtml
for clustalw alignments-
http://corba.ebi.ac.uk/Biocatalog/Alignment_Search_software.html/
for tcoffee alignments-
http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html
for local blast searching- ftp://ncbi.nlm.nih.gov/blast
=head2 I.3 Installation
The actual installation of the various system components is
accomplished in the standard manner:
=over 6
=item *
Locate the package on the network
=item *
Download
=item *
Decompress (with gunzip or a similiar utility)
=item *
Remove the file archive (eg with tar -xvf)
=item *
Create a "makefile" (with "perl Makefile.PL" for perl modules or a
supplied "install" or "configure" program for non-perl program
=item *
Run "make", "make test" and "make install" This procedure must be
repeated for every CPAN module, bioperl-extension and external
module to be installed. A helper module CPAN.pm is available from
CPAN which automates the process for installing the perl modules.
=back
For the external programs (clustal, Tcoffee, ncbi-blast), there is an
extra step:
=over 1
=item *
Set the relevant environmental variable (CLUSTALDIR, TCOFFEEDIR or
BLASTDIR) to the directory holding the executable in your startup
file - eg in .bashrc. (For running local blasts, it is also
necessary that the name of local-blast database directory is known
to bioperl. This will typically happen automatically, but in case
of difficulty, refer to the documentation for StandAloneBlast.pm)
=back
The only likely complication (at least on unix systems) that may occur
is if you are unable to obtain system level writing privileges. For
instructions on modifying the installation in this case and for more
details on the overall installation procedure, see the README file in
the bioperl distribution as well as the README files in the external
programs you want to use (eg bioperl-ext, clustalw, TCoffee,
NCBI-blast).
=head2 I.4 Additional comments for non-unix users
Bioperl has mainly been developed and tested under various unix
environments (including Linux) and this tutorial is intended primarily
for unix users. The minimal installation of bioperl should work
under other OS\'s (NT, windows,Mac). However, bioperl has not been
widely tested under these OS\'s.
Todd Richmond has written of his experiences with BioPerl on MacOs 9
at http://bioperl.org/Core/mac-bioperl.html. There is also a
description of bioperl on windows by Jurgen Pletinckx at
http://www.bioperl.org/Core/windows-bioperl.html. (Note that currently
these documents describe relaease 0.7.x of bioperl.) Minimal bioperl
does run without problems on Mac OS X sincel it is a Unix system. However,
external precompiled programs (eg NCBI local Blast) and other useful
auxiliary programs such as perl-TK and ptkdb are in many cases not yet
available under OS X.
Many bioperl features require the use of CPAN modules, compiled
extensions or external programs. These features will probably will
not work under some or all of these other operating systems. If a
script attempts to access these features from a non-unix OS, bioperl
is designed to simply report that the desired capability is not available.
However, since the testing of bioperl in these environments has been
limited, the script may well crash in a less "graceful" manner.
=head1 II. Brief introduction to bioperl\'s objects
The purpose of this tutorial is to get you using bioperl to solve
real-life bioinformatics problems as quickly as possible. The aim is
not to explain the structure of bioperl objects or perl
object-oriented programming in general. Indeed, the relationships
among the bioperl objects is not simple; however, understanding them
in detail is fortunately not necessary for successfully using the
package.
Nevertheless, a little familiarity with the bioperl object "bestiary"
can be very helpful even to the casual user of bioperl. For example
there are (at least) six different "sequence objects" - Seq,
PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI. Understanding the
relationships among these objects - and why there are so many of them
- will help you select the appropriate one to use in your script.
=head2 II.1 Sequence objects: (Seq, PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI)
Seq is the central sequence object in bioperl. When in doubt this is
probably the object that you want to use to describe a dna, rna or
protein sequence in bioperl. Most common sequence manipulations can
be performed with Seq. These capabilities are described in sections
III.3.1 and III.7.1.
Seq objects can be created explicitly (see section III.2.1 for an
example). However usually Seq objects will be created for you
automatically when you read in a file containing sequence data using
the SeqIO object. This procedure is described in section III.2.1. In
addition to storing its identification labels and the sequence itself,
a Seq object can store multiple annotations and associated "sequence
features". This capability can be very useful - especially in
development of automated genome annotation systems, see section
III.7.1.
On the other hand, if you need a script capable of simultaneously
handling many (hundreds or thousands) sequences at a time, then the
overhead of adding annotations to each sequence can be significant.
For such applications, you will want to use the PrimarySeq object.
PrimarySeq is basically a "stripped down" version of Seq. It contains
just the sequence data itself and a few identifying labels (id,
accession number, molecule type = dna, rna, or protein). For
applications with hundreds or thousands or sequences, using PrimarySeq
objects can significantly speed up program execution and decrease the
amount of RAM the program requires.
What is (for historical reasons) called a LocatableSeq object might be
more appropriately called an "AlignedSeq" object. It is a Seq object
which is part of a multiple sequence alignment. It has "start" and
"end" positions indicating from where in a larger sequence it may have
been extracted. It also may have "gap" symbols corresponding to the
alignment to which it belongs. It is used by the alignment
object SimpleAlign and other modules that use SimpleAlign objects (eg
AlignIO, pSW). In general you don\'t have to worry about creating
LocatableSeq objects because they will be made for you automatically
when you create an alignment (using pSW, Clustalw, Tcoffee or bl2seq)
or when input an alignment data file using AlignIO. However if you
need to input a sequence alignment by hand (ieg to build a SimpleAlign
object), you will need to input the sequences as LocatableSeqs.
A LargeSeq object is a special type of Seq object used for handling
very long ( eg E<gt> 100 MB) sequences. If you need to manipulate such
long sequences see section III.7.2 which describes LargeSeq objects.
A LiveSeq object is another specialized object for storing sequence
data. LiveSeq addresses the problem of features whose location on a
sequence changes over time. This can happen, for example, when
sequence feature objects are used to store gene locations on newly
sequenced genomes - locations which can change as higher quality
sequencing data becomes available. Although a LiveSeq object is not
implemented in the same way as a Seq object, LiveSeq does implement
the SeqI interface (see below). Consequently, most methods available
for Seq objects will work fine with LiveSeq objects. Section III.7.2
contains further discussion of LiveSeq objects.
SeqI objects are Seq "interface objects" (see section II.4) They are
used to ensure bioperl\'s compatibility with other software packages.
SeqI and other interface objects are not likely to be relevant to the
casual bioperl user.
*** Having described these other types of sequence objects, the
"bottom line" still is that if you store your sequence data in Seq
objects (which is where they\'ll be if you read them in with
SeqIO), you will usually do just fine. ***
=head2 II.2 Alignment objects (SimpleAlign, UnivAln)
There are two "alignment objects" in bioperl: SimpleAlign and UnivAln.
Both store an array of sequences as an alignment. However their
internal data structures are quite different and converting between
them - though certainly possible - is rather awkward. In contrast to
the sequence objects - where there are good reasons for having 6
different classes of objects, the presence of two alignment objects is
just an unfortunate relic of the two systems having been designed
independently at different times.
In earlier releases of bioperl, SimpleAlign and UnivAln each had some
significant capabilities that the other lacked. However, as of the current
release, most of the basic features of UnivAln have been implemented
in SimpleAlign. Consequently, except in cases where you want to do very
intricate "slicing and dicing" of alignments, you will be best off to use
SimpleAlign objects for your multiple sequence alignments. For more
details on these objects see section III.5.4.
=head2 II.3 Location objects
A "Location object" is designed to be associated with a Sequence Feature object
to indicate where on a larger structure (eg a chromosome or contig)
the feature can be found. The reason why this simple concept has evolved
in a collection of rather complicated objects is that
1) Some objects have multiple locations or sub-locations (eg a gene\'s exons
may have multiple start and stop locations)
2) In unfinished genomes, the precise locations of features is not known with
certainty.
Bioperl\'s various Location objects address these complications. In addition
there are "CoordinatePolicy" objects that allow the user to specify how to
measure the "length" of a feature if its precise start and end coordinates are
not know. In most cases, you will not need to worry about these complications
if you are using bioperl to handle simple features with well-defined start
and stop locations. However, if you are using bioperl to annotate partially
or unfinished genomes or to read annotations of such genomes with bioperl,
understanding the various Location objects will be important. See the
documentation of the various modules in the Bio::Locations directory for more
information.
=head2 II.4 Interface objects and implementation objects
Since release 0.6, bioperl has been moving to separate interface and
implementation objects. An interface is solely the definition of what
methods one can call on an object, without any knowledge of how it is
implemented. An implementation is an actual, working implementation of
an object. In languages like Java, interface definition is part of the
language. In Perl, you have to roll your own.
In bioperl, the interface objects usually have names like
Bio::MyObjectI, with the trailing I indicating it is an interface
object. The interface objects mainly provide documentation on what the
interface is, and how to use it, without any implementations (though
there are some exceptions). Although interface objects are not of
much direct utility to the casual bioperl user, being aware of their
existence is useful since they are the basis to understanding how
bioperl programs can communicate with other bioinformatics projects
such as Ensembl and the Annotation Workbench (see section IV)
=head1 III. Using bioperl
Bioperl provides software modules for many of the typical tasks of
bioinformatics programming. These include:
=over 7
=item * Accessing sequence data from local and remote databases
=item * Transforming formats of database/ file records
=item * Manipulating individual sequences
=item * Searching for "similar" sequences
=item * Creating and manipulating sequence alignments
=item * Searching for genes and other structures on genomic DNA
=item * Developing machine readable sequence annotations
=back
The following sections describe how bioperl can help perform all of
these tasks.
=head2 III.1 Accessing sequence data from local and remote databases
Much of bioperl is focused on sequence manipulation. However, before
bioperl can manipulate sequences, it needs to have access to sequence
data. Now one can directly enter data sequence data into a bioperl
Seq object, eg:
$seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact',
'-desc'=>'Sample Bio::Seq object',
'-display_id' => 'something',
'-accession_number' => 'accnum',
'-alphabet' => 'dna' );
However, in most cases, it is preferable to access sequence data from
some online data file or database (Note that in common with
conventional bioinformatics usage we will call a "database" what might
be more appropriately referred to as an "indexed flat file".) Bioperl
supports accessing remote databases as well as developing indices for
setting up local databases.
=head2 III.1.1 Accessing remote databases (Bio::DB::GenBank, etc)
Accessing sequence data from the principal molecular biology databases
is straightforward in bioperl. Data can be accessed by means of the
sequence\'s accession number or id. Batch mode access is also
supported to facilitate the efficient retrieval of multiple sequences.
For retrieving data from genbank, for example, the code could be as
follows:
$gb = new Bio::DB::GenBank();
$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
$seq2 = $gb->get_Seq_by_acc('AF303112'))
$seqio = $gb->get_Stream_by_batch([ qw(J00522 AF303112 2981014)]));
Bioperl currently supports sequence data retrieval from the genbank,
genpept, swissprot and gdb databases. Bioperl also supports retrieval
from a remote Ace database. This capability requires the presence of
the external AcePerl module. You need to download and install the aceperl
module from http://stein.cshl.org/AcePerl/.
=head2 III.1.2 Indexing and accessing local databases (Bio::Index::*,
bpindex.pl, bpfetch.pl)
Alternately, bioperl permits indexing local sequence data files by
means of the Bio::Index objects. The following sequence data formats
are supported: genbank, swissprot, pfam, embl and fasta. Once the set
of sequences have been indexed using Bio::Index, individual sequences
can be accessed using syntax very similar to that described above for
accessing remote databases. For example, if one wants to set up an
indexed (flat-file) database of fasta files, and later wants then to
retrieve one file, one could write a scripts like:
# script 1: create the index
use Bio::Index::Fasta; # using fasta file format
$Index_File_Name = shift;
$inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
$inx->make_index(@ARGV);
# script 2: retrieve some files
use Bio::Index::Fasta;
$Index_File_Name = shift;
$inx = Bio::Index::Fasta->new($Index_File_Name);
foreach $id (@ARGV) {
$seq = $inx->fetch($id); # Returns Bio::Seq object
# do something with the sequence
}
To facilitate the creation and use of more complex or flexible
indexing systems, the bioperl distribution includes two sample scripts
bpindex.pl and bpfetch.pl. These scripts can be used as templates to
develop customized local data-file indexing systems.
=head2 III.2 Transforming formats of database/ file records
=cut
=head2 III.2.1 Transforming sequence files (SeqIO)
A common - and tedious - bioinformatics task is that of converting
sequence data among the many widely used data formats. Bioperl\'s
SeqIO object, however, makes this chore a breeze. SeqIO can read a
stream of sequences - located in a single or in multiple files - in
any of six formats: Fasta, EMBL. GenBank, Swissprot, PIR and GCG.
Once the sequence data has been read in with SeqIO, it is available to
bioperl in the form of Seq objects. Moreover, the Seq objects can
then be written to another file (again using SeqIO) in any of the
supported data formats making data converters simple to implement, for
example:
use Bio::SeqIO;
$in = Bio::SeqIO->new('-file' => "inputfilename",
'-format' => 'Fasta');
$out = Bio::SeqIO->new('-file' => ">outputfilename",
'-format' => 'EMBL');
while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }
In addition, perl "tied filehandle" syntax is available to SeqIO,
allowing you to use the standard E<lt>E<gt> and print operations to read and
write sequence objects, eg:
$in = Bio::SeqIO->newFh('-file' => "inputfilename" ,
'-format' => 'Fasta');
$out = Bio::SeqIO->newFh('-format' => 'EMBL');
print $out $_ while <$in>;
=head2 III.2.2 Transforming alignment files (AlignIO)
Data files storing multiple sequence alignments also appear in varied
formats. AlignIO is the bioperl object for data conversion of
alignment files. AlignIO is patterned on the SeqIO object and shares
most of SeqIO\'s features. AlignIO currently supports input in the
following formats: fasta, mase, stockholm, prodom, selex, bl2seq,
msf/gcg and output in these formats: : fasta, mase, selex, clustalw,
msf/gcg. One significant difference between AlignIO and SeqIO is that
AlignIO handles IO for only a single alignment at a time (SeqIO.pm
handles IO for multiple sequences in a single stream.) Syntax for
AlignIO is almost identical to that of SeqIO: use Bio::AlignIO;
$in = Bio::AlignIO->new('-file' => "inputfilename" ,
'-format' => 'fasta');
$out = Bio::AlignIO->new('-file' => ">outputfilename",
'-format' => 'pfam');
while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }
The only difference is that here, the returned object reference, $aln,
is to a SimpleAlign object rather than a Seq object.
AlignIO also supports the tied filehandle syntax described above for
SeqIO. (Note that currently AlignIO is usable only with SimpleAlign
alignment objects. IO for UnivAln objects can only be done for files
in fasta data format.)
=head2 III.3 Manipulating sequences
=cut
=head2 III.3.1 Manipulating sequence data with Seq methods
OK, so we know how to retrieve sequences and access them as Seq
objects. Let\'s see how we can use the Seq objects to manipulate our
sequence data and retrieve information. Seq provides multiple methods
for performing many common (and some not-so-common) tasks of sequence
manipulation and data retrieval. Here are some of the most useful:
The following methods return strings
$seqobj->display_id(); # the human read-able id of the sequence
$seqobj->seq(); # string of sequence
$seqobj->subseq(5,10); # part of the sequence as a string
$seqobj->accession_number(); # when there, the accession number
$seqobj->alphabet(); # one of 'dna','rna','protein'
$seqobj->primary_id(); # a unique id for this sequence irregardless
# of its display_id or accession number
The following methods return an array of Bio::SeqFeature objects
$seqobj->top_SeqFeatures # The 'top level' sequence features
$seqobj->all_SeqFeatures # All sequence features, including sub
# seq features
Sequence features will be discussed further in section III.7 on
machine-readable sequence annotation.
The following methods returns new sequence objects, but do not transfer features across
$seqobj->trunc(5,10) # truncation from 5 to 10 as new object
$seqobj->revcom # reverse complements sequence
$seqobj->translate # translation of the sequence
Note that some methods return strings, some return arrays and some
return references to objects. Here (as elsewhere in perl and bioperl)
it is the user\'s responsibility to check the relevant documentation so
they know the format of the data being returned.
Many of these methods are self-explanatory. However, bioperl\'s flexible
translation methods warrant further comment. Translation in bioinformatics
can mean two slightly different things:
=over 2
=item 1 Translating a nucleotide sequence from start to end.
=item 2 Taking into account the constraints of real coding regions in mRNAs.
=back
For historical reasons the bioperl implementation of translation does
the first of these tasks easily. Any sequence object which is not of type
'protein' can be translated by simply calling the method which returns
a protein sequence object:
$translation1 = $my_seq_object->translate;
However, the translate method can also be passed several optional
parameters to modify its behavior. For example, the first two
arguments to "translate" can be used to modify the characters used to
represent stop (default '*') and unknown amino acid ('X'). (These are
normally best left untouched.) The third argument determines the
frame of the translation. The default frame is "0". To get
translations in the other two forward frames, we would write:
$translation2 = $my_seq_object->translate(undef,undef,1);
$translation3 = $my_seq_object->translate(undef,undef,2);
The fourth argument to "translate" makes it possible to use
alternative genetic codes. There are currently 16 codon tables
defined, including tables for 'Verterbate Mitochondrial', 'Bacterial',
'Alternative Yeast Nuclear' and 'Ciliate, Dasycladacean and Hexamita
Nuclear' translation. These tables are located in the object
Bio::Tools::CodonTable which is used by the translate method. For
example, for mitochondrial translation:
$human_mitochondrial_translation =
$my_seq_object->translate(undef,undef,undef, 2);
If we want to translate full coding regions (CDS) the way major
nucleotide databanks EMBL, GenBank and DDBJ do it, the translate
method has to perform more tricks. Specifically, 'translate' needs to
confirm that the sequence has appropriate start and terminator codons
at the beginning and the end of the sequence and that there are no
terminator codons present within the sequence. In addition, if the
genetic code being used has an atypical (non-ATG) start codon, the
translate method needs to convert the initial amino acid to
methionine. These checks and conversions are triggered by setting the
fifth argument of the translate method to evaluate to "true".
If argument 5 is set to true and the criteria for a proper CDS are
not met, the method, by default, issues a warning. By setting the
sixth argument to evaluate to "true", one can instead instruct
the program to die if an improper CDS is found, e.g.
$protein_object =
$cds->translate(undef,undef,undef,undef,1,'die_if_errors');
=head2 III.3.2 Obtaining basic sequence statistics- MW, residue &codon
frequencies(SeqStats, SeqWord)
In addition to the methods directly available in the Seq object,
bioperl provides various "helper" objects to determine additional
information about a sequence. For example, the SeqStats object
provides methods for obtaining the molecular weight of the sequence as
well the number of occurrences of each of the component residues
(bases for a nucleic acid or amino acids for a protein.) For nucleic
acids, SeqStats also returns counts of the number of codons used. For
example:
use SeqStats
$seq_stats = Bio::Tools::SeqStats->new($seqobj);
$weight = $seq_stats->get_mol_wt();
$monomer_ref = $seq_stats->count_monomers();
$codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence
Note: sometimes sequences will contain "ambiguous" codes. For this
reason, get_mol_wt() returns (a reference to) a two element array
containing a greatest lower bound and a least upper bound of the
molecular weight.
The SeqWords object is similar to SeqStats and provides methods for
calculating frequencies of "words" (eg tetramers or hexamers) within
the sequence.
=head2 III.3.3 Identifying restriction enzyme sites (RestrictionEnzyme)
Another common sequence manipulation task for nucleic acid sequences
is locating restriction enzyme cutting sites. Bioperl provides the
RestrictionEnzyme object for this purpose. Bioperl\'s standard
RestrictionEnzyme object comes with data for XXX different restriction
enzymes. A list of the available enzymes can be accessed using the
available_list() method. For example to select all available enzymes
that with cutting patterns that are six bases long one would write:
$re = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
@sixcutters = $re->available_list(6);
Once an appropriate enzyme has been selected, the sites for that
enzyme on a given nucleic acid sequence can be obtained using the
cut_seq() method. The syntax for performing this task is:
$re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI');
# $seqobj is the Seq object for the dna to be cut
@fragments = $re1->cut_seq($seqobj);
Adding an enzyme not in the default list is easily accomplished:
$re2 = new Bio::Tools::RestrictionEnzyme('-NAME' =>'EcoRV--GAT^ATC',
'-MAKE' =>'custom');
Once the custom enzyme object has been created, cut_seq() can be
called in the usual manner.
=head2 III.3.4 Identifying amino acid cleavage sites (Sigcleave)
For amino acid sequences we may be interested to know whether the
amino acid sequence contains a cleavable "signal sequence" for
directing the transport of the protein within the cell. SigCleave is
a program (originally part of the EGCG molecular biology package) to
predict signal sequences, and to identify the cleavage site.
The "threshold" setting controls the score reporting. If no value for
threshold is passed in by the user, the code defaults to a reporting
value of 3.5. SigCleave will only return score/position pairs which
meet the threshold limit.
There are 2 accessor methods for this object. "signals" will return a
perl hash containing the sigcleave scores keyed by amino acid
position. "pretty_print" returns a formatted string similar to the
output of the original sigcleave utility.
Syntax for using the modules is as follows:
use Bio::Tools::Sigcleave;
$sigcleave_object = new Bio::Tools::Sigcleave
('-file'=>'sigtest.aa',
'-threshold'=>'3.5'
'-desc'=>'test sigcleave protein seq',
'-type'=>'AMINO
');
%raw_results = $sigcleave_object->signals;
$formatted_output = $sigcleave_object->pretty_print;
Note that Sigcleave is passed a raw sequence (or file containing a
sequence) rather than a sequence object when it is created. Also note
that the "type" in the Sigcleave object is "amino" whereas in a Seq
object it is "protein".
=head2 III.3.5 Miscellaneous sequence utilities: OddCodes, SeqPattern
OddCodes:
For some purposes it\'s useful to have a listing of an amino acid
sequence showing where the hydrophobic amino acids are located or
where the positively charged ones are. Bioperl provides this
capability via the module OddCodes.pm.
For example, to quickly see where the charged amino acids are located
along the sequence we perform:
use Bio::Tools::OddCodes;
$oddcode_obj = Bio::Tools::OddCodes->new($amino_obj);
$output = $oddcode_obj->charge();
The sequence will be transformed into a three-letter sequence (A,C,N)
for negative (acidic), positive (basic), and neutral amino acids. For
example the ACDEFGH would become NNAANNC.
For a more complete chemical description of the sequence one can call
the chemical() method which turns sequence into one with an 8-letter
chemical alphabet { A (acidic), L (aliphatic), M (amide), R
(aromatic), C (basic), H (hydroxyl), I (imino), S (sulfur) }:
$output = $oddcode_obj->chemical();
In this case the sample sequence ACDEFGH would become LSAARAC.
OddCodes also offers translation into alphabets showing alternate
characteristics of the amino acid sequence such as hydrophobicity,
"functionality" or grouping using Dayhoff\'s definitions. See the
documentation for OddCodes.pm for further details.
SeqPattern:
The SeqPattern object is used to manipulate sequences that include
perl "regular expressions". A key motivation for SeqPattern is to
have a way of generating a reverse complement of a nucleic acid
sequence pattern that includes ambiguous bases and/or regular
expressions. This capability leads to significant performance gains
when pattern matching on both the sense and anti-sense strands of a
query sequence are required. Typical syntax for using SeqPattern is
shown below. For more information, there are several interesting
examples in the script SeqPattern.pl in the examples directory.
Use Bio::Tools::SeqPattern;
$pattern = '(CCCCT)N{1,200}(agggg)N{1,200}(agggg)';
$pattern_obj = new Bio::Tools::SeqPattern('-SEQ' =>$pattern,
'-TYPE' =>'dna');
$pattern_obj2 = $pattern_obj->revcom();
$pattern_obj->revcom(1); ## returns expanded rev complement pattern.
=head2 III.4 Searching for "similar" sequences
One of the basic tasks in molecular biology is identifying sequences
that are, in some way, similar to a sequence of interest. The Blast
programs, originally developed at the NCBI, are widely used for
identifying such sequences. Bioperl offers a number of modules to
facilitate running Blast as well as to parse the often voluminous
reports produced by Blast.
=head2 III.4.1 Running BLAST locally (StandAloneBlast)
There are several reasons why one might want to run the Blast programs
locally - speed, data security, immunity to network problems, being
able to run large batch runs etc. The NCBI provides a downloadable
version of blast in a stand-alone format, and running blast locally
without any use of perl or bioperl - is completely
straightforward. However, there are situations where having a perl
interface for running the blast programs locally is convenient.
The module StandAloneBlast.pm offers the ability to wrap local calls
to blast from within perl. All of the currently available options of
NCBI Blast (eg PSIBLAST, PHIBLAST, bl2seq) are available from within
the bioperl StandAloneBlast interface. Of course, to use
StandAloneBlast, one needs to have installed locally ncbi-blast as
well as one or more blast-readable databases.
Basic usage of the StandAloneBlast.pm module is simple. Initially, a
local blast "factory object" is created.
@params = ('program' => 'blastn',