Skip to content

Commit

Permalink
Print functions and back-tick markup for AlignIO page etc
Browse files Browse the repository at this point in the history
See #47.
  • Loading branch information
peterjc committed Apr 29, 2016
1 parent ceef80e commit 4bd8c87
Showing 1 changed file with 80 additions and 77 deletions.
157 changes: 80 additions & 77 deletions wiki/AlignIO.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ tags:
- Wiki Documentation
---

This page describes Bio.AlignIO, a new multiple sequence Alignment
This page describes `Bio.AlignIO`, a new multiple sequence Alignment
Input/Output interface for BioPython 1.46 and later.

In addition to the built in API documentation, there is a whole chapter
in the [Tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html)
on Bio.AlignIO, and although there is some overlap it is well worth
reading in addition to this WIKI page. There is also the [API
reading in addition to this page. There is also the [API
documentation](http://biopython.org/DIST/docs/api/Bio.AlignIO-module.html)
(which you can read online, or from within Python with the help
(which you can read online, or from within Python with the `help()`
command).

Aims
Expand All @@ -23,21 +23,21 @@ Aims
You may already be familiar with the [Bio.SeqIO](SeqIO "wikilink")
module which deals with files containing one or more sequences
represented as [SeqRecord](SeqRecord "wikilink") objects. The purpose of
the SeqIO module is to provide a simple uniform interface to assorted
the `SeqIO` module is to provide a simple uniform interface to assorted
sequence file formats.

Similarly, Bio.AlignIO deals with files containing one or more sequence
alignments represented as Alignment objects. Bio.AlignIO uses the same
set of functions for input and output as in Bio.SeqIO, and the same
Similarly, `Bio.AlignIO` deals with files containing one or more sequence
alignments represented as Alignment objects. `Bio.AlignIO` uses the same
set of functions for input and output as in `Bio.SeqIO`, and the same
names for the file formats supported.

Note that the inclusion of Bio.AlignIO does lead to some duplication or
choice in how to deal with some file formats. For example, Bio.AlignIO
and Bio.Nexus will both read alignments from NEXUS files - but Bio.NEXUS
allows more control and the use of trees.
Note that the inclusion of `Bio.AlignIO` does lead to some duplication or
choice in how to deal with some file formats. For example, `Bio.AlignIO`
and `Bio.Nexus` will both read alignments from NEXUS files - but
`Bio.NEXUS` allows more control and the use of trees.

My vision is that for reading or writing sequence alignments you should
try Bio.AlignIO as your first choice. In some cases you may only care
try `Bio.AlignIO` as your first choice. In some cases you may only care
about the sequences themselves, in which case try using
[Bio.SeqIO](SeqIO "wikilink") on the alignment file directly. Unless you
have some very specific requirements, I hope this should suffice.
Expand Down Expand Up @@ -98,48 +98,50 @@ Fib\_gamma](http://pfam.sanger.ac.uk/family?acc=PF09395). At the time of
writing, this contained 14 sequences with an alignment length of 77
amino acids, and is shown below in the PFAM or Stockholm format:

# STOCKHOLM 1.0
#=GS Q7ZVG7_BRARE/37-110 AC Q7ZVG7.1
#=GS Q6X871_SCAAQ/1-77 AC Q6X871.1
#=GS O02676_CROCR/1-77 AC O02676.1
#=GS Q6X869_TENEC/1-77 AC Q6X869.1
#=GS FIBG_HUMAN/40-116 AC P02679.3
#=GS O02689_TAPIN/1-77 AC O02689.1
#=GS O02688_PIG/1-77 AC O02688.1
#=GS O02672_9CETA/1-77 AC O02672.1
#=GS O02682_EQUPR/1-77 AC O02682.1
#=GS Q6X870_CYNVO/1-77 AC Q6X870.1
#=GS FIBG_RAT/40-116 AC P02680.3
#=GS Q6X866_DROAU/1-76 AC Q6X866.1
#=GS O93568_CHICK/40-116 AC O93568.1
#=GS FIBG_XENLA/38-114 AC P17634.1
Q7ZVG7_BRARE/37-110 GFGTYCPTTCGVADYLQRYKPDMDKKLDDMEQDLEEIANLTRGAQDKVVYLK---DSEAQAQKQSPDTYIKKSSNML
Q6X871_SCAAQ/1-77 RFGSYCPTTCGIADFLSTYQATVDKDLQTLEDILSQAENKTMEAKELVKAIQVSYLPEDPARPNRVELATKDSKKMM
O02676_CROCR/1-77 RFGSYCPTTCGIADFLSTYQTGVXNDLRTLEDLLSGIENKTSEAKELIKSIQVSYNPNEPPKPNTIVSATKDSKKMM
Q6X869_TENEC/1-77 RFGSYCPTTCGIADFLSTYQGSIDKDLQTLEDILNQVENKTXEASELIKSIQVSYNPDEPPRPNMIEGATQKSKKML
FIBG_HUMAN/40-116 RFGSYCPTTCGIADFLSTYQTKVDKDLQSLEDILHQVENKTSEVKQLIKAIQLTYNPDESSKPNMIDAATLKSRKML
#=GS FIBG_HUMAN/40-116 DR PDB; 1qvh L;14-45
#=GS FIBG_HUMAN/40-116 DR PDB; 1fza C;88-90
#=GS FIBG_HUMAN/40-116 DR PDB; 1fzb C;88-90
#=GS FIBG_HUMAN/40-116 DR PDB; 1fzb F;88-90
#=GS FIBG_HUMAN/40-116 DR PDB; 1qvh I;14-45
#=GS FIBG_HUMAN/40-116 DR PDB; 1fza F;88-90
#=GR FIBG_HUMAN/40-116 SS CCXCXBXXHHHHHHHHHHHHHHHHHHHHHHHXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX-CC
O02689_TAPIN/1-77 RFGSYCPTTCGIADFLSTYQTXVDKDLQVLEDILNQAENKTSEAKELIKAIQVRYKPDEPTKPGGIDSATRESKKML
O02688_PIG/1-77 RFGSYCPTMCGIAGFLSTYQNTVEKDLQNLEGILHQVENKTSEARELIKAIQISYNPEDLSKPDRIQSATKESKKML
O02672_9CETA/1-77 RFGSYCPTTCGVADFLSNYQTSVDKDLQNLEGILYQVENKTSEARELVKAIQISYNPDEPSKPNNIESATKNSKRMM
O02682_EQUPR/1-77 RFGSYCPTTCGIADFLSNYQTSVDKDLQDFEDILHRAENQTSEAEQLIQAIRTSYNPDEPPKTGRIDAATRESKKMM
Q6X870_CYNVO/1-77 RFGSYCPTTCGIADFLSTYQTKVDEDLQNLEDILYRVENRTSEAKELIKAIQVDYNPGEPPKQSVTEGATQNAKKMV
FIBG_RAT/40-116 RFGSYCPTTCGISDFLNSYQTDVDTDLQTLENILQRAENRTTEAKELIKAIQVYYNPDQPPKPGMIEGATQKSKKMV
Q6X866_DROAU/1-76 RFGSYCPTTCGIADFLNKYQTTIDQDLRHMEETLRDIDNKTAESTLLIQKIQIGQTPDPRPQ-NVIGDVTQKSRKMI
O93568_CHICK/40-116 RFGSYCPTTCGIADFFNKYRLTTDGELLEIEGLLQQATNSTGSIEYLIQHIKTIYPSEKQTLPQSIEQLTQKSKKII
#=GS O93568_CHICK/40-116 DR PDB; 1m1j F;14-90
#=GS O93568_CHICK/40-116 DR PDB; 1m1j C;14-90
#=GR O93568_CHICK/40-116 SS CCEEEEE-CCCCCCCCCCCCCHHHCCCCCHHHHHHHHHHHHHHHCCCCCCHHHHS-SSTT--SS-HHHHHHHHHHHH
FIBG_XENLA/38-114 RFGEYCPTTCGISDFLNRYQENVDTDLQYLENLLTQISNSTSGTTIIVEHLIDSGKKPATSPQTAIDPMTQKSKTCW
#=GC SS_cons CCECEEE-CCCCCCCCCCCCCHHHCCCCCHHHHHHHHHHHHHHHCCCCCCHHHHS-SSTT--SS-HHHHHHHHHHCC
#=GC seq_cons RFGSYCPTTCGIADFLSsYQssVDcDLQsLEsILpplEN+ToEAc-LIKuIQlsYsP--ss+PstI-uATpcSKKMl
//
```
# STOCKHOLM 1.0
#=GS Q7ZVG7_BRARE/37-110 AC Q7ZVG7.1
#=GS Q6X871_SCAAQ/1-77 AC Q6X871.1
#=GS O02676_CROCR/1-77 AC O02676.1
#=GS Q6X869_TENEC/1-77 AC Q6X869.1
#=GS FIBG_HUMAN/40-116 AC P02679.3
#=GS O02689_TAPIN/1-77 AC O02689.1
#=GS O02688_PIG/1-77 AC O02688.1
#=GS O02672_9CETA/1-77 AC O02672.1
#=GS O02682_EQUPR/1-77 AC O02682.1
#=GS Q6X870_CYNVO/1-77 AC Q6X870.1
#=GS FIBG_RAT/40-116 AC P02680.3
#=GS Q6X866_DROAU/1-76 AC Q6X866.1
#=GS O93568_CHICK/40-116 AC O93568.1
#=GS FIBG_XENLA/38-114 AC P17634.1
Q7ZVG7_BRARE/37-110 GFGTYCPTTCGVADYLQRYKPDMDKKLDDMEQDLEEIANLTRGAQDKVVYLK---DSEAQAQKQSPDTYIKKSSNML
Q6X871_SCAAQ/1-77 RFGSYCPTTCGIADFLSTYQATVDKDLQTLEDILSQAENKTMEAKELVKAIQVSYLPEDPARPNRVELATKDSKKMM
O02676_CROCR/1-77 RFGSYCPTTCGIADFLSTYQTGVXNDLRTLEDLLSGIENKTSEAKELIKSIQVSYNPNEPPKPNTIVSATKDSKKMM
Q6X869_TENEC/1-77 RFGSYCPTTCGIADFLSTYQGSIDKDLQTLEDILNQVENKTXEASELIKSIQVSYNPDEPPRPNMIEGATQKSKKML
FIBG_HUMAN/40-116 RFGSYCPTTCGIADFLSTYQTKVDKDLQSLEDILHQVENKTSEVKQLIKAIQLTYNPDESSKPNMIDAATLKSRKML
#=GS FIBG_HUMAN/40-116 DR PDB; 1qvh L;14-45
#=GS FIBG_HUMAN/40-116 DR PDB; 1fza C;88-90
#=GS FIBG_HUMAN/40-116 DR PDB; 1fzb C;88-90
#=GS FIBG_HUMAN/40-116 DR PDB; 1fzb F;88-90
#=GS FIBG_HUMAN/40-116 DR PDB; 1qvh I;14-45
#=GS FIBG_HUMAN/40-116 DR PDB; 1fza F;88-90
#=GR FIBG_HUMAN/40-116 SS CCXCXBXXHHHHHHHHHHHHHHHHHHHHHHHXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX-CC
O02689_TAPIN/1-77 RFGSYCPTTCGIADFLSTYQTXVDKDLQVLEDILNQAENKTSEAKELIKAIQVRYKPDEPTKPGGIDSATRESKKML
O02688_PIG/1-77 RFGSYCPTMCGIAGFLSTYQNTVEKDLQNLEGILHQVENKTSEARELIKAIQISYNPEDLSKPDRIQSATKESKKML
O02672_9CETA/1-77 RFGSYCPTTCGVADFLSNYQTSVDKDLQNLEGILYQVENKTSEARELVKAIQISYNPDEPSKPNNIESATKNSKRMM
O02682_EQUPR/1-77 RFGSYCPTTCGIADFLSNYQTSVDKDLQDFEDILHRAENQTSEAEQLIQAIRTSYNPDEPPKTGRIDAATRESKKMM
Q6X870_CYNVO/1-77 RFGSYCPTTCGIADFLSTYQTKVDEDLQNLEDILYRVENRTSEAKELIKAIQVDYNPGEPPKQSVTEGATQNAKKMV
FIBG_RAT/40-116 RFGSYCPTTCGISDFLNSYQTDVDTDLQTLENILQRAENRTTEAKELIKAIQVYYNPDQPPKPGMIEGATQKSKKMV
Q6X866_DROAU/1-76 RFGSYCPTTCGIADFLNKYQTTIDQDLRHMEETLRDIDNKTAESTLLIQKIQIGQTPDPRPQ-NVIGDVTQKSRKMI
O93568_CHICK/40-116 RFGSYCPTTCGIADFFNKYRLTTDGELLEIEGLLQQATNSTGSIEYLIQHIKTIYPSEKQTLPQSIEQLTQKSKKII
#=GS O93568_CHICK/40-116 DR PDB; 1m1j F;14-90
#=GS O93568_CHICK/40-116 DR PDB; 1m1j C;14-90
#=GR O93568_CHICK/40-116 SS CCEEEEE-CCCCCCCCCCCCCHHHCCCCCHHHHHHHHHHHHHHHCCCCCCHHHHS-SSTT--SS-HHHHHHHHHHHH
FIBG_XENLA/38-114 RFGEYCPTTCGISDFLNRYQENVDTDLQYLENLLTQISNSTSGTTIIVEHLIDSGKKPATSPQTAIDPMTQKSKTCW
#=GC SS_cons CCECEEE-CCCCCCCCCCCCCHHHCCCCCHHHHHHHHHHHHHHHCCCCCCHHHHS-SSTT--SS-HHHHHHHHHHCC
#=GC seq_cons RFGSYCPTTCGIADFLSsYQssVDcDLQsLEsILpplEN+ToEAc-LIKuIQlsYsP--ss+PstI-uATpcSKKMl
//
```

You will notice that there is plenty of annotation information here,
including accession numbers for each sequence and also some PDB database
Expand All @@ -149,53 +151,54 @@ chick fibrinogen proteins.
This file contains a single alignment, so we can use the
`Bio.AlignIO.read()` function to load it in Biopython. Let's assume
you have downloaded this alignment from Sanger, or have copy and pasted
the text above, and saved this as a file called `PF09395\_seed.sth` on
the text above, and saved this as a file called `PF09395_seed.sth` on
your computer. Then in python:

``` python
from Bio import AlignIO
alignment = AlignIO.read(open("PF09395_seed.sth"), "stockholm")
print "Alignment length %i" % alignment.get_alignment_length()
print("Alignment length %i" % alignment.get_alignment_length())
for record in alignment :
print record.seq, record.id
print(record.seq + " " + record.id)
```

That should give:

Alignment length 77
GFGTYCPTTCGVADYLQRYKPDMDKKLDDMEQDLEEIANLTRGAQDKVVYLK---DSEAQAQKQSPDTYIKKSSNML Q7ZVG7_BRARE/37-110
RFGSYCPTTCGIADFLSTYQATVDKDLQTLEDILSQAENKTMEAKELVKAIQVSYLPEDPARPNRVELATKDSKKMM Q6X871_SCAAQ/1-77
RFGSYCPTTCGIADFLSTYQTGVXNDLRTLEDLLSGIENKTSEAKELIKSIQVSYNPNEPPKPNTIVSATKDSKKMM O02676_CROCR/1-77
RFGSYCPTTCGIADFLSTYQGSIDKDLQTLEDILNQVENKTXEASELIKSIQVSYNPDEPPRPNMIEGATQKSKKML Q6X869_TENEC/1-77
RFGSYCPTTCGIADFLSTYQTKVDKDLQSLEDILHQVENKTSEVKQLIKAIQLTYNPDESSKPNMIDAATLKSRKML FIBG_HUMAN/40-116
RFGSYCPTTCGIADFLSTYQTXVDKDLQVLEDILNQAENKTSEAKELIKAIQVRYKPDEPTKPGGIDSATRESKKML O02689_TAPIN/1-77
RFGSYCPTMCGIAGFLSTYQNTVEKDLQNLEGILHQVENKTSEARELIKAIQISYNPEDLSKPDRIQSATKESKKML O02688_PIG/1-77
RFGSYCPTTCGVADFLSNYQTSVDKDLQNLEGILYQVENKTSEARELVKAIQISYNPDEPSKPNNIESATKNSKRMM O02672_9CETA/1-77
RFGSYCPTTCGIADFLSNYQTSVDKDLQDFEDILHRAENQTSEAEQLIQAIRTSYNPDEPPKTGRIDAATRESKKMM O02682_EQUPR/1-77
RFGSYCPTTCGIADFLSTYQTKVDEDLQNLEDILYRVENRTSEAKELIKAIQVDYNPGEPPKQSVTEGATQNAKKMV Q6X870_CYNVO/1-77
RFGSYCPTTCGISDFLNSYQTDVDTDLQTLENILQRAENRTTEAKELIKAIQVYYNPDQPPKPGMIEGATQKSKKMV FIBG_RAT/40-116
RFGSYCPTTCGIADFLNKYQTTIDQDLRHMEETLRDIDNKTAESTLLIQKIQIGQTPDPRPQ-NVIGDVTQKSRKMI Q6X866_DROAU/1-76
RFGSYCPTTCGIADFFNKYRLTTDGELLEIEGLLQQATNSTGSIEYLIQHIKTIYPSEKQTLPQSIEQLTQKSKKII O93568_CHICK/40-116
RFGEYCPTTCGISDFLNRYQENVDTDLQYLENLLTQISNSTSGTTIIVEHLIDSGKKPATSPQTAIDPMTQKSKTCW FIBG_XENLA/38-114
```
Alignment length 77
GFGTYCPTTCGVADYLQRYKPDMDKKLDDMEQDLEEIANLTRGAQDKVVYLK---DSEAQAQKQSPDTYIKKSSNML Q7ZVG7_BRARE/37-110
RFGSYCPTTCGIADFLSTYQATVDKDLQTLEDILSQAENKTMEAKELVKAIQVSYLPEDPARPNRVELATKDSKKMM Q6X871_SCAAQ/1-77
RFGSYCPTTCGIADFLSTYQTGVXNDLRTLEDLLSGIENKTSEAKELIKSIQVSYNPNEPPKPNTIVSATKDSKKMM O02676_CROCR/1-77
RFGSYCPTTCGIADFLSTYQGSIDKDLQTLEDILNQVENKTXEASELIKSIQVSYNPDEPPRPNMIEGATQKSKKML Q6X869_TENEC/1-77
RFGSYCPTTCGIADFLSTYQTKVDKDLQSLEDILHQVENKTSEVKQLIKAIQLTYNPDESSKPNMIDAATLKSRKML FIBG_HUMAN/40-116
RFGSYCPTTCGIADFLSTYQTXVDKDLQVLEDILNQAENKTSEAKELIKAIQVRYKPDEPTKPGGIDSATRESKKML O02689_TAPIN/1-77
RFGSYCPTMCGIAGFLSTYQNTVEKDLQNLEGILHQVENKTSEARELIKAIQISYNPEDLSKPDRIQSATKESKKML O02688_PIG/1-77
RFGSYCPTTCGVADFLSNYQTSVDKDLQNLEGILYQVENKTSEARELVKAIQISYNPDEPSKPNNIESATKNSKRMM O02672_9CETA/1-77
RFGSYCPTTCGIADFLSNYQTSVDKDLQDFEDILHRAENQTSEAEQLIQAIRTSYNPDEPPKTGRIDAATRESKKMM O02682_EQUPR/1-77
RFGSYCPTTCGIADFLSTYQTKVDEDLQNLEDILYRVENRTSEAKELIKAIQVDYNPGEPPKQSVTEGATQNAKKMV Q6X870_CYNVO/1-77
RFGSYCPTTCGISDFLNSYQTDVDTDLQTLENILQRAENRTTEAKELIKAIQVYYNPDQPPKPGMIEGATQKSKKMV FIBG_RAT/40-116
RFGSYCPTTCGIADFLNKYQTTIDQDLRHMEETLRDIDNKTAESTLLIQKIQIGQTPDPRPQ-NVIGDVTQKSRKMI Q6X866_DROAU/1-76
RFGSYCPTTCGIADFFNKYRLTTDGELLEIEGLLQQATNSTGSIEYLIQHIKTIYPSEKQTLPQSIEQLTQKSKKII O93568_CHICK/40-116
RFGEYCPTTCGISDFLNRYQENVDTDLQYLENLLTQISNSTSGTTIIVEHLIDSGKKPATSPQTAIDPMTQKSKTCW FIBG_XENLA/38-114
```

Alignment Output
----------------

As in [Bio.SeqIO](SeqIO "wikilink"), there is a single output function
**Bio.AlignIO.write()**. This takes three arguments: some alignments, a
`Bio.AlignIO.write()`. This takes three arguments: some alignments, a
file handle to write to, and the format to use.

As of Biopython 1.48, the alignment object acquired a **format()**
As of Biopython 1.48, the alignment object acquired a `.format()`
method to give a string containing the alignment in the specified file
format, e.g.

``` python
AlignIO.read(open("PF09395_seed.sth"), "stockholm")
print alignment.format("fasta")
print(alignment.format("fasta"))
```

This wiki section needs to be filled out, so in the short term please
refer to the Bio.AlignIO chapter in the Tutorial.
Please refer to the Bio.AlignIO chapter in the Tutorial for more details.

File Format Conversion
----------------------
Expand Down

0 comments on commit 4bd8c87

Please sign in to comment.