Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 1562 lines (1408 sloc) 77.027 kB
54f13f2 @peterjc Reoganisation:
peterjc authored
1 # Copyright 2006-2007 by Peter Cock. All rights reserved.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
2 # This code is part of the Biopython distribution and governed by its
3 # license. Please see the LICENSE file that should have been included
4 # as part of this package.
5 #
6 #Nice link:
7 # http://www.ebi.ac.uk/help/formats_frame.html
8
357b0bf @peterjc Adding missing trailing periods to make epydoc happy (for the online …
peterjc authored
9 """Sequence input/output as SeqRecord objects.
f726249 merged Andrew's Seq package with the tree
jchang authored
10
71f665f @peterjc Added new "read" function which returns a SeqRecord when given a hand…
peterjc authored
11 The Bio.SeqIO module is also documented by a whole chapter in the Biopython
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
12 tutorial, and by the wiki http://biopython.org/wiki/SeqIO on the website.
13 The approach is designed to be similar to the bioperl SeqIO design.
71f665f @peterjc Added new "read" function which returns a SeqRecord when given a hand…
peterjc authored
14
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
15 Input
16 =====
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
17 The main function is Bio.SeqIO.parse(...) which takes an input file handle,
a40b976 @peterjc Made the file format a required argument.
peterjc authored
18 and format string. This returns an iterator giving SeqRecord objects.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
19
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
20 from Bio import SeqIO
54f13f2 @peterjc Reoganisation:
peterjc authored
21 handle = open("example.fasta", "rU")
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
22 for record in SeqIO.parse(handle, "fasta") :
54f13f2 @peterjc Reoganisation:
peterjc authored
23 print record
5fdd32f @peterjc Changes to the comments: adding handle.close() to the examples, fixed…
peterjc authored
24 handle.close()
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
25
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
26 Note that the parse() function will all invoke the relevant parser for the
27 format with its default settings. You may want more control, in which case
a40b976 @peterjc Made the file format a required argument.
peterjc authored
28 you need to create a format specific sequence iterator directly.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
29
a40b976 @peterjc Made the file format a required argument.
peterjc authored
30 For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
31 using a sequence iterator can save you a lot of memory (RAM). There is
32 less benefit for interlaced file formats (e.g. most multiple alignment file
33 formats). However, an iterator only lets you access the records one by one.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
34
54f13f2 @peterjc Reoganisation:
peterjc authored
35 If you want random access to the records by number, turn this into a list:
36
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
37 from Bio import SeqIO
54f13f2 @peterjc Reoganisation:
peterjc authored
38 handle = open("example.fasta", "rU")
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
39 records = list(SeqIO.parse(handle, "fasta"))
5fdd32f @peterjc Changes to the comments: adding handle.close() to the examples, fixed…
peterjc authored
40 handle.close()
54f13f2 @peterjc Reoganisation:
peterjc authored
41 print records[0]
42
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
43 If you want random access to the records by a key such as the record id,
44 turn the iterator into a dictionary:
54f13f2 @peterjc Reoganisation:
peterjc authored
45
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
46 from Bio import SeqIO
54f13f2 @peterjc Reoganisation:
peterjc authored
47 handle = open("example.fasta", "rU")
5fdd32f @peterjc Changes to the comments: adding handle.close() to the examples, fixed…
peterjc authored
48 record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
49 handle.close()
54f13f2 @peterjc Reoganisation:
peterjc authored
50 print record["gi:12345678"]
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
51
71f665f @peterjc Added new "read" function which returns a SeqRecord when given a hand…
peterjc authored
52 If you expect your file to contain one-and-only-one record, then we provide
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
53 the following 'helper' function which will return a single SeqRecord, or
54 raise an exception if there are no records or more than one record:
71f665f @peterjc Added new "read" function which returns a SeqRecord when given a hand…
peterjc authored
55
56 from Bio import SeqIO
57 handle = open("example.fasta", "rU")
58 record = SeqIO.read(handle, "fasta")
59 handle.close()
60 print record
61
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
62 This style is useful when you expect a single record only (and would
63 consider multiple records an error). For example, when dealing with GenBank
64 files for bacterial genomes or chromosomes, there is normally only a single
65 record. Alternatively, use this with a handle when download a single record
66 from the internet.
71f665f @peterjc Added new "read" function which returns a SeqRecord when given a hand…
peterjc authored
67
68 However, if you just want the first record from a file containing multiple
69 record, use the iterator's next() method:
5fdd32f @peterjc Changes to the comments: adding handle.close() to the examples, fixed…
peterjc authored
70
71 from Bio import SeqIO
72 handle = open("example.fasta", "rU")
73 record = SeqIO.parse(handle, "fasta").next()
74 handle.close()
75 print record
76
77 The above code will work as long as the file contains at least one record.
78 Note that if there is more than one record, the remaining records will be
79 silently ignored.
a40b976 @peterjc Made the file format a required argument.
peterjc authored
80
54f13f2 @peterjc Reoganisation:
peterjc authored
81 Input - Alignments
82 ==================
a40b976 @peterjc Made the file format a required argument.
peterjc authored
83 Currently an alignment class cannot be created from SeqRecord objects.
29f0a4d @peterjc Use to_dict and to_alignment instead of SequencesToDict and Sequences…
peterjc authored
84 Instead, use the to_alignment(...) function, like so:
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
85
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
86 from Bio import SeqIO
54f13f2 @peterjc Reoganisation:
peterjc authored
87 handle = open("example.aln", "rU")
29f0a4d @peterjc Use to_dict and to_alignment instead of SequencesToDict and Sequences…
peterjc authored
88 alignment = SeqIO.to_alignment(SeqIO.parse(handle, "clustal"))
5fdd32f @peterjc Changes to the comments: adding handle.close() to the examples, fixed…
peterjc authored
89 handle.close()
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
90
91 This function may be removed in future once alignments can be created
92 directly from SeqRecord objects.
54f13f2 @peterjc Reoganisation:
peterjc authored
93
94 Output
95 ======
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
96 Use the function Bio.SeqIO.write(...), which takes a complete set of
97 SeqRecord objects (either as a list, or an iterator), an output file handle
98 and of course the file format.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
99
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
100 from Bio import SeqIO
101 records = ...
102 handle = open("example.faa", "w")
e4d304a @peterjc Corrected a minor error in an example in the introduction
peterjc authored
103 SeqIO.write(records, handle, "fasta")
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
104 handle.close()
105
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
106 In general, you are expected to call this function once (with all your
107 records) and then close the file handle.
11b1007 @peterjc Don't close the handle when writing files.
peterjc authored
108
109 Output - Advanced
110 =================
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
111 The effect of calling write() multiple times on a single file will vary
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
112 depending on the file format, and is best avoided unless you have a strong
113 reason to do so.
11b1007 @peterjc Don't close the handle when writing files.
peterjc authored
114
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
115 Trying this for certain alignment formats (e.g. phylip, clustal, stockholm)
116 would have the effect of concatenating several multiple sequence alignments
117 together. Such files are created by the PHYLIP suite of programs for
118 bootstrap analysis.
11b1007 @peterjc Don't close the handle when writing files.
peterjc authored
119
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
120 For sequential files formats (e.g. fasta, genbank) each "record block" holds
121 a single sequence. For these files it would probably be safe to call
122 write() multiple times.
11b1007 @peterjc Don't close the handle when writing files.
peterjc authored
123
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
124 File Formats
125 ============
126 When specifying formats, use lowercase strings.
127
128 Old Files
129 =========
4150dd3 @peterjc Basic alphabet check in to_alignment function
peterjc authored
130 The modules Bio.SeqIO.FASTA and Bio.SeqIO.generic are depreciated and may be
131 removed.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
132 """
133
134 #TODO
135 # - define policy on reading aligned sequences with gaps in
136 # (e.g. - and . characters) including how the alphabet interacts
137 #
29f0a4d @peterjc Use to_dict and to_alignment instead of SequencesToDict and Sequences…
peterjc authored
138 # - Can we build the to_alignment(...) functionality
54f13f2 @peterjc Reoganisation:
peterjc authored
139 # into the generic Alignment class instead?
140 #
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
141 # - How best to handle unique/non unique record.id when writing.
142 # For most file formats reading such files is fine; The stockholm
143 # parser would fail.
144 #
145 # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf)
146 # http://www.bioperl.org/wiki/MSF_multiple_alignment_format
147 #
148 # - Writing NEXUS multiple alignment format (*.nxs)
149 # http://www.bioperl.org/wiki/NEXUS_multiple_alignment_format
150 # Can be simply offload to Bio.Nexus for this?
151
152 """
153 FAO BioPython Developers
154 ========================
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
155 The way I envision this SeqIO system working as that for any sequence file
156 format we have an iterator that returns SeqRecord objects.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
157
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
158 This also applies to interlaced fileformats (like clustal) where the file
159 cannot be read record by record. You should still return an iterator!
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
160
161 These file format specific sequence iterators may be implemented as:
162 * Classes which take a handle for __init__ and provide the __iter__ method
163 * Functions that take a handle, and return an iterator object
164 * Generator functions that take a handle, and yeild SeqRecord objects
165
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
166 It is then trivial to turn this iterator into a list of SeqRecord objects,
167 an in memory dictionary, or a multiple sequence alignment object.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
168
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
169 For building the dictionary by default the id propery of each SeqRecord is
170 used as the key. You should always populate the id property, and it should
171 be unique. For some file formats the accession number is a good choice.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
172
5eb9e8c @peterjc Improving the doc string line breaks for better readability when usin…
peterjc authored
173 When adding a new file format, please use the same lower case format name
174 as BioPerl, or if they have not defined one, try the names used by EMBOSS.
175
176 See also http://biopython.org/wiki/SeqIO_dev
177
178 --Peter
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
179 """
180
181 import os
182 #from cStringIO import StringIO
183 from StringIO import StringIO
184 from Bio.Alphabet import generic_alphabet, generic_protein
185 from Bio.Seq import Seq
186 from Bio.SeqRecord import SeqRecord
187 from Bio.Align.Generic import Alignment
188
189 import FastaIO
a169bb0 @peterjc Added EMBL parsing.
peterjc authored
190 import InsdcIO #EMBL and GenBank
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
191 import StockholmIO
192 import ClustalIO
193 import PhylipIO
194 import NexusIO
80276dd @peterjc Renamed functions to avoid using the digit two as a pun for (convert)…
peterjc authored
195 import SwissIO
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
196
197 #Convention for format names is "mainname-subtype" in lower case.
198 #Please use the same names as BioPerl where possible.
199 #
200 #Note that this simple system copes with defining
201 #multiple possible iterators for a given format/extension
202 #with the -subtype suffix
203
80276dd @peterjc Renamed functions to avoid using the digit two as a pun for (convert)…
peterjc authored
204 _FormatToIterator ={"fasta" : FastaIO.FastaIterator,
a169bb0 @peterjc Added EMBL parsing.
peterjc authored
205 "genbank" : InsdcIO.GenBankIterator,
206 "genbank-cds" : InsdcIO.GenBankCdsFeatureIterator,
207 "embl" : InsdcIO.EmblIterator,
208 "embl-cds" : InsdcIO.EmblCdsFeatureIterator,
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
209 "clustal" : ClustalIO.ClustalIterator,
210 "phylip" : PhylipIO.PhylipIterator,
211 "nexus" : NexusIO.NexusIterator,
212 "stockholm" : StockholmIO.StockholmIterator,
80276dd @peterjc Renamed functions to avoid using the digit two as a pun for (convert)…
peterjc authored
213 "swiss" : SwissIO.SwissIterator,
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
214 }
215
80276dd @peterjc Renamed functions to avoid using the digit two as a pun for (convert)…
peterjc authored
216 _FormatToWriter ={"fasta" : FastaIO.FastaWriter,
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
217 "phylip" : PhylipIO.PhylipWriter,
218 "stockholm" : StockholmIO.StockholmWriter,
3271d08 @peterjc Writing clustal alignments using Bio.Clustalw - not elegant but it av…
peterjc authored
219 "clustal" : ClustalIO.ClustalWriter,
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
220 }
221
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
222 def write(sequences, handle, format) :
357b0bf @peterjc Adding missing trailing periods to make epydoc happy (for the online …
peterjc authored
223 """Write complete set of sequences to a file.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
224
225 sequences - A list (or iterator) of SeqRecord objects
11b1007 @peterjc Don't close the handle when writing files.
peterjc authored
226 handle - File handle object to write to
54f13f2 @peterjc Reoganisation:
peterjc authored
227 format - What format to use.
11b1007 @peterjc Don't close the handle when writing files.
peterjc authored
228
229 You should close the handle after calling this function.
a40b976 @peterjc Made the file format a required argument.
peterjc authored
230
231 There is no return value.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
232 """
54f13f2 @peterjc Reoganisation:
peterjc authored
233
a40b976 @peterjc Made the file format a required argument.
peterjc authored
234 #Try and give helpful error messages:
235 if isinstance(handle, basestring) :
77e4fda @peterjc Changed a few of the ValueErrors to TypeErrors, updated a doc string
peterjc authored
236 raise TypeError("Need a file handle, not a string (i.e. not a filename)")
237 if not isinstance(format, basestring) :
238 raise TypeError("Need a string for the file format (lower case)")
a40b976 @peterjc Made the file format a required argument.
peterjc authored
239 if not format :
240 raise ValueError("Format required (lower case string)")
241 if format <> format.lower() :
242 raise ValueError("Format string '%s' should be lower case" % format)
b0d24c6 @peterjc Catch possible bad input of a SeqRecord to SeqIO.write(), and issue a…
peterjc authored
243 if isinstance(sequences,SeqRecord):
244 raise ValueError("Use a SeqRecord list/iterator, not just a single SeqRecord")
a40b976 @peterjc Made the file format a required argument.
peterjc authored
245
246 #Map the file format to a writer class
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
247 try :
80276dd @peterjc Renamed functions to avoid using the digit two as a pun for (convert)…
peterjc authored
248 writer_class = _FormatToWriter[format]
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
249 except KeyError :
77e4fda @peterjc Changed a few of the ValueErrors to TypeErrors, updated a doc string
peterjc authored
250 raise ValueError("Unknown format '%s'" % format)
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
251
252 writer_class(handle).write_file(sequences)
11b1007 @peterjc Don't close the handle when writing files.
peterjc authored
253 #Don't close the file, as that would prevent things like
254 #creating concatenated phylip files for bootstrapping.
255 #handle.close()
a40b976 @peterjc Made the file format a required argument.
peterjc authored
256 return
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
257
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
258 def parse(handle, format) :
357b0bf @peterjc Adding missing trailing periods to make epydoc happy (for the online …
peterjc authored
259 """Turns a sequence file into an iterator returning SeqRecords.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
260
54f13f2 @peterjc Reoganisation:
peterjc authored
261 handle - handle to the file.
77e4fda @peterjc Changed a few of the ValueErrors to TypeErrors, updated a doc string
peterjc authored
262 format - string describing the file format.
54f13f2 @peterjc Reoganisation:
peterjc authored
263
a40b976 @peterjc Made the file format a required argument.
peterjc authored
264 If you have the file name in a string 'filename', use:
54f13f2 @peterjc Reoganisation:
peterjc authored
265
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
266 from Bio import SeqIO
267 my_iterator = SeqIO.parse(open(filename,"rU"), format)
54f13f2 @peterjc Reoganisation:
peterjc authored
268
269 If you have a string 'data' containing the file contents, use:
270
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
271 from Bio import SeqIO
54f13f2 @peterjc Reoganisation:
peterjc authored
272 from StringIO import StringIO
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
273 my_iterator = SeqIO.parse(StringIO(data), format)
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
274
275 Note that file will be parsed with default settings,
276 which may result in a generic alphabet or other non-ideal
54f13f2 @peterjc Reoganisation:
peterjc authored
277 settings. For more control, you must use the format specific
a40b976 @peterjc Made the file format a required argument.
peterjc authored
278 iterator directly...
71f665f @peterjc Added new "read" function which returns a SeqRecord when given a hand…
peterjc authored
279
280 Use the Bio.SeqIO.read(handle, format) function when you expect
281 a single record only.
a40b976 @peterjc Made the file format a required argument.
peterjc authored
282 """
54f13f2 @peterjc Reoganisation:
peterjc authored
283
a40b976 @peterjc Made the file format a required argument.
peterjc authored
284 #Try and give helpful error messages:
54f13f2 @peterjc Reoganisation:
peterjc authored
285 if isinstance(handle, basestring) :
77e4fda @peterjc Changed a few of the ValueErrors to TypeErrors, updated a doc string
peterjc authored
286 raise TypeError("Need a file handle, not a string (i.e. not a filename)")
287 if not isinstance(format, basestring) :
288 raise TypeError("Need a string for the file format (lower case)")
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
289 if not format :
a40b976 @peterjc Made the file format a required argument.
peterjc authored
290 raise ValueError("Format required (lower case string)")
291 if format <> format.lower() :
292 raise ValueError("Format string '%s' should be lower case" % format)
293
294 #Map the file format to a sequence iterator:
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
295 try :
80276dd @peterjc Renamed functions to avoid using the digit two as a pun for (convert)…
peterjc authored
296 iterator_generator = _FormatToIterator[format]
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
297 except KeyError :
54f13f2 @peterjc Reoganisation:
peterjc authored
298 raise ValueError("Unknown format '%s'" % format)
643f10c @peterjc Removed SequenceDict and SequenceList classes from Interfaces.py and …
peterjc authored
299
54f13f2 @peterjc Reoganisation:
peterjc authored
300 #Its up to the caller to close this handle - they opened it.
301 return iterator_generator(handle)
643f10c @peterjc Removed SequenceDict and SequenceList classes from Interfaces.py and …
peterjc authored
302
71f665f @peterjc Added new "read" function which returns a SeqRecord when given a hand…
peterjc authored
303 def read(handle, format) :
304 """Turns a sequence file into a single SeqRecord.
305
306 handle - handle to the file.
307 format - string describing the file format.
308
309 If the handle contains no records, or more than one record,
310 an exception is raised. For example, using a GenBank file
311 containing one record:
312
313 from Bio import SeqIO
314 record = SeqIO.read(open("example.gbk"), "genbank")
315
316 If however you want the first record from a file containing,
317 multiple records this function would raise an exception.
318 Instead use:
319
320 from Bio import SeqIO
321 record = SeqIO.parse(open("example.gbk"), "genbank").next()
322
323 Use the Bio.SeqIO.parse(handle, format) function if you want
324 to read multiple records from the handle.
325 """
326 iterator = parse(handle, format)
327 try :
328 first = iterator.next()
329 except StopIteration :
330 first = None
331 if first is None :
332 raise ValueError, "No records found in handle"
333 try :
334 second = iterator.next()
335 except StopIteration :
336 second = None
337 if second is not None :
338 raise ValueError, "More than one record found in handle"
339 return first
340
29f0a4d @peterjc Use to_dict and to_alignment instead of SequencesToDict and Sequences…
peterjc authored
341 def to_dict(sequences, key_function=None) :
71f665f @peterjc Added new "read" function which returns a SeqRecord when given a hand…
peterjc authored
342 """Turns a sequence iterator or list into a dictionary.
643f10c @peterjc Removed SequenceDict and SequenceList classes from Interfaces.py and …
peterjc authored
343
80276dd @peterjc Renamed functions to avoid using the digit two as a pun for (convert)…
peterjc authored
344 sequences - An iterator that returns SeqRecord objects,
345 or simply a list of SeqRecord objects.
8e86ce1 @peterjc Renamed argument key2record key_function to avoid the pun
peterjc authored
346 key_function - Optional function which when given a SeqRecord
347 returns a unique string for the dictionary key.
643f10c @peterjc Removed SequenceDict and SequenceList classes from Interfaces.py and …
peterjc authored
348
8e86ce1 @peterjc Renamed argument key2record key_function to avoid the pun
peterjc authored
349 e.g. key_function = lambda rec : rec.name
350 or, key_function = lambda rec : rec.description.split()[0]
643f10c @peterjc Removed SequenceDict and SequenceList classes from Interfaces.py and …
peterjc authored
351
8e86ce1 @peterjc Renamed argument key2record key_function to avoid the pun
peterjc authored
352 If key_function is ommitted then record.id is used, on the
643f10c @peterjc Removed SequenceDict and SequenceList classes from Interfaces.py and …
peterjc authored
353 assumption that the records objects returned are SeqRecords
354 with a unique id field.
355
356 If there are duplicate keys, an error is raised.
357
358 Example usage:
359
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
360 from Bio import SeqIO
643f10c @peterjc Removed SequenceDict and SequenceList classes from Interfaces.py and …
peterjc authored
361 filename = "example.fasta"
29f0a4d @peterjc Use to_dict and to_alignment instead of SequencesToDict and Sequences…
peterjc authored
362 d = SeqIO.to_dict(SeqIO.parse(open(faa_filename, "rU")),
8e86ce1 @peterjc Renamed argument key2record key_function to avoid the pun
peterjc authored
363 key_function = lambda rec : rec.description.split()[0])
643f10c @peterjc Removed SequenceDict and SequenceList classes from Interfaces.py and …
peterjc authored
364 print len(d)
365 print d.keys()[0:10]
366 key = d.keys()[0]
367 print d[key]
368 """
8e86ce1 @peterjc Renamed argument key2record key_function to avoid the pun
peterjc authored
369 if key_function is None :
370 key_function = lambda rec : rec.id
643f10c @peterjc Removed SequenceDict and SequenceList classes from Interfaces.py and …
peterjc authored
371
372 d = dict()
80276dd @peterjc Renamed functions to avoid using the digit two as a pun for (convert)…
peterjc authored
373 for record in sequences :
8e86ce1 @peterjc Renamed argument key2record key_function to avoid the pun
peterjc authored
374 key = key_function(record)
375 if key in d :
376 raise ValueError("Duplicate key '%s'" % key)
643f10c @peterjc Removed SequenceDict and SequenceList classes from Interfaces.py and …
peterjc authored
377 d[key] = record
378 return d
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
379
29f0a4d @peterjc Use to_dict and to_alignment instead of SequencesToDict and Sequences…
peterjc authored
380 def to_alignment(sequences, alphabet=generic_alphabet, strict=True) :
71f665f @peterjc Added new "read" function which returns a SeqRecord when given a hand…
peterjc authored
381 """Returns a multiple sequence alignment.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
382
80276dd @peterjc Renamed functions to avoid using the digit two as a pun for (convert)…
peterjc authored
383 sequences -An iterator that returns SeqRecord objects,
384 or simply a list of SeqRecord objects.
385 All the record sequences must be the same length.
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
386 alphabet - Optional alphabet. Stongly recommended.
387 strict - Optional, defaults to True. Should error checking
388 be done?
389 """
54f13f2 @peterjc Reoganisation:
peterjc authored
390 #TODO - Move this functionality into the Alignment class instead?
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
391 alignment_length = None
392 alignment = Alignment(alphabet)
80276dd @peterjc Renamed functions to avoid using the digit two as a pun for (convert)…
peterjc authored
393 for record in sequences :
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
394 if strict :
395 if alignment_length is None :
396 alignment_length = len(record.seq)
397 elif alignment_length <> len(record.seq) :
8e86ce1 @peterjc Renamed argument key2record key_function to avoid the pun
peterjc authored
398 raise ValueError("Sequences of different lengths")
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
399
4150dd3 @peterjc Basic alphabet check in to_alignment function
peterjc authored
400 if not isinstance(record.seq.alphabet, alphabet.__class__) :
401 raise ValueError("Incompatible sequence alphabet")
402
403 #ToDo, additional checks on the specified alignment...
404 #Should we look at the alphabet.contains() method?
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
405
406 #This is abusing the "private" records list,
407 #we should really have a method like add_sequence
4150dd3 @peterjc Basic alphabet check in to_alignment function
peterjc authored
408 #but which takes SeqRecord objects. See also Bug 1944
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
409 alignment._records.append(record)
410 return alignment
411
412 if __name__ == "__main__" :
413 #Run some tests...
414 from Bio.Alphabet import generic_nucleotide
415 from sets import Set
416
54f13f2 @peterjc Reoganisation:
peterjc authored
417 # Fasta file with unusual layout, from here:
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
418 # http://virgil.ruc.dk/kurser/Sekvens/Treedraw.htm
419 faa_example = \
420 """>V_Harveyi_PATH
421 mknwikvava aialsaatvq aatevkvgms gryfpftfvk qdklqgfevd mwdeigkrnd
422 ykieyvtanf sglfglletg ridtisnqit mtdarkakyl fadpyvvdga qitvrkgnds
423 iqgvedlagk tvavnlgsnf eqllrdydkd gkiniktydt giehdvalgr adafimdrls
424 alelikktgl plqlagepfe tiqnawpfvd nekgrklqae vnkalaemra dgtvekisvk
425 wfgaditk
426 >B_subtilis_YXEM
427 mkmkkwtvlv vaallavlsa cgngnssske ddnvlhvgat gqsypfayke ngkltgfdve
428 vmeavakkid mkldwkllef sglmgelqtg kldtisnqva vtderketyn ftkpyayagt
429 qivvkkdntd iksvddlkgk tvaavlgsnh aknleskdpd kkiniktyet qegtlkdvay
430 grvdayvnsr tvliaqikkt glplklagdp ivyeqvafpf akddahdklr kkvnkaldel
431 rkdgtlkkls ekyfneditv eqkh
432 >FLIY_ECOLI
433 mklahlgrqa lmgvmavalv agmsvksfad egllnkvker gtllvglegt yppfsfqgdd
434 gkltgfevef aqqlakhlgv easlkptkwd gmlasldskr idvvinqvti sderkkkydf
435 stpytisgiq alvkkgnegt iktaddlkgk kvgvglgtny eewlrqnvqg vdvrtydddp
436 tkyqdlrvgr idailvdrla aldlvkktnd tlavtgeafs rqesgvalrk gnedllkavn
437 daiaemqkdg tlqalsekwf gadvtk
438 >Deinococcus_radiodurans
439 mkksllslkl sgllvpsvla lslsacssps stlnqgtlki amegtyppft skneqgelvg
440 fdvdiakava qklnlkpefv ltewsgilag lqankydviv nqvgitperq nsigfsqpya
441 ysrpeiivak nntfnpqsla dlkgkrvgst lgsnyekqli dtgdikivty pgapeiladl
442 vagridaayn drlvvnyiin dqklpvrgag qigdaapvgi alkkgnsalk dqidkaltem
443 rsdgtfekis qkwfgqdvgq p
444 >B_subtilis_GlnH_homo_YCKK
445 mkkallalfm vvsiaalaac gagndnqskd nakdgdlwas ikkkgvltvg tegtyepfty
446 hdkdtdkltg ydveviteva krlglkvdfk etqwgsmfag lnskrfdvva nqvgktdred
447 kydfsdkytt sravvvtkkd nndikseadv kgktsaqslt snynklatna gakvegvegm
448 aqalqmiqqa rvdmtyndkl avlnylktsg nknvkiafet gepqstyftf rkgsgevvdq
449 vnkalkemke dgtlskiskk wfgedvsk
450 >YA80_HAEIN
451 mkkllfttal ltgaiafstf shageiadrv ektktllvgt egtyapftfh dksgkltgfd
452 vevirkvaek lglkvefket qwdamyagln akrfdvianq tnpsperlkk ysfttpynys
453 ggvivtkssd nsiksfedlk grksaqsats nwgkdakaag aqilvvdgla qslelikqgr
454 aeatindkla vldyfkqhpn sglkiaydrg dktptafafl qgedalitkf nqvlealrqd
455 gtlkqisiew fgyditq
456 >E_coli_GlnH
457 mksvlkvsla altlafavss haadkklvva tdtafvpfef kqgdkyvgfd vdlwaaiake
458 lkldyelkpm dfsgiipalq tknvdlalag ititderkka idfsdgyyks gllvmvkann
459 ndvksvkdld gkvvavksgt gsvdyakani ktkdlrqfpn idnaymelgt nradavlhdt
460 pnilyfikta gngqfkavgd sleaqqygia fpkgsdelrd kvngalktlr engtyneiyk
461 kwfgtepk
462 >HISJ_E_COLI
463 mkklvlslsl vlafssataa faaipqniri gtdptyapfe sknsqgelvg fdidlakelc
464 krintqctfv enpldalips lkakkidaim sslsitekrq qeiaftdkly aadsrlvvak
465 nsdiqptves lkgkrvgvlq gttqetfgne hwapkgieiv syqgqdniys dltagridaa
466 fqdevaaseg flkqpvgkdy kfggpsvkde klfgvgtgmg lrkednelre alnkafaemr
467 adgtyeklak kyfdfdvygg"""
468
469 # This alignment was created from the fasta example given above
470 aln_example = \
471 """CLUSTAL X (1.83) multiple sequence alignment
472
473
474 V_Harveyi_PATH --MKNWIKVAVAAIA--LSAA------------------TVQAATEVKVG
475 B_subtilis_YXEM MKMKKWTVLVVAALLAVLSACG------------NGNSSSKEDDNVLHVG
476 B_subtilis_GlnH_homo_YCKK MKKALLALFMVVSIAALAACGAGNDNQSKDNAKDGDLWASIKKKGVLTVG
477 YA80_HAEIN MKKLLFTTALLTGAIAFSTF-----------SHAGEIADRVEKTKTLLVG
478 FLIY_ECOLI MKLAHLGRQALMGVMAVALVAG---MSVKSFADEG-LLNKVKERGTLLVG
479 E_coli_GlnH --MKSVLKVSLAALTLAFAVS------------------SHAADKKLVVA
480 Deinococcus_radiodurans -MKKSLLSLKLSGLLVPSVLALS--------LSACSSPSSTLNQGTLKIA
481 HISJ_E_COLI MKKLVLSLSLVLAFSSATAAF-------------------AAIPQNIRIG
482 : . : :.
483
484 V_Harveyi_PATH MSGRYFPFTFVKQ--DKLQGFEVDMWDEIGKRNDYKIEYVTANFSGLFGL
485 B_subtilis_YXEM ATGQSYPFAYKEN--GKLTGFDVEVMEAVAKKIDMKLDWKLLEFSGLMGE
486 B_subtilis_GlnH_homo_YCKK TEGTYEPFTYHDKDTDKLTGYDVEVITEVAKRLGLKVDFKETQWGSMFAG
487 YA80_HAEIN TEGTYAPFTFHDK-SGKLTGFDVEVIRKVAEKLGLKVEFKETQWDAMYAG
488 FLIY_ECOLI LEGTYPPFSFQGD-DGKLTGFEVEFAQQLAKHLGVEASLKPTKWDGMLAS
489 E_coli_GlnH TDTAFVPFEFKQG--DKYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPA
490 Deinococcus_radiodurans MEGTYPPFTSKNE-QGELVGFDVDIAKAVAQKLNLKPEFVLTEWSGILAG
491 HISJ_E_COLI TDPTYAPFESKNS-QGELVGFDIDLAKELCKRINTQCTFVENPLDALIPS
492 ** .: *::::. : :. . ..:
493
494 V_Harveyi_PATH LETGRIDTISNQITMTDARKAKYLFADPYVVDG-AQITVRKGNDSIQGVE
495 B_subtilis_YXEM LQTGKLDTISNQVAVTDERKETYNFTKPYAYAG-TQIVVKKDNTDIKSVD
496 B_subtilis_GlnH_homo_YCKK LNSKRFDVVANQVG-KTDREDKYDFSDKYTTSR-AVVVTKKDNNDIKSEA
497 YA80_HAEIN LNAKRFDVIANQTNPSPERLKKYSFTTPYNYSG-GVIVTKSSDNSIKSFE
498 FLIY_ECOLI LDSKRIDVVINQVTISDERKKKYDFSTPYTISGIQALVKKGNEGTIKTAD
499 E_coli_GlnH LQTKNVDLALAGITITDERKKAIDFSDGYYKSG-LLVMVKANNNDVKSVK
500 Deinococcus_radiodurans LQANKYDVIVNQVGITPERQNSIGFSQPYAYSRPEIIVAKNNTFNPQSLA
501 HISJ_E_COLI LKAKKIDAIMSSLSITEKRQQEIAFTDKLYAADSRLVVAKNSDIQP-TVE
502 *.: . * . * *: : : .
503
504 V_Harveyi_PATH DLAGKTVAVNLGSNFEQLLRDYDKDGKINIKTYDT--GIEHDVALGRADA
505 B_subtilis_YXEM DLKGKTVAAVLGSNHAKNLESKDPDKKINIKTYETQEGTLKDVAYGRVDA
506 B_subtilis_GlnH_homo_YCKK DVKGKTSAQSLTSNYNKLATN----AGAKVEGVEGMAQALQMIQQARVDM
507 YA80_HAEIN DLKGRKSAQSATSNWGKDAKA----AGAQILVVDGLAQSLELIKQGRAEA
508 FLIY_ECOLI DLKGKKVGVGLGTNYEEWLRQNV--QGVDVRTYDDDPTKYQDLRVGRIDA
509 E_coli_GlnH DLDGKVVAVKSGTGSVDYAKAN--IKTKDLRQFPNIDNAYMELGTNRADA
510 Deinococcus_radiodurans DLKGKRVGSTLGSNYEKQLIDTG---DIKIVTYPGAPEILADLVAGRIDA
511 HISJ_E_COLI SLKGKRVGVLQGTTQETFGNEHWAPKGIEIVSYQGQDNIYSDLTAGRIDA
512 .: *: . : .: : * :
513
514 V_Harveyi_PATH FIMDRLSALE-LIKKT-GLPLQLAGEPFETI-----QNAWPFVDNEKGRK
515 B_subtilis_YXEM YVNSRTVLIA-QIKKT-GLPLKLAGDPIVYE-----QVAFPFAKDDAHDK
516 B_subtilis_GlnH_homo_YCKK TYNDKLAVLN-YLKTSGNKNVKIAFETGEPQ-----STYFTFRKGS--GE
517 YA80_HAEIN TINDKLAVLD-YFKQHPNSGLKIAYDRGDKT-----PTAFAFLQGE--DA
518 FLIY_ECOLI ILVDRLAALD-LVKKT-NDTLAVTGEAFSRQ-----ESGVALRKGN--ED
519 E_coli_GlnH VLHDTPNILY-FIKTAGNGQFKAVGDSLEAQ-----QYGIAFPKGS--DE
520 Deinococcus_radiodurans AYNDRLVVNY-IINDQ-KLPVRGAGQIGDAA-----PVGIALKKGN--SA
521 HISJ_E_COLI AFQDEVAASEGFLKQPVGKDYKFGGPSVKDEKLFGVGTGMGLRKED--NE
522 . .: : . .
523
524 V_Harveyi_PATH LQAEVNKALAEMRADGTVEKISVKWFGADITK----
525 B_subtilis_YXEM LRKKVNKALDELRKDGTLKKLSEKYFNEDITVEQKH
526 B_subtilis_GlnH_homo_YCKK VVDQVNKALKEMKEDGTLSKISKKWFGEDVSK----
527 YA80_HAEIN LITKFNQVLEALRQDGTLKQISIEWFGYDITQ----
528 FLIY_ECOLI LLKAVNDAIAEMQKDGTLQALSEKWFGADVTK----
529 E_coli_GlnH LRDKVNGALKTLRENGTYNEIYKKWFGTEPK-----
530 Deinococcus_radiodurans LKDQIDKALTEMRSDGTFEKISQKWFGQDVGQP---
531 HISJ_E_COLI LREALNKAFAEMRADGTYEKLAKKYFDFDVYGG---
532 : .: .: :: :** . : ::*. :
533 """
534
535 # This is the clustal example (above) but output in phylip format,
536 # with truncated names. Note there is an ambiguity here: two
537 # different sequences both called "B_subtilis", originally
538 # "B_subtilis_YXEM" and "B_subtilis_GlnH_homo_YCKK"
539 phy_example = \
540 """ 8 286
541 V_Harveyi_ --MKNWIKVA VAAIA--LSA A--------- ---------T VQAATEVKVG
542 B_subtilis MKMKKWTVLV VAALLAVLSA CG-------- ----NGNSSS KEDDNVLHVG
543 B_subtilis MKKALLALFM VVSIAALAAC GAGNDNQSKD NAKDGDLWAS IKKKGVLTVG
544 YA80_HAEIN MKKLLFTTAL LTGAIAFSTF ---------- -SHAGEIADR VEKTKTLLVG
545 FLIY_ECOLI MKLAHLGRQA LMGVMAVALV AG---MSVKS FADEG-LLNK VKERGTLLVG
546 E_coli_Gln --MKSVLKVS LAALTLAFAV S--------- ---------S HAADKKLVVA
547 Deinococcu -MKKSLLSLK LSGLLVPSVL ALS------- -LSACSSPSS TLNQGTLKIA
548 HISJ_E_COL MKKLVLSLSL VLAFSSATAA F--------- ---------- AAIPQNIRIG
549
550 MSGRYFPFTF VKQ--DKLQG FEVDMWDEIG KRNDYKIEYV TANFSGLFGL
551 ATGQSYPFAY KEN--GKLTG FDVEVMEAVA KKIDMKLDWK LLEFSGLMGE
552 TEGTYEPFTY HDKDTDKLTG YDVEVITEVA KRLGLKVDFK ETQWGSMFAG
553 TEGTYAPFTF HDK-SGKLTG FDVEVIRKVA EKLGLKVEFK ETQWDAMYAG
554 LEGTYPPFSF QGD-DGKLTG FEVEFAQQLA KHLGVEASLK PTKWDGMLAS
555 TDTAFVPFEF KQG--DKYVG FDVDLWAAIA KELKLDYELK PMDFSGIIPA
556 MEGTYPPFTS KNE-QGELVG FDVDIAKAVA QKLNLKPEFV LTEWSGILAG
557 TDPTYAPFES KNS-QGELVG FDIDLAKELC KRINTQCTFV ENPLDALIPS
558
559 LETGRIDTIS NQITMTDARK AKYLFADPYV VDG-AQITVR KGNDSIQGVE
560 LQTGKLDTIS NQVAVTDERK ETYNFTKPYA YAG-TQIVVK KDNTDIKSVD
561 LNSKRFDVVA NQVG-KTDRE DKYDFSDKYT TSR-AVVVTK KDNNDIKSEA
562 LNAKRFDVIA NQTNPSPERL KKYSFTTPYN YSG-GVIVTK SSDNSIKSFE
563 LDSKRIDVVI NQVTISDERK KKYDFSTPYT ISGIQALVKK GNEGTIKTAD
564 LQTKNVDLAL AGITITDERK KAIDFSDGYY KSG-LLVMVK ANNNDVKSVK
565 LQANKYDVIV NQVGITPERQ NSIGFSQPYA YSRPEIIVAK NNTFNPQSLA
566 LKAKKIDAIM SSLSITEKRQ QEIAFTDKLY AADSRLVVAK NSDIQP-TVE
567
568 DLAGKTVAVN LGSNFEQLLR DYDKDGKINI KTYDT--GIE HDVALGRADA
569 DLKGKTVAAV LGSNHAKNLE SKDPDKKINI KTYETQEGTL KDVAYGRVDA
570 DVKGKTSAQS LTSNYNKLAT N----AGAKV EGVEGMAQAL QMIQQARVDM
571 DLKGRKSAQS ATSNWGKDAK A----AGAQI LVVDGLAQSL ELIKQGRAEA
572 DLKGKKVGVG LGTNYEEWLR QNV--QGVDV RTYDDDPTKY QDLRVGRIDA
573 DLDGKVVAVK SGTGSVDYAK AN--IKTKDL RQFPNIDNAY MELGTNRADA
574 DLKGKRVGST LGSNYEKQLI DTG---DIKI VTYPGAPEIL ADLVAGRIDA
575 SLKGKRVGVL QGTTQETFGN EHWAPKGIEI VSYQGQDNIY SDLTAGRIDA
576
577 FIMDRLSALE -LIKKT-GLP LQLAGEPFET I-----QNAW PFVDNEKGRK
578 YVNSRTVLIA -QIKKT-GLP LKLAGDPIVY E-----QVAF PFAKDDAHDK
579 TYNDKLAVLN -YLKTSGNKN VKIAFETGEP Q-----STYF TFRKGS--GE
580 TINDKLAVLD -YFKQHPNSG LKIAYDRGDK T-----PTAF AFLQGE--DA
581 ILVDRLAALD -LVKKT-NDT LAVTGEAFSR Q-----ESGV ALRKGN--ED
582 VLHDTPNILY -FIKTAGNGQ FKAVGDSLEA Q-----QYGI AFPKGS--DE
583 AYNDRLVVNY -IINDQ-KLP VRGAGQIGDA A-----PVGI ALKKGN--SA
584 AFQDEVAASE GFLKQPVGKD YKFGGPSVKD EKLFGVGTGM GLRKED--NE
585
586 LQAEVNKALA EMRADGTVEK ISVKWFGADI TK----
587 LRKKVNKALD ELRKDGTLKK LSEKYFNEDI TVEQKH
588 VVDQVNKALK EMKEDGTLSK ISKKWFGEDV SK----
589 LITKFNQVLE ALRQDGTLKQ ISIEWFGYDI TQ----
590 LLKAVNDAIA EMQKDGTLQA LSEKWFGADV TK----
591 LRDKVNGALK TLRENGTYNE IYKKWFGTEP K-----
592 LKDQIDKALT EMRSDGTFEK ISQKWFGQDV GQP---
593 LREALNKAFA EMRADGTYEK LAKKYFDFDV YGG---
594 """
595 # This is the clustal example (above) but output in phylip format,
596 nxs_example = \
597 """#NEXUS
598 BEGIN DATA;
599 dimensions ntax=8 nchar=286;
600 format missing=?
601 symbols="ABCDEFGHIKLMNPQRSTUVWXYZ"
602 interleave datatype=PROTEIN gap= -;
603
604 matrix
605 V_Harveyi_PATH --MKNWIKVAVAAIA--LSAA------------------TVQAATEVKVG
606 B_subtilis_YXEM MKMKKWTVLVVAALLAVLSACG------------NGNSSSKEDDNVLHVG
607 B_subtilis_GlnH_homo_YCKK MKKALLALFMVVSIAALAACGAGNDNQSKDNAKDGDLWASIKKKGVLTVG
608 YA80_HAEIN MKKLLFTTALLTGAIAFSTF-----------SHAGEIADRVEKTKTLLVG
609 FLIY_ECOLI MKLAHLGRQALMGVMAVALVAG---MSVKSFADEG-LLNKVKERGTLLVG
610 E_coli_GlnH --MKSVLKVSLAALTLAFAVS------------------SHAADKKLVVA
611 Deinococcus_radiodurans -MKKSLLSLKLSGLLVPSVLALS--------LSACSSPSSTLNQGTLKIA
612 HISJ_E_COLI MKKLVLSLSLVLAFSSATAAF-------------------AAIPQNIRIG
613
614 V_Harveyi_PATH MSGRYFPFTFVKQ--DKLQGFEVDMWDEIGKRNDYKIEYVTANFSGLFGL
615 B_subtilis_YXEM ATGQSYPFAYKEN--GKLTGFDVEVMEAVAKKIDMKLDWKLLEFSGLMGE
616 B_subtilis_GlnH_homo_YCKK TEGTYEPFTYHDKDTDKLTGYDVEVITEVAKRLGLKVDFKETQWGSMFAG
617 YA80_HAEIN TEGTYAPFTFHDK-SGKLTGFDVEVIRKVAEKLGLKVEFKETQWDAMYAG
618 FLIY_ECOLI LEGTYPPFSFQGD-DGKLTGFEVEFAQQLAKHLGVEASLKPTKWDGMLAS
619 E_coli_GlnH TDTAFVPFEFKQG--DKYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPA
620 Deinococcus_radiodurans MEGTYPPFTSKNE-QGELVGFDVDIAKAVAQKLNLKPEFVLTEWSGILAG
621 HISJ_E_COLI TDPTYAPFESKNS-QGELVGFDIDLAKELCKRINTQCTFVENPLDALIPS
622
623 V_Harveyi_PATH LETGRIDTISNQITMTDARKAKYLFADPYVVDG-AQITVRKGNDSIQGVE
624 B_subtilis_YXEM LQTGKLDTISNQVAVTDERKETYNFTKPYAYAG-TQIVVKKDNTDIKSVD
625 B_subtilis_GlnH_homo_YCKK LNSKRFDVVANQVG-KTDREDKYDFSDKYTTSR-AVVVTKKDNNDIKSEA
626 YA80_HAEIN LNAKRFDVIANQTNPSPERLKKYSFTTPYNYSG-GVIVTKSSDNSIKSFE
627 FLIY_ECOLI LDSKRIDVVINQVTISDERKKKYDFSTPYTISGIQALVKKGNEGTIKTAD
628 E_coli_GlnH LQTKNVDLALAGITITDERKKAIDFSDGYYKSG-LLVMVKANNNDVKSVK
629 Deinococcus_radiodurans LQANKYDVIVNQVGITPERQNSIGFSQPYAYSRPEIIVAKNNTFNPQSLA
630 HISJ_E_COLI LKAKKIDAIMSSLSITEKRQQEIAFTDKLYAADSRLVVAKNSDIQP-TVE
631
632 V_Harveyi_PATH DLAGKTVAVNLGSNFEQLLRDYDKDGKINIKTYDT--GIEHDVALGRADA
633 B_subtilis_YXEM DLKGKTVAAVLGSNHAKNLESKDPDKKINIKTYETQEGTLKDVAYGRVDA
634 B_subtilis_GlnH_homo_YCKK DVKGKTSAQSLTSNYNKLATN----AGAKVEGVEGMAQALQMIQQARVDM
635 YA80_HAEIN DLKGRKSAQSATSNWGKDAKA----AGAQILVVDGLAQSLELIKQGRAEA
636 FLIY_ECOLI DLKGKKVGVGLGTNYEEWLRQNV--QGVDVRTYDDDPTKYQDLRVGRIDA
637 E_coli_GlnH DLDGKVVAVKSGTGSVDYAKAN--IKTKDLRQFPNIDNAYMELGTNRADA
638 Deinococcus_radiodurans DLKGKRVGSTLGSNYEKQLIDTG---DIKIVTYPGAPEILADLVAGRIDA
639 HISJ_E_COLI SLKGKRVGVLQGTTQETFGNEHWAPKGIEIVSYQGQDNIYSDLTAGRIDA
640
641 V_Harveyi_PATH FIMDRLSALE-LIKKT-GLPLQLAGEPFETI-----QNAWPFVDNEKGRK
642 B_subtilis_YXEM YVNSRTVLIA-QIKKT-GLPLKLAGDPIVYE-----QVAFPFAKDDAHDK
643 B_subtilis_GlnH_homo_YCKK TYNDKLAVLN-YLKTSGNKNVKIAFETGEPQ-----STYFTFRKGS--GE
644 YA80_HAEIN TINDKLAVLD-YFKQHPNSGLKIAYDRGDKT-----PTAFAFLQGE--DA
645 FLIY_ECOLI ILVDRLAALD-LVKKT-NDTLAVTGEAFSRQ-----ESGVALRKGN--ED
646 E_coli_GlnH VLHDTPNILY-FIKTAGNGQFKAVGDSLEAQ-----QYGIAFPKGS--DE
647 Deinococcus_radiodurans AYNDRLVVNY-IINDQ-KLPVRGAGQIGDAA-----PVGIALKKGN--SA
648 HISJ_E_COLI AFQDEVAASEGFLKQPVGKDYKFGGPSVKDEKLFGVGTGMGLRKED--NE
649
650 V_Harveyi_PATH LQAEVNKALAEMRADGTVEKISVKWFGADITK----
651 B_subtilis_YXEM LRKKVNKALDELRKDGTLKKLSEKYFNEDITVEQKH
652 B_subtilis_GlnH_homo_YCKK VVDQVNKALKEMKEDGTLSKISKKWFGEDVSK----
653 YA80_HAEIN LITKFNQVLEALRQDGTLKQISIEWFGYDITQ----
654 FLIY_ECOLI LLKAVNDAIAEMQKDGTLQALSEKWFGADVTK----
655 E_coli_GlnH LRDKVNGALKTLRENGTYNEIYKKWFGTEPK-----
656 Deinococcus_radiodurans LKDQIDKALTEMRSDGTFEKISQKWFGQDVGQP---
657 HISJ_E_COLI LREALNKAFAEMRADGTYEKLAKKYFDFDVYGG---
658 ;
659 end;
660 """
661
662 # This example uses DNA, from here:
663 # http://www.molecularevolution.org/resources/fileformats/
664 nxs_example2 = \
665 """#NEXUS
666
667 Begin data;
668 Dimensions ntax=10 nchar=705;
669 Format datatype=dna interleave=yes gap=- missing=?;
670 Matrix
671 Cow ATGGCATATCCCATACAACTAGGATTCCAAGATGCAACATCACCAATCATAGAAGAACTA
672 Carp ATGGCACACCCAACGCAACTAGGTTTCAAGGACGCGGCCATACCCGTTATAGAGGAACTT
673 Chicken ATGGCCAACCACTCCCAACTAGGCTTTCAAGACGCCTCATCCCCCATCATAGAAGAGCTC
674 Human ATGGCACATGCAGCGCAAGTAGGTCTACAAGACGCTACTTCCCCTATCATAGAAGAGCTT
675 Loach ATGGCACATCCCACACAATTAGGATTCCAAGACGCGGCCTCACCCGTAATAGAAGAACTT
676 Mouse ATGGCCTACCCATTCCAACTTGGTCTACAAGACGCCACATCCCCTATTATAGAAGAGCTA
677 Rat ATGGCTTACCCATTTCAACTTGGCTTACAAGACGCTACATCACCTATCATAGAAGAACTT
678 Seal ATGGCATACCCCCTACAAATAGGCCTACAAGATGCAACCTCTCCCATTATAGAGGAGTTA
679 Whale ATGGCATATCCATTCCAACTAGGTTTCCAAGATGCAGCATCACCCATCATAGAAGAGCTC
680 Frog ATGGCACACCCATCACAATTAGGTTTTCAAGACGCAGCCTCTCCAATTATAGAAGAATTA
681
682 Cow CTTCACTTTCATGACCACACGCTAATAATTGTCTTCTTAATTAGCTCATTAGTACTTTAC
683 Carp CTTCACTTCCACGACCACGCATTAATAATTGTGCTCCTAATTAGCACTTTAGTTTTATAT
684 Chicken GTTGAATTCCACGACCACGCCCTGATAGTCGCACTAGCAATTTGCAGCTTAGTACTCTAC
685 Human ATCACCTTTCATGATCACGCCCTCATAATCATTTTCCTTATCTGCTTCCTAGTCCTGTAT
686 Loach CTTCACTTCCATGACCATGCCCTAATAATTGTATTTTTGATTAGCGCCCTAGTACTTTAT
687 Mouse ATAAATTTCCATGATCACACACTAATAATTGTTTTCCTAATTAGCTCCTTAGTCCTCTAT
688 Rat ACAAACTTTCATGACCACACCCTAATAATTGTATTCCTCATCAGCTCCCTAGTACTTTAT
689 Seal CTACACTTCCATGACCACACATTAATAATTGTGTTCCTAATTAGCTCATTAGTACTCTAC
690 Whale CTACACTTTCACGATCATACACTAATAATCGTTTTTCTAATTAGCTCTTTAGTTCTCTAC
691 Frog CTTCACTTCCACGACCATACCCTCATAGCCGTTTTTCTTATTAGTACGCTAGTTCTTTAC
692
693 Cow ATTATTTCACTAATACTAACGACAAAGCTGACCCATACAAGCACGATAGATGCACAAGAA
694 Carp ATTATTACTGCAATGGTATCAACTAAACTTACTAATAAATATATTCTAGACTCCCAAGAA
695 Chicken CTTCTAACTCTTATACTTATAGAAAAACTATCA---TCAAACACCGTAGATGCCCAAGAA
696 Human GCCCTTTTCCTAACACTCACAACAAAACTAACTAATACTAACATCTCAGACGCTCAGGAA
697 Loach GTTATTATTACAACCGTCTCAACAAAACTCACTAACATATATATTTTGGACTCACAAGAA
698 Mouse ATCATCTCGCTAATATTAACAACAAAACTAACACATACAAGCACAATAGATGCACAAGAA
699 Rat ATTATTTCACTAATACTAACAACAAAACTAACACACACAAGCACAATAGACGCCCAAGAA
700 Seal ATTATCTCACTTATACTAACCACGAAACTCACCCACACAAGTACAATAGACGCACAAGAA
701 Whale ATTATTACCCTAATGCTTACAACCAAATTAACACATACTAGTACAATAGACGCCCAAGAA
702 Frog ATTATTACTATTATAATAACTACTAAACTAACTAATACAAACCTAATGGACGCACAAGAG
703
704 Cow GTAGAGACAATCTGAACCATTCTGCCCGCCATCATCTTAATTCTAATTGCTCTTCCTTCT
705 Carp ATCGAAATCGTATGAACCATTCTACCAGCCGTCATTTTAGTACTAATCGCCCTGCCCTCC
706 Chicken GTTGAACTAATCTGAACCATCCTACCCGCTATTGTCCTAGTCCTGCTTGCCCTCCCCTCC
707 Human ATAGAAACCGTCTGAACTATCCTGCCCGCCATCATCCTAGTCCTCATCGCCCTCCCATCC
708 Loach ATTGAAATCGTATGAACTGTGCTCCCTGCCCTAATCCTCATTTTAATCGCCCTCCCCTCA
709 Mouse GTTGAAACCATTTGAACTATTCTACCAGCTGTAATCCTTATCATAATTGCTCTCCCCTCT
710 Rat GTAGAAACAATTTGAACAATTCTCCCAGCTGTCATTCTTATTCTAATTGCCCTTCCCTCC
711 Seal GTGGAAACGGTGTGAACGATCCTACCCGCTATCATTTTAATTCTCATTGCCCTACCATCA
712 Whale GTAGAAACTGTCTGAACTATCCTCCCAGCCATTATCTTAATTTTAATTGCCTTGCCTTCA
713 Frog ATCGAAATAGTGTGAACTATTATACCAGCTATTAGCCTCATCATAATTGCCCTTCCATCC
714
715 Cow TTACGAATTCTATACATAATAGATGAAATCAATAACCCATCTCTTACAGTAAAAACCATA
716 Carp CTACGCATCCTGTACCTTATAGACGAAATTAACGACCCTCACCTGACAATTAAAGCAATA
717 Chicken CTCCAAATCCTCTACATAATAGACGAAATCGACGAACCTGATCTCACCCTAAAAGCCATC
718 Human CTACGCATCCTTTACATAACAGACGAGGTCAACGATCCCTCCCTTACCATCAAATCAATT
719 Loach CTACGAATTCTATATCTTATAGACGAGATTAATGACCCCCACCTAACAATTAAGGCCATG
720 Mouse CTACGCATTCTATATATAATAGACGAAATCAACAACCCCGTATTAACCGTTAAAACCATA
721 Rat CTACGAATTCTATACATAATAGACGAGATTAATAACCCAGTTCTAACAGTAAAAACTATA
722 Seal TTACGAATCCTCTACATAATGGACGAGATCAATAACCCTTCCTTGACCGTAAAAACTATA
723 Whale TTACGGATCCTTTACATAATAGACGAAGTCAATAACCCCTCCCTCACTGTAAAAACAATA
724 Frog CTTCGTATCCTATATTTAATAGATGAAGTTAATGATCCACACTTAACAATTAAAGCAATC
725
726 Cow GGACATCAGTGATACTGAAGCTATGAGTATACAGATTATGAGGACTTAAGCTTCGACTCC
727 Carp GGACACCAATGATACTGAAGTTACGAGTATACAGACTATGAAAATCTAGGATTCGACTCC
728 Chicken GGACACCAATGATACTGAACCTATGAATACACAGACTTCAAGGACCTCTCATTTGACTCC
729 Human GGCCACCAATGGTACTGAACCTACGAGTACACCGACTACGGCGGACTAATCTTCAACTCC
730 Loach GGGCACCAATGATACTGAAGCTACGAGTATACTGATTATGAAAACTTAAGTTTTGACTCC
731 Mouse GGGCACCAATGATACTGAAGCTACGAATATACTGACTATGAAGACCTATGCTTTGATTCA
732 Rat GGACACCAATGATACTGAAGCTATGAATATACTGACTATGAAGACCTATGCTTTGACTCC
733 Seal GGACATCAGTGATACTGAAGCTATGAGTACACAGACTACGAAGACCTGAACTTTGACTCA
734 Whale GGTCACCAATGATATTGAAGCTATGAGTATACCGACTACGAAGACCTAAGCTTCGACTCC
735 Frog GGCCACCAATGATACTGAAGCTACGAATATACTAACTATGAGGATCTCTCATTTGACTCT
736
737 Cow TACATAATTCCAACATCAGAATTAAAGCCAGGGGAGCTACGACTATTAGAAGTCGATAAT
738 Carp TATATAGTACCAACCCAAGACCTTGCCCCCGGACAATTCCGACTTCTGGAAACAGACCAC
739 Chicken TACATAACCCCAACAACAGACCTCCCCCTAGGCCACTTCCGCCTACTAGAAGTCGACCAT
740 Human TACATACTTCCCCCATTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAAT
741 Loach TACATAATCCCCACCCAGGACCTAACCCCTGGACAATTCCGGCTACTAGAGACAGACCAC
742 Mouse TATATAATCCCAACAAACGACCTAAAACCTGGTGAACTACGACTGCTAGAAGTTGATAAC
743 Rat TACATAATCCCAACCAATGACCTAAAACCAGGTGAACTTCGTCTATTAGAAGTTGATAAT
744 Seal TATATGATCCCCACACAAGAACTAAAGCCCGGAGAACTACGACTGCTAGAAGTAGACAAT
745 Whale TATATAATCCCAACATCAGACCTAAAGCCAGGAGAACTACGATTATTAGAAGTAGATAAC
746 Frog TATATAATTCCAACTAATGACCTTACCCCTGGACAATTCCGGCTGCTAGAAGTTGATAAT
747
748 Cow CGAGTTGTACTACCAATAGAAATAACAATCCGAATGTTAGTCTCCTCTGAAGACGTATTA
749 Carp CGAATAGTTGTTCCAATAGAATCCCCAGTCCGTGTCCTAGTATCTGCTGAAGACGTGCTA
750 Chicken CGCATTGTAATCCCCATAGAATCCCCCATTCGAGTAATCATCACCGCTGATGACGTCCTC
751 Human CGAGTAGTACTCCCGATTGAAGCCCCCATTCGTATAATAATTACATCACAAGACGTCTTG
752 Loach CGAATGGTTGTTCCCATAGAATCCCCTATTCGCATTCTTGTTTCCGCCGAAGATGTACTA
753 Mouse CGAGTCGTTCTGCCAATAGAACTTCCAATCCGTATATTAATTTCATCTGAAGACGTCCTC
754 Rat CGGGTAGTCTTACCAATAGAACTTCCAATTCGTATACTAATCTCATCCGAAGACGTCCTG
755 Seal CGAGTAGTCCTCCCAATAGAAATAACAATCCGCATACTAATCTCATCAGAAGATGTACTC
756 Whale CGAGTTGTCTTACCTATAGAAATAACAATCCGAATATTAGTCTCATCAGAAGACGTACTC
757 Frog CGAATAGTAGTCCCAATAGAATCTCCAACCCGACTTTTAGTTACAGCCGAAGACGTCCTC
758
759 Cow CACTCATGAGCTGTGCCCTCTCTAGGACTAAAAACAGACGCAATCCCAGGCCGTCTAAAC
760 Carp CATTCTTGAGCTGTTCCATCCCTTGGCGTAAAAATGGACGCAGTCCCAGGACGACTAAAT
761 Chicken CACTCATGAGCCGTACCCGCCCTCGGGGTAAAAACAGACGCAATCCCTGGACGACTAAAT
762 Human CACTCATGAGCTGTCCCCACATTAGGCTTAAAAACAGATGCAATTCCCGGACGTCTAAAC
763 Loach CACTCCTGGGCCCTTCCAGCCATGGGGGTAAAGATAGACGCGGTCCCAGGACGCCTTAAC
764 Mouse CACTCATGAGCAGTCCCCTCCCTAGGACTTAAAACTGATGCCATCCCAGGCCGACTAAAT
765 Rat CACTCATGAGCCATCCCTTCACTAGGGTTAAAAACCGACGCAATCCCCGGCCGCCTAAAC
766 Seal CACTCATGAGCCGTACCGTCCCTAGGACTAAAAACTGATGCTATCCCAGGACGACTAAAC
767 Whale CACTCATGGGCCGTACCCTCCTTGGGCCTAAAAACAGATGCAATCCCAGGACGCCTAAAC
768 Frog CACTCGTGAGCTGTACCCTCCTTGGGTGTCAAAACAGATGCAATCCCAGGACGACTTCAT
769
770 Cow CAAACAACCCTTATATCGTCCCGTCCAGGCTTATATTACGGTCAATGCTCAGAAATTTGC
771 Carp CAAGCCGCCTTTATTGCCTCACGCCCAGGGGTCTTTTACGGACAATGCTCTGAAATTTGT
772 Chicken CAAACCTCCTTCATCACCACTCGACCAGGAGTGTTTTACGGACAATGCTCAGAAATCTGC
773 Human CAAACCACTTTCACCGCTACACGACCGGGGGTATACTACGGTCAATGCTCTGAAATCTGT
774 Loach CAAACCGCCTTTATTGCCTCCCGCCCCGGGGTATTCTATGGGCAATGCTCAGAAATCTGT
775 Mouse CAAGCAACAGTAACATCAAACCGACCAGGGTTATTCTATGGCCAATGCTCTGAAATTTGT
776 Rat CAAGCTACAGTCACATCAAACCGACCAGGTCTATTCTATGGCCAATGCTCTGAAATTTGC
777 Seal CAAACAACCCTAATAACCATACGACCAGGACTGTACTACGGTCAATGCTCAGAAATCTGT
778 Whale CAAACAACCTTAATATCAACACGACCAGGCCTATTTTATGGACAATGCTCAGAGATCTGC
779 Frog CAAACATCATTTATTGCTACTCGTCCGGGAGTATTTTACGGACAATGTTCAGAAATTTGC
780
781 Cow GGGTCAAACCACAGTTTCATACCCATTGTCCTTGAGTTAGTCCCACTAAAGTACTTTGAA
782 Carp GGAGCTAATCACAGCTTTATACCAATTGTAGTTGAAGCAGTACCTCTCGAACACTTCGAA
783 Chicken GGAGCTAACCACAGCTACATACCCATTGTAGTAGAGTCTACCCCCCTAAAACACTTTGAA
784 Human GGAGCAAACCACAGTTTCATGCCCATCGTCCTAGAATTAATTCCCCTAAAAATCTTTGAA
785 Loach GGAGCAAACCACAGCTTTATACCCATCGTAGTAGAAGCGGTCCCACTATCTCACTTCGAA
786 Mouse GGATCTAACCATAGCTTTATGCCCATTGTCCTAGAAATGGTTCCACTAAAATATTTCGAA
787 Rat GGCTCAAATCACAGCTTCATACCCATTGTACTAGAAATAGTGCCTCTAAAATATTTCGAA
788 Seal GGTTCAAACCACAGCTTCATACCTATTGTCCTCGAATTGGTCCCACTATCCCACTTCGAG
789 Whale GGCTCAAACCACAGTTTCATACCAATTGTCCTAGAACTAGTACCCCTAGAAGTCTTTGAA
790 Frog GGAGCAAACCACAGCTTTATACCAATTGTAGTTGAAGCAGTACCGCTAACCGACTTTGAA
791
792 Cow AAATGATCTGCGTCAATATTA---------------------TAA
793 Carp AACTGATCCTCATTAATACTAGAAGACGCCTCGCTAGGAAGCTAA
794 Chicken GCCTGATCCTCACTA------------------CTGTCATCTTAA
795 Human ATA---------------------GGGCCCGTATTTACCCTATAG
796 Loach AACTGGTCCACCCTTATACTAAAAGACGCCTCACTAGGAAGCTAA
797 Mouse AACTGATCTGCTTCAATAATT---------------------TAA
798 Rat AACTGATCAGCTTCTATAATT---------------------TAA
799 Seal AAATGATCTACCTCAATGCTT---------------------TAA
800 Whale AAATGATCTGTATCAATACTA---------------------TAA
801 Frog AACTGATCTTCATCAATACTA---GAAGCATCACTA------AGA
802 ;
803 End;
804 """
805
806 # This example uses amino acids, from here:
807 # http://www.molecularevolution.org/resources/fileformats/
808 nxs_example3 = \
809 """#NEXUS
f726249 merged Andrew's Seq package with the tree
jchang authored
810
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
811 Begin data;
812 Dimensions ntax=10 nchar=234;
813 Format datatype=protein gap=- interleave;
814 Matrix
815 Cow MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE
816 Carp MAHPTQLGFKDAAMPVMEELLHFHDHALMIVLLISTLVLYIITAMVSTKLTNKYILDSQE
817 Chicken MANHSQLGFQDASSPIMEELVEFHDHALMVALAICSLVLYLLTLMLMEKLS-SNTVDAQE
818 Human MAHAAQVGLQDATSPIMEELITFHDHALMIIFLICFLVLYALFLTLTTKLTNTNISDAQE
819 Loach MAHPTQLGFQDAASPVMEELLHFHDHALMIVFLISALVLYVIITTVSTKLTNMYILDSQE
820 Mouse MAYPFQLGLQDATSPIMEELMNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE
821 Rat MAYPFQLGLQDATSPIMEELTNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE
822 Seal MAYPLQMGLQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE
823 Whale MAYPFQLGFQDAASPIMEELLHFHDHTLMIVFLISSLVLYIITLMLTTKLTHTSTMDAQE
824 Frog MAHPSQLGFQDAASPIMEELLHFHDHTLMAVFLISTLVLYIITIMMTTKLTNTNLMDAQE
825
826 Cow VETIWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLSFDS
827 Carp IEIVWTILPAVILVLIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLGFDS
828 Chicken VELIWTILPAIVLVLLALPSLQILYMMDEIDEPDLTLKAIGHQWYWTYEYTDFKDLSFDS
829 Human METVWTILPAIILVLIALPSLRILYMTDEVNDPSLTIKSIGHQWYWTYEYTDYGGLIFNS
830 Loach IEIVWTVLPALILILIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLSFDS
831 Mouse VETIWTILPAVILIMIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDS
832 Rat VETIWTILPAVILILIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDS
833 Seal VETVWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLNFDS
834 Whale VETVWTILPAIILILIALPSLRILYMMDEVNNPSLTVKTMGHQWYWSYEYTDYEDLSFDS
835 Frog IEMVWTIMPAISLIMIALPSLRILYLMDEVNDPHLTIKAIGHQWYWSYEYTNYEDLSFDS
836
837 Cow YMIPTSELKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLN
838 Carp YMVPTQDLAPGQFRLLETDHRMVVPMESPVRVLVSAEDVLHSWAVPSLGVKMDAVPGRLN
839 Chicken YMTPTTDLPLGHFRLLEVDHRIVIPMESPIRVIITADDVLHSWAVPALGVKTDAIPGRLN
840 Human YMLPPLFLEPGDLRLLDVDNRVVLPIEAPIRMMITSQDVLHSWAVPTLGLKTDAIPGRLN
841 Loach YMIPTQDLTPGQFRLLETDHRMVVPMESPIRILVSAEDVLHSWALPAMGVKMDAVPGRLN
842 Mouse YMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAVPSLGLKTDAIPGRLN
843 Rat YMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAIPSLGLKTDAIPGRLN
844 Seal YMIPTQELKPGELRLLEVDNRVVLPMEMTIRMLISSEDVLHSWAVPSLGLKTDAIPGRLN
845 Whale YMIPTSDLKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLN
846 Frog YMIPTNDLTPGQFRLLEVDNRMVVPMESPTRLLVTAEDVLHSWAVPSLGVKTDAIPGRLH
847
848 Cow QTTLMSSRPGLYYGQCSEICGSNHSFMPIVLELVPLKYFEKWSASML-------
849 Carp QAAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLEHFENWSSLMLEDASLGS
850 Chicken QTSFITTRPGVFYGQCSEICGANHSYMPIVVESTPLKHFEAWSSL------LSS
851 Human QTTFTATRPGVYYGQCSEICGANHSFMPIVLELIPLKIFEM-------GPVFTL
852 Loach QTAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLSHFENWSTLMLKDASLGS
853 Mouse QATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI-------
854 Rat QATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI-------
855 Seal QTTLMTMRPGLYYGQCSEICGSNHSFMPIVLELVPLSHFEKWSTSML-------
856 Whale QTTLMSTRPGLFYGQCSEICGSNHSFMPIVLELVPLEVFEKWSVSML-------
857 Frog QTSFIATRPGVFYGQCSEICGANHSFMPIVVEAVPLTDFENWSSSML-EASL--
858 ;
859 End;
860 """
861
862 # This example with its slightly odd (partial) annotation is from here:
863 # http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html
864 sth_example = \
865 """# STOCKHOLM 1.0
866 #=GF ID CBS
867 #=GF AC PF00571
868 #=GF DE CBS domain
869 #=GF AU Bateman A
870 #=GF CC CBS domains are small intracellular modules mostly found
871 #=GF CC in 2 or four copies within a protein.
872 #=GF SQ 67
873 #=GS O31698/18-71 AC O31698
874 #=GS O83071/192-246 AC O83071
875 #=GS O83071/259-312 AC O83071
876 #=GS O31698/88-139 AC O31698
877 #=GS O31698/88-139 OS Bacillus subtilis
878 O83071/192-246 MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRVPVYERS
879 #=GR O83071/192-246 SA 999887756453524252..55152525....36463774777
880 O83071/259-312 MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVAIVLDEY
881 #=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEE
882 O31698/18-71 MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAIPVLDPS
883 #=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH
884 O31698/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE
885 #=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHH..HEEEEEEE....EEEEEEEEEEH
886 #=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH
887 O31699/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE
888 #=GR O31699/88-139 AS ________________*__________________________
889 #=GR_O31699/88-139_IN ____________1______________2__________0____
890 //
d2e8f98 Removal of __all__ tags from files. These are poorly updated, annoyin…
chapmanb authored
891 """
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
892
893 # Interlaced example from BioPerl documentation. Also note the blank line.
894 # http://www.bioperl.org/wiki/Stockholm_multiple_alignment_format
895 sth_example2 = \
896 """# STOCKHOLM 1.0
897 #=GC SS_cons .................<<<<<<<<...<<<<<<<........>>>>>>>..
898 AP001509.1 UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGU
899 #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..--
900 AE007476.1 AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU
901 #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>----
902
903 #=GC SS_cons ......<<<<<<<.......>>>>>>>..>>>>>>>>...............
904 AP001509.1 CUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU
905 #=GR AP001509.1 SS -------<<<<<--------->>>>>--->>>>>>>>---------------
906 AE007476.1 UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU
907 #=GR AE007476.1 SS ------.<<<<<--------->>>>>.-->>>>>>>>---------------
908 //"""
909
910 # Sample GenBank record from here:
911 # http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
912 gbk_example = \
913 """LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
914 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
915 (AXL2) and Rev7p (REV7) genes, complete cds.
916 ACCESSION U49845
917 VERSION U49845.1 GI:1293613
918 KEYWORDS .
919 SOURCE Saccharomyces cerevisiae (baker's yeast)
920 ORGANISM Saccharomyces cerevisiae
921 Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
922 Saccharomycetales; Saccharomycetaceae; Saccharomyces.
923 REFERENCE 1 (bases 1 to 5028)
924 AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.
925 TITLE Cloning and sequence of REV7, a gene whose function is required for
926 DNA damage-induced mutagenesis in Saccharomyces cerevisiae
927 JOURNAL Yeast 10 (11), 1503-1509 (1994)
928 PUBMED 7871890
929 REFERENCE 2 (bases 1 to 5028)
930 AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M.
931 TITLE Selection of axial growth sites in yeast requires Axl2p, a novel
932 plasma membrane glycoprotein
933 JOURNAL Genes Dev. 10 (7), 777-793 (1996)
934 PUBMED 8846915
935 REFERENCE 3 (bases 1 to 5028)
936 AUTHORS Roemer,T.
937 TITLE Direct Submission
938 JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New
939 Haven, CT, USA
940 FEATURES Location/Qualifiers
941 source 1..5028
942 /organism="Saccharomyces cerevisiae"
943 /db_xref="taxon:4932"
944 /chromosome="IX"
945 /map="9"
946 CDS <1..206
947 /codon_start=3
948 /product="TCP1-beta"
949 /protein_id="AAA98665.1"
950 /db_xref="GI:1293614"
951 /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
952 AEVLLRVDNIIRARPRTANRQHM"
953 gene 687..3158
954 /gene="AXL2"
955 CDS 687..3158
956 /gene="AXL2"
957 /note="plasma membrane glycoprotein"
958 /codon_start=1
959 /function="required for axial budding pattern of S.
960 cerevisiae"
961 /product="Axl2p"
962 /protein_id="AAA98666.1"
963 /db_xref="GI:1293615"
964 /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF
965 TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN
966 VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE
967 VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE
968 TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV
969 YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG
970 DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ
971 DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA
972 NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA
973 CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN
974 NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ
975 SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS
976 YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK
977 HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL
978 VDFSNKSNVNVGQVKDIHGRIPEML"
979 gene complement(3300..4037)
980 /gene="REV7"
981 CDS complement(3300..4037)
982 /gene="REV7"
983 /codon_start=1
984 /product="Rev7p"
985 /protein_id="AAA98667.1"
986 /db_xref="GI:1293616"
987 /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ
988 FVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVD
989 KDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNR
990 RVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEK
991 LISGDDKILNGVYSQYEEGESIFGSLF"
992 ORIGIN
993 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
994 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct
995 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa
996 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg
997 241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa
998 301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa
999 361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat
1000 421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctcaaagc tccttgccga
1001 481 gagtcgccct cctttgtcga gtaattttca cttttcatat gagaacttat tttcttattc
1002 541 tttactctca catcctgtag tgattgacac tgcaacagcc accatcacta gaagaacaga
1003 601 acaattactt aatagaaaaa ttatatcttc ctcgaaacga tttcctgctt ccaacatcta
1004 661 cgtatatcaa gaagcattca cttaccatga cacagcttca gatttcatta ttgctgacag
1005 721 ctactatatc actactccat ctagtagtgg ccacgcccta tgaggcatat cctatcggaa
1006 781 aacaataccc cccagtggca agagtcaatg aatcgtttac atttcaaatt tccaatgata
1007 841 cctataaatc gtctgtagac aagacagctc aaataacata caattgcttc gacttaccga
1008 901 gctggctttc gtttgactct agttctagaa cgttctcagg tgaaccttct tctgacttac
1009 961 tatctgatgc gaacaccacg ttgtatttca atgtaatact cgagggtacg gactctgccg
1010 1021 acagcacgtc tttgaacaat acataccaat ttgttgttac aaaccgtcca tccatctcgc
1011 1081 tatcgtcaga tttcaatcta ttggcgttgt taaaaaacta tggttatact aacggcaaaa
1012 1141 acgctctgaa actagatcct aatgaagtct tcaacgtgac ttttgaccgt tcaatgttca
1013 1201 ctaacgaaga atccattgtg tcgtattacg gacgttctca gttgtataat gcgccgttac
1014 1261 ccaattggct gttcttcgat tctggcgagt tgaagtttac tgggacggca ccggtgataa
1015 1321 actcggcgat tgctccagaa acaagctaca gttttgtcat catcgctaca gacattgaag
1016 1381 gattttctgc cgttgaggta gaattcgaat tagtcatcgg ggctcaccag ttaactacct
1017 1441 ctattcaaaa tagtttgata atcaacgtta ctgacacagg taacgtttca tatgacttac
1018 1501 ctctaaacta tgtttatctc gatgacgatc ctatttcttc tgataaattg ggttctataa
1019 1561 acttattgga tgctccagac tgggtggcat tagataatgc taccatttcc gggtctgtcc
1020 1621 cagatgaatt actcggtaag aactccaatc ctgccaattt ttctgtgtcc atttatgata
1021 1681 cttatggtga tgtgatttat ttcaacttcg aagttgtctc cacaacggat ttgtttgcca
1022 1741 ttagttctct tcccaatatt aacgctacaa ggggtgaatg gttctcctac tattttttgc
1023 1801 cttctcagtt tacagactac gtgaatacaa acgtttcatt agagtttact aattcaagcc
1024 1861 aagaccatga ctgggtgaaa ttccaatcat ctaatttaac attagctgga gaagtgccca
1025 1921 agaatttcga caagctttca ttaggtttga aagcgaacca aggttcacaa tctcaagagc
1026 1981 tatattttaa catcattggc atggattcaa agataactca ctcaaaccac agtgcgaatg
1027 2041 caacgtccac aagaagttct caccactcca cctcaacaag ttcttacaca tcttctactt
1028 2101 acactgcaaa aatttcttct acctccgctg ctgctacttc ttctgctcca gcagcgctgc
1029 2161 cagcagccaa taaaacttca tctcacaata aaaaagcagt agcaattgcg tgcggtgttg
1030 2221 ctatcccatt aggcgttatc ctagtagctc tcatttgctt cctaatattc tggagacgca
1031 2281 gaagggaaaa tccagacgat gaaaacttac cgcatgctat tagtggacct gatttgaata
1032 2341 atcctgcaaa taaaccaaat caagaaaacg ctacaccttt gaacaacccc tttgatgatg
1033 2401 atgcttcctc gtacgatgat acttcaatag caagaagatt ggctgctttg aacactttga
1034 2461 aattggataa ccactctgcc actgaatctg atatttccag cgtggatgaa aagagagatt
1035 2521 ctctatcagg tatgaataca tacaatgatc agttccaatc ccaaagtaaa gaagaattat
1036 2581 tagcaaaacc cccagtacag cctccagaga gcccgttctt tgacccacag aataggtctt
1037 2641 cttctgtgta tatggatagt gaaccagcag taaataaatc ctggcgatat actggcaacc
1038 2701 tgtcaccagt ctctgatatt gtcagagaca gttacggatc acaaaaaact gttgatacag
1039 2761 aaaaactttt cgatttagaa gcaccagaga aggaaaaacg tacgtcaagg gatgtcacta
1040 2821 tgtcttcact ggacccttgg aacagcaata ttagcccttc tcccgtaaga aaatcagtaa
1041 2881 caccatcacc atataacgta acgaagcatc gtaaccgcca cttacaaaat attcaagact
1042 2941 ctcaaagcgg taaaaacgga atcactccca caacaatgtc aacttcatct tctgacgatt
1043 3001 ttgttccggt taaagatggt gaaaattttt gctgggtcca tagcatggaa ccagacagaa
1044 3061 gaccaagtaa gaaaaggtta gtagattttt caaataagag taatgtcaat gttggtcaag
1045 3121 ttaaggacat tcacggacgc atcccagaaa tgctgtgatt atacgcaacg atattttgct
1046 3181 taattttatt ttcctgtttt attttttatt agtggtttac agatacccta tattttattt
1047 3241 agtttttata cttagagaca tttaatttta attccattct tcaaatttca tttttgcact
1048 3301 taaaacaaag atccaaaaat gctctcgccc tcttcatatt gagaatacac tccattcaaa
1049 3361 attttgtcgt caccgctgat taatttttca ctaaactgat gaataatcaa aggccccacg
1050 3421 tcagaaccga ctaaagaagt gagttttatt ttaggaggtt gaaaaccatt attgtctggt
1051 3481 aaattttcat cttcttgaca tttaacccag tttgaatccc tttcaatttc tgctttttcc
1052 3541 tccaaactat cgaccctcct gtttctgtcc aacttatgtc ctagttccaa ttcgatcgca
1053 3601 ttaataactg cttcaaatgt tattgtgtca tcgttgactt taggtaattt ctccaaatgc
1054 3661 ataatcaaac tatttaagga agatcggaat tcgtcgaaca cttcagtttc cgtaatgatc
1055 3721 tgatcgtctt tatccacatg ttgtaattca ctaaaatcta aaacgtattt ttcaatgcat
1056 3781 aaatcgttct ttttattaat aatgcagatg gaaaatctgt aaacgtgcgt taatttagaa
1057 3841 agaacatcca gtataagttc ttctatatag tcaattaaag caggatgcct attaatggga
1058 3901 acgaactgcg gcaagttgaa tgactggtaa gtagtgtagt cgaatgactg aggtgggtat
1059 3961 acatttctat aaaataaaat caaattaatg tagcatttta agtataccct cagccacttc
1060 4021 tctacccatc tattcataaa gctgacgcaa cgattactat tttttttttc ttcttggatc
1061 4081 tcagtcgtcg caaaaacgta taccttcttt ttccgacctt ttttttagct ttctggaaaa
1062 4141 gtttatatta gttaaacagg gtctagtctt agtgtgaaag ctagtggttt cgattgactg
1063 4201 atattaagaa agtggaaatt aaattagtag tgtagacgta tatgcatatg tatttctcgc
1064 4261 ctgtttatgt ttctacgtac ttttgattta tagcaagggg aaaagaaata catactattt
1065 4321 tttggtaaag gtgaaagcat aatgtaaaag ctagaataaa atggacgaaa taaagagagg
1066 4381 cttagttcat cttttttcca aaaagcaccc aatgataata actaaaatga aaaggatttg
1067 4441 ccatctgtca gcaacatcag ttgtgtgagc aataataaaa tcatcacctc cgttgccttt
1068 4501 agcgcgtttg tcgtttgtat cttccgtaat tttagtctta tcaatgggaa tcataaattt
1069 4561 tccaatgaat tagcaatttc gtccaattct ttttgagctt cttcatattt gctttggaat
1070 4621 tcttcgcact tcttttccca ttcatctctt tcttcttcca aagcaacgat ccttctaccc
1071 4681 atttgctcag agttcaaatc ggcctctttc agtttatcca ttgcttcctt cagtttggct
1072 4741 tcactgtctt ctagctgttg ttctagatcc tggtttttct tggtgtagtt ctcattatta
1073 4801 gatctcaagt tattggagtc ttcagccaat tgctttgtat cagacaattg actctctaac
1074 4861 ttctccactt cactgtcgag ttgctcgttt ttagcggaca aagatttaat ctcgttttct
1075 4921 ttttcagtgt tagattgctc taattctttg agctgttctc tcagctcctc atatttttct
1076 4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc
1077 //"""
1078
1079 # GenBank format protein (aka GenPept) file from:
1080 # http://www.molecularevolution.org/resources/fileformats/
1081 gbk_example2 = \
1082 """LOCUS AAD51968 143 aa linear BCT 21-AUG-2001
1083 DEFINITION transcriptional regulator RovA [Yersinia enterocolitica].
1084 ACCESSION AAD51968
1085 VERSION AAD51968.1 GI:5805369
1086 DBSOURCE locus AF171097 accession AF171097.1
1087 KEYWORDS .
1088 SOURCE Yersinia enterocolitica
1089 ORGANISM Yersinia enterocolitica
1090 Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
1091 Enterobacteriaceae; Yersinia.
1092 REFERENCE 1 (residues 1 to 143)
1093 AUTHORS Revell,P.A. and Miller,V.L.
1094 TITLE A chromosomally encoded regulator is required for expression of the
1095 Yersinia enterocolitica inv gene and for virulence
1096 JOURNAL Mol. Microbiol. 35 (3), 677-685 (2000)
1097 MEDLINE 20138369
1098 PUBMED 10672189
1099 REFERENCE 2 (residues 1 to 143)
1100 AUTHORS Revell,P.A. and Miller,V.L.
1101 TITLE Direct Submission
1102 JOURNAL Submitted (22-JUL-1999) Molecular Microbiology, Washington
1103 University School of Medicine, Campus Box 8230, 660 South Euclid,
1104 St. Louis, MO 63110, USA
1105 COMMENT Method: conceptual translation.
1106 FEATURES Location/Qualifiers
1107 source 1..143
1108 /organism="Yersinia enterocolitica"
1109 /mol_type="unassigned DNA"
1110 /strain="JB580v"
1111 /serotype="O:8"
1112 /db_xref="taxon:630"
1113 Protein 1..143
1114 /product="transcriptional regulator RovA"
1115 /name="regulates inv expression"
1116 CDS 1..143
1117 /gene="rovA"
1118 /coded_by="AF171097.1:380..811"
1119 /note="regulator of virulence"
1120 /transl_table=11
1121 ORIGIN
1122 1 mestlgsdla rlvrvwrali dhrlkplelt qthwvtlhni nrlppeqsqi qlakaigieq
1123 61 pslvrtldql eekglitrht candrrakri klteqsspii eqvdgvicst rkeilggisp
1124 121 deiellsgli dklerniiql qsk
1125 //"""
1126
80276dd @peterjc Renamed functions to avoid using the digit two as a pun for (convert)…
peterjc authored
1127
1128 swiss_example = \
1129 """ID 104K_THEAN Reviewed; 893 AA.
1130 AC Q4U9M9;
1131 DT 18-APR-2006, integrated into UniProtKB/Swiss-Prot.
1132 DT 05-JUL-2005, sequence version 1.
1133 DT 31-OCT-2006, entry version 8.
1134 DE 104 kDa microneme-rhoptry antigen precursor (p104).
1135 GN ORFNames=TA08425;
1136 OS Theileria annulata.
1137 OC Eukaryota; Alveolata; Apicomplexa; Piroplasmida; Theileriidae;
1138 OC Theileria.
1139 OX NCBI_TaxID=5874;
1140 RN [1]
1141 RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
1142 RC STRAIN=Ankara;
1143 RX PubMed=15994557; DOI=10.1126/science.1110418;
1144 RA Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,
1145 RA Kerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,
1146 RA Coulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,
1147 RA Gardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,
1148 RA Larke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,
1149 RA Nene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,
1150 RA Rawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,
1151 RA Squares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,
1152 RA Langsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,
1153 RA Barrell B.G., Hall N.;
1154 RT "Genome of the host-cell transforming parasite Theileria annulata
1155 RT compared with T. parva.";
1156 RL Science 309:131-133(2005).
1157 CC -!- SUBCELLULAR LOCATION: Cell membrane; lipid-anchor; GPI-anchor
1158 CC (Potential). In microneme/rhoptry complexes (By similarity).
1159 DR EMBL; CR940353; CAI76474.1; -; Genomic_DNA.
1160 DR InterPro; IPR007480; DUF529.
1161 DR Pfam; PF04385; FAINT; 4.
1162 KW Complete proteome; GPI-anchor; Lipoprotein; Membrane; Repeat; Signal;
1163 KW Sporozoite.
1164 FT SIGNAL 1 19 Potential.
1165 FT CHAIN 20 873 104 kDa microneme-rhoptry antigen.
1166 FT /FTId=PRO_0000232680.
1167 FT PROPEP 874 893 Removed in mature form (Potential).
1168 FT /FTId=PRO_0000232681.
1169 FT COMPBIAS 215 220 Poly-Leu.
1170 FT COMPBIAS 486 683 Lys-rich.
1171 FT COMPBIAS 854 859 Poly-Arg.
1172 FT LIPID 873 873 GPI-anchor amidated aspartate
1173 FT (Potential).
1174 SQ SEQUENCE 893 AA; 101921 MW; 2F67CEB3B02E7AC1 CRC64;
1175 MKFLVLLFNI LCLFPILGAD ELVMSPIPTT DVQPKVTFDI NSEVSSGPLY LNPVEMAGVK
1176 YLQLQRQPGV QVHKVVEGDI VIWENEEMPL YTCAIVTQNE VPYMAYVELL EDPDLIFFLK
1177 EGDQWAPIPE DQYLARLQQL RQQIHTESFF SLNLSFQHEN YKYEMVSSFQ HSIKMVVFTP
1178 KNGHICKMVY DKNIRIFKAL YNEYVTSVIG FFRGLKLLLL NIFVIDDRGM IGNKYFQLLD
1179 DKYAPISVQG YVATIPKLKD FAEPYHPIIL DISDIDYVNF YLGDATYHDP GFKIVPKTPQ
1180 CITKVVDGNE VIYESSNPSV ECVYKVTYYD KKNESMLRLD LNHSPPSYTS YYAKREGVWV
1181 TSTYIDLEEK IEELQDHRST ELDVMFMSDK DLNVVPLTNG NLEYFMVTPK PHRDIIIVFD
1182 GSEVLWYYEG LENHLVCTWI YVTEGAPRLV HLRVKDRIPQ NTDIYMVKFG EYWVRISKTQ
1183 YTQEIKKLIK KSKKKLPSIE EEDSDKHGGP PKGPEPPTGP GHSSSESKEH EDSKESKEPK
1184 EHGSPKETKE GEVTKKPGPA KEHKPSKIPV YTKRPEFPKK SKSPKRPESP KSPKRPVSPQ
1185 RPVSPKSPKR PESLDIPKSP KRPESPKSPK RPVSPQRPVS PRRPESPKSP KSPKSPKSPK
1186 VPFDPKFKEK LYDSYLDKAA KTKETVTLPP VLPTDESFTH TPIGEPTAEQ PDDIEPIEES
1187 VFIKETGILT EEVKTEDIHS ETGEPEEPKR PDSPTKHSPK PTGTHPSMPK KRRRSDGLAL
1188 STTDLESEAG RILRDPTGKI VTMKRSKSFD DLTTVREKEH MGAEIRKIVV DDDGTEADDE
1189 DTHPSKEKHL STVRRRRPRP KKSSKSSKPR KPDSAFVPSI IFIFLVSLIV GIL
1190 //
1191 ID 104K_THEPA Reviewed; 924 AA.
1192 AC P15711; Q4N2B5;
1193 DT 01-APR-1990, integrated into UniProtKB/Swiss-Prot.
1194 DT 01-APR-1990, sequence version 1.
1195 DT 31-OCT-2006, entry version 31.
1196 DE 104 kDa microneme-rhoptry antigen precursor (p104).
1197 GN OrderedLocusNames=TP04_0437;
1198 OS Theileria parva.
1199 OC Eukaryota; Alveolata; Apicomplexa; Piroplasmida; Theileriidae;
1200 OC Theileria.
1201 OX NCBI_TaxID=5875;
1202 RN [1]
1203 RP NUCLEOTIDE SEQUENCE [GENOMIC DNA].
1204 RC STRAIN=Muguga;
1205 RX MEDLINE=90158697; PubMed=1689460; DOI=10.1016/0166-6851(90)90007-9;
1206 RA Iams K.P., Young J.R., Nene V., Desai J., Webster P., Ole-Moiyoi O.K.,
1207 RA Musoke A.J.;
1208 RT "Characterisation of the gene encoding a 104-kilodalton microneme-
1209 RT rhoptry protein of Theileria parva.";
1210 RL Mol. Biochem. Parasitol. 39:47-60(1990).
1211 RN [2]
1212 RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
1213 RC STRAIN=Muguga;
1214 RX PubMed=15994558; DOI=10.1126/science.1110439;
1215 RA Gardner M.J., Bishop R., Shah T., de Villiers E.P., Carlton J.M.,
1216 RA Hall N., Ren Q., Paulsen I.T., Pain A., Berriman M., Wilson R.J.M.,
1217 RA Sato S., Ralph S.A., Mann D.J., Xiong Z., Shallom S.J., Weidman J.,
1218 RA Jiang L., Lynn J., Weaver B., Shoaibi A., Domingo A.R., Wasawo D.,
1219 RA Crabtree J., Wortman J.R., Haas B., Angiuoli S.V., Creasy T.H., Lu C.,
1220 RA Suh B., Silva J.C., Utterback T.R., Feldblyum T.V., Pertea M.,
1221 RA Allen J., Nierman W.C., Taracha E.L.N., Salzberg S.L., White O.R.,
1222 RA Fitzhugh H.A., Morzaria S., Venter J.C., Fraser C.M., Nene V.;
1223 RT "Genome sequence of Theileria parva, a bovine pathogen that transforms
1224 RT lymphocytes.";
1225 RL Science 309:134-137(2005).
1226 CC -!- SUBCELLULAR LOCATION: Cell membrane; lipid-anchor; GPI-anchor
1227 CC (Potential). In microneme/rhoptry complexes.
1228 CC -!- DEVELOPMENTAL STAGE: Sporozoite antigen.
1229 DR EMBL; M29954; AAA18217.1; -; Unassigned_DNA.
1230 DR EMBL; AAGK01000004; EAN31789.1; -; Genomic_DNA.
1231 DR PIR; A44945; A44945.
1232 DR InterPro; IPR007480; DUF529.
1233 DR Pfam; PF04385; FAINT; 4.
1234 KW Complete proteome; GPI-anchor; Lipoprotein; Membrane; Repeat; Signal;
1235 KW Sporozoite.
1236 FT SIGNAL 1 19 Potential.
1237 FT CHAIN 20 904 104 kDa microneme-rhoptry antigen.
1238 FT /FTId=PRO_0000046081.
1239 FT PROPEP 905 924 Removed in mature form (Potential).
1240 FT /FTId=PRO_0000232679.
1241 FT COMPBIAS 508 753 Pro-rich.
1242 FT COMPBIAS 880 883 Poly-Arg.
1243 FT LIPID 904 904 GPI-anchor amidated aspartate
1244 FT (Potential).
1245 SQ SEQUENCE 924 AA; 103626 MW; 289B4B554A61870E CRC64;
1246 MKFLILLFNI LCLFPVLAAD NHGVGPQGAS GVDPITFDIN SNQTGPAFLT AVEMAGVKYL
1247 QVQHGSNVNI HRLVEGNVVI WENASTPLYT GAIVTNNDGP YMAYVEVLGD PNLQFFIKSG
1248 DAWVTLSEHE YLAKLQEIRQ AVHIESVFSL NMAFQLENNK YEVETHAKNG ANMVTFIPRN
1249 GHICKMVYHK NVRIYKATGN DTVTSVVGFF RGLRLLLINV FSIDDNGMMS NRYFQHVDDK
1250 YVPISQKNYE TGIVKLKDYK HAYHPVDLDI KDIDYTMFHL ADATYHEPCF KIIPNTGFCI
1251 TKLFDGDQVL YESFNPLIHC INEVHIYDRN NGSIICLHLN YSPPSYKAYL VLKDTGWEAT
1252 THPLLEEKIE ELQDQRACEL DVNFISDKDL YVAALTNADL NYTMVTPRPH RDVIRVSDGS
1253 EVLWYYEGLD NFLVCAWIYV SDGVASLVHL RIKDRIPANN DIYVLKGDLY WTRITKIQFT
1254 QEIKRLVKKS KKKLAPITEE DSDKHDEPPE GPGASGLPPK APGDKEGSEG HKGPSKGSDS
1255 SKEGKKPGSG KKPGPAREHK PSKIPTLSKK PSGPKDPKHP RDPKEPRKSK SPRTASPTRR
1256 PSPKLPQLSK LPKSTSPRSP PPPTRPSSPE RPEGTKIIKT SKPPSPKPPF DPSFKEKFYD
1257 DYSKAASRSK ETKTTVVLDE SFESILKETL PETPGTPFTT PRPVPPKRPR TPESPFEPPK
1258 DPDSPSTSPS EFFTPPESKR TRFHETPADT PLPDVTAELF KEPDVTAETK SPDEAMKRPR
1259 SPSEYEDTSP GDYPSLPMKR HRLERLRLTT TEMETDPGRM AKDASGKPVK LKRSKSFDDL
1260 TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP
1261 KKPDSAYIPS ILAILVVSLI VGIL
1262 //
1263 ID 108_SOLLC Reviewed; 102 AA.
1264 AC Q43495;
1265 DT 15-JUL-1999, integrated into UniProtKB/Swiss-Prot.
1266 DT 01-NOV-1996, sequence version 1.
1267 DT 31-OCT-2006, entry version 37.
1268 DE Protein 108 precursor.
1269 OS Solanum lycopersicum (Tomato) (Lycopersicon esculentum).
1270 OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
1271 OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
1272 OC asterids; lamiids; Solanales; Solanaceae; Solanum; Lycopersicon.
1273 OX NCBI_TaxID=4081;
1274 RN [1]
1275 RP NUCLEOTIDE SEQUENCE [MRNA].
1276 RC STRAIN=cv. VF36; TISSUE=Anther;
1277 RX MEDLINE=94143497; PubMed=8310077; DOI=10.1104/pp.101.4.1413;
1278 RA Chen R., Smith A.G.;
1279 RT "Nucleotide sequence of a stamen- and tapetum-specific gene from
1280 RT Lycopersicon esculentum.";
1281 RL Plant Physiol. 101:1413-1413(1993).
1282 CC -!- TISSUE SPECIFICITY: Stamen- and tapetum-specific.
1283 CC -!- SIMILARITY: Belongs to the A9/FIL1 family.
1284 DR EMBL; Z14088; CAA78466.1; -; mRNA.
1285 DR PIR; S26409; S26409.
1286 DR InterPro; IPR013770; LPT_helical.
1287 DR InterPro; IPR003612; LTP/seed_store/tryp_amyl_inhib.
1288 DR Pfam; PF00234; Tryp_alpha_amyl; 1.
1289 DR SMART; SM00499; AAI; 1.
1290 KW Signal.
1291 FT SIGNAL 1 30 Potential.
1292 FT CHAIN 31 102 Protein 108.
1293 FT /FTId=PRO_0000000238.
1294 FT DISULFID 41 77 By similarity.
1295 FT DISULFID 51 66 By similarity.
1296 FT DISULFID 67 92 By similarity.
1297 FT DISULFID 79 99 By similarity.
1298 SQ SEQUENCE 102 AA; 10576 MW; CFBAA1231C3A5E92 CRC64;
1299 MASVKSSSSS SSSSFISLLL LILLVIVLQS QVIECQPQQS CTASLTGLNV CAPFLVPGSP
1300 TASTECCNAV QSINHDCMCN TMRIAAQIPA QCNLPPLSCS AN
1301 //
1302 """
1303
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1304 print "#########################################################"
1305 print "# Sequence Input Tests #"
1306 print "#########################################################"
1307
a169bb0 @peterjc Added EMBL parsing.
peterjc authored
1308 #ToDo - Check alphabet, or at least DNA/amino acid, for those
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1309 # filetype that specify it (e.g. Nexus, GenBank)
1310 tests = [
1311 (aln_example, "clustal", 8, "HISJ_E_COLI",
1312 "MKKLVLSLSLVLAFSSATAAF-------------------AAIPQNIRIG" + \
1313 "TDPTYAPFESKNS-QGELVGFDIDLAKELCKRINTQCTFVENPLDALIPS" + \
1314 "LKAKKIDAIMSSLSITEKRQQEIAFTDKLYAADSRLVVAKNSDIQP-TVE" + \
1315 "SLKGKRVGVLQGTTQETFGNEHWAPKGIEIVSYQGQDNIYSDLTAGRIDA" + \
1316 "AFQDEVAASEGFLKQPVGKDYKFGGPSVKDEKLFGVGTGMGLRKED--NE" + \
1317 "LREALNKAFAEMRADGTYEKLAKKYFDFDVYGG---", True),
1318 (phy_example, "phylip", 8, "HISJ_E_COL", None, False),
1319 (nxs_example, "nexus", 8, "HISJ_E_COLI", None, True),
1320 (nxs_example2, "nexus", 10, "Frog",
1321 "ATGGCACACCCATCACAATTAGGTTTTCAAGACGCAGCCTCTCCAATTATAGAAGAATTA" + \
1322 "CTTCACTTCCACGACCATACCCTCATAGCCGTTTTTCTTATTAGTACGCTAGTTCTTTAC" + \
1323 "ATTATTACTATTATAATAACTACTAAACTAACTAATACAAACCTAATGGACGCACAAGAG" + \
1324 "ATCGAAATAGTGTGAACTATTATACCAGCTATTAGCCTCATCATAATTGCCCTTCCATCC" + \
1325 "CTTCGTATCCTATATTTAATAGATGAAGTTAATGATCCACACTTAACAATTAAAGCAATC" + \
1326 "GGCCACCAATGATACTGAAGCTACGAATATACTAACTATGAGGATCTCTCATTTGACTCT" + \
1327 "TATATAATTCCAACTAATGACCTTACCCCTGGACAATTCCGGCTGCTAGAAGTTGATAAT" + \
1328 "CGAATAGTAGTCCCAATAGAATCTCCAACCCGACTTTTAGTTACAGCCGAAGACGTCCTC" + \
1329 "CACTCGTGAGCTGTACCCTCCTTGGGTGTCAAAACAGATGCAATCCCAGGACGACTTCAT" + \
1330 "CAAACATCATTTATTGCTACTCGTCCGGGAGTATTTTACGGACAATGTTCAGAAATTTGC" + \
1331 "GGAGCAAACCACAGCTTTATACCAATTGTAGTTGAAGCAGTACCGCTAACCGACTTTGAA" + \
1332 "AACTGATCTTCATCAATACTA---GAAGCATCACTA------AGA", True),
1333 (nxs_example3, "nexus", 10, "Frog",
1334 'MAHPSQLGFQDAASPIMEELLHFHDHTLMAVFLISTLVLYIITIMMTTKLTNTNLMDAQE' + \
1335 'IEMVWTIMPAISLIMIALPSLRILYLMDEVNDPHLTIKAIGHQWYWSYEYTNYEDLSFDS' + \
1336 'YMIPTNDLTPGQFRLLEVDNRMVVPMESPTRLLVTAEDVLHSWAVPSLGVKTDAIPGRLH' + \
1337 'QTSFIATRPGVFYGQCSEICGANHSFMPIVVEAVPLTDFENWSSSML-EASL--', True),
1338 (faa_example, "fasta", 8, "HISJ_E_COLI",
1339 'mkklvlslslvlafssataafaaipqnirigtdptyapfesknsqgelvgfdidlakelc' + \
1340 'krintqctfvenpldalipslkakkidaimsslsitekrqqeiaftdklyaadsrlvvak' + \
1341 'nsdiqptveslkgkrvgvlqgttqetfgnehwapkgieivsyqgqdniysdltagridaa' + \
1342 'fqdevaasegflkqpvgkdykfggpsvkdeklfgvgtgmglrkednelrealnkafaemr' + \
1343 'adgtyeklakkyfdfdvygg', True),
1344 (sth_example, "stockholm", 5, "O31699/88-139",
1345 'EVMLTDIPRLHINDPIMK--GFGMVINN------GFVCVENDE', True),
1346 (sth_example2, "stockholm", 2, "AE007476.1",
1347 'AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU' + \
1348 'UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU', True),
1349 (gbk_example, "genbank", 1, "U49845.1", None, True),
1350 (gbk_example2,"genbank", 1, 'AAD51968.1',
1351 "MESTLGSDLARLVRVWRALIDHRLKPLELTQTHWVTLHNINRLPPEQSQIQLAKAIGIEQ" + \
1352 "PSLVRTLDQLEEKGLITRHTCANDRRAKRIKLTEQSSPIIEQVDGVICSTRKEILGGISP" + \
1353 "DEIELLSGLIDKLERNIIQLQSK", True),
a169bb0 @peterjc Added EMBL parsing.
peterjc authored
1354 (gbk_example, "genbank-cds", 3, "AAA98667.1",
1355 'MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQFVPINRHPALIDYIEE' + \
1356 'LILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVDKDDQIITETEVFDEFRSS' + \
1357 'LNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNRRVDSLEEKAEIERDSNWVKC' + \
1358 'QEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEKLISGDDKILNGVYSQYEEGESI' + \
1359 'FGSLF', True),
80276dd @peterjc Renamed functions to avoid using the digit two as a pun for (convert)…
peterjc authored
1360 (swiss_example,"swiss", 3, "Q43495",
1361 "MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSP" + \
1362 "TASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN", True),
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1363 ]
1364
1365 for (data, format, rec_count, last_id, last_seq, dict_check) in tests:
1366
1367 print "%s file with %i records" % (format, rec_count)
1368
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
1369 print "Bio.SeqIO.parse(handle)"
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1370
1371 #Basic check, turning the iterator into a list...
1372 #This uses "for x in iterator" interally.
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
1373 iterator = parse(StringIO(data), format=format)
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1374 as_list = list(iterator)
a169bb0 @peterjc Added EMBL parsing.
peterjc authored
1375 assert len(as_list) == rec_count, \
1376 "Expected %i records, found %i" \
1377 % (rec_count, len(as_list))
1378 assert as_list[-1].id == last_id, \
1379 "Expected '%s' as last record ID, found '%s'" \
1380 % (last_id, as_list[-1].id)
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1381 if last_seq :
1382 assert as_list[-1].seq.tostring() == last_seq
1383
1384 #Test iteration including use of the next() method and "for x in iterator"
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
1385 iterator = parse(StringIO(data), format=format)
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1386 count = 1
1387 record = iterator.next()
1388 assert record is not None
1389 assert str(record.__class__) == "Bio.SeqRecord.SeqRecord"
1390 #print record
1391 for record in iterator :
1392 assert record.id == as_list[count].id
1393 assert record.seq.tostring() == as_list[count].seq.tostring()
1394 count = count + 1
1395 assert count == rec_count
1396 assert record is not None
1397 assert record.id == last_id
1398
1399 #Test iteration using just next() method
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
1400 iterator = parse(StringIO(data), format=format)
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1401 count = 0
1402 while True :
1403 try :
1404 record = iterator.next()
1405 except StopIteration :
1406 break
1407 if record is None : break
1408 assert record.id == as_list[count].id
1409 assert record.seq.tostring() == as_list[count].seq.tostring()
1410 count=count+1
1411 assert count == rec_count
1412
71f665f @peterjc Added new "read" function which returns a SeqRecord when given a hand…
peterjc authored
1413 print "parse(...)"
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
1414 iterator = parse(StringIO(data), format=format)
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1415 for (i, record) in enumerate(iterator) :
1416 assert record.id == as_list[i].id
1417 assert record.seq.tostring() == as_list[i].seq.tostring()
1418 assert i+1 == rec_count
1419
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
1420 print "parse(handle to empty file)"
1421 iterator = parse(StringIO(""), format=format)
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1422 assert len(list(iterator))==0
1423
1424 if dict_check :
29f0a4d @peterjc Use to_dict and to_alignment instead of SequencesToDict and Sequences…
peterjc authored
1425 print "to_dict(parse(...))"
1426 seq_dict = to_dict(parse(StringIO(data), format=format))
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1427 assert Set(seq_dict.keys()) == Set([r.id for r in as_list])
1428 assert last_id in seq_dict
1429 assert seq_dict[last_id].seq.tostring() == as_list[-1].seq.tostring()
1430
3271d08 @peterjc Writing clustal alignments using Bio.Clustalw - not elegant but it av…
peterjc authored
1431 if len(Set([len(r.seq) for r in as_list]))==1 :
1432 #All the sequences in the example are the same length,
1433 #so it make sense to try turning this file into an alignment.
29f0a4d @peterjc Use to_dict and to_alignment instead of SequencesToDict and Sequences…
peterjc authored
1434 print "to_alignment(parse(handle))"
1435 alignment = to_alignment(parse(handle = StringIO(data), format=format))
3271d08 @peterjc Writing clustal alignments using Bio.Clustalw - not elegant but it av…
peterjc authored
1436 assert len(alignment._records)==rec_count
1437 assert alignment.get_alignment_length() == len(as_list[0].seq)
1438 for i in range(0, rec_count) :
1439 assert as_list[i].id == alignment._records[i].id
1440 assert as_list[i].id == alignment.get_all_seqs()[i].id
1441 assert as_list[i].seq.tostring() == alignment._records[i].seq.tostring()
1442 assert as_list[i].seq.tostring() == alignment.get_all_seqs()[i].seq.tostring()
1443
71f665f @peterjc Added new "read" function which returns a SeqRecord when given a hand…
peterjc authored
1444 print "read(...)"
1445 if rec_count == 1 :
1446 record = read(StringIO(data), format)
1447 assert isinstance(record, SeqRecord)
1448 else :
1449 try :
1450 record = read(StringIO(data), format)
1451 assert False, "Should have failed"
1452 except ValueError :
1453 #Expected to fail
1454 pass
1455
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1456 print
1457
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
1458 print "Checking phy <-> aln examples agree using list(parse(...))"
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1459 #Only compare the first 10 characters of the record.id as they
29f0a4d @peterjc Use to_dict and to_alignment instead of SequencesToDict and Sequences…
peterjc authored
1460 #are truncated in the phylip file. Cannot use to_dict(parse(...))
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1461 #on the phylip file as there is a repeared id.
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
1462 aln_list = list(parse(StringIO(aln_example), format="clustal"))
1463 phy_list = list(parse(StringIO(phy_example), format="phylip"))
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1464 assert len(aln_list) == len(phy_list)
1465 assert Set([r.id[0:10] for r in aln_list]) == Set([r.id for r in phy_list])
1466 for i in range(0, len(aln_list)) :
1467 assert aln_list[i].id[0:10] == phy_list[i].id
1468 assert aln_list[i].seq.tostring() == phy_list[i].seq.tostring()
1469
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
1470 print "Checking nxs <-> aln examples agree using parse"
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1471 #Only compare the first 10 characters of the record.id as they
29f0a4d @peterjc Use to_dict and to_alignment instead of SequencesToDict and Sequences…
peterjc authored
1472 #are truncated in the phylip file. Cannot use to_dict(parse(...))
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1473 #on the phylip file as there is a repeared id.
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
1474 aln_iter = parse(StringIO(aln_example), format="clustal")
1475 nxs_iter = parse(StringIO(nxs_example), format="nexus")
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1476 while True :
1477 try :
1478 aln_record = aln_iter.next()
1479 except StopIteration :
1480 aln_record = None
1481 try :
1482 nxs_record = nxs_iter.next()
1483 except StopIteration :
1484 nxs_record = None
1485 if aln_record is None or nxs_record is None :
1486 assert aln_record is None
1487 assert nxs_record is None
1488 break
1489 assert aln_record.id == nxs_record.id
1490 assert aln_record.seq.tostring() == nxs_record.seq.tostring()
1491
29f0a4d @peterjc Use to_dict and to_alignment instead of SequencesToDict and Sequences…
peterjc authored
1492 print "Checking faa <-> aln examples agree using to_dict(parse(...)"
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1493 #In my examples, aln_example is an alignment of faa_example
29f0a4d @peterjc Use to_dict and to_alignment instead of SequencesToDict and Sequences…
peterjc authored
1494 aln_dict = to_dict(parse(StringIO(aln_example), format="clustal"))
1495 faa_dict = to_dict(parse(StringIO(faa_example), format="fasta"))
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1496
1497 ids = Set(aln_dict.keys())
1498 assert ids == Set(faa_dict.keys())
1499
1500 for id in ids :
1501 #The aln file contains gaps as "-", and this fasta file does not
1502 assert aln_dict[id].seq.tostring().upper().replace("-","") == \
1503 faa_dict[id].seq.tostring().upper()
1504
1505 print
1506 print "#########################################################"
1507 print "# Sequence Output Tests #"
1508 print "#########################################################"
1509 print
1510
1511 general_output_formats = ["fasta"]
3271d08 @peterjc Writing clustal alignments using Bio.Clustalw - not elegant but it av…
peterjc authored
1512 alignment_formats = ["phylip","stockholm","clustal"]
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1513 for (in_data, in_format, rec_count, last_id, last_seq, unique_ids) in tests:
1514 if unique_ids :
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
1515 in_list = list(parse(StringIO(in_data), format=in_format))
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1516 seq_lengths = [len(r.seq) for r in in_list]
1517 output_formats = general_output_formats[:]
1518 if min(seq_lengths)==max(seq_lengths) :
1519 output_formats.extend(alignment_formats)
1520 print "Checking conversion from %s (including to alignment formats)" % in_format
1521 else :
1522 print "Checking conversion from %s (excluding alignment formats)" % in_format
1523 for out_format in output_formats :
1524 print "Converting %s iterator -> %s" % (in_format, out_format)
1525 output = open("temp.txt","w")
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
1526 iterator = parse(StringIO(in_data), format=in_format)
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1527 #I am using an iterator here deliberately, as some format
1528 #writers (e.g. phylip and stockholm) will have to cope with
54f13f2 @peterjc Reoganisation:
peterjc authored
1529 #this and get the record count.
f7b95b4 @peterjc Tweaked the built in test to cope with getting a ValueError from a se…
peterjc authored
1530
1531 try :
1532 write(iterator, output, out_format)
1533 except ValueError, e:
1534 print "FAILED: %s" % str(e)
1535 #Try next format instead...
1536 continue
1537
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1538 output.close()
1539
1540 print "Checking %s <-> %s" % (in_format, out_format)
dfbaeff @peterjc Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
1541 out_list = list(parse(open("temp.txt","rU"), format=out_format))
faa2016 @peterjc Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
1542
1543 assert rec_count == len(out_list)
1544 if last_seq :
1545 assert last_seq == out_list[-1].seq.tostring()
1546 if out_format=="phylip" :
1547 assert last_id[0:10] == out_list[-1].id
1548 else :
1549 assert last_id == out_list[-1].id
1550
1551 for i in range(0, rec_count) :
1552 assert in_list[-1].seq.tostring() == out_list[-1].seq.tostring()
1553 if out_format=="phylip" :
1554 assert in_list[i].id[0:10] == out_list[i].id
1555 else :
1556 assert in_list[i].id == out_list[i].id
1557 print
1558
1559 print "#########################################################"
1560 print "# SeqIO Tests finished #"
1561 print "#########################################################"
Something went wrong with that request. Please try again.