Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 1011 lines (851 sloc) 40.929 kb
a2855b4 Peter Cock Some minor tweaking from running pylint (fixing long lines etc) includin...
peterjc authored
1 # Copyright 2006-2010 by Peter Cock. All rights reserved.
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
2 # This code is part of the Biopython distribution and governed by its
3 # license. Please see the LICENSE file that should have been included
4 # as part of this package.
5 #
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
6 # Nice link:
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
7 # http://www.ebi.ac.uk/help/formats_frame.html
8
0507add Peter Cock Alternative newline doctest for SeqIO get_raw indexing
peterjc authored
9 r"""Sequence input/output as SeqRecord objects.
f726249 merged Andrew's Seq package with the tree
jchang authored
10
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
11 Bio.SeqIO is also documented at SeqIO_ and by
6fbf797 Peter Cock Use epytext for nicer Bio.SeqIO epydoc API pages. Extended the doctests...
peterjc authored
12 a whole chapter in our tutorial:
2df8189 Travis Wrightsman restructured text progress 5
twrightsman authored
13
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
14 - `HTML Tutorial`_
15 - `PDF Tutorial`_
16
17 .. _SeqIO: http://biopython.org/wiki/SeqIO
18 .. _`HTML Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.html
19 .. _`PDF Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
20
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
21 Input
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
22 -----
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
23 The main function is Bio.SeqIO.parse(...) which takes an input file handle
24 (or in recent versions of Biopython alternatively a filename as a string),
6fbf797 Peter Cock Use epytext for nicer Bio.SeqIO epydoc API pages. Extended the doctests...
peterjc authored
25 and format string. This returns an iterator giving SeqRecord objects:
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
26
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
27 >>> from Bio import SeqIO
28 >>> for record in SeqIO.parse("Fasta/f002", "fasta"):
29 ... print("%s %i" % (record.id, len(record)))
30 gi|1348912|gb|G26680|G26680 633
31 gi|1348917|gb|G26685|G26685 413
32 gi|1592936|gb|G29385|G29385 471
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
33
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
34 Note that the parse() function will invoke the relevant parser for the
5eb9e8c Peter Cock Improving the doc string line breaks for better readability when using p...
peterjc authored
35 format with its default settings. You may want more control, in which case
a40b976 Peter Cock Made the file format a required argument.
peterjc authored
36 you need to create a format specific sequence iterator directly.
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
37
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
38 Input - Single Records
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
39 ----------------------
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
40 If you expect your file to contain one-and-only-one record, then we provide
5eb9e8c Peter Cock Improving the doc string line breaks for better readability when using p...
peterjc authored
41 the following 'helper' function which will return a single SeqRecord, or
42 raise an exception if there are no records or more than one record:
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
43
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
44 >>> from Bio import SeqIO
45 >>> record = SeqIO.read("Fasta/f001", "fasta")
46 >>> print("%s %i" % (record.id, len(record)))
47 gi|3318709|pdb|1A91| 79
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
48
5eb9e8c Peter Cock Improving the doc string line breaks for better readability when using p...
peterjc authored
49 This style is useful when you expect a single record only (and would
50 consider multiple records an error). For example, when dealing with GenBank
51 files for bacterial genomes or chromosomes, there is normally only a single
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
52 record. Alternatively, use this with a handle when downloading a single
53 record from the internet.
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
54
55 However, if you just want the first record from a file containing multiple
0655626 Peter Cock Use next(iterator) in SeqIO, AlignIO and SearchIO
peterjc authored
56 record, use the next() function on the iterator (or under Python 2, the
57 iterator's next() method):
5fdd32f Peter Cock Changes to the comments: adding handle.close() to the examples, fixed a ...
peterjc authored
58
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
59 >>> from Bio import SeqIO
60 >>> record = next(SeqIO.parse("Fasta/f002", "fasta"))
61 >>> print("%s %i" % (record.id, len(record)))
62 gi|1348912|gb|G26680|G26680 633
5fdd32f Peter Cock Changes to the comments: adding handle.close() to the examples, fixed a ...
peterjc authored
63
64 The above code will work as long as the file contains at least one record.
65 Note that if there is more than one record, the remaining records will be
66 silently ignored.
a40b976 Peter Cock Made the file format a required argument.
peterjc authored
67
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
68
69 Input - Multiple Records
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
70 ------------------------
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
71 For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records
72 using a sequence iterator can save you a lot of memory (RAM). There is
73 less benefit for interlaced file formats (e.g. most multiple alignment file
74 formats). However, an iterator only lets you access the records one by one.
75
76 If you want random access to the records by number, turn this into a list:
77
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
78 >>> from Bio import SeqIO
79 >>> records = list(SeqIO.parse("Fasta/f002", "fasta"))
80 >>> len(records)
81 3
82 >>> print(records[1].id)
83 gi|1348917|gb|G26685|G26685
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
84
85 If you want random access to the records by a key such as the record id,
86 turn the iterator into a dictionary:
87
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
88 >>> from Bio import SeqIO
89 >>> record_dict = SeqIO.to_dict(SeqIO.parse("Fasta/f002", "fasta"))
90 >>> len(record_dict)
91 3
92 >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"]))
93 413
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
94
95 However, using list() or the to_dict() function will load all the records
96 into memory at once, and therefore is not possible on very large files.
97 Instead, for *some* file formats Bio.SeqIO provides an indexing approach
98 providing dictionary like access to any record. For example,
99
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
100 >>> from Bio import SeqIO
101 >>> record_dict = SeqIO.index("Fasta/f002", "fasta")
102 >>> len(record_dict)
103 3
104 >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"]))
105 413
106 >>> record_dict.close()
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
107
108 Many but not all of the supported input file formats can be indexed like
0078bd5 Peter Cock Manually grabbed my SFF code from my 'index' branch
peterjc authored
109 this. For example "fasta", "fastq", "qual" and even the binary format "sff"
110 work, but alignment formats like "phylip", "clustalw" and "nexus" will not.
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
111
f797cb9 Peter Cock Adding get_raw method to Bio.SeqIO.index() dictionary class (see Bug 300...
peterjc authored
112 In most cases you can also use SeqIO.index to get the record from the file
113 as a raw string (not a SeqRecord). This can be useful for example to extract
114 a sub-set of records from a file where SeqIO cannot output the file format
115 (e.g. the plain text SwissProt format, "swiss") or where it is important to
116 keep the output 100% identical to the input). For example,
117
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
118 >>> from Bio import SeqIO
119 >>> record_dict = SeqIO.index("Fasta/f002", "fasta")
120 >>> len(record_dict)
121 3
122 >>> print(record_dict.get_raw("gi|1348917|gb|G26685|G26685").decode())
123 >gi|1348917|gb|G26685|G26685 human STS STS_D11734.
124 CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATCTTACTCTTTC
125 ATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTTTACAGATGTGAAACTTTCAA
126 GAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAAACCTGATGCTTTTATAAGCCATTGTGATTA
127 GGATGACTGTTACAGGCTTAGCTTTGTGTGAAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGT
128 TCATATTACTNTAAGTTCTATAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGG
129 AATTGNTTTGCCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA
130 <BLANKLINE>
131 >>> print(record_dict["gi|1348917|gb|G26685|G26685"].format("fasta"))
132 >gi|1348917|gb|G26685|G26685 human STS STS_D11734.
133 CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATC
134 TTACTCTTTCATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTT
135 TACAGATGTGAAACTTTCAAGAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAA
136 ACCTGATGCTTTTATAAGCCATTGTGATTAGGATGACTGTTACAGGCTTAGCTTTGTGTG
137 AAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGTTCATATTACTNTAAGTTCTA
138 TAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGGAATTGNTTTG
139 CCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA
140 <BLANKLINE>
141 >>> record_dict.close()
f797cb9 Peter Cock Adding get_raw method to Bio.SeqIO.index() dictionary class (see Bug 300...
peterjc authored
142
143 Here the original file and what Biopython would output differ in the line
397094d Peter Cock Tweak doctest for Python 3
peterjc authored
144 wrapping. Also note that under Python 3, the get_raw method will return a
145 bytes string, hence the use of decode to turn it into a (unicode) string.
146 This is uncessary on Python 2.
147
0507add Peter Cock Alternative newline doctest for SeqIO get_raw indexing
peterjc authored
148 Also note that the get_raw method will preserve the newline endings. This
149 example FASTQ file uses Unix style endings (b"\n" only),
150
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
151 >>> from Bio import SeqIO
152 >>> fastq_dict = SeqIO.index("Quality/example.fastq", "fastq")
153 >>> len(fastq_dict)
154 3
155 >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792")
156 >>> raw.count(b"\n")
157 4
158 >>> raw.count(b"\r\n")
159 0
160 >>> b"\r" in raw
161 False
162 >>> len(raw)
163 78
164 >>> fastq_dict.close()
0507add Peter Cock Alternative newline doctest for SeqIO get_raw indexing
peterjc authored
165
166 Here is the same file but using DOS/Windows new lines (b"\r\n" instead),
167
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
168 >>> from Bio import SeqIO
169 >>> fastq_dict = SeqIO.index("Quality/example_dos.fastq", "fastq")
170 >>> len(fastq_dict)
171 3
172 >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792")
173 >>> raw.count(b"\n")
174 4
175 >>> raw.count(b"\r\n")
176 4
177 >>> b"\r\n" in raw
178 True
179 >>> len(raw)
180 82
181 >>> fastq_dict.close()
0507add Peter Cock Alternative newline doctest for SeqIO get_raw indexing
peterjc authored
182
183 Because this uses two bytes for each new line, the file is longer than
184 the Unix equivalent with only one byte.
185
f797cb9 Peter Cock Adding get_raw method to Bio.SeqIO.index() dictionary class (see Bug 300...
peterjc authored
186
54f13f2 Peter Cock Reoganisation:
peterjc authored
187 Input - Alignments
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
188 ------------------
a7db627 Peter Cock Using the new MSA object to simplify the Bio.SeqIO and Bio.AlignIO code....
peterjc authored
189 You can read in alignment files as alignment objects using Bio.AlignIO.
03f2ff1 Peter Cock Update the comments w.r.t. Bio.AlignIO
peterjc authored
190 Alternatively, reading in an alignment file format via Bio.SeqIO will give
6fbf797 Peter Cock Use epytext for nicer Bio.SeqIO epydoc API pages. Extended the doctests...
peterjc authored
191 you a SeqRecord for each row of each alignment:
192
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
193 >>> from Bio import SeqIO
194 >>> for record in SeqIO.parse("Clustalw/hedgehog.aln", "clustal"):
195 ... print("%s %i" % (record.id, len(record)))
196 gi|167877390|gb|EDS40773.1| 447
197 gi|167234445|ref|NP_001107837. 447
198 gi|74100009|gb|AAZ99217.1| 447
199 gi|13990994|dbj|BAA33523.2| 447
200 gi|56122354|gb|AAV74328.1| 447
54f13f2 Peter Cock Reoganisation:
peterjc authored
201
0507add Peter Cock Alternative newline doctest for SeqIO get_raw indexing
peterjc authored
202
54f13f2 Peter Cock Reoganisation:
peterjc authored
203 Output
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
204 ------
5eb9e8c Peter Cock Improving the doc string line breaks for better readability when using p...
peterjc authored
205 Use the function Bio.SeqIO.write(...), which takes a complete set of
206 SeqRecord objects (either as a list, or an iterator), an output file handle
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
207 (or in recent versions of Biopython an output filename as a string) and of
61ee9cd Travis Wrightsman fixed docstring tests
twrightsman authored
208 course the file format::
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
209
61ee9cd Travis Wrightsman fixed docstring tests
twrightsman authored
210 from Bio import SeqIO
211 records = ...
212 SeqIO.write(records, "example.faa", "fasta")
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
213
61ee9cd Travis Wrightsman fixed docstring tests
twrightsman authored
214 Or, using a handle::
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
215
61ee9cd Travis Wrightsman fixed docstring tests
twrightsman authored
216 from Bio import SeqIO
217 records = ...
218 with open("example.faa", "w") as handle:
219 SeqIO.write(records, handle, "fasta")
dfbaeff Peter Cock Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
220
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
221 You are expected to call this function once (with all your records) and if
222 using a handle, make sure you close it to flush the data to the hard disk.
11b1007 Peter Cock Don't close the handle when writing files.
peterjc authored
223
397094d Peter Cock Tweak doctest for Python 3
peterjc authored
224
11b1007 Peter Cock Don't close the handle when writing files.
peterjc authored
225 Output - Advanced
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
226 -----------------
dfbaeff Peter Cock Renamed SequenceIterator and WriteSequences to parse and write
peterjc authored
227 The effect of calling write() multiple times on a single file will vary
5eb9e8c Peter Cock Improving the doc string line breaks for better readability when using p...
peterjc authored
228 depending on the file format, and is best avoided unless you have a strong
229 reason to do so.
11b1007 Peter Cock Don't close the handle when writing files.
peterjc authored
230
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
231 If you give a filename, then each time you call write() the existing file
232 will be overwriten. For sequential files formats (e.g. fasta, genbank) each
233 "record block" holds a single sequence. For these files it would probably
234 be safe to call write() multiple times by re-using the same handle.
235
236
237 However, trying this for certain alignment formats (e.g. phylip, clustal,
238 stockholm) would have the effect of concatenating several multiple sequence
239 alignments together. Such files are created by the PHYLIP suite of programs
240 for bootstrap analysis, but it is clearer to do this via Bio.AlignIO instead.
11b1007 Peter Cock Don't close the handle when writing files.
peterjc authored
241
242
7262e20 Peter Cock Adding Bio.SeqIO.convert() and Bio.AlignIO.convert() functions, with doc...
peterjc authored
243 Conversion
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
244 ----------
7262e20 Peter Cock Adding Bio.SeqIO.convert() and Bio.AlignIO.convert() functions, with doc...
peterjc authored
245 The Bio.SeqIO.convert(...) function allows an easy interface for simple
246 file format conversions. Additionally, it may use file format specific
247 optimisations so this should be the fastest way too.
248
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
249 In general however, you can combine the Bio.SeqIO.parse(...) function with
250 the Bio.SeqIO.write(...) function for sequence file conversion. Using
251 generator expressions or generator functions provides a memory efficient way
252 to perform filtering or other extra operations as part of the process.
7262e20 Peter Cock Adding Bio.SeqIO.convert() and Bio.AlignIO.convert() functions, with doc...
peterjc authored
253
397094d Peter Cock Tweak doctest for Python 3
peterjc authored
254
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
255 File Formats
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
256 ------------
46c28b0 Peter Cock Docstring improvements.
peterjc authored
257 When specifying the file format, use lowercase strings. The same format
258 names are also used in Bio.AlignIO and include the following:
259
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
260 - abif - Applied Biosystem's sequencing trace format
261 - ace - Reads the contig sequences from an ACE assembly file.
262 - embl - The EMBL flat file format. Uses Bio.GenBank internally.
263 - fasta - The generic sequence file format where each record starts with
264 an identifer line starting with a ">" character, followed by
265 lines of sequence.
266 - fastq - A "FASTA like" format used by Sanger which also stores PHRED
267 sequence quality values (with an ASCII offset of 33).
268 - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS
269 - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which
270 encodes Solexa quality scores (not PHRED quality scores) with an
271 ASCII offset of 64.
272 - fastq-illumina - Solexa/Illumina 1.3 to 1.7 variant of the FASTQ format
273 which encodes PHRED quality scores with an ASCII offset of 64
274 (not 33). Note as of version 1.8 of the CASAVA pipeline Illumina
275 will produce FASTQ files using the standard Sanger encoding.
276 - genbank - The GenBank or GenPept flat file format.
277 - gb - An alias for "genbank", for consistency with NCBI Entrez Utilities
278 - ig - The IntelliGenetics file format, apparently the same as the
279 MASE alignment format.
280 - imgt - An EMBL like format from IMGT where the feature tables are more
281 indented to allow for longer feature types.
282 - phd - Output from PHRED, used by PHRAP and CONSED for input.
283 - pir - A "FASTA like" format introduced by the National Biomedical
284 Research Foundation (NBRF) for the Protein Information Resource
285 (PIR) database, now part of UniProt.
286 - seqxml - SeqXML, simple XML format described in Schmitt et al (2011).
287 - sff - Standard Flowgram Format (SFF), typical output from Roche 454.
288 - sff-trim - Standard Flowgram Format (SFF) with given trimming applied.
289 - swiss - Plain text Swiss-Prot aka UniProt format.
290 - tab - Simple two column tab separated sequence files, where each
291 line holds a record's identifier and sequence. For example,
292 this is used as by Aligent's eArray software when saving
293 microarray probes in a minimal tab delimited text file.
294 - qual - A "FASTA like" format holding PHRED quality values from
295 sequencing DNA, but no actual sequences (usually provided
296 in separate FASTA files).
297 - uniprot-xml - The UniProt XML format (replacement for the SwissProt plain
298 text format which we call "swiss")
6fbf797 Peter Cock Use epytext for nicer Bio.SeqIO epydoc API pages. Extended the doctests...
peterjc authored
299
300 Note that while Bio.SeqIO can read all the above file formats, it cannot
77301f7 Peter Cock Fixed a typo and tweaked docstring line wrapping
peterjc authored
301 write to all of them.
46c28b0 Peter Cock Docstring improvements.
peterjc authored
302
303 You can also use any file format supported by Bio.AlignIO, such as "nexus",
304 "phlip" and "stockholm", which gives you access to the individual sequences
305 making up each alignment as SeqRecords.
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
306 """
cc23085 Peter Cock Move with_statement import up to please Jython (but still after docstrin...
peterjc authored
307
de12c5e Peter Cock Add: from __future__ import print_statement
peterjc authored
308 from __future__ import print_function
57bb2c0 Peter Cock Use basestring from Bio._py3k
peterjc authored
309 from Bio._py3k import basestring
de12c5e Peter Cock Add: from __future__ import print_statement
peterjc authored
310
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
311 __docformat__ = "restructuredtext en" # not just plaintext
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
312
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
313 # TODO
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
314 # - define policy on reading aligned sequences with gaps in
315 # (e.g. - and . characters) including how the alphabet interacts
316 #
317 # - How best to handle unique/non unique record.id when writing.
318 # For most file formats reading such files is fine; The stockholm
319 # parser would fail.
320 #
321 # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf)
1a082f9 Connor McCoy Trim trailing space
cmccoy authored
322 # http://www.bioperl.org/wiki/MSF_multiple_alignment_format
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
323
324 """
325 FAO BioPython Developers
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
326 ------------------------
5eb9e8c Peter Cock Improving the doc string line breaks for better readability when using p...
peterjc authored
327 The way I envision this SeqIO system working as that for any sequence file
328 format we have an iterator that returns SeqRecord objects.
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
329
6c277c5 Peter Cock Removed the old __main__ self tests, run the doctests instead.
peterjc authored
330 This also applies to interlaced fileformats (like clustal - although that
331 is now handled via Bio.AlignIO instead) where the file cannot be read record
332 by record. You should still return an iterator, even if the implementation
333 could just as easily return a list.
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
334
335 These file format specific sequence iterators may be implemented as:
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
336 - Classes which take a handle for __init__ and provide the __iter__ method
337 - Functions that take a handle, and return an iterator object
338 - Generator functions that take a handle, and yield SeqRecord objects
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
339
5eb9e8c Peter Cock Improving the doc string line breaks for better readability when using p...
peterjc authored
340 It is then trivial to turn this iterator into a list of SeqRecord objects,
341 an in memory dictionary, or a multiple sequence alignment object.
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
342
5eb9e8c Peter Cock Improving the doc string line breaks for better readability when using p...
peterjc authored
343 For building the dictionary by default the id propery of each SeqRecord is
344 used as the key. You should always populate the id property, and it should
6c277c5 Peter Cock Removed the old __main__ self tests, run the doctests instead.
peterjc authored
345 be unique in most cases. For some file formats the accession number is a good
346 choice. If the file itself contains ambiguous identifiers, don't try and
347 dis-ambiguate them - return them as is.
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
348
5eb9e8c Peter Cock Improving the doc string line breaks for better readability when using p...
peterjc authored
349 When adding a new file format, please use the same lower case format name
350 as BioPerl, or if they have not defined one, try the names used by EMBOSS.
351
352 See also http://biopython.org/wiki/SeqIO_dev
353
354 --Peter
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
355 """
356
a946fd8 Connor McCoy WIP: Import with from __future__ for python 2.5
cmccoy authored
357
a38504c Connor McCoy Rename: seq_handle -> as_handle
cmccoy authored
358 from Bio.File import as_handle
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
359 from Bio.SeqRecord import SeqRecord
a7db627 Peter Cock Using the new MSA object to simplify the Bio.SeqIO and Bio.AlignIO code....
peterjc authored
360 from Bio.Align import MultipleSeqAlignment
dfc1e92 Peter Cock Optional alphabet argument for reading now checked against that inferred...
peterjc authored
361 from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
362
eddb1a9 Peter Cock Turn sibling imports into relative imports (via 2to3)
peterjc authored
363 from . import AbiIO
364 from . import AceIO
365 from . import FastaIO
366 from . import IgIO # IntelliGenetics or MASE format
367 from . import InsdcIO # EMBL and GenBank
368 from . import PdbIO
369 from . import PhdIO
370 from . import PirIO
371 from . import SeqXmlIO
372 from . import SffIO
373 from . import SwissIO
374 from . import TabIO
375 from . import QualityIO # FastQ and qual files
376 from . import UniprotIO
b7c7b27 initial support for reading .ab1 trace files
Wibowo Arindrarto authored
377
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
378
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
379 # Convention for format names is "mainname-subtype" in lower case.
380 # Please use the same names as BioPerl or EMBOSS where possible.
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
381 #
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
382 # Note that this simple system copes with defining
383 # multiple possible iterators for a given format/extension
384 # with the -subtype suffix
f219482 Peter Cock Integration with Bio.AlignIO
peterjc authored
385 #
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
386 # Most alignment file formats will be handled via Bio.AlignIO
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
387
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
388 _FormatToIterator = {"fasta": FastaIO.FastaIterator,
389 "gb": InsdcIO.GenBankIterator,
390 "genbank": InsdcIO.GenBankIterator,
391 "genbank-cds": InsdcIO.GenBankCdsFeatureIterator,
392 "embl": InsdcIO.EmblIterator,
393 "embl-cds": InsdcIO.EmblCdsFeatureIterator,
394 "imgt": InsdcIO.ImgtIterator,
395 "ig": IgIO.IgIterator,
396 "swiss": SwissIO.SwissIterator,
75818d9 Eric Talevich SeqIO: enabled read-only "pdb-atom" format
etal authored
397 "pdb-atom": PdbIO.PdbAtomIterator,
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
398 "pdb-seqres": PdbIO.PdbSeqresIterator,
399 "phd": PhdIO.PhdIterator,
400 "ace": AceIO.AceIterator,
401 "tab": TabIO.TabIterator,
402 "pir": PirIO.PirIterator,
403 "fastq": QualityIO.FastqPhredIterator,
404 "fastq-sanger": QualityIO.FastqPhredIterator,
405 "fastq-solexa": QualityIO.FastqSolexaIterator,
406 "fastq-illumina": QualityIO.FastqIlluminaIterator,
407 "qual": QualityIO.QualPhredIterator,
a2855b4 Peter Cock Some minor tweaking from running pylint (fixing long lines etc) includin...
peterjc authored
408 "sff": SffIO.SffIterator,
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
409 # Not sure about this in the long run:
a2855b4 Peter Cock Some minor tweaking from running pylint (fixing long lines etc) includin...
peterjc authored
410 "sff-trim": SffIO._SffTrimIterator,
e0325dc Peter Cock Switch format name to uniprot-xml as discussed with BioPerl, EMBOSS, etc...
peterjc authored
411 "uniprot-xml": UniprotIO.UniprotIterator,
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
412 "seqxml": SeqXmlIO.SeqXmlIterator,
0239025 improved tag parsing, compacted datetime annotations, implemented Peter'...
Wibowo Arindrarto authored
413 "abi": AbiIO.AbiIterator,
414 "abi-trim": AbiIO._AbiTrimIterator,
a2855b4 Peter Cock Some minor tweaking from running pylint (fixing long lines etc) includin...
peterjc authored
415 }
416
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
417 _FormatToWriter = {"fasta": FastaIO.FastaWriter,
418 "gb": InsdcIO.GenBankWriter,
419 "genbank": InsdcIO.GenBankWriter,
420 "embl": InsdcIO.EmblWriter,
421 "imgt": InsdcIO.ImgtWriter,
422 "tab": TabIO.TabWriter,
423 "fastq": QualityIO.FastqPhredWriter,
424 "fastq-sanger": QualityIO.FastqPhredWriter,
425 "fastq-solexa": QualityIO.FastqSolexaWriter,
426 "fastq-illumina": QualityIO.FastqIlluminaWriter,
427 "phd": PhdIO.PhdWriter,
428 "qual": QualityIO.QualPhredWriter,
429 "sff": SffIO.SffWriter,
430 "seqxml": SeqXmlIO.SeqXmlWriter,
a2855b4 Peter Cock Some minor tweaking from running pylint (fixing long lines etc) includin...
peterjc authored
431 }
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
432
1751ea1 Peter Cock Forgot to update these filenames
peterjc authored
433 _BinaryFormats = ["sff", "sff-trim", "abi", "abi-trim"]
0780548 Peter Cock Handle the binary formats (SFF) in SeqIO more cleanly
peterjc authored
434
ddb0a8d Connor McCoy Use context managers for AlignIO, SeqIO
cmccoy authored
435
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
436 def write(sequences, handle, format):
357b0bf Peter Cock Adding missing trailing periods to make epydoc happy (for the online API...
peterjc authored
437 """Write complete set of sequences to a file.
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
438
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
439 - sequences - A list (or iterator) of SeqRecord objects, or (if using
440 Biopython 1.54 or later) a single SeqRecord.
441 - handle - File handle object to write to, or filename as string
442 (note older versions of Biopython only took a handle).
443 - format - lower case string describing the file format to write.
11b1007 Peter Cock Don't close the handle when writing files.
peterjc authored
444
445 You should close the handle after calling this function.
a40b976 Peter Cock Made the file format a required argument.
peterjc authored
446
91b01f2 Peter Cock Make Bio.SeqIO.write(...) and Bio.AlignIO.write(...) return number of re...
peterjc authored
447 Returns the number of records written (as an integer).
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
448 """
f219482 Peter Cock Integration with Bio.AlignIO
peterjc authored
449 from Bio import AlignIO
450
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
451 # Try and give helpful error messages:
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
452 if not isinstance(format, basestring):
77e4fda Peter Cock Changed a few of the ValueErrors to TypeErrors, updated a doc string
peterjc authored
453 raise TypeError("Need a string for the file format (lower case)")
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
454 if not format:
a40b976 Peter Cock Made the file format a required argument.
peterjc authored
455 raise ValueError("Format required (lower case string)")
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
456 if format != format.lower():
a40b976 Peter Cock Made the file format a required argument.
peterjc authored
457 raise ValueError("Format string '%s' should be lower case" % format)
aad8f45 Peter Cock Relax the Bio.SeqIO and AlignIO write functions to accept a single SeqRe...
peterjc authored
458
a2855b4 Peter Cock Some minor tweaking from running pylint (fixing long lines etc) includin...
peterjc authored
459 if isinstance(sequences, SeqRecord):
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
460 # This raised an exception in order version of Biopython
aad8f45 Peter Cock Relax the Bio.SeqIO and AlignIO write functions to accept a single SeqRe...
peterjc authored
461 sequences = [sequences]
a40b976 Peter Cock Made the file format a required argument.
peterjc authored
462
03a17db Connor McCoy Remove unused handle_close; no isinstance check
cmccoy authored
463 if format in _BinaryFormats:
464 mode = 'wb'
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
465 else:
03a17db Connor McCoy Remove unused handle_close; no isinstance check
cmccoy authored
466 mode = 'w'
ddb0a8d Connor McCoy Use context managers for AlignIO, SeqIO
cmccoy authored
467
a38504c Connor McCoy Rename: seq_handle -> as_handle
cmccoy authored
468 with as_handle(handle, mode) as fp:
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
469 # Map the file format to a writer class
ddb0a8d Connor McCoy Use context managers for AlignIO, SeqIO
cmccoy authored
470 if format in _FormatToWriter:
471 writer_class = _FormatToWriter[format]
472 count = writer_class(fp).write_file(sequences)
473 elif format in AlignIO._FormatToWriter:
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
474 # Try and turn all the records into a single alignment,
475 # and write that using Bio.AlignIO
ddb0a8d Connor McCoy Use context managers for AlignIO, SeqIO
cmccoy authored
476 alignment = MultipleSeqAlignment(sequences)
477 alignment_count = AlignIO.write([alignment], fp, format)
478 assert alignment_count == 1, \
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
479 "Internal error - the underlying writer " \
480 " should have returned 1, not %s" % repr(alignment_count)
ddb0a8d Connor McCoy Use context managers for AlignIO, SeqIO
cmccoy authored
481 count = len(alignment)
482 del alignment_count, alignment
483 elif format in _FormatToIterator or format in AlignIO._FormatToIterator:
484 raise ValueError("Reading format '%s' is supported, but not writing"
485 % format)
486 else:
487 raise ValueError("Unknown format '%s'" % format)
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
488
ddb0a8d Connor McCoy Use context managers for AlignIO, SeqIO
cmccoy authored
489 assert isinstance(count, int), "Internal error - the underlying %s " \
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
490 "writer should have returned the record count, not %s" \
491 % (format, repr(count))
1a082f9 Connor McCoy Trim trailing space
cmccoy authored
492
91b01f2 Peter Cock Make Bio.SeqIO.write(...) and Bio.AlignIO.write(...) return number of re...
peterjc authored
493 return count
1a082f9 Connor McCoy Trim trailing space
cmccoy authored
494
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
495
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
496 def parse(handle, format, alphabet=None):
50f7b8f Peter Cock Fixed the StringIO example formating for epydoc, and make it into a full...
peterjc authored
497 r"""Turns a sequence file into an iterator returning SeqRecords.
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
498
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
499 - handle - handle to the file, or the filename as a string
500 (note older versions of Biopython only took a handle).
501 - format - lower case string describing the file format.
502 - alphabet - optional Alphabet object, useful when the sequence type
503 cannot be automatically inferred from the file itself
504 (e.g. format="fasta" or "tab")
54f13f2 Peter Cock Reoganisation:
peterjc authored
505
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
506 Typical usage, opening a file to read in, and looping over the record(s):
507
508 >>> from Bio import SeqIO
f73b7a2 Peter Cock Moving Tests/Nucleic/*.nu to Tests/Fasta
peterjc authored
509 >>> filename = "Fasta/sweetpea.nu"
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
510 >>> for record in SeqIO.parse(filename, "fasta"):
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
511 ... print("ID %s" % record.id)
512 ... print("Sequence length %i" % len(record))
513 ... print("Sequence alphabet %s" % record.seq.alphabet)
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
514 ID gi|3176602|gb|U78617.1|LOU78617
515 Sequence length 309
516 Sequence alphabet SingleLetterAlphabet()
517
518 For file formats like FASTA where the alphabet cannot be determined, it
519 may be useful to specify the alphabet explicitly:
520
521 >>> from Bio import SeqIO
522 >>> from Bio.Alphabet import generic_dna
f73b7a2 Peter Cock Moving Tests/Nucleic/*.nu to Tests/Fasta
peterjc authored
523 >>> filename = "Fasta/sweetpea.nu"
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
524 >>> for record in SeqIO.parse(filename, "fasta", generic_dna):
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
525 ... print("ID %s" % record.id)
526 ... print("Sequence length %i" % len(record))
527 ... print("Sequence alphabet %s" % record.seq.alphabet)
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
528 ID gi|3176602|gb|U78617.1|LOU78617
529 Sequence length 309
530 Sequence alphabet DNAAlphabet()
531
532 If you have a string 'data' containing the file contents, you must
533 first turn this into a handle in order to parse it:
54f13f2 Peter Cock Reoganisation:
peterjc authored
534
50f7b8f Peter Cock Fixed the StringIO example formating for epydoc, and make it into a full...
peterjc authored
535 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n"
536 >>> from Bio import SeqIO
988fbcf Peter Cock Python 3 fallback for StringIO in doctests
peterjc authored
537 >>> try:
538 ... from StringIO import StringIO # Python 2
539 ... except ImportError:
540 ... from io import StringIO # Python 3
541 ...
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
542 >>> for record in SeqIO.parse(StringIO(data), "fasta"):
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
543 ... print("%s %s" % (record.id, record.seq))
50f7b8f Peter Cock Fixed the StringIO example formating for epydoc, and make it into a full...
peterjc authored
544 Alpha ACCGGATGTA
545 Beta AGGCTCGGTTA
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
546
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
547 Use the Bio.SeqIO.read(...) function when you expect a single record
548 only.
a40b976 Peter Cock Made the file format a required argument.
peterjc authored
549 """
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
550 # NOTE - The above docstring has some raw \n characters needed
551 # for the StringIO example, hense the whole docstring is in raw
552 # string mode (see the leading r before the opening quote).
f219482 Peter Cock Integration with Bio.AlignIO
peterjc authored
553 from Bio import AlignIO
54f13f2 Peter Cock Reoganisation:
peterjc authored
554
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
555 # Hack for SFF, will need to make this more general in future
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
556 if format in _BinaryFormats:
03a17db Connor McCoy Remove unused handle_close; no isinstance check
cmccoy authored
557 mode = 'rb'
ddb0a8d Connor McCoy Use context managers for AlignIO, SeqIO
cmccoy authored
558 else:
03a17db Connor McCoy Remove unused handle_close; no isinstance check
cmccoy authored
559 mode = 'rU'
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
560
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
561 # Try and give helpful error messages:
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
562 if not isinstance(format, basestring):
77e4fda Peter Cock Changed a few of the ValueErrors to TypeErrors, updated a doc string
peterjc authored
563 raise TypeError("Need a string for the file format (lower case)")
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
564 if not format:
a40b976 Peter Cock Made the file format a required argument.
peterjc authored
565 raise ValueError("Format required (lower case string)")
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
566 if format != format.lower():
a40b976 Peter Cock Made the file format a required argument.
peterjc authored
567 raise ValueError("Format string '%s' should be lower case" % format)
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
568 if alphabet is not None and not (isinstance(alphabet, Alphabet) or
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
569 isinstance(alphabet, AlphabetEncoder)):
e715e27 Peter Cock Optional alphabet argument for the read and parse functions in Bio.SeqIO...
peterjc authored
570 raise ValueError("Invalid alphabet, %s" % repr(alphabet))
a40b976 Peter Cock Made the file format a required argument.
peterjc authored
571
a38504c Connor McCoy Rename: seq_handle -> as_handle
cmccoy authored
572 with as_handle(handle, mode) as fp:
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
573 # Map the file format to a sequence iterator:
ddb0a8d Connor McCoy Use context managers for AlignIO, SeqIO
cmccoy authored
574 if format in _FormatToIterator:
575 iterator_generator = _FormatToIterator[format]
576 if alphabet is None:
577 i = iterator_generator(fp)
578 else:
579 try:
580 i = iterator_generator(fp, alphabet=alphabet)
581 except TypeError:
582 i = _force_alphabet(iterator_generator(fp), alphabet)
583 elif format in AlignIO._FormatToIterator:
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
584 # Use Bio.AlignIO to read in the alignments
ddb0a8d Connor McCoy Use context managers for AlignIO, SeqIO
cmccoy authored
585 i = (r for alignment in AlignIO.parse(fp, format,
586 alphabet=alphabet)
587 for r in alignment)
0fb039b Peter Cock Explicitly close handles to avoid problems deleting files on Jython on W...
peterjc authored
588 else:
ddb0a8d Connor McCoy Use context managers for AlignIO, SeqIO
cmccoy authored
589 raise ValueError("Unknown format '%s'" % format)
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
590 # This imposes some overhead... wait until we drop Python 2.4 to fix it
ddb0a8d Connor McCoy Use context managers for AlignIO, SeqIO
cmccoy authored
591 for r in i:
592 yield r
643f10c Peter Cock Removed SequenceDict and SequenceList classes from Interfaces.py and use...
peterjc authored
593
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
594
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
595 def _force_alphabet(record_iterator, alphabet):
a2855b4 Peter Cock Some minor tweaking from running pylint (fixing long lines etc) includin...
peterjc authored
596 """Iterate over records, over-riding the alphabet (PRIVATE)."""
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
597 # Assume the alphabet argument has been pre-validated
a2855b4 Peter Cock Some minor tweaking from running pylint (fixing long lines etc) includin...
peterjc authored
598 given_base_class = _get_base_alphabet(alphabet).__class__
599 for record in record_iterator:
600 if isinstance(_get_base_alphabet(record.seq.alphabet),
601 given_base_class):
602 record.seq.alphabet = alphabet
603 yield record
604 else:
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
605 raise ValueError("Specified alphabet %s clashes with "
606 "that determined from the file, %s"
a2855b4 Peter Cock Some minor tweaking from running pylint (fixing long lines etc) includin...
peterjc authored
607 % (repr(alphabet), repr(record.seq.alphabet)))
e715e27 Peter Cock Optional alphabet argument for the read and parse functions in Bio.SeqIO...
peterjc authored
608
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
609
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
610 def read(handle, format, alphabet=None):
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
611 """Turns a sequence file into a single SeqRecord.
612
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
613 - handle - handle to the file, or the filename as a string
614 (note older versions of Biopython only took a handle).
615 - format - string describing the file format.
616 - alphabet - optional Alphabet object, useful when the sequence type
617 cannot be automatically inferred from the file itself
618 (e.g. format="fasta" or "tab")
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
619
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
620 This function is for use parsing sequence files containing
621 exactly one record. For example, reading a GenBank file:
622
623 >>> from Bio import SeqIO
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
624 >>> record = SeqIO.read("GenBank/arab1.gb", "genbank")
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
625 >>> print("ID %s" % record.id)
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
626 ID AC007323.5
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
627 >>> print("Sequence length %i" % len(record))
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
628 Sequence length 86436
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
629 >>> print("Sequence alphabet %s" % record.seq.alphabet)
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
630 Sequence alphabet IUPACAmbiguousDNA()
631
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
632 If the handle contains no records, or more than one record,
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
633 an exception is raised. For example:
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
634
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
635 >>> from Bio import SeqIO
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
636 >>> record = SeqIO.read("GenBank/cor6_6.gb", "genbank")
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
637 Traceback (most recent call last):
638 ...
639 ValueError: More than one record found in handle
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
640
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
641 If however you want the first record from a file containing
642 multiple records this function would raise an exception (as
643 shown in the example above). Instead use:
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
644
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
645 >>> from Bio import SeqIO
0655626 Peter Cock Use next(iterator) in SeqIO, AlignIO and SearchIO
peterjc authored
646 >>> record = next(SeqIO.parse("GenBank/cor6_6.gb", "genbank"))
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
647 >>> print("First record's ID %s" % record.id)
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
648 First record's ID X55053.1
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
649
650 Use the Bio.SeqIO.parse(handle, format) function if you want
651 to read multiple records from the handle.
652 """
e715e27 Peter Cock Optional alphabet argument for the read and parse functions in Bio.SeqIO...
peterjc authored
653 iterator = parse(handle, format, alphabet)
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
654 try:
0655626 Peter Cock Use next(iterator) in SeqIO, AlignIO and SearchIO
peterjc authored
655 first = next(iterator)
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
656 except StopIteration:
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
657 first = None
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
658 if first is None:
ca1ab3d Peter Cock Updating raise exception style (see PEP8)
peterjc authored
659 raise ValueError("No records found in handle")
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
660 try:
0655626 Peter Cock Use next(iterator) in SeqIO, AlignIO and SearchIO
peterjc authored
661 second = next(iterator)
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
662 except StopIteration:
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
663 second = None
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
664 if second is not None:
ca1ab3d Peter Cock Updating raise exception style (see PEP8)
peterjc authored
665 raise ValueError("More than one record found in handle")
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
666 return first
667
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
668
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
669 def to_dict(sequences, key_function=None):
71f665f Peter Cock Added new "read" function which returns a SeqRecord when given a handle ...
peterjc authored
670 """Turns a sequence iterator or list into a dictionary.
643f10c Peter Cock Removed SequenceDict and SequenceList classes from Interfaces.py and use...
peterjc authored
671
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
672 - sequences - An iterator that returns SeqRecord objects,
673 or simply a list of SeqRecord objects.
674 - key_function - Optional callback function which when given a
675 SeqRecord should return a unique key for the dictionary.
643f10c Peter Cock Removed SequenceDict and SequenceList classes from Interfaces.py and use...
peterjc authored
676
8e86ce1 Peter Cock Renamed argument key2record key_function to avoid the pun
peterjc authored
677 e.g. key_function = lambda rec : rec.name
678 or, key_function = lambda rec : rec.description.split()[0]
643f10c Peter Cock Removed SequenceDict and SequenceList classes from Interfaces.py and use...
peterjc authored
679
7639f14 Christian Brueffer More typo fixes.
cbrueffer authored
680 If key_function is omitted then record.id is used, on the assumption
77301f7 Peter Cock Fixed a typo and tweaked docstring line wrapping
peterjc authored
681 that the records objects returned are SeqRecords with a unique id.
1a082f9 Connor McCoy Trim trailing space
cmccoy authored
682
643f10c Peter Cock Removed SequenceDict and SequenceList classes from Interfaces.py and use...
peterjc authored
683 If there are duplicate keys, an error is raised.
684
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
685 Example usage, defaulting to using the record.id as key:
686
687 >>> from Bio import SeqIO
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
688 >>> filename = "GenBank/cor6_6.gb"
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
689 >>> format = "genbank"
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
690 >>> id_dict = SeqIO.to_dict(SeqIO.parse(filename, format))
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
691 >>> print(sorted(id_dict))
c142327 Peter Cock Sort dictionary examples in Bio.SeqIO doctest so they will work on Jytho...
peterjc authored
692 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1']
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
693 >>> print(id_dict["L31939.1"].description)
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
694 Brassica rapa (clone bif72) kin mRNA, complete cds.
695
77301f7 Peter Cock Fixed a typo and tweaked docstring line wrapping
peterjc authored
696 A more complex example, using the key_function argument in order to
697 use a sequence checksum as the dictionary key:
6fbf797 Peter Cock Use epytext for nicer Bio.SeqIO epydoc API pages. Extended the doctests...
peterjc authored
698
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
699 >>> from Bio import SeqIO
700 >>> from Bio.SeqUtils.CheckSum import seguid
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
701 >>> filename = "GenBank/cor6_6.gb"
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
702 >>> format = "genbank"
33b8a2a Peter Cock Allow filenames as well as handles in Bio.SeqIO.parse(), read() and writ...
peterjc authored
703 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(filename, format),
1402c97 Peter Cock Fixed a doctest to look better in epydoc
peterjc authored
704 ... key_function = lambda rec : seguid(rec.seq))
c83880b Peter Cock Replace .iteritems() with .iter() and .itervalues() with .values()
peterjc authored
705 >>> for key, record in sorted(seguid_dict.items()):
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
706 ... print("%s %s" % (key, record.id))
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
707 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1
c142327 Peter Cock Sort dictionary examples in Bio.SeqIO doctest so they will work on Jytho...
peterjc authored
708 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1
709 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
710 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1
c142327 Peter Cock Sort dictionary examples in Bio.SeqIO doctest so they will work on Jytho...
peterjc authored
711 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1
3cea36b Peter Cock docstring changes for use with doctest
peterjc authored
712 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
713
714 This approach is not suitable for very large sets of sequences, as all
715 the SeqRecord objects are held in memory. Instead, consider using the
7fd8672 Peter Cock Renaming new function Bio.SeqIO.indexed_dict() to just Bio.SeqIO.index()...
peterjc authored
716 Bio.SeqIO.index() function (if it supports your particular file format).
1a082f9 Connor McCoy Trim trailing space
cmccoy authored
717 """
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
718 if key_function is None:
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
719 key_function = lambda rec: rec.id
643f10c Peter Cock Removed SequenceDict and SequenceList classes from Interfaces.py and use...
peterjc authored
720
721 d = dict()
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
722 for record in sequences:
8e86ce1 Peter Cock Renamed argument key2record key_function to avoid the pun
peterjc authored
723 key = key_function(record)
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
724 if key in d:
8e86ce1 Peter Cock Renamed argument key2record key_function to avoid the pun
peterjc authored
725 raise ValueError("Duplicate key '%s'" % key)
643f10c Peter Cock Removed SequenceDict and SequenceList classes from Interfaces.py and use...
peterjc authored
726 d[key] = record
727 return d
faa2016 Peter Cock Reworked Bio.SeqIO system returning SeqRecord objects
peterjc authored
728
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
729
9bb8590 Peter Cock Auto detect BGZF in SeqIO index functions
peterjc authored
730 def index(filename, format, alphabet=None, key_function=None):
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
731 """Indexes a sequence file and returns a dictionary like object.
732
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
733 - filename - string giving name of file to be indexed
734 - format - lower case string describing the file format
735 - alphabet - optional Alphabet object, useful when the sequence type
736 cannot be automatically inferred from the file itself
737 (e.g. format="fasta" or "tab")
738 - key_function - Optional callback function which when given a
739 SeqRecord identifier string should return a unique
740 key for the dictionary.
1a082f9 Connor McCoy Trim trailing space
cmccoy authored
741
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
742 This indexing function will return a dictionary like object, giving the
743 SeqRecord objects as values:
744
745 >>> from Bio import SeqIO
7fd8672 Peter Cock Renaming new function Bio.SeqIO.indexed_dict() to just Bio.SeqIO.index()...
peterjc authored
746 >>> records = SeqIO.index("Quality/example.fastq", "fastq")
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
747 >>> len(records)
748 3
802134e Peter Cock sorted(mydict.keys()) -> sorted(mydict)
peterjc authored
749 >>> sorted(records)
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
750 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792']
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
751 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta"))
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
752 >EAS54_6_R1_2_1_540_792
753 TTGGCAGGCCAAGGCCGATGGATCA
754 <BLANKLINE>
755 >>> "EAS54_6_R1_2_1_540_792" in records
756 True
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
757 >>> print(records.get("Missing", None))
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
758 None
883504e Peter Cock Close index handles in SeqIO doctests
peterjc authored
759 >>> records.close()
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
760
9bb8590 Peter Cock Auto detect BGZF in SeqIO index functions
peterjc authored
761 If the file is BGZF compressed, this is detected automatically. Ordinary
762 GZIP files are not supported:
8afddca Peter Cock Indexing BGZF (and GZIP) compressed files in Bio.SeqIO
peterjc authored
763
764 >>> from Bio import SeqIO
9bb8590 Peter Cock Auto detect BGZF in SeqIO index functions
peterjc authored
765 >>> records = SeqIO.index("Quality/example.fastq.bgz", "fastq")
8afddca Peter Cock Indexing BGZF (and GZIP) compressed files in Bio.SeqIO
peterjc authored
766 >>> len(records)
767 3
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
768 >>> print(records["EAS54_6_R1_2_1_540_792"].seq)
8afddca Peter Cock Indexing BGZF (and GZIP) compressed files in Bio.SeqIO
peterjc authored
769 TTGGCAGGCCAAGGCCGATGGATCA
883504e Peter Cock Close index handles in SeqIO doctests
peterjc authored
770 >>> records.close()
8afddca Peter Cock Indexing BGZF (and GZIP) compressed files in Bio.SeqIO
peterjc authored
771
db35871 Christian Brueffer Minor documentation fixes.
cbrueffer authored
772 Note that this pseudo dictionary will not support all the methods of a
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
773 true Python dictionary, for example values() is not defined since this
774 would require loading all of the records into memory at once.
775
776 When you call the index function, it will scan through the file, noting
777 the location of each record. When you access a particular record via the
778 dictionary methods, the code will jump to the appropriate part of the
779 file and then parse that section into a SeqRecord.
780
781 Note that not all the input formats supported by Bio.SeqIO can be used
782 with this index function. It is designed to work only with sequential
783 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any
784 interlaced file format (e.g. alignment formats such as "clustal").
785
786 For small files, it may be more efficient to use an in memory Python
787 dictionary, e.g.
788
789 >>> from Bio import SeqIO
883504e Peter Cock Close index handles in SeqIO doctests
peterjc authored
790 >>> records = SeqIO.to_dict(SeqIO.parse("Quality/example.fastq", "fastq"))
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
791 >>> len(records)
792 3
802134e Peter Cock sorted(mydict.keys()) -> sorted(mydict)
peterjc authored
793 >>> sorted(records)
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
794 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792']
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
795 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta"))
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
796 >EAS54_6_R1_2_1_540_792
797 TTGGCAGGCCAAGGCCGATGGATCA
798 <BLANKLINE>
799
77301f7 Peter Cock Fixed a typo and tweaked docstring line wrapping
peterjc authored
800 As with the to_dict() function, by default the id string of each record
801 is used as the key. You can specify a callback function to transform
32a20f4 Christian Brueffer More typo and duplicate word fixes.
cbrueffer authored
802 this (the record identifier string) into your preferred key. For example:
c879a25 Peter Cock Basic key_function in Bio.SeqIO.indexed_dict() as per mailing list discu...
peterjc authored
803
804 >>> from Bio import SeqIO
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
805 >>> def make_tuple(identifier):
c879a25 Peter Cock Basic key_function in Bio.SeqIO.indexed_dict() as per mailing list discu...
peterjc authored
806 ... parts = identifier.split("_")
807 ... return int(parts[-2]), int(parts[-1])
7fd8672 Peter Cock Renaming new function Bio.SeqIO.indexed_dict() to just Bio.SeqIO.index()...
peterjc authored
808 >>> records = SeqIO.index("Quality/example.fastq", "fastq",
809 ... key_function=make_tuple)
c879a25 Peter Cock Basic key_function in Bio.SeqIO.indexed_dict() as per mailing list discu...
peterjc authored
810 >>> len(records)
811 3
802134e Peter Cock sorted(mydict.keys()) -> sorted(mydict)
peterjc authored
812 >>> sorted(records)
c879a25 Peter Cock Basic key_function in Bio.SeqIO.indexed_dict() as per mailing list discu...
peterjc authored
813 [(413, 324), (443, 348), (540, 792)]
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
814 >>> print(records[(540, 792)].format("fasta"))
c879a25 Peter Cock Basic key_function in Bio.SeqIO.indexed_dict() as per mailing list discu...
peterjc authored
815 >EAS54_6_R1_2_1_540_792
816 TTGGCAGGCCAAGGCCGATGGATCA
817 <BLANKLINE>
818 >>> (540, 792) in records
819 True
820 >>> "EAS54_6_R1_2_1_540_792" in records
821 False
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
822 >>> print(records.get("Missing", None))
c879a25 Peter Cock Basic key_function in Bio.SeqIO.indexed_dict() as per mailing list discu...
peterjc authored
823 None
883504e Peter Cock Close index handles in SeqIO doctests
peterjc authored
824 >>> records.close()
c879a25 Peter Cock Basic key_function in Bio.SeqIO.indexed_dict() as per mailing list discu...
peterjc authored
825
77301f7 Peter Cock Fixed a typo and tweaked docstring line wrapping
peterjc authored
826 Another common use case would be indexing an NCBI style FASTA file,
827 where you might want to extract the GI number from the FASTA identifer
828 to use as the dictionary key.
c879a25 Peter Cock Basic key_function in Bio.SeqIO.indexed_dict() as per mailing list discu...
peterjc authored
829
77301f7 Peter Cock Fixed a typo and tweaked docstring line wrapping
peterjc authored
830 Notice that unlike the to_dict() function, here the key_function does
831 not get given the full SeqRecord to use to generate the key. Doing so
832 would impose a severe performance penalty as it would require the file
833 to be completely parsed while building the index. Right now this is
834 usually avoided.
1a082f9 Connor McCoy Trim trailing space
cmccoy authored
835
391b511 Peter Cock Rename Bio.SeqIO.index_many() to index_db() as discussed on mailing list
peterjc authored
836 See also: Bio.SeqIO.index_db() and Bio.SeqIO.to_dict()
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
837 """
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
838 # Try and give helpful error messages:
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
839 if not isinstance(filename, basestring):
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
840 raise TypeError("Need a filename (not a handle)")
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
841 if not isinstance(format, basestring):
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
842 raise TypeError("Need a string for the file format (lower case)")
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
843 if not format:
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
844 raise ValueError("Format required (lower case string)")
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
845 if format != format.lower():
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
846 raise ValueError("Format string '%s' should be lower case" % format)
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
847 if alphabet is not None and not (isinstance(alphabet, Alphabet) or
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
848 isinstance(alphabet, AlphabetEncoder)):
8af957d Peter Cock Adding the Bio.SeqIO.indexed_dict() function developed on a github branc...
peterjc authored
849 raise ValueError("Invalid alphabet, %s" % repr(alphabet))
850
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
851 # Map the file format to a sequence iterator:
d18aab8 Peter Cock autopep8 E261 - Fix spacing after comment hash
peterjc authored
852 from ._index import _FormatToRandomAccess # Lazy import
85c7109 Peter Cock Small refactoring of Bio.SeqIO index imports
peterjc authored
853 from Bio.File import _IndexedSeqFileDict
a7cd981 Peter Cock Move core Bio.SeqIO index code to Bio.File
peterjc authored
854 try:
85c7109 Peter Cock Small refactoring of Bio.SeqIO index imports
peterjc authored
855 proxy_class = _FormatToRandomAccess[format]
a7cd981 Peter Cock Move core Bio.SeqIO index code to Bio.File
peterjc authored
856 except KeyError:
857 raise ValueError("Unsupported format %r" % format)
f9cc32f Peter Cock Use Bio.File._IndexedSeqFileDict in Bio.SearchIO.index(...)
peterjc authored
858 repr = "SeqIO.index(%r, %r, alphabet=%r, key_function=%r)" \
859 % (filename, format, alphabet, key_function)
f7e072f Peter Cock Refactor Bio.SeqIO.index(...) internals
peterjc authored
860 return _IndexedSeqFileDict(proxy_class(filename, format, alphabet),
f9cc32f Peter Cock Use Bio.File._IndexedSeqFileDict in Bio.SearchIO.index(...)
peterjc authored
861 key_function, repr, "SeqRecord")
ded6f5e Peter Cock Moving alignment consensus code into Bio.Alphabet, currently as some pri...
peterjc authored
862
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
863
391b511 Peter Cock Rename Bio.SeqIO.index_many() to index_db() as discussed on mailing list
peterjc authored
864 def index_db(index_filename, filenames=None, format=None, alphabet=None,
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
865 key_function=None):
adb28a3 Peter Cock Adding Bio.SeqIO.index_many(...) function using SQLite3
peterjc authored
866 """Index several sequence files and return a dictionary like object.
867
868 The index is stored in an SQLite database rather than in memory (as in the
869 Bio.SeqIO.index(...) function).
1a082f9 Connor McCoy Trim trailing space
cmccoy authored
870
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
871 - index_filename - Where to store the SQLite index
872 - filenames - list of strings specifying file(s) to be indexed, or when
873 indexing a single file this can be given as a string.
874 (optional if reloading an existing index, but must match)
875 - format - lower case string describing the file format
876 (optional if reloading an existing index, but must match)
877 - alphabet - optional Alphabet object, useful when the sequence type
878 cannot be automatically inferred from the file itself
879 (e.g. format="fasta" or "tab")
880 - key_function - Optional callback function which when given a
881 SeqRecord identifier string should return a unique
882 key for the dictionary.
1a082f9 Connor McCoy Trim trailing space
cmccoy authored
883
adb28a3 Peter Cock Adding Bio.SeqIO.index_many(...) function using SQLite3
peterjc authored
884 This indexing function will return a dictionary like object, giving the
885 SeqRecord objects as values:
886
887 >>> from Bio.Alphabet import generic_protein
888 >>> from Bio import SeqIO
889 >>> files = ["GenBank/NC_000932.faa", "GenBank/NC_005816.faa"]
890 >>> def get_gi(name):
891 ... parts = name.split("|")
892 ... i = parts.index("gi")
893 ... assert i != -1
894 ... return parts[i+1]
895 >>> idx_name = ":memory:" #use an in memory SQLite DB for this test
391b511 Peter Cock Rename Bio.SeqIO.index_many() to index_db() as discussed on mailing list
peterjc authored
896 >>> records = SeqIO.index_db(idx_name, files, "fasta", generic_protein, get_gi)
adb28a3 Peter Cock Adding Bio.SeqIO.index_many(...) function using SQLite3
peterjc authored
897 >>> len(records)
898 95
899 >>> records["7525076"].description
900 'gi|7525076|ref|NP_051101.1| Ycf2 [Arabidopsis thaliana]'
901 >>> records["45478717"].description
902 'gi|45478717|ref|NP_995572.1| pesticin [Yersinia pestis biovar Microtus str. 91001]'
883504e Peter Cock Close index handles in SeqIO doctests
peterjc authored
903 >>> records.close()
adb28a3 Peter Cock Adding Bio.SeqIO.index_many(...) function using SQLite3
peterjc authored
904
905 In this example the two files contain 85 and 10 records respectively.
1a082f9 Connor McCoy Trim trailing space
cmccoy authored
906
6c1db87 Peter Cock Minor docstring clarification
peterjc authored
907 BGZF compressed files are supported, and detected automatically. Ordinary
908 GZIP compressed files are not supported.
9bb8590 Peter Cock Auto detect BGZF in SeqIO index functions
peterjc authored
909
d3499c1 Peter Cock Add see also note for glob module to index_db functions
peterjc authored
910 See also: Bio.SeqIO.index() and Bio.SeqIO.to_dict(), and the Python module
911 glob which is useful for building lists of files.
adb28a3 Peter Cock Adding Bio.SeqIO.index_many(...) function using SQLite3
peterjc authored
912 """
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
913 # Try and give helpful error messages:
adb28a3 Peter Cock Adding Bio.SeqIO.index_many(...) function using SQLite3
peterjc authored
914 if not isinstance(index_filename, basestring):
915 raise TypeError("Need a string for the index filename")
ef941e7 Peter Cock Accept a single filename as a string in Bio.SeqIO.index_db()
peterjc authored
916 if isinstance(filenames, basestring):
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
917 # Make the API a little more friendly, and more similar
918 # to Bio.SeqIO.index(...) for indexing just one file.
ef941e7 Peter Cock Accept a single filename as a string in Bio.SeqIO.index_db()
peterjc authored
919 filenames = [filenames]
adb28a3 Peter Cock Adding Bio.SeqIO.index_many(...) function using SQLite3
peterjc authored
920 if filenames is not None and not isinstance(filenames, list):
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
921 raise TypeError(
922 "Need a list of filenames (as strings), or one filename")
adb28a3 Peter Cock Adding Bio.SeqIO.index_many(...) function using SQLite3
peterjc authored
923 if format is not None and not isinstance(format, basestring):
924 raise TypeError("Need a string for the file format (lower case)")
925 if format and format != format.lower():
926 raise ValueError("Format string '%s' should be lower case" % format)
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
927 if alphabet is not None and not (isinstance(alphabet, Alphabet) or
adb28a3 Peter Cock Adding Bio.SeqIO.index_many(...) function using SQLite3
peterjc authored
928 isinstance(alphabet, AlphabetEncoder)):
929 raise ValueError("Invalid alphabet, %s" % repr(alphabet))
930
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
931 # Map the file format to a sequence iterator:
eddb1a9 Peter Cock Turn sibling imports into relative imports (via 2to3)
peterjc authored
932 from ._index import _FormatToRandomAccess # Lazy import
85c7109 Peter Cock Small refactoring of Bio.SeqIO index imports
peterjc authored
933 from Bio.File import _SQLiteManySeqFilesDict
439cead Peter Cock Small refactor to _SQLiteManySeqFilesDict to help using beyond Bio.SeqIO
peterjc authored
934 repr = "SeqIO.index_db(%r, filenames=%r, format=%r, alphabet=%r, key_function=%r)" \
935 % (index_filename, filenames, format, alphabet, key_function)
bfa8b25 Christian Brueffer Add a blank line before and after functions (PEP8 E301).
cbrueffer authored
936
1c8d82e Peter Cock Refactor _SQLiteManySeqFilesDict with factory pattern.
peterjc authored
937 def proxy_factory(format, filename=None):
938 """Given a filename returns proxy object, else boolean if format OK."""
939 if filename:
940 return _FormatToRandomAccess[format](filename, format, alphabet)
941 else:
942 return format in _FormatToRandomAccess
bfa8b25 Christian Brueffer Add a blank line before and after functions (PEP8 E301).
cbrueffer authored
943
85c7109 Peter Cock Small refactoring of Bio.SeqIO index imports
peterjc authored
944 return _SQLiteManySeqFilesDict(index_filename, filenames,
1c8d82e Peter Cock Refactor _SQLiteManySeqFilesDict with factory pattern.
peterjc authored
945 proxy_factory, format,
946 key_function, repr)
adb28a3 Peter Cock Adding Bio.SeqIO.index_many(...) function using SQLite3
peterjc authored
947
7262e20 Peter Cock Adding Bio.SeqIO.convert() and Bio.AlignIO.convert() functions, with doc...
peterjc authored
948
7fedbbd Peter Cock No code changes. Removing white space before ':' character in SeqIO and ...
peterjc authored
949 def convert(in_file, in_format, out_file, out_format, alphabet=None):
7262e20 Peter Cock Adding Bio.SeqIO.convert() and Bio.AlignIO.convert() functions, with doc...
peterjc authored
950 """Convert between two sequence file formats, return number of records.
951
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
952 - in_file - an input handle or filename
953 - in_format - input file format, lower case string
954 - out_file - an output handle or filename
955 - out_format - output file format, lower case string
956 - alphabet - optional alphabet to assume
7262e20 Peter Cock Adding Bio.SeqIO.convert() and Bio.AlignIO.convert() functions, with doc...
peterjc authored
957
bc094b8 Travis Wrightsman finalized restructured text format
twrightsman authored
958 **NOTE** - If you provide an output filename, it will be opened which will
77301f7 Peter Cock Fixed a typo and tweaked docstring line wrapping
peterjc authored
959 overwrite any existing file without warning. This may happen if even
960 the conversion is aborted (e.g. an invalid out_format name is given).
7262e20 Peter Cock Adding Bio.SeqIO.convert() and Bio.AlignIO.convert() functions, with doc...
peterjc authored
961
962 For example, going from a filename to a handle:
963
964 >>> from Bio import SeqIO
988fbcf Peter Cock Python 3 fallback for StringIO in doctests
peterjc authored
965 >>> try:
966 ... from StringIO import StringIO # Python 2
967 ... except ImportError:
968 ... from io import StringIO # Python 3
969 ...
7262e20 Peter Cock Adding Bio.SeqIO.convert() and Bio.AlignIO.convert() functions, with doc...
peterjc authored
970 >>> handle = StringIO("")
971 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta")
972 3
b45cc28 Peter Cock Largely automated print function style in the doctests.
peterjc authored
973 >>> print(handle.getvalue())
7262e20 Peter Cock Adding Bio.SeqIO.convert() and Bio.AlignIO.convert() functions, with doc...
peterjc authored
974 >EAS54_6_R1_2_1_413_324
975 CCCTTCTTGTCTTCAGCGTTTCTCC
976 >EAS54_6_R1_2_1_540_792
977 TTGGCAGGCCAAGGCCGATGGATCA
978 >EAS54_6_R1_2_1_443_348
979 GTTGCTTCTGGCGTGGGTGGGGGGG
980 <BLANKLINE>
981 """
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
982 # Hack for SFF, will need to make this more general in future
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
983 if in_format in _BinaryFormats:
03a17db Connor McCoy Remove unused handle_close; no isinstance check
cmccoy authored
984 in_mode = 'rb'
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
985 else:
03a17db Connor McCoy Remove unused handle_close; no isinstance check
cmccoy authored
986 in_mode = 'rU'
987
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
988 # Don't open the output file until we've checked the input is OK?
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
989 if out_format in ["sff", "sff_trim"]:
03a17db Connor McCoy Remove unused handle_close; no isinstance check
cmccoy authored
990 out_mode = 'wb'
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
991 else:
03a17db Connor McCoy Remove unused handle_close; no isinstance check
cmccoy authored
992 out_mode = 'w'
993
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
994 # This will check the arguments and issue error messages,
995 # after we have opened the file which is a shame.
eddb1a9 Peter Cock Turn sibling imports into relative imports (via 2to3)
peterjc authored
996 from ._convert import _handle_convert # Lazy import
a38504c Connor McCoy Rename: seq_handle -> as_handle
cmccoy authored
997 with as_handle(in_file, in_mode) as in_handle:
998 with as_handle(out_file, out_mode) as out_handle:
ddb0a8d Connor McCoy Use context managers for AlignIO, SeqIO
cmccoy authored
999 count = _handle_convert(in_handle, in_format,
1000 out_handle, out_format,
1001 alphabet)
7262e20 Peter Cock Adding Bio.SeqIO.convert() and Bio.AlignIO.convert() functions, with doc...
peterjc authored
1002 return count
1a082f9 Connor McCoy Trim trailing space
cmccoy authored
1003
516bfe5 Peter Cock PEP8 style using autopep8
peterjc authored
1004
883504e Peter Cock Close index handles in SeqIO doctests
peterjc authored
1005 # This helpful trick for testing no longer works with the
1006 # local imports :(
1007 #
7a82dba Carlos Peña PEP8 fixes E265
carlosp420 authored
1008 # if __name__ == "__main__":
883504e Peter Cock Close index handles in SeqIO doctests
peterjc authored
1009 # from Bio._utils import run_doctest
1010 # run_doctest()
Something went wrong with that request. Please try again.