Permalink
Browse files

initial git commit.

  • Loading branch information...
0 parents commit 3cf16bc8e796dc8574fe6f3d019582c4e7eae2af @brentp committed Jul 14, 2010
Showing with 1,761 additions and 0 deletions.
  1. +45 −0 CHANGELOG.txt
  2. +3 −0 MANIFEST.in
  3. +188 −0 README.txt
  4. +284 −0 ez_setup.py
  5. +136 −0 pyfasta/__init__.py
  6. +203 −0 pyfasta/fasta.py
  7. +293 −0 pyfasta/records.py
  8. +225 −0 pyfasta/split_fasta.py
  9. +8 −0 setup.cfg
  10. +30 −0 setup.py
  11. +50 −0 tests.sh
  12. +49 −0 tests/bench.py
  13. +6 −0 tests/data/three_chrs.fasta
  14. +30 −0 tests/data/three_chrs.fasta.orig
  15. +210 −0 tests/test_all.py
  16. +1 −0 upload.sh
@@ -0,0 +1,45 @@
+Changes
+=======
+0.3.9
+-----
+* only require 'r' (not r+) for memory map.
+
+0.3.8
+-----
+* clean up logic for mixing inplace/non-inplace flattened files.
+ if the inplace is available, it is always used.
+
+0.3.6/7
+-------
+* dont re-flatten the file every time!
+* allow spaces before and after the header in the orginal fasta.
+
+0.3.5
+-----
+
+* update docs in README.txt for new CLI stuff.
+* allow flattening inplace.
+* get rid of memmap (results in faster parsing).
+
+0.3.4
+-----
+
+* restore python2.5 compatiblity.
+* CLI: add ability to exclude sequence from extract
+* CLI: allow spliting based on header.
+
+0.3.3
+-----
+
+* include this file in the tar ball (thanks wen h.)
+
+0.3.2
+-----
+
+* separate out backends into records.py
+
+* use nosetests (python setup.py nosetests)
+
+* add a TCRecord backend for next-gen sequencing availabe if tc is (easy-)installed.
+
+* improve test coverage.
@@ -0,0 +1,3 @@
+include CHANGELOG.txt
+include tests/*.py
+include tests/data/*.fasta
@@ -0,0 +1,188 @@
+==================================================
+pyfasta: pythonic access to fasta sequence files.
+==================================================
+
+
+:Author: Brent Pedersen (brentp)
+:Email: bpederse@gmail.com
+:License: MIT
+
+.. contents ::
+
+Implementation
+==============
+
+Requires Python >= 2.5. Stores a flattened version of the fasta file without
+spaces or headers and uses either a mmap of numpy binary format or fseek/fread so the
+*sequence data is never read into memory*. Saves a pickle (.gdx) of the start, stop
+(for fseek/mmap) locations of each header in the fasta file for internal use.
+
+Usage
+=====
+::
+
+ >>> from pyfasta import Fasta
+
+ >>> f = Fasta('tests/data/three_chrs.fasta')
+ >>> sorted(f.keys())
+ ['chr1', 'chr2', 'chr3']
+
+ >>> f['chr1']
+ NpyFastaRecord(0..80)
+
+
+Slicing
+-------
+::
+
+ >>> f['chr1'][:10]
+ 'ACTGACTGAC'
+
+ # get the 1st basepair in every codon (it's python yo)
+ >>> f['chr1'][::3]
+ 'AGTCAGTCAGTCAGTCAGTCAGTCAGT'
+
+ # can query by a 'feature' dictionary
+ >>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9})
+ 'CTGACTGA'
+
+ # same as:
+ >>> f['chr1'][1:9]
+ 'CTGACTGA'
+
+ # with reverse complement (automatic for - strand)
+ >>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9, 'strand': '-'})
+ 'TCAGTCAG'
+
+Numpy
+=====
+
+The default is to use a memmaped numpy array as the backend. In which case it's possible to
+get back an array directly...
+::
+
+ >>> f['chr1'].tostring = False
+ >>> f['chr1'][:10] # doctest: +NORMALIZE_WHITESPACE
+ memmap(['A', 'C', 'T', 'G', 'A', 'C', 'T', 'G', 'A', 'C'], dtype='|S1')
+
+ >>> import numpy as np
+ >>> a = np.array(f['chr2'])
+ >>> a.shape[0] == len(f['chr2'])
+ True
+
+ >>> a[10:14] # doctest: +NORMALIZE_WHITESPACE
+ array(['A', 'A', 'A', 'A'], dtype='|S1')
+
+mask a sub-sequence
+::
+
+ >>> a[11:13] = np.array('N', dtype='S1')
+ >>> a[10:14].tostring()
+ 'ANNA'
+
+
+Backends (Record class)
+=======================
+It's also possible to specify another record class as the underlying work-horse
+for slicing and reading. Currently, there's just the default:
+
+ * NpyFastaRecord which uses numpy memmap
+ * FastaRecord, which uses using fseek/fread
+ * MemoryRecord which reads everything into memory and must reparse the original
+ fasta every time.
+ * TCRecord which is identical to NpyFastaRecord except that it saves the index
+ in a TokyoCabinet hash database, for cases when there are enough records that
+ loading the entire index from a pickle into memory is unwise. (NOTE: that the
+ sequence is not loaded into memory in either case).
+
+It's possible to specify the class used with the `record_class` kwarg to the `Fasta`
+constructor:
+::
+
+ >>> from pyfasta import FastaRecord # default is NpyFastaRecord
+ >>> f = Fasta('tests/data/three_chrs.fasta', record_class=FastaRecord)
+ >>> f['chr1']
+ FastaRecord('tests/data/three_chrs.fasta.flat', 0..80)
+
+other than the repr, it should behave exactly like the Npy record class backend
+
+it's possible to create your own using a sub-class of FastaRecord. see the source
+in pyfasta/records.py for details.
+
+Flattening
+==========
+In order to efficiently access the sequence content, pyfasta saves a separate, flattened file with all newlines and headers removed from the sequence. In the case of large fasta files, one may not wish to save 2 copies of a 5GG+ file. In that case, it's possible to flatten the file "inplace", keeping all the headers, and retaining the validity of the fasta file -- with the only change being that the new-lines are removed from each sequence. This can be specified via `flatten_inplace` = True
+::
+
+ >>> import os
+ >>> os.unlink('tests/data/three_chrs.fasta.gdx') # cleanup non-inplace idx
+ >>> f = Fasta('tests/data/three_chrs.fasta', flatten_inplace=True)
+ >>> f['chr1'] # note the difference in the output from above.
+ NpyFastaRecord(6..86)
+
+ # sequence from is same as when requested from non-flat file above.
+ >>> f['chr1'][1:9]
+ 'CTGACTGA'
+
+ # the flattened file is kept as a place holder without the sequence data.
+ >>> open('tests/data/three_chrs.fasta.flat').read()
+ '@flattened@'
+
+
+Command Line Interface
+======================
+there's also a command line interface to manipulate / view fasta files.
+the `pyfasta` executable is installed via setuptools, running it will show
+help text.
+
+split a fasta file into 6 new files of relatively even size:
+
+ $ pyfasta **split** -n 6 original.fasta
+
+split the fasta file into one new file per header with "%(seqid)s" being filled into each filename.:
+
+ $ pyfasta **split** --header "%(seqid)s.fasta" original.fasta
+
+create 1 new fasta file with the sequence split into 10K-mers:
+
+ $ pyfasta **split** -n 1 -k 10000 original.fasta
+
+2 new fasta files with the sequence split into 10K-mers with 2K overlap:
+
+ $ pyfasta **split** -n 2 -k 10000 -o 2000 original.fasta
+
+
+show some info about the file (and show gc content):
+
+ $ pyfasta **info** --gc test/data/three_chrs.fasta
+
+
+**extract** sequence from the file. use the header flag to make
+a new fasta file. the args are a list of sequences to extract.
+
+ $ pyfasta **extract** --header --fasta test/data/three_chrs.fasta seqa seqb seqc
+
+**extract** sequence from a file using a file containing the headers *not* wanted in the new file:
+
+ $ pyfasta extract --header --fasta input.fasta --exclude --file seqids_to_exclude.txt
+
+
+**flatten** a file inplace, for faster later use by pyfasta, and without creating another copy. (`Flattening`_)
+
+ $ pyfasta flatten input.fasta
+
+cleanup
+=======
+(though for real use these will remain for faster access)
+::
+
+ >>> os.unlink('tests/data/three_chrs.fasta.gdx')
+ >>> os.unlink('tests/data/three_chrs.fasta.flat')
+
+Testing
+=======
+there is currently > 99% test coverage for the 2 modules and all included
+record classes. to run the tests:
+::
+
+ $ python setup.py nosetests
Oops, something went wrong.

0 comments on commit 3cf16bc

Please sign in to comment.