Sternum

description

Sternum is a python package that map Next Generation Sequencing reads to a reference geneome using different algorithms and data structures.

Available methods

1. Suffix Tree (Trie)

Example from: http://btechsmartclass.com/data_structures/tries.html

trie is implemented using dictionaries in python. Each node represents one nucleotide and points to a new dictionary for the followong nucleotide. The last charcter of any suffix is '$' points to a list of information about readID and position of this suffix

{0: {A: {OTHER RECURSIVE DICTIONARIES} C: {A: {$: [[ID, pos], [ID, pos]]}}}}

To find a pattern in trie , the pattern is matched against the trie letter by letter untill it reaches '$'. If the pattern fails to reach a trie leaf it returns -1, else, [ID, pos]

trie is constructed from m genome in linear time O(|m|) and each pattern n is found in linear time O(|n|)

2. Suffix Array (SA)

Example from: https://discuss.codechef.com/t/a-tutorial-on-suffix-arrays/2950

sa is implemented using list in python. Each index points to ith order kmer in the whole reference.

[indecies of ordered kmers alphabtically]

To find a pattern in sa , binary search is performed using these indecies since the kmers are ordered. the pattern is matched each time against the kmer selected by binary seach untill it reaches a match. If the pattern fails to reach a match it returns -1, else, [ID, pos]

sa is constructed from m genome in logarithmic time O(|m.log(m)|) and each pattern n is found in logarithmic time O(|m|.log(|m|)) , however, it's stored in much less memory size than trie

3. Burrow Wheeler Transform (BWT)

Example from: Zhang, D., Liu, Q., Wu, Y., Li, Y., & Xiao, L. (2013). Compression and Indexing Based on BWT: A Survey. 2013 10Th Web Information System And Application Conference. doi: 10.1109/wisa.2013.20\

bwt is implemented using list and dictionary in python. Each index points to a suffix in the whole reference.

{index: 'nucleotide', index: 'nucleotide', ..., index: '$', ..., index: 'nucleotide'}

along with last2first list to point to this suffix in a column of ordered suffixes, which necessary to search for pattern through going back and forth from BWT to the first column.

Example from: Rocha, M., & Ferreira, P. Bioinformatics Algorithms (p. 357)\

To find a pattern in bwt , back and forth walk through the inverted pattern in the first and last column. the pattern is matched if it reaches its first character in the last column. If the pattern fails to reach a match it returns -1, else, [ID, pos]

bwt is constructed from m genome in logarithmic time O(|m|.log(|m|)) and each pattern n is found in linear time O(|n|).

Input and Output

Input

Reference genome file -r str()
fasta/fastq file
Sequence reads file -s str()
fasta/fastq file. paired reads are not supported
kmer size -k int()
smaller kmer size means more allowance for mismatches
Minimum count of matched kmers -cint()
minimum allowed number of kmers matching the reference in sequence to be reported.
Minimum matching percentage -p int()
minimum allowed percentage of similarity between matched part of read and covered stretch of genome
Batch size -b int()
how many sequence reads to be loaded into memory per iteration to be mapped

Output

pseudo SAM file -o str()
a file containing only read name, reference name, position, and read sequence per match

Getting started

Installing the package

Download the latest release: releases

In your command line environment:

pip install path/to/sternum-x.x.x-py2.py3-none-any.whl

Run an example

In your Command Line environment:

sternum -m method -r path/to/reference.fastx -s path/to/sequence.fastx -k kmerSize -p minPercentage -c minKcount -b batchSize -o path/to/myfirstmap

this file will be genrated: path/to/myfirstmap*$m*$sequenceID.sam
you can try with files in sternum/data

Comparison with Bowtie2 and BWA

CPU seconds to map different number of reads with defaul parameters of each tool
Memory used to map different number of reads with defaul parameters of each tool
Percentage of mapped reads by Sternum(BWT) compared to other tools

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
data		data
figures		figures
src		src
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sternum

description

Available methods

1. Suffix Tree (Trie)

2. Suffix Array (SA)

3. Burrow Wheeler Transform (BWT)

Input and Output

Input

Reference genome file `-r` str()

Sequence reads file `-s` str()

kmer size `-k` int()

Minimum count of matched kmers `-c`int()

Minimum matching percentage `-p` int()

Batch size `-b` int()

Output

pseudo SAM file `-o` str()

Getting started

Installing the package

Run an example

Comparison with Bowtie2 and BWA

About

Releases 2

Packages

Languages

anazhmetdin/Sternum

Folders and files

Latest commit

History

Repository files navigation

Sternum

description

Available methods

1. Suffix Tree (Trie)

2. Suffix Array (SA)

3. Burrow Wheeler Transform (BWT)

Input and Output

Input

Reference genome file -r str()

Sequence reads file -s str()

kmer size -k int()

Minimum count of matched kmers -cint()

Minimum matching percentage -p int()

Batch size -b int()

Output

pseudo SAM file -o str()

Getting started

Installing the package

Run an example

Comparison with Bowtie2 and BWA

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Reference genome file `-r` str()

Sequence reads file `-s` str()

kmer size `-k` int()

Minimum count of matched kmers `-c`int()

Minimum matching percentage `-p` int()

Batch size `-b` int()

pseudo SAM file `-o` str()

Packages