Skip to content

Code to classify open reading frames into coding and noncoding

License

Notifications You must be signed in to change notification settings

deprekate/goodorfs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

GOODORFS is a tool to classify open reading frames into coding and noncoding

It takes as input a fasta file representing the entire genome. It then finds all potential open reading frames, and then for each, calculates the Energy Density Profile from the amino acid frequency.

To install GOODORFS,

pip3 install goodorfs

or

git clone https://github.com/deprekate/goodorfs.git
cd goodorfs
python3 setup.py install

To run GOODORFS simply provide the path to a fasta file. The default output is the same format as Glimmers LONGORFS program, in order to serve as a drop in replacement. The columns are: orf_id, start_location, stop_location, frame, a bunch of zeros as filler

$ goodorfs.py tests/NC_001416.fna
00001     191     736  +2   0.000
00002     711    2636  +3   0.000
00003    2633    2839  +2   0.000
00004    3270    2830  -1   0.000
00005    2836    4437  +1   0.000
00006    5095    4604  -2   0.000
00007    4283    5737  +2   0.000
...

Additionally GOODORFS can also output the nucleotide sequences in fasta format for use in other applications:

$ good-orfs.py -Y fna tests/NC_001416.fna | head
>NC_001416_orf1 [START=191] [STOP=736]
ATGGAAGTCAACAAAAAGCAGCTGGCTGACATTTTCGGTGCGAGTATCCGTACCATTCA...
>NC_001416_orf2 [START=711] [STOP=2636]
GTGAATATATCGAACAGTCAGGTTAACAGGCTGCGGCATTTTGTCCGCGCCGGGCTTCG...
>NC_001416_orf3 [START=2633] [STOP=2839]
ATGACGCGACAGGAAGAACTTGCCGCTGCCCGTGCGGCACTGCATGACCTGATGACAGG...
>NC_001416_orf4 [START=3270] [STOP=2830]
GTGCATGGCCACACCTTCCCGAATCATCATGGTAAACGTGCGTTTTCGCTCAACGTCAA...
...

We have started testing GOODORFS to run on metagenomes. All that is needed is to bin reads according to their GC content and then run the bins through GOODROFS in batches in order to predict gene fragments within the reads.

We have added a script to group the reads according to gc content. It prints out batches of 500 reads separated by the null terminator character, which allows commands to be chained to xargs. To run GOODORFS on the supplied sample metagenome (which is in FASTA file format), run the command:

python3 scripts/bin_reads.py tests/ERR5004783_part.fasta | xargs -0 -I {} sh -c "echo '{}' | ./goodorfs.py -Y fna"

About

Code to classify open reading frames into coding and noncoding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published