Skip to content

fzyan/MetaCSST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

###############################################################################################
##Package: Metagenomic Complex Sequence Scanning Tool (MetaCSST)                             ##
##Developer: Fazhe Yan                                                                       ##
##Email: fazheyan33@163.com ; ccwei@sjtu.edu                                                 ##
##Department: Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University  ##
###############################################################################################

##################
## Introduction ##
##################
Metagenomic Complex Sequencing Scanning Tool (MetaCSST) is a tool to predict DGRs in sequenced genomes as well as metagenomic datasets. It is based on Generalized Hidden Markov Model (GHMM), using motif patterns to identify the elements in DGRs.

###############
## Copyright ##
###############
This software is free for personal, academic and non-profit use from https://github.com/fzyan/MetaCSST (GitHub website)
For commercial users, please contact <ccwei@sjtu.edu.cn>.

#########################
## System requirements ##
#########################
Linux operation system, memory 2G to use multiple threads. 
Perl 5.8.5 or up and gcc version 4.1.2 or up.

###########
## Usage ##
###########
1>Identify sub structures (TR, VR or RT) in DGRs:
    ./MetaCSSTsub -build TR.config -in $fa [-out $out_dir] [-thread $thread]
    or ./MetaCSSTsub -build VR.config -in $fa [-out $out_dir] [-thread $thread]
    or ./MetaCSSTsub -build RT.config -in $fa [-out $out_dir] [-thread $thread]
    
    # $fa : input file in FASTA format (Maybe a pretreatment is in need: ./src/chomp.pl $input )
    # $out_dir : output directory. If not given, the default out directory will be "out_metacsst"
    # $thread : thread number, default 1

2>DGR prediction
    Step1: ./MetaCSSTmain -build arg.config -in $fa [-out $out_dir1] [-thread $thread]
    	   #Identification of the sub structures using GHMM
    Step2: perl src/callVR.pl $out_dir1/raw.gtf $fa $out-tmp
    	   #calling VRs according to the identified TRs
    Step3: perl src/removeRepeat.pl $out-tmp $out-DGR
    	   #remove identical TR-VR pairs generated by callVR.pl

###############
## OUT files ##
###############
1>Identify sub structures (TR, VR or RT) in DGRs:
    out_dir/out.txt   : Identified sub structures
        #The input fasta sequences are followed by the elements found in the sequences
	#File format example:
	    >gi|377805758|gb|JQ680349.1|			                                                           ##ID
	    CCCACAGTGCGTGTATGAT......GATTAATACAGAATTACTACG							      	   ##sequence
	    Score:6.57      +       matchSeq(31631-31680):CTATCTTTGGGATATTCTATAGTTCTAGCTATAACATCAATTCCACCAAC	      	   ##element1
	    Score:62.73     -       matchSeq(39481-39544):AACAACAGCTGGAACGTGAACTTTAGTAATGGCAACTTCAACAACAACAACAAGTACAACAGTA ##element2

	    #For each identified element, the format: 
	    	 Score:($score) $string matchSeq(start-end):sequence of this element

    out_dir/align.txt : count matirx for each position, used to build PWMs
    out_dir/score.txt : PWMs (scoring matrices)

2>DGR prediction
    Step1: 
    	   out_tmp1/raw.gtf : TRs and RTs identified.
	       #ID  element  score  string  start  end  sequence
	   out_tmp1/align.txt : count matirx for each position, used to build PWMs
	   out_tmp1/score.txt : PWMs (scoring matrices)

    Step2:
	   out_tmp2.txt : TRs are followed by paired VRs
	       #ID  TR  string  original_start  original_end  start  end  A-to-N-substitutions  Non-A-to-N-substitutions  sequence
	       #ID  VR  string  *  		*	      start  end  A-to-N-substitutions  Non-A-to-N-substitutions  sequence
	       #ID  RT  string  start		end	      sequence
	       
    Step3:
           out-DGR.gtf : The file format is the same as out_tmp2.txt
	   
###########
## Files ##
###########
    |-MetaCSSTmain		executable program to predict DGRs
    |-MetaCSSTsub		executable program to identify TRs, VRs or RTs
    |-arg(/TR/VR/RT).config	config files in the GHMM
    |-align/*align		align matrix used to develop the GHMM
    |-main.cpp			source code to build MetaCSSTmain
    |-sub.cpp			source code to build MetaCSSTsub
    |-ghmm.h & fun.h		some functions, structures and objects
    |-callVR.pl			used to search VRs according to the raw GTF file generated by MetaCSSTmain
    |-removeRepeat.pl		remove identical TR-VR pairs generated by callVR.pl
    |-chomp.pl			preprocess the input file in FASTA format
    |-addition/merged*		collected DGRs
    |-addition/training		training set
    |-addition/test		test set
    |-addition/classify		classification of TRs/VRs/RTs, generated by MUSCLE
    |-example and example.sh	a example to identify DGRs
    |-callORF.pl  		script to call Open Reading Frames
    |-coden.txt			coden table uesd to call ORFs        

##################
## Installation ##
##################
MetaCSSTmain and MetaCSSTsub are executable programs.
If you want to modify the codes and recompile:
   g++ -lpthread src/main.cpp -o MetaCSSTmain_new
   g++ -lpthread src/sub.cpp -o MetaCSSTsub_new

#############
## Contact ##
#############
If you have any questions, feel free to contact us:
   fazheyan33@163.com
   ccwei@sjtu.edu.cn

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published