Skip to content
/ RFSC Public

Reference-Free Sequence Classification Tool for DNA sequences in metagenomic samples

License

Notifications You must be signed in to change notification settings

cobilab/RFSC

Repository files navigation

License: GPL v3

RFSC is a Reference-Free Sequence Classification Tool that using machine learning classifiers relies on an ensemble of experts in order to provide efficient classification in metagenomic contexts.

System Requirements

Laptop computer running Linux Ubuntu (for example, 18.04 LTS or higher) with GCC (https://gcc.gnu.org), Conda (https://docs.conda.io) and CMake (https://cmake.org) installed. The hardware must contain at least 32 GB of RAM, and a 800 GB disk. In the case of the this, if the database is not re-built, it is only needed near 10 GB of space. Furthermore, to perform installation correctly, docker and docker compose must be installed in the system (https://docs.docker.com/engine/install/ubuntu/).

Installation

Using Docker

git clone https://github.com/cobilab/RFSC
cd RFSC
chmod +x RFSC.sh 
chmod +777 src/*.sh
docker-compose build
docker-compose up -d && docker exec -it rfsc bash && docker-compose down
./RFSC.sh --install #install tools

Download NCBI Reference Databases

./RFSC.sh --build-ref-virus --build-ref-bacteria --build-ref-archaea --build-ref-protozoa \ --build-ref-fungi --build-ref-plant --build-ref-mitochondrial --build-ref-plastid

or

./RFSC.sh -dviral -dbact -darch -dprot -dfung -dplan -dmito -dplas

Global Results

Real Sequence Classification

Obtain classification report of KNN, GNB and XGBoost.

./RFSC.sh -runAll #classification report table
./RFSC.sh -runAll F1Score # Weighted-averaged F1-score
./RFSC.sh -runAll Accuracy # Average Accuracy

Generate mutated data, and perform classification

To gathers a small set of sequences from the 8 domains, run the script:

./RFSC.sh -mget  

To compute features for mutated sequences, run the script:

./RFSC.sh -cfem

To perform classification for all mutated sequences, run the script:

./RFSC.sh -cclm 

Synthetic Sequence Generation and Classification

To gathers a small set of sequences from the 8 domains, run the script:
./RFSC.sh -sget

To create the synthetic hybrid sequences and compute their features, run the script:

./RFSC.sh -cfes

To perform classification of the synthetic hybrid sequences and obtain classification report of KNN, GNB and XGBoost, run the script:

./RFSC.sh -ccls

To perform classification of the synthetic sequences using Kraken2, run the script:

(only for comparison purposes, requires Kraken2 installation)
You should download the Kraken2 database at: https://benlangmead.github.io/aws-indexes/k2
To obtain the same results, use the Standard database containing "archaea, bacteria, viral, plasmid, human1, UniVec_Core" created at 5/17/2021, with 38.6GB.
./RFSC.sh -ckra

Running Examples

✨ Generate a synthetic sequence and subsequently proceed to a Reference-Free Reconstruction of the same:

 

./RFSC.sh --clean y
./RFSC.sh --threads 8 --gen-adapters
./RFSC.sh --efetch-fasta 155971 Input_Data/EntrezGenomes 
./RFSC.sh --efetch-fasta EF491856.1 Input_Data/EntrezGenomes 
./RFSC.sh --efetch-fasta MT682520 Input_Data/EntrezGenomes
./RFSC.sh -synt Input_Data/EntrezGenomes/155971.fna Input_Data/EntrezGenomes/EF491856.1.fna Input_Data/EntrezGenomes/MT682520.fna
./RFSC.sh -trim TT PE --run-de-novo
✨ Reference-Based Classification, usign FALCON-meta:
(If the reference databases have already been built and the Reference Free Reconstruction stage is finished)

 

./RFSC.sh --threads 8 --set-len-cov 100 3 --set-threshold-max-min 70 1 --run-falcon SO Viral
Results show the list of possible candidates for this sequence.
✨ Reference-Free Classification, using XBoost

 

./RFSC.sh --threads 8 --efetch-fasta 155971 RefFree
./RFSC.sh --run-xgboost
The expected result of this test is viral sequence.

Tools Integrated in RFSC

Tool URL
AC https://github.com/cobilab/ac
Blastn https://blast.ncbi.nlm.nih.gov/Blast.cgi
Cryfa https://github.com/cobilab/cryfa
Entrez https://www.ncbi.nlm.nih.gov/genome
FALCON-meta https://github.com/cobilab/falcon
FASTP https://github.com/OpenGene/fastp
GeCo3 https://github.com/cobilab/geco3
GTO https://cobilab.github.io/gto/
metaSPAdes https://cab.spbu.ru/software/meta-spades/
ORFfinder https://www.ncbi.nlm.nih.gov/orffinder/
ORFM https://github.com/wwood/OrfM
SoD https://github.com/pratas/SoD.git
Trimmomatic http://www.usadellab.org/cms/?page=trimmomatic

License

GNU GPL

✨Developed to make a change!✨

About

Reference-Free Sequence Classification Tool for DNA sequences in metagenomic samples

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •