-
metaspades https://github.com/ablab/spades#sec2
-
MetaQuast http://quast.sourceforge.net/metaquast
-
Metabat https://bitbucket.org/berkeleylab/metabat/src/master/
- mmaper http://eggnog-mapper.embl.de
- diamond https://github.com/bbuchfink/diamond
- hmmer https://github.com/EddyRivasLab/hmmer
• bowtie2 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
For gene predicition, functional annotation or pipeline manager
- prodigal https://github.com/hyattpd/Prodigal
- orfm https://github.com/wwood/OrfM
- trinnotate https://github.com/Trinotate/Trinotate.github.io/wiki
- Docker https://www.docker.com/ (already in container)
- Nextflow https://www.nextflow.io/ (Managing data flow)
The Dockerfile for making a container:
docker build -t my_dockerpipeline my_FOLDER/
for using the docker you have to use a docker share folder
mkdir $PATH/dockershare
cd $PATH/dockershare
Run docker example
cd $HOME/dockershare
docker run -it --rm -v $PATH/dockershare:/data -i /data/"FOLDER WITH SAMPLPES" -o /data/"FOLDER WITH SAMPLES"
#Check for installed image
docker images
https://hub.docker.com/r/amartinsan/metagenompipeline
#You can pull de image from dockerhub, REMEMBER to specifty version or tag (1.0 in this case)
docker pull amartinsan/metagenompipeline:1.0
Instead of fastqc/multiqc and trimmomatic, the idea is to use fastp to filter low quality reads from pair end data.
Also a CD-HIT-DUP pass can be implemented to futher filter, althought it is very time consuming and not always recommended.
-
Quality proceessed seqs with a "Q_" prefix
-
quality folder with orignal seqs and .json and .html outputs of fastqc quality metrics
Mestaspades or megagit assembly, followed by metaQuast ond contigs and scaffolds.
Also a cd-hit-est pass can be done, althought it takes a good ammount of time (not recomended).
- Directory named : Q_"SAMPLE NAME"_SPADES_ASSEMBLY or_MEHAGIT_ASSEMBLY
It contains the assembly results.
- Directory named: contigs_Quast_"Q_SAMPLE_NAME"
It contains the metrics of MetaQuast for the assembly.
Taking the output of the assembly (scaffolds) and uses bowtie2 to generate the index, .sam and .bam files necessary for the binning with metabat.
Also it uses checkm to check the quality of the bins.
-
bowtieINDEX folder with bowtie results.
-
scaffolds.sort.bam (and sam) and stats.
-
bins folder with assembled bins.
Using the contigs.fasta file from the assembly and the program Prodigal.
- prodigal folder with the predicted proteins and genes.
This it where it gets tricky, depending on the program and database used.
Here we are using mmseq2 with eggNOG emapper. It uses the protein_spades.fasta file from the Prodigal prediction
As always, getting the databes is the issue
Download : https://www.uniprot.org/help/downloads
A regular blastp of the obtained protein_spades.fasta of the assembly.
blastp -query protein_spades.fasta -db uniprot_sprot.pep -num_threads $THREADS -out swissblast.fasta
For diamond the chosen database has to be in a reference format
/diamond makedb --in reference.fastaCHOSEN-DATABASE -d referenceDB
# running a search in blastp mode
./diamond blastp -d referenceDB -q protein_spades.fasta -o matches.tsv
Firs the db has to be ini a profile form (recommended uniprot or interpro)
hmmaling --amino hmmer_profileDATABASE proteins_spades.fasta
###############################################################
Project 319234 awarded to Dr. Rosa María Gutierrez Ríos
#################################################################