Skip to content

Workflow for Metagenomics Data in 2022Summer Maize Rhizomicrobiome

Notifications You must be signed in to change notification settings

ZJLEOWANG3/Metagenomics

Repository files navigation

To process 2022Summer Metagenomics Data.

  • The files have been generated by the Illumina pipeline software v2.18.

  • The sequences are in fastq format. The fastq file may contain both "filtered" and "not filtered" reads, depending on the instrument used. Files from the NextSeq500 contain only "not filtered" reads, i.e. reads that pass Illumina's Chastity filter. The pass filter status is indicated by a flag (Y/N) in the sequence header for each read. "Y" in the header means the read is filtered (out), i.e. flagged as low quality.

  • Base quality scores are in Sanger FASTQ format (the offest is ASCII 33).

  • download

scp li.gua@xfer.discovery.neu.edu:/home/li.gua/scratch/ZIJIAN/CROPPS_2022_Summer/*.sh /Users/zijianleowang/Desktop/NEU_Server

Some Hint

  • Learn Slurm and Sbatch here
sq-myjobs # see what jobs is running
set nu! # vim, hide line number
  • login NEU Clusters to process the data
ssh discovery
cd scratch/ZIJIAN/CROPPS_2022_Summer
  • Get computation node
get-node-interactive 
  • Check conda environment
conda info --envs

Download

  • Make download.sh file
vim download.sh
#!/bin/bash
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_11N_ACTCTAGG_R1.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=162668027&refid=985527"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_11N_ACTCTAGG_R2.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=1236375108&refid=985528"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_12N_TCTTACGC_R1.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=13177059&refid=985529"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_12N_TCTTACGC_R2.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=637362183&refid=985530"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_18N_CTTAATAG_R1.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=1805060388&refid=985531"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_18N_CTTAATAG_R2.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=281761865&refid=985532"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_22N_ATAGCCTT_R1.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=831132020&refid=985533"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_22N_ATAGCCTT_R2.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=169574154&refid=985534"
  • Download Data from BioHPC
bash download.sh
  • Save it as a backup
sbatch --time 24:00:00 -c 8 -J gzip.raw --wrap="tar -czvf CROPPS_2022_Summer.sorting.tar.gz *.fastq"
mv CROPPS_2022_Summer.sorting.tar.gz /home/li.gua/Downloads/ZIJIAN/CROPPS_2022_Summer.sorting.tar.gz
  • Extract your data (If error, try redownload your data)
gunzip *.gz

Prepare PE Name

  • get pair file name for downstream analysis
bash get_pari.sh

Sanity Check by Kaiju

  • install by release in the righthand side of github page
  • make database
mkdir kaijudb
cd kaijudb
~/opt/kaiju/1.9.1/bin/kaiju-makedb -s <DB> # <DB> we choose refseq
  • run kaiju
kaiju.sh
  • transform to human-readable table
kaiju.table.sh
  • check results kaiju example

  • download to local

mkdir ~/Desktop/NEU_Server/kaiju
scp li.gua@xfer.discovery.neu.edu:/home/li.gua/scratch/ZIJIAN/CROPPS_2022_Summer/kaiju/*.tsv /Users/zijianleowang/Desktop/NEU_Server/kaiju/

QC & Trim

bash trim.sh
# Eye-check the quality of trimmed reads
vim ./err/trimmed/13697_32712_179493_H5LVWAFX5_CROPPS_11N_ACTCTAGG.err 

Assembly

wget http://cab.spbu.ru/files/release3.15.5/SPAdes-3.15.5-Linux.tar.gz
tar -xzf SPAdes-3.15.5-Linux.tar.gz
mv SPAdes-3.15.5-Linux/* ~/opt/spades/3.15.5
  • Run code
spades.sh
vim assemble/13697_32712_179493_H5LVWAFX5_CROPPS_18N_CTTAATAG.assembled/contigs.fasta
# copy some sequences from contigs to blastn for sanity check whether it is meaningful or no significance match

Contigs QC

  • check contigs statistics using quast
bash quast.sh
# copy to local and check html files
scp -r li.gua@xfer-00.discovery.neu.edu:/home/li.gua/scratch/ZIJIAN/CROPPS_2022_Summer/quast /Users/zijianleowang/Desktop/NEU_Server
  • The quast report.html looks like below quast report

  • Filter and simplify name using anvio

conda activate anvio-7
bash anvio.filtsimp.sh # by default, remove contigs length less than 500 bps
  • map trimmed reads back to filtered contigs
bash bowtie2.build.sh # build the reference
bash bowtie2.sh # by default -X 1000 for maximum length fragment
  • downstream mapping processing
bash samtool.view.sh
bash samtool.sort.sh
bash picard.sh # remove duplicates
bash samtool.index.sh # to index bam files

Binning

bash anvio.gen.contigs.db.sh # FASTA files into contigs-db files
bash anvio.run.hmms.sh # decorate your contigs database with hits from HMM models 
sbatch --time 24:00:00 -c 16 -J db --wrap=". ~/.bashrc; conda activate anvio-7; anvi-setup-ncbi-cogs -T 16 --just-do-it" # !!!!!! run if the below command returned no database in log files
bash anvio.run.ncbi.cogs.sh # annotate genes in your contigs-db with functions from the NCBI’s Clusters of Orthologus Groups
bash anvio.profile.sh # default min contigs length is 500
  • Binning
bash anvio.merge.sh # merge profile db
# to run below successfully, we need to edit python script
# vim /home/li.gua/.local/env/python-3.10.4-2023-01-27-venv-anvio-7/lib/python3.10/site-packages/sklearn/utils/validation.py
# change line 1894 to feature_names = np.asarray(X.columns**.astype(str)**, dtype=object) because it requires col name all be string type
bash anvio.cluster.contigs.sh # by default, using 500 length and 
  • Visualize
# connect first window
ssh -L 8080:localhost:8080 discovery
# at one server window
get-node-interactive
conda activate anvio-7
cd scratch/ZIJIAN/CROPPS_2022_Summer/
anvi-interactive -p anvio/anvio.profile/13697_32712_179493_H5LVWAFX5_CROPPS_22N_ATAGCCTT.db/PROFILE_merged/PROFILE.db -c anvio/anvio.gen.contigs.db/13697_32712_179493_H5LVWAFX5_CROPPS_22N_ATAGCCTT.db --server-only -P 8080 --show-all-layers -C concoct
# at another server window
ssh discovery
ssh -L 8080:localhost:8080 $node_name_of_above
  • Check in local
http://localhost:8080/ # google
# click draw
  • Example Anvio Binning Anvio-interact Bin

ORF

Binning

About

Workflow for Metagenomics Data in 2022Summer Maize Rhizomicrobiome

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages