Metagenomics

To process 2022Summer Metagenomics Data.

The files have been generated by the Illumina pipeline software v2.18.
The sequences are in fastq format. The fastq file may contain both "filtered" and "not filtered" reads, depending on the instrument used. Files from the NextSeq500 contain only "not filtered" reads, i.e. reads that pass Illumina's Chastity filter. The pass filter status is indicated by a flag (Y/N) in the sequence header for each read. "Y" in the header means the read is filtered (out), i.e. flagged as low quality.
Base quality scores are in Sanger FASTQ format (the offest is ASCII 33).
download

scp li.gua@xfer.discovery.neu.edu:/home/li.gua/scratch/ZIJIAN/CROPPS_2022_Summer/*.sh /Users/zijianleowang/Desktop/NEU_Server

Some Hint

Learn Slurm and Sbatch here

sq-myjobs # see what jobs is running
set nu! # vim, hide line number

login NEU Clusters to process the data

ssh discovery
cd scratch/ZIJIAN/CROPPS_2022_Summer

Get computation node

get-node-interactive

Check conda environment

conda info --envs

Download

Make download.sh file

vim download.sh
#!/bin/bash
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_11N_ACTCTAGG_R1.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=162668027&refid=985527"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_11N_ACTCTAGG_R2.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=1236375108&refid=985528"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_12N_TCTTACGC_R1.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=13177059&refid=985529"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_12N_TCTTACGC_R2.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=637362183&refid=985530"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_18N_CTTAATAG_R1.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=1805060388&refid=985531"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_18N_CTTAATAG_R2.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=281761865&refid=985532"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_22N_ATAGCCTT_R1.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=831132020&refid=985533"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_22N_ATAGCCTT_R2.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=169574154&refid=985534"

Download Data from BioHPC

bash download.sh

Save it as a backup

sbatch --time 24:00:00 -c 8 -J gzip.raw --wrap="tar -czvf CROPPS_2022_Summer.sorting.tar.gz *.fastq"
mv CROPPS_2022_Summer.sorting.tar.gz /home/li.gua/Downloads/ZIJIAN/CROPPS_2022_Summer.sorting.tar.gz

Extract your data (If error, try redownload your data)

gunzip *.gz

Prepare PE Name

get pair file name for downstream analysis

bash get_pari.sh

Sanity Check by Kaiju

install by release in the righthand side of github page
make database

mkdir kaijudb
cd kaijudb
~/opt/kaiju/1.9.1/bin/kaiju-makedb -s <DB> # <DB> we choose refseq

run kaiju

kaiju.sh

transform to human-readable table

kaiju.table.sh

check results
download to local

mkdir ~/Desktop/NEU_Server/kaiju
scp li.gua@xfer.discovery.neu.edu:/home/li.gua/scratch/ZIJIAN/CROPPS_2022_Summer/kaiju/*.tsv /Users/zijianleowang/Desktop/NEU_Server/kaiju/

QC & Trim

bash trim.sh
# Eye-check the quality of trimmed reads
vim ./err/trimmed/13697_32712_179493_H5LVWAFX5_CROPPS_11N_ACTCTAGG.err

Assembly

Install SPAdes

wget http://cab.spbu.ru/files/release3.15.5/SPAdes-3.15.5-Linux.tar.gz
tar -xzf SPAdes-3.15.5-Linux.tar.gz
mv SPAdes-3.15.5-Linux/* ~/opt/spades/3.15.5

Run code

spades.sh

sanity check by blastn

vim assemble/13697_32712_179493_H5LVWAFX5_CROPPS_18N_CTTAATAG.assembled/contigs.fasta
# copy some sequences from contigs to blastn for sanity check whether it is meaningful or no significance match

Contigs QC

check contigs statistics using quast

bash quast.sh
# copy to local and check html files
scp -r li.gua@xfer-00.discovery.neu.edu:/home/li.gua/scratch/ZIJIAN/CROPPS_2022_Summer/quast /Users/zijianleowang/Desktop/NEU_Server

The quast report.html looks like below
Filter and simplify name using anvio

conda activate anvio-7
bash anvio.filtsimp.sh # by default, remove contigs length less than 500 bps

map trimmed reads back to filtered contigs

bash bowtie2.build.sh # build the reference
bash bowtie2.sh # by default -X 1000 for maximum length fragment

downstream mapping processing

bash samtool.view.sh
bash samtool.sort.sh
bash picard.sh # remove duplicates
bash samtool.index.sh # to index bam files

Binning

Create Anvio Contigs Database by official tutorial

bash anvio.gen.contigs.db.sh # FASTA files into contigs-db files
bash anvio.run.hmms.sh # decorate your contigs database with hits from HMM models 
sbatch --time 24:00:00 -c 16 -J db --wrap=". ~/.bashrc; conda activate anvio-7; anvi-setup-ncbi-cogs -T 16 --just-do-it" # !!!!!! run if the below command returned no database in log files
bash anvio.run.ncbi.cogs.sh # annotate genes in your contigs-db with functions from the NCBI’s Clusters of Orthologus Groups

Profiling BAM files by official tutorial

bash anvio.profile.sh # default min contigs length is 500

Binning

bash anvio.merge.sh # merge profile db
# to run below successfully, we need to edit python script
# vim /home/li.gua/.local/env/python-3.10.4-2023-01-27-venv-anvio-7/lib/python3.10/site-packages/sklearn/utils/validation.py
# change line 1894 to feature_names = np.asarray(X.columns**.astype(str)**, dtype=object) because it requires col name all be string type
bash anvio.cluster.contigs.sh # by default, using 500 length and

Visualize

# connect first window
ssh -L 8080:localhost:8080 discovery

# at one server window
get-node-interactive
conda activate anvio-7
cd scratch/ZIJIAN/CROPPS_2022_Summer/
anvi-interactive -p anvio/anvio.profile/13697_32712_179493_H5LVWAFX5_CROPPS_22N_ATAGCCTT.db/PROFILE_merged/PROFILE.db -c anvio/anvio.gen.contigs.db/13697_32712_179493_H5LVWAFX5_CROPPS_22N_ATAGCCTT.db --server-only -P 8080 --show-all-layers -C concoct

# at another server window
ssh discovery
ssh -L 8080:localhost:8080 $node_name_of_above

Check in local

http://localhost:8080/ # google
# click draw

Example Anvio Binning

ORF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metagenomics

Some Hint

Download

Prepare PE Name

Sanity Check by Kaiju

QC & Trim

Assembly

Contigs QC

Binning

Binning

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
kaiju		kaiju
media		media
.gitignore		.gitignore
README.md		README.md
anvio.filtsimp.sh		anvio.filtsimp.sh
bowtie2.build.sh		bowtie2.build.sh
bowtie2.sh		bowtie2.sh
download.sh		download.sh
get_pari.sh		get_pari.sh
picard.sh		picard.sh
quast.sh		quast.sh
samtool.sort.sh		samtool.sort.sh
samtool.view.sh		samtool.view.sh
spades.sh		spades.sh
trim.sh		trim.sh

ZJLEOWANG3/Metagenomics

Folders and files

Latest commit

History

Repository files navigation

Metagenomics

Some Hint

Download

Prepare PE Name

Sanity Check by Kaiju

QC & Trim

Assembly

Contigs QC

Binning

Binning

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages