To process 2022Summer Metagenomics Data.
-
The files have been generated by the Illumina pipeline software v2.18.
-
The sequences are in fastq format. The fastq file may contain both "filtered" and "not filtered" reads, depending on the instrument used. Files from the NextSeq500 contain only "not filtered" reads, i.e. reads that pass Illumina's Chastity filter. The pass filter status is indicated by a flag (Y/N) in the sequence header for each read. "Y" in the header means the read is filtered (out), i.e. flagged as low quality.
-
Base quality scores are in Sanger FASTQ format (the offest is ASCII 33).
-
download
scp li.gua@xfer.discovery.neu.edu:/home/li.gua/scratch/ZIJIAN/CROPPS_2022_Summer/*.sh /Users/zijianleowang/Desktop/NEU_Server
- Learn Slurm and Sbatch here
sq-myjobs # see what jobs is running
set nu! # vim, hide line number
- login NEU Clusters to process the data
ssh discovery
cd scratch/ZIJIAN/CROPPS_2022_Summer
- Get computation node
get-node-interactive
- Check conda environment
conda info --envs
- Make download.sh file
vim download.sh
#!/bin/bash
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_11N_ACTCTAGG_R1.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=162668027&refid=985527"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_11N_ACTCTAGG_R2.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=1236375108&refid=985528"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_12N_TCTTACGC_R1.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=13177059&refid=985529"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_12N_TCTTACGC_R2.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=637362183&refid=985530"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_18N_CTTAATAG_R1.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=1805060388&refid=985531"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_18N_CTTAATAG_R2.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=281761865&refid=985532"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_22N_ATAGCCTT_R1.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=831132020&refid=985533"
wget -q -c -O 13697_32712_179493_H5LVWAFX5_CROPPS_22N_ATAGCCTT_R2.fastq.gz "http://cbsuapps.biohpc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=169574154&refid=985534"
- Download Data from BioHPC
bash download.sh
- Save it as a backup
sbatch --time 24:00:00 -c 8 -J gzip.raw --wrap="tar -czvf CROPPS_2022_Summer.sorting.tar.gz *.fastq"
mv CROPPS_2022_Summer.sorting.tar.gz /home/li.gua/Downloads/ZIJIAN/CROPPS_2022_Summer.sorting.tar.gz
- Extract your data (If error, try redownload your data)
gunzip *.gz
- get pair file name for downstream analysis
bash get_pari.sh
Sanity Check by Kaiju
- install by release in the righthand side of github page
- make database
mkdir kaijudb
cd kaijudb
~/opt/kaiju/1.9.1/bin/kaiju-makedb -s <DB> # <DB> we choose refseq
- run kaiju
kaiju.sh
- transform to human-readable table
kaiju.table.sh
mkdir ~/Desktop/NEU_Server/kaiju
scp li.gua@xfer.discovery.neu.edu:/home/li.gua/scratch/ZIJIAN/CROPPS_2022_Summer/kaiju/*.tsv /Users/zijianleowang/Desktop/NEU_Server/kaiju/
bash trim.sh
# Eye-check the quality of trimmed reads
vim ./err/trimmed/13697_32712_179493_H5LVWAFX5_CROPPS_11N_ACTCTAGG.err
- Install SPAdes
wget http://cab.spbu.ru/files/release3.15.5/SPAdes-3.15.5-Linux.tar.gz
tar -xzf SPAdes-3.15.5-Linux.tar.gz
mv SPAdes-3.15.5-Linux/* ~/opt/spades/3.15.5
- Run code
spades.sh
- sanity check by blastn
vim assemble/13697_32712_179493_H5LVWAFX5_CROPPS_18N_CTTAATAG.assembled/contigs.fasta
# copy some sequences from contigs to blastn for sanity check whether it is meaningful or no significance match
- check contigs statistics using quast
bash quast.sh
# copy to local and check html files
scp -r li.gua@xfer-00.discovery.neu.edu:/home/li.gua/scratch/ZIJIAN/CROPPS_2022_Summer/quast /Users/zijianleowang/Desktop/NEU_Server
conda activate anvio-7
bash anvio.filtsimp.sh # by default, remove contigs length less than 500 bps
- map trimmed reads back to filtered contigs
bash bowtie2.build.sh # build the reference
bash bowtie2.sh # by default -X 1000 for maximum length fragment
- downstream mapping processing
bash samtool.view.sh
bash samtool.sort.sh
bash picard.sh # remove duplicates
bash samtool.index.sh # to index bam files
- Create Anvio Contigs Database by official tutorial
bash anvio.gen.contigs.db.sh # FASTA files into contigs-db files
bash anvio.run.hmms.sh # decorate your contigs database with hits from HMM models
sbatch --time 24:00:00 -c 16 -J db --wrap=". ~/.bashrc; conda activate anvio-7; anvi-setup-ncbi-cogs -T 16 --just-do-it" # !!!!!! run if the below command returned no database in log files
bash anvio.run.ncbi.cogs.sh # annotate genes in your contigs-db with functions from the NCBI’s Clusters of Orthologus Groups
- Profiling BAM files by official tutorial
bash anvio.profile.sh # default min contigs length is 500
- Binning
bash anvio.merge.sh # merge profile db
# to run below successfully, we need to edit python script
# vim /home/li.gua/.local/env/python-3.10.4-2023-01-27-venv-anvio-7/lib/python3.10/site-packages/sklearn/utils/validation.py
# change line 1894 to feature_names = np.asarray(X.columns**.astype(str)**, dtype=object) because it requires col name all be string type
bash anvio.cluster.contigs.sh # by default, using 500 length and
- Visualize
# connect first window
ssh -L 8080:localhost:8080 discovery
# at one server window
get-node-interactive
conda activate anvio-7
cd scratch/ZIJIAN/CROPPS_2022_Summer/
anvi-interactive -p anvio/anvio.profile/13697_32712_179493_H5LVWAFX5_CROPPS_22N_ATAGCCTT.db/PROFILE_merged/PROFILE.db -c anvio/anvio.gen.contigs.db/13697_32712_179493_H5LVWAFX5_CROPPS_22N_ATAGCCTT.db --server-only -P 8080 --show-all-layers -C concoct
# at another server window
ssh discovery
ssh -L 8080:localhost:8080 $node_name_of_above
- Check in local
http://localhost:8080/ # google
# click draw