-
Notifications
You must be signed in to change notification settings - Fork 0
05_COMAPPING
I decided to go back and reformat the contigs so that each contig had a prefix indicating which sample it belonged to. The reasoning for this is that we might be able to have a central Anvio profile which includes contigs from all samples.
One issue I saw was that using a cutoff threshold of 2500 was eliminating 70% of the contigs in the sample. This could be a huge disadvantage, as we are essentially throwing away a majority of the data at this step. But I tried it anyway.
for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2019`; \
do anvi-script-reformat-fasta $SET/$SET.final.contigs.fa -l 2500 \
--simplify-names -o $SET/$SET.fa; anvi-gen-contigs-database \
--num-threads 40 -f $SET/$SET.fa -o $SET/$SET.contigs.db; \
done
# redid the contig databases after reformatting the contigs from each sample with
# their respective sample name prefixes (changes "-" to "_" in these names):
for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good`; \
do bar=${SET//-/_}; anvi-script-reformat-fasta $SET/final.contigs.fa -l 1000 \
--simplify-names --prefix c_$bar -o $SET/$SET.prefixed.fa; \
done
for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good`; \
do anvi-gen-contigs-database --num-threads 40 -f $SET/$SET.prefixed.fa \
-o $SET/$SET.prefixed.contigs.db& \
done
while read line;
do
SET=$(echo $line);
samples=$(echo "$(cat AMOR_2020_Good)"); # return all entries in file
# delimiter=",";
# declare -a Smparray=($(echo $samples | tr "$delimiter" " ")); # create sample name array
# for samp in "${Smparray[@]}";
# do
bowtie2 --threads 40 \
-x 05_COMAPPING/$SET \
-1 02_HUMAN_Decontam/$SET-cleanR1.fq \
-2 02_HUMAN_Decontam/$SET-cleanR2.fq \
--no-unal \
-S 05_COMAPPING/$SET/$SET.sam;
samtools view -F 4 -bS 05_COMAPPING/$SET/$SET.sam > 05_COMAPPING/$SET/$SET-RAW.bam;
samtools sort 05_COMAPPING/$SET/$SET-RAW.bam -o 05_COMAPPING/$SET/$SET.bam;
samtools index 05_COMAPPING/$SET/$SET.bam;
rm 05_COMAPPING/$SET/$SET.sam 05_COMAPPING/$SET/$SET-RAW.bam;
# done;
done < AMOR_2020_Good
# For 2019 samples, made a symlink to the QC'ed data and ran their own loop
while read line;
do
SET=$(echo $line);
samples=$(echo "$(cat AMOR_2019)"); # return all entries in file
bowtie2 --threads 40 \
-x 05_COMAPPING/$SET \
-1 05_COMAPPING/01_QC/$SET-QUALITY_PASSED_R1.fastq \
-2 05_COMAPPING/01_QC/$SET-QUALITY_PASSED_R2.fastq \
--no-unal \
-S 05_COMAPPING/$SET/$SET.sam;
samtools view -F 4 -bS 05_COMAPPING/$SET/$SET.sam > 05_COMAPPING/$SET/$SET-RAW.bam;
samtools sort 05_COMAPPING/$SET/$SET-RAW.bam -o 05_COMAPPING/$SET/$SET.bam;
samtools index 05_COMAPPING/$SET/$SET.bam;
rm 05_COMAPPING/$SET/$SET.sam 05_COMAPPING/$SET/$SET-RAW.bam;
done < AMOR_2019
2. Second, I made a concatenated contigs file that includes all contigs of all samples we want to use in this study.
I concatenate the contig files using the prefixed contig files in each sample subdirectory. A disadvantage of this method is that there are probably many reads that are able to map to many contigs, and this could make the data "noisy and difficult" https://groups.google.com/g/anvio/c/G-PXEjqcbmc?pli=1.
find -type f -name '*prefixed.fa' -exec cat {} + > merged.contigs.fa
anvi-script-reformat-fasta -l 2500 merged.contigs.fa -o merged.contigs2500.fa
anvi-gen-contigs-database --num-threads 40 -f merged.contigs2500.fa -o merged.contigs2500.db
This is what I would call "co-mapping" rather than "co-assembly". This is for the sake of improving differential coverage information for binning later.
# Decided to switch over to the SAGA server for this task. Bowtie does not parallelize.
# this means that we are either SOL (on kjempefuru) or we need HPC.
# So I recreated the indices on SAGA.
# Create folders for each of the mappings to live, named for the assembly sample mapped to.
samples=$(echo "$(cat /cluster/projects/nn9836k/Metagenomics_AMOR_2020/05_COMAPPING/AMOR_2020_Good)")
for i in $samples; do mkdir $i; done
# Build the bowtie2 databases
for SET in `cat AMOR_2020_Good`; do bowtie2-build 03_INDIV_ASSEMBLY/$SET/$SET.prefixed.fa \
05_COMAPPING/INDICES/$SET --threads 20; \
done
#### BOWTIE BUILD
# copy contig data from kjempefuru to SAGA (within kjempefuru)
tocopy=$(find -type f -name '*.prefixed.fa') # everything's in subdirectories. Annoying.
scp $tocopy ede041@saga.sigma2.no:/cluster/projects/nn9836k/Metagenomics_AMOR_2020/04_CONTIGS/
# Also copied the simple text files containing the list of sample names to process.
# THEN -- and I'm very happy I took some time to do this... I figured out how to do a thing:
# reference: https://www.blopig.com/blog/2020/02/using-slurm-a-little-bit-more-efficiently/
# 1. I created a template slurm script for the jobs to use, with <samp> where sample names should go.
# just called it bowtie_template.sh
# 2. Then I created a script that found / replaced the placeholder with sample names I want, and made
./Make_child_scripts.sh
# new baby scripts that had the respective sample name in the script filename. They all look like
# the template, with respective sample info filled in like so:
#!/usr/bin/bash
# every job must be accounted for
#SBATCH --account=nn9836k
#SBATCH --job-name=14ROV7-A_to_12ROV10-HD34D
# every job requires some specification of the number of cores to be used
#SBATCH --ntasks=1
# every job requires some specification of the memory (RAM) it needs
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=5G
# every job requires a runtime limit
#SBATCH --time=5:00:00
# setting up software environment
module purge
module load Bowtie2/2.4.2-GCC-10.2.0
module load SAMtools/1.11-GCC-10.2.0
# set up paths
INDEX_PATH="/cluster/projects/nn9836k/Metagenomics_AMOR_2020/05_COMAPPING/INDICES"
READS_PATH="/cluster/projects/nn9836k/Metagenomics_AMOR_2020/02_NOHUMAN"
COMAP_PATH="/cluster/projects/nn9836k/Metagenomics_AMOR_2020/05_COMAPPING"
# Run bowtie2 on the data
bowtie2 --threads 40 -x ${INDEX_PATH}/12ROV10-HD34D \
-1 ${READS_PATH}/14ROV7-A-cleanR1.fq -2 ${READS_PATH}/14ROV7-A-cleanR2.fq \
--no-unal -S ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A.sam
samtools view -F 4 -bS ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A.sam > ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A-RAW.bam
samtools sort ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A-RAW.bam -o ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A.bam
samtools index ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A.bam
rm ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A.sam ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A-RAW.bam
# 3. I made a script that calls all the slurm jobs!
./Submit_slurms.sh
# The first two scripts I just built locally on my machine and uploaded the results to the server.
# Only the submission script really needs to be run on the server.
##### BOWTIE2 MAPPING
# Next, comes the actual mapping step.
# Within the COMAPPING directory where the bowtie indexes are, create directories called after each sample.
for i in $samples; do mkdir $i; done