Skip to content

05_COMAPPING

eolesin edited this page May 26, 2021 · 16 revisions

0. Create nice deflines and build the contigs databases for both the 2019 and 2020 samples.

I decided to go back and reformat the contigs so that each contig had a prefix indicating which sample it belonged to. The reasoning for this is that we might be able to have a central Anvio profile which includes contigs from all samples.

One issue I saw was that using a cutoff threshold of 2500 was eliminating 70% of the contigs in the sample. This could be a huge disadvantage, as we are essentially throwing away a majority of the data at this step. But I tried it anyway.

1. At first, I made individual contig databases for each of the samples

for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2019`; \
    do anvi-script-reformat-fasta $SET/$SET.final.contigs.fa -l 2500 \
    --simplify-names -o $SET/$SET.fa; anvi-gen-contigs-database \
    --num-threads 40 -f $SET/$SET.fa -o $SET/$SET.contigs.db; \
done

# redid the contig databases after reformatting the contigs from each sample with
# their respective sample name prefixes (changes "-" to "_" in these names):
for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good`; \
    do bar=${SET//-/_}; anvi-script-reformat-fasta $SET/final.contigs.fa -l 1000   \
    --simplify-names --prefix c_$bar -o $SET/$SET.prefixed.fa;  \
done

for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good`; \
    do anvi-gen-contigs-database --num-threads 40 -f $SET/$SET.prefixed.fa \ 
    -o $SET/$SET.prefixed.contigs.db& \
done

Mapped each sample to its assembly

while read line;
	do
	SET=$(echo $line);
	samples=$(echo "$(cat AMOR_2020_Good)");        # return all entries in file
#    	delimiter=",";          
#    	declare -a Smparray=($(echo $samples | tr "$delimiter" " ")); # create sample name array
#    	for samp in "${Smparray[@]}";
#    	do
	bowtie2 --threads 40 \
	-x 05_COMAPPING/$SET \
	-1 02_HUMAN_Decontam/$SET-cleanR1.fq \
	-2 02_HUMAN_Decontam/$SET-cleanR2.fq \
	--no-unal \
	-S 05_COMAPPING/$SET/$SET.sam;
	samtools view -F 4 -bS 05_COMAPPING/$SET/$SET.sam > 05_COMAPPING/$SET/$SET-RAW.bam; 
            samtools sort 05_COMAPPING/$SET/$SET-RAW.bam -o 05_COMAPPING/$SET/$SET.bam;     
            samtools index 05_COMAPPING/$SET/$SET.bam;     
            rm 05_COMAPPING/$SET/$SET.sam 05_COMAPPING/$SET/$SET-RAW.bam;    
#        done;  
    done < AMOR_2020_Good

# For 2019 samples, made a symlink to the QC'ed data and ran their own loop
while read line;
	do
	SET=$(echo $line);
	samples=$(echo "$(cat AMOR_2019)");        # return all entries in file
	bowtie2 --threads 40 \
	-x 05_COMAPPING/$SET \
	-1 05_COMAPPING/01_QC/$SET-QUALITY_PASSED_R1.fastq \
	-2 05_COMAPPING/01_QC/$SET-QUALITY_PASSED_R2.fastq \
	--no-unal \
	-S 05_COMAPPING/$SET/$SET.sam;
	samtools view -F 4 -bS 05_COMAPPING/$SET/$SET.sam > 05_COMAPPING/$SET/$SET-RAW.bam; 
            samtools sort 05_COMAPPING/$SET/$SET-RAW.bam -o 05_COMAPPING/$SET/$SET.bam;     
            samtools index 05_COMAPPING/$SET/$SET.bam;     
            rm 05_COMAPPING/$SET/$SET.sam 05_COMAPPING/$SET/$SET-RAW.bam;    
    done < AMOR_2019

2. Second, I made a concatenated contigs file that includes all contigs of all samples we want to use in this study.

I concatenate the contig files using the prefixed contig files in each sample subdirectory. A disadvantage of this method is that there are probably many reads that are able to map to many contigs, and this could make the data "noisy and difficult" https://groups.google.com/g/anvio/c/G-PXEjqcbmc?pli=1.

find -type f -name '*prefixed.fa' -exec cat {} + > merged.contigs.fa

anvi-script-reformat-fasta -l 2500 merged.contigs.fa -o merged.contigs2500.fa
anvi-gen-contigs-database --num-threads 40 -f merged.contigs2500.fa  -o merged.contigs2500.db

3. Third, I went ahead and mapped all reads from all samples to each individual assembly.

This is what I would call "co-mapping" rather than "co-assembly". This is for the sake of improving differential coverage information for binning later.



# Decided to switch over to the SAGA server for this task. Bowtie does not parallelize.
# this means that we are either SOL (on kjempefuru) or we need HPC.
# So I recreated the indices on SAGA.

# Create folders for each of the mappings to live, named for the assembly sample mapped to.

samples=$(echo "$(cat /cluster/projects/nn9836k/Metagenomics_AMOR_2020/05_COMAPPING/AMOR_2020_Good)")
for i in $samples; do mkdir $i; done

# Build the bowtie2 databases
for SET in `cat AMOR_2020_Good`; do bowtie2-build 03_INDIV_ASSEMBLY/$SET/$SET.prefixed.fa \
 05_COMAPPING/INDICES/$SET --threads 20; \
done

### CO-MAPPING using slurm jobs

BOWTIE BUILD

copy contig data from kjempefuru to SAGA (within kjempefuru)

tocopy=$(find -type f -name '*.prefixed.fa') # everything's in subdirectories. Annoying. scp $tocopy ede041@saga.sigma2.no:/cluster/projects/nn9836k/Metagenomics_AMOR_2020/04_CONTIGS/

Also copied the simple text files containing the list of sample names to process.

THEN -- and I'm very happy I took some time to do this... I figured out how to do a thing:

1. I created a template slurm script for the jobs to use, with where sample names should go.

just called it bowtie_template.sh

2. Then I created a script that found / replaced the placeholder with sample names I want, and made

./Make_child_scripts.sh

new baby scripts that had the respective sample name in the script filename. They all look like

the template, with respective sample info filled in like so:

#!/usr/bin/bash

every job must be accounted for

#SBATCH --account=nn9836k #SBATCH --job-name=14ROV7-A_to_12ROV10-HD34D

every job requires some specification of the number of cores to be used

#SBATCH --ntasks=1

every job requires some specification of the memory (RAM) it needs

#SBATCH --cpus-per-task=8 #SBATCH --mem-per-cpu=5G

every job requires a runtime limit

#SBATCH --time=5:00:00

setting up software environment

module purge module load Bowtie2/2.4.2-GCC-10.2.0 module load SAMtools/1.11-GCC-10.2.0

set up paths

INDEX_PATH="/cluster/projects/nn9836k/Metagenomics_AMOR_2020/05_COMAPPING/INDICES" READS_PATH="/cluster/projects/nn9836k/Metagenomics_AMOR_2020/02_NOHUMAN" COMAP_PATH="/cluster/projects/nn9836k/Metagenomics_AMOR_2020/05_COMAPPING"

Run bowtie2 on the data

bowtie2 --threads 40 -x ${INDEX_PATH}/12ROV10-HD34D
-1 ${READS_PATH}/14ROV7-A-cleanR1.fq -2 ${READS_PATH}/14ROV7-A-cleanR2.fq
--no-unal -S ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A.sam samtools view -F 4 -bS ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A.sam > ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A-RAW.bam samtools sort ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A-RAW.bam -o ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A.bam samtools index ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A.bam rm ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A.sam ${COMAP_PATH}/12ROV10-HD34D/14ROV7-A-RAW.bam

3. I made a script that calls all the slurm jobs!

./Submit_slurms.sh

The first two scripts I just built locally on my machine and uploaded the results to the server.

Only the submission script really needs to be run on the server.

BOWTIE2 MAPPING

Next, comes the actual mapping step.

Within the COMAPPING directory where the bowtie indexes are, create directories called after each sample.

for i in $samples; do mkdir $i; done

Clone this wiki locally