Skip to content

05_COMAPPING

eolesin edited this page May 19, 2021 · 16 revisions

0. Create nice deflines and build the contigs databases for both the 2019 and 2020 samples.

I decided to go back and reformat the contigs so that each contig had a prefix indicating which sample it belonged to. The reasoning for this is that we might be able to have a central Anvio profile which includes contigs from all samples.

One issue I saw was that using a cutoff threshold of 2500 was eliminating 70% of the contigs in the sample. This could be a huge disadvantage, as we are essentially throwing away a majority of the data at this step. But I tried it anyway.

1. At first, I made individual contig databases for each of the samples

for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2019`; \
    do anvi-script-reformat-fasta $SET/$SET.final.contigs.fa -l 2500 \
    --simplify-names -o $SET/$SET.fa; anvi-gen-contigs-database \
    --num-threads 40 -f $SET/$SET.fa -o $SET/$SET.contigs.db; \
done

# redid the contig databases after reformatting the contigs from each sample with
# their respective sample name prefixes (changes "-" to "_" in these names):
for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good`; \
    do bar=${SET//-/_}; anvi-script-reformat-fasta $SET/final.contigs.fa -l 1000   \
    --simplify-names --prefix c_$bar -o $SET/$SET.prefixed.fa;  \
done

for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good`; \
    do anvi-gen-contigs-database --num-threads 40 -f $SET/$SET.prefixed.fa \ 
    -o $SET/$SET.prefixed.contigs.db& \
done

2. Second, I made a concatenated contigs file that includes all contigs of all samples we want to use in this study.

I concatenate the contig files using the prefixed contig files in each sample subdirectory. A disadvantage of this method is that there are probably many reads that are able to map to many contigs, and this could make the data "noisy and difficult" https://groups.google.com/g/anvio/c/G-PXEjqcbmc?pli=1.

find -type f -name '*prefixed.fa' -exec cat {} + > merged.contigs.fa

anvi-script-reformat-fasta -l 2500 merged.contigs.fa -o merged.contigs2500.fa
anvi-gen-contigs-database --num-threads 40 -f merged.contigs2500.fa  -o merged.contigs2500.db

3. Third, I went ahead and mapped all reads from all samples to each individual assembly.

This is what I would call "co-mapping" rather than "co-assembly". This is for the sake of improving differential coverage information for binning later.

# Build the bowtie2 databases
for SET in `cat AMOR_2020_Good`; do bowtie2-build 03_INDIV_ASSEMBLY/$SET/$SET.prefixed.fa \
 05_COMAPPING/$SET --threads 20; \
done

# Create folders for each of the mappings to live, named for the sample used for the assembly mapped to.

samples=$(echo "$(cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good)")
for i in $samples; do mkdir $i; done


# Perform the mapping. Each sample has its folder. Each folder contains mapping data for all samples.
# Weirdly only works when copy and pasted from a txt file.... or as a script:
while read line;
	do
	SET=$(echo $line);
	samples=$(echo "$(cat AMOR_2020_Good)");        # return all entries in file
    	delimiter=",";          
    	declare -a Smparray=($(echo $samples | tr "$delimiter" " ")); # create sample name array
    	for samp in "${Smparray[@]}";
    	do
	bowtie2 --threads 40 \
	-x 05_COMAPPING/$SET \
	-1 02_HUMAN_Decontam/$samp-cleanR1.fq \
	-2 02_HUMAN_Decontam/$samp-cleanR2.fq \
	--no-unal \
	-S 05_COMAPPING/$SET/$samp.sam;
	samtools view -F 4 -bS 05_COMAPPING/$SET/$samp.sam > 05_COMAPPING/$SET/$samp-RAW.bam; 
            samtools sort 05_COMAPPING/$SET/$samp-RAW.bam -o 05_COMAPPING/$SET/$samp.bam;     
            samtools index 05_COMAPPING/$SET/$samp.bam;     
            rm 05_COMAPPING/$SET/$samp.sam 05_COMAPPING/$SET/$samp-RAW.bam;    
        done;  
    done < AMOR_2020_Good

Indexing seems... broken or something when I see the result on kjempefuru. I'm not sure why. Some of my 2.bt2 files have nothing in them... it looks like it might be related to the conda build. Maybe this is something that SAGA is better suited for, this comapping. It's going really really slowly on kjempefuru. Perhaps doing the mappings as separate slurm jobs is the solution.

CO-MAPPING using slurm jobs

#### BOWTIE BUILD
# copy contig data from kjempefuru to SAGA (within kjempefuru)

tocopy=$(find -type f -name '*.prefixed.fa') # everything's in subdirectories. Annoying.
scp $tocopy ede041@saga.sigma2.no:/cluster/projects/nn9836k/Metagenomics_AMOR_2020/04_CONTIGS/

# Also copied the simple text files containing the list of sample names to process.
# THEN -- and I'm very happy I took some time to do this... I figured out how to do a thing.
# 1. I created a template slurm script for the jobs to use, with <samp> where sample names should go.
# just called it bowtie_template.sh
# 2. Then I created a script that found / replaced the placeholder with sample names I want, and made

./Make_child_scripts.sh

# new baby scripts that had the respective sample name in the script filename.
# 3. I made a script that calls all the slurm jobs!

./Submit_slurms.sh

# The first two scripts I just built locally on my machine and uploaded the results to the server. 
# Only the submission script really needs to be run on the server.

##### BOWTIE2 MAPPING

# Next, comes the actual mapping step. 
# Within the COMAPPING directory where the bowtie indexes are, create directories called after each sample.

for i in $samples; do mkdir $i; done


Clone this wiki locally