Skip to content

05_COMAPPING

eolesin edited this page May 19, 2021 · 16 revisions

0. Create nice deflines and build the contigs databases for both the 2019 and 2020 samples.

I decided to go back and reformat the contigs so that each contig had a prefix indicating which sample it belonged to. The reasoning for this is that we might be able to have a central Anvio profile which includes contigs from all samples.

One issue I saw was that using a cutoff threshold of 2500 was eliminating 70% of the contigs in the sample. This could be a huge disadvantage, as we are essentially throwing away a majority of the data at this step. But I tried it anyway.

1. At first, I made individual contig databases for each of the samples

for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2019`; \
    do anvi-script-reformat-fasta $SET/$SET.final.contigs.fa -l 2500 \
    --simplify-names -o $SET/$SET.fa; anvi-gen-contigs-database \
    --num-threads 40 -f $SET/$SET.fa -o $SET/$SET.contigs.db; \
done

# redid the contig databases after reformatting the contigs from each sample with
# their respective sample name prefixes (changes "-" to "_" in these names):
for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good`; \
    do bar=${SET//-/_}; anvi-script-reformat-fasta $SET/final.contigs.fa -l 1000   \
    --simplify-names --prefix c_$bar -o $SET/$SET.prefixed.fa;  \
done

for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good`; \
    do anvi-gen-contigs-database --num-threads 40 -f $SET/$SET.prefixed.fa \ 
    -o $SET/$SET.prefixed.contigs.db& \
done

2. Second, I made a concatenated contigs file that includes all contigs of all samples we want to use in this study.

I concatenate the contig files using the prefixed contig files in each sample subdirectory. A disadvantage of this method is that there are probably many reads that are able to map to many contigs, and this could make the data "noisy and difficult" https://groups.google.com/g/anvio/c/G-PXEjqcbmc?pli=1.

find -type f -name '*prefixed.fa' -exec cat {} + > merged.contigs.fa

anvi-script-reformat-fasta -l 2500 merged.contigs.fa -o merged.contigs2500.fa
anvi-gen-contigs-database --num-threads 40 -f merged.contigs2500.fa  -o merged.contigs2500.db

3. Third, I went ahead and mapped all reads from all samples to each individual assembly.

This is what I would call "co-mapping" rather than "co-assembly". This is for the sake of improving differential coverage information for binning later.

# Build the bowtie2 databases
for SET in `cat AMOR_2020_Good`; do bowtie2-build 03_INDIV_ASSEMBLY/$SET/$SET.prefixed.fa 05_COMAPPING/$SET --threads 20; done

# Create folders for each of the mappings to live, named for the sample used for the assembly mapped to.

samples=$(echo "$(cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good)")
for i in $samples; do mkdir $i; done


# Perform the mapping. Each sample has its folder. Each folder contains mapping data for all samples.
while read line;
	do
	SET=$(echo $line);
	samples=$(echo "$(cat AMOR_2020_Good)");          	# return all entries in file
    	delimiter=",";          
    	declare -a Smparray=($(echo $samples | tr "$delimiter" " "));         # create sample name array
    	for samp in "${Smparray[@]}";
    	do
	bowtie2 --threads 40 \
	-x 05_COMAPPING/$SET \
	-1 02_HUMAN_Decontam/$samp-cleanR1.fq \
	-2 02_HUMAN_Decontam/$samp-cleanR2.fq \
	--no-unal \
	-S 05_COMAPPING/$SET/$samp.sam;
	samtools view -F 4 -bS 05_COMAPPING/$SET/$samp.sam > 05_COMAPPING/$SET/$samp-RAW.bam&& 
            samtools sort 05_COMAPPING/$SET/$samp-RAW.bam -o 05_COMAPPING/$SET/$samp.bam;     
            samtools index 05_COMAPPING/$SET/$samp.bam;     
            rm 05_COMAPPING/$SET/$samp.sam 05_COMAPPING/$SET/$samp-RAW.bam;    
        done;  
    done < AMOR_2020_Good
 

Clone this wiki locally