Skip to content

05_COMAPPING

eolesin edited this page May 19, 2021 · 16 revisions

0. Create nice deflines and build the contigs databases for both the 2019 and 2020 samples.

I decided to go back and reformat the contigs so that each contig had a prefix indicating which sample it belonged to. The reasoning for this is that we might be able to have a central Anvio profile which includes contigs from all samples.

One issue I saw was that using a cutoff threshold of 2500 was eliminating 70% of the contigs in the sample. This could be a huge disadvantage, as we are essentially throwing away a majority of the data at this step. But I tried it anyway.

1. At first, I made individual contig databases for each of the samples

for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2019`; \
    do anvi-script-reformat-fasta $SET/$SET.final.contigs.fa -l 2500 \
    --simplify-names -o $SET/$SET.fa; anvi-gen-contigs-database \
    --num-threads 40 -f $SET/$SET.fa -o $SET/$SET.contigs.db; \
done

# redid the contig databases after reformatting the contigs from each sample with
# their respective sample name prefixes (changes "-" to "_" in these names):
for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good`; \
    do bar=${SET//-/_}; anvi-script-reformat-fasta $SET/final.contigs.fa -l 1000   \
    --simplify-names --prefix c_$bar -o $SET/$SET.prefixed.fa;  \
done

for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good`; \
    do anvi-gen-contigs-database --num-threads 40 -f $SET/$SET.prefixed.fa \ 
    -o $SET/$SET.prefixed.contigs.db& \
done

2. Second, I made a concatenated contigs file that includes all contigs of all samples we want to use in this study.

I concatenate the contig files using the prefixed contig files in each sample subdirectory. A disadvantage of this method is that there are probably many reads that are able to map to many contigs, and this could make the data "noisy and difficult" https://groups.google.com/g/anvio/c/G-PXEjqcbmc?pli=1.

find -type f -name '*prefixed.fa' -exec cat {} + > merged.contigs.fa

anvi-script-reformat-fasta -l 2500 merged.contigs.fa -o merged.contigs2500.fa
anvi-gen-contigs-database --num-threads 40 -f merged.contigs2500.fa  -o merged.contigs2500.db

3. Third, I went ahead and mapped all reads from all samples to each individual assembly.

This is what I would call "co-mapping" rather than "co-assembly". This is for the sake of improving differential coverage information for binning later.

# Build the bowtie2 databases
for SET in `cat AMOR_2020_Good`; do bowtie2-build 03_INDIV_ASSEMBLY/$SET/$SET.prefixed.fa 05_COMAPPING/$SET --threads 20; done

# Create folders for each of the mappings to live, named for the sample used for the assembly mapped to.
for i in $samples; do mkdir $i; done


# Perform the mapping. Each sample has its folder. Each folder contains mapping data for all samples.

Clone this wiki locally