-
Notifications
You must be signed in to change notification settings - Fork 0
05_COMAPPING
I decided to go back and reformat the contigs so that each contig had a prefix indicating which sample it belonged to. The reasoning for this is that we might be able to have a central Anvio profile which includes contigs from all samples.
One issue I saw was that using a cutoff threshold of 2500 was eliminating 70% of the contigs in the sample. This could be a huge disadvantage, as we are essentially throwing away a majority of the data at this step. But I tried it anyway.
for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2019`; \
do anvi-script-reformat-fasta $SET/$SET.final.contigs.fa -l 2500 \
--simplify-names -o $SET/$SET.fa; anvi-gen-contigs-database \
--num-threads 40 -f $SET/$SET.fa -o $SET/$SET.contigs.db; \
done
# redid the contig databases after reformatting the contigs from each sample with
# their respective sample name prefixes (changes "-" to "_" in these names):
for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good`; \
do bar=${SET//-/_}; anvi-script-reformat-fasta $SET/final.contigs.fa -l 1000 \
--simplify-names --prefix c_$bar -o $SET/$SET.prefixed.fa; \
done
for SET in `cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good`; \
do anvi-gen-contigs-database --num-threads 40 -f $SET/$SET.prefixed.fa \
-o $SET/$SET.prefixed.contigs.db& \
done
2. Second, I made a concatenated contigs file that includes all contigs of all samples we want to use in this study.
I concatenate the contig files using the prefixed contig files in each sample subdirectory. A disadvantage of this method is that there are probably many reads that are able to map to many contigs, and this could make the data "noisy and difficult" https://groups.google.com/g/anvio/c/G-PXEjqcbmc?pli=1.
find -type f -name '*prefixed.fa' -exec cat {} + > merged.contigs.fa
anvi-script-reformat-fasta -l 2500 merged.contigs.fa -o merged.contigs2500.fa
anvi-gen-contigs-database --num-threads 40 -f merged.contigs2500.fa -o merged.contigs2500.db
This is what I would call "co-mapping" rather than "co-assembly". This is for the sake of improving differential coverage information for binning later.
# Build the bowtie2 databases
for SET in `cat AMOR_2020_Good`; do bowtie2-build 03_INDIV_ASSEMBLY/$SET/$SET.prefixed.fa \
05_COMAPPING/$SET --threads 20; \
done
# Create folders for each of the mappings to live, named for the sample used for the assembly mapped to.
samples=$(echo "$(cat /export/dahlefs/work/Metagenomes_chimneys_2020_workfolder/AMOR_2020_Good)")
for i in $samples; do mkdir $i; done
# Perform the mapping. Each sample has its folder. Each folder contains mapping data for all samples.
# Weirdly only works when copy and pasted from a txt file.... or as a script:
while read line;
do
SET=$(echo $line);
samples=$(echo "$(cat AMOR_2020_Good)"); # return all entries in file
delimiter=",";
declare -a Smparray=($(echo $samples | tr "$delimiter" " ")); # create sample name array
for samp in "${Smparray[@]}";
do
bowtie2 --threads 40 \
-x 05_COMAPPING/$SET \
-1 02_HUMAN_Decontam/$samp-cleanR1.fq \
-2 02_HUMAN_Decontam/$samp-cleanR2.fq \
--no-unal \
-S 05_COMAPPING/$SET/$samp.sam;
samtools view -F 4 -bS 05_COMAPPING/$SET/$samp.sam > 05_COMAPPING/$SET/$samp-RAW.bam;
samtools sort 05_COMAPPING/$SET/$samp-RAW.bam -o 05_COMAPPING/$SET/$samp.bam;
samtools index 05_COMAPPING/$SET/$samp.bam;
rm 05_COMAPPING/$SET/$samp.sam 05_COMAPPING/$SET/$samp-RAW.bam;
done;
done < AMOR_2020_Good
Indexing seems... broken or something when I see the result on kjempefuru. I'm not sure why. Some of my 2.bt2 files have nothing in them... it looks like it might be related to the conda build. Maybe this is something that SAGA is better suited for, this comapping. It's going really really slowly on kjempefuru. Perhaps doing the mappings as separate slurm jobs is the solution.
#### BOWTIE BUILD
# copy contig data from kjempefuru to SAGA (within kjempefuru)
tocopy=$(find -type f -name '*.prefixed.fa') # everything's in subdirectories. Annoying.
scp $tocopy ede041@saga.sigma2.no:/cluster/projects/nn9836k/Metagenomics_AMOR_2020/04_CONTIGS/
# Also copied the simple text files containing the list of sample names to process.
# THEN -- and I'm very happy I took some time to do this... I figured out how to do a thing.
# 1. I created a template slurm script for the jobs to use, with <samp> where sample names should go.
# just called it bowtie_template.sh
# 2. Then I created a script that found / replaced the placeholder with sample names I want, and made
./Make_child_scripts.sh
# new baby scripts that had the respective sample name in the script filename.
# 3. I made a script that calls all the slurm jobs!
./Submit_slurms.sh
# The first two scripts I just built locally on my machine and uploaded the results to the server.
# Only the submission script really needs to be run on the server.
##### BOWTIE2 MAPPING
# Next, comes the actual mapping step.
# Within the COMAPPING directory where the bowtie indexes are, create directories called after each sample.
for i in $samples; do mkdir $i; done