Skip to content

03_ASSEMBLY

eolesin edited this page Feb 17, 2021 · 8 revisions

Step 0: Metadata preparation

Gather and integrate the metadata you think you may need later on in analysis. We had some sequencing information plus some sample details AND file size records of the human contamination that we removed using bbmap in the previous step. I wrote a script to join these files based on a common column name: integrate.py.

Step 1: Create a program that picks out samples based on metadata values and puts them into a comma-separated list.

Some of the information I hard-coded into the script, but put in input lines where one could alternatively ask for input. I didn't need so much flexibility for creating lists for assemblies, but maybe we will need this flexibility in future file lists.

See: pickout_input2.py

Step 2: Use pickout_input2.py to populate a file with the datasets you want to run megahit with.

We are open (and easily able) to run different co-assemblies, but for now decided to run one assembly per site. Some of these sites have around 22 samples, others have just 5. Definitely open to hearing other strategies, but for now we decided to do the following co-assemblies:

  • Lokis Castle
  • All Favne (NPD field) - 22 samples
  • Jan Mayen Gradient
  • Bruse gradient
  • Soria Moria
  • Ægir

Step 3: Perform the co-assemblies using a bash loop with megahit:

Start a "screen" session, activate your conda environment with megahit installed, and:

while read line; do  \
     Dataset=$(echo $line | cut -d" " -f1); \
     R1s=$(echo $line | cut -d" " -f2);   \      
     R2s=$(echo $line | cut -d" " -f3);    \
     megahit -1 $R1s -2 $R2s --min-contig-len 1000 -m 0.85 -o 03_ASSEMBLIES/$Dataset/ -t 40 ; done < METADATA/megahitSamples.txt

# So this loop is running megahit using each line of the megahitSamples.txt file to define our inputs.

Step 4: Simplify deflines.

Begin following the TARA Oceans assembly of co-assemblies method: https://merenlab.org/data/tara-oceans-mags/

# In 03_ASSEMBLIES: 
# Create file to hold co-assembly names
ls > set.txt

# Not perfect, but this gets the job done. I should maybe stop the loop after it has covered all the subdirectories.
# This just renames the individual final.contigs.fa files from each assembly.

while read SET; do mv $SET/final.contigs.fa  $SET/$SET-contigs.fa; done < set.txt

# Now simplifying the deflines.

for SET in `cat set.txt`
do
    anvi-script-reformat-fasta $SET/$SET-contigs.fa \
                               --simplify-names \
                               -o $SET/$SET-contigs-FIXED.fa \
                               --prefix $SET
    mv $SET/$SET-contigs-FIXED.fa $SET/$SET-contigs.fa
done

Clone this wiki locally