03_ASSEMBLY

Step 0: Metadata preparation

Gather and integrate the metadata you think you may need later on in analysis. We had some sequencing information plus some sample details AND file size records of the human contamination that we removed using bbmap in the previous step. I wrote a script to join these files based on a common column name: integrate.py.

Step 1: Create a program that picks out samples based on metadata values and puts them into a comma-separated list.

Some of the information I hard-coded into the script, but put in input lines where one could alternatively ask for input. I didn't need so much flexibility for creating lists for assemblies, but maybe we will need this flexibility in future file lists.

See: pickout_input2.py

Step 2: Use pickout_input2.py to populate a file with the datasets you want to run megahit with.

We are open (and easily able) to run different co-assemblies, but for now decided to run one assembly per site. Some of these sites have around 22 samples, others have just 5. Definitely open to hearing other strategies, but for now we decided to do the following co-assemblies:

Lokis Castle
All Favne (NPD field) - 22 samples
Jan Mayen Gradient
Bruse gradient
Soria Moria
Ægir

Step 3: Perform the co-assemblies using a bash loop with megahit:

Start a "screen" session, activate your conda environment with megahit installed, and:

while read line; do  \
     Dataset=$(echo $line | cut -d" " -f1); \
     R1s=$(echo $line | cut -d" " -f2);   \      
     R2s=$(echo $line | cut -d" " -f3);    \
     megahit -1 $R1s -2 $R2s --min-contig-len 1000 -m 0.85 -o 03_ASSEMBLIES/$Dataset/ -t 40 ; done < METADATA/megahitSamples.txt

# So this loop is running megahit using each line of the megahitSamples.txt file to define our inputs.

Step 4: Simplify deflines.

Begin following the TARA Oceans assembly of co-assemblies method: https://merenlab.org/data/tara-oceans-mags/

# In 03_ASSEMBLIES: 
# Create file to hold co-assembly names
ls > set.txt

# Not perfect, but this gets the job done. I should maybe stop the loop after it has covered all the subdirectories.
# This just renames the individual final.contigs.fa files from each assembly.

while read SET; do mv $SET/final.contigs.fa  $SET/$SET-contigs.fa; done < set.txt

# Now simplifying the deflines.

for SET in `cat set.txt`
do
    anvi-script-reformat-fasta $SET/$SET-contigs.fa \
                               --simplify-names \
                               -o $SET/$SET-contigs-FIXED.fa \
                               --prefix $SET
    mv $SET/$SET-contigs-FIXED.fa $SET/$SET-contigs.fa
done

In 2020 Dahle group sent 60 samples for sequencing from various chimneys across the AMOR. The wiki here is to share the pipeline I used to process this dataset. The intent is to be specific about all steps involved, and to provide other lab members with this information so that they do not have to repeat the same time-consuming processes. By using my Git page, there is an added benefit of accountability and having someone to email if something doesn't work for you. :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03_ASSEMBLY

Step 0: Metadata preparation

Step 1: Create a program that picks out samples based on metadata values and puts them into a comma-separated list.

Step 2: Use pickout_input2.py to populate a file with the datasets you want to run megahit with.

Step 3: Perform the co-assemblies using a bash loop with megahit:

Step 4: Simplify deflines.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally