This code will show how to trim and assemble of paired-end metagenome .fastq.gz files in BASH using bbtools and SPAdes.

Sample data from ENA:

Arctic Ocean metagenomes sampled aboard CGC Healy during the 2015 GEOTRACES Arctic research cruise Secondary Study Accession:ERP015773 Study Title:Arctic Ocean metagenomes from HLY1502 Center Name:UNIVERSITY OF ALASKA FAIRBANKS Study Name:Arctic Ocean metagenomes ENA-FIRST-PUBLIC:2016-05-27 ENA-LAST-UPDATE:2016-05-25

Can be found at: https://www.ebi.ac.uk/ena/browser/view/PRJEB14154?show=reads

I have used the first 5 pairs of Generated FASTQ files

move into content folder:

In [1]:
cd example_content

examine content...notice all of the .fastq.gz files are in separate subfolders

In [2]:
ls

ERR1424899	ERR1424900	ERR1424901	ERR1424902	ERR1424903


move up one directory and create a new subdirectory to move all of the .fastq.gz files into one place. Then check that the directory was made with ls.

In [3]:
cd ..

First, we need to make a copy of the original data before moving it

In [4]:
cp -R example_content example_content_copy

In [5]:
mkdir all_data

In [6]:
ls

Mapping_simple.ipynb			references
all_data				spadsandstatswrapcommands.txt
bowtie2andsamtoolscommands.txt		trimming_assembly_basic.ipynb
example_content				trimming_classification_basic.ipynb
example_content_2			trimming_classification_moreqc.ipynb
example_content_copy


move back to example_content directory

In [7]:
cd example_content

locate all files ending with .gz in all subfolders within the directory. The `*` character means that any other characters can preceed .gz. The `mindepth` command specifies to perform commands that follow at the subdirectory level (1=root). The empty `{}` allows all files meeting the criteria to be moved.  The `print` command allows user to monitor files

In [8]:
find . -mindepth 2 -type f -name '*.gz' -print -exec mv {} ../all_data \;

./ERR1424899/ERR1424899_1.fastq.gz
./ERR1424899/ERR1424899_2.fastq.gz
./ERR1424900/ERR1424900_1.fastq.gz
./ERR1424900/ERR1424900_2.fastq.gz
./ERR1424901/ERR1424901_1.fastq.gz
./ERR1424901/ERR1424901_2.fastq.gz
./ERR1424902/ERR1424902_1.fastq.gz
./ERR1424902/ERR1424902_2.fastq.gz
./ERR1424903/ERR1424903_1.fastq.gz
./ERR1424903/ERR1424903_2.fastq.gz


move into the `all_data` subdirectory to check that all the `.fastq.gz` files have moved.

In [9]:
cd ../all_data

In [10]:
ls

ERR1424899_1.fastq.gz	ERR1424901_1.fastq.gz	ERR1424903_1.fastq.gz
ERR1424899_2.fastq.gz	ERR1424901_2.fastq.gz	ERR1424903_2.fastq.gz
ERR1424900_1.fastq.gz	ERR1424902_1.fastq.gz
ERR1424900_2.fastq.gz	ERR1424902_2.fastq.gz


Now, we delete the original example_content directory, which is empty.

In [11]:
rm -r ../example_content

Let's go through trimming and assembly using one pair of read files

First, I am going to trim reads using a few trimming commands with the tool BBDUK from bbtools prior to assembly using SPAdes
We call the tool bbduk within bbtools by `bbduk.sh`.

Let's star with basic trimming


Provide input file and output directory/name. The reference library used for trimming is adapters, which contains all illumina adapter sequences. `ktrim` directions determine whether the 3' (right) or 5' (left) adapters are trimmed. In this case, we are setting it to trim the 3' adapter. `tbo` indicates trim by overlap

In [12]:
mkdir ../trimmed_reads

In [13]:
bbduk.sh in=ERR1424899_1.fastq.gz in2=ERR1424899_2.fastq.gz out=../trimmed_reads/trimmed_ERR1424899.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo minlen=70 ref=adapters ordered ow=t

/usr/local/Cellar/bbtools/38.95/libexec//calcmem.sh: line 75: [: -v: unary operator expected
Max memory cannot be determined.  Attempting to use 1400 MB.
If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command, 
or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
java -ea -Xmx1400m -Xms1400m -cp /usr/local/Cellar/bbtools/38.95/libexec/current/ jgi.BBDuk in=ERR1424899_1.fastq.gz in2=ERR1424899_2.fastq.gz out=../trimmed_reads/trimmed_ERR1424899.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo minlen=70 ref=adapters ow=t
Executing jgi.BBDuk [in=ERR1424899_1.fastq.gz, in2=ERR1424899_2.fastq.gz, out=../trimmed_reads/trimmed_ERR1424899.fq.gz, ktrim=r, k=23, mink=11, hdist=1, tbo, minlen=70, ref=adapters, ow=t]
Version 38.95

maskMiddle was disabled because useShortKmers=true
0.062 seconds.
Initial:
Memory: max=1468m, total=1468m, free=1439m, used=29m

Added 217135 kmers; time: 	0.285 seconds.
Memory: max=1468m, total=1468m, free=1434m

Notice that 90% of reads (96% of bases) were retained after ktrimming and trimming by overlap 

Next, let's remove any synthetic DNA (spike-ins) and other such artifacts from the trimmed reads using `bbduk.sh` . The argument `cardinality` will approximate the unique number of k-mers. `phix` refers to a virus that is often spiked in during sequencing runs. `ordered` indicates that the input order will be the same as in the prior command.

In [14]:
mkdir ../artfilt_reads

In [15]:
bbduk.sh in=../trimmed_reads/trimmed_ERR1424899.fq.gz  out=../artfilt_reads/artfilt_ERR1424899.fq.gz  k=31 ref=artifacts,phix ordered cardinality ow=t; 

/usr/local/Cellar/bbtools/38.95/libexec//calcmem.sh: line 75: [: -v: unary operator expected
Max memory cannot be determined.  Attempting to use 1400 MB.
If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command, 
or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
java -ea -Xmx1400m -Xms1400m -cp /usr/local/Cellar/bbtools/38.95/libexec/current/ jgi.BBDuk in=../trimmed_reads/trimmed_ERR1424899.fq.gz out=../artfilt_reads/artfilt_ERR1424899.fq.gz k=31 ref=artifacts,phix ordered cardinality ow=t
Executing jgi.BBDuk [in=../trimmed_reads/trimmed_ERR1424899.fq.gz, out=../artfilt_reads/artfilt_ERR1424899.fq.gz, k=31, ref=artifacts,phix, ordered, cardinality, ow=t]
Version 38.95

Set ORDERED to true
0.035 seconds.
Initial:
Memory: max=1468m, total=1468m, free=1439m, used=29m

Added 92346 kmers; time: 	0.174 seconds.
Memory: max=1468m, total=1468m, free=1433m, used=35m

Input is being processed as paired
Started output streams:	0

There were no contaminants, so 100% of reads and bases from the prior step were retained.

Finally, let's take the trimmed, contaminant-free reads and do further quality trimming to remove any low-quality or low-entropy reads

In [16]:
mkdir ../qtrimmed

In [17]:
bbduk.sh in=../artfilt_reads/artfilt_ERR1424899.fq.gz out=../qtrimmed_reads/qtrimmed_ERR1424899.fq.gz  qtrim=r trimq=10 minlen=70 ordered maxns=0 maq=8 entropy=0.95 ow=t;

/usr/local/Cellar/bbtools/38.95/libexec//calcmem.sh: line 75: [: -v: unary operator expected
Max memory cannot be determined.  Attempting to use 1400 MB.
If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command, 
or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
java -ea -Xmx1400m -Xms1400m -cp /usr/local/Cellar/bbtools/38.95/libexec/current/ jgi.BBDuk in=../artfilt_reads/artfilt_ERR1424899.fq.gz out=../qtrimmed_reads/qtrimmed_ERR1424899.fq.gz qtrim=r trimq=10 minlen=70 ordered maxns=0 maq=8 entropy=0.95 ow=t
Executing jgi.BBDuk [in=../artfilt_reads/artfilt_ERR1424899.fq.gz, out=../qtrimmed_reads/qtrimmed_ERR1424899.fq.gz, qtrim=r, trimq=10, minlen=70, ordered, maxns=0, maq=8, entropy=0.95, ow=t]
Version 38.95

Set ORDERED to true
0.047 seconds.
Initial:
Memory: max=1468m, total=1468m, free=1439m, used=29m

Input is being processed as paired
Started output streams:	0.102 seconds.
Processing time:   		49.072 seconds.



Notice that through qtrimming, no low-quality reads or bases were discarded. However, 15% of reads were discarded due to low entropy (likely, repeats), resulting in 84% of reads and bases being retained from the prior step.

Now, we are ready to assemble the fully QC'd and trimmed pair of read files using SPAdes. Notice that I have selected `--only-assembler` to skip error correction because we have already QC'd this data. The input data flags `--meta` and `--12` indicate that the input file contains interlaced forward and reverse paired-end metagenome reads 

In [19]:
mkdir ../SPAdes_out

In [21]:
spades.py -o ../SPAdes_out/ERR1424899_spades --meta --12 ../qtrimmed_reads/qtrimmed_ERR1424899.fq.gz --only-assembler

Command line: /Users/ashley/miniconda3/envs/metagenome/bin/spades.py	-o	/Volumes/Ashley's External Drive/metagenome-demos/SPAdes_out/ERR1424899_spades	--meta	--12	/Volumes/Ashley's External Drive/metagenome-demos/qtrimmed_reads/qtrimmed_ERR1424899.fq.gz	--only-assembler	

System information:
  SPAdes version: 3.15.2
  Python version: 3.7.11
  OS: Darwin-21.2.0-x86_64-i386-64bit

Output dir: /Volumes/Ashley's External Drive/metagenome-demos/SPAdes_out/ERR1424899_spades
Mode: ONLY assembling (without read error correction)
Debug mode is turned OFF

Dataset parameters:
  Metagenomic mode
  Reads:
    Library number: 1, library type: paired-end
      orientation: fr
      left reads: not specified
      right reads: not specified
      interlaced reads: ["/Volumes/Ashley's External Drive/metagenome-demos/qtrimmed_reads/qtrimmed_ERR1424899.fq.gz"]
      single reads: not specified
      merged reads: not specified
Assembly parameters:
  k: [21, 33, 55]
  Repeat resolution is enabled
  Misma

Let's navigate to the output folder and see what's there

In [23]:
cd ../SPAdes_out/ERR1424899_spades

In [26]:
ls

K21					misc
K33					params.txt
K55					pipeline_state
assembly_graph.fastg			run_spades.sh
assembly_graph_after_simplification.gfa	run_spades.yaml
assembly_graph_with_scaffolds.gfa	scaffolds.fasta
before_rr.fasta				scaffolds.paths
contigs.fasta				spades.log
contigs.paths				split_input
dataset.info				strain_graph.gfa
first_pe_contigs.fasta			tmp
