This code will show how to generate basic reports and then carry out clean up and taxonomic classification of paired-end metagenome .fastq.gz files in BASH. 

Sample data from ENA:

Metagenomes from the Tara Oceans Project

Can be found at https://www.ebi.ac.uk/ena/browser/view/PRJEB1787?show=reads

In example code, I have used the 4th-7th listed pairs of Generated FASTQ files

move into content folder:

In [31]:
cd example_content_2

examine content...notice all of the .fastq.gz files are in separate subfolders

In [32]:
ls

ERR1701760	ERR315858	ERR315859


move up one directory and create a new subdirectory to move all of the .fastq.gz files into one place. Then check that the directory was made with ls.

In [33]:
cd ..

First, we need to make a copy of the original data before moving it

In [None]:
cp -R example_content_2 example_content_2_copy

In [None]:
mkdir all_data

In [None]:
ls

move back to example_content_2 directory

In [None]:
cd example_content_2

locate all files ending with .gz in all subfolders within the directory. The `*` character means that any other characters can preceed .gz. The `mindepth` command specifies to perform commands that follow at the subdirectory level (1=root). The empty `{}` allows all files meeting the criteria to be moved.  The `print` command allows user to monitor files

In [None]:
find . -mindepth 2 -type f -name '*.gz' -print -exec mv {} ../all_data \;

move into the `all_data` subdirectory to check that all the `.fastq.gz` files have moved.

In [None]:
cd ../all_data

In [None]:
ls

Now, we delete the original example_content_2 directory, which is empty.

In [None]:
rm -r ../example_content_2

I am going to perform various quality control read removal and read trimning steps using the tools in bbtools. I will start with one pair of read files to demonstrate syntax.

First, I am going to call the tool clumpify by `clumpify.sh` to remove optical duplicates. These are detected within a particular distance on a sequencing flowcell. Optical duplicates are one read detected as multiple reads by the sequencing platform optical sensor. This is a non-issue for patterned flow cells. Note that for paired-end reads, the output file is one with merged reads. The argument `groups` is used to decrease memory usage. The use can specify a number of groups to make the temporary data files used during processing as arbitrarily small as they want. To learn about decreasing `clumpify` memory usage, check out the [offical documentation](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/clumpify-guide/)

In [None]:
mkdir ../optical_deduped_reads

In [None]:
clumpify.sh in1=ERR315859_1.fastq.gz in2=ERR315859_2.fastq.gz out=../optical_deduped_reads/optdedup_ERR315859.fastq.gz -Xmx2000m groups=20 dedupe optical ow=t

Second, I am going to call the tool `filterbytile.sh` to remove low-quality reads from the optical-deduped files by flowcell tile. The tool does this by averaging all of the reads with a micro-tile area and then keeping or discarding the entire micro-tile.

In [None]:
mkdir ../tile_filtered_reads

In [None]:
filterbytile.sh in=../optical_deduped_reads/optdedup_ERR315859.fastq.gz out=../tile_filtered_reads/tilefilt_ERR315859.fastq.gz ow=t

Third, I will call the tool bbduk by `bbduk.sh` to trim adapters. The reference library used for trimming is adapters, which contains all illumina adapter sequences. `ktrim` directions determine whether the 3' (right) or 5' (left) adapters are trimmed. `k` specifies the kmer length in bp. `mink ` is the minimum allowable kmer length at the end of the sequence, `hdist` is allowable mismatch,and `ow=t` allows existing files to be overwritten. In this case, we are setting it to trim the 3' adapter. The argument `ordered` means the the tool will set the same input order set by `clumpify`

In [None]:
mkdir ../trimmed_reads

In [None]:
bbduk.sh in=../tile_filtered_reads/tilefilt_ERR315859.fastq.gz out=../trimmed_reads/trimmed_ERR315859.fastq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=90 ref=adapters ordered ow=t

Fourth, I will remove any synthetic DNA (spike-ins) and other such artifacts from the trimmed reads using `bbduk.sh` . The argument `cardinality` will approximate the unique number of k-mers. `phix` refers to a virus that is often spiked in during sequencing runs.

In [None]:
mkdir ../artfilt_reads

In [None]:
bbduk.sh in=../trimmed_reads/trimmed_ERR315859.fastq.gz out=../artfilt_reads/artfilt_ERR315859.fastq.gz k=31 ref=artifacts,phix ordered cardinality ow=t

Fifth, I will trim low-quality regions from reads and discard reads with a lot of repeats (low-entropy reads) using `bbduk`. The minimum quality of read to retain is set by the argument `maq` 

In [None]:
mkdir ../qtrimmedfilt_reads

In [None]:
bbduk.sh in=../artfilt_reads/artfilt_ERR315859.fastq.gz out=../qtrimmed_reads/qtrimmed_ERR315859.fastq.gz qtrim=r trimq=10 minlen=90 ordered maxns=0 maq=8 entropy=.95 ow=t

Finally, I will use the package `sourmash` to compute the MinHash signature of the fully cleaned and QC'd read file. We won't be able to visualize sample similarity and dissimilarity through signatures until we have processed all samples, so we will get to this later.

In [None]:
mkdir ../MinHashSigs

In [None]:
sourmash compute -k 31 --scaled=1000 ../qtrimmed_reads/qtrimmed_ERR315859.fastq.gz --output ../MinHashSigs/qtrimmed_ERR315859.fastq.gz.sig

Now that the reads have been QC'd, cleaned and trimmed, let's try to do a simple classification of the 16S genes using the package Kraken2. This was Homebrew installed on my machine. We will redirect raw outputs to `null`

In [None]:
cd ..

In [None]:
mkdir reports

In [None]:
mkdir text

Before running Kraken2, we need to define the path of the taxonomy databases

In [None]:
export KRAKEN2_DB_PATH='/Users/ashley/Applications/kraken2-2.0.9-beta/'

We will run kraken2 taxonomy (using the SILVA 16S database) on the merged, qc'd and trimmed file, and produce a report. Confidence (0-1) will be set to 0.5. To skip producing default text output, follow input file name with `> /dev/null` . If skipping text output, do not enter `output` information

In [None]:
/Users/ashley/Applications/kraken2-2.0.9-beta/kraken2 --db silva --confidence 0.5 --output text/kraken2_text_qtrimmed_ERR315859.tsv --report reports/kraken2_report_qtrimmed_ERR315859.tsv qtrimmedfilt_reads/qtrimmed_ERR315859.fastq.gz

The report has a particular format. Let's take a look at the first few lines.The first column is the percentage of sequence fragments covered by the clade root, the second column is the number of sequence fragments covered by the clade root, the 3rd column is number of sequence fragments assigned directly to the taxon, the 4th column is the taxonomic level, the 5th column is the NCBI taxonomic ID, and the final column is the taxonomy (scientific name). Because it is from paired reads, in column 4 the format is bp forward read|bp reverse read. the `|` character is used to indicate read types in column 5 also.

In [None]:
head reports/kraken2_report_qtrimmed_ERR315859.tsv

next, let's take a look at the text output, which also has a particular format. The first column indicates whether the sequence was classified or not, the second column is the sequence ID within the fastq file, the 3rd column is the taxonomy ID assigned to the sequence by Kraken2 (0 if unclassified), the 4th column is the sequence length in bp, and the last column is the kmer map. For example- 0:66 means 66 kmers were mapped to unclassified. 

Now we will use Krona to visualize our results. Because this was conda installed, into my metagenome conda environment, I will need to activate the conda environment first. I have done this in my native terminal using `conda activate` because it is buggy in the Jupyter notebook bash environment

Because we are only interested in visualization the 16S composition of these read files, we will only updated Krona taxonomy to include the SILVA database.

In [None]:
ktUpdateTaxonomy.sh --only-build /Users/ashley/Applications/kraken2-2.0.9-beta/silva/taxonomy/

In [None]:
mkdir krona_outputs

In [None]:
cd qtrimmed_reads

In [None]:
ktImportTaxonomy -o ../krona_outputs/krona_qtrimmed_ERR315859.html -t 5 -m 3 -tax /Users/ashley/Applications/kraken2-2.0.9-beta/silva/taxonomy/ ../reports/kraken2_report_qtrimmed_ERR315859.tsv

The interactive Krona graph html files will now be available in the output folder, and will look like [this](https://github.com/ashleybc/metagenome-only/blob/main/trimming-classification/visuals/Kronagraph.png)

Now that we know what to do for one pair of reads, let's loop through all of the read pairs. This loop will set the variable `prefix` to each unique read set name

In [None]:
cd all_data

In [None]:
for prefix in `ls *.gz | cut -f1 -d'_' | sort -u`; do
echo $prefix
read1=( ${prefix}*_1.fastq.gz ) 
read2=( ${prefix}*_2.fastq.gz )

clumpify.sh in1=${read1} in2=${read2} out=../optical_deduped_reads/optdedup_${prefix}.fastq.gz groups=20 dedupe optical ow=t
filterbytile.sh in=../optical_deduped_reads/optdedup_${prefix}.fastq.gz out=../tile_filtered_reads/tilefilt_${prefix}.fastq.gz ow=t
bbduk.sh in=../tile_filtered_reads/tilefilt_${prefix}.fastq.gz out=../trimmed_reads/trimmed_${prefix}.fastq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=90 ref=adapters ordered ow=t
bbduk.sh in=../trimmed_reads/trimmed_${prefix}.fastq.gz out=../artfilt_reads/artfilt_${prefix}.fastq.gz k=31 ref=artifacts,phix ordered cardinality ow=t
bbduk.sh in=../artfilt_reads/artfilt_${prefix}.fastq.gz out=../qtrimmed_reads/qtrimmed_${prefix}.fastq.gz qtrim=r trimq=10 minlen=90 ordered maxns=0 maq=8 entropy=.95 ow=t
sourmash compute -k 31 --scaled=1000 ../qtrimmed_reads/qtrimmed_${prefix}.fastq.gz --output ../MinHashSigs/qtrimmed_${prefix}.fastq.gz.sig
/Users/ashley/Applications/kraken2-2.0.9-beta/kraken2 --db silva --confidence 0.5 --output ../text/kraken2_text_qtrimmed_${prefix}.tsv --report ../reports/kraken2_report_qtrimmed_${prefix}.tsv ../qtrimmed_reads/qtrimmed_${prefix}
ktImportTaxonomy -o ../krona_outputs/krona_qtrimmed_${prefix}.html -t 5 -m 3 -tax /Users/ashley/Applications/kraken2-2.0.9-beta/silva/taxonomy/ ../reports/kraken2_report_qtrimmed_${prefix}.tsv

done

Now that MinHash signatures have been computed for all reads, we can do an all vs. all signature comparison and visualize it.

In [None]:
mkdir ../MinHashPlots

In [None]:
sourmash compare --traverse-directory ./ ..MinHashSigs/*.sig -k 31 -o meta_comp

In [None]:
sourmash plot --pdf --labels meta_comp