# Detaxizer Notebook

The notebook provides a step-by-step explanation of the `nf-core/detaxizer` pipeline.

To familiarize yourself with the nf-core piplines [click here](https://nf-co.re/). **Detaxizer** processes raw metagenomic sequencing data (FASTQ format) and enables the detection or optional removal of specific taxa.

To achieve it, various tools such as bbduk, kraken2 and blastn are applied. In this notebook, the necessary libraries and automated sample sheets (which are the processed input for detaxizer) are described for easy reproducibility. 

For further information check out the official repository: https://github.com/nf-core/detaxizer/tree/1.0.0 



## Important setup:

In order to use this pipline it is necessary to have conda installed - [click here](https://conda.io/projects/conda/en/latest/user-guide/install/index.html). 

If you are running this notebook on the M3 cluster, you should already have conda installed.

The second step is to set up a Nextflow environment, which is required for the pipeline. This environment installs all necessary software dependencies in isolation, preventing conflicts and ensuring reproducibility of the workflow.

Every time you wish to install a new library, do it in an environment by creating it with "conda create --name <insert_name>". Access the environment with "conda activate <env_name>".

Last but not least, update your `~/.bashrc` file. If you are part of the M3 team you can add an alias such as `alias m3='ssh username@l1.m3c.uni-tuebingen.de'` which is a shortcut for accessing the server.

Also, set a common directory for storing downloaded nf-core pipelines, and adjust the path to match your system. An example would be:
```
export NXF_SINGULARITY_CACHEDIR="/mnt/lustre/groups/maier/YOUR_M3HPC_USERNAME/bin/nf-core"
```

To run multiple tasks at the same time, I recommend using the task manager `Screen`. This way, even if you lose your HPC connection, your sessions will continue running.

You might encounter problems when running an nf-core pipeline with your setup. One common mistake is having the wrong version of the pipeline, which clashes with your settings. More is detailed in the last section.

## Libraries to load beforehand

In [1]:
# Package loading:
library(tidyverse)
library(conflicted)

“package ‘ggplot2’ was built under R version 4.3.3”
“package ‘stringr’ was built under R version 4.3.3”
“package ‘forcats’ was built under R version 4.3.3”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## Solve

In [2]:
# Specifying preferences to solve conflicts
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::lag)


[1m[22m[90m[conflicted][39m Will prefer [1m[34mdplyr[39m[22m::filter over any other package.
[1m[22m[90m[conflicted][39m Will prefer [1m[34mdplyr[39m[22m::lag over any other package.


## Set paths

In [3]:
# Set the base directory. Place to test small data sets.
base_dir = "/mnt/lustre/groups/maier/maina479/projects/Detaxizer_Notebook_final/data/raw_data"

# Output directory
out_dir = "/mnt/lustre/groups/maier/maina479/projects/Detaxizer_Notebook_final/data/output"
dir.create(out_dir)

# Sheet directory
sheet_dir = file.path(out_dir, "sheets")
dir.create(sheet_dir)

# Pipeline output
output_detax = file.path(out_dir, "output_detax")
dir.create(output_detax)

## Automate the samplesheet.csv

The following code sets a base directory (`base_dir`) where the input data is stored. All the files ending with `.fastq.gz` in the directory and also subfolders are listed and separated in `forward_reads` (contains the R1 label in the filename) and `reverse_reads`(which contains R2). 

Then the sample ID is extracted (to identify each sequence) and it is applied to the forward and reverse reads. The read lists are aligned to ensure matching sample order.

In [4]:
# Makes a list of all files ending with fastq.gz files in the "raw_data" directory, iterating through all subdirectories
data_fastqs = list.files(path = base_dir, 
  pattern = "\\.fastq\\.gz$", 
  recursive = TRUE, 
  full.names = TRUE)

# Since column "short_reads_fastq_1" needs the R1 files and column "short_reads_fastq_2" needs the R2 files they are filterd separately
forward_reads = data_fastqs[grepl("R1", data_fastqs)]
reverse_reads = data_fastqs[grepl("R2", data_fastqs)]

# Get the sample name (ID), which is everything before the first "_". Adjust this for your file name convention
sample_id = function(path) {
  full_name = basename(path)
  sub("_.*", "", full_name)
}

# Apply ID names to the forward_reads
apply_ids = sapply(forward_reads, sample_id)

# Since the order of reverse_reads can be different, align in the same order as forward_reads
reverse_reads = reverse_reads[ match(apply_ids, sapply(reverse_reads, sample_id)) ]

## Create samplesheet.csv for input of pipline
Finally the samplesheet is created as a data frame with the columns `sample`, `short_reads_fastq_1`, `short_reads_fastq_2` and `long_reads_fastq_1`, which will be empty in our case. 

It is then saved as a .csv file in the desired location and ready to use in the next step.

In [5]:
# Specify sample IDs together with the FASTQ file paths in a data frame (structure similar to a table for data storage)
samplesheet = data.frame(
  sample = apply_ids,
  short_reads_fastq_1 = forward_reads, # R1 forward reads
  short_reads_fastq_2 = reverse_reads, # R2 reverse reads
  long_reads_fastq_1 = rep("", length(apply_ids)) # Not present in our data set so one empty per row
)

# Save as a .csv file
sheet_path = file.path(sheet_dir, "samplesheet.csv")
write.csv(samplesheet, file = sheet_path, row.names = FALSE, quote = FALSE)

cat("Samplesheet was created in samplesheet.csv\n")

Samplesheet was created in samplesheet.csv


## Run pipeline

Run **Detaxizer** pipeline (nf-core/detaxizer is the repository) with the command shown in the next block. 

The Profile test (--profile test) is a configuration profile from the pipeline that uses minimal resources; M3 specifically has the 'm3c' (using M3 HPC) option. The input is prepared above using the samplesheet.csv file, and the output directory (outdir) is populated with content after execution. All files and directories must be specified with the correct path. Ensure you are using the correct version of the pipeline with `-r`, as this can cause problems.

Also, when constructing the shell command, it is important to include the parameters “enable_filter” and “filter_trimmed”, as well as “reads_minlength 70”. The filter step must be activated with the first parameter, then it has to be specified that the already trimmed sequences by detaxizer (not the raw sequences) should be used as input for the filtering step (according to https://nf-co.re/detaxizer/1.0.0/parameters). Finally, filtering out reads that are less than 70 base pairs (bp) in length is primordial. To use the parameter “perform_shortread_redundancyestimation” for checking coverage in the metagenome with the Taxprofiler notebook, the k-mer size must be at least 24 bp; otherwise, an error will appear. For this reason, it is very important to include the parameter “reads_minlength_70” in **Detaxizer**.

In [11]:
glue::glue("cd {out_dir} && \\
conda activate {conda_env} && \\
nextflow run nf-core/detaxizer -r 1.0.0 \\
-profile m3c \\
--input {samples_sheet} \\
--outdir {pipeline_out} \\
--filter_trimmed \\
--enable_filter \\
--reads_minlength 70",
    out_dir = out_dir,
    conda_env = "nextflow",
    samples_sheet = sheet_path,
    pipeline_out = output_detax)

## Pipeline Output

Important folders for further metagenome processing and analysis are the following:
- `output_detax/filter`, which is later used in the Taxprofiler Notebook.
- `output_detax/multiqc`, specifically the "multiqc_report.html" for quality control.
