# 1. nfcore/sarek workflow

From the [nfcore/sarek website](https://nf-co.re/sarek/3.2.3):

> nf-core/sarek is a workflow designed to detect variants on whole genome or targeted sequencing data. Initially designed for Human, and Mouse, it can work on any species with a reference genome.

> The pipeline is built using [Nextflow](https://www.nextflow.io/), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner.

# Contents

[1a. Prerequisites](#prereq2)

[1b. nfcore/sarek workflow overview](#overview2)

[1c. Running a test dataset](#test)

[1d. Initial setup](#setup)

[1e. Creating a samplesheet](#sampsheet)

**************

## 1a. Prerequisites <a class="anchor" id="prereq2"></a>

To run the nfcore/sarek workflow, you will need:

1. **A QUT HPC account.** If you are seeing this Notebook, you most likely already have a HPC account. Regardless, you can request an account be created, or request any other HPC or bioinformatics support, via the portal here: https://eresearchqut.atlassian.net/servicedesk/customer/portals

2. Nextflow needs to be installed on your HPC account. **If you haven't already installed Nextflow, or you need to update Nextflow to the latest version, click on the following link, which will download and open a Jupyter Notebook that will step you through installing Nextflow:** https://jupyterhub.eres.qut.edu.au/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Feresearchqut%2FJupyter_Nextflow_install&urlpath=lab%2Ftree%2FJupyter_Nextflow_install%2FNextflow.ipynb&branch=main

3. **Your sequence data files** (fastq.gz format) copied to a location on the QUT HPC that you can access. If you are unsure of the location of your data files, submit a service request through the portal (link above).

*************

## 1b. nfcore/sarek workflow overview <a class="anchor" id="overview2"></a>

Below is an schematic outlining the nfcore/sarek workflow. For a more detailed overview, click [here](https://nf-co.re/sarek/3.2.3). To see how to run nfcore/sarek from the command-line, see [here](https://nf-co.re/sarek/3.2.3/docs/usage). To see what results nfcore/sarek generates, click [here](https://nf-co.re/sarek/3.2.3/docs/output).

<div>
<img src="https://raw.githubusercontent.com/nf-core/sarek/3.2.3//docs/images/sarek_workflow.png" width="300">
    </div>

**************

## 1c. Running a test dataset <a class="anchor" id="test"></a>

This section will run a small test dataset through the nfcore/sarek workflow, to see if you can successfully run nfcore/sarek.

<div class="alert alert-block alert-success">
Run the following code cell to run the nfcore/sarek test. Note that this is set to run without output messages (`-q`) as these will span multiple pages. Instead `nextflow log -f status` is run after the test run has finished, to see if the job ran sucessfully. The test takes approx 3 minutes to complete.
</div>

In [None]:
nextflow -q run nf-core/sarek -r 3.2.2 --outdir test -profile test,singularity
nextflow log -f status

**In the last 3 lines you should see a table with `TIMESTAMP` .. `DURATION`, etc (You can ignore any `WARN` messages). If the `STATUS` is `OK` then the test run was successful and you can go to the next section. If you don't see `STATUS` being `OK`, submit a support request through the [Portal](https://eresearchqut.atlassian.net/servicedesk/customer/portals).** 

*************

## 1d. Initial setup <a class="anchor" id="setup"></a>

<div class="alert alert-block alert-info">
Enter the directory that contains your sequence data files. You can find this directory path by typing 'pwd' on the command-line when you are in that directory, or by contacting the HPC staff via the [portal](https://eresearchqut.atlassian.net/servicedesk/customer/portals). The structure of the below command should be `root_path=/directory/containing/my/vcf/files`.
</div>

In [None]:
root_path=/work/eresearch_bio/projects/ATGC/Andrea_Magnioni/20230725_WES/data

<div class="alert alert-block alert-success">
Now move to the above directory (cd = change directory): 
</div>

In [None]:
cd $root_path

**NOTE: the above two code cells must be run every time you use this Notebook.**

<div class="alert alert-block alert-warning">
To see if you are in the correct directory, run the 'ls' code cell below. You should see a list of all your sample files ('...fastq.gz' or '...fq.gz' files). If you don't see the files, you've entered the above location incorrectly and need to correct and re-run the above code cell.
</div>

In [None]:
ls

********

## 1e. Creating a samplesheet <a class="anchor" id="sampsheet"></a>

nfcore/sarek requires a samplesheet, that conatins, at minumum, your patient IDs, sample IDs, [sequencing lane number](https://www.biostars.org/p/48906/), and the sequence data files (fastq files) associated with them. Details about the required samplesheet structure, as well as an example samplesheet, can be see [here](https://nf-co.re/sarek/3.2.3/docs/usage#input-sample-sheet-configurations).

**IMPORTANT: this section requires considerable user input, as the sample names and data files differ between projects. Care must be taken to enter the data correctly, case-sensitive with no extra spaces or missing information. At the end of this section you can view your created samples table to see if you have entered the information correctly. If not, you can re-run this whole section to recreate the samplesheet.**

<div class="alert alert-block alert-info">
First, input the naming convention of your data files. You can see this from the `ls` command above. They should either be `.fastq.gz` ( or `.fq.gz`. 
</div>

In [None]:
fq_format=*.fastq.gz