# 1. nfcore/sarek workflow

From the [nfcore/sarek website](https://nf-co.re/sarek/3.2.3):

> nf-core/sarek is a workflow designed to detect variants on whole genome or targeted sequencing data. Initially designed for Human, and Mouse, it can work on any species with a reference genome.

> The pipeline is built using [Nextflow](https://www.nextflow.io/), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner.

# Contents

[1a. Prerequisites](#prereq2)

[1b. nfcore/sarek workflow overview](#overview2)

[1c. Running a test dataset](#test)

[1d. Initial setup](#setup)

[1e. Creating a samplesheet](#sampsheet)

**************

## 1a. Prerequisites <a class="anchor" id="prereq2"></a>

To run the nfcore/sarek workflow, you will need:

1. **A QUT HPC account.** If you are seeing this Notebook, you most likely already have a HPC account. Regardless, you can request an account be created, or request any other HPC or bioinformatics support, via the portal here: https://eresearchqut.atlassian.net/servicedesk/customer/portals

2. Nextflow needs to be installed on your HPC account. **If you haven't already installed Nextflow, or you need to update Nextflow to the latest version, click on the following link, which will download and open a Jupyter Notebook that will step you through installing Nextflow:** https://jupyterhub.eres.qut.edu.au/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Feresearchqut%2FJupyter_Nextflow_install&urlpath=lab%2Ftree%2FJupyter_Nextflow_install%2FNextflow.ipynb&branch=main

3. **Your sequence data files** (fastq.gz format) copied to a location on the QUT HPC that you can access. If you are unsure of the location of your data files, submit a service request through the portal (link above).

*************

## 1b. nfcore/sarek workflow overview <a class="anchor" id="overview2"></a>

Below is an schematic outlining the nfcore/sarek workflow. For a more detailed overview, click [here](https://nf-co.re/sarek/3.2.3). To see how to run nfcore/sarek from the command-line, see [here](https://nf-co.re/sarek/3.2.3/docs/usage). To see what results nfcore/sarek generates, click [here](https://nf-co.re/sarek/3.2.3/docs/output).

<div>
<img src="https://raw.githubusercontent.com/nf-core/sarek/3.2.3//docs/images/sarek_workflow.png" width="300">
    </div>

**************

## 1c. Running a test dataset <a class="anchor" id="test"></a>

This section will run a small test dataset through the nfcore/sarek workflow, to see if you can successfully run nfcore/sarek.

<div class="alert alert-block alert-success">
Run the following code cell to run the nfcore/sarek test. Note that this is set to run without output messages (`-q`) as these will span multiple pages. Instead `nextflow log -f status` is run after the test run has finished, to see if the job ran sucessfully. The test takes approx 3 minutes to complete.
</div>

In [None]:
mkdir $HOME/nftemp && cd $HOME/nftemp
nextflow -q run nf-core/sarek -r 3.2.2 --outdir test -profile test,singularity
nextflow log -f status

**In the last 3 lines you should see a table with `TIMESTAMP` .. `DURATION`, etc (You can ignore any `WARN` messages). If the `STATUS` is `OK` then the test run was successful and you can go to the next section. If you don't see `STATUS` being `OK`, submit a support request through the [Portal](https://eresearchqut.atlassian.net/servicedesk/customer/portals).** 

<div class="alert alert-block alert-success">
To run the above test, Nextflow downloads test data and generates numerous output directories and files. These should be removed after the test is run, by running the following:  
    </div>

In [None]:
cd $HOME
rm -rf nftemp

*************

## 1d. Initial setup <a class="anchor" id="setup"></a>

<div class="alert alert-block alert-info">
Enter the directory that contains your sequence data files. You can find this directory path by typing 'pwd' on the command-line when you are in that directory, or by contacting the HPC staff via the [portal](https://eresearchqut.atlassian.net/servicedesk/customer/portals). The structure of the below command should be `root_path=/directory/containing/my/vcf/files`.
</div>

In [None]:
root_path=/work/eresearch_bio/projects/ATGC/Andrea_Magnioni/20230725_WES/data/jupyter_dev

<div class="alert alert-block alert-success">
Now move to the above directory (cd = change directory): 
</div>

In [None]:
cd $root_path

<div class="alert alert-block alert-danger">
NOTE: the above two code cells must be run every time you use this Notebook.
    </div>

<div class="alert alert-block alert-warning">
To see if you are in the correct directory, run the 'ls' code cell below. You should see a list of all your sample files ('...fastq.gz' or '...fq.gz' files). If you don't see the files, you've entered the above location incorrectly and need to correct and re-run the above code cell.
</div>

In [None]:
ls

********

## 1e. Creating a samplesheet <a class="anchor" id="sampsheet"></a>

nfcore/sarek requires a samplesheet, that contains, at minumum, 5 columns:
1. your patient IDs
2. sample IDs
3. [sequencing lane number](https://www.biostars.org/p/48906/)
4. Read 1, and 
5. Read 2 sequence data files (fastq files) associated with them. 
Details about the required samplesheet structure, as well as an example samplesheet, can be see [here](https://nf-co.re/sarek/3.2.3/docs/usage#input-sample-sheet-configurations).

<div class="alert alert-block alert-danger">
IMPORTANT: this section requires considerable user input, as the sample names and data files differ between projects. Care must be taken to enter the data correctly, case-sensitive with no extra spaces or missing information. At the end of this section you can view your created samples table to see if you have entered the information correctly. If not, you can re-run this whole section to recreate the samplesheet.
</div>

<div class="alert alert-block alert-danger">
FASTQ FILE NAMING CONVENTIONS: sequence data files (fastq files) are typically paired-end, with two fastq files per sample, Read 1 (R1) and Read 2 (R2). In addition, fastq files are typically gzipped ('filename.fastq.gz' - nfcore/sarek requirs this) and may be named 'fastq' or just 'fq'. This results in a few different naming patterns. To create your samplesheet correctly, you need to enter the naming patterns that match your fastq files in the code cells below (i.e. the last few characters at the end of each file that are teh same in ALL Read 1 and ALL Read 2 files respectively). Some examples of fastq naming conventions can be `R1_001.fastq.gz`, `1_fq.gz`, `R2_fq.gz`, `2_fastq.gz', etc.  
</div>

For more information about fastq files, see [here](https://knowledge.illumina.com/software/general/software-general-reference_material-list/000002211).

<div class="alert alert-block alert-info">
First, input the common naming pattern of your Read 1 data files. You can see this from the `ls` command above.
</div>

In [6]:
fq1_format=R1_001.fastq.gz

<div class="alert alert-block alert-info">
Enter the Read 2 files naming pattern.
</div>

In [5]:
fq2_format=R2_001.fastq.gz

<div class="alert alert-block alert-success">
Now you should be able to generate the samplesheet columns for Read 1 and Read 2, by running the below code cell:
</div>

In [7]:
# "$PWD" adds full path
# -maxdepth 1 doesn't look in subdirs
find "$PWD" -maxdepth 1 -type f -iname "*$fq1_format" > forward_reads.txt
find "$PWD" -maxdepth 1 -type f -iname "*$fq1_format" > reverse_reads.txt

Now you will need to manually input your samples data (patient ID, sample ID, lane number) in the below code cells.

<div class="alert alert-block alert-danger">
NOTE: more than one sample may come from a single patient, and samples may span multiple lanes, thus the patient ID and sample ID may need to be entered multiple times. You should know your samples/patient relationship and you can find out about the lane information by contacting your sequencing facility, or by subitting a support request at the eResearch portal: https://eresearchqut.atlassian.net/servicedesk/customer/portals
</div>

<div class="alert alert-block alert-info">
Enter your patient ID in the code cell below, between 'echo -e ' and ' > patient.txt'. Separate each sample name with a comma. If you only have one sample per patient, your patient name and sample name may be the same. 
</div>

In [9]:
echo -e IFGMEL012-MEL,IFGMEL012-MEL,IFGMEL017-MEL,IFGMEL017-MEL,IFGMEL020-MEL,IFGMEL020-PBMC,IFGMEL029-MEL,IFGMEL029-PBMC,IFGMEL030-MEL,IFGMEL030-PBMC,IFGMEL035-MELIFGMEL035-PBMC > patient.txt

<div class="alert alert-block alert-warning">
TIP: make sure you have the same number of patient IDs and sample IDs eneterd as you have Read 1 and Read 2 fastq files. To see how many patient IDs, etc, you need to enter, run the below code cell:
</div>

In [8]:
cat forward_reads.txt | wc -l

12


<div class="alert alert-block alert-info">
Enter your sample ID in the code cell below. If the patient and sample IDs are the same, you can copy and paste the entire cell contents from the paitent ID cell above. 
</div>

In [19]:
echo -e IFGMEL012-MEL,IFGMEL012-MEL,IFGMEL017-MEL,IFGMEL017-MEL,IFGMEL020-MEL,IFGMEL020-PBMC,IFGMEL029-MEL,IFGMEL029-PBMC,IFGMEL030-MEL,IFGMEL030-PBMC,IFGMEL035-MELIFGMEL035-PBMC > sample.txt

<div class="alert alert-block alert-info">
Enter the lane numbers for each sample. Again, some samples may be sequenced across multiple lanes. This information is provided by your sequencing facility. If none of your samples were sequenced across multiple lanes, you can just enter `lane_1, lane_2, lane_3....` etc. 
</div>

In [20]:
echo -e lane_1,lane_2,lane_1,lane_2,lane_1,lane_1,lane_1,lane_1,lane_1,lane_1,lane_1,lane_1 > lane.txt

<div class="alert alert-block alert-success">
Run the below code cell to convert all you sample information files to columns.
</div>

In [21]:
# Substitues the commas with a newline
cat patient.txt | tr , \\n > patient_ID.txt
cat sample.txt | tr , \\n > sample_ID.txt
cat lane.txt | tr , \\n > lane_ID.txt

<div class="alert alert-block alert-success">
Now combine all the columns into a single table.
</div>

In [10]:
cat patient.txt

IFGMEL012-MEL,IFGMEL012-MEL,IFGMEL017-MEL,IFGMEL017-MEL,IFGMEL020-MEL,IFGMEL020-PBMC,IFGMEL029-MEL,IFGMEL029-PBMC,IFGMEL030-MEL,IFGMEL030-PBMC,IFGMEL035-MELIFGMEL035-PBMC


In [16]:
column -t -s, patient.txt

IFGMEL012-MEL  IFGMEL012-MEL  IFGMEL017-MEL  IFGMEL017-MEL  IFGMEL020-MEL  IFGMEL020-PBMC  IFGMEL029-MEL  IFGMEL029-PBMC  IFGMEL030-MEL  IFGMEL030-PBMC  IFGMEL035-MELIFGMEL035-PBMC


In [18]:
cat patient.txt | tr , \\n

IFGMEL012-MEL
IFGMEL012-MEL
IFGMEL017-MEL
IFGMEL017-MEL
IFGMEL020-MEL
IFGMEL020-PBMC
IFGMEL029-MEL
IFGMEL029-PBMC
IFGMEL030-MEL
IFGMEL030-PBMC
IFGMEL035-MELIFGMEL035-PBMC


In [None]:
# Combine it column-wise
paste sample_ID.txt forward_reads.txt reverse_reads.txt > files_cols.txt

# Add headers
echo -e "sampleID\tforwardReads\treverseReads" | cat - files_cols.txt > files.tsv

# Cleanup
rm *.txt