# Nextflow Workshop: Introduction to Nexflow, Managing Dependencies and Containers

In this notebook, we will explore how Nextflow enables the integration of containers and dependency management. The reference documentation is available at [Nextflow Containers](https://training.nextflow.io/latest/basic_training/containers/).


## Introduction to Nextflow

Nextflow is a workflow engine that integrates various languages and execution environments. Its integration with containers (Docker, Singularity, etc.) greatly improves the reproducibility of analyses.

In this workshop, you will learn to:
- Configure Nextflow to use containers.
- Define and manage tool dependencies.
- Execute pipelines using container images.


## Installation and Configuration of Nextflow

If you haven't installed Nextflow yet, you can do so using the following command in your terminal:

```bash
curl -s https://get.nextflow.io | bash


In [None]:
# Check the Nextflow version (requires Nextflow to be installed and in the PATH)
!nextflow -version

# Nextflow Installation Dependencies

Before installing Nextflow, you must ensure that your system has the following dependencies installed and properly configured:

1. **Java**
2. **Docker**
3. **Conda**


#### 1. Java

Nextflow is built on the Java Virtual Machine (JVM), so a Java runtime is required. Here are some key points:

- **Recommended Version:**  
  Nextflow typically requires Java 8 or later (Java 11 is a common choice).

- **Installation:**  
  You can install OpenJDK via your package manager. For example, on Ubuntu:
  ```bash
  sudo apt-get update
  sudo apt-get install openjdk-11-jre
  ```
  After installation, verify by running:


In [None]:
%%bash
java -version

#### 2. Docker

Docker is needed if you plan to run Nextflow processes inside containers, which enhances reproducibility and portability of your workflows.

- **Installation:**  
  Follow the official [Docker installation guide](https://docs.docker.com/engine/install/) for your operating system.
  
- **Permissions:**  
  Ensure your user is added to the `docker` group to avoid running Docker commands with `sudo`. For example:
  ```bash
  sudo usermod -aG docker $USER
  ```
Note: After running this command, you must log out and log back in or restart your terminal to apply the changes.

Test Docker installation with:

In [None]:
%%bash
newgrp docker
docker run --rm hello-world


#### 3. Conda

Conda is a popular package and environment manager often used in bioinformatics workflows.

- **Usage in Nextflow:**  
  While not strictly required to run Nextflow, Conda is very useful for managing dependencies and installing bioinformatics tools. Nextflow can automatically create and use Conda environments if specified in your pipeline.

- **Installation:**  
  You can install Miniconda (a minimal version of Conda) by following the instructions on the [Miniconda website](https://docs.conda.io/en/latest/miniconda.html).

- **Verification:**  
  Once installed, verify by running:
  ```bash
  conda --version


# Introduction to SSH and SLURM

In high-performance computing (HPC) environments or clusters, it is essential to know how to remotely access systems and submit jobs to a queue. In this section, we will provide a brief introduction on using SSH to access remote systems and on SLURM, a widely used job management and scheduling system in clusters.

## Basic Use of SSH

SSH (Secure Shell) is a network protocol that allows you to securely access systems. With SSH, you can log in to remote servers, transfer files, and execute commands on the remote machine.

**Basic Commands:**

- **Connect to a Remote Server:**
  ```bash
  ssh user@server_address


To get access to the server configured for the Hackathon, it needs to have a private key, previosly generated.
It was shared thought Drive. And you can use it in the following way:

```bash
ssh -i hackathonv2 user@192.5.87.172
```
where hackathonv2 is the private key


Example:

```bash
ssh -i ~/.ssh/hackathonv2 andres@192.5.87.172


Copy Files Using SCP:

```bash
scp -i hackathonv2 <myfile> <myuser>@192.5.87.172:/home/<myuser>/


Transfer Entire Directories:

```bash
scp -r directory user@server_address:/destination/path


Connections to Jupyter:

```bash
ssh -i ~/.ssh/hackathonv2 -L 9090:localhost:9001 shared@192.5.87.172

## Introduction to SLURM

SLURM (Simple Linux Utility for Resource Management) is a job scheduling and management system used in computing clusters. It allows users to submit, monitor, and control jobs on the cluster.

**Basic SLURM Concepts:**

- **sbatch:**  
  Submits a job script to the SLURM queue.
  
- **squeue:**  
  Displays the list of jobs in the queue or running.
  
- **scancel:**  
  Allows you to cancel a job in the queue or running.

**Example SLURM Script:**

```bash
#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --time=00:01:00
#SBATCH --partition=local
#SBATCH --ntasks=1

# Execute the pipeline or desired command
nextflow run pipeline.nf -profile docker


# Introduction to Docker Management and Manipulation

Docker is a powerful platform that allows you to package applications and their dependencies into a single container. Containers are lightweight, portable, and reproducible environments. This makes Docker especially useful in bioinformatics pipelines, where reproducibility and dependency management are critical.


## Basic Docker Commands

Here are some essential Docker commands you should know:

- **docker build:**  
  Builds a Docker image from a Dockerfile.  
  Example:  
  ```bash
  docker build -t my_image .


##### Runs a container based on a Docker image:

```bash
docker run --rm my_image

##### Lists currently running containers:

```bash
docker ps

##### Stops a running container.

```bash
docker stop <container_id>

## Creating a Dockerfile for a Bioinformatics Pipeline

A Dockerfile is a text file that contains a series of instructions on how to build a Docker image. For a bioinformatics pipeline, your Dockerfile might install necessary tools (e.g., samtools, bwa, etc.) and set up the environment required for your analysis.

Below is an example Dockerfile that installs samtools for a bioinformatics pipeline.


In [None]:
# Use an official Ubuntu base image
FROM ubuntu:20.04

# Prevent interactive dialogue during installation
ENV DEBIAN_FRONTEND=noninteractive

# Update the package repository and install necessary packages, including samtools
RUN apt-get update && \
    apt-get install -y samtools && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Set the working directory inside the container
WORKDIR /data

# Set the entrypoint (optional) to allow easy use of the image
ENTRYPOINT ["samtools"]


## Building and Running Your Docker Image

1. **Building the Docker Image:**  
   Save the above Dockerfile in a directory (e.g., `docker-samtools/`). Then build the image by running:
   ```bash
   docker build -t bioinfo-samtools docker-samtools/


Once built, you can run a container based on the image. For example, to display the version of samtools installed:

2. **Running the Docker Image:**  
    ```bash
    docker run --rm bioinfo-samtools --version

    ```
The --rm flag tells Docker to remove the container once it stops.

# Pulling a Container in Nextflow

Nextflow makes it easy to work with containers. When you define a process to run inside a container (using the `container` directive), Nextflow automatically checks if the specified container image is available locally. If it isn’t, Nextflow will pull the image from a container registry (such as Docker Hub or Singularity Hub).


## How Nextflow Pulls a Container

When you run a pipeline with a process that specifies a container, Nextflow does the following:

- **Local Check:** It checks if the image exists on the host.
- **Automatic Pull:** If the image is missing, Nextflow automatically downloads (pulls) it from the default container registry.
- **Caching:** Once pulled, the image is cached locally to speed up subsequent executions.


## Using Containers in Nextflow

Nextflow allows direct integration with containers. For example, to use Docker or Singularity, simply specify the container image to be used in the process. Below is a basic example:

```nextflow
process sayHello {
    container 'ubuntu:latest'
    
    script:
    """
    echo "Hello from an Ubuntu container"
    """
}

workflow {
    sayHello()
}


When you run this pipeline, if the ubuntu:latest image isn't already available on your system, Nextflow will automatically pull it.

## Advanced Options and Considerations

Nextflow provides additional options to customize container behavior:

- **Custom Registries:**  
  You can configure Nextflow to pull images from alternative registries if needed.

- **Image Pull Policies:**  
  By default, Nextflow pulls the image only if it is not found locally. You can adjust this behavior via configuration settings (e.g., forcing a pull every time).

- **Singularity Support:**  
  Similar principles apply when using Singularity. Nextflow will pull the required Singularity image if it's not present locally.


## Running a Pipeline from the Notebook

To execute Nextflow commands or scripts from the notebook, you can use code cells with `!` (shell command) or the `%%bash` cell magic.

For example, if you have the `hello.nf` file in the same directory, you can run:

```bash
!nextflow run hello.nf


In [None]:
!nextflow run hello.nf

# Nextflow Profiles and Enabling Docker

In Nextflow, *profiles* allow you to define configuration settings for different environments or use cases (for example, development, production, or containerized execution). This is especially useful for conditionally enabling Docker and customizing other runtime parameters.


## What is a Nextflow Profile?

A Nextflow profile is a set of configuration settings that are applied when you run your pipeline with that specific profile. In your `nextflow.config` file, you can define sections for a `docker` profile where Docker is enabled, along with other settings such as image pull policies or resource configurations.


```nextflow
// nextflow.config

// Global configuration (default values)
docker {
    enabled = false   // Default: Docker is disabled
}

// Profile definitions
profiles {
    docker {
        docker.enabled = true   // Enable Docker when using this profile
        // Additional Docker-specific settings:
        // For example, specify custom run options (adjust as needed)
        docker.runOptions = '-u $(id -u):$(id -g)'
    }
    
    // You can define other profiles, for example a 'standard' or 'local' profile
    standard {
        // Standard environment configuration without containers
        docker.enabled = false
    }
}


# Challenge: SAM to Sorted & Indexed BAM Pipeline

In this exercise, you will design a Nextflow pipeline that performs the following steps using Docker containers with samtools:
1. Convert an input SAM file to an unsorted BAM file.
2. Sort the BAM file.
3. Index the sorted BAM file.

This pipeline will help you practice process chaining, container usage, and reproducibility in bioinformatics workflows.


## Step 1: Prepare the Input SAM File

For this challenge, create a minimal SAM file. You can use the following example content and save it as `input.sam`:

```sam
@HD	VN:1.0	SO:unsorted
@SQ	SN:chr1	LN:1000
read1	0	chr1	100	60	50M	*	0	0	ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTAC	*


## Step 2: Create the Nextflow Pipeline Script

Create a Nextflow script named `sam_to_bam.nf` with three processes:
1. **convertSamToBam:** Converts the SAM file to an unsorted BAM file.
2. **sortBam:** Sorts the unsorted BAM file.
3. **indexBam:** Indexes the sorted BAM file.


## Step 3: Configure Nextflow to Use Docker

Create a file named `nextflow.config` with the following content to enable Docker:

```groovy
docker.enabled = true

profiles {
    docker {
        docker.enabled = true
    }
}


To run the pipeline, open a terminal (or a cell in Jupyter using bash) and execute:

In [None]:
%%bash
newgrp docker
nextflow run sam_to_bam.nf -profile docker

## 6. Conclusions and Next Steps

In this notebook, we covered:
- How to verify the installation of Nextflow.
- Examples of defining processes with containers.
- How to execute pipelines from a notebook.

For more details, review the [official Nextflow Containers documentation](https://training.nextflow.io/latest/basic_training/containers/) and experiment with different images and configurations based on your analysis needs.