# A short introduction to containerized software

After spending time using nf-core pipelines to answer bioinformatic questions, we will focus on the processes that lie behind these pipelines now.

Today, we will focus on containerization, namely via Docker. 



1. Check if Docker is installed.

In [None]:
!docker info

### What is a container?

A container-image is a leightweight and packaged environments which contains als necessary software such as libraries, dependencies and configurations which are required for execution. A container itself is the running instantiation of such an container image.

### Why do we use containers?

 Since it is possible to assign versions to Docker images—and each version corresponds to a specific set of versioned software—research conducted on containerized infrastructure is highly reusable and reproducible. Because they (should) "out of the box" be runnable, they enhance development by reducing effort related to software installation.

### What is a docker image?

The container image is the template which is required to run a container by using a container platform such as docker. The images are often based on a Dockerfile which contains the building plan of an image but they can also be created based on a running container. The images can be stored in a image repository such as Dockerhub and with that easily distributed world wide.

### Let's run our first docker image:

### Login to docker

In [None]:
!docker login

### Run your first docker container

In [None]:
!docker run hello-world

### Find the container ID

In [None]:
!docker ps -a

### Delete the container again, give prove its deleted

In [None]:
!docker remove e4c5d9809fb1

In [None]:
!docker ps -a -f "id=e4c5d9809fb1"

### FASTQC is a very useful tool as you've learned last week. Let's try and run it from command line

Link to the software: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Please describe the steps you took to download and run the software for the example fastq file from last week below:

1. !brew install fastqc (installation by using brew on mac)
2. !fastqc -o fastqc_results -t 4 SRX19144488_SRR23195511_1.fastq.gz (run fastqc)

### Very well, now let's try to make use of its docker container

1. create a container holding fastqc using seqera containers (https://seqera.io/containers/)
2. use the container to generate a fastqc html of the example fastq file

In [None]:
# pull the container
!docker pull community.wave.seqera.io/library/fastqc:0.12.1--af7a5314d5015c29

In [None]:
# run the container and save the results to a new "fastqc_results" directory
!docker run --rm -v "$(pwd)":/data community.wave.seqera.io/library/fastqc:0.12.1--af7a5314d5015c29 fastqc "/data/SRX19144488_SRR23195511_1.fastq.gz" -o "/data/fastqc_results"

### Now that you know how to use a docker container, which approach was easier and which approach will be easier in the future?

Always depends. With the use of a software manager (such as brew) installation is often (but not always) quite easy and the tools can be run without doing additional mounting and caringn about the directory structure. However, there are many situations when containers are a way better choice such as:
- Software versioning should be documented
- Software requires a special set of dependencies
- Software is not directly available for current OS
- Software needs to be pre-build 
- Many different softwares are needed
- [...]

### What would you say, which approach is more reproducible?

In general, the conainterized approach.

### Compare the file to last weeks fastqc results, are they identical?
### Is the fastqc version identical?

## Dockerfiles

We now used Docker containers and images directly to boost our research. 

Let's create our own toy Dockerfile including the "cowsay" tool (https://en.wikipedia.org/wiki/Cowsay)

Hints:
1. Docker is Linux, so you need to know the apt-get command to install "cowsay"

### Explain the RUN and ENV lines you added to the file

RUN apt-get update

This executes apt-get with the mode update so that already installed software packages as also software repository paths get updated.

RUN apt-get install -y curl cowsay

This executes apt-get with the mode install so that it installes the software packages curl and cowsay.

ENV PATH="/usr/games:$PATH"

This prepends /usr/games to the existing $PATH environment variable. When you run a command in the terminal, the system searches the directories listed in $PATH in order and executes the first matching executable it finds. By adding /usr/games, binaries located there (such as cowsay) become directly accessible. Without this, the system would not be able to find or execute those binaries. Cowsay is installed per default into /usr/games. If the path variable is not updated, cowsay can not be executed outside of the directory.

In [None]:
!docker build -t cowsay:latest -f cowsay .

In [None]:
# make sure that the image has been built
!docker images

In [None]:
# run the docker file 
!docker run --rm cowsay:latest cowsay

## Let's do some bioinformatics with the docker file and create a new docker file that holds the salmon tool used in rnaseq

To do so, use "curl" in your new dockerfile to get salmon from https://github.com/COMBINE-lab/salmon/releases/download/v1.5.2/salmon-1.5.2_linux_x86_64.tar.gz

In [None]:
# use the file "salmon_docker" in this directory to build a new docker image

In [None]:
# build the image
docker build --platform linux/amd64 -t salmon:latest -f salmon .

In [None]:
# run the docker image to give out the version of salmon
docker run --platform linux/amd64 --rm salmon:latest salmon --version

## Do you think bioinformaticians have to create a docker image every time they want to run a tool?

No, not necessarily. If the tool is just a "one time run" may the effort is not worthy. However, if the tool is more often used and the done work requires reproducability, a docker image should be may taken into account.

Find the salmon docker image online and run it on your computer.

In [None]:
!docker pull combinelab/salmon

In [None]:
!docker run --platform linux/amd64 combinelab/salmon salmon --version

## What is https://biocontainers.pro/ ?

BioContainers is a community-driven project that provides the infrastructure and basic guidelines to create, manage and distribute bioinformatics packages (e.g conda) and containers (e.g docker, singularity). BioContainers is based on the popular frameworks Conda, Docker and Singularity.

https://biocontainers-edu.readthedocs.io/en/latest/what_is_biocontainers.html

## Are there other ways to create Docker (or Apptainer) images?
Beside of a Dockerfile, its also possible to start building "a new image" by starting a container from an existing one - installing software via cli and afterwards creating an image from the used container. However, this is not realy reproducable und shouldn´t be the default process.

Nowadays there are also some higher level frameworks available with which it is possible to define a list of software which should be installed on the image and then a Dockerfile is generated out of it which will be process later on. 
Example:
- Definining a list of software which is available on conda
- High level framework uses an image with preinstalled conda environment and just runs conda install [...] on top of it
Wave is a sample framework which provides such features.


## What is https://seqera.io/containers/ ?

It´s a wrapper build on top on the wave technology to build images on demand as needed based on a configuration file. If matching images do already exist, they are pre used from a cache.