# Container Basics for Research

In this workshop, we will start with a simple application running in a Docker container. We will take a closer look at the key components and environments that are needed. 

There are several container technologies available, but Docker container is the most popular once. We will focus on Docker container in theis workshop. 

We will also explore diffent ways of running containers in AWS with different services. 

Why containers for research
- Repeatable and sharable tools and applications
- Portable - run on different environemnts ( develop on laptop, test on-prem, run large scale in the cloud)
- Stackable - run differnet applications in a pipeline with different OS, e.g.


In [None]:
#You only need to do this once per kernel - used in analyzing fastq data. If you don't need to run the last step, you don't need this
!pip install bioinfokit 


In [None]:
import boto3
import botocore
import json
import time
import os
import base64
import docker
import pandas as pd

import project_path # path to helper methods
from lib import workshop
from botocore.exceptions import ClientError

# create a bucket for the workshop to store output files. 

session = boto3.session.Session()
bucket = workshop.create_bucket_name('container-ws-')
session.resource('s3').create_bucket(Bucket=bucket)

print(bucket)

First of all, let's create a helper magic for us to easily create and save a file from the notebook

In [None]:
from IPython.core.magic import register_line_cell_magic

@register_line_cell_magic
def writetemplate(line, cell):
    with open(line, 'w+') as f:
        f.write(cell.format(**globals()))

## Running an application in a container locally.

This SageMaker Jupyter notebook runs on an EC2 instance with docker daemon installed. We can build and test docker containers on the same instance. 

We are going to build a simple web server container that says "Hello World!". Let's start with the Docker files 

### Let's start the the Dockerfile
Think about the Dockerfile as the automation script that you usually do on a linux VM. It just run inside an container. You start with a base image (in this case ubuntu:18.04), then you install, configue compile or build the software you need. 


In [None]:
%%writetemplate Dockerfile
FROM ubuntu:18.04
  
# Install dependencies and apache web server
RUN apt-get update && apt-get -y install apache2

# Create the index html
RUN echo 'Hello World!' > /var/www/html/index.html

# Configure apache 
RUN echo '. /etc/apache2/envvars' > /root/run_apache.sh && \
 echo 'mkdir -p /var/run/apache2' >> /root/run_apache.sh && \
 echo 'mkdir -p /var/lock/apache2' >> /root/run_apache.sh && \
 echo '/usr/sbin/apache2 -D FOREGROUND' >> /root/run_apache.sh && \
 chmod 755 /root/run_apache.sh

EXPOSE 80

CMD /root/run_apache.sh

### Now let's build the container. 

The server that runs this SageMaker Jupyter notebook happen to have "docker" runtime installed. 
Docker builld will use the "Dockerfile" in the current directory and use "-t" to build and tag the image. The image will be in the local docker image registry. 

We will later learn how to use an external image registry (AWS ECR, e.g.) to push the image to. 

In [None]:
!docker build -t simple_server .

### Run the container 

Run the container locally, we will bind the container port 80 to the localhsot port 8080 ("-d" runs detached/background)

We use curl to access the web server on port 8080


In [None]:
c_id = !docker run  -d -p 8080:80 simple_server
    
!curl http://localhost:8080


In [None]:
        
docker_client = docker.from_env()
simple_server_container = docker_client.containers.get(c_id[0])


def list_all_running_containers():
    docker_client = docker.from_env()
    container_list = docker_client.containers.list()
    for c in container_list:
        print(c.attrs['Id'], c.attrs['State']['Status'])
    return container_list

running_containers = list_all_running_containers()

# Now stop the running container
simple_server_container.stop()

## Let's run some real workload

We are going to use The NCBI SRA (Sequence Read Archive) SRA Tool (https://github.com/ncbi/sra-tools) fasterq-dump (https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump) to extract fastq from SRA-accessions.

The command takes a package name as an argument
```
$ fasterq-dump SRR000001
```

The base image is provided by https://hub.docker.com/r/pegi3s/sratoolkit/

The workflow of the contianer: 
1. Upon start, container runs a script "sratest.sh".
3. sratest.sh will "prefetch" the data package, whose name is passed via an environment variable. 
4. sratest.sh then run fasterq-dump on the dat apackage
5. sratest.sh will then upload the result to S3://{bucket}

The output of the fasterq-dump will be stored in s3://{bucket}/data/sra-toolkit/fasterq/{{PACKAGE_NAME}


In [None]:
PACKAGE_NAME='SRR000002'

# this is where the output will be stored
sra_prefix = 'data/sra-toolkit/fasterq'
sra_output = f"s3://{bucket}/{sra_prefix}"

# to run the docker container locally, you need the access credtitials inside the container when usign aws cli
# pass the current keys and session token to the container va environment variables
credentials = boto3.session.Session().get_credentials()
current_credentials = credentials.get_frozen_credentials()    

# Please don't print those out:  
access_key=current_credentials.access_key
secret_key=current_credentials.secret_key
token=current_credentials.token


In [None]:
%%writetemplate sratest.sh
#!/bin/bash
set -x

# this is where ncbi/sra-toolkit is installed on the container inside the pegi3s/sratookit image
#export PATH="/opt/sratoolkit.2.9.6-ubuntu64/bin:${{PATH}}"
prefetch $PACKAGE_NAME --output-directory /tmp
fasterq-dump $PACKAGE_NAME -e 18
aws s3 sync . $SRA_OUTPUT/$PACKAGE_NAME
    

In [None]:
%%writetemplate Dockerfile.pegi3s
FROM pegi3s/sratoolkit

ENV TZ=America/New_York
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

RUN apt-get update --fix-missing && apt-get install -y unzip python3 awscli
RUN export PATH=/usr/local/bin/aws/bin:$PATH
ADD sratest.sh /usr/local/bin/sratest.sh
RUN chmod +x /usr/local/bin/sratest.sh
WORKDIR /tmp
ENTRYPOINT ["/usr/local/bin/sratest.sh"]


In [None]:
!docker build -t myncbi/sra-tools -f Dockerfile.pegi3s .

In [None]:
PACKAGE_NAME='SRR000002'

# only run this when you need to clean up the registry and storage
#!docker system prune -a -f
!docker run --env SRA_OUTPUT=$sra_output --env PACKAGE_NAME=$PACKAGE_NAME --env PACKAGE_NAME=$PACKAGE_NAME --env AWS_ACCESS_KEY_ID=$access_key --env AWS_SECRET_ACCESS_KEY=$secret_key --env AWS_SESSION_TOKEN=$token    myncbi/sra-tools:latest
    
    

In [None]:
# Now try a differnet package
PACKAGE_NAME = 'SRR000003'
!docker run --env SRA_OUTPUT=$sra_output --env PACKAGE_NAME=$PACKAGE_NAME --env PACKAGE_NAME=$PACKAGE_NAME --env AWS_ACCESS_KEY_ID=$access_key --env AWS_SECRET_ACCESS_KEY=$secret_key --env AWS_SESSION_TOKEN=$token    myncbi/sra-tools:latest


### Build your own docker image

So far, we have been using existing pegi3s ncbi/sratools image. Let's build our own image using a ubuntu base image. 

1. Install tzdata - this is a dependency of some of the other packages we need. Normally we do not need to install it specifically, however there is an issue with tzdata requireing an interaction to select timezone during the installation process, which would halt the docker built. so install it separately with -y. 
2. Install wget and awscli.
3. Download sratookit ubuntu binary and unzip into /opt
4. set the PATH to include sratoolkit/bin
5. USER nobody is needed to set the permission for sratookit configuration. 
6. use the same sratest.sh script 

In [None]:
%%writetemplate Dockerfile.myown
#FROM ubuntu:18.04  
FROM public.ecr.aws/ubuntu/ubuntu:latest

RUN apt-get update 

RUN DEBIAN_FRONTEND="noninteractive" apt-get -y install tzdata \
        && apt-get install -y wget libxml-libxml-perl awscli

RUN wget -q https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.0/sratoolkit.2.10.0-ubuntu64.tar.gz -O /tmp/sratoolkit.tar.gz \
        && tar zxf /tmp/sratoolkit.tar.gz -C /opt/ && rm /tmp/sratoolkit.tar.gz

ENV PATH="/opt/sratoolkit.2.10.0-ubuntu64/bin/:${{PATH}}"

ADD sratest.sh /usr/local/bin/sratest.sh
RUN chmod +x /usr/local/bin/sratest.sh
WORKDIR /tmp
USER nobody
ENTRYPOINT ["/usr/local/bin/sratest.sh"]

In [None]:
# Build the image
!docker build -t myownncbi/sra-tools -f Dockerfile.myown .

In [None]:
PACKAGE_NAME='SRR000004'

!docker run --env SRA_OUTPUT=$sra_output --env PACKAGE_NAME=$PACKAGE_NAME --env AWS_ACCESS_KEY_ID=$access_key --env AWS_SECRET_ACCESS_KEY=$secret_key --env AWS_SESSION_TOKEN=$token    myownncbi/sra-tools:latest


In [None]:
# checkou the outfiles on S3
s3_client = session.client('s3')
objs = s3_client.list_objects(Bucket=bucket, Prefix=sra_prefix)
for obj in objs['Contents']:
    fn = obj['Key']
    p = os.path.dirname(fn)
    if not os.path.exists(p):
        os.makedirs(p)
    s3_client.download_file(bucket, fn , fn)



In [None]:

# you can use interactive python interpreter, jupyter notebook, google colab, spyder or python code
# I am using interactive python interpreter (Python 3.8.2)
from bioinfokit.analys import fastq
fastq_iter = fastq.fastq_reader(file=f"{sra_prefix}/{PACKAGE_NAME}/{PACKAGE_NAME}.fastq") 
# read fastq file and print out the first 10, 
i = 0
for record in fastq_iter:
    # get sequence headers, sequence, and quality values
    header_1, sequence, header_2, qual = record
    # get sequence length
    sequence_len = len(sequence)
    # count A bases
    a_base = sequence.count('A')
    if i < 10:
        print(sequence, qual, a_base, sequence_len)
    i +=1

print(f"Total number of records for package {PACKAGE_NAME} : {i}")

In [None]:
!aws s3 rb s3://$bucket --force  
!rm -rf $sra_prefix


## Other ways to run the container 

We looked at creating and running containers locally in this notebook. Please checkout notebook/hpc/hatch-fastqc notebook for running containers in AWS Batch service. 