## Biotech Blueprint Example Nextflow Notebook

At a high level, this notebook does the following:

1. Copies the example `nf-core/rnaseq/` workflow from THIS Jupyter environment to S3 (`workflowBucket`/`workflowFolderPrefix`).
2. Submits the Nextflow head node which downloads the `nf-core/rnaseq/` workflow from that S3 location.
3. As the Nextflow head node runs, any subsequent nextflow processes will be submitted as addtional jobs.
4. Tail the head node's CloudWatch Logs output to observe progress.

![alt text](nextflowbatchhelper/bb.nextflow.diagram.png "Logo Title Text 1")


We first initalize the APIs, worfkflow staging location, job queue/definition. If you deployed this using the Biotech Blueprint Informatics Catalog, these values will have already been set for you.

In [None]:
import boto3
import botocore
import time
batchClient = boto3.client('batch')
s3Client = boto3.resource('s3')
logClient = boto3.client('logs')

workflowBucket = 'xxWorkflowBucketxx' # replace with your bucket name
workflowFolderPrefix = 'xxWorkflowPrefixxx' # replace with your object key
jobQueueName = 'xxJobQueuexx'
headNodeJobDef = 'nextflow'


Stage the workflow folder from this Juptyer enviornment to the S3 bucket the head node will download it from

In [None]:
%%script bash -s "$workflowBucket" "$workflowFolderPrefix"
aws s3 sync --only-show-errors --exclude '.*' nf-core/rnaseq s3://$1/$2

Here we are submitting the head node to batch. There are a few key points to highlight:

The Nextflow container is setup such that the first parameter is the staging location of the workflow folder. Any addtional parameters are passed, in order, directly to the Nextflow command entrypoint. 

* Note that the `--reads` parameter utilizes Nextflows glob matching pattern. If there were more than one pair in that 1000 Genomes s3 location, it would kick off the rnaseq pipeline for every pair it matched. Just one job submission can kick of 1000s of jobs!

In [None]:

response = batchClient.submit_job(
    jobDefinition=headNodeJobDef,
    jobName='rnaseq-NextflowHeadNode',
    jobQueue=jobQueueName,
    containerOverrides={
        'command': [
            "s3://{0}/{1}".format(workflowBucket, workflowFolderPrefix),
            "--reads", "s3://1000genomes/phase3/data/HG00243/sequence_read/SRR*_{1,2}.filt.fastq.gz",
            "--genome", "GRCh37",
            "--skip_qc"
        ]
    }
)

headNodeJobId = response['jobId']
print('Job ID: {0}'.format(headNodeJobId))



A hastily written method to tail the head node's CloudWatch logs to follow the progress of the job... You can also observe progress by looking in the AWS Batch Console. You can go even deeper by looking at the CloudWatch log output of any process Nextflow has started.
* Note: If this is the first time you are running something or Batch there will be a slight delay (10s of seconds) while compute resources are provisioned. You will experience a similar delay if Batch ever scales down your desired CPU count to zero because there are no jobs to run.

In [None]:
from nextflowbatchhelper.tailCloudwatch import cloudWatchTail
tail = cloudWatchTail()
tail.startTail(headNodeJobId)