# Run a Hail notebook in the background

On the *All of US* Workbench, users are logged out after 30 minutes of inactivity. For long running Hail jobs, this means that users may not have all notebook cells populated even though the notebook continues to run to completion.

<div class="alert alert-block alert-success">
    <b>If you wish to capture all Hail notebook cell outputs</b>, use this notebook to run your long-running Hail notebook (or any other long-running notebook).<p>But also note that your analysis <a href="https://support.terra.bio/hc/en-us/articles/360029761352-Preventing-runaway-costs-with-notebook-auto-pause-#h_de5698f5-3c82-4763-aaaf-ea7df6a1869c">the cluster will autopause after 24 hours</a>. To prevent your cluster from shutting down if your background Hail job takes longer than 24 hours, be sure to log in and start a notebook, any notebook, to reset the autopause timer.</p>
</div>

How to use this notebook:
1. Copy this notebook to the workspace that contains the long-running notebook you wish to run in the background.
1. Edit the filename in `NOTEBOOK_TO_RUN` to be the name of the notebook in this workspace you wish to run.
1. Use menu `Cell -> Run All` to tell the kernel to run this notebook to completion.
1. Close this tab and do other work. (note: when you close the tab, the outputs in *this* notebook will no longer be updated, and that is fine, because what we really want are the outputs in the Hail notebook.)
1. When the Hail job is complete:
    1. Your Hail cluster will pause (if you have no other notebooks open).
    1. You will see a copy of your notebook in the notebooks tab with a date and timestamp suffix --> this is the output file.
    1. You will also see the Hail log in `gs://<workspace bucket>/hail-logs/YYYYMMDD/hail*.log`

If you check back, and you do not see the output notebook, the Hail job is still running. To confirm this, first notice whether your Cloud analysis environment is still running. If it is, from the terminal run `yarn application -list` to see that a spark context is still processing data.

## Setup

In [1]:
import nbformat
from nbconvert.preprocessors import CellExecutionError
from nbconvert.preprocessors import ExecutePreprocessor

import os
import time

<div class="alert alert-block alert-warning">
<b>Change the cell below</b> to hold the name of the notebook in this workspace that you wish to run in the background. There is no need to modify any of the other cells.
</div>

In [2]:
#---------[ CHANGE THIS TO BE THE NAME OF THE NOTEBOOK YOU WISH TO RUN ]-----------
NOTEBOOK_TO_RUN = '02 Step 2 extract Nov 2021 SNPs from reference 1KGenomes and HGMD.ipynb'

<div class="alert alert-block alert-danger" >
    NB: Change the following to <b>R</b> if you have an R notebook or to <b>PYTHON</b> if you have a python notebook.
</div>

In [3]:
KERNEL = 'PYTHON'

## Programmatically set output paths

Formulate a new filename for the output notebook so that it include a date and timestamp for when it was executed.

In [4]:
TIMESTAMP_FILE_SUFFIX = time.strftime('_%Y%m%d_%H%M%S.ipynb')
OUTPUT_NOTEBOOK = NOTEBOOK_TO_RUN.replace('.ipynb', TIMESTAMP_FILE_SUFFIX)

print(f'Executed notebook will be written to filename "{OUTPUT_NOTEBOOK}" on the local disk and the workspace bucket.')

Executed notebook will be written to filename "02 Step 2 extract Nov 2021 SNPs from reference 1KGenomes and HGMD_20220329_220334.ipynb" on the local disk and the workspace bucket.


<div class="alert alert-block alert-danger" >
    NB: Run the next cell only if you use Hail
</div>

In [5]:
DATESTAMP = time.strftime('%Y%m%d')
HAIL_LOG_DIR_FOR_PROVENANCE = f'{os.getenv("WORKSPACE_BUCKET")}/hail-logs/{DATESTAMP}/'

print(f'Hail logs will be copied to {HAIL_LOG_DIR_FOR_PROVENANCE}')

Hail logs will be copied to gs://fc-secure-30fdbdfd-a46b-406d-9617-1bc69ae1da9d/hail-logs/20220329/


## Execute the notebook and capture provenance

In [6]:
def get_kernel(kernel):
    return 'ir' if kernel.lower() == 'r' else 'python3'

KERNEL_NAME = get_kernel(KERNEL)

<div class="alert alert-block alert-info">
The next cell will use <kbd>nbconvert</kbd> to run the notebook in the background until it completes (or yields an error) and will capture all notebook outputs as a separate notebook file.
</div>

In [None]:
# See also https://nbconvert.readthedocs.io/en/latest/execute_api.html
with open(NOTEBOOK_TO_RUN) as f_in:
    nb = nbformat.read(f_in, as_version=4)
    ep = ExecutePreprocessor(timeout=-1, kernel_name=KERNEL_NAME)
    try:
        out = ep.preprocess(nb, {'metadata': {'path': ''}})
    except CellExecutionError:
        out = None
        msg = 'Error executing the notebook "%s".\n\n' % NOTEBOOK_TO_RUN
        msg += 'See notebook "%s" for the traceback.' % OUTPUT_NOTEBOOK
        print(msg)
    finally:
        with open(OUTPUT_NOTEBOOK, mode='w', encoding='utf-8') as f_out:
            nbformat.write(nb, f_out)

# saving notebooks to the bucket
WORKSPACE_BUCKET = os.getenv('WORKSPACE_BUCKET')
!gsutil cp "{OUTPUT_NOTEBOOK}" {WORKSPACE_BUCKET}/notebooks/

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


2022-03-29 22:03:42 WARN  Hail:43 - This Hail JAR was compiled for Spark 3.1.1, running with Spark 3.1.2.
  Compatibility is not guaranteed.




<div class="alert alert-block alert-danger" >
    NB: Run the next cell if you use Hail
</div>

<div class="alert alert-block alert-info">
    When notebook has finished execution, the next cell will use <kbd>gsutil</kbd> to copy all hail logs from the execution directory on the local disk to the workspace bucket.
</div>

In [None]:
%%bash -s $HAIL_LOG_DIR_FOR_PROVENANCE

gzip --keep hail*.log
gsutil -m cp hail*.log.gz ${1}