# Write filtered AoU BGEN file

In this notebook, we write data from the *All of Us* whole genome sequence matrix table to a BGEN file for use with other tools such as [PLINK2](https://www.cog-genomics.org/plink/2.0/) and [regenie](https://rgcgithub.github.io/regenie/).

Note that this work is part of a larger project to [Demonstrate the Potential for Pooled Analysis of All of Us and UK Biobank Genomic Data](https://docs.google.com/document/d/19ZS0z_-7FEM37pNDAXaWaqBSLnqyd9MZEkiOmtF3n_0/edit#). Specifically this is for the portion of the project that is the **siloed** analysis.

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the <i>All of Us</i> Workbench.
    <ul>
        <li>Use "Recommended Environment" <kbd><b>Hail Genomics Analysis</b></kbd> which creates compute type <kbd>Dataproc Cluster</kbd> with reasonable defaults for CPU, RAM, disk, and number of workers. If you like, you can increase the number of workers to make this job complete faster.</li>
        <li>This notebook can take several hours to run. Recommend that it is run in the background via <kbd>run_notebook_in_the_background</kbd>.</li>
        <ul>
            <li>chr21 took about 5 hours with 5 worker nodes.</li>
            <li>chr1 - chr22 <b>TODO(margaret) besure to write down how long this takes with how many worker nodes</b></li>
        </ul>
    </ul>
</div>

In [None]:
from datetime import datetime
import hail as hl
import os
import time

## Retrieve the exome capture regions

For details, see https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=3803.

In [None]:
%%bash

wget -nd -nv biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/xgen_plus_spikein.GRCh38.bed

gsutil cp xgen_plus_spikein.GRCh38.bed ${WORKSPACE_BUCKET}/data/ukb/

## Define constants

In [None]:
# Papermill parameters. See https://papermill.readthedocs.io/en/latest/usage-parameterize.html

#---[ Inputs ]---
# Matrix table was provided by AoU.
AOU_MT = 'gs://fc-secure-4adb21f6-46f4-4a79-99f9-afd63890c6d0/data/beta/beta_wgs_98622.mt'
# This file is from https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=3803.
EXOME_REGIONS = f'{os.getenv("WORKSPACE_BUCKET")}/data/ukb/xgen_plus_spikein.GRCh38.bed'
# Use autosomes only.
INTERVALS_TO_EXAMINE = ['chr1-chr22']
INTERVALS_TO_EXAMINE = ['chr21']  # <----------- REMOVE THIS LATER!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
INTERVALS_TO_EXAMINE_NAME = '_'.join(INTERVALS_TO_EXAMINE).replace(':', 'range')

#---[ Outputs ]---
# Create a timestamp for a folder of results generated today.
DATESTAMP = time.strftime('%Y%m%d')
TIMESTAMP = time.strftime('%Y%m%d_%H%M%S')
WORK_DIR = !pwd

OUTPUT_BGEN = f'{os.getenv("WORKSPACE_BUCKET")}/data/aou/geno/{DATESTAMP}/aou-alpha3-{INTERVALS_TO_EXAMINE_NAME}' # Hail will add the .bgen suffix.
HAIL_LOG = f'{WORK_DIR[0]}/hail-write-filtered-bgen-{TIMESTAMP}.log'
HAIL_LOG_DIR_FOR_PROVENANCE = f'{os.getenv("WORKSPACE_BUCKET")}/hail-logs/{DATESTAMP}/'

In [None]:
OUTPUT_BGEN

## Check access

In [None]:
!gsutil ls {AOU_MT}

## Start Hail 

In [None]:
# See also https://towardsdatascience.com/fetch-failed-exception-in-apache-spark-decrypting-the-most-common-causes-b8dff21075c
# See https://spark.apache.org/docs/2.4.7/configuration.html

EXTRA_SPARK_CONFIG = {
    # If set to "true", performs speculative execution of tasks. This means if one or more tasks are running
    # slowly in a stage, they will be re-launched.
    'spark.speculation': 'true', # Default is false.
    
    # Fraction of tasks which must be complete before speculation is enabled for a particular stage.
    'spark.speculation.quantile': '0.95', # Default is 0.75

    # Default timeout for all network interactions. This config will be used in place of 
    # spark.core.connection.ack.wait.timeout, spark.storage.blockManagerSlaveTimeoutMs, 
    # spark.shuffle.io.connectionTimeout, spark.rpc.askTimeout or spark.rpc.lookupTimeout if they are not configured.
    'spark.network.timeout': '180s', # Default is 120s
        
    # (Netty only) Fetches that fail due to IO-related exceptions are automatically retried if this is set to a
    # non-zero value. This retry logic helps stabilize large shuffles in the face of long GC pauses or transient
    # network connectivity issues.
    'spark.shuffle.io.maxRetries': '10',  # Default is 3
    
    # (Netty only) How long to wait between retries of fetches. The maximum delay caused by retrying is 15 seconds
    # by default, calculated as maxRetries * retryWait.
    'spark.shuffle.io.retryWait': '15s',  # Default is 5s
    
    # Number of failures of any particular task before giving up on the job. The total number of failures spread
    # across different tasks will not cause the job to fail; a particular task has to fail this number of attempts.
    # Should be greater than or equal to 1. Number of allowed retries = this value - 1.
    'spark.task.maxFailures': '10', # Default is 4.

    # Number of consecutive stage attempts allowed before a stage is aborted.
    'spark.stage.maxConsecutiveAttempts': '10' # Default is 4.
}

In [None]:
hl.init(spark_conf=EXTRA_SPARK_CONFIG,
        min_block_size=50,
        default_reference='GRCh38',
        log=HAIL_LOG)

Check the configuration.

In [None]:
sc = hl.spark_context()
config = sc._conf.getAll()
config.sort()
config

# Load exome capture regions

In [None]:
ukb_exome_capture_regions = hl.import_bed(EXOME_REGIONS)

In [None]:
ukb_exome_capture_regions.describe()

In [None]:
ukb_exome_capture_regions.aggregate(hl.agg.counter(ukb_exome_capture_regions.interval.start.contig))

In [None]:
ukb_exome_capture_regions.show(5)

# Read the matrix table

In [None]:
aou_mt = hl.read_matrix_table(AOU_MT)

In [None]:
aou_mt.n_partitions()

In [None]:
aou_mt.describe()

## Filter to our intervals of interest

In [None]:
if len(INTERVALS_TO_EXAMINE) > 0:
    aou_mt = hl.filter_intervals(
        aou_mt,
        [hl.parse_locus_interval(x) for x in INTERVALS_TO_EXAMINE],
        keep=True)

## Filter to include only exonic variants

In [None]:
aou_mt = aou_mt.filter_rows(hl.is_defined(ukb_exome_capture_regions[aou_mt.locus]), keep=True)

## Examine variants with filter flags

In [None]:
aou_mt_rows = aou_mt.rows()
aou_mt_rows.group_by(aou_mt_rows.filters).aggregate(n = hl.agg.count()).show()

<div class="alert alert-block alert-info">
<b>Note:</b> Note that AoU VCFs and UKB VCFs use different soft thresholds for what is flagged by VCF filters, so all variant QC happens further downstream. Note that if you do want to make use of the AoU VCF filter flags, uncomment the code in the following two cells.
</div>

In [None]:
#aou_mt = aou_mt.filter_rows(hl.is_missing(aou_mt.filters), keep = True)

In [None]:
#aou_mt_rows = aou_mt.rows()
#aou_mt_rows.group_by(aou_mt_rows.filters).aggregate(n = hl.agg.count()).show()

## Create an rsid

This is needed by plink.

In [None]:
aou_mt = aou_mt.annotate_rows(
    rsid = aou_mt.locus.contig + '_' + hl.str(aou_mt.locus.position)
            + '_' + aou_mt.alleles[0] + '_' + aou_mt.alleles[1])

## Convert biallelic vaiants

In [None]:
# For efficiency, do not pass the biallelic variants to the split method,
# just add the corresponding annotations.
aou_bi = aou_mt.filter_rows(hl.len(aou_mt.alleles) == 2)
aou_bi = aou_bi.annotate_rows(a_index = 1)
aou_bi = aou_bi.annotate_rows(was_split = False)

# Split the multi-allelic sites into biallelic sites.
aou_multi = aou_mt.filter_rows(hl.len(aou_mt.alleles) > 2)
aou_split = hl.split_multi_hts(aou_multi,
                               keep_star=False,
                               left_aligned=False,
                               vep_root='vep',
                               permit_shuffle=False)

# Union the two collections and include only the row and entry fields that are needed.
aou_prepared = aou_split.union_rows(aou_bi)

# Write the matrix table to BGEN

https://hail.is/docs/0.2/methods/impex.html#hail.methods.export_bgen

In [None]:
start = datetime.now()
print(start)

In [None]:
homref_gp = hl.literal([1.0, 0.0, 0.0])
het_gp = hl.literal([0.0, 1.0, 0.0])
homvar_gp = hl.literal([0.0, 0.0, 1.0])

aou_prepared = aou_prepared.annotate_entries(
    GP = hl.case()
        .when(aou_prepared.GT.is_hom_ref(), homref_gp)
        .when(aou_prepared.GT.is_het(), het_gp)
        .default(homvar_gp)
)

In [None]:
hl.methods.export_bgen(mt=aou_prepared, output=OUTPUT_BGEN, gp=aou_prepared.GP, rsid=aou_prepared.rsid, parallel=None)

In [None]:
end = datetime.now()
print(end)
print(end - start)

In [None]:
start = datetime.now()
print(start)

In [None]:
hl.methods.index_bgen(OUTPUT_BGEN + '.bgen')

In [None]:
end = datetime.now()
print(end)
print(end - start)

# Provenance

In [None]:
# Copy the Hail log to the workspace bucket so that we can retain it.
!gzip --keep {HAIL_LOG}
!gsutil cp {HAIL_LOG}.gz {HAIL_LOG_DIR_FOR_PROVENANCE}

In [None]:
print(datetime.now())

In [None]:
!pip3 freeze