# Filter BED file

In this notebook, we filter the UK Biobank 200k exome release BED files to include only the relevant variants using [PLINK2](https://www.cog-genomics.org/plink/2.0/).

Note that this work is part of a larger project to [Demonstrate the Potential for Pooled Analysis of All of Us and UK Biobank Genomic Data](https://github.com/all-of-us/ukb-cross-analysis-demo-project). Specifically this is for the portion of the project that is the **siloed** analysis.

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the UK Biobank Research Analysis Platform.
    <ul>
        <li>Use compute type 'Standard VM' with sufficient CPU and RAM (e.g. start with 8 CPUs and 30 GB RAM, increase if needed).</li>
        <li>This notebook can take a while to run (e.g., 30 minutes for the larger chromosomes). Recommend that it is run in the background via <kbd>dx run dxjupyterlab</kbd> which will also capture provenance.</li>
    </ul>
</div>

To run on a single chromosome at a time:
```
CHROM=21
dx run dxjupyterlab \
    --instance-type=mem2_ssd1_v2_x8 \
    -icmd="papermill 03_ukb_plink_bed_filter.ipynb 03_ukb_plink_bed_filter_chr${CHROM}_$(date +%Y%m%d).ipynb -p CHROM ${CHROM}" \
    -iin=03_ukb_plink_bed_filter.ipynb \
    --folder=outputs/plink-make-bed/$(date +%Y%m%d)/
```

To run on all chromosomes in parallel:
```
for CHROM in {1..22}; do
    dx run dxjupyterlab \
        --instance-type=mem2_ssd1_v2_x8 \
        -icmd="papermill 03_ukb_plink_bed_filter.ipynb 03_ukb_plink_bed_filter_chr${CHROM}_$(date +%Y%m%d).ipynb -p CHROM ${CHROM}" \
        -iin=03_ukb_plink_bed_filter.ipynb \
        --folder=outputs/plink-make-bed/$(date +%Y%m%d)/ \
        --yes

done
```

See also https://platform.dnanexus.com/app/dxjupyterlab

In [None]:
import os
import pandas as pd

## Setup plink2

https://www.cog-genomics.org/plink/2.0/

In [None]:
%%bash

##### plink 2 install
PLINK_VERSION=2.3.Alpha
PLINK_ZIP_PATH=/tmp/plink-$PLINK_VERSION.zip
curl -L -o $PLINK_ZIP_PATH https://s3.amazonaws.com/plink2-assets/alpha2/plink2_linux_x86_64.zip
mkdir -p /tmp/plink2/
unzip -o $PLINK_ZIP_PATH -d /tmp/plink2/

In [None]:
!/tmp/plink2/plink2 --version # --help

## Define constants

This takes as input the WES data from UK Biobank.

In [None]:
# Papermill parameters. See https://papermill.readthedocs.io/en/latest/usage-parameterize.html

#---[ Inputs ]---
CHROM = '22'
# This was created via ukb_rap_siloed_analyses/02_ukb_lipids_phenotype.ipynb
PHENOTYPES = '/mnt/project/outputs/r-prepare-phenotype/20220217/ukb_200kwes_lipids_phenotype.tsv'

#---[ Outputs ]---
FIXED_FAM = 'ukb23155_FIXED_b0_v1.fam'

In [None]:
# Constants that depend on parameters injected by papermill.

#---[ Inputs ]---
# Provided by UKB RAP.
LOCAL_BED = f'/mnt/project/Bulk/Exome\ sequences/Population\ level\ exome\ OQFE\ variants\,\ PLINK\ format/ukb23155_c{CHROM}_b0_v1'
LOCAL_FAM = f'/mnt/project/Bulk/Exome\ sequences/Population\ level\ exome\ OQFE\ variants\,\ PLINK\ format/ukb23155_c{CHROM}_b0_v1.fam'

#---[ Outputs ]---
PLINK_OUTPUT_FILENAME_PREFIX = f'ukb_200kwes_chr{CHROM}'

## Get capture region intervals

For details, see https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=3803.

In [None]:
!wget -nd biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/xgen_plus_spikein.GRCh38.bed

## Copy fam locally for editing

In [None]:
!sed -r -e 's|redacted|-9|' {LOCAL_FAM} > {FIXED_FAM}

# Write out the filtered BED file.

In [None]:
!/tmp/plink2/plink2 \
  --bfile {LOCAL_BED} \
  --psam {FIXED_FAM} \
  --chr 1-22 \
  --keep {PHENOTYPES} \
  --extract bed0 xgen_plus_spikein.GRCh38.bed \
  --mac 6 \
  --make-bed \
  --out {PLINK_OUTPUT_FILENAME_PREFIX}_plink_makebed

In [None]:
%%bash

ls -lat | head

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze