# Filter BED file

In this notebook, we filter bed files to include only the relevant variants using [PLINK2](https://www.cog-genomics.org/plink/2.0/).

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the UK Biobank Research Analysis Platform..
    <ul>
        <li>Use compute type 'Standard VM' with sufficient CPU and RAM (e.g. start with 8 CPUs and 30 GB RAM, increase if needed).</li>
        <li>This notebook is pretty fast, but in general it is recommended to be run in the background via <kbd>dx run</kbd> to capture provenance.</li>
    </ul>
</div>

```
CHROM=21
dx run dxjupyterlab \
    --instance-type=mem2_ssd1_v2_x8 \
    -icmd="papermill 1_ukb_plink_bed_filter.ipynb 1_ukb_plink_bed_filter_chr${CHROM}_$(date +%Y%m%d).ipynb -p CHROM ${CHROM}" \
    -iin=1_ukb_plink_bed_filter.ipynb \
    --folder=outputs/plink-make-bed/$(date +%Y%m%d)/
```

In [1]:
import os
import pandas as pd

## Setup plink2

https://www.cog-genomics.org/plink/2.0/

In [2]:
%%bash

##### plink 2 install
PLINK_VERSION=2.3.Alpha
PLINK_ZIP_PATH=/tmp/plink-$PLINK_VERSION.zip
curl -L -o $PLINK_ZIP_PATH https://s3.amazonaws.com/plink2-assets/alpha2/plink2_linux_x86_64.zip
mkdir -p /tmp/plink2/
unzip -o $PLINK_ZIP_PATH -d /tmp/plink2/

Archive:  /tmp/plink-2.3.Alpha.zip
  inflating: /tmp/plink2/plink2      


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8671k  100 8671k    0     0  8273k      0  0:00:01  0:00:01 --:--:-- 8273k


In [3]:
!/tmp/plink2/plink2 --version # --help

PLINK v2.00a2.3LM 64-bit Intel (24 Jan 2020)


## Define constants

This takes as input the WES data from UK Biobank.

In [4]:
CHROM = '22'
LOCAL_BED = f'/mnt/project/Bulk/Exome\ sequences/Population\ level\ exome\ OQFE\ variants\,\ PLINK\ format/ukb23155_c{CHROM}_b0_v1'
LOCAL_FAM = f'/mnt/project/Bulk/Exome\ sequences/Population\ level\ exome\ OQFE\ variants\,\ PLINK\ format/ukb23155_c{CHROM}_b0_v1.fam'
FIXED_FAM = 'ukb23155_FIXED_b0_v1.fam'

# Outputs
PLINK_OUTPUT_FILENAME_PREFIX = f'ukb_200kwes_chr{CHROM}'

## Get capture region intervals

In [6]:
!wget -nd biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/xgen_plus_spikein.GRCh38.bed

--2021-10-28 16:23:43--  http://biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/xgen_plus_spikein.GRCh38.bed
Resolving biobank.ndph.ox.ac.uk (biobank.ndph.ox.ac.uk)... 163.1.206.99
Connecting to biobank.ndph.ox.ac.uk (biobank.ndph.ox.ac.uk)|163.1.206.99|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/xgen_plus_spikein.GRCh38.bed [following]
--2021-10-28 16:23:43--  https://biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/xgen_plus_spikein.GRCh38.bed
Connecting to biobank.ndph.ox.ac.uk (biobank.ndph.ox.ac.uk)|163.1.206.99|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4892340 (4.7M) [application/vnd.realvnc.bed]
Saving to: ‘xgen_plus_spikein.GRCh38.bed’


2021-10-28 16:23:43 (121 MB/s) - ‘xgen_plus_spikein.GRCh38.bed’ saved [4892340/4892340]



## Copy fam locally for editing

In [7]:
!sed -r -e 's|redacted|-9|' {LOCAL_FAM} > {FIXED_FAM}

# Write out the filtered BED file.

In [12]:
!/tmp/plink2/plink2 \
  --bfile {LOCAL_BED} \
  --psam {FIXED_FAM} \
  --chr 1-22 \
  --extract bed0 xgen_plus_spikein.GRCh38.bed \
  --mac 6 \
  --make-bed \
  --out {PLINK_OUTPUT_FILENAME_PREFIX}_plink_makebed

PLINK v2.00a2.3LM 64-bit Intel (24 Jan 2020)   www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ukb_200kwes_22_plink_makebed.log.
Options in effect:
  --bfile /mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, PLINK format/ukb23155_c22_b0_v1
  --chr 1-22
  --extract bed0 xgen_plus_spikein.GRCh38.bed
  --mac 6
  --make-bed
  --out ukb_200kwes_22_plink_makebed
  --psam ukb23155_FIXED_b0_v1.fam

Start time: Thu Oct 28 16:32:27 2021
31091 MiB RAM detected; reserving 15545 MiB for main workspace.
Using up to 16 threads (change this with --threads).
200643 samples (110425 females, 90012 males, 206 ambiguous; 200643 founders)
loaded from ukb23155_FIXED_b0_v1.fam.
414980 variants loaded from /mnt/project/Bulk/Exome sequences/Population level
exome OQFE variants, PLINK format/ukb23155_c22_b0_v1.bim.
Note: No phenotype data present.
--extract bed0: 211721 variants excluded.
Calculating allele frequencie

In [9]:
%%bash

ls -lat | head

total 3043604
-rw-r--r-- 1 root root 3099598676 Oct 28 16:26 ukb_200kwes_22_plink_makebed.bed
-rw-r--r-- 1 root root       1272 Oct 28 16:26 ukb_200kwes_22_plink_makebed.log
drwxr-xr-x 1 root root        420 Oct 28 16:24 .
-rw-r--r-- 1 root root    2111836 Oct 28 16:24 ukb_200kwes_22_plink_makebed.bim
-rw-r--r-- 1 root root    5015801 Oct 28 16:24 ukb_200kwes_22_plink_makebed.fam
-rw-r--r-- 1 root root    5015801 Oct 28 16:23 ukb23155_FIXED_b0_v1.fam
-rw-r--r-- 1 root root        112 Oct 25 17:17 untitled.txt
drwxr-xr-x 1 root root         92 Oct 25 17:16 .ipynb_checkpoints
drwxr-xr-x 1 root root        112 Oct 21 16:21 ..


# Provenance 

In [10]:
%%bash

date

Thu Oct 28 16:26:39 UTC 2021


In [11]:
%%bash

pip3 freeze

ansiwrap==0.8.4
anyio==3.3.4
argcomplete==1.12.3
argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1605217006479/work
asn1crypto @ file:///tmp/build/80754af9/asn1crypto_1596577642040/work
async-generator==1.10
attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1620387926260/work
backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work
backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1618230623929/work
bash_kernel==0.7.2
black==21.8b0
bleach @ file:///home/conda/feedstock_root/build_artifacts/bleach_1629908509068/work
brotlipy==0.7.0
certifi==2021.5.30
cffi==1.14.0
chardet==3.0.4
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
click==8.0.1
colorama==0.4.4
conda==4.10.3
conda-package-handling @ file:///tmp/build/80754af9/conda-package-handling_1618262155238/work
contextvars==2.4
cryptography==2.3.1
cycler==0.10.0
datac