# Merge BED files

In this notebook, we merge bed files [PLINK](https://www.cog-genomics.org/plink/1.9/).

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the UK Biobank Research Analysis Platform..
    <ul>
        <li>Use compute type 'Standard VM' with sufficient CPU and RAM (e.g. start with 8 CPUs and 30 GB RAM, increase if needed).</li>
        <li>This notebook is pretty fast, but in general it is recommended to be run in the background via <kbd>dx run</kbd> to capture provenance.</li>
    </ul>
</div>

```
dx run dxjupyterlab \
    --instance-type=mem2_ssd1_v2_x8 \
    -icmd="papermill 2_ukb_plink_merge_bed_files.ipynb 2_ukb_plink_merge_bed_files_$(date +%Y%m%d).ipynb" \
    -iin=2_ukb_plink_merge_bed_files.ipynb \
    --folder=outputs/plink-merge-bed/$(date +%Y%m%d)/
```

In [22]:
import os
import pandas as pd

## Setup plink

https://www.cog-genomics.org/plink/1.9/

In [31]:
%%bash

##### plink 1 install
PLINK_VERSION=20210606
PLINK_ZIP_PATH=/tmp/plink-$PLINK_VERSION.zip
curl -L -o $PLINK_ZIP_PATH https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_$PLINK_VERSION.zip
mkdir -p /tmp/plink/
unzip -o $PLINK_ZIP_PATH -d /tmp/plink/

Archive:  /tmp/plink-20210606.zip
  inflating: /tmp/plink/plink        
  inflating: /tmp/plink/LICENSE      
  inflating: /tmp/plink/toy.ped      
  inflating: /tmp/plink/toy.map      
  inflating: /tmp/plink/prettify     


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8708k  100 8708k    0     0  8153k      0  0:00:01  0:00:01 --:--:-- 8146k


In [35]:
!/tmp/plink/plink --version # --help

PLINK v1.90b6.24 64-bit (6 Jun 2021)


## Define constants

This takes as input the WES data from UK Biobank further filtered by notebook `1_ukb_plink_bed_filter.ipynb`.

In [52]:
BED_PATTERN = '/mnt/project/outputs/plink-make-bed/20211028/ukb_200kwes_chr*_plink_makebed'

# Outputs
BED_MERGE_LIST = 'bed_merge_list.txt'
PLINK_OUTPUT_FILENAME_PREFIX = f'ukb_200kwes_filtered'

## Write out the merge list

In [49]:
chrom = 1
INITIAL_BED = BED_PATTERN.replace('*', str(chrom))

INITIAL_BED

with open(BED_MERGE_LIST, mode='w') as the_file:
    for chrom in range(2, 23):
        path = BED_PATTERN.replace('*', str(chrom))
        the_file.write(f'{path}.bed {path}.bim {path}.fam \n')

In [50]:
!cat {BED_MERGE_LIST}

/mnt/project/outputs/plink-make-bed/20211028/ukb_200kwes_chr22_plink_makebed.bed /mnt/project/outputs/plink-make-bed/20211028/ukb_200kwes_chr22_plink_makebed.bim /mnt/project/outputs/plink-make-bed/20211028/ukb_200kwes_chr22_plink_makebed.fam 


# Write out the merged BED file.

In [51]:
!/tmp/plink/plink \
  --bed {INITIAL_BED}.bed \
  --bim {INITIAL_BED}.bim \
  --fam {INITIAL_BED}.fam \
  --merge-list {BED_MERGE_LIST} \
  --make-bed \
  --out {PLINK_OUTPUT_FILENAME_PREFIX}_plink_mergebed

PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ukb_200kwes_filtered_plink_mergebed.log.
Options in effect:
  --bed /mnt/project/outputs/plink-make-bed/20211028/ukb_200kwes_chr21_plink_makebed.bed
  --bim /mnt/project/outputs/plink-make-bed/20211028/ukb_200kwes_chr21_plink_makebed.bim
  --fam /mnt/project/outputs/plink-make-bed/20211028/ukb_200kwes_chr21_plink_makebed.fam
  --make-bed
  --merge-list bed_merge_list.txt
  --out ukb_200kwes_filtered_plink_mergebed

31091 MB RAM detected; reserving 15545 MB for main workspace.
position.
position.
position.
Performing single-pass merge (200643 people, 54245 variants).
Merged fileset written to ukb_200kwes_filtered_plink_mergebed-merge.bed +
ukb_200kwes_filtered_plink_mergebed-merge.bim +
ukb_200kwes_filtered_plink_mergebed-merge.fam .
54245 variants loaded from .bim file.
200643 people (90012 males, 110425 females, 206 am

In [9]:
%%bash

ls -lat | head

total 3043604
-rw-r--r-- 1 root root 3099598676 Oct 28 16:26 ukb_200kwes_22_plink_makebed.bed
-rw-r--r-- 1 root root       1272 Oct 28 16:26 ukb_200kwes_22_plink_makebed.log
drwxr-xr-x 1 root root        420 Oct 28 16:24 .
-rw-r--r-- 1 root root    2111836 Oct 28 16:24 ukb_200kwes_22_plink_makebed.bim
-rw-r--r-- 1 root root    5015801 Oct 28 16:24 ukb_200kwes_22_plink_makebed.fam
-rw-r--r-- 1 root root    5015801 Oct 28 16:23 ukb23155_FIXED_b0_v1.fam
-rw-r--r-- 1 root root        112 Oct 25 17:17 untitled.txt
drwxr-xr-x 1 root root         92 Oct 25 17:16 .ipynb_checkpoints
drwxr-xr-x 1 root root        112 Oct 21 16:21 ..


# Provenance 

In [10]:
%%bash

date

Thu Oct 28 16:26:39 UTC 2021


In [11]:
%%bash

pip3 freeze

ansiwrap==0.8.4
anyio==3.3.4
argcomplete==1.12.3
argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1605217006479/work
asn1crypto @ file:///tmp/build/80754af9/asn1crypto_1596577642040/work
async-generator==1.10
attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1620387926260/work
backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work
backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1618230623929/work
bash_kernel==0.7.2
black==21.8b0
bleach @ file:///home/conda/feedstock_root/build_artifacts/bleach_1629908509068/work
brotlipy==0.7.0
certifi==2021.5.30
cffi==1.14.0
chardet==3.0.4
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
click==8.0.1
colorama==0.4.4
conda==4.10.3
conda-package-handling @ file:///tmp/build/80754af9/conda-package-handling_1618262155238/work
contextvars==2.4
cryptography==2.3.1
cycler==0.10.0
datac