# Merge BED files

In this notebook, we merge the filtered BED files using [PLINK](https://www.cog-genomics.org/plink/1.9/).

Note that this work is part of a larger project to [Demonstrate the Potential for Pooled Analysis of All of Us and UK Biobank Genomic Data](https://docs.google.com/document/d/19ZS0z_-7FEM37pNDAXaWaqBSLnqyd9MZEkiOmtF3n_0/edit#). Specifically this is for the portion of the project that is the **siloed** analysis.

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the UK Biobank Research Analysis Platform.
    <ul>
        <li>Use compute type 'Single Node' with sufficient CPU and RAM (e.g. start with 8 CPUs and 30 GB RAM, increase if needed).</li>
        <li>This notebook is pretty fast, but in general it is recommended to be run in the background via <kbd>dx run dxjupyterlab</kbd> to capture provenance.</li>
    </ul>
</div>

```
dx run dxjupyterlab \
    --instance-type=mem2_ssd1_v2_x8 \
    -icmd="papermill 04_ukb_plink_merge_bed_files.ipynb 04_ukb_plink_merge_bed_files_$(date +%Y%m%d).ipynb" \
    -iin=04_ukb_plink_merge_bed_files.ipynb \
    --folder=outputs/plink-merge-bed/$(date +%Y%m%d)/
```
See also https://platform.dnanexus.com/app/dxjupyterlab

In [None]:
import os
import pandas as pd

## Setup plink

https://www.cog-genomics.org/plink/1.9/

In [None]:
%%bash

##### plink 1 install
PLINK_VERSION=20210606
PLINK_ZIP_PATH=/tmp/plink-$PLINK_VERSION.zip
curl -L -o $PLINK_ZIP_PATH https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_$PLINK_VERSION.zip
mkdir -p /tmp/plink/
unzip -o $PLINK_ZIP_PATH -d /tmp/plink/

In [None]:
!/tmp/plink/plink --version # --help

## Define constants

This takes as input the WES data from UK Biobank further filtered by notebook `1_ukb_plink_bed_filter.ipynb`.

In [None]:
# Papermill parameters. See https://papermill.readthedocs.io/en/latest/usage-parameterize.html

#---[ Inputs ]---
# These files were created via ukb_rap_siloed_analyses/03_ukb_plink_bed_filter.ipynb
BED_PATTERN = '/mnt/project/outputs/plink-make-bed/20220217/ukb_200kwes_chr*_plink_makebed'

#---[ Outputs ]---
BED_MERGE_LIST = 'bed_merge_list.txt'
PLINK_OUTPUT_FILENAME_PREFIX = f'ukb_200kwes_filtered'

## Write out the merge list

In [None]:
chrom = 1
INITIAL_BED = BED_PATTERN.replace('*', str(chrom))

INITIAL_BED

with open(BED_MERGE_LIST, mode='w') as the_file:
    for chrom in range(2, 23):
        path = BED_PATTERN.replace('*', str(chrom))
        the_file.write(f'{path}.bed {path}.bim {path}.fam \n')

In [None]:
!cat {BED_MERGE_LIST}

# Write out the merged BED file.

In [None]:
!/tmp/plink/plink \
  --bed {INITIAL_BED}.bed \
  --bim {INITIAL_BED}.bim \
  --fam {INITIAL_BED}.fam \
  --merge-list {BED_MERGE_LIST} \
  --make-bed \
  --out {PLINK_OUTPUT_FILENAME_PREFIX}_plink_mergebed

In [None]:
%%bash

ls -lat | head

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze