# Compute LD and PCA via PLINK

In this notebook, we compute Linkage Disequilibrium and Principal Components Analysis using [PLINK2](https://www.cog-genomics.org/plink/2.0/).

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the All of Us Workbench.
    <ul>
        <li>Use compute type 'Standard VM' with sufficient CPU and RAM (e.g. start with 8 CPUs and 30 GB RAM, increase if needed).</li>
        <li>This notebook can take a while to run. Recommend that it is run in the background via <kbd>run_notebook_in_the_background</kbd>.</li>
    </ul>
</div>

In [None]:
from datetime import datetime
import os
import pandas as pd
import time

## Setup plink2

https://www.cog-genomics.org/plink/2.0/

In [None]:
%%bash

##### plink 2 install
PLINK_VERSION=2.3.Alpha
PLINK_ZIP_PATH=/tmp/plink-$PLINK_VERSION.zip
curl -L -o $PLINK_ZIP_PATH https://s3.amazonaws.com/plink2-assets/alpha2/plink2_linux_x86_64.zip
mkdir -p /tmp/plink2/
unzip -o $PLINK_ZIP_PATH -d /tmp/plink2/

In [None]:
!/tmp/plink2/plink2 --version # --help

## Define constants

The BGEN file created via `write_bgen.ipynb`.

In [None]:
REMOTE_MERGED_BGEN = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210906/ukb-aou-alpha2-chr1-chr22.bgen'
REMOTE_MERGED_BGEN_SAMPLE = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210906/ukb-aou-alpha2-chr1-chr22.sample'

LOCAL_MERGED_BGEN = os.path.basename(REMOTE_MERGED_BGEN)
LOCAL_MERGED_BGEN_SAMPLE = os.path.basename(REMOTE_MERGED_BGEN_SAMPLE)

In [None]:
RESULT_BUCKET = os.getenv("WORKSPACE_BUCKET")
DATESTAMP = time.strftime('%Y%m%d')

# Outputs
PLINK_OUTPUT_FILENAME_PREFIX = 'aou_alpha3_ukb_lipids'
PLINK_OUTPUTS = f'{os.getenv("WORKSPACE_BUCKET")}/data/merged/plink/{DATESTAMP}/'

## Copy data locally for testing

In [None]:
!gsutil cp -n {REMOTE_MERGED_BGEN} {REMOTE_MERGED_BGEN_SAMPLE} .    

# Compute Linkage Disequilibrium via PLINK

In [None]:
!/tmp/plink2/plink2 \
  --bgen {LOCAL_MERGED_BGEN} ref-first \
  --chr 1-22 \
  --indep-pairwise 200 50 0.25 \
  --out {PLINK_OUTPUT_FILENAME_PREFIX}_plink_ld

In [None]:
%%bash

ls -lat | head

In [None]:
%%bash

wc -l *prune*

In [None]:
!gsutil -m cp {PLINK_OUTPUT_FILENAME_PREFIX}* {PLINK_OUTPUTS}

In [None]:
!gsutil ls {PLINK_OUTPUTS}

# Compute Principal Components Analysis via PLINK

<div class="alert alert-block alert-warning">
    <b>Note</b>: the <kbd>--memory</kbd> parameter below assumes the machine has 30 GB of RAM. Adjust this value if the machine has more or less than 30 GB of RAM.
</div>

In [None]:
!/tmp/plink2/plink2 \
  --bgen {LOCAL_MERGED_BGEN} ref-first \
  --chr 1-22 \
  --extract {PLINK_OUTPUT_FILENAME_PREFIX}_plink_ld.prune.in \
  --pca 15 approx \
  --memory 27500 \
  --out {PLINK_OUTPUT_FILENAME_PREFIX}_plink_pca

In [None]:
%%bash

ls -lat | head

In [None]:
!gsutil -m cp {PLINK_OUTPUT_FILENAME_PREFIX}* {PLINK_OUTPUTS}

In [None]:
!gsutil ls {PLINK_OUTPUTS}

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze