# Compute LD and PCA via PLINK

In this notebook, we compute linkage disequilibrium and principal components analysis using [PLINK2](https://www.cog-genomics.org/plink/2.0/).

Note that this work is part of a larger project to [Demonstrate the Potential for Pooled Analysis of All of Us and UK Biobank Genomic Data](https://github.com/all-of-us/ukb-cross-analysis-demo-project). Specifically this is for the portion of the project that is the **siloed** analysis.

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the <i>All of Us</i> Workbench.
    <ul>
        <li>Use "Recommended Environment" <kbd><b>General Analysis</b></kbd> which creates compute type <kbd><b>Standard VM</b></kbd> with sufficient CPU and RAM (e.g. start with <b>8 CPUs</b> and <b>30 GB RAM</b>, increase if needed).</li>
        <li>This notebook can take a while to run <b>TBD DETAILS CHR21 VS. ALL CHRS</b>. Recommend that it is run in the background via <kbd>run_notebook_in_the_background</kbd>.</li>    </ul>
</div>

In [None]:
from datetime import datetime
import os
import time

## Setup plink2

https://www.cog-genomics.org/plink/2.0/

In [None]:
%%bash

##### plink 2 install
PLINK_VERSION=2.3.Alpha
PLINK_ZIP_PATH=/tmp/plink-$PLINK_VERSION.zip
curl -L -o $PLINK_ZIP_PATH https://s3.amazonaws.com/plink2-assets/alpha2/plink2_linux_x86_64.zip
mkdir -p /tmp/plink2/
unzip -o $PLINK_ZIP_PATH -d /tmp/plink2/

In [None]:
!/tmp/plink2/plink2 --version # --help

## Define constants

In [None]:
# Papermill parameters. See https://papermill.readthedocs.io/en/latest/usage-parameterize.html

#---[ Inputs ]---
# The BGEN file was created via aou_workbench_siloed_analyses/02_aou_write_filtered_bgen.ipynb.
REMOTE_BGEN = 'gs://fc-secure-098ff3db-05c2-4426-8914-a26608668529/data/aou/geno/20220304/aou-alpha3-chr1-chr22.bgen'
# The sample file was created via aou_workbench_siloed_analyses/02_aou_write_filtered_bgen.ipynb.
REMOTE_BGEN_SAMPLE = 'gs://fc-secure-098ff3db-05c2-4426-8914-a26608668529/data/aou/geno/20220304/aou-alpha3-chr1-chr22.sample'
# These two files were created via notebook aou_workbench_siloed_analyses/03_aou_variant_qc.ipynb
# NOTE: use variant QC files created for regenie step 2, not step 1, here.
REMOTE_VARIANT_QC_ID = 'gs://fc-secure-471c1068-cd3d-4b43-9b5d-a618c85ceea5/data/aou/variant-qc/20220208/aou_alpha3_lipids_step2QC_plink.id'
REMOTE_VARIANT_QC_SNPLIST = 'gs://fc-secure-471c1068-cd3d-4b43-9b5d-a618c85ceea5/data/aou/variant-qc/20220208/aou_alpha3_lipids_step2QC_plink.snplist'

#---[ Outputs ]---
# Create a timestamp for a folder of results generated today.
DATESTAMP = time.strftime('%Y%m%d')

OUTPUT_FILENAME_PREFIX = 'aou_alpha3_lipids'
OUTPUT_FOLDER = f'{os.getenv("WORKSPACE_BUCKET")}/data/aou/ld-pca/{DATESTAMP}/'

In [None]:
LOCAL_BGEN = os.path.basename(REMOTE_BGEN)
LOCAL_BGEN_SAMPLE = os.path.basename(REMOTE_BGEN_SAMPLE)
LOCAL_VARIANT_QC_ID = os.path.basename(REMOTE_VARIANT_QC_ID)
LOCAL_VARIANT_QC_SNPLIST = os.path.basename(REMOTE_VARIANT_QC_SNPLIST)

## Copy data locally

In [None]:
!gsutil cp -n {REMOTE_BGEN} {REMOTE_BGEN_SAMPLE} .    

In [None]:
!gsutil cp {REMOTE_VARIANT_QC_ID} {REMOTE_VARIANT_QC_SNPLIST} .

# Compute linkage disequilibrium via plink2

In [None]:
!/tmp/plink2/plink2 \
  --bgen {LOCAL_BGEN} ref-first \
  --sample {LOCAL_BGEN_SAMPLE} \
  --chr 1-22 \
  --keep {LOCAL_VARIANT_QC_ID} \
  --extract {LOCAL_VARIANT_QC_SNPLIST} \
  --indep-pairwise 200 50 0.25 \
  --out {OUTPUT_FILENAME_PREFIX}_plink_ld

In [None]:
%%bash

ls -lat | head

In [None]:
%%bash

wc -l *prune*

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}_plink_ld* {OUTPUT_FOLDER}

In [None]:
!gsutil ls {OUTPUT_FOLDER}

# Compute principal components analysis via plink2

<div class="alert alert-block alert-warning">
    <b>Note</b>: the <kbd>--memory</kbd> parameter below assumes the machine has 30 GB of RAM. Adjust this value if the machine has more or less than 30 GB of RAM.
</div>

In [None]:
!/tmp/plink2/plink2 \
  --bgen {LOCAL_BGEN} ref-first \
  --sample {LOCAL_BGEN_SAMPLE} \
  --chr 1-22 \
  --keep {LOCAL_VARIANT_QC_ID} \
  --extract {OUTPUT_FILENAME_PREFIX}_plink_ld.prune.in \
  --pca 15 approx \
  --memory 27500 \
  --out {OUTPUT_FILENAME_PREFIX}_plink_pca

In [None]:
%%bash

ls -lat | head

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}_plink_pca* {OUTPUT_FOLDER}

In [None]:
!gsutil ls {OUTPUT_FOLDER}

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze