# Perform variant QC

In this notebook, we perform variant quality control using [PLINK2](https://www.cog-genomics.org/plink/2.0/).

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the All of Us Workbench.
    <ul>
        <li>Use compute type 'Standard VM' with sufficient CPU and RAM (e.g. start with 8 CPUs and 30 GB RAM, increase if needed).</li>
        <li>This notebook can take a while to run. Recommend that it is run in the background via <kbd>run_notebook_in_the_background</kbd>.</li>
    </ul>
</div>

In [None]:
from datetime import datetime
import os
import time

## Setup plink2

https://www.cog-genomics.org/plink/2.0/

In [None]:
%%bash

##### plink 2 install
PLINK_VERSION=2.3.Alpha
PLINK_ZIP_PATH=/tmp/plink-$PLINK_VERSION.zip
curl -L -o $PLINK_ZIP_PATH https://s3.amazonaws.com/plink2-assets/alpha2/plink2_linux_x86_64.zip
mkdir -p /tmp/plink2/
unzip -o $PLINK_ZIP_PATH -d /tmp/plink2/

In [None]:
!/tmp/plink2/plink2 --version # --help

## Define constants

In [None]:
# Papermill parameters. See https://papermill.readthedocs.io/en/latest/usage-parameterize.html

#---[ Inputs ]---
# The BGEN file created via aou_workbench_siloed_analyses/02_aou_write_filtered_bgen.ipynb.
REMOTE_BGEN = 'gs://fc-secure-440c511e-7fff-417c-9c86-f8ab51bfc618/data/aou/20211110/aou-alpha2-chr1-chr22.bgen'
REMOTE_BGEN_SAMPLE = 'gs://fc-secure-440c511e-7fff-417c-9c86-f8ab51bfc618/data/aou/20211110/aou-alpha2-chr1-chr22.sample'
# This CSV was created via notebook aou_workbench_siloed_analyses/01_aou_lipids_phenotype.ipynb
REMOTE_PHENOTYPES = 'gs://fc-secure-471c1068-cd3d-4b43-9b5d-a618c85ceea5/data/aou/pheno/20220203/aou_alpha3_lipids_phenotype.csv'

#---[ Outputs ]---
# Create a timestamp for a folder of results generated today.
DATESTAMP = time.strftime('%Y%m%d')
OUTPUT_FILENAME_PREFIX = 'aou_alpha3_lipids'
OUTPUT_FOLDER = f'{os.getenv("WORKSPACE_BUCKET")}/data/aou/variant-qc/{DATESTAMP}/'

In [None]:
LOCAL_BGEN = os.path.basename(REMOTE_BGEN)
LOCAL_BGEN_SAMPLE = os.path.basename(REMOTE_BGEN_SAMPLE)
LOCAL_PHENOTYPES = os.path.basename(REMOTE_PHENOTYPES)

## Copy data locally

In [None]:
!gsutil cp -n {REMOTE_BGEN} {REMOTE_BGEN_SAMPLE} .

In [None]:
!gsutil cp {REMOTE_PHENOTYPES} .

# Variant QC for step 1 via plink2

Use [plink2 to perform the variant QC](https://rgcgithub.github.io/regenie/recommendations/#exclusion-files) and obtain a subset of SNPs roughly equal to the number of samples.

In [None]:
LOCAL_BGEN

In [None]:
!/tmp/plink2/plink2 \
  --bgen {LOCAL_BGEN} ref-first \
  --sample {LOCAL_MERGED_BGEN_SAMPLE} \
  --chr 1-22 \
  --keep {LOCAL_PHENOTYPES} \
  --geno 0.1 \
  --mind 0.1 \
  --mac 100 \
  --hwe 1e-15 \
  --write-snplist \
  --write-samples \
  --no-id-header \
  --out {OUTPUT_FILENAME_PREFIX}_step1QC_plink

In [None]:
!ls -lth . | head

In [None]:
!head {OUTPUT_FILENAME_PREFIX}_step1QC_plink.id

In [None]:
!tail {OUTPUT_FILENAME_PREFIX}_step1QC_plink.id

In [None]:
!wc -l {OUTPUT_FILENAME_PREFIX}_step1QC_plink.id

In [None]:
!head {OUTPUT_FILENAME_PREFIX}_step1QC_plink.snplist

In [None]:
!tail {OUTPUT_FILENAME_PREFIX}_step1QC_plink.snplist

In [None]:
!wc -l {OUTPUT_FILENAME_PREFIX}_step1QC_plink.snplist

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}* {OUTPUT_FOLDER}

# Variant QC for step 2 via plink2

In [None]:
!/tmp/plink2/plink2 \
  --bgen {LOCAL_BGEN} ref-first \
  --sample {LOCAL_MERGED_BGEN_SAMPLE} \
  --chr 1-22 \
  --keep {LOCAL_PHENOTYPES} \
  --geno 0.1 \
  --mind 0.1 \
  --hwe 1e-15 \
  --write-snplist \
  --write-samples \
  --no-id-header \
  --out {OUTPUT_FILENAME_PREFIX}_step2QC_plink

# This differs from step 1 QC in that the following parameter was removed
#    --mac 100 \

In [None]:
!ls -lth . | head

In [None]:
!head {OUTPUT_FILENAME_PREFIX}_step2QC_plink.id

In [None]:
!tail {OUTPUT_FILENAME_PREFIX}_step2QC_plink.id

In [None]:
!wc -l {OUTPUT_FILENAME_PREFIX}_step2QC_plink.id

In [None]:
!head {OUTPUT_FILENAME_PREFIX}_step2QC_plink.snplist

In [None]:
!tail {OUTPUT_FILENAME_PREFIX}_step2QC_plink.snplist

In [None]:
!wc -l {OUTPUT_FILENAME_PREFIX}_step2QC_plink.snplist

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}* {OUTPUT_FOLDER}

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze