# Perform variant quality control

In this notebook, we perform variant quality control using [PLINK2](https://www.cog-genomics.org/plink/2.0/).

Note that this work is part of a larger project to [Demonstrate the Potential for Pooled Analysis of All of Us and UK Biobank Genomic Data](https://github.com/all-of-us/ukb-cross-analysis-demo-project). Specifically this is for the portion of the project that is the **siloed** analysis.

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the UK Biobank Research Analysis Platform.
    <ul>
        <li>Use compute type 'Single Node' with sufficient CPU and RAM (e.g. start with 8 CPUs and 30 GB RAM, increase if needed).</li>
        <li>This notebook can take a while to run (e.g., 45 minutes). Recommend that it is run in the background via <kbd>dx run dxjupyterlab</kbd> which will also capture provenance.</li>
    </ul>
</div>

```
dx run dxjupyterlab \
    --instance-type=mem2_ssd1_v2_x8 \
    -icmd="papermill 05_ukb_plink_variant_qc.ipynb 05_ukb_plink_variant_qc_$(date +%Y%m%d).ipynb" \
    -iin=05_ukb_plink_variant_qc.ipynb \
    --folder=outputs/plink-variant-qc/$(date +%Y%m%d)/
```
See also https://platform.dnanexus.com/app/dxjupyterlab

In [None]:
import os
import pandas as pd

## Setup plink2

https://www.cog-genomics.org/plink/2.0/

In [None]:
%%bash

##### plink 2 install
PLINK_VERSION=2.3.Alpha
PLINK_ZIP_PATH=/tmp/plink-$PLINK_VERSION.zip
curl -L -o $PLINK_ZIP_PATH https://s3.amazonaws.com/plink2-assets/alpha2/plink2_linux_x86_64.zip
mkdir -p /tmp/plink2/
unzip -o $PLINK_ZIP_PATH -d /tmp/plink2/

In [None]:
!/tmp/plink2/plink2 --version # --help

## Define constants

This takes as input the WES data from UK Biobank further filtered by notebook `1_ukb_plink_bed_filter.ipynb` and then merged by notebook `2_ukb_plink_merge_bed_files.ipynb`.

In [None]:
# Papermill parameters. See https://papermill.readthedocs.io/en/latest/usage-parameterize.html

#---[ Inputs ]---
# This was created via ukb_rap_siloed_analyses/04_ukb_plink_merge_bed_files.ipynb
BED_FILE = '/mnt/project/outputs/plink-merge-bed/20220420/ukb_200kwes_filtered_plink_mergebed'
# This was created via ukb_rap_siloed_analyses/02_ukb_lipids_phenotype.ipynb
PHENOTYPES = '/mnt/project/outputs/r-prepare-phenotype/20220308/ukb_200kwes_lipids_phenotype.tsv'

#---[ Outputs ]---
PLINK_OUTPUT_FILENAME_PREFIX = 'ukb_200kwes_lipids'

# Perform variant QC via PLINK for regenie step 1

Use [plink2 to perform the variant QC](https://rgcgithub.github.io/regenie/recommendations/#exclusion-files) and obtain a subset of variants roughly equal to the number of samples.

In [None]:
!/tmp/plink2/plink2 \
  --bfile {BED_FILE} \
  --chr 1-22 \
  --keep {PHENOTYPES} \
  --geno 0.1 \
  --mind 0.1 \
  --mac 100 \
  --hwe 1e-15 \
  --write-snplist \
  --write-samples \
  --no-id-header \
  --out {PLINK_OUTPUT_FILENAME_PREFIX}_regenie_step1_plink_variant_qc

In [None]:
%%bash

ls -lat | head

# Perform variant QC via PLINK for regenie step 2

Use [plink2 to perform the variant QC](https://rgcgithub.github.io/regenie/recommendations/#exclusion-files) to omit low quality variants and samples from consideration in regenie step 2.

In [None]:
!/tmp/plink2/plink2 \
  --bfile {BED_FILE} \
  --chr 1-22 \
  --keep {PHENOTYPES} \
  --geno 0.1 \
  --mind 0.1 \
  --hwe 1e-15 \
  --write-snplist \
  --write-samples \
  --no-id-header \
  --out {PLINK_OUTPUT_FILENAME_PREFIX}_regenie_step2_plink_variant_qc

# This differs from step 1 QC in that the following parameter was removed
#    --mac 100 \

In [None]:
%%bash

ls -lat | head

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze