# Run a GWAS via regenie

In this notebook, we perform a genome-wide association study using [regenie](https://rgcgithub.github.io/regenie/) via [dsub](https://github.com/databiosphere/dsub).

# Setup 

In [None]:
from datetime import datetime
import os
import pandas as pd
import time

## Setup plink2

https://www.cog-genomics.org/plink/2.0/

In [None]:
%%bash

##### plink 2 install
PLINK_VERSION=2.3.Alpha
PLINK_ZIP_PATH=/tmp/plink-$PLINK_VERSION.zip
curl -L -o $PLINK_ZIP_PATH https://s3.amazonaws.com/plink2-assets/alpha2/plink2_linux_x86_64.zip
mkdir -p /tmp/plink2/
unzip -o $PLINK_ZIP_PATH -d /tmp/plink2/

In [None]:
!/tmp/plink2/plink2 --version # --help

## Setup regenie

Note: regenie is already installed locally by default.

For longer-running jobs we will run it via dsub. regenie is installed in Docker image `briansha/regenie:v2.0.1_boost`.

In [None]:
!regenie --version # --help

## Setup dsub

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the All of Us Workbench. It runs fine on the default Cloud Environment. 
</div>

In [None]:
!pip3 install --upgrade dsub

In [None]:
!dsub --version # --help

In [None]:
%%bash

gcloud auth list

<div class="alert alert-block alert-warning">
    <b>Note:</b> (1) You must use your own PET account. (2) Your PET account has to be granted access to run itself as a service account.
</div>

## Setup bgen_reader

In [None]:
!pip3 install bgen-reader

## Define constants

The BGEN file created via `write_bgen_20210719_172314.ipynb`. To be determined whether it is in the correct format for regenie.

Note that Brian successfully created a BGEN for regenie using this command:
`./plink2 --bfile 'MEGA_data_common_filtered_final' --chr 1-22 --export bgen-1.2 bits=8  --out 'MEGA_data_common_filtered_final_chr1_22'`

In [None]:
REMOTE_MERGED_BGEN = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210805/ukb-aou-alpha1-chr1-chr22.bgen'
REMOTE_MERGED_BGEN_SAMPLE = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210805/ukb-aou-alpha1-chr1-chr22.sample'

LOCAL_MERGED_BGEN = os.path.basename(REMOTE_MERGED_BGEN)
LOCAL_MERGED_BGEN_SAMPLE = os.path.basename(REMOTE_MERGED_BGEN_SAMPLE)

These TSVs were created via notebook `AOU_UKB_phenotype_refined.ipynb`.

In [None]:
REMOTE_PHENOTYPES = [
    'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/MergedData_HDL_Iteration2_ForGWAS.csv',
    'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/MergedData_LDL_Iteration2_ForGWAS.csv',
    'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/MergedData_TC_Iteration2_ForGWAS.csv',
    'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/MergedData_TG_Iteration2_ForGWAS.csv'
]

LOCAL_PHENOTYPES = [os.path.basename(pheno) for pheno in REMOTE_PHENOTYPES]

REGENIE_PHENOTYPES = 'aou_alpha1_ukb_lipids_phenotypes_and_covariates.tsv'

In [None]:
RESULT_BUCKET = os.getenv("WORKSPACE_BUCKET")
DATESTAMP = time.strftime('%Y%m%d')

# Outputs
REGENIE_OUTPUTS = f'{os.getenv("WORKSPACE_BUCKET")}/data/regenie/{DATESTAMP}/'

## Copy data locally for testing

In [None]:
!gsutil cp {REMOTE_MERGED_BGEN} {REMOTE_MERGED_BGEN_SAMPLE} .    

In [None]:
for remote_pheno_file in REMOTE_PHENOTYPES:
    !gsutil cp {remote_pheno_file} .    

# Reshape the phenotypes for regenie 

## Read in the four CSVs.

In [None]:
raw_pheno_dfs = [pd.read_csv(local_pheno_file) for local_pheno_file in LOCAL_PHENOTYPES]

In [None]:
[p.shape for p in raw_pheno_dfs]

In [None]:
[p.columns for p in raw_pheno_dfs]

In [None]:
raw_pheno_dfs[0].dtypes

## Join the four CSVs into a single CSV.

In [None]:
index_col = ['eid', 'sampleid', 'sex', 'age', 'age2', 'pc1', 'pc2', 'pc3', 'pc4',
             'pc5', 'pc6', 'pc7', 'pc8', 'pc9', 'pc10', 'statin', 'cohort']
pheno_dfs = [pd.read_csv(local_pheno_file, index_col=index_col) for local_pheno_file in LOCAL_PHENOTYPES]

In [None]:
[p.shape for p in pheno_dfs]

In [None]:
[p.columns for p in pheno_dfs]

In [None]:
pheno_dfs[0].head()

In [None]:
pheno_df = pd.concat(pheno_dfs, axis=1)

pheno_df.shape

## Check the resulting dataframe.

In [None]:
pheno_df.dtypes

In [None]:
pheno_df = pheno_df.reset_index()

In [None]:
pheno_df.shape

In [None]:
pheno_df.dtypes

In [None]:
pheno_df.head()

In [None]:
pheno_df = pheno_df.astype({'sampleid': 'int32'})

In [None]:
pheno_df.head()

In [None]:
pheno_df.groupby('sex')['sex'].count()

In [None]:
pheno_df.groupby('cohort')['cohort'].count()

In [None]:
pheno_df.query('eid == sampleid').groupby('cohort')['cohort'].count()

In [None]:
pheno_df.query('eid == sampleid & cohort == "UKB"')

In [None]:
pheno_df.query('eid != sampleid').groupby('cohort')['cohort'].count()

In [None]:
pheno_df[['HDLnorm', 'LDLnorm', 'TCnorm', 'TGnorm']].describe()

In [None]:
[f'{col}: {pheno_df[col].isnull().sum()}' for col in sorted(pheno_df.columns)]

## Create the TSV for regenie.

### Add the  FID and IID columns.

In [None]:
pheno_df['FID'] = pheno_df['sampleid'].astype('str') + '_' + pheno_df['cohort'].str.lower()

In [None]:
pheno_df['IID'] = pheno_df['FID']

### TEMPORARY: fill in NA covariates

In [None]:
pheno_df['sex'] = pheno_df['sex'].fillna('other')

In [None]:
[f'{col}: {pheno_df[col].isnull().sum()}' for col in sorted(pheno_df.columns)]

### Then write the TSV to disk so that regenie can read it.

In [None]:
pheno_df.columns

In [None]:
pheno_df[['FID', 'IID', 'sex', 'age', 'age2',  'cohort',
          'pc1', 'pc2', 'pc3', 'pc4', 'pc5', 'pc6', 'pc7', 'pc8', 'pc9', 'pc10',
          'HDLnorm', 'LDLnorm', 'TCnorm', 'TGnorm']].to_csv(
    REGENIE_PHENOTYPES,
    sep='\t',
    na_rep='NA',
    index=False
)

In [None]:
!head {REGENIE_PHENOTYPES}

# Check the BGEN file

There were a few problems that needed to be fixed:
* the first time I created it, I hit [this issue](https://hail.zulipchat.com/#narrow/stream/123010-Hail-0.2E2.20support/topic/hl.2Eexport_bgen).
* the second time, I realized that PLINK required rsids, so those were computed by Hail and written to a new BGEN

In [None]:
!head {LOCAL_MERGED_BGEN_SAMPLE}

In [None]:
from bgen_reader import read_bgen

In [None]:
bgen = read_bgen(LOCAL_MERGED_BGEN, verbose=True)

In [None]:
# Variants metadata.
print(bgen["variants"].head())

In [None]:
# Samples read from the bgen file.
print(bgen["samples"].head())

In [None]:
# Samples read from the bgen file.
print(bgen["samples"].tail())

In [None]:
# There are X variants in total.
print(len(bgen["genotype"]))

In [None]:
# This library avoid as much as possible accessing the bgen file for performance
# and memory reasons. The `compute` function actually tells the library to
# access the file to retrieve some data.
geno = bgen["genotype"][0].compute()
print(geno.keys())
# Let's have a look at the probabilities regarding the first variant.
print(geno["probs"])
# The above matrix is of size samples-by-(combination-of-alleles).
print(geno["probs"].shape)

# Variant QC via PLINK

Per Margaret, use [plink2 to perform the variant QC](https://rgcgithub.github.io/regenie/recommendations/#exclusion-files) and obtain a subset of SNPs roughly equal to the number of samples.


We'll run this locally since its pretty quick.

In [None]:
!/tmp/plink2/plink2 \
  --bgen {LOCAL_MERGED_BGEN} ref-first \
  --chr 1-22 \
  --geno 0.1 \
  --mind 0.1 \
  --mac 100 \
  --hwe 1e-15 \
  --write-snplist \
  --write-samples \
  --no-id-header \
  --out aou_alpha1_ukb_lipids_plink

# This is too strict and removes too many samples.
#  --maf 0.01 \

In [None]:
!ls -lth . | head

In [None]:
!head aou_alpha1_ukb_lipids_plink.id

In [None]:
!tail aou_alpha1_ukb_lipids_plink.id

In [None]:
!wc -l aou_alpha1_ukb_lipids_plink.id

In [None]:
!head aou_alpha1_ukb_lipids_plink.snplist

In [None]:
!tail aou_alpha1_ukb_lipids_plink.snplist

In [None]:
!wc -l aou_alpha1_ukb_lipids_plink.snplist

In [None]:
!gsutil -m cp aou_alpha1_ukb_lipids* {REGENIE_OUTPUTS}

# regenie

This work is based on https://github.com/briansha/Regenie_WDL/blob/master/regenie.wdl

See also:
* regenie documentation https://rgcgithub.github.io/regenie/options/#input
* dsub documentation https://github.com/DataBiosphere/dsub/blob/main/docs/input_output.md

## Step 1

From https://rgcgithub.github.io/regenie/overview/:
> In the first step a subset of genetic markers are used to fit a whole genome regression model that captures a good fraction of the phenotype variance attributable to genetic effects.

In [None]:
# Parameters to add
# 8 core machine
# 11 GB ram
# 500 GB disk

!regenie \
    --step 1 \
    --bgen={LOCAL_MERGED_BGEN} \
    --ref-first \
    --sample={LOCAL_MERGED_BGEN_SAMPLE} \
    --phenoFile={REGENIE_PHENOTYPES} \
    --phenoColList=LDLnorm,HDLnorm,TCnorm,TGnorm \
    --covarFile={REGENIE_PHENOTYPES} \
    --catCovarList=sex,cohort \
    --covarColList=age,age2,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \
    --extract aou_alpha1_ukb_lipids_plink.snplist \
    --bsize 1000 \
    --verbose \
    --out aou_alpha1_ukb_lipids_regenie_part1


# Note that no samples were omitted by the QC step, so we are leaving out this file since regenie 
# complained that it did not match the samples in the BGEN file.
#     --keep aou_alpha1_ukb_lipids.id \


In [None]:
!ls -lth . | head

In [None]:
!gsutil -m cp aou_alpha1_ukb_lipids* {REGENIE_OUTPUTS}

## Step 2

From https://rgcgithub.github.io/regenie/overview/:
> In the second step, a larger set of genetic markers (e.g. imputed markers) are tested for association with the phenotype conditional upon the prediction from the regression model in Step 1, using a leave one chromosome out (LOCO) scheme, that avoids proximal contamination.

In [None]:
!regenie \
    --step 2 \
    --bgen={LOCAL_MERGED_BGEN} \
    --ref-first \
    --sample={LOCAL_MERGED_BGEN_SAMPLE} \
    --phenoFile={REGENIE_PHENOTYPES} \
    --phenoColList=LDLnorm,HDLnorm,TCnorm,TGnorm \
    --covarFile={REGENIE_PHENOTYPES} \
    --catCovarList=sex,cohort \
    --covarColList=age,age2,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \
    --firth 0.01 \
    --approx \
    --pred aou_alpha1_ukb_lipids_regenie_part1_pred.list \
    --bsize 400 \
    --split \
    --out aou_alpha1_ukb_lipids_regenie_part2


In [None]:
!ls -lh aou_alpha1_ukb_lipids*

In [None]:
!gsutil -m cp aou_alpha1_ukb_lipids* {REGENIE_OUTPUTS}

In [None]:
!gsutil ls {REGENIE_OUTPUTS}

# Appendix

## Use QCtool to subset the BGEN

## Check missingness

## regenie via dsub

Still re-writing the sections below to run this at scale in the background via dsub.

## Compress Hail logs

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze