# Run a GWAS via regenie

In this notebook, we perform a genome-wide association study using [regenie](https://rgcgithub.github.io/regenie/) via [dsub](https://github.com/databiosphere/dsub).

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the All of Us Workbench.
    <ul>
        <li>Use compute type 'Standard VM' with sufficient CPU and RAM (e.g. start with 8 CPUs and 30 GB RAM, increase if needed).</li>
        <li>This notebook can take a while to run. Recommend that it is run in the background via <kbd>run_notebook_in_the_background</kbd>.</li>
    </ul>
</div>

In [None]:
from datetime import datetime
import os
import pandas as pd
import time

## Setup plink2

https://www.cog-genomics.org/plink/2.0/

In [None]:
%%bash

##### plink 2 install
PLINK_VERSION=2.3.Alpha
PLINK_ZIP_PATH=/tmp/plink-$PLINK_VERSION.zip
curl -L -o $PLINK_ZIP_PATH https://s3.amazonaws.com/plink2-assets/alpha2/plink2_linux_x86_64.zip
mkdir -p /tmp/plink2/
unzip -o $PLINK_ZIP_PATH -d /tmp/plink2/

In [None]:
!/tmp/plink2/plink2 --version # --help

## Setup Bgen reader

In [None]:
!pip3 install bgen-reader

## Setup regenie

Note: regenie is already installed locally by default, but we are choosing to update to a more recent version.

In [None]:
!regenie --help # --help  --version

In [None]:
%%bash

REGENIE_VERSION=v2.2.4
rm regenie.zip
curl -L -o regenie.zip "https://github.com/rgcgithub/regenie/releases/download/${REGENIE_VERSION}/regenie_${REGENIE_VERSION}.gz_x86_64_Linux.zip"
unzip -o regenie.zip

In [None]:
!./regenie_v2.2.4.gz_x86_64_Linux --version # --help

## Define constants

The BGEN file created via `write_filtered_aou_bgen.ipynb`.

In [None]:
REMOTE_BGEN = 'gs://fc-secure-440c511e-7fff-417c-9c86-f8ab51bfc618/data/aou/20211110/aou-alpha2-chr1-chr22.bgen'
REMOTE_BGEN_SAMPLE = 'gs://fc-secure-440c511e-7fff-417c-9c86-f8ab51bfc618/data/aou/20211110/aou-alpha2-chr1-chr22.sample'

LOCAL_BGEN = os.path.basename(REMOTE_BGEN)
LOCAL_BGEN_SAMPLE = os.path.basename(REMOTE_BGEN_SAMPLE)

These TSVs were created via notebook `AOU_UKB_phenotype_refined.ipynb`.

In [None]:
REMOTE_PHENOTYPES = [
    'gs://fc-secure-440c511e-7fff-417c-9c86-f8ab51bfc618/data/phenotypes/20211223/AOU_HDLmat_lipids_phenotype_alpha2.csv',
    'gs://fc-secure-440c511e-7fff-417c-9c86-f8ab51bfc618/data/phenotypes/20211223/AOU_LDLmat_lipids_phenotype_alpha2.csv',
    'gs://fc-secure-440c511e-7fff-417c-9c86-f8ab51bfc618/data/phenotypes/20211223/AOU_TCmat_lipids_phenotype_alpha2.csv',
    'gs://fc-secure-440c511e-7fff-417c-9c86-f8ab51bfc618/data/phenotypes/20211223/AOU_TGmat_lipids_phenotype_alpha2.csv'
]

LOCAL_PHENOTYPES = [os.path.basename(pheno) for pheno in REMOTE_PHENOTYPES]

In [None]:
RESULT_BUCKET = os.getenv("WORKSPACE_BUCKET")
DATESTAMP = time.strftime('%Y%m%d')

# Outputs
OUTPUT_FILENAME_PREFIX = 'aou_alpha2_lipids'
REGENIE_PHENOTYPES = f'{OUTPUT_FILENAME_PREFIX}_phenotypes_and_covariates.tsv'
REGENIE_SAMPLE_IIDS = f'{OUTPUT_FILENAME_PREFIX}_sample_iids.txt'
REGENIE_OUTPUTS = f'{os.getenv("WORKSPACE_BUCKET")}/data/aou/regenie/{DATESTAMP}/'

## Copy data locally for testing

In [None]:
!gsutil cp -n {REMOTE_BGEN} {REMOTE_BGEN_SAMPLE} .

In [None]:
for remote_pheno_file in REMOTE_PHENOTYPES:
    !gsutil cp -n {remote_pheno_file} .

# Reshape the phenotypes for regenie

**TODO(Margaret)** the code in this section was needed previously to reshape the phenotypes into the correct form for regenie. Either modify this code as needed, or remove it, if you update the R notebook to emit a phenotype file that can be used by regenie directly.

## Read in the four CSVs.

In [None]:
raw_pheno_dfs = [pd.read_csv(local_pheno_file) for local_pheno_file in LOCAL_PHENOTYPES]

In [None]:
[p.shape for p in raw_pheno_dfs]

In [None]:
[p.columns for p in raw_pheno_dfs]

In [None]:
raw_pheno_dfs[0].dtypes

In [None]:
pheno_df = pd.merge(raw_pheno_dfs[0],raw_pheno_dfs[1], on="sampleid")
pheno_df = pd.merge(pheno_df,raw_pheno_dfs[2], on="sampleid")
pheno_df = pd.merge(pheno_df,raw_pheno_dfs[3], on="sampleid")

pheno_df.shape

In [None]:
#for col in pheno_df.columns:
#    print(col)
#    index_no = pheno_df.columns.get_loc(col)
#    print(index_no)

In [None]:
columns=['eid_y','sex_y','age_y','age2_y','pc1_y','pc2_y','pc3_y','pc4_y','pc5_y','pc6_y','pc7_y','pc8_y','pc9_y','pc10_y','statin_y','cohort_y']
pheno_df = pheno_df.drop(columns,1)
    
for col in pheno_df.columns:
    print(col)

In [None]:
pheno_df = pheno_df.loc[:,~pheno_df.columns.duplicated()]

In [None]:
pheno_df.shape

In [None]:
for col in pheno_df.columns:
    print(col)

## Join the four CSVs into a single CSV.

In [None]:
#index_col = ['eid', 'sampleid', 'sex', 'age', 'age2', 'pc1', 'pc2', 'pc3', 'pc4',
#             'pc5', 'pc6', 'pc7', 'pc8', 'pc9', 'pc10', 'statin']
#pheno_dfs = [pd.read_csv(local_pheno_file, index_col=index_col) for local_pheno_file in LOCAL_PHENOTYPES]


In [None]:
#[p.shape for p in pheno_dfs]

In [None]:
#pheno_dfs[0].dtypes

In [None]:
#[p.columns for p in pheno_dfs]

In [None]:
# Uncomment to see row level data.
#pheno_dfs[0].head()

In [None]:
#pheno_df = pd.concat(pheno_dfs, axis=1) # , join='inner'
#pheno_df.shape

## Check the resulting dataframe.

In [None]:
pheno_df.dtypes

In [None]:
#pheno_df = pheno_df.reset_index()

In [None]:
pheno_df.head(1)

In [None]:
pheno_df.shape

In [None]:
pheno_df.dtypes

In [None]:
# Uncomment to see row level data.
#pheno_df.head()

In [None]:
# Define varaibles
pheno_df = pheno_df.astype({'sampleid': 'int32'})
#pheno_df = pheno_df.astype({'age_x': 'float64'})
#pheno_df = pheno_df.astype({'age2_x': 'float64'})
#pheno_df = pheno_df.astype({'statin_x': 'int64'})
#pheno_df = pheno_df.astype({'TCraw': 'float64'})
#pheno_df = pheno_df.astype({'TGraw': 'float64'})

In [None]:
pheno_df.dtypes

In [None]:
pheno_df.head(10)

In [None]:
pheno_df.groupby('sex_x')['sex_x'].count()

In [None]:
len(pheno_df.sampleid)

In [None]:
len(pheno_df.sampleid.unique())

In [None]:
pheno_df[['HDLnorm', 'LDLnorm', 'TCnorm', 'TGnorm']].describe()

In [None]:
[f'{col}: {pheno_df[col].isnull().sum()}' for col in sorted(pheno_df.columns)]

## Create the TSV for regenie.

### Add the  FID and IID columns.

In [None]:
pheno_df['FID'] = pheno_df['sampleid']

In [None]:
pheno_df['IID'] = pheno_df['FID']

In [None]:
len(pheno_df.IID)

In [None]:
len(pheno_df.IID.unique())

In [None]:
IIDs = pheno_df.IID.unique()
len(IIDs)

In [None]:
pheno_df[['IID']].to_csv(
    REGENIE_SAMPLE_IIDS,
    sep='\t',
    na_rep='NA',
    index=False
)

In [None]:
!head {REGENIE_SAMPLE_IIDS}

### Fill in NA covariates

In [None]:
pheno_df['sex_x'] = pheno_df['sex_x'].fillna('other')

In [None]:
pheno_df.groupby('sex_x')['sex_x'].count()

In [None]:
[f'{col}: {pheno_df[col].isnull().sum()}' for col in sorted(pheno_df.columns)]

### Then write the TSV to disk so that regenie can read it.

In [None]:
pheno_df.columns

In [None]:
pheno_df[['FID', 'IID', 'sex_x', 'age_x', 'age2_x',
          'pc1_x', 'pc2_x', 'pc3_x', 'pc4_x', 'pc5_x', 'pc6_x', 'pc7_x', 'pc8_x', 'pc9_x', 'pc10_x',
          'HDLnorm', 'LDLnorm', 'TCnorm', 'TGnorm']].to_csv(
    REGENIE_PHENOTYPES,
    sep='\t',
    na_rep='NA',
    index=False
)

In [None]:
# Uncomment to see row level data.
!head {REGENIE_PHENOTYPES}

In [None]:
!gsutil cp {REGENIE_PHENOTYPES} {REGENIE_OUTPUTS}
!gsutil cp {REGENIE_SAMPLE_IIDS} {REGENIE_OUTPUTS}

In [None]:
!gsutil ls {REGENIE_OUTPUTS}

### Check BGEN file

In [None]:
!head {LOCAL_BGEN_SAMPLE}

In [None]:
from bgen_reader import read_bgen

In [None]:
bgen = read_bgen(LOCAL_BGEN, verbose=True)

In [None]:
# Variants metadata.
print(bgen["variants"].head())

In [None]:
# Samples read from the bgen file.
print(bgen["samples"].head())

In [None]:
# Samples read from the bgen file.
print(bgen["samples"].tail())

In [None]:
# There are X variants in total.
print(len(bgen["genotype"]))

In [None]:
# This library avoid as much as possible accessing the bgen file for performance
# and memory reasons. The `compute` function actually tells the library to
# access the file to retrieve some data.
geno = bgen["genotype"][0].compute()
print(geno.keys())
# Let's have a look at the probabilities regarding the first variant.
print(geno["probs"])
# The above matrix is of size samples-by-(combination-of-alleles).
print(geno["probs"].shape)

# Variant QC via PLINK

Use [plink2 to perform the variant QC](https://rgcgithub.github.io/regenie/recommendations/#exclusion-files) and obtain a subset of SNPs roughly equal to the number of samples.

In [None]:
!/tmp/plink2/plink2 \
  --bgen {LOCAL_BGEN} ref-first \
  --chr 1-22 \
  --geno 0.1 \
  --mind 0.1 \
  --maf 0.001 \
  --hwe 1e-15 \
  --write-snplist \
  --write-samples \
  --no-id-header \
  --keep {REGENIE_SAMPLE_IIDS} \
  --out {OUTPUT_FILENAME_PREFIX}_plink

# This is ended up getting error in Step 1
# Uh-oh, SNP chr1_939398_GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA_G has low variance (=0.000000).
#  --mac 100 \
#  --maf 0.01
#  --maf 0.05

In [None]:
!ls -lth . | head

In [None]:
!head {OUTPUT_FILENAME_PREFIX}_plink.id

In [None]:
!tail {OUTPUT_FILENAME_PREFIX}_plink.id

In [None]:
!wc -l {OUTPUT_FILENAME_PREFIX}_plink.id

In [None]:
!head {OUTPUT_FILENAME_PREFIX}_plink.snplist

In [None]:
!tail {OUTPUT_FILENAME_PREFIX}_plink.snplist

In [None]:
!wc -l {OUTPUT_FILENAME_PREFIX}_plink.snplist

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}* {REGENIE_OUTPUTS}

# regenie

This work is based on https://github.com/briansha/Regenie_WDL/blob/master/regenie.wdl

See also:
* regenie documentation https://rgcgithub.github.io/regenie/options/#input
* dsub documentation https://github.com/DataBiosphere/dsub/blob/main/docs/input_output.md

## Step 1

From https://rgcgithub.github.io/regenie/overview/:
> In the first step a subset of genetic markers are used to fit a whole genome regression model that captures a good fraction of the phenotype variance attributable to genetic effects.

In [None]:
# Parameters to add
# 8 core machine
# 11 GB ram
# 500 GB disk

!./regenie_v2.2.4.gz_x86_64_Linux \
    --step 1 \
    --bgen={LOCAL_BGEN} \
    --ref-first \
    --sample={LOCAL_BGEN_SAMPLE} \
    --phenoFile={REGENIE_PHENOTYPES} \
    --phenoColList=LDLnorm,HDLnorm,TCnorm,TGnorm \
    --covarFile={REGENIE_PHENOTYPES} \
    --catCovarList=sex_x \
    --covarColList=age_x,age2_x,pc1_x,pc2_x,pc3_x,pc4_x,pc5_x,pc6_x,pc7_x,pc8_x,pc9_x,pc10_x \
    --extract {OUTPUT_FILENAME_PREFIX}_plink.snplist \
    --bsize 1000 \
    --verbose \
    --out {OUTPUT_FILENAME_PREFIX}_regenie_part1


# Note that no samples were omitted by the QC step, so we are leaving out this file since regenie 
# complained that it did not match the samples in the BGEN file.
#     --keep {OUTPUT_FILENAME_PREFIX}.id \


In [None]:
!ls -lth . | head

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}* {REGENIE_OUTPUTS}

In [None]:
!gsutil ls {REGENIE_OUTPUTS}

## Step 2

From https://rgcgithub.github.io/regenie/overview/:
> In the second step, a larger set of genetic markers (e.g. imputed markers) are tested for association with the phenotype conditional upon the prediction from the regression model in Step 1, using a leave one chromosome out (LOCO) scheme, that avoids proximal contamination.

In [None]:
!./regenie_v2.2.4.gz_x86_64_Linux \
    --step 2 \
    --bgen={LOCAL_BGEN} \
    --ref-first \
    --sample={LOCAL_BGEN_SAMPLE} \
    --phenoFile={REGENIE_PHENOTYPES} \
    --phenoColList=LDLnorm,HDLnorm,TCnorm,TGnorm \
    --covarFile={REGENIE_PHENOTYPES} \
    --catCovarList=sex \
    --covarColList=age,age2,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \
    --firth 0.01 \
    --approx \
    --pred {OUTPUT_FILENAME_PREFIX}_regenie_part1_pred.list \
    --bsize 400 \
    --out {OUTPUT_FILENAME_PREFIX}_regenie_part2

In [None]:
!ls -lth {OUTPUT_FILENAME_PREFIX}*

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}* {REGENIE_OUTPUTS}

In [None]:
!gsutil ls {REGENIE_OUTPUTS}

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze