# Perform LD clumping analysis

## As-Is Software Disclaimer

This notebook is delivered "As-Is". Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

[MIT License](https://github.com/dnanexus/UKB_RAP/blob/main/LICENSE) applies to this notebook.

## Jupyterlab app details (launch configuration)

Recommended configuration
- Runtime: ~20 min
- Cluster configuration: `Single Node`
- Recommended instance: `mem2_ssd1_v2_x32`
- Cost: ~£0.5

## Dependencies

|Library |License|
|:------------- |:-------------|
|[pandas](https://pandas.pydata.org/) |[BSD-3](https://github.com/pandas-dev/pandas/blob/main/LICENSE)|
|[numpy](https://numpy.org/) |[BSD-3](https://github.com/numpy/numpy/blob/main/LICENSE.txt)|
|[bgenix](https://enkre.net/cgi-bin/code/bgen/doc/trunk/doc/wiki/bgenix.md) | [Boost Software License (MIT-like)](https://enkre.net/cgi-bin/code/bgen/file?name=LICENSE_1_0.txt&ci=trunk)|
|[PLINK](https://www.cog-genomics.org/plink/1.9/) |[GPL](https://github.com/chrchang/plink-ng/blob/master/1.9/LICENSE)|
|[PLINK2](https://www.cog-genomics.org/plink/2.0/) |[GPL](https://github.com/chrchang/plink-ng/blob/master/2.0/COPYING)|



## Introduction

This notebook:
- Extracts significant GWAS results
- Subset BGEN for each chromosome, leaving only significant variants
- Converts BGEN to PLINK
- Performs LD clumping
- Uploads results to UKB RAP

## Prepare environment

Uncomment the install commands if you are comfortable with the library license and want to install and run the parts notebook that depend on the library.

In [None]:
%%capture captured
%%bash
# Install PLINK
#cd /opt/notebooks
#wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20230116.zip
#unzip -o plink_linux_x86_64_20230116.zip

In [None]:
!./plink --version

In [None]:
%%capture captured
%%bash
# Install bgenix
#cd /opt/notebooks
#wget http://code.enkre.net/bgen/tarball/release/bgen.tgz
#tar xvfz bgen.tgz > /dev/null
#cd bgen.tgz/
#./waf configure 
#./waf 
#./build/test/unit/test_bgen
#./build/apps/bgenix -g example/example.16bits.bgen –list
#cd /opt/notebooks

In [None]:
%%capture captured
%%bash
# Install PLINK2
#cd /opt/notebooks
#wget https://s3.amazonaws.com/plink2-assets/alpha3/plink2_linux_avx2_20220814.zip
#unzip -o plink2_linux_avx2_20220814.zip

In [None]:
!./plink2 --version

In [None]:
import glob
import numpy as np 
import pandas as pd  
import os
import shutil
import subprocess

In [None]:
! dx download -f /Data/gwas_results_imputed_gel/ischemia_cc.REGENIE_WGR_additive.REGENIE_PLOTS.lmm.tsv.gz

In [None]:
! dx download -f /Data/ischemia_df.phe

In [None]:
pd.set_option('display.max_columns', 500)
pd.set_option('max_colwidth', -1)

In [None]:
imputation_folder = 'Imputation from genotype (GEL)'
imputation_field_id = '21008'
output_dir = '/Data/'

In [None]:
%%bash
# Create symlink for imputed data
DIR='/mnt/project/Bulk/Imputation/Imputation*from*genotype*(GEL)'
ln -sf $DIR /opt/notebooks/imputed
DIR2=/mnt/project/Bulk-DRL/GEL_imputed_sample_files_fixed/
ln -sf $DIR2 /opt/notebooks/samples
DIR3=/mnt/project/Data/gel_impute_qc/
ln -sf $DIR3 /opt/notebooks/keepsnps

## Load GWAS results data

In [None]:
gwas = pd.read_csv("ischemia_cc.REGENIE_WGR_additive.REGENIE_PLOTS.lmm.tsv.gz", sep='\t', compression='gzip')
gwas.head()

In [None]:
# Subset GWAS results to select top variants
sig_gwas = gwas.loc[(gwas['Pval'] < 5*10**(-8)) & (gwas['Name'].str.startswith('rs'))].sort_values(by='Pval')
sig_gwas["Chr"] = sig_gwas["Chr"].astype(str)
sig_gwas.head()

In [None]:
sig_gwas.Chr.unique()

In [None]:
sig_gwas.shape

In [None]:
sig_gwas.to_csv('significant_variants.txt', index=False, sep='\t')

In [None]:
pheno = pd.read_csv("ischemia_df.phe", sep='\t')
pheno.head()

In [None]:
pheno[['FID']].to_csv('eids_to_keep.txt', index=False, sep='\t', header=False)

## Run LD clumping

In [None]:
# Iterate over the chromosomes having significant variants
chromosomes = sig_gwas['Chr'].unique()

BATCH_SIZE=10000

if os.path.exists('tmp'):
    shutil.rmtree('tmp')

os.mkdir('tmp')

        

for chromosome in sorted(chromosomes, key=lambda x: int(x)):
    # Some SNP blocks were corrupted, excluding SNPs from those regions
    rs_names = sig_gwas[(sig_gwas['Chr'] == chromosome) & (~sig_gwas['Name'].isin(['rs934198', 'rs17062991']))]['Name']

    print(chromosome)

    for i, start_idx in enumerate(range(0, len(rs_names), BATCH_SIZE)):
        rs_names_batch = rs_names[start_idx:start_idx + BATCH_SIZE]
        print(chromosome, start_idx, len(rs_names_batch))
        #print(rs_names_batch)
        rs_names_batch.to_csv('rs_batch.txt', header=False, index=False)

        new_bgen_name = f'tmp/{chromosome}_{i}.bgen'
        plink_output_prefix = f'tmp/plink_{chromosome}_{i}'
        plink_ld_output_prefix = f'tmp/plink_{chromosome}_{i}_ld_clumped'

        with open(new_bgen_name, 'wb') as new_bgen:
            print(f'Extract significant rsIDs')
            subprocess.check_call(['/opt/notebooks/bgen.tgz/build/apps/bgenix', '-g', f'imputed/ukb21008_c{chromosome}_b0_v1.bgen', 
                            '-incl-rsids', 'rs_batch.txt'], stdout=new_bgen, stderr=subprocess.PIPE)
            print(f'Make PLINK files')
            subprocess.check_call(['./plink2', '--bgen', new_bgen_name, 'ref-first', '--sample', 
                               f'samples/ukb21008_c{chromosome}_b0_v1.sample', '--rm-dup', 'force-first', '--out', plink_output_prefix,
                               '--keep-fam', 'eids_to_keep.txt', '--make-bed'])
            print(f'Perform LD clumping -output {plink_ld_output_prefix}')
            subprocess.check_call([
                './plink', '--bfile', plink_output_prefix, 
                '--extract', f'keepsnps/ukb21008_c{chromosome}_b0_v1_qc_pass.snplist', '--keep-fam', 'eids_to_keep.txt',
                '--clump-p1', '1', '--clump-r2', '0.1', '--clump-kb', '250', '--clump', 'significant_variants.txt',
                '--clump-snp-field', 'Name', '--clump-field', 'Pval', '--out', f'{plink_ld_output_prefix}_ld_clumped'
            ])


Merge per-chromosome results into one file

In [None]:
%%bash
head -n 1 tmp/plink_1_0_ld_clumped_ld_clumped.clumped > plink_all_ld_clumped_ld_clumped.clumped
tail -n +2 -q tmp/plink_*_ld_clumped_ld_clumped.clumped | head -n -2 >> plink_all_ld_clumped_ld_clumped.clumped
sed -i '/^$/d' plink_all_ld_clumped_ld_clumped.clumped

In [None]:
!dx upload plink_all_ld_clumped_ld_clumped.clumped --path /Data/LD_clump/

## Output files

- Table with the results of LD clumping, one row per each index variant (`plink_all_ld_clumped_ld_clumped.clumped`)