# Identifying loci of interest from association study results

The aim of this workflow is to select subsets of SNPs of potential interest (low p-value) from GWAS for follow up analysis (such as fine-mapping). We use the LD clumping method to account for LD between SNPs. The outcome of LD clumping is sets of SNPs in LD with each other representing association signals for each locus of interest. SNPs between different clumps are independent. SNPs in a clump can be used to establish boundaries for loci to be investigated further.

LD clumping is implemented in PLINK. A practical challenge is that some LD reference genotype data, such as that from UK Biobank, is too large to efficiently use for LD clumping. We implemented this pipeline to allow for determining LD based only on selected samples. We have successfully applied the procedure in the UK Biobank data we have analyzed.

**A note to readers:** please read through the "Input", "Parameter setting", "Output" sections, then follow the instructions in "Minimal working example illustration" to complete an exercise analysis with the minimal working example data we provide. The code implementations of the analysis is available as the rest of the notebook; interested readers may read the section "Command Interface" and onwards to learn about the implementations.

## Input

- GWAS summary statistics
- LD reference panel

## Parameter settings for clumping

1. Which reference dataset to use? 

    * Public data such as 1000 Genomes Project (eg our `1000G_CEU`  bundle), HapMap (eg our `hapmap_CEU_r23a_filtered` bundle), UK10K, HRC reference panel
    * In-sample LD: use the same genotype files that the GWAS data were generated, if available
    
2. What is the significance threshold for the index variant (p1) we should use for the analyses? 
    
    p=5e-08
    
3. What significance threshold to use for the SNPs to be clumped? 
   
   p=1 (this will include all the SNPs in LD with the index SNP)
   
4. What LD $r^2$ to use? 
   
   r2=0.3 or even lower to capture bigger LD blocks
   
5. What window size in kb to use? 
   
   We use 2Mb in this analysis which is way larger than the average LD block size in humans
   
Below are the default options used by PLINK

```
--clump-p1 0.0001: significance threshold for Index SNPs
--clump-p2 0.01: Secondary significance threshold for clumped SNPs
--clump-r2 0.50: LD threshold for clumping
--clump-kb 250: Physical distance threshold for clumping
--clump-field P_value: To specify the name of the field for P-value
--clump-verbose: to add a more detailed report of SNPs in each clump
--clump-best: to select the single best proxy
--clump-allow-overlap: allow for overlap between clumped regions
```

## Output

A list of regions in the following format:

```
chr start end top_snp all_snps
```

where the last column `all_snps` are SNP IDs in the LD cluster.

## Minimal working example illustration

We demonstrate this pipeline on a minimal working example (MWE) dataset generated by the `LMM.ipynb` workflow, and is available on [Google Drive](https://drive.google.com/drive/u/0/folders/1h931uJPKuCQyh_vi08Xfjn2NnzI2AZmt) (access to be granted upon request). The data-set, being a toy example, does not have a strong GWAS signal. We will illustrate clumping for association tests with p-value < 1E-5. The LD will be based on 200 samples as an illustration.

In [1]:
sos run LD_Clumping.ipynb \
    --cwd clumping_output \
    --genoFile data/1000G.EUR.mwe.pruned.bed \
    --sampleFile data/1000G.EUR.mwe.pruned.fam \
    --sumstatsFiles data/1000G.EUR.pheno_x.regenie.snp_stats.gz \
    --clump-p1 0.0005 \
    --clump-field P \
    --clump-annotate BP \
    --ld-sample-size 200 \
    --reference-genotype-prefix mwe

INFO: Running [32mdefault[0m: Perform LD-clumping in PLINKv1.9
INFO: Running [32mreference_1[0m: 
INFO: Running [32mfilter_plink[0m: Select a subset of samples from the plink files
INFO: [32mfilter_plink[0m is [32mcompleted[0m.
INFO: [32mfilter_plink[0m output:   [32mclumping_output/cache/1000G.EUR.mwe.pruned.200.bed[0m
INFO: [32mreference_1[0m is [32mcompleted[0m.
INFO: [32mreference_1[0m output:   [32mclumping_output/cache/1000G.EUR.mwe.pruned.200.bed[0m
INFO: Running [32mreference_2[0m: Merge all the .bed files into one reference file to use in clumping step
INFO: [32mreference_2[0m is [32mcompleted[0m.
INFO: [32mreference_2[0m output:   [32mclumping_output/mwe.200.ref_geno.bed[0m
INFO: [32mdefault[0m is [32mcompleted[0m.
INFO: [32mdefault[0m output:   [32mclumping_output/1000G.EUR.pheno_x.regenie.snp_stats.clumped clumping_output/1000G.EUR.pheno_x.regenie.snp_stats.clumped_region[0m
INFO: Workflow default (ID=wc1f850dcba7e55a1) is executed su

The clumped regions can be found as:

In [2]:
head clumping_output/1000G.EUR.pheno_x.regenie.snp_stats.clumped_region

1 248715699 249172500
2 152287831 152331133
2 211348599 211499959
4 11788538 11807469
4 132424256 132586858
4 166669197 167485564
5 32500390 32625480
5 100665906 100883179
5 145830216 145926587
6 4526996 4551432


These regions can be used in subsequent workflows for refined analysis including statistical fine-mapping.

## Command interface

In [None]:
sos run LD_Clumping.ipynb -h

## Global parameter settings

In [None]:
[global]
# Working directory: change accordingly
parameter: cwd = path
# Path to bgen or plink files
parameter: genoFile = paths
# Path to sample files
parameter: sampleFile = path
# Path to summary stats file
parameter: sumstatsFiles = paths
# Path to samples of unrelated individuals
parameter: unrelated_samples = path(".")
# Reference genotype file
parameter: reference_genotype_prefix = str
# Number of samples to use to compute LD
parameter: ld_sample_size = 2000
# Clumping parameteres
parameter: clump_field = str
parameter: clump_annotate = ""
parameter: clump_p1 = 5e-08
parameter: clump_p2 = 0.01
# r2 = 0.04 => r = 0.2
parameter: clump_r2 = 0.04
parameter: clump_kb = 2000
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# File names for clumping
parameter: clumpFile = path(f'{cwd}/' + "_".join([f'{x:bn}' for x in sumstatsFiles]) + '.clumped')
# Output the bgen file with 8bit formatting
#parameter: bgen_bits=16
# Specific number of threads to use
parameter: numThreads = 5
# Specify the container to use
parameter: container = ''
if not container:
    container = None
if unrelated_samples == path("."):
    unrelated_samples = sampleFile
clumpregionFile = f'{clumpFile:n}.clumped_region'

## Select a random subset of unrelated samples

The unrelated sample file should be a text file containing white space separated columns. It should have a column named `IID` for sample IDs.

In [None]:
# Create a white-space delimited file with a list of unrelated samples in data.
[filter_samples: provides = f'{cwd}/{unrelated_samples:bn}.{ld_sample_size}.txt']
input: unrelated_samples, sampleFile
output: f'{cwd}/{unrelated_samples:bn}.{ld_sample_size}.txt'
R: container = container, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    set.seed(1)
    all_unrelated = read.table(${_input[0]:r}, header=F${"" if _input[0].suffix == ".fam" else ", skip=2"})
    avail_samples = read.table(${_input[1]:r}, header=F${"" if _input[1].suffix == ".fam" else ", skip=2"})
    unrelated_samples = which(avail_samples[,1] %in% all_unrelated[,1])
    dat = avail_samples[unrelated_samples,]
    if (${ld_sample_size} < nrow(dat)) {
      dat = dat[sample(1:nrow(dat), ${ld_sample_size}), 1, drop=F]
    } else {
      dat = dat[, 1, drop=F]
    }
    write.table(dat, ${_output:r}, quote=F, row.names=F, col.names=F)    

## Select Filter the BGEN files 

This step is based on the samples selected in the previous step

In [None]:
# Select a subset of samples from the bgen files
[filter_bgen_1]
depends: f'{cwd}/{unrelated_samples:bn}.{ld_sample_size}.txt'
input: genoFile, group_by=1
output: f'{cwd}/cache/{_input:bn}.{ld_sample_size}.bgen'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h',  mem = '60G', tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= True, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
   qctool -g {_input} -s {sampleFile} -og {_output} -os {_output:n}.sample -incl-samples {_depends}

## Make PLINK files with the selected samples

The BGEN files extracted have to be converted to PLINK format because currently LD clumping in PLINK 1.9 does not work with BGEN format.

In [None]:
# Make the binary files for bgen input using the selected samples and exclude repeated variant ids
[filter_bgen_2]
depends: Py_Module('xxhash')
output: f'{cwd}/cache/{_input:bn}.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    # try create index if not exist
    bgenix -g ${_input} -index || true
    # get a list of duplicated SNP IDs
    bgenix -g ${_input} -list 2>/dev/null | grep -v "#" | tail -n+2 | cut -d$'\t' -f2 | sort | uniq -d > ${_output:n}.exclude
    plink2 \
    --bgen ${_input} ref-first \
    --sample ${_input:n}.sample \
    --make-bed \
    --exclude ${_output:n}.exclude \
    --out ${_output:n} \
    --threads ${numThreads} \
    --memory 12000
    
python: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    # Fix SNP names longer than 50 characters. 
    # This will result in a false insufficient memory alert and error in the next step, if not dealt with
    import pandas as pd
    from xxhash import xxh32 as xxh
    def shorten_id(x):
        return x if len(x) < 30 else f"{x.split('_')[0]}_{xxh(x).hexdigest()}"

    dat = pd.read_csv('${_output:n}.bim', header=None, sep='\t')
    dat.columns = ['chrom', 'id', 'gd', 'pos', 'a1', 'a2']
    dat['id'] = dat['id'].apply(shorten_id)
    dat.to_csv('${_output:n}.bim', sep='\t', header=False, index=False)

## Filter PLINK files

In [None]:
# Select a subset of samples from the plink files
[filter_plink]
depends: f'{cwd}/{unrelated_samples:bn}.{ld_sample_size}.txt'
input: genoFile, group_by=1
output: f'{cwd}/cache/{_input:bn}.{ld_sample_size}.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h',  mem = '60G', tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= True, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    plink --bfile {_input:n} --keep-fam {_depends} --make-bed --out {_output:n} --threads {numThreads} --memory 12000

In [None]:
[reference_1]
bgen = [x for x in genoFile if x.suffix == '.bgen']
plink = [x for x in genoFile if x.suffix == '.bed']
input: genoFile, group_by=1
output: f'{cwd}/cache/{_input:bn}.{ld_sample_size}.bed'
if len(bgen):
    sos_run('filter_bgen', genoFile=bgen)
if len(plink):
    sos_run('filter_plink', genoFile=plink)

## Merge all chroms to one file

This is necessary for LD clumping in PLINK to work properly. We cannot `--merge` and `--make-bed` starting from bgen files in PLINK 2 at this point. We have to stick to PLINK 1.9 which requires duplicated variants have to be removed and indels renamed to <50 characters in length, because PLINK1.9 is not capable of dealing with very long variant names and when merging different bed files it cannot handle multiallelic variants. 

In [None]:
# Merge all the .bed files into one reference file to use in clumping step
[reference_2]
input: group_by='all'
output: f'{cwd}/{reference_genotype_prefix}.{ld_sample_size}.ref_geno.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    echo -e ${' '.join([str(x)[:-4] for x in _input[1:]])} | sed 's/ /\n/g' > ${_output:n}.merge_list
    plink \
    --bfile ${_input[0]:n} \
    --merge-list ${_output:n}.merge_list \
    --make-bed \
    --out ${_output:n} \
    --threads ${numThreads} \
    --memory 48000

## Perform LD clumping per chrom

Note: The same fields are extracted from all results files (e.g. SNP and P) -- i.e. it is not possible to specify different fields from different files

In [None]:
# Perform LD-clumping in PLINKv1.9
[default]
parameter: verbose = True
input: f'{cwd}/{reference_genotype_prefix}.{ld_sample_size}.ref_geno.bed'
output: clumpFile, clumpregionFile 
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G',cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'   
    plink \
    --bfile ${_input:n} \
    --clump ${sumstatsFiles:,} \
    --clump-field ${clump_field} \
    --clump-p1 ${clump_p1} \
    --clump-p2 ${clump_p2} \
    --clump-r2 ${clump_r2} \
    --clump-kb ${clump_kb} \
    ${("--clump-verbose") if verbose else ""} \
    ${("--clump-annotate %s" % clump_annotate) if clump_annotate else ""} \
    --clump-allow-overlap \
    --out ${_output[0]:n} \
    --threads ${numThreads} \
    && touch ${_output[0]} # need to touch and create empty file because some chroms may not have anything significant to clump.
    grep "RANGE" ${_output[0]} | awk -F ":" '{print $2, $3}' | sort -V | sed 's/\../ /g; s/^[[:blank:]]*//g ; s/chr//g' > ${_output[1]}