# Identifying loci of interest from GWAS

The aim is to select subsets of SNPs of possible interest (low p-value) from GWAS. We use LD clumping method to account for LD between SNPs. The output of LD clumping is a subset of independent SNPs for each locus of interest. We will use these SNPs to establish boundaries for these loci to be investigated further.

## Input

- Summary statistics
- LD reference panel

**Settings for clumping analysis:**

1. Which reference dataset to use? Options are 1000G_CEU, hapmap_CEU_r23a_filtered, UK10K, HRC reference panel
    
    * FIXME: if the SNPs are not in the reference panel they won't be outputed as index SNPs
    * FIX: use the bgen files with the genotype info for which we are going to extract 1000 individuals
    
    
2. What is the significance threshold for the index variant (p1) we should use for the analyses? 
    
    p=5e-08
    
3. What significance threshold to use for the SNPs to be clumped? 
   
   p=1 (this will include all the SNPs)
   
4. What LD r2 to use? 
   
   r2=0.3 or even lower to capture bigger LD blocks
   
5. What window size in kb to use (research about the average LD in the human genome for CEU population)? 
   
   I decided to use 1Mb (1000Kb), however conversations with the team decided that 2Mb will be better

Below are the default options used by PLINK

```
--clump-p1 0.0001: significance threshold for Index SNPs
--clump-p2 0.01: Secondary significance threshold for clumped SNPs
--clump-r2 0.50: LD threshold for clumping
--clump-kb 250: Physical distance threshold for clumping
--clump-field P_BOLT_LMM: To specify the name of the field for P-value
--clump-verbose: to add a more detailed report of SNPs in each clump
--clump-best: to select the single best proxy
--clump-allow-overlap: allow for overlap between clumped regions
```

## Output

A list of regions in the following format:

```
chr start end top_snp all_snps
```

where the last column `all_snps` are SNP IDs in the LD cluster.

## Command interface

In [2]:
sos run LD_Clumping.ipynb -h

usage: sos run LD_Clumping.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  filter_samples
  default

Global Workflow Options:
  --cwd VAL (as path, required)
                        Working directory: change accordingly
  --bfile VAL (as path, required)
                        Genotype file in plink binary format
  --bgenFile  paths

                        Path to bgen files
  --sampleFile VAL (as path, required)
                        Path to sample files
  --sumstatsFiles  paths

                        Path to summary stats file
  --unrelated-samples VAL (as path, required)
                        Path to samples of unrelated individuals
  --ld-sample-size 1000 (as int)
   

## Global parameter settings

In [None]:
[global]
# Working directory: change accordingly
parameter: cwd = path
# Genotype file in plink binary format
parameter: bfile = path
# Path to bgen files
parameter: bgenFile = paths
# Path to sample files
parameter: sampleFile = path
# Path to summary stats file
parameter: sumstatsFiles = paths
# Path to samples of unrelated individuals
parameter: unrelated_samples = path
# Number of samples to use to compute LD
parameter: ld_sample_size = 1000
# Clumping parameteres
parameter: clump_field = str
parameter: clump_annotate = str
parameter: clump_p1 = 5e-08
parameter: clump_p2 = 1
# r2 = 0.04 => r = 0.2
parameter: clump_r2 = 0.04
parameter: clump_kb = 2000
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Output the bgen file with 8bit formatting
#parameter: bgen_bits=16
# Specific number of threads to use
parameter: numThreads = int
# Load specific modules for each step
parameter: qctool_module = '''
module load QCTOOL/2.0-foss-2016b-rc7-CentOS6.8
echo "Module qctool loaded"
{cmd}
'''
parameter: plink2_module = '''
module load PLINK/2_x86_64_20180428
echo "Module plink2 loaded"
{cmd}
'''

parameter: plink_module = '''
module load PLINK/1.90-beta5.3
echo "Module plink loaded"
{cmd}
'''

## Illustration with a minimal working example


```
JOB_OPT='-j 2'
```

On a minimal working example (MWE) dataset generated by the `LMM.ipynb` workflow,
```
sos run LD_Clumping.ipynb \
    --cwd output \
    --bfile data/genotypes.bed \
    --sampleFile data/imputed_genotypes.sample \
    --bgenFile data/imputed_genotypes_chr*.bgen \
    --sumstatsFile output/phenotypes_BMI.fastGWA.snp_stats.gz \
    --unrelated-samples data/unrelated_samples.txt \
    --numThreads 5 \
    --clump-p1 0.05 \
    --clump-field P \
    --clump-annotate BP \
    $JOB_OPT
```

## Note

In step default_2 as per PLINK conditions duplicate variants have to be removed and indels renamed to >50 characters in lenght. PLINK1.9 is not capable of dealing with very long variant names and when merging different bed files it cannot handle multiallelic variants. The merge option is not available in PLINK2.0 as of this moment

## Select a random subset of unrelated samples

The unrelated sample file should be a text file containing white space separated columns. It should have a column named `IID` for sample IDs.

In [None]:
# Create a white-space delimited file with a list of unrelated samples in data.
[filter_samples: provides = f'{cwd}/{unrelated_samples:bn}.{ld_sample_size}.txt']
input: unrelated_samples, sampleFile
output: f'{cwd}/{unrelated_samples:bn}.{ld_sample_size}.txt'
R: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    set.seed(1)
    all_unrelated = read.table(${_input[0]:r}, header=T)
    avail_samples = read.table(${_input[1]:r}, header=F, skip=2)
    unrelated_samples = which(avail_samples[,1] %in% all_unrelated$IID)
    dat = avail_samples[unrelated_samples,]
    if (${ld_sample_size} < nrow(dat)) {
      dat = dat[sample(1:nrow(dat), ${ld_sample_size}), 1, drop=F]
    } else {
      dat = dat[, 1, drop=F]
    }
    write.table(dat, ${_output:r}, quote=F, row.names=F, col.names=F)    

## Select Filter the BGEN files 

This step is based on the samples selected in the previously

In [None]:
# Select a subset of samples from the BGEN files
[default_1]
depends: f'{cwd}/{unrelated_samples:bn}.{ld_sample_size}.txt'
input: bgenFile, group_by=1
output: f'{cwd}/{_input:bn}.{ld_sample_size}.bgen', f'{cwd}/{_input:bn}.{ld_sample_size}.sample'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h',  mem = '60G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', template = '{cmd}' if executable('qctools').target_exists() else qctool_module
    qctool \
    -g ${_input} \
    -s ${sampleFile} \
    -og ${_output[0]} \
    -os ${_output[1]} \
    -incl-samples ${_depends}

## Make PLINK files with the selected samples

The BGEN files extracted have to be converted to PLINK format because currently LD clumping in PLINK 1.9 does not work with BGEN format.

In [None]:
# Make the binary files for the selected samples excluding repeated variant id
[default_2]
depends: Py_Module('xxhash')
output: f'{_input[0]:n}.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output:bn}'

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', template = '{cmd}' if executable('plink2').target_exists() else plink2_module
    # try create index if not exist
    bgenix -g ${_input[0]} -index || true
    # get a list of duplicated SNP IDs
    bgenix -g ${_input[0]} -list 2>/dev/null | grep -v "#" | tail -n+2 | cut -d$'\t' -f2 | sort | uniq -d > ${_output:n}.exclude
    plink2 \
    --bgen ${_input[0]} ref-first \
    --sample ${_input[1]} \
    --make-bed \
    --exclude ${_output:n}.exclude \
    --out ${_output:n} \
    --threads ${numThreads} \
    --memory 12000
    
python: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    # Fix SNP names longer than 50 characters. 
    # This will result in a false insufficient memory alert and error in the next step, if not dealt with
    import pandas as pd
    from xxhash import xxh32 as xxh
    def shorten_id(x):
        return x if len(x) < 30 else f"{x.split('_')[0]}_{xxh(x).hexdigest()}"

    dat = pd.read_csv('${_output:n}.bim', header=None, sep='\t')
    dat.columns = ['chrom', 'id', 'gd', 'pos', 'a1', 'a2']
    dat['id'] = dat['id'].apply(shorten_id)
    dat.to_csv('${_output:n}.bim', sep='\t', header=False, index=False)

## Merge all chroms to one file

This is necessary for LD clumping in PLINK to work properly. We cannot merge bgen files to bed in PLINK 2 at this point.

In [1]:
# Merge all the .bed files into one reference file 
[default_3]
input: group_by = 'all'
output: f'{cwd}/' + "_".join([f'{x:bn}' for x in sumstatsFiles]) + '.ref_geno.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', template = '{cmd}' if executable('plink').target_exists() else plink_module
    echo -e ${' '.join([str(x)[:-4] for x in _input[1:]])} | sed 's/ /\n/g' > ${_output:n}.merge_list
    plink \
    --bfile ${_input[0]:n} \
    --merge-list ${_output:n}.merge_list \
    --make-bed \
    --out ${_output:n} \
    --threads ${numThreads} \
    --memory 48000

## Perform LD clumping per chrom

Note: The same fields are extracted from all results files (e.g. SNP and P) -- i.e. it is not possible to specify different fields from different files

In [2]:
# Perform LD-clumping in PLINKv1.9
[default_4]
output: f'{_input:nn}.clumped', f'{_input:nn}.clumped_region'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G',cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', template = '{cmd}' if executable('plink').target_exists() else plink_module    
    plink \
    --bfile ${_input:n} \
    --clump ${sumstatsFiles:,} \
    --clump-field ${clump_field} \
    --clump-p1 ${clump_p1} \
    --clump-p2 ${clump_p2} \
    --clump-r2 ${clump_r2} \
    --clump-kb ${clump_kb} \
    --clump-verbose \
    --clump-annotate ${clump_annotate} \
    --clump-allow-overlap \
    --out ${_output[0]:n} \
    --threads ${numThreads} \
    && touch ${_output[0]} # need to touch and create empty file because some chroms may not have anything significant to clump.
    grep "RANGE" ${_output[0]} | awk -F ":" '{print $2, $3}' | sort -V | sed 's/\../ /g; s/^[[:blank:]]*//g ; s/chr//g' > ${_output[1]}