# Compute pairwise LD for selected SNPs
Though it is straightfoward enough to do it in R / Python, I use PLINK to compute the LD matrix.

PLINK is highly efficient. First I extract the subset of variants of interest from all data, then for a quick look I use `--indep-pairwise` to prune SNPs at given LD level. Finally I use `--r2` option to compute LD stats and formally report a summary.

In [6]:
[global]
cwd = '~/Documents/GTEx'
mash_snps = '~/GIT/github/gtexresults_mash/Data/maxz.txt'
mash_snps_all = "${cwd!a}/mash_revision/MatrixEQTLSumStats.Portable.h5"
snp_list = "${cwd!a}/mash_revision/snp_eqtls.txt"
snp_list_random = "${cwd!a}/mash_revision/snp_random.txt"
genotype_data = "${cwd!a}/genotype_plink/GTEx7.Imputed.bed"

## Prepare SNP ID list to extract
For example, codes below prepares SNPs using the eQTLs from `mash` paper.

### From text file
In particular [this file](https://github.com/stephenslab/gtexresults_mash/blob/master/Data/maxz.txt).

In [3]:
%preview /home/gaow/GIT/github/gtexresults_mash/Data/maxz.txt --limit 2

"Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" "Artery_Coronary" "Artery_Tibial" "Brain_Anterior_cingulate_cortex_BA24" "Brain_Caudate_basal_ganglia" "Brain_Cerebellar_Hemisphere" "Brain_Cerebellum" "Brain_Cortex" "Brain_Frontal_Cortex_BA9" "Brain_Hippocampus" "Brain_Hypothalamus" "Brain_Nucleus_accumbens_basal_ganglia" "Brain_Putamen_basal_ganglia" "Breast_Mammary_Tissue" "Cells_EBV-transformed_lymphocytes" "Cells_Transformed_fibroblasts" "Colon_Sigmoid" "Colon_Transverse" "Esophagus_Gastroesophageal_Junction" "Esophagus_Mucosa" "Esophagus_Muscularis" "Heart_Atrial_Appendage" "Heart_Left_Ventricle" "Liver" "Lung" "Muscle_Skeletal" "Nerve_Tibial" "Ovary" "Pancreas" "Pituitary" "Prostate" "Skin_Not_Sun_Exposed_Suprapubic" "Skin_Sun_Exposed_Lower_leg" "Small_Intestine_Terminal_Ileum" "Spleen" "Stomach" "Testis" "Thyroid" "Uterus" "Vagina" "Whole_Blood"
"ENSG00000000419.8_20_49461813_G_C_b37" 0.140035769899035 0.0475048018545887 -0.185199569115366 -0.7074

In [8]:
%sosrun get_list
[get_list]
output: "${snp_list!a}"
import numpy as np
np.savetxt("${snp_list!a}", 
           [':'.join(x.split('_')[1:3]) for x in np.loadtxt("${mash_snps!a}", dtype = 'str', delimiter=" ", skiprows=1, usecols=(0,)).tolist()],
          fmt = '%s')

20:49461813
1:169695110
1:169655079
1:27888990
1:196513323


### From HDF5 table row names
In particular my GTEx V6 sumstat database `MatrixEQTLSumStats.Portable.h5`

In [17]:
%sosrun get_all_list
[get_all_list]
output: snp_list, snp_list_random
python:
    import h5py, numpy as np
    f = h5py.File(${mash_snps_all!ar})
    max_list = [':'.join(x.decode().split('_')[1:3]) for x in f['max']['rownames']]
    random_list = [':'.join(x.decode().split('_')[1:3]) for x in f['null']['rownames']]
    f.close()
    np.savetxt(${snp_list!ar}, max_list, fmt = '%s')
    np.savetxt(${snp_list_random!ar}, random_list, fmt = '%s')

20:49461813
1:169695110
1:169655079
1:27888990
1:196513323


20:49782767
20:49654572
20:49392478
1:169117725
1:170340311


## Compute LD for given list of variants
Code chunk below extracts SNPs from GTEx genotype data and computes `r2` via PLINK.
* Steps 1 & 2 extracts data and run PLINK to get a quick estimate for, for example, when a pruning cutoff of 0.2 is set.
* Steps 3 & 4 more formally calculates LD on per-chromosome basis

In [24]:
%sosrun get_ld:1-2 -b ~/Documents/GTEx/bin
[get_ld_1]
# extract SNPs for given list
parameter: input_list = snp_list
depends: genotype_data
input: input_list
output: "${cwd!a}/mash_revision/${input!bn}.extracted.bed"
task: workdir = cwd
run:
    plink --bfile ${genotype_data!n} \
      --extract ${input} \
      --no-sex --no-pheno --no-parents \
      --make-bed \
      --out ${output!n}

[get_ld_2]
# Quickly survey how many SNPs are removed for given cutoff
pairwise_ld_param = '10000 500 0.2'
output: "${input!n}.prune.out"
run:
    plink --bfile ${input!n} \
          --indep-pairwise ${pairwise_ld_param} \
          --out ${output!nn}

[get_ld_3]
# split data by chrom
chroms = [i+1 for i in range(22)]
input: for_each = 'chroms', pattern = '{name}.prune.out'
output: expand_pattern('{_name}_chr{_chroms}.bed')
task: workdir = cwd
run: 
plink --bfile ${_input!n} \
      --chr ${_chroms} \
      --allow-no-sex \
      --make-bed \
      --out ${_output!n}

[get_ld_4]
# compute LD
input: group_by = 1, pattern = '{name}.bed'
output: expand_pattern('{_name}.r2.ld.gz')
task: workdir = cwd
run:
    plink --bfile ${_input!n} \
          --out ${_output!nn} \
          --r2 square gz


plink --bfile /home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted \
      --indep-pairwise 10000 500 0.2 \
      --out /home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted
PLINK v1.90b4.3 64-bit (9 May 2017)            www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted.log.
Options in effect:
  --bfile /home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted
  --indep-pairwise 10000 500 0.2
  --out /home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted

32120 MB RAM detected; reserving 16060 MB for main workspace.
13030 variants loaded from .bim file.
635 people (0 males, 0 females, 635 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 635 founders and

Here I executed steps 1 & 2. You'll see PLINK's log if you expand the code chunk above. But the relevant PLINK output message here is:

```
Pruning complete.  4204 of 13030 variants removed.
```

## Examine LD strengths per chromosome
This section takes a deeper look at LD pattern using results from steps 3 & 4. Cell below loads all data and compute some summary statistics on LD strength.

In [19]:
setwd('~/Documents/GTEx/mash_revision/')
grid = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99)
prop = matrix(0, length(grid), 22)
snps = matrix(0, length(grid), 22)
n_snps = 0
for (i in 1:22) {
    tmp = read.table(paste0('GTEx7.Imputed.extracted_chr', i, '.r2.ld.gz'))
    tmp[is.na(tmp)] = 0
    rownames(tmp) = colnames(tmp)
    diag(tmp) = 0
    tmp[upper.tri(tmp)] = 0
    n_snps = n_snps + dim(tmp)[1]
    for (j in 1:length(grid)) {
        m = abs(tmp) > grid[j]
        prop[j,i] = sum(m) / ((dim(tmp)[1]^2 - dim(tmp)[1])/ 2)
        ss = c(rownames(m)[row(m)[which(m)]], colnames(m)[col(m)[which(m)]])
        snps[j,i] = length(unique(ss))
    }
}

Here is summary of proportion of pairs on each chromosome (row) having LD greater than thresholds (column):


In [20]:
prop = data.frame(t(prop))
colnames(prop) = grid
rownames(prop) = paste('chr', 1:22)
prop
apply(prop, 2, mean)

Unnamed: 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99
chr 1,0.002647735,0.001113455,0.0007692388,0.000552005,0.0004029846,0.0003053868,0.0002277283,0.0001595147,0.0001007462,7.031242e-05,2.728542e-05
chr 2,0.002238748,0.001330724,0.0009680365,0.0007227658,0.0005505545,0.0003913894,0.000297456,0.000198304,0.0001148076,6.001305e-05,1.304631e-05
chr 3,0.003638957,0.002022455,0.0015067842,0.0011447171,0.000826537,0.0005522437,0.0004022967,0.0003218374,0.000241378,0.000135318,5.485865e-05
chr 4,0.002470868,0.001535395,0.0011591727,0.0008744636,0.0007016045,0.0005694182,0.0004067273,0.0002643727,0.0001728591,0.0001016818,4.067273e-05
chr 5,0.002170284,0.001363383,0.000984975,0.0008180301,0.0006455203,0.0004451864,0.0003561491,0.0002448525,0.0001446856,8.347245e-05,1.669449e-05
chr 6,0.003293191,0.001729617,0.0010977303,0.0007287453,0.0005719267,0.0004520066,0.0003735973,0.0002859634,0.0002029417,0.0001245324,3.68985e-05
chr 7,0.004022743,0.002099676,0.00150117,0.0011332362,0.0008388891,0.0006230346,0.00050039,0.0003384991,0.0002305719,0.0001422677,4.905784e-05
chr 8,0.003694415,0.001943986,0.0015568721,0.0013128219,0.0010603561,0.0007069041,0.0005301781,0.0002945434,0.000159895,0.0001094018,1.683105e-05
chr 9,0.003993205,0.0023165,0.0017575985,0.001411962,0.0010736794,0.000698627,0.0005809635,0.0004559461,0.0001838492,7.353969e-05,2.206191e-05
chr 10,0.003673287,0.002107207,0.0016042781,0.0012732366,0.0009803922,0.0007257448,0.0005665903,0.0003883372,0.0002164502,0.0001527884,3.81971e-05


Here is summary of number of unique SNPs involved on each chromosome (row) having LD greater than thresholds (column):


In [21]:
snps = data.frame(t(snps))
colnames(snps) = grid
rownames(snps) = paste('chr', 1:22)
snps

Unnamed: 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99
chr 1,957,775,650,553,454,389,316,236,159,116,49
chr 2,531,408,350,278,236,194,166,123,76,46,10
chr 3,483,369,302,264,228,191,146,120,94,60,29
chr 4,234,171,138,110,101,88,69,48,34,20,8
chr 5,346,262,211,181,158,122,107,76,46,29,6
chr 6,405,305,248,194,160,129,110,91,68,42,16
chr 7,397,321,273,233,190,149,121,98,78,50,19
chr 8,286,208,177,150,131,94,79,62,38,26,4
chr 9,313,247,212,172,149,123,108,90,42,20,6
chr 10,328,261,214,189,147,113,97,76,47,34,12


Proportion of SNPs involved are:


In [22]:
apply(snps, 2, sum) / n_snps

In [23]:
%sessioninfo

0,1
SoS Version,0.9.8.10
numpy,1.13.1

0,1
Kernel,ir
Language,R
R version 3.4.0 (2017-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: BunsenLabs GNU/Linux 8.7 (Hydrogen) Matrix products: default BLAS: /usr/lib64/microsoft-r/3.4/lib64/R/lib/libRblas.so LAPACK: /usr/lib/libopenblasp-r0.2.12.so locale:  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RevoUtilsMath_10.0.0 loaded via a namespace (and not attached):  [1] compiler_3.4.0 R6_2.2.0 magrittr_1.5 [4] RevoUtils_10.0.4 IRdisplay_0.4.4 pbdZMQ_0.2-5 [7] tools_3.4.0 crayon_1.3.2 uuid_0.1-2 [10] stringi_1.1.5 IRkernel_0.8.7.9000 jsonlite_1.4 [13] stringr_1.2.0 digest_0.6.12 repr_0.12.0 [16] evaluate_0.10,R version 3.4.0 (2017-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: BunsenLabs GNU/Linux 8.7 (Hydrogen) Matrix products: default BLAS: /usr/lib64/microsoft-r/3.4/lib64/R/lib/libRblas.so LAPACK: /usr/lib/libopenblasp-r0.2.12.so locale:  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RevoUtilsMath_10.0.0 loaded via a namespace (and not attached):  [1] compiler_3.4.0 R6_2.2.0 magrittr_1.5 [4] RevoUtils_10.0.4 IRdisplay_0.4.4 pbdZMQ_0.2-5 [7] tools_3.4.0 crayon_1.3.2 uuid_0.1-2 [10] stringi_1.1.5 IRkernel_0.8.7.9000 jsonlite_1.4 [13] stringr_1.2.0 digest_0.6.12 repr_0.12.0 [16] evaluate_0.10
