# HDL in MVP

## Reference panel

`download_1000G()` in `bigsnpr`

Including 2490 (mostly unrelated) individuals and ~1.7M SNPs in common with either HapMap3 or the UK Biobank.

## Base data: summary Statistics from MVP

Posterior betas for traits HDL.

## Target data: UK biobank

covariates, phenotype related to HDL and genotypes of 2000 individuals.

## Model

The model I used to estimate effect size is auto mdoel. It runs the algorithm for 30 different p (the proportion of causal variants) values range from 10e-4 to 0.9, and heritability from LD score regression as initial value.



## Some codes might be useful for data preparation

Codes for extrat snplist from summary statistics

```
sumstats <- readRDS("mvpdata/pos_sumstats_hdl.rds")
write.table(sumstats$rsid, file = "mvpdata/pos_sumstats_hdl.snplist", sep = " ",
    row.names = FALSE, col.names = FALSE, quote=FALSE)
```

## Step 1: common snps

In [16]:
sos run ldpred.ipynb common_snp \
    --outpath res-data \
    --testpath ukbiobank \
    --stat_snp mvpdata/pos_sumstats_hdl.snplist \
    --ref_snp 1000G/1000G.QC.snplist \
    --test_snp ukbiobank/UKB.TMP.snplist \
    --summstats_file mvpdata/pos_sumstats_hdl.rds \
    --sub_stats mvpdata/pos_sumstats_hdl.SUB.rds \
    --stats_comsnp mvpdata/common.snplist \
    --test_comsnp ukbiobank/common.snplist \
    --ref_comsnp 1000G/common.snplist

INFO: Running [32mcommon_snp[0m: 2004l[?2004l[?2004l[?2004l[?2004l[?2004l[?2004l
INFO: [32mcommon_snp[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcommon_snp[0m output:   [32mmvpdata/pos_sumstats_hdl.SUB.rds mvpdata/common.snplist... (4 items)[0m
INFO: Workflow common_snp (ID=w858c37480708ebb0) is ignored with 1 ignored step.
[?2004h

: 1

## Step 2: subsetting reference panel

In [4]:
sos run ldpred.ipynb subsets \
    --outpath res-data \
    --testpath ukbiobank \
    --bed_file 1000G/1000G.bed \
    --fam_file 1000G/1000G.QC.fam \
    --snp_file 1000G/common.snplist \
    --sub_bedfile 1000G/1000G.SUB.bed

INFO: Running [32msubsets[0m: [?2004l
INFO: [32msubsets[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32msubsets[0m output:   [32m1000G/1000G.SUB.bed 1000G/1000G.SUB.bim... (3 items)[0m
INFO: Workflow subsets (ID=wc3f510c0cb7ea1e6) is ignored with 1 ignored step.
[?2004h

: 1

Totally 31566 varients

    ./plink \
        --bfile 1000G/1000G \
        --keep 1000G/1000G.QC.fam \
        --extract 1000G/common.snplist \
        --make-bed \
        --out 1000G/1000G.SUB


## Step 3: SNP Matching


In [5]:
sos run ldpred.ipynb data_load \
    --ref_bfile 1000G/1000G.SUB.bed \
    --summstats_file mvpdata/pos_sumstats_hdl.SUB.rds \
    --n_eff 200000 \
    --ref_file 1000G/1000G.SUB.rds \
    --test_snplist UKB.SUB.snplist \
    --outpath res-data \
    --testpath ukbiobank

INFO: Running [32mdata_load_10[0m: 04l[?2004l[?2004l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
Error : File '1000G/1000G.SUB.bk' already exists.
INFO: [32mdata_load_10[0m is [32mcompleted[0m.
INFO: Running [32mdata_load_20[0m: 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
INFO: [32mdata_load_20[0m is [32mcompleted[0m.
INFO: [32mdata_load_20[0m output:   [32mres-data/SumStats.RData[0m
INFO: Running [32mdata_load_30[0m: 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
31,566 variants to be matched.
1,134 ambiguous SNPs have been removed.
30,4

: 1

## Step 4: Quality control (or do not)

In [9]:
sos run ldpred.ipynb QControl \
    --qc_in res-data/MatchedSnp.RData \
    --outpath res-data \
    --testpath ukbiobank

INFO: Running [32mQControl[0m: 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
   user  system elapsed 
  0.054   0.010   2.790 
Saving 7 x 7 in image
[1] "30376 over 30432 were removed in Quality Control."
INFO: [32mQControl[0m is [32mcompleted[0m.
INFO: [32mQControl[0m output:   [32mres-data/sd.rds res-data/QCMatchedSnp.RData... (4 items)[0m
INF

: 1

## Step 5: subsetting test data

In [11]:
sos run ldpred.ipynb subsets \
    --outpath res-data \
    --testpath ukbiobank \
    --bed_file ukbiobank/UKB.QC.bed \
    --fam_file ukbiobank/UKB.QC.fam \
    --snp_file ukbiobank/UKB.SUB.snplist \
    --sub_bedfile ukbiobank/UKB.SUB.bed

INFO: Running [32msubsets[0m: [?2004l[?2004l[?2004l
PLINK v1.90b6.22 64-bit (16 Apr 2021)          www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ukbiobank/UKB.SUB.log.
Options in effect:
  --bfile ukbiobank/UKB.QC
  --extract ukbiobank/UKB.SUB.snplist
  --keep ukbiobank/UKB.QC.fam
  --make-bed
  --out ukbiobank/UKB.SUB

8192 MB RAM detected; reserving 4096 MB for main workspace.
47605 variants loaded from .bim file.
2000 people (1085 males, 915 females) loaded from .fam.
--extract: 30432 variants remaining.
--keep: 2000 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2000 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
30432 variants and 2000 peopl

: 1

Equivalent PLUNK code:
```
cd ukbiobank
./plink \
    --bfile UKB.QC \
    --extract UKB.SUB.snplist \
    --make-bed \
    --out UKB.SUB
cd ..
```

## Step 6: Calculate LD matrix and correlation

In [13]:
sos run ldpred.ipynb LD \
    --outpath res-data \
    --testpath ukbiobank \
    --ld_in res-data/MatchedSnp.Rdata

INFO: Running [32mLD[0m: 2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
In file.remove(paste0(tmp, ".sbk")) :
  cannot remove file './res-data/file1067e5aa21f64.sbk', reason 'No such file or directory'
INFO: [32mLD[0m is [32mcompleted[0m.
INFO: [32mLD[0m output:   [32mres-data/LdMatrix.Rdata[0m
INFO: Workflow LD (ID=we860c6f45697626b) is executed successfully with 1 completed step.
[?2004h

: 1

## Step 7: Estimate posterior effect sizes and PRS

In [15]:
sos run ldpred.ipynb load_testdata \
    --outpath res-data \
    --testpath ukbiobank \
    --test_bfile ukbiobank/UKB.SUB.bed

INFO: Running [32mload_testdata[0m: 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
[1] "/Users/zhangmengyu/Documents/bioworkflows/ldpred/ukbiobank/UKB.SUB.rds"
INFO: [32mload_testdata[0m is [32mcompleted[0m.
INFO: Workflow load_testdata (ID=w9d96a61448de1008) is executed successfully with 1 completed step.
[?2004h

: 1

In [None]:
sos run ldpred.ipynb inf_prs \
    --outpath res-data \
    --testpath ukbiobank \
    --inf_in res-data/LdMatrix.Rdata \
    --test_file ukbiobank/UKB.SUB.rds

[?2004h[?2004l[?2004l[?2004l[?2004l

In [3]:
sos run ldpred.ipynb load_testdata+inf_prs \
    --outpath res-data \
    --testpath ukbiobank \
    --inf_in res-data/LdMatrix.Rdata \
    --test_file ukbiobank/UKB.SUB.rds \
    --test_bfile ukbiobank/UKB.SUB.bed

INFO: Running [32mload_testdata[0m: 4l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
[1] "/Users/zhangmengyu/Documents/bioworkflows/ldpred/ukbiobank/UKB.SUB.rds"
INFO: [32mload_testdata[0m is [32mcompleted[0m.
INFO: Running [32minf_prs[0m: 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
30,432 variants to be matched.
0 ambiguous SNPs have been removed.
30,432 variants have been matched; 0 were flipped and 29,999 were reversed.
INFO: [32minf_prs[0m is [32mcompleted[0m.
INFO: [32minf_prs[0m output:   [32mres-data/InfPred.Rdata[0m
INFO: Workflow load_testdata+inf_prs (ID=w3d7fec231a0c181f) is executed successfully with 2 completed steps.
[?2004h

: 1

In [4]:
sos run ldpred.ipynb grid_prs \
    --outpath res-data \
    --testpath ukbiobank \
    --grid_in res-data/LdMatrix.Rdata \
    --test_file ukbiobank/UKB.SUB.rds

INFO: Running [32mgrid_prs[0m: [?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
30,432 variants to be matched.
0 ambiguous SNPs have been removed.
30,432 variants have been matched; 0 were flipped and 29,999 were reversed.
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
INFO: [32mgrid_prs[0m is [32mcompleted[0m.
INFO: [32mgrid_prs[0m output:   [32mres-data/GridPred.Rdata[0m
INFO: Workflow grid_prs (ID=wcd327e167525fb83) is executed successfully with 1 completed step.
[?2004h

: 1

In [5]:
sos run ldpred.ipynb auto_prs \
    --outpath res-data \
    --testpath ukbiobank \
    --auto_in res-data/LdMatrix.Rdata \
    --test_file ukbiobank/UKB.SUB.rds

INFO: Running [32mauto_prs[0m: [?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
30,432 variants to be matched.
0 ambiguous SNPs have been removed.
30,432 variants have been matched; 0 were flipped and 29,999 were reversed.
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Saving 7 x 7 in image
INFO: [32mauto_prs[0m is [32mcompleted[0m.
INFO: [32mauto_prs[0m output:   [32mres-data/AutoPred.Rd

: 1

## Step 8: predict phenotypes

In [6]:
sos run ldpred.ipynb null_phenopred \
    --outpath res-data \
    --testpath ukbiobank \
    --cov_file ukbiobank/UKB.ind.cov \
    --trait_file ukbiobank/UKB.hdl.cov

INFO: Running [32mnull_phenopred[0m: l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
[90m# A tibble: 1 x 3[39m
  model          R2   MSE
  [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m NULL model 0.039[4m7[24m 0.943
null device 
          1 
INFO: [32mnull_phenopred[0m is [32mcompleted[0m.
INFO: [32mnull_phenopred[0m output:   [32mres-data/summary/NullSummary.pdf res-data/model/NullModel.Rdata[0m
INFO: Workflow null_phenopred (ID=w5023cab995f4589a) is executed successfully with 1 completed step.
[?2004h

: 1

In [7]:
sos run ldpred.ipynb inf_phenopred \
    --outpath res-data \
    --testpath ukbiobank \
    --cov_file ukbiobank/UKB.ind.cov \
    --trait_file ukbiobank/UKB.hdl.cov \
    --inf_file res-data/InfPred.Rdata \
    --response 1

INFO: Running [32minf_phenopred[0m: 4l[?2004l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
[90m# A tibble: 1 x 3[39m
  model         R2   MSE
  [3m[90m<chr>[39m[23m      [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m Inf model 0.039[4m7[24m 0.943
null device 
          1 
INFO: [32minf_phenopred[0m is [32mcompleted[0m.
INFO: [32minf_phenopred[0m output:   [32mres-data/summary/InfSummary.pdf res-data/model/InfModel.Rdata[0m
INFO: Workflow inf_phenopred (ID=w2f8790d57588343f) is executed successfully with 1 completed step.
[?2004h

: 1

In [11]:
sos run ldpred.ipynb grid_phenopred \
    --outpath res-data \
    --testpath ukbiobank \
    --cov_file ukbiobank/UKB.ind.cov \
    --trait_file ukbiobank/UKB.hdl.cov \
    --grid_file res-data/GridPred.Rdata \
    --response 1

INFO: Running [32mgrid_phenopred[0m: l[?2004l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
Saving 7 x 7 in image
[90m# A tibble: 1 x 3[39m
  model          R2   MSE
  [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m Grid model 0.040[4m6[24m 0.947
pdf 
  2 
INFO: [32mgrid_phenopred[0m is [32mcompleted[0m.
INFO: [32mgrid_phenopred[0m output:   [32mres-data/summary/GridSummary.pdf res-data/model/GridModel.Rdata... (3 items)[0m
INFO: Workflow grid_phenopred (ID=w23e15a0ea89a18bb) is executed successfully with 1 completed step.
[?2004h

: 1

In [9]:
sos run ldpred.ipynb auto_phenopred \
    --outpath res-data \
    --testpath ukbiobank \
    --cov_file ukbiobank/UKB.ind.cov \
    --trait_file ukbiobank/UKB.hdl.cov \
    --auto_file res-data/AutoPred.Rdata \
    --response 1

INFO: Running [32mauto_phenopred[0m: l[?2004l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
[90m# A tibble: 1 x 3[39m
  model          R2   MSE
  [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m Auto model 0.039[4m8[24m 0.943
null device 
          1 
INFO: [32mauto_phenopred[0m is [32mcompleted[0m.
INFO: [32mauto_phenopred[0m output:   [32mres-data/summary/AutoSummary.pdf res-data/model/AutoModel.Rdata[0m
INFO: Workflow auto_phenopred (ID=w97e2f857c0c491cd) is executed successfully with 1 completed step.
[?2004h

: 1