# HDL in MVP

## Reference panel

`download_1000G()` in `bigsnpr`

Including 503 (mostly unrelated) European individuals and ~1.7M SNPs in common with either HapMap3 or the UK Biobank. Classification of European populstion can be found at [IGSR](https://www.internationalgenome.org/category/population/). European individuals ID are from [IGSR data portal](https://www.internationalgenome.org/data-portal/sample).

## Base data: summary Statistics from MVP

Posterior betas for traits HDL.

## Target data: UK biobank

covariates, phenotype related to HDL and genotypes of 2000 individuals `UKB.QC.fam`. 

## Model

Auto model runs the algorithm for 30 different $p$ (the proportion of causal variants) values range from 10e-4 to 0.9, and heritability $h^2$ from LD score regression as initial value.

Grid model tries a grid of parameters $p$, ranges from 0 to 1 and three $h^2$ which are 0.7/1/1.4 times of initial $h^2$ estimated by LD score regression.

## Data preparation

Use `awk` select columns in phenotypes file saved to traits file `UKB.hdl.cov` and covaraites file `UKB.ind.cov`.


## Step 1: common snps

In [5]:
sos run ldpred.ipynb extract_snp -v1 \
    --outpath res-data \
    --testpath ukbiobank \
    --ref_bed 1000G/1000G.EUR.bed \
    --test_bed ukbiobank/UKB.bed \
    --ref_snp 1000G/1000G.QC.snplist \
    --test_snp ukbiobank/UKB.QC.snplist \
    --summstats_file mvpdata/pos_sumstats_hdl.rds \
    --stat_snp mvpdata/pos_sumstats_hdl.snplist

PLINK v1.90b6.22 64-bit (16 Apr 2021)          www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ukbiobank/UKB.QC.log.
Options in effect:
  --bfile ukbiobank/UKB
  --geno 0.01
  --maf 0.01
  --mind 0.01
  --out ukbiobank/UKB.QC
  --write-snplist

8192 MB RAM detected; reserving 4096 MB for main workspace.
529024 variants loaded from .bim file.
366732 people (168262 males, 198470 females) loaded from .fam.
282 people removed due to missing genotype data (--mind).
IDs written to ukbiobank/UKB.QC.irem .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 366450 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate in remaining samples is 0.997994.
81 variants remove

: 1

In [7]:
sos run ldpred.ipynb common_snp \
    --outpath res-data \
    --testpath ukbiobank \
    --stat_snp mvpdata/pos_sumstats_hdl.snplist \
    --ref_snp 1000G/1000G.QC.snplist \
    --test_snp ukbiobank/UKB.QC.snplist \
    --summstats_file mvpdata/pos_sumstats_hdl.rds \
    --sub_stats mvpdata/pos_sumstats_hdl.SUB.rds

INFO: Running [32mcommon_snp[0m: 2004l[?2004l[?2004l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
[1] "There are  409486  common SNPs."
INFO: [32mcommon_snp[0m is [32mcompleted[0m.
INFO: [32mcommon_snp[0m output:   [32mmvpdata/pos_sumstats_hdl.SUB.rds res-data/common.snplist[0m
INFO: Workflow common_snp (ID=w5eab50cc3f321432) is executed successfully with 1 completed step.
[?2004h

: 1

## Step 2: subsetting reference panel

In [8]:
sos run ldpred.ipynb subsets \
    --outpath res-data \
    --testpath ukbiobank \
    --bed_file 1000G/1000G.EUR.bed \
    --fam_file 1000G/1000G.EUR.fam \
    --snp_file res-data/common.snplist \
    --sub_bedfile 1000G/1000G.SUB.bed

INFO: Running [32msubsets[0m: [?2004l[?2004l[?2004l
PLINK v1.90b6.22 64-bit (16 Apr 2021)          www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1000G/1000G.SUB.log.
Options in effect:
  --bfile 1000G/1000G.EUR
  --extract res-data/common.snplist
  --keep 1000G/1000G.EUR.fam
  --make-bed
  --out 1000G/1000G.SUB

8192 MB RAM detected; reserving 4096 MB for main workspace.
1664852 variants loaded from .bim file.
503 people (240 males, 263 females) loaded from .fam.
--extract: 409486 variants remaining.
--keep: 503 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 503 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
409486 variants and 503 people pass fi

: 1

Totally 31566 varients

    ./plink \
        --bfile 1000G/1000G.EUR \
        --keep 1000G/1000G.EUR.fam \
        --extract res-data/common.snplist \
        --make-bed \
        --out 1000G/1000G.SUB


## Step 3: SNP Matching


In [10]:
sos run ldpred.ipynb data_load \
    --outpath res-data \
    --testpath ukbiobank \
    --ref_bfile 1000G/1000G.SUB.bed \
    --ref_file 1000G/1000G.SUB.rds \
    --summstats_file mvpdata/pos_sumstats_hdl.SUB.rds \
    --n_eff 200000 \
    --test_snplist UKB.SUB.snplist

INFO: Running [32mdata_load_10[0m: 04l[?2004l[?2004l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
[1] "/Users/zhangmengyu/Documents/bioworkflows/ldpred/1000G/1000G.SUB.rds"
INFO: [32mdata_load_10[0m is [32mcompleted[0m.
INFO: [32mdata_load_10[0m output:   [32mres-data/SumStats.rds[0m
INFO: Running [32mdata_load_20[0m: 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
409,486 variants to be matched.
21,954 ambiguous SNPs have been removed.
387,532 variants have been matched; 0 were flipped and 211,410 were reversed.
INFO: [32mdata_load_20[0m is [32mcompleted[0m.
INFO: [32mdata_load_20[0m output:   [32mres-data/MatchedSnp.RData ukbiobank/UKB.SUB.snplist[0m
INF

: 1

## Step 4: Quality control (or do not)

Greatly drop variants. 

In [11]:
sos run ldpred.ipynb QControl \
    --qc_in res-data/MatchedSnp.RData \
    --outpath res-data \
    --testpath ukbiobank \
    --test_snplist UKB.QC.SUB.snplist

INFO: Running [32mQControl[0m: [?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Saving 7 x 7 in image
[1] "387086 over 387532 were removed in Quality Control."
INFO: [32mQControl[0m is [32mcompleted[0m.
INFO: [32mQControl[0m output:   [32mres-data/QcMatchedSnp.RData res-data/plots/QcPlot.png... (3 items)[0m
INFO: Workflow QControl (ID=w4c4dd

: 1

## Step 5: subsetting target data

In [12]:
sos run ldpred.ipynb subsets \
    --outpath res-data \
    --testpath ukbiobank \
    --bed_file ukbiobank/UKB.bed \
    --fam_file ukbiobank/UKB.QC.fam \
    --snp_file ukbiobank/UKB.QC.SUB.snplist \
    --sub_bedfile ukbiobank/UKB.SUB.bed

INFO: Running [32msubsets[0m: [?2004l[?2004l[?2004l
PLINK v1.90b6.22 64-bit (16 Apr 2021)          www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ukbiobank/UKB.SUB.log.
Options in effect:
  --bfile ukbiobank/UKB
  --extract ukbiobank/UKB.QC.SUB.snplist
  --keep ukbiobank/UKB.QC.fam
  --make-bed
  --out ukbiobank/UKB.SUB

8192 MB RAM detected; reserving 4096 MB for main workspace.
529024 variants loaded from .bim file.
366732 people (168262 males, 198470 females) loaded from .fam.
--extract: 446 variants remaining.
--keep: 2000 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2000 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate i

: 1

## Step 6: Calculate LD matrix and correlation

In [13]:
sos run ldpred.ipynb LD \
    --outpath res-data \
    --testpath ukbiobank \
    --ld_in res-data/QcMatchedSnp.Rdata

INFO: Running [32mLD[0m: 2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
INFO: [32mLD[0m is [32mcompleted[0m.
INFO: [32mLD[0m output:   [32mres-data/LdMatrix.Rdata[0m
INFO: Workflow LD (ID=w10cce9261a8a96e3) is executed successfully with 1 completed step.
[?2004h

: 1

## Step 7: Estimate posterior effect sizes and PRS

In [28]:
sos run ldpred.ipynb load_testdata+inf_prs \
    --outpath res-data \
    --testpath ukbiobank \
    --inf_in res-data/LdMatrix.Rdata \
    --test_bfile ukbiobank/UKB.SUB.bed \
    --test_file ukbiobank/UKB.SUB.rds

INFO: Running [32mload_testdata[0m: 4l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
Error : File 'ukbiobank/UKB.SUB.bk' already exists.
INFO: [32mload_testdata[0m is [32mcompleted[0m.
INFO: Running [32minf_prs[0m: 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
[1] "finish imputation"
446 variants to be matched.
0 ambiguous SNPs have been removed.
446 variants have been matched; 0 were flipped and 446 were reversed.
[1] "finish estimate beta"
INFO: [32minf_prs[0m is [32mcompleted[0m.
INFO: [32minf_prs[0m output:   [32mres-data/InfPred.Rdata[0m
INFO: Workflow load_testdata+inf_prs (ID=w54dc303e0045d534) is executed successfully with 2 completed steps.
[?2004h

: 1

In [70]:
sos run ldpred.ipynb grid_prs \
    --outpath res-data \
    --testpath ukbiobank \
    --grid_in res-data/LdMatrix.Rdata \
    --test_bfile ukbiobank/UKB.SUB.bed \
    --cov_file ukbiobank/UKB.ind.cov \
    --trait_file ukbiobank/UKB.hdl.cov \
    --test_file ukbiobank/UKB.SUB.rds \
    --response continuous

INFO: Running [32mgrid_prs[0m: [?2004l[?2004l[?2004l[?2004l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
446 variants to be matched.
0 ambiguous SNPs have been removed.
446 variants have been matched; 0 were flipped and 446 were reversed.
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Saving 7 x 7 in image
INFO: [32mgrid_prs[0m is [32mcompleted[0m.
INFO: [32mgrid_prs[0m output:   [32mres-data/GridPred.Rdata res-data/plots/GridPlot.png[0m
INFO: Workflow grid_prs (ID=w3fc91275489be46d) is executed successfully with 1 completed step.
[?2004h

: 1

In [33]:
sos run ldpred.ipynb auto_prs \
    --outpath res-data \
    --testpath ukbiobank \
    --auto_in res-data/LdMatrix.Rdata \
    --test_bfile ukbiobank/UKB.SUB.bed \
    --test_file ukbiobank/UKB.SUB.rds

INFO: Running [32mauto_prs[0m: [?2004l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
446 variants to be matched.
0 ambiguous SNPs have been removed.
446 variants have been matched; 0 were flipped and 446 were reversed.
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Saving 7 x 7 in image
INFO: [32mauto_prs[0m is [32mcompleted[0m.
INFO: [32mauto_prs[0m output:   [32mres-data/AutoPred.Rda

: 1

## Step 8: predict phenotypes

Null model: Traits ~ Sex + Age + Smoking + Alcohol

In [63]:
sos run ldpred.ipynb null_phenopred \
    --outpath res-data \
    --testpath ukbiobank \
    --cov_file ukbiobank/UKB.ind.cov \
    --trait_file ukbiobank/UKB.hdl.cov \
    --response continuous

INFO: Running [32mnull_phenopred[0m: l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
[90m# A tibble: 1 x 3[39m
  model          R2   MSE
  [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m NULL model 0.034[4m9[24m 0.943
null device 
          1 
INFO: [32mnull_phenopred[0m is [32mcompleted[0m.
INFO: [32mnull_phenopred[0m output:   [32mres-data/summary/NullSummary.pdf res-data/model/NullModel.Rdata[0m
INFO: Workflow null_phenopred (ID=wdc18cb54145d3956) is executed successfully with 1 completed step.
[?2004h

: 1

Inf/grid/auto model: Traits ~ Sex + Age + Smoking + Alcohol + PRS

In [66]:
sos run ldpred.ipynb inf_phenopred \
    --outpath res-data \
    --testpath ukbiobank \
    --cov_file ukbiobank/UKB.ind.cov \
    --trait_file ukbiobank/UKB.hdl.cov \
    --prs_file res-data/InfPred.Rdata \
    --mod_summary InfSummary.pdf \
    --model InfModel.Rdata \
    --response continuous

INFO: Running [32minf_phenopred[0m: 4l[?2004l[?2004l[?2004l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
[90m# A tibble: 1 x 2[39m
      R2   MSE
   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m 0.036[4m3[24m 0.946
null device 
          1 
INFO: [32minf_phenopred[0m is [32mcompleted[0m.
INFO: [32minf_phenopred[0m output:   [32mres-data/summary/InfSummary.pdf res-data/model/InfModel.Rdata[0m
INFO: Workflow inf_phenopred (ID=w37ea6141f308b849) is executed successfully with 1 completed step.
[?2004h

: 1

In [67]:
sos run ldpred.ipynb grid_phenopred \
    --outpath res-data \
    --testpath ukbiobank \
    --cov_file ukbiobank/UKB.ind.cov \
    --trait_file ukbiobank/UKB.hdl.cov \
    --prs_file res-data/GridPred.Rdata \
    --mod_summary GridSummary.pdf \
    --model GridModel.Rdata \
    --response continuous

INFO: Running [32mgrid_phenopred[0m: l[?2004l[?2004l[?2004l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
[90m# A tibble: 1 x 2[39m
      R2   MSE
   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m 0.046[4m0[24m 0.937
null device 
          1 
INFO: [32mgrid_phenopred[0m is [32mcompleted[0m.
INFO: [32mgrid_phenopred[0m output:   [32mres-data/summary/GridSummary.pdf res-data/model/GridModel.Rdata[0m
INFO: Workflow grid_phenopred (ID=w2a8b5d758bf9e758) is executed successfully with 1 completed step.
[?2004h

: 1

In [36]:
sos run ldpred.ipynb auto_phenopred \
    --outpath res-data \
    --testpath ukbiobank \
    --cov_file ukbiobank/UKB.ind.cov \
    --trait_file ukbiobank/UKB.hdl.cov \
    --prs_file res-data/AutoPred.Rdata \
    --mod_summary AutoSummary.pdf \
    --model AutoModel.Rdata \
    --response 1

INFO: Running [32mauto_phenopred[0m: l[?2004l[?2004l[?2004l[?2004l
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: bigstatsr
[90m# A tibble: 1 x 2[39m
      R2   MSE
   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m 0.041[4m4[24m 0.943
null device 
          1 
INFO: [32mauto_phenopred[0m is [32mcompleted[0m.
INFO: [32mauto_phenopred[0m output:   [32mres-data/summary/AutoSummary.pdf res-data/model/AutoModel.Rdata[0m
INFO: Workflow auto_phenopred (ID=wd0834b76da1c03f1) is executed successfully with 1 completed step.
[?2004h

: 1

# Results

Following table shows adjusted R squared of HDL prediction model. QC is quality control in step 4.

|   Betas   | QC? |   Null  |   Inf   |   Grid  |   Auto  |
|:---------:|:---:|:-------:|:-------:|:-------:|:-------:|
|  Original | Yes | 0.03486 | 0.03549 | 0.03431 | 0.03564 |
|  Original |  No | -$^{*}$ | -$^{*}$ |    -$^{*}$    |    -$^*$    |
| Posterior | Yes | 0.03486 | 0.03628 | 0.04595 | 0.04142 |
| Posterior |  No | 0.03486 | 0.03516 |    -$^{*}$    |    -$^*$    |

$*$: Take long very long time to get results (4hrs+)