# LDpred2 Pipeline for Polygenic Risk Score Prediction

This notebook shows the pipepline for PRS prediction using R package `bigsnpr`. 

## Aim

The pipeple was developed to predict PRS using infinitesimal, grid and auto model to estimate effect size. 

## Input

1. Reference panel for LD matrix and correlation calculation `(.bed/.bim/.fam)`
    - `--bed_file=path`
2. Genotype data from bed file `(.rds)`
    - `--ref_file=path`
3. Summary statistics data `(.rds)`
    - `--summstats_file=path`
4. Sample size for estimate the effect size in summary statistics data
    - `--n_eff=2000000`
    
Note: reference genome used 1000Genome.

## Output

The pipeline save the results from every steps.

* Step 33: QCplot for quality control
    - `--qc_plot='path/QcPlot.png'`
* Step 35: LD matrix and correlation matrix.
    - `--ld_out='path/LdOutput.RData'`
* Step 40: SNP heritability $h^2$
    - `--ldreg_out='path/LdRegOut.RData'`
* Step 50: Estimated effect size (inf/grid/auto)
    - `xxx_beta ='path/xxxBeta.RData'`
* Step 60: Predicted PRS (inf/grid/auto) and convergence plot of proportion of causal variants $p$ and heritability
    - `xxx_prs = 'path/xxxPrs.RData'`
    - `conv_plot = 'path/ConvPlot.png'`

## Reference

1. [LDpred2: better, faster, stronger](https://www.biorxiv.org/content/10.1101/2020.04.28.066720v3.full.pdf)
2. [bigsnpr](https://cran.rstudio.com/web/packages/bigsnpr/bigsnpr.pdf)
3. [Tutorial for LDpred2](https://privefl.github.io/bigsnpr/articles/LDpred2.html)

## Data preparation

* Summary statistics (n*9)

Column: "chr", "pos", "rsid", "a1", "a0", "n_eff", "beta", "beta_se", "p"

* refence panel

    genotypes (n * # of snp):
    
        matrix that contains 0,1,2
    
    fam (n * 6):
    
        "family.ID", "sample.ID", "paternal.ID", "maternal.ID", "sex", "affection"

    map (n*6):
    
        "chromosome", "marker.ID", "genetic.dist", "physical.pos", "allele1", "allele2"



## Command Interface

In [3]:
sos run ldpred2.ipynb -h

usage: sos run ldpred2.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  inf
  auto
  grid

Global Workflow Options:
  --summstats-file VAL (as path, required)
                        path to summary statistics files, genotypes, phenotypes
                        and covaraites data "*.rds"
  --bed-file VAL (as path, required)
                        "*.bed"
  --geno-file VAL (as path, required)
                        "*.rds"
  --new-geno VAL (as path, required)
                        "*.rds"
  --n-eff 2000000 (as int)
                        parameter: tmpdir_ = "tmp-dataset"
  --summary-stat 'path/SumStats.RData'
                        Predict PRS
  --qc-in 'path/QcInput.RDat

: 1

## Global Parameter Setting

In [None]:
[global]

### Data preparation

# path to summary statistics files, genotypes, phenotypes and covaraites data
# "*.rds"
parameter: summstats_file =  path
# "*.bed"
parameter: bed_file =  path
# "*.rds"
parameter: ref_file = path
# "*.rds"

#parameter: tmpdir_ = "tmp-dataset"
parameter: n_eff = 2000000

# Predict PRS
parameter: summary_stat = 'path/SumStats.RData'
parameter: qc_in = 'path/QcInput.RData'
parameter: qc_plot = 'path/QcPlot.png'
parameter: conv_plot = 'path/ConvPlot.png'
parameter: sd_out = 'path/sd.rds'
parameter: ld_in = 'path/LdInput.RData'
parameter: ld_out = 'path/LdOutput.RData'
parameter: ldreg_out = 'path/LdRegOut.RData'
parameter: inf_beta = 'path/InfBeta.RData'
parameter: grid_beta = 'path/GridBeta.RData'
parameter: auto_beta = 'path/AutoBeta.RData'
parameter: inf_prs = 'path/InfPrs.RData'
parameter: grid_prs = 'path/GridPrs.RData'
parameter: auto_prs = 'path/AutoPrs.RData'

# Predict Phenotype


# Binary or continuous phenotype
parameter: response = 1

## Example command

```
sos run ldpred2.ipynb auto \
    --summstats_file mvpdata/sumstats_ldl.rds \
    --bed_file  1000G/1000G.bed \
    --ref_file 1000G/1000G.rds \
    --new_geno path \
    --summary_stat './res-data/SumStats.RData' \
    --qc_in './res-data/QcInput.RData' \
    --qc_plot './res-data/QcPlot.png' \
    --conv-plot './res-data/ConvPlot.png' \
    --ld_in './res-data/LdInput.RData' \
    --sd_out './res-data/sd.rds' \
    --ld_out './res-data/LdOutput.RData' \
    --ldreg_out './res-data/LdRegOut.RData' \
    --inf_beta './res-data/InfBeta.RData' \
    --grid_beta './res-data/GridBeta.RData' \
    --auto_beta './res-data/AutoBeta.RData' \
    --inf_prs './res-data/InfPrs.RData' \
    --grid_prs './res-data/GridPrs.RData' \
    --auto_prs './res-data/AutoPrs.RData' \
    --response 1 \
    --n_eff 200000
```


## Workflow



### Predict PRS 

#### Process bed file

In [None]:
[data_load_10]

input: bed_file

R: expand=True 
    suppressMessages(library(bigsnpr))
    try(snp_readBed("{_input}"))

#### Load reference panel and summary statistics

save `sumstats` as `SumStats.RData`

In [None]:
[data_load_20]

input: summstats_file
output: summary_stat

R: expand=True
    # Read in the summary statistic file
    sumstats <- readRDS("{_input}") 
    sumstats$n_eff = {n_eff}
    save(sumstats, file = "{_output}")

#### Load genotype data (UK Biobank)

For PRS calculation and phenotype comparison.

#### SNP matching and get the CM information from 1000 Genome

* Perform SNP matching `snp_match(sumstats, map)` to get `info_snp`

Mathch alleles between summary statistics `sumstats` and SNP information from `obj.bigSNP`.

* CM information from 1000 Genome `snp_asGeneticPos(CHR, POS, dir = ".")`

Use genetic maps available at https://github.com/joepickrell/1000-genomes-genetic-maps/ to interpolate physical positions (in bp) to genetic positions (in cM).


save `obj.bigSNP`, `genotype`, `map`, `CHR`, `POS`, `info_snp` and `POS2` as `QcInput.RData`

In [None]:
[data_load_30]

input: geno = ref_file, sums = summary_stat
output: qc_in
 
R: expand=True 
    suppressMessages(library(bigsnpr))
    load("{_input["sums"]}")
    # now attach the genotype object
    obj.bigSNP <- snp_attach("{_input["geno"]}")
    
    # Assign the genotype to a variable for easier downstream analysis
    genotype <- obj.bigSNP$genotypes
    
    # extract the SNP information from the genotype
    map <- obj.bigSNP$map[-(2:3)]
    names(map) <- c("chr", "pos", "a1", "a0")  
    
    # Rename the data structures
    CHR <- map$chr
    POS <- map$pos   

    # perform SNP matching
    info_snp <- snp_match(sumstats, map)
    
    # get the CM information from 1000 Genome
    # will download the 1000G file to the current directory (".")
    POS2 <- snp_asGeneticPos(CHR, POS, dir = ".")
   
   
    
    # save data to Rdata file
    save(obj.bigSNP, genotype, map, CHR, POS, info_snp, POS2, file = "{_output}")
    

#### Quality Control

In [1]:
[QControl]

input: qc_in
output: sdout = sd_out, ldin = ld_in, qcplot = qc_plot


R: expand=True
    suppressMessages(library(bigsnpr))
    suppressMessages(library(tidyverse))
    load("{_input}")
    NCORES = nb_cores()
    ind.val = 1:nrow(genotype)
    sd <- runonce::save_run(
      sqrt(big_colstats(genotype, ind.val, ncores = NCORES)$var),
      file = "{_output["sdout"]}"
    )

    sd_val <- sd[info_snp$`_NUM_ID_`]

    sd_ss <- with(info_snp, 2 / sqrt(n_eff * beta_se^2))

    is_bad <- sd_ss < (0.5 * sd_val) | 
            sd_ss > (sd_val + 0.1) |  ##### fixme>
            sd_ss < 0.1 | 
            sd_val < 0.05
      
    qplot(sd_val, sd_ss, color = is_bad, alpha = I(0.5)) +
      theme_bigstatsr() +
      coord_equal() +
      scale_color_viridis_d(direction = -1) +
      geom_abline(linetype = 2, color = "red") +
      labs(x = "Standard deviations in the validation set",
           y = "Standard deviations derived from the summary statistics",
           color = "Removed?")
    ggsave("{_output["qcplot"]}")      
      
    n = nrow(info_snp)
    print(paste0(length(which(is_bad=="TRUE")), " over ", n, " was removed in Quality Control."))
           
    info_snp = info_snp[!is_bad, ]
           
    save(obj.bigSNP, genotype, map, CHR, POS, info_snp, POS2, file = "{_output["ldin"]}")

#### Calculate LD

* calculate LD using genotype from `obj.bigSNP` and CM information (distance)

save `info_snp`, `ld`, `fam.order`, `corr`, `NCORES`, `genotype` as `LdOutput.RData`

In [None]:
[LD_corr]
input: ldin = ld_in
output: ld_out

R: expand = True
    suppressMessages(library(bigsnpr))
    suppressMessages(library(data.table))
    suppressMessages(library(bigsparser))
    load("{_input["ldin"]}") 
    source("rcode/LD_corr.R")
    save(info_snp, ld, fam.order, corr, NCORES, genotype, file = "{_output}")

#### Perform LD score regression 

Using funciton `snp_ldsc()` to obtain $h^2$ the (SNP) heritability

save `h2_est`, `df_beta`, `corr`, `NCORES`, `info_snp`, `genotype` as `LdRegOut.RData` 

In [None]:
[LD_reg]

input: ld_out
output: ldreg_out

R: expand=True
    suppressMessages(library(bigsnpr))
    suppressMessages(library(tidyverse))
    suppressMessages(library(data.table))
    load("{_input}")
    df_beta <- info_snp[,c("beta", "beta_se", "n_eff", "_NUM_ID_")]
    ldsc <- snp_ldsc(ld, 
                    length(ld), 
                    chi2 = (df_beta$beta / df_beta$beta_se)^2,
                    sample_size = df_beta$n_eff, 
                    blocks = NULL)
    h2_est <- ldsc[["h2"]]
    save(h2_est, df_beta, corr, NCORES, info_snp, genotype,file = "{_output}")

#### Get adjusted betas

##### Infinitesimal model

save `beta_inf`, `df_beta`, `corr`, `NCORES`,`info_snp`, `genotype` as `InfBeta.RData`

In [None]:
[inf_10]

input: ldreg_out
output: inf_beta

R: expand=True
    suppressMessages(library(bigsnpr))
    suppressMessages(library(data.table))
    load("{_input}")
    ## adjusted beta ##
    beta_inf <- snp_ldpred2_inf(corr, df_beta, h2 = h2_est)
    # save data
    save(beta_inf, df_beta, corr, NCORES,info_snp, genotype, file = "{_output}")

##### Grid model

save `beta_grid`, `df_beta`, `corr`, `NCORES`,`info_snp`, `genotype` as `GridBeta.Rdata`

In [None]:
[grid_10]

input: ldreg_out
output: grid_beta

R: expand=True
    suppressMessages(library(bigsnpr))
    suppressMessages(library(data.table))
    load("{_input}")
    # Prepare data for grid model
    p_seq <- signif(seq_log(1e-4, 1, length.out = 17), 2)
    h2_seq <- round(h2_est * c(0.7, 1, 1.4), 4)
    grid.param <-
        expand.grid(p = p_seq,
                h2 = h2_seq,
                sparse = c(FALSE, TRUE))
    # Get adjusted beta from grid model
    beta_grid <- snp_ldpred2_grid(corr, df_beta, grid.param, ncores = NCORES)
    # save data
    save(beta_grid, df_beta, corr, NCORES,info_snp, genotype, file = "{_output}")

##### Auto model

save `beta_auto`, `df_beta`, `corr`, `NCORES`,`info_snp`, `genotype` as `AutoBeta.Rdata`

In [None]:
[auto_10]

input: ldreg_out
output: auto_beta

R: expand=True
    suppressMessages(library(bigsnpr))
    suppressMessages(library(data.table))
    load("{_input}")
    # Get adjusted beta from the auto model
    multi_auto <- snp_ldpred2_auto(
        corr,
        df_beta,
        h2_init = h2_est,
        vec_p_init = seq_log(1e-4, 0.9, length.out = 30),
        ncores = NCORES
    )
    beta_auto <- sapply(multi_auto, function(auto)
        auto$beta_est)
    # save data
    save(beta_auto, df_beta,multi_auto, corr, NCORES,info_snp, genotype, file = "{_output}")

#### Get PRS

##### Infinitesimal model

save `pred_inf` as `InfPrs.RData`

In [None]:
[inf_20]

input: infbeta = inf_beta
output: inf_prs

R: expand=True
    suppressMessages(library(bigsnpr))
    suppressMessages(library(data.table))
    load("{_input["infbeta"]}")

    # calculate PRS for all samples
    ind.test <- 1:nrow(genotype)
    pred_inf <- big_prodVec(    genotype,
                                beta_inf,
                                ind.row = ind.test,
                                ind.col = info_snp$`_NUM_ID_`)
    save(pred_inf, file = "{_output}")

##### Grid model
save `pred_grid` as `GridPrs.RData`

In [5]:
[grid_20]

input: gridbeta = grid_beta
output: grid_prs

R: expand=True
    suppressMessages(library(bigsnpr))
    suppressMessages(library(data.table))
    load("{_input["gridbeta"]}")

    ind.test <- 1:nrow(genotype)
    pred_grid <- big_prodMat(   genotype, 
                                beta_grid, 
                                ind.col = info_snp$`_NUM_ID_`)
    save(pred_grid, file = "{_output}")

##### Auto model

save `pred_auto` as `AutoPrs.RData`

In [None]:
[auto_20]

input: autobeta = auto_beta
output: autoprs = auto_prs, convplot = conv_plot 

R: expand=True
    suppressMessages(library(bigsnpr))
    suppressMessages(library(data.table))
    suppressMessages(library(ggplot2))
    load("{_input["autobeta"]}")

    
    ## calculate PRS for all samples
    #ind.test <- 1:nrow(genotype)
    #pred_auto <-
    #    big_prodMat(genotype,
    #                beta_auto,
    #                ind.row = ind.test,
    #                ind.col = info_snp$`_NUM_ID_`)
    ## scale the PRS generated from AUTO
    #pred_scaled <- apply(pred_auto, 2, sd)
    #final_beta_auto <-
    #    rowMeans(beta_auto[,
    #                abs(pred_scaled -
    #                    median(pred_scaled)) <
    #                    3 * mad(pred_scaled)])
    #pred_auto <-
    #    big_prodVec(genotype,
    #        final_beta_auto,
    #        ind.row = ind.test,
    #        ind.col = info_snp$`_NUM_ID_`)
    #        
    #ind = abs(pred_scaled - median(pred_scaled)) < 3 * mad(pred_scaled)
    #ind = which(ind=="TRUE")
                        
    auto <- multi_auto[[2]]
    plot_grid(
      qplot(y = auto$path_p_est) +
        theme_bigstatsr() +
        geom_hline(yintercept = auto$p_est, col = "blue") +
        scale_y_log10() +
        labs(y = "p"),
      qplot(y = auto$path_h2_est) +
        theme_bigstatsr() +
        geom_hline(yintercept = auto$h2_est, col = "blue") +
        labs(y = "h2"),
      ncol = 1, align = "hv"
    )
    ggsave("{_output["convplot"]}", width = 10, height = 7)
    # save(pred_auto, file = "{_output["autoprs"]}")

### Predict phenotype

In [4]:
[phepred_10]