# Example polygenic risk scores: participant height

> Polygenic scores are important tools for understanding complex genetic associations. In this notebook, we show how to derive polygenic scores based on summary statistics and a matrix of correlation between genetic variants. We will use R package bigsnpr that implements the LDpred2 method (https://doi.org/10.1093/bioinformatics/btaa1029).

> As input, we will use the same data as in the previous GWAS examples. This notebook focus on a linear model using participant height data. In the next one, we will use a more complex example with logistic regression and blood pressure data. 

- runtime: 1h
- recommended instance: mem1_ssd1_v2_x16
- estimated cost: <£1.00

This notebook depends on:
* **Notebook 107** - maf_flt_8chroms* prefixed files

## Install required packages

Function `p_load` from `pacman` loads packages into R.
If a given package missing it will be automatically installed - this can take a considerable amount of time for packages that need C or FORTRAN code compilation.

The following packages are needed to run this notebook:

- `dplyr` - tabular data manipulation in R, require to pre-process and filter phenotypic data
- `parallel` - parallel computation in R
- `bigsnpr` - run statistics on file-backed arrays, needed to calculate the approximate singular value decomposition (SVD) needed for PCA plots
- `bigparallelr` - controls parallel computation using file-backed arrays
- `ggplot2` - needed for graphics 
- `readxl` - read tabular readxl files

In [None]:
if(!require(pacman)) install.packages("pacman")
pacman::p_load(dplyr, bigsnpr, ggplot2, readr, tidyr, bigparallelr)

## Read the EIDs of individuals in the exome cohort from `.fam` file

This can be created using **Notebook 107**

In [None]:
system('dx download -fr bed_maf/maf_flt_8chroms*', intern=TRUE)
exome_eids <- read_table('maf_flt_8chroms.fam', col_names = LETTERS[1:6]) %>% pull(B)
str(exome_eids)

## Simulate height data for 2500 individuals

In this step, we sample a participant's height from a normal distribution. The parameters are based on data observed from the British population. You can try inputting real data here - your project has to have access to the *Participant standing height* field. You can retrieve this data using a cohort browser or following the methods in **Notebook 201** (for R) or **103** (for Python).

In [4]:
pheno <- tibble(eid=exome_eids) %>%
    filter(eid>0) %>%
    sample_n(2500) %>%
    mutate(height = rnorm(length(eid), 177.8, 5.97) %>% round)

## Read and preview the plink bed/bim/fam files

In [None]:
bedfile <- normalizePath("maf_flt_8chroms.bed")
tmpfile <- normalizePath("bigsnpr_input_prc_height", mustWork = FALSE)
if( length(dir(pattern=tmpfile)) ) unlink(dir(pattern=tmpfile))
snp_readBed2(bedfile, backingfile = tmpfile, ind.row=which(exome_eids %in% pheno$eid))

In [None]:
obj.bigSNP <- snp_attach(paste0(tmpfile, ".rds"))
str(obj.bigSNP, max.level = 2, strict.width = "cut")

## PCA

In [7]:
NCORES = 1
G   <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
ind.excl <- snp_indLRLDR(infos.chr = as.integer(as.factor(CHR)), infos.pos = POS)
ind.keep <- snp_clumping(G, infos.chr = as.integer(as.factor(CHR)),exclude = ind.excl,ncores = NCORES)
obj.svd <- big_randomSVD(G, fun.scaling = snp_scaleBinom(), ind.col = ind.keep, ncores = NCORES)

## Polygenic risk scores

### Divide the dataset into train set and test set

We use 2000 individuals to train our model and 500 to test it

In [8]:
ind.train <- sample(nrow(G), 2000)
ind.test <- setdiff(rows_along(G), ind.train)

### Train linear regression model

In [9]:
cmsa.lin <- big_spLinReg(
    X = G, 
    y.train = pheno$height[ind.train], 
    ind.train = ind.train, 
    covar.train = obj.svd$u[ind.train, ],
    alphas = c(1, 0.5, 0.05, 0.001),
    ncores = NCORES
)

### Get the model predictions for the test set

In [10]:
preds <- predict(cmsa.lin, X = G, ind.row = ind.test, covar.row = obj.svd$u[ind.test, ])

### Calculate the root-mean-square error (RMSE) of the predictions

Please note that since we simulated the phenotypic data there is no actual linkage between phenotypes and genotypes.
Running this model on phenotypes obtained from actual participants and a significantly larger sample should yield a smaller error.

In [None]:
RMSE = function(m, o){
  sqrt(mean((m - o)^2))
}
RMSE(preds, pheno$height[ind.test])


We can observe that the RMSE value of 5.94cm is nearly identical to the standard deviation of height used in our simulation (5.97cm).

To further get further intuition about the meaning of RMSE in predictive models can calculate the predictor that will predict everyone's height as the mean of the population.
We can see that, as expected, we got very similar model precision to our model based on random data (5.94cm).

In [None]:
RMSE(rep(177.8, length(ind.test)), pheno$height[ind.test])

### Calculate R-square

Squared correlation is another useful metric for linear model performance.
Since our model is based on randomly generated phenotypes this value is close to 0.

In [None]:
cor(preds, pheno$height[ind.test])^2

## Closing remarks

In conclusion, if this model would be run on actual data we would conclude no linkage between variants of human height.
In reality, genetic variants can explain ~60% of phenotypic variance for height (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4250049/) with a typical PRC model achieving an R-square of ~0.4 for variants alone and ~0.7 for models combing biological sex (https://academic.oup.com/jcem/article/106/7/1918/6206752).