# LD Score Regression Tutorial

Author: Anmol Singh (singh.anmol@columbia.edu)

## Background

Please find [here a brief review of LDSC](https://cumc.github.io/xqtl-pipeline/pipeline/integrative_analysis/enrichment/ldsc.html). This page also implements a pipeline version of LDSC analysis for real-world data analysis. In this tutorial we focus on analysis of an example data-set to illustrate how LDSC works.

## Software setup

### Installing LDSC with Conda

Make sure that you have installed python 2 and conda (if you want to install ldsc using conda). Conda can be installed using this link: https://store.continuum.io/cshop/anaconda/.

#### Step 1: Clone the Github Repository

`git clone https://github.com/bulik/ldsc.git`

`cd ldsc`

#### Step 2: Activate the Conda Environment

`conda env create --file environment.yml`

`source activate ldsc`

#### Step 3: Check to see if the main python scripts used for analysis are executable

`./ldsc.py -h`

`./munge_sumstats.py -h`

Check to make sure that both `./ldsc.py -h` and `./munge_sumstats.py -h` output the list of all possible commands for both, otherwise there is something wrong with the installation.

### Installing LDSC without Conda

#### Step 1: Clone the Github Repository

`git clone https://github.com/bulik/ldsc.git`

`cd ldsc`

#### Step 2: Make the python scripts executeable

`chmod +x ldsc.py`

`chmod +x munge_sumstats.py`

#### Step 3: Check to see if the main python scripts used for analysis are executable

`./ldsc.py -h`

`./munge_sumstats.py -h`

Since pybedtools is required for make_annot.py, if you need to make binary annotations and did not install through conda you must either install pybedtools on the cluster or use it through this docker image.

#### Step 4: Load Docker Image that has pybedtools

`module load Singularity`

`module load R`

`singularity pull docker://quay.io/biocontainers/pybedtools-0.8.0-py27he860b03_1`

Through this image you will now be able to use the `make_annot.py` script with no issues. 

## Example Analysis 1: Simple LD Score Regression

This is a simple example of non-partitioned LD Score Regression.

You can find the plink files needed for this tutorial here: https://data.broadinstitute.org/alkesgroup/LDSCORE/1000G_Phase3_plinkfiles.tgz. This file contains the bim/bam/fam files for 489 subjects for all 1000 Genome Phase 3 SNPs which will be used as the reference panel for our analysis.

Now, after downloading the data we can take a look at a simple example calculating the LD scores for 1000 Genome Phase 3 variants on chromosome 22. To conduct the regression we must do this for every chromosome but the commands are the same so I will just show it for one. A great way that I have found to loop over all the chromosomes in parallel is to use xargs:

```
seq 1 22| xargs -n 1 -I j -P 4 python ldsc.py --bfile 1000G.EUR.QC.j --l2 --ld-wind-cm 1 --out tutorial.j 
```

This xargs command will loop over the command passed to it using the iterative variable j (marked with the -I flag). The -n 1 flag indicates that there is one iterative variable and the -P 4 flag indicates that 4 of the iterative commands will be submitted at a time (e.g. chr1,chr2,chr3,chr4 will be submitted in a batch and then after that is done the next 4 will start).

For the command flags: --bfile indicates that the file is a plink bed/bim/fam file with that prefix, --l2 indicates you want to calculate LD Scores, --ld-wind-cm indicates that you want to calculate LD Scores using a 1 cM window, and --out indicates the prefix you want to use for output files

In [83]:
python2 ldsc.py \
    --bfile 1000G.EUR.QC.7\
    --l2\
    --ld-wind-cm 1\
    --out tutorial.7

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./ldsc.py \
--ld-wind-cm 1.0 \
--out tutorial.7 \
--bfile /mnt/mfs/statgen/Anmol/training_files/testing/ldsc/AD_Variants/1000G_EUR_Phase3_plink/1000G.EUR.QC.7 \
--l2  

Beginning analysis at Thu Jan  6 20:53:02 2022
Read list of 589569 SNPs from /mnt/mfs/statgen/Anmol/training_files/testing/ldsc/AD_Variants/1000G_EUR_Phase3_plink/1000G.EUR.QC.7.bim
Read list of 489 individuals from /mnt/mfs/statgen/Anmol/training_files/testing/ldsc/AD_Variants/1000G_EUR_Phase3_plink/1000G.EUR.QC.7.fam
Reading genotypes from /mnt/mfs/statgen/Anmol/training_files/testing/ldsc/AD_Variants/1000G_EUR_Phase3_plink/1000G.EUR.QC.7.bed
Estimating LD Score.
Writing L

The output of this command shows a summary of the LD Scores and the MAF/LD Score correlation matrix which is useful for conducting QC on the analysis. The MAF and LD Scores should be positively correlated.

The command also creates a file with the LD Scores that are gzipped. An example output is shown below.

In [53]:
import pandas as pd
results = pd.read_csv("tutorial.22.l2.ldscore.gz",sep="\t") 
results.head()

Unnamed: 0,CHR,SNP,BP,L2
0,22,rs587616822,16050840,3.795
1,22,rs62224609,16051249,10.431
2,22,rs587646183,16052463,1.361
3,22,rs139918843,16052684,4.825
4,22,rs587743102,16052837,2.057


After calculating LD Scores for each chromosome, it is time to set up the summary statistic file for the phenotype you are trying to analyze. The summary statistic file we will use is for BMI and it can be downloaded here: http://www.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files. For the tutorial you also need the list of hapmap snps to restrict the summary statistic file to the recommended HapMap Phase 3 SNPs that will be used in the regression. The authors recommend restricting the analysis to HapMap Phase 3 SNPs because most GWAS summary statistics do not have information about imputation quality, thus using HapMap SNPs insures that you are using well-imputed and common variants for the analysis. This file can be downloaded here: https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/w_hm3.snplist.bz2. The summary statistic file should have the following columns with the following names for the analysis to work:

SNP -- SNP identifier (e.g., rs number)

N -- sample size (which may vary from SNP to SNP).

P -- p-value.

A1 -- first allele (effect allele)

A2-- second allele (other allele)

Signed Summary Statistic (Can be Z, BETA, or Odds Ratio(label as OR)), is optional if A1 is the risk increasing allele as you can put the flag --a1-inc in the command and ldsc will calculate the Z score for the SNPs for you

Once you have set up the summary statistic file with these column headers you can reformat it for the analysis using the following command:

In [63]:
python2 munge_sumstats.py --sumstats GIANT_BMI_Speliotes2010_publicrelease_HapMapCeuFreq.txt\
--merge-alleles w_hm3.snplist\
--out BMI\
 --a1-inc

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./munge_sumstats.py \
--out BMI \
--merge-alleles w_hm3.snplist/w_hm3.snplist \
--a1-inc  \
--sumstats GIANT_BMI_Speliotes2010_publicrelease_HapMapCeuFreq.txt 

Interpreting column names as follows:
Allele2:	Allele 2, interpreted as non-ref allele for signed sumstat.
MarkerName:	Variant ID (e.g., rs number)
Allele1:	Allele 1, interpreted as ref allele for signed sumstat.
p:	p-Value
N:	Sample size

Reading list of SNPs for allele merge from w_hm3.snplist/w_hm3.snplist
Read 1217311 SNPs for allele merge.
Reading sumstats from GIANT_BMI_Speliotes2010_publicrelease_HapMapCeuFreq.txt into memory 5000000 SNPs at a time.
Read 2471516 SNPs from --s

This will return a file called BMI.sumstats.gz which is a gzipped file that will be used as the summary statistic file in our analysis. It contains a row for each variant as well as the Allele Information and the Z score calculated by the munge_sumstats.py program.

Now we can conduct the Simple LD Score Regression using the following command listed below. We will have to download the weights for all hapmap snps excluding the HLA gene region for this analysis, which can be found here: https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/weights_hm3_no_hla.tgz. The authors excluded the HLA gene region due to the unusual genetic architecture and LD pattern in this region.

For the command flags: --h2 indicates that you want to conduct LD Score regression using the gzipped summary statistic file we made in the last part, --ref-ld-chr indicates the reference genome LD Scores which were calculated in the section above, --w-ld-chr indicates the files that contains weights for the regression SNPs that the program can use.

In [75]:
python2 ldsc.py \
--h2 BMI.sumstats.gz \
--ref-ld-chr tutorial. \
--w-ld-chr ./weights_hm3_no_hla/weights. \
--out tutorial

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./ldsc.py \
--h2 BMI.sumstats.gz \
--ref-ld-chr tutorial. \
--out tutorial \
--frqfile-chr 1000G_frq/1000G.mac5eur. \
--w-ld-chr weights_hm3_no_hla/weights. 

Beginning analysis at Fri Jan  7 01:15:50 2022
The frequency file is unnecessary and is being ignored.
Reading summary statistics from BMI.sumstats.gz ...
Read summary statistics for 1040803 SNPs.
Reading reference panel LD Score from tutorial.[1-22] ... (ldscore_fromlist)
Read reference panel LD Scores for 9997231 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from weights_hm3_no_hla/weights.[1-22] ... (ldscore_fromlist)
Read regression we

Now we have estimated the proportion of heritability that is attributed to the BMI phenotype which is shown in the output above. Note that this value should be between 0 and 1 but can be a bit below 0 due to standard errors during calculation.

Heritability is formally defined as the proportion of phenotypic variation (VP) that is due to variation in genetic values (VG).

Thus, in this case the proportion of phenotypic variance for BMI that is due to genetic factors is relatively low.

Lambda GC is the genomic inflation factor which tells us how much systematic bias is present in our data, it is calculated in this case by median(chi^2)/0.4549. The value should be close to 1.

Mean chi^2 is the mean chi-square statistic and should be above 1.02.

Intercept is the LD Score regression intercept. The intercept should be close to 1, unless the data have been corrected for GC bias or Genomic Control Bias which controls for bias from population stratification, in which case it will often be lower. Note that the intercept in our case is below 1 because the summary statistics file we used has been corrected for GC bias.

Ratio is (intercept-1)/(mean(chi^2)-1), which measures the proportion of the inflation in the mean chi^2 that the LD Score regression intercept ascribes to causes other than polygenic heritability. The value of ratio should be close to zero, though in practice values of 10-20% are not uncommon, probably due to sample/reference LD Score mismatch or model misspecification (e.g., low LD variants have slightly higher h^2 per SNP).

### Example Analysis 2: Partitioned LD Score Regression

We first make the annotation file with respect to a specific annotation bed file using the make_annot.py script that comes with the ldsc program. For the purposes of this tutorial we will use a Histone Mark annotation from Adipose Tissue, Adipose_Tissue.H3K27ac. I have provided the bed file for this annotation on a google drive folder (https://drive.google.com/drive/folders/1HdG-QsCl6fAspSxGsuoOCapwfnXCyfnU?usp=sharing) so you can download it to run the commands below. The command to make the annotation file for this annotation for one chromosome of the 1000 Genome Phase 3 variants (the reference data) for the tutorial is listed here:

In [None]:
python2 make_annot.py \
		--bed-file Adipose_Tissue.H3K27ac.bed \
		--bimfile 1000G.EUR.QC.22.bim \
		--annot-file Adipose_Tissue.H3K27ac.annot.gz

In [39]:
import pandas as pd
results = pd.read_csv("Adipose_Tissue.H3K27ac.22.annot.gz",sep="\t") 
results.head()

Unnamed: 0,ANNOT
0,0
1,0
2,1
3,1
4,1


This command will output a file with 0/1 for each variant in the bim file as shown above which corresponds to whether this specific variant is within the regions described in the annotation file.

After the annotation file is made we can use it to calculate the LD Scores for this annotation. This will be done using the command below for chromosome 22, remember you have to repeat the same command for every chromsome in the reference panel. In this case the program recommends that you only print LD Scores for HapMap Phase 3 SNPs. This can be achieved by using the hapmap snplist file which can be found here: https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/w_hm3.snplist.bz2. **You must get rid of the A1 and A2 columns in this file and keep only the SNP column before using the command below**

For this command the difference is that we add the --annot flag which indicates the annotation file we are using and the --thin-annot flag which indicates that the annotation file does not contain any information about the SNPs (rs number, CHR, and BP) and only contains the binary scores for the annotation.

Make sure your annotation files have the same prefix as your LD Score files that you will create as ldsc will not be able to read the annotation files if they have a different prefix when you try to conduct the regression.

In [38]:
python2 ldsc.py \
    --bfile 1000G.EUR.QC.22\
    --l2\
    --ld-wind-cm 1 --annot Adipose_Tissue.H3K27ac.22.annot.gz --thin-annot\
    --out Adipose_Tissue.H3K27ac.22\
    --print-snps w_hm3.snplist

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./ldsc.py \
--print-snps /mnt/mfs/statgen/Anmol/training_files/testing/ldsc/AD_Variants/hapmap_snplist.txt \
--ld-wind-cm 1.0 \
--out Adipose_Tissue.H3K27ac.22 \
--bfile /mnt/mfs/statgen/Anmol/training_files/testing/ldsc/AD_Variants/1000G_EUR_filtered/1000G.22.final \
--thin-annot  \
--annot Adipose_Tissue.H3K27ac.22.annot.gz \
--l2  

Beginning analysis at Sun Jun 27 07:38:17 2021
Read list of 113121 SNPs from /mnt/mfs/statgen/Anmol/training_files/testing/ldsc/AD_Variants/1000G_EUR_filtered/1000G.22.final.bim
Read 1 annotations for 113121 SNPs from Adipose_Tissue.H3K27ac.22.annot.gz
Read list of 489 individuals from /mnt/mfs/statgen/Anmol/

This command outputs the same gzipped LD score file as the simple case but instead of just an LD Score column, it will have one LD Score column for each annotation that you are calculating LD Scores for.

In [37]:
import pandas as pd
results = pd.read_csv("Adipose_Tissue.H3K27ac.22.l2.ldscore.gz",sep="\t") 
results.head()

Unnamed: 0,CHR,SNP,BP,L2
0,22,rs7287144,16886873,6.932
1,22,rs5748662,16892858,5.78
2,22,rs4010554,16894264,7.341
3,22,rs4010558,16896762,7.412
4,22,rs2379981,17030792,19.616


Now that we have calculated the LD Scores for each chromosome for our annotation, we can use these LD Scores to conduct the Partitioned LD Score Regression for our annotation. In this case we have to make sure that our annotation files are in the same folder and have the same prefix name as our LD Score files. Now we can conduct the Regression for our annotation:

The new flag --frqfile-chr is used to add the MAF frequencies for the reference genome SNPs since we will only be using SNPs with a MAF>0.05 to conduct the analysis. The baseline annotation is an annotation consisiting of all 1's and is the intercept of the LD Score Regression.

In [31]:
python2 ldsc.py \
    --h2 BMI.sumstats.gz\
    --ref-ld-chr base.,Adipose_Tissue.H3K27ac.\ 
    --w-ld-chr weights_hm3_no_hla/weights.\
    --overlap-annot\
    --frqfile-chr 1000G_frq/1000G.mac5eur.\
    --out Adipose_Tissue.H3K27ac

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./ldsc.py \
--h2 BMI.sumstats.gz \
--ref-ld-chr baselineLD/base.,Adipose_Tissue.H3K27ac. \
--out Adipose_Tissue.H3K27ac \
--overlap-annot  \
--frqfile-chr 1000G_frq/1000G.mac5eur. \
--w-ld-chr weights_hm3_no_hla/weights. 

Beginning analysis at Thu Jan  6 00:56:01 2022
Reading summary statistics from BMI.sumstats.gz ...
Read summary statistics for 1040803 SNPs.
Reading reference panel LD Score from baselineLD/base.,Adipose_Tissue.H3K27ac.[1-22] ... (ldscore_fromlist)
Read reference panel LD Scores for 1168549 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from weights_hm3_no_hla/weights.[1-22] ..

Here the comma indicates that we are concatinating the baseline model with the new annotation.

The results our outputted in a .results file which shows the proportion of heritability and enrichment attributable to each category for the trait you are studying, in this case BMI.

The results file for this analysis looks like this, where L2_1 represents our Adipose Tissue Annotation and baseL2_0 describes the baseline annotation:

In [23]:
import pandas as pd
results = pd.read_csv("Adipose_Tissue.H3K27ac.results",sep="\t") 
results.head()

Unnamed: 0,Category,Prop._SNPs,Prop._h2,Prop._h2_std_error,Enrichment,Enrichment_std_error,Enrichment_p
0,baseL2_0,1.028145,0.978165,0.00083,0.951388,0.000807,1.793667e-37
1,L2_1,0.600255,0.99393,0.063355,1.655848,0.105547,2.752202e-08
