# Stratified LD Score Regression 
This notebook implements the pipepline of [S-LDSC](https://github.com/bulik/ldsc/wiki) for LD score and functional enrichment analysis. It is written by Anmol Singh (singh.anmol@columbia.edu), with input from Dr. Gao Wang.

**FIXME: the initial draft is complete but pending Gao's review and documentation with minimal working example**

The pipeline is developed to integrate GWAS summary statistics data, annotation data, and LD reference panel data to compute functional enrichment for each of the epigenomic annotations that the user provides using the S-LDSC model. We will first start off with an introduction, instructions to set up, and the minimal working examples. Then the workflow code that can be run using SoS on any data will be at the end. 

## A brief review on Stratified LD score regression

Here I briefly review LD Score Regression and what it is used for. For more in depth information on LD Score Regression please read the following three papers:

1. "LD Score regression distinguishes confounding from polygenicity in genome-wide association studies" by Sullivan et al (2015)

2. "Partitioning heritability by functional annotation using genome-wide association summary statistics" by Finucane et al (2015)

3. "Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection" by Gazal et al (2017)

As stated in Sullivan et al 2015, confounding factors and polygenic effects can cause inflated test statistics and other methods cannot distinguish between inflation from confounding bias and a true signal. LD Score Regression (LDSC) is a technique that aims to identify the impact of confounding factors and polygenic effects using information from GWAS summary statistics. 

This approach involves using regression to mesaure the relationship between Linkage Disequilibrium (LD) scores and test statistics of SNPs from the GWAS summary statistics. Variants in LD with a "causal" variant show an elevation in test statistics in association analysis proportional to their LD (measured by $r^2$) with the causal variant within a certain window size (could be 1 cM, 1kB, etc.). In contrast, inflation from confounders such as population stratification that occur purely from genetic drift will not correlate with LD. For a polygenic trait, SNPs with a high LD score will have more significant χ2 statistics on average than SNPs with a low LD score. Thus, if we regress the $\chi^2$ statistics from GWAS against LD Score, the intercept minus one is an estimator of the mean contribution of confounding bias to the inflation in the test statistics. The regression model is known as LD Score regression. 

### LDSC model

Under a polygenic assumption, in which effect sizes for variants are drawn independently from distributions with variance proportional to  $1/(p(1-p))$ where p is the minor allele frequency (MAF), the expected $\chi^2$ statistic of variant j is:

$$E[\chi^2|l_j] = Nh^2l_j/M + Na + 1 \quad (1)$$

where $N$ is the sample size; $M$ is the number of SNPs, such that $h^2/M$ is the average heritability explained per SNP; $a$ measures the contribution of confounding biases, such as cryptic relatedness and population stratification; and $l_j = \sum_k r^2_{jk}$ is the LD Score of variant $j$, which measures the amount of genetic variation tagged by $j$. A full derivation of this equation is provided in the Supplementary Note of Sullivan et al (2015). An alternative derivation is provided in Supplementary Note of Zhu and Stephens (2017) AoAS.

From this we can see that LD Score regression can be used to compute SNP-based heritability for a phenotype or trait, from GWAS summary statistics and does not require genotype information like other methods such as REML do. 

### Stratified LDSC

Heritability is the proportion of phenotypic variation (VP) that is due to variation in genetic values (VG) and thus can tell us how much of the difference in observed phenotypes in a sample is due to difference in genetics in the sample. It can also be extended to analyze partitioned heritability for a phenotype/trait split over categories. 

For Partitioned Heritability or Stratified LD Score Regression (S-LDSC) more power is added to our analysis by leveraging LD Score information as well as using SNPs that haven't reached Genome Wide Significance to partition heritability for a trait over categories which many other methods do not do. 


S-LDSC relies on the fact that the $\chi^2$ association statistic for a given SNP includes the effects of all SNPs tagged by this SNP meaning that in a region of high LD in the genome the given SNP from the GWAS represents the effects of a group of SNPs in that region.

S-LDSC determines that a category of SNPs is enriched for heritability if SNPs with high LD to that category have more significant $\chi^2$ statistics than SNPs with low LD to that category.

Here, enrichment of a category is defined as the proportion of SNP heritability in the category divided by the proportion of SNPs in that category.

More precisely, under a polygenic model, the expected $\chi^2$ statistic of SNP $j$ is

$$E[\chi^2_j] = N\sum_CT_Cl(j,C) + Na + 1 \quad (2)$$

where $N$ is sample size, C indexes categories, $ℓ(j, C)$ is the LD score of SNP j with respect to category $l(j,C) = \sum_{k\epsilon C} r^2_{jk}$, $a$ is a term that measures the contribution of confounding biases, and if the categories are disjoint, $\tau_C$ is the per-SNP heritability in category $C$; if the categories overlap, then the per-SNP heritability of SNP j is $\sum_{C:j\epsilon C} \tau_C$.  Equation 2 allows us to estimate $\tau_C$ via a (computationally simple) multiple regression of $\chi^2$ against $ℓ(j, C)$, for either a quantitative or case-control study. 

To see how these methods have been applied to real world data as well as a further discussion on methods and comparisons to other methods please read the three papers listed at the top of the document.

## Command Interface

In [116]:
!sos run LDSC_Code.ipynb -h

usage: sos run LDSC.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  make_annot
  munge_sumstats_no_sign
  munge_sumstats_sign
  calc_ld_score
  calc_enrichment

Sections
  make_annot:
    Workflow Options:
      --bed VAL (as str, required)
                        path to bed file
      --bim VAL (as str, required)
                        path to bim file
      --annot VAL (as str, required)
                        name of output annotation file
  munge_sumstats_no_sign: This option is for when the summary statistic file
                        does not contain a signed summary statistic (Z or Beta).
                        In this case,the program will calculate Z for you based

## Make Annotation File

In [93]:

[make_annot]

# Make Annotated Bed File

# path to bed file
parameter: bed = str 
#path to bim file
parameter: bim = str
#name of output annotation file
parameter: annot = str
bash: expand = True
    make_annot.py --bed-file {bed} --bimfile {bim} --annot-file {annot}

## Munge Summary Statistics (Option 1: No Signed Summary Statistic)

In [None]:
#This option is for when the summary statistic file does not contain a signed summary statistic (Z or Beta). 
#In this case,the program will calculate Z for you based on A1 being the risk allele
[munge_sumstats_no_sign]



#path to summary statistic file
parameter: sumst = str
#path to Hapmap3 SNPs file, keep all columns (SNP, A1, and A2) for the munge_sumstats program
parameter: alleles = "w_hm3.snplist"
#path to output file
parameter: output = str

bash: expand = True
    munge_sumstats.py --sumstats {sumst} --merge-alleles {alleles} --out {output} --a1-inc

## Munge Summary Statistics (Option 2: No Signed Summary Statistic)

In [None]:
# This option is for when the summary statistic file does contain a signed summary statistic (Z or Beta)
[munge_sumstats_sign]



#path to summary statistic file
parameter: sumst = str
#path to Hapmap3 SNPs file, keep all columns (SNP, A1, and A2) for the munge_sumstats program
parameter: alleles = "w_hm3.snplist"
#path to output file
parameter: output = str

bash: expand = True
    munge_sumstats.py --sumstats {sumst} --merge-alleles {alleles} --out {output}

## Calculate LD Scores

**Make sure to delete SNP,CHR, and BP columns from annotation files if they are present otherwise this code will not work. Before deleting, if these columns are present, make sure that the annotation file is sorted.**

In [None]:
#Calculate LD Scores
#**Make sure to delete SNP,CHR, and BP columns from annotation files if they are present otherwise this code will not work. Before deleting, if these columns are present, make sure that the annotation file is sorted.**
[calc_ld_score]

#Path to bim file
parameter: bim = str
#Path to annotation File. Make sure to remove the SNP, CHR, and BP columns from the annotation file if present before running.
parameter: annot_file = str
#name of output file
parameter: output = str
#path to Hapmap3 SNPs file, remove the A1 and A2 columns for the Calculate LD Scores program 
parameter: snplist = "w_hm3.snplist"

bash: expand = True
    ldsc.py --bfile {bim} --l2 --ld-wind-cm 1 --annot {annot_file} --thin-annot --out {output} --print-snps {snplist}

## Calculate Functional Enrichment using Annotations

In [None]:
#Calculate Enrichment Scores for Functional Annotations
[calc_enrichment]

#Path to Summary statistics File
parameter: sumstats = str
#Path to Reference LD Scores Files (Base Annotation + Annotation you want to analyze, format like minimal working example)
parameter: ref_ld = str
#Path to LD Weight Files (Format like minimal working example)
parameter: w_ld = str
#path to frequency files (Format like minimal working example)
parameter: frq_file = str
#Output name
parameter: output = str

bash: expand = True
    ldsc.py --h2 {sumstats} --ref-ld-chr {ref_ld} --w-ld-chr {w_ld} --overlap-annot --frqfile-chr {frq_file} --out {output}