# Fine-mapping with PolyFun

## Aim

The purpose of this notebook is to demonstrate a functionally-informed fine-mapping workflow using the PolyFun method.

## Methods Overview 

## Input 

1) GWAS summary statistics including the following variables: 

- variant_id - variant ID 
- P - p-value 
- CHR - chromosome number 
- BP - base pair position
- A1 - The effect allele (i.e., the sign of the effect size is with respect to A1)
- A2 - the second allele 
- MAF - minor allele frequency 
- BETA - effect size 
- SE - effect size standard error

2) SNP-identifier file or S-LDSC (stratified LD-score regression) LD-score and annotation file

   SNP-identifier file should include the following columns: 

- CHR - chromosome
- BP - base pair position (in hg19 coordinates)
- A1 - The effect allele 
- A2 - the second allele

3) Ld-score weights file 


## Output

A `.gz` file containing input summary statistics columns and additionally the following columns:

- PIP - posterior causal probability
- BETA_MEAN - posterior mean of causal effect size (in standardized genotype scale)
- BETA_SD - posterior standard deviation of causal effect size (in standardized genotype scale)
- CREDIBLE_SET - the index of the first (typically smallest) credible set that the SNP belongs to (0 means none).


## Workflow

### Step 1: Compute Prior Causal Probabilities

#### Method 1: Use precomputed prior causal probabilities

Use precomputed prior causal probabilities of 19 million imputed UK Biobank SNPs with MAF>0.1%, based on a meta-analysis of 15 UK Biobank traits. 

In [None]:
[prior_causal_prob]
parameter: sumstats = AD_sumstats_Jansenetal_2019sept.txt.gz
parameter: container = none
bash: container = container 
    mkdir -p /output
    python extract_snpvar.py \
        --sumstats sumstats \
        --out /output/snps_with_var.gz \
        --allow-missing

#### Method 2: Compute via L2-regularized extension of S-LDSC (preferred)

Compute via an L2-regularized extension of stratified LD-score regression (S-LDSC). 

1) Create a munged summary statistics file in a PolyFun-friendly parquet format.

In [None]:
[munged_sumstats]
parameter: sumstats = AD_sumstats_Jansenetal_2019sept.txt.gz
parameter: sample_size = int
parameter: container = none
bash: container = container 
    mkdir -p /SLDSC_output
    python munge_polyfun_sumstats.py \
      --sumstats sumstats \
      --n sample_size \
      --out /SLDSC_output/sumstats_munged.parquet \
      --min-info 0 \
      --min-maf 0

2) Run PolyFun with L2-regularized S-LDSC

In [None]:
[L2_regu_SLDSC]
parameter: container = none
paramter: ref_ld = /baselineLF2.2.UKB/baselineLF2.2.UKB.
parameter: ref_wgt = /weights.UKB.l2.ldscore/weights.UKB.
bash: container=container
    python polyfun.py \
    --compute-h2-L2 \
    --no-partitions \
    --output-prefix /SLDSC_output/run \
    --sumstats /SLDSC_output/sumstats_munged.parquet \
    --ref-ld-chr ref_ld \
    --w-ld-chr ref_wgt \
    --allow-missing

#### Method 3: Compute Non-parametrically

1) Create a munged summary statistics file in a PolyFun-friendly parquet format.

In [None]:
[munged_sumstats2]
parameter: sumstats = AD_sumstats_Jansenetal_2019sept.txt.gz
parameter: sample_size = int
parameter: container = none
bash: container = container 
    mkdir -p /SLDSC_output
    python munge_polyfun_sumstats.py \
      --sumstats sumstats \
      --n sample_size \
      --out /SLDSC_output/sumstats_munged.parquet \
      --min-info 0 \
      --min-maf 0

2) Run PolyFun with L2-regularized S-LDSC

In [None]:
[L2_regu_SLDSC2]
parameter: container = none
paramter: ref_ld = example_data/annotations.
parameter: ref_wgt = example_data/weights.
bash: container=container
    python polyfun.py \
    --compute-h2-L2 \
    --output-prefix output/testrun \
    --sumstats example_data/sumstats.parquet \
    --ref-ld-chr ref_ld \
    --w-ld-chr ref_wgt

3) Compute LD-scores for each SNP bin

In [None]:
[ld_snpbin]
bash:
    python polyfun.py \
    --compute-ldscores \
    --output-prefix output/testrun \
    --bfile-chr example_data/reference. \
    --chr 1

4) Re-estimate per-SNP heritabilities via S-LDSC

In [None]:
[re_SLDSC]
bash:
    python polyfun.py \
    --compute-h2-bins \
    --output-prefix output/testrun \
    --sumstats example_data/sumstats.parquet \
    --w-ld-chr example_data/weights.

### Step 2: Create functional annotations 

#### Method 1: Use existing function annotation files 

Use functional annotations for ~19 million UK Biobank imputed SNPs with MAF>0.1%, based on the baseline-LF 2.2.UKB annotations

Download (30G): https://data.broadinstitute.org/alkesgroup/LDSCORE/baselineLF_v2.2.UKB.polyfun.tar.gz

#### Method 2: Create annotations 

To create your own annotations, for each chromosome, the following files are needed: 

1) A `.gz` or `.parquet` Annotations file containing the following columns:

- CHR - chromosome number
- BP base pair position
- SNP - dbSNP reference number 
- A1 - The effect allele 
- A2 - the second allele
- Arbitrary additional columns representing annotations 

2) A `.l2.M` white-space delimited file containing a single line with the sums of the columns of each annotation

3) (Optional) A `l2.M_5_50` file that is the `.l2.M` file but only containing common SNPS (MAF between 5% and 50%) 


### Step 3: Compute LD-scores for annotations 

#### Method 1: Compute with reference panel of sequenced individuals 

Reference panel should have at least 3000 sequenced individuals from target population.

In [None]:
[ld_score]
parameter: container = none
parameter: ref_ld = reference.1
parameter: annot_file = annotations.1.annot.parquet
bash: container = container
    mkdir -p
    python compute_ldscores.py \
    --bfile ref_ldexample_data/reference.1 \
    --annot annot_file \
    --out output/ldscores1.parquet

#### Method 2: Compute with pre-computed UK Biobank LD matrices 

Matrices download: https://data.broadinstitute.org/alkesgroup/UKBB_LD

In [None]:
[ld_score_uk]
parameter: container = none
parameter: annot_file = annotations.1.annot.parquet
base: container = container
    mkdir -p 
    python compute_ldscores_from_ld.py \
    --annot annot_file \
    --ukb \
    --out output/ldscores2.parquet

#### Method 3: Compute with own pre-computed LD matrices

Own pre-computed LD matrices should be in `.bcor` format. 

In [None]:
[ld_score_own]
parameter: container = none
parameter: annot_file = annotations.1.annot.parquet
parameter: sample_size = int
base: container = container
    mkdir -p 
    python compute_ldscores_from_ld.py \
    --annot annot_file \
    --out output/ldscores3.parquet \
    --n sample_size\
    bcor_files/*.bcor

### Step 4: Functionally informed fine mapping with finemapper

Input summary statistics file must have `SNPVAR` column (per-SNP heritability) to perform functionally-informed fine-mapping.

In [None]:
[fine_mapping]
parameter: genotype_file = example_data/chr1
parameter: sumstat = example_data/chr1.finemap_sumstats.txt.gz
parameter: sample_size = 383290
parameter: chr = 1
parameter: start = 46000001
parameter: end = 49000001
parameter: output_path = output/finemap.1.46000001.49000001.gz
bash: 
    mkdir -p LD_cache
    mkdir -o output

    python finemapper.py \
    --geno genotype_file \
    --sumstats  \
    --n sample_size \
    --chr chr \
    --start start \
    --end end \
    --method susie \
    --max-num-causal 5 \
    --cache-dir LD_cache \
    --out output_path

## Minimal Working Example

In [None]:
bash: 
    python extract_snpvar.py \
        --sumstats AD_sumstats_Jansenetal_2019sept.txt.gz \
        --out output/AD_snps_with_var.gz \
        --allow-missing

In [None]:
bash: 
    python finemapper.py \
    --geno UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720_chr1 \
    --sumstats AD_snps_with_var.gz \
    --n 410905 \
    --chr 1 \
    --start 46000001 \
    --end 49000001 \
    --method susie \
    --max-num-causal 5 \
    --allow-missing \
    --cache-dir LD_cache \
    --out output/finemap.1.46000001.49000001.gz

### Summary

In [23]:
bash:
    gzcat output/finemap.1.46000001.49000001.gz | head

CHR	SNP	BP	A1	A2	SNPVAR	Z	N	P	PIP	BETA_MEAN	BETA_SD	DISTANCE_FROM_CENTER	CREDIBLE_SET
1	rs2088102	46032974	T	C	1.70060e-06	1.25500e+01	383290	3.97510e-36	1.00000e+00	-2.03917e-02	1.61901e-03	1456799	1
1	rs7528714	47966058	G	A	1.18040e-06	5.14320e+00	383290	2.70098e-07	9.97870e-01	7.42146e-03	1.62305e-03	476285	2
1	rs7528075	47870271	G	A	1.18040e-06	4.40160e+00	383290	1.07456e-05	9.76545e-01	-5.98945e-03	1.81667e-03	380498	3
1	rs212968	48734666	G	A	1.70060e-06	-3.01130e+00	383290	2.60132e-03	3.75823e-01	-1.56305e-03	2.23942e-03	1244893	0
1	rs2622911	47837404	C	A	1.70060e-06	3.12520e+00	383290	1.77684e-03	3.71804e-01	1.54312e-03	2.22812e-03	347631	0
1	rs4511165	48293181	G	A	1.70060e-06	-1.18940e+00	383290	2.34282e-01	5.75970e-02	1.60630e-04	7.52226e-04	803408	0
1	rs3766196	47284526	C	A	6.93040e-06	-5.92360e-02	383290	9.52764e-01	4.89776e-02	-5.06776e-07	3.48039e-04	205247	0
1	rs12567716	48197570	T	C	1.18040e-06	2.14810e+00	383290	3.17058e-02	4.45128e-02	-1.28281e-04	6.81457e-04	707797	0


In [25]:
import numpy as np
import pandas as pd

data = pd.read_csv('output/finemap.1.46000001.49000001.gz', sep="\t")

data.head(5)
    
num_var_cs = np.count_nonzero(data['CREDIBLE_SET'])
total_cs = len(data.CREDIBLE_SET.unique())- 1
avg_var_cs = float(num_var_cs) / total_cs
pip50 = sum(1 for i in data['PIP'] if i >0.5)
pip95 = sum(1 for i in data['PIP'] if i >0.95)

result = "Number of variants with PIP > 0.5: " + str(pip50) + "\n" + "Number of variants with PIP > 0.95: " + str(pip95) + "\n" \
    + "Number of variants that have credible sets: " + str(num_var_cs) + "\n" \
    + "Number of unique credible sets: " + str(total_cs) + "\n" \
    + "Average number of variants per credible set: " + str(avg_var_cs) 


with open('results.txt', 'a') as the_file:
    the_file.write(result)

with open('results.txt') as f:
    contents = f.readlines()
    print(contents)

['Number of variants with PIP > 0.5: 3\n', 'Number of variants with PIP > 0.95: 3\n', 'Number of variants that have credible sets: 3\n', 'Number of unique credible sets: 3\n', 'Average number of variants per credible set: 1.0']
