# Fine-mapping with PolyFun

## Aim

The purpose of this notebook is to demonstrate a functionally-informed fine-mapping workflow using the PolyFun method.

## Methods Overview 

## Input 

1) GWAS summary statistics including the following variables: 

- variant_id - variant ID 
- P - p-value 
- CHR - chromosome number 
- BP - base pair position
- A1 - The effect allele (i.e., the sign of the effect size is with respect to A1)
- A2 - the second allele 
- MAF - minor allele frequency 
- BETA - effect size 
- SE - effect size standard error

2) SNP-identifier file or S-LDSC (stratified LD-score regression) LD-score and annotation file

   SNP-identifier file should include the following columns: 

- CHR - chromosome
- BP - base pair position (in hg19 coordinates)
- A1 - The effect allele 
- A2 - the second allele

3) Ld-score weights file 


## Output

A `.gz` file containing input summary statistics columns and additionally the following columns:

- PIP - posterior causal probability
- BETA_MEAN - posterior mean of causal effect size (in standardized genotype scale)
- BETA_SD - posterior standard deviation of causal effect size (in standardized genotype scale)
- CREDIBLE_SET - the index of the first (typically smallest) credible set that the SNP belongs to (0 means none).


## Workflow

### Step 1: Compute Prior Causal Probabilities

#### Method 1: Use precomputed prior causal probabilities

Use precomputed prior causal probabilities of 19 million imputed UK Biobank SNPs with MAF>0.1%, based on a meta-analysis of 15 UK Biobank traits. 

In [None]:
[prior_causal_prob]
parameter: sumstats = AD_sumstats_Jansenetal_2019sept.txt.gz
parameter: container = none
bash: container = container 
    mkdir -p /output
    python extract_snpvar.py \
        --sumstats sumstats \
        --out /output/snps_with_var.gz \
        --allow-missing

#### Method 2: Compute via L2-regularized extension of S-LDSC (preferred)

Compute via an L2-regularized extension of stratified LD-score regression (S-LDSC). Procedure for both methods is shown in this workflow. 

In [None]:
[munged_sumstats]
parameter: sumstats = AD_sumstats_Jansenetal_2019sept.txt.gz
parameter: sample_size = int
parameter: container = none
bash: container = container 
    mkdir -p /SLDSC_output
    python munge_polyfun_sumstats.py \
      --sumstats sumstats \
      --n sample_size \
      --out /SLDSC_output/sumstats_munged.parquet \
      --min-info 0 \
      --min-maf 0

### Step 2: Create functional annotations 

#### Method 1: Use existing function annotation files 

Use functional annotations for ~19 million UK Biobank imputed SNPs with MAF>0.1%, based on the baseline-LF 2.2.UKB annotations

Download (30G): https://data.broadinstitute.org/alkesgroup/LDSCORE/baselineLF_v2.2.UKB.polyfun.tar.gz

#### Method 2: Create annotations 

To create your own annotations, for each chromosome, the following files are needed: 

1) A `.gz` or `.parquet` Annotations file containing the following columns:

- CHR - chromosome number
- BP base pair position
- SNP - dbSNP reference number 
- A1 - The effect allele 
- A2 - the second allele
- Arbitrary additional columns representing annotations 

2) A `.l2.M` white-space delimited file containing a single line with the sums of the columns of each annotation

3) (Optional) A `l2.M_5_50` file that is the `.l2.M` file but only containing common SNPS (MAF between 5% and 50%) 


### Step 3: Compute LD-scores for annotations 

#### Method 1: Compute with reference panel of sequenced individuals 

Reference panel should have at least 3000 sequenced individuals from target population.

In [None]:
[ld_score]
parameter: container = none
parameter: ref_ld = reference.1
parameter: annot_file = annotations.1.annot.parquet
bash: container = container
    mkdir -p
    python compute_ldscores.py \
    --bfile ref_ldexample_data/reference.1 \
    --annot annot_file \
    --out output/ldscores1.parquet

#### Method 2: Compute with pre-computed UK Biobank LD matrices 

Matrices download: https://data.broadinstitute.org/alkesgroup/UKBB_LD

In [None]:
[ld_score_uk]
parameter: container = none
parameter: annot_file = annotations.1.annot.parquet
base: container = container
    mkdir -p 
    python compute_ldscores_from_ld.py \
    --annot annot_file \
    --ukb \
    --out output/ldscores2.parquet

#### Method 3: Compute with own pre-computed LD matrices

Own pre-computed LD matrices should be in `.bcor` format. 

In [None]:
[ld_score_own]
parameter: container = none
parameter: annot_file = annotations.1.annot.parquet
parameter: sample_size = int
base: container = container
    mkdir -p 
    python compute_ldscores_from_ld.py \
    --annot annot_file \
    --out output/ldscores3.parquet \
    --n sample_size\
    bcor_files/*.bcor

### Step 4: Run PolyFun with L2-regularized S-LDSC

If prior causal probabilities aren't computed,then use `finemapper.py` instead of `polyfun.py` to perform non-functionally-informed fine-mapping. 

In [None]:
[L2_regu_SLDSC]
parameter: container = none
paramter: ref_ld = /baselineLF2.2.UKB/baselineLF2.2.UKB.
parameter: ref_wgt = /weights.UKB.l2.ldscore/weights.UKB.
bash: container=container
    python polyfun.py \
    --compute-h2-L2 \
    --no-partitions \
    --output-prefix /SLDSC_output/run \
    --sumstats /SLDSC_output/sumstats_munged.parquet \
    --ref-ld-chr ref_ld \
    --w-ld-chr ref_wgt \
    --allow-missing

### Step 5: Functionally informed fine mapping with finemapper

Input summary statistics file must have `SNPVAR` column (per-SNP heritability) to perform functionally-informed fine-mapping.

In [None]:
[fine_mapping]
parameter: genotype_file = example_data/chr1
parameter: sumstat = example_data/chr1.finemap_sumstats.txt.gz
parameter: sample_size = 383290
parameter: chr = 1
parameter: start = 46000001
parameter: end = 49000001
parameter: output_path = output/finemap.1.46000001.49000001.gz
bash: 
    mkdir -p LD_cache
    mkdir -o output

    python finemapper.py \
    --geno genotype_file \
    --sumstats  \
    --n sample_size \
    --chr chr \
    --start start \
    --end end \
    --method susie \
    --max-num-causal 5 \
    --cache-dir LD_cache \
    --out output_path

## Minimal Working Example

In [None]:
module load Singularity

In [2]:
[munged_sumstats]
parameter: sumstats = example_data/boltlmm_sumstats.gz
parameter: sample_size = 327209
parameter: output_path = example_data/sumstats_munged.parquet
bash: container= none
    python munge_polyfun_sumstats.py \
    --sumstats sumstats \
    --n sample_size \
    --out output_path \
    --min-info 0.6 \
    --min-maf 0.001

In [None]:
[ld_score]
parameter: container = none
parameter: ref_ld = reference.1
parameter: annot_file = annotations.1.annot.parquet
bash: container = container
    mkdir -p
    python compute_ldscores.py \
    --bfile ref_ld \
    --annot annot_file \
    --out output/ldscores1.parquet

In [None]:
[L2_regu_SLDSC]
parameter: output_path = output/testrun
paramter: sumstats = example_data/sumstats.parquet
paramter: ref_ld = example_data/annotations.
parameter: ref_wgt = example_data/weights.
bash: container=none
    mkdir -p output
    python polyfun.py \
    --compute-h2-L2 \
    --no-partitions \
    --output-prefix output_path \
    --sumstats sumstats \
    --ref-ld-chr ref_ld \
    --w-ld-chr ref_wgt

In [None]:
[fine_mapping]
parameter: genotype_file = example_data/chr1
parameter: sumstats = example_data/chr1.finemap_sumstats.txt.gz
parameter: sample_size = 383290
parameter: chr = 1
parameter: start = 46000001
parameter: end = 49000001
parameter: output_path = output/finemap.1.46000001.49000001.gz
bash: 
    mkdir -p LD_cache
    mkdir -o output

    python finemapper.py \
    --geno genotype_file \
    --sumstats sumstats \
    --n sample_size \
    --chr chr \
    --start start \
    --end end \
    --method susie \
    --max-num-causal 5 \
    --cache-dir LD_cache \
    --out output_path

### Summary

In [10]:
import numpy as np
import pandas as pd

data = pd.read_csv('finemap.1.46000001.49000001', sep="\t")

data.head(5)
    

num_var_cs = np.count_nonzero(data['CREDIBLE_SET'])
total_cs = len(data.CREDIBLE_SET.unique())- 1
avg_var_cs = float(num_var_cs) / total_cs
pip50 = sum(1 for i in data['PIP'] if i >0.5)
pip95 = sum(1 for i in data['PIP'] if i >0.95)

result = "Number of variants with PIP > 0.5: " + str(pip50) + "\n" + "Number of variants with PIP > 0.95: " + str(pip95) + "\n" \
    + "Number of variants that have credible sets: " + str(num_var_cs) + "\n" \
    + "Number of unique credible sets: " + str(total_cs) + "\n" \
    + "Average number of variants per credible set: " + str(avg_var_cs) 


with open('results.txt', 'a') as the_file:
    the_file.write(result)

with open('results.txt') as f:
    contents = f.readlines()
    print(contents)

['Number of variants with PIP > 0.5: 3\n', 'Number of variants with PIP > 0.95: 3\n', 'Number of variants that have credible sets: 3\n', 'Number of unique credible sets: 3\n', 'Average number of variants per credible set: 1.0']


In [None]:
import os.path
import glob

# get the location of finemapping result files
file_with_annot_location = os.path.join('/mnt', 'mfs', 'statgen','tl3030','AD_2021_output','with_annot', 'finemap.*.gz')print(file_with_annot_location)
filenames_with_annot = glob.glob(file_with_annot_location)

snp_with_annot = pd.DataFrame()

for f in filenames_with_annot:
    # read the data
    outfile = pd.read_csv(f, delimiter = "\t")
    
    # filter out SNPs that has PIP >= 0.95
    significant = (outfile[outfile['PIP']>=0.95])
    snp_with_annot = snp_with_annot.append(significant)
    
    
# remove duplicated SNPs
snp_with_annot_uniq = snp_with_annot.drop_duplicates(subset='SNP', keep='first')

CS_with_annot = pd.DataFrame()

for f in filenames_with_annot:
    # read the data
    outfile = pd.read_csv(f, delimiter = "\t")
    
    # filter out SNPs that has CS
    significant = (outfile[outfile['CREDIBLE_SET']>0])
    CS_with_annot = CS_with_annot.append(significant)
    
    
# remove duplicated SNPs
CS_with_annot_uniq = CS_with_annot.drop_duplicates(subset='SNP', keep='first')

pd.options.mode.chained_assignment = None

# Read in the range file
region_range = pd.read_csv("/mnt/mfs/statgen/tl3030/range.csv").dropna()
#chr1_160990767_161203192 = pd.read_csv("/mnt/mfs/statgen/tl3030/finemapping_result_97gene/finemap.1.160990767.161203192.gz", delimiter = "\t")
#print(chr1_160990767_161203192.head())

bpcol = CS_with_annot_uniq[['CHR', 'BP']]
#print(bpcol.head())

# Assign SNPs to the gene region that it belong to
j = 0
for i, bp in bpcol.iterrows():
    #print(i, bp['CHR'])
    for k,row in region_range.iterrows():
        if (bp['CHR'] == row['Chr']) and (bp['BP'] > row['start']) and (bp['BP'] < row['end']):
            #print(row['Chr'],row['Gene Name'])
            #print(i, bp['CHR'], row['Chr'], row['Gene Name'])
            CS_with_annot_uniq.iloc[j,15] = row['Gene Name']
            #pass
            #CS_with_annot_uniq.loc[j,'GENE']= row['Gene Name']
    j += 1

    
CS_with_annot_uniq.to_csv('/mnt/mfs/statgen/tl3030/AD_2021_output/variants_with_CS_2021sumstat_97genes_with_annot.txt', index=False, sep='\t', mode='w')
CS_with_annot_uniq.sort_values(by=['CHR']) # sort the file by chromosome
num_of_CS_with_annot = CS_with_annot_uniq.drop_duplicates(subset=['CREDIBLE_SET', 'GENE'], keep = 'last').reset_index(drop = True)
num_of_CS_with_annot.sort_values(by=['GENE'])
num_of_gene_with_annot = CS_with_annot_uniq.drop_duplicates(subset=['GENE'], keep = 'last').reset_index(drop = True)
check_frequency_with_annot = CS_with_annot_uniq.groupby(["CREDIBLE_SET", "GENE"]).size().reset_index(name="Time")
CS_with_1_variant_with_annot = check_frequency_with_annot[check_frequency_with_annot['Time'] == 1]
gene_with_annot = CS_with_annot_uniq['GENE']
gene_list_with_annot = gene_with_annot.drop_duplicates()


print(num_of_CS_with_annot.shape[0])
print(num_of_gene_with_annot.shape[0])

###  Summary of Fine-mapping Result Without Functional Annotations

In [None]:
import os.path
# get the location of finemapping result files
file_without_annot_location = os.path.join('/mnt', 'mfs', 'statgen','tl3030','AD_2021_output','without_annot', 'finemap.*.gz')

import glob
# get a list of result file name
filenames_without_annot = glob.glob(file_without_annot_location)

snp_without_annot = pd.DataFrame()

for f in filenames_without_annot:
    # read the data
    outfile = pd.read_csv(f, delimiter = "\t")
    
    # filter out SNPs that has PIP >= 0.95
    significant = (outfile[outfile['PIP']>=0.95])
    snp_without_annot = snp_without_annot.append(significant)


# remove duplicated SNPs
snp_without_annot_uniq = snp_without_annot.drop_duplicates(subset='SNP', keep='first')

CS_without_annot = pd.DataFrame()

for f in filenames_without_annot:
    # read the data
    outfile = pd.read_csv(f, delimiter = "\t")
    
    # filter out SNPs that has CS
    significant = (outfile[outfile['CREDIBLE_SET']>0])
    CS_without_annot = CS_without_annot.append(significant)

# remove duplicated SNPs
CS_without_annot_uniq = CS_without_annot.drop_duplicates(subset='SNP', keep='first')


# Read in the range file
region_range = pd.read_csv("/mnt/mfs/statgen/tl3030/range.csv").dropna()

bpcol = CS_without_annot_uniq[['CHR', 'BP']]

# Assign SNPs to the gene region that it belong to
j = 0
for i, bp in bpcol.iterrows():
    #print(i, bp['CHR'])
    for k,row in region_range.iterrows():
        if (bp['CHR'] == row['Chr']) and (bp['BP'] > row['start']) and (bp['BP'] < row['end']):
            CS_without_annot_uniq.iloc[j,15] = row['Gene Name']
            #CS_without_annot_uniq.loc[j,'GENE']= row['Gene Name']
    j += 1


num_of_CS_without_annot = CS_without_annot_uniq.drop_duplicates(subset=['CREDIBLE_SET', 'GENE'], keep = 'last').reset_index(drop = True)
num_of_CS_without_annot.sort_values(by=['GENE'])
num_of_gene_without_annot = CS_without_annot_uniq.drop_duplicates(subset=['GENE'], keep = 'last').reset_index(drop = True)
check_frequency_without_annot = CS_without_annot_uniq.groupby(["CREDIBLE_SET", "GENE"]).size().reset_index(name="Time")
CS_with_1_variant_without_annot = check_frequency_without_annot[check_frequency_without_annot['Time'] == 1]
gene_without_annot = CS_without_annot_uniq['GENE']
gene_list_without_annot = gene_without_annot.drop_duplicates()