# Extracting Credible Sets

The [Chiou et al., 2021 paper](nature.com/articles/s41588-021-00823-0) used fine-mapping with a large cohort of GWAS data to determine 81 unique loci associated with T1D and included an additional ~60 independent signals. Originally I had tried to use the Ay lab fine-mapping pipeline but we could only obtain 10 regions (did not do a direct comparison). Using these fine-mapped results we should be able to get a better list of SGLs. In this notebook I am extracting the credible sets for downstream use, this data was stored in Supplementary table 1 of the publication.

In [1]:
import os
import pandas as pd 

# # add tabix to path variable 
# os.environ["PATH"] += ":/mnt/bioadhoc-temp/Groups/vd-ay/jreyna/software/mamba/envs/hichip-db/bin/"

# change the working directory
os.chdir('/mnt/BioHome/jreyna/jreyna-temp/projects/t1d-loop-catalog/')

In [2]:
outdir = 'results/hg38/external_studies/chiou_2021/processing/finemapping/'
os.makedirs(outdir, exist_ok=True)

## Load the variant data

In [4]:
# load the data
cs_fn = 'results/hg38/external_studies/chiou_2021/Supplemental1.credible_sets.41586_2021_3552_MOESM4_ESM.xlsx'
cs_vars = pd.read_excel(cs_fn, skiprows=2)
cs_vars.columns = ['rsid', 'hg38_id', 'hg19_id', 'chrom', 'hg38_pos', 'hg38_signal_name', 'ppa'] # 'hg38_pos', 'hg38_pos' ,'']

# add major and minor alleles
_, pos, nmajor, nminor = zip(*cs_vars['hg19_id'].str.split(':'))
cs_vars['hg19_pos'] = pos
cs_vars['major_allele'] = nmajor
cs_vars['minor_allele'] = nminor

# add the hg19 signal name
def create_hg19_id(sr):
    hg38_id = sr['hg38_signal_name']
    info = hg38_id.split(':')
    info[1] = sr['hg19_pos']
    final = ':'.join(info)
    return(final)
cs_vars['hg19_signal_name'] = cs_vars.apply(create_hg19_id, axis=1)

# add the start position for bed file format
cs_vars['hg19_start'] = cs_vars['hg19_pos'].astype(int) - 1
cs_vars['hg38_start'] = cs_vars['hg38_pos'].astype(int) - 1

# rename from pos to end
cs_vars.rename(columns={'hg19_pos': 'hg19_end', 'hg38_pos': 'hg38_end'}, inplace=True)

In [5]:
# extract hg19 data
hg19_cols = ['chrom', 'hg19_start', 'hg19_end', 'rsid', 'hg19_id', 'hg19_signal_name', 'ppa', 'major_allele', 'minor_allele']
hg19_data = cs_vars[hg19_cols]

# remove hg19 from column names
hg19_data.columns = ['chrom', 'start', 'end', 'rsid', 'id', 'signal_name', 'ppa', 'major_allele', 'minor_allele']

# save the data with a full set of columns
full_fn = os.path.join(outdir, 'hg19.finemapping.full.bed')
hg19_data.to_csv(full_fn, sep='\t', index=False, header=True)

# make a list of columns with just genetic coordinate information & save
basic_cols = ['chrom', 'start', 'end', 'rsid', 'ppa']
basic_fn = os.path.join(outdir, 'hg19.finemapping.basic.bed')
hg19_data[basic_cols].to_csv(basic_fn, sep='\t', index=False, header=True)

In [6]:
# extract hg38 data
hg38_cols = ['chrom', 'hg38_start', 'hg38_end', 'rsid', 'hg38_id', 'hg38_signal_name', 'ppa', 'major_allele', 'minor_allele']
hg38_data = cs_vars[hg38_cols]

# remove hg38 from column names
hg38_data.columns = ['chrom', 'start', 'end', 'rsid', 'id', 'signal_name', 'ppa', 'major_allele', 'minor_allele']

# save the data with a full set of columns
full_fn = os.path.join(outdir, 'hg38.finemapping.full.bed')
hg38_data.to_csv(full_fn, sep='\t', index=False, header=True)

# make a list of columns with just genetic coordinate information & save
basic_fn = os.path.join(outdir, 'hg38.finemapping.basic.bed')
hg38_data[basic_cols].to_csv(basic_fn, sep='\t', index=False, header=True)

: 

In [None]:
# save_data['ppa_pct'] = pd.cut(save_data.PPA,
#                               bins=[0,0.1,0.2,.5,0.8,0.9,1],
#                               labels=['0-10%', '10-20%', '20-50%', '50-80%', '80-90%', '90-100%'] )
# save_data.loc[save_data['ppa_pct'] == '80-90%']
# save_data.loc[save_data.ppa_pct == '50-80%']

: 