Here, we develop the several network-based GWAS methods we will benchmark:

**Network-based gene re-ranking methods (these methods may not provide a statistically robust p-value)**
1. Threshold-based network propagation: propagate all gene p-value hits below a certain p-value threshold
2. Prop x GWAS: Construct an "empirical p-value" for all genes afer propagation using method 1 and multiply that by the original gene-level p-values to create an "adjusted association/p-value" for re-ranking

**Network-boosted p-value methods (these methods may actually be multiplied with original gene-level p-values to create a "boosted" p-value. The p-value boost is limited only to 1/(n+1), where n is the number of degree-preserved network shuffles created.**
3. Propagate the -log10(p-value) of each gene across the actual network and the null networks. Calculate an empirical p-value (rank / 1001) and normal distribution (using null network values) p-value for each gene in the network. Multiply these values by the original gene-level p-value.

# Load network kernel

In [1]:
import os
import pandas as pd
import numpy as np

In [2]:
# First, we will load the network kernel of the network we will be using
kernel = pd.read_hdf('/Users/Dan/Data/GWAS_for_class/kernels/PCNet_kernel.hdf')

In [3]:
all(kernel.columns==kernel.index)

True

In [4]:
# Get all genes in the network
network_genes = list(kernel.index)

In [5]:
# Move VEGAS2 Alzheimer's data to wd and re-perform analysis
#vegas_pval_table = pd.read_csv('/cellar/users/jkhuang/Data/Projects/Network_GWAS/Data/IGAP/IGAP_boosting_results/vegas_results/vegas_df.txt', sep='\t')

wd='/Users/Dan/Data/GWAS_for_class/'
vegas_pval_table = pd.read_csv('/Users/Dan/Data/GWAS_for_class/scz_gene_10k.txt',sep='\t', usecols=[1,2,3,4,5,6,7,8])
vegas_pval_table.columns = list(vegas_pval_table.columns)[:-1]+['TopSNP P-Value']
vegas_pval_table.to_csv(wd+'scz_limited_toPCnet.csv', sep='\t', index=False)

In [8]:
vegas_pval_table.head()

Unnamed: 0,Gene,Chr,Gene Start,Gene End,nSNPs,TopSNP,TopSNP Pos,TopSNP P-Value
0,HIST1H4K,6,27906930,27907284,8,rs34706883,27913234.0,5.07118e-10
1,HIST1H2AK,6,27913636,27914096,16,rs34706883,27913234.0,5.07118e-10
2,HIST1H2BN,6,27914418,27914867,17,rs34706883,27913234.0,5.07118e-10
3,HIST1H2AL,6,27941085,27941555,10,rs13199772,27942064.0,7.05379e-10
4,HIST1H1B,6,27942548,27943338,10,rs13199772,27942064.0,7.05379e-10


In [7]:
# Load each gene p-value table and create binary propagation vector
pval_table_files = [fn for fn in os.listdir(wd) if fn.endswith('.csv')]
pval_table_files.sort()

threshold = 1e-6
threshold_genes = {}
prop_vectors = []
pval_table = vegas_pval_table
threshold_genes[name] = pval_table[pval_table['TopSNP P-Value'] < threshold]
prop_vector = (pval_table.set_index('Gene').ix[network_genes]['TopSNP P-Value'] < threshold).astype(float)
prop_vector.name = name
prop_vectors.append(prop_vector)
prop_vector_matrix = pd.concat(prop_vectors, axis=1).ix[network_genes].T

NameError: name 'name' is not defined

# Propagate genes with p < threshold (1e-6)

In [19]:
# Propagate binary vectors over the network
prop_val_matrix = np.dot(prop_vector_matrix, kernel)
prop_val_table = pd.DataFrame(prop_val_matrix, index = prop_vector_matrix.index, columns = prop_vector_matrix.columns)

In [27]:
prop_val_table.T.head()
prop_val_table.T.sort_values(by='scz_limited_toPCnet', ascending=False).to_csv('/Users/Dan/Data/GWAS_for_class/scz_e6_pcnet.txt', sep='\t')

In [80]:
# For each propagation result:
outdir = '/Users/Dan/Data/GWAS_for_class/'
for i in prop_val_table.index:
    # Get all gene p-values and combine with propagation values
    pval_table = pd.read_csv(wd+i+'.csv', sep='\t').set_index('Gene')
    network_gene_gwas_pvals = pval_table.ix[network_genes]['TopSNP P-Value'].fillna(1.0)
    prop_gwas_table = pd.concat([network_gene_gwas_pvals, prop_val_table.ix[i]], axis=1)
    prop_gwas_table.columns = ['GWAS P-Value', 'Prop Value']
    # Get all genes that were not seed genes
    network_non_seed = list(set(network_genes) - set(threshold_genes[i]['Gene']))
    prop_gwas_table_filt = prop_gwas_table.ix[network_non_seed]
    # Update gene propagation table with non-seed gene ranks/empirical p-values, and adjusted ranks
    prop_gwas_table_filt['Prop Rank'] = prop_gwas_table_filt['Prop Value'].rank(ascending=False)
    prop_gwas_table_filt['Prop Empirical P-Value'] = prop_gwas_table_filt['Prop Rank'] / float(prop_gwas_table_filt.shape[0])
    prop_gwas_table_filt['Adjusted P-Value'] = prop_gwas_table_filt['GWAS P-Value'].multiply(prop_gwas_table_filt['Prop Empirical P-Value'])
    prop_gwas_table_filt['Adjusted Rank'] = prop_gwas_table_filt['Adjusted P-Value'].rank()
    # Construct top genes table
    top_genes = threshold_genes[i][['Gene','TopSNP P-Value']].set_index('Gene')
    top_network_genes = pd.concat([top_genes, prop_gwas_table.ix[top_genes.index]['Prop Value']], axis=1)
    top_network_genes.columns = ['GWAS P-Value', 'Prop Value']
    top_network_genes['Prop Rank'] = 0
    top_network_genes['Prop Empirical P-Value'] = None
    top_network_genes['Adjusted P-Value'] = None
    top_network_genes['Adjusted Rank'] = 0  
    # Combine top GWAS genes and new propagation ranking table
    combined_prop_gwas_table = pd.concat([top_network_genes, prop_gwas_table_filt])
    combined_prop_gwas_table_sorted = combined_prop_gwas_table[prop_gwas_table_filt.columns].sort_values(by=['Prop Rank', 'Adjusted Rank', 'GWAS P-Value'])
    combined_prop_gwas_table_sorted.to_csv(outdir+i+'_prop_results.csv', sep='\t')

# Propagate transformed gene p-values over network and null networks
Transform all p-values to -log10(p-value) for propagation, but cap the maximum value at 6 (p=1e-6). All genes in the network that have no p-value are given a p-value of 1. Try also setting all p-values larger than 0.1 to 1 before transformation. Both of these methods are noise reduction techniques.

See code below replicated in script at: /cellar/users/jkhuang/Data/Projects/Network_GWAS/src/construct_null_prop_tables.py  
Run this code on a large machine, it is pretty heavy duty and may take a lot of time / memory

In [81]:
# First, we will load the network kernel of the network we will be using
kernel = pd.read_hdf('/cellar/users/jkhuang/Data/Projects/Network_GWAS/Data/Network_Data/PCNet_kernel.hdf')

In [82]:
# Get all genes in the network
network_genes = list(kernel.index)

In [121]:
# Load each gene p-value table and create 2 propagation vectors
# Keep only the nodes in the network (from first loaded null kernel)
wd = '/cellar/users/jkhuang/Data/nosnap/Projects/NBGWAS/snp2gene/'
pval_table_files = [fn for fn in os.listdir(wd) if fn.endswith('kb.csv')]

# For the VEGAS file only
# pval_table_files = [fn for fn in os.listdir(wd) if 'VEGAS2' in fn]

pval_table_files.sort()
# Change the p-value thresholds here for null network propagation if desired
max_p, min_p = 0.1, 1e-6
min_cap_prop_vectors = []
end_cap_prop_vectors = []
for fn in pval_table_files:
    name = fn.split('.')[0]
    pval_table = pd.read_csv(wd+fn, sep='\t').set_index('Gene')
    # Get all network gene p-values, set as 1 if missing
    network_pvals = pval_table.ix[network_genes]['TopSNP P-Value'].fillna(1.0)
    # Clip network gene p-values
    network_pvals_min_cap = network_pvals.clip(lower=min_p)
    network_pvals_end_cap = network_pvals.clip(lower=min_p)
    network_pvals_end_cap[network_pvals_end_cap > max_p] = 1
    # Transform values to propagation vectors
    min_cap_prop_vector = -np.log10(network_pvals_min_cap)
    min_cap_prop_vector[min_cap_prop_vector <= 0] = 0
    min_cap_prop_vector.name = name
    min_cap_prop_vectors.append(min_cap_prop_vector)
    end_cap_prop_vector = -np.log10(network_pvals_end_cap)
    end_cap_prop_vector[end_cap_prop_vector <= 0] = 0
    end_cap_prop_vector.name = name
    end_cap_prop_vectors.append(end_cap_prop_vector)
min_cap_prop_vector_matrix = pd.concat(min_cap_prop_vectors, axis=1).ix[network_genes].T
end_cap_prop_vector_matrix = pd.concat(end_cap_prop_vectors, axis=1).ix[network_genes].T

In [234]:
# Propagate binary vectors over the network
outdir = '/cellar/users/jkhuang/Data/nosnap/Projects/NBGWAS/prop_results/'
prop_val_matrix1 = np.dot(min_cap_prop_vector_matrix, kernel)
prop_val_table1 = pd.DataFrame(prop_val_matrix1, index = min_cap_prop_vector_matrix.index, 
                               columns = min_cap_prop_vector_matrix.columns)
prop_val_table1.to_csv(outdir+'full_prop_results_1.csv', sep='\t')

prop_val_matrix2 = np.dot(end_cap_prop_vector_matrix, kernel)
prop_val_table2 = pd.DataFrame(prop_val_matrix2, index = end_cap_prop_vector_matrix.index, 
                               columns = end_cap_prop_vector_matrix.columns)
prop_val_table2.to_csv(outdir+'full_prop_results_2.csv', sep='\t')

In [262]:
# Construct null propagation matrices

# Initialize null propagation distribution tables
null_prop_tables1 = {}
for name in prop_val_table1.index:
    null_prop_tables1[name] = []
    
null_prop_tables2 = {}
for name in prop_val_table2.index:
    null_prop_tables2[name] = []    

In [263]:
# n is the number of null networks + 1
n = 4
for i in range(1, n):
    # Load null kernel and re-order binary vector matrix (if needed)
    null_kernel = pd.read_hdf(nulls_dir+'null_PCNet_kernel_'+repr(i)+'.hdf')
    null_network_genes = list(null_kernel.index)
    min_cap_prop_vector_matrix = min_cap_prop_vector_matrix[null_network_genes]
    end_cap_prop_vector_matrix = end_cap_prop_vector_matrix[null_network_genes]

    # Propagate the first set of transformed p-value vectors over the network
    null_prop_val_matrix1 = np.dot(min_cap_prop_vector_matrix, null_kernel)
    null_prop_val_table1 = pd.DataFrame(null_prop_val_matrix1, index = min_cap_prop_vector_matrix.index, 
                                        columns = min_cap_prop_vector_matrix.columns)
    # Add new propagation result rows to each table for first propagation style
    for name in null_prop_val_table1.index:
        null_prop_tables1[name].append(null_prop_val_table1.ix[name])

    # Propagate the second set of transformed p-value vectors over the network
    null_prop_val_matrix2 = np.dot(end_cap_prop_vector_matrix, null_kernel)
    null_prop_val_table2 = pd.DataFrame(null_prop_val_matrix2, index = end_cap_prop_vector_matrix.index, 
                                        columns = end_cap_prop_vector_matrix.columns)
    # Add new propagation result rows to each table for second propagation style
    for name in null_prop_val_table2.index:
        null_prop_tables2[name].append(null_prop_val_table2.ix[name])
    

In [264]:
# Concatenate all the null network tables together
# The script version will save the individual elements of the dictionary to disk instead
null_prop_dfs1 = {}
for table in null_prop_tables1:
    null_prop_table1 = pd.concat(null_prop_tables1[table], axis=1)
    null_prop_table1.columns = ['Null Prop '+repr(i) for i in range(1, n)]
    null_prop_dfs1[table] = null_prop_table1        
    #null_prop_table1.to_csv(outdir+table+'_null_prop_results_1.csv', sep='\t')
    
null_prop_dfs2 = {}
for table in null_prop_tables1:
    null_prop_table2 = pd.concat(null_prop_tables2[table], axis=1)
    null_prop_table2.columns = ['Null Prop '+repr(i) for i in range(1, n)]
    null_prop_dfs2[table] = null_prop_table2        
    #null_prop_table2.to_csv(outdir+table+'_null_prop_results_2.csv', sep='\t')

# Get NullProp p-values for all diseases

In [None]:
import scipy.stats as stats

In [142]:
print wd
print outdir
null_outdir = '/cellar/users/jkhuang/Data/nosnap/Projects/NBGWAS/null_prop_results/'
print null_outdir

/cellar/users/jkhuang/Data/nosnap/Projects/NBGWAS/snp2gene/
/cellar/users/jkhuang/Data/nosnap/Projects/NBGWAS/prop_results/
/cellar/users/jkhuang/Data/nosnap/Projects/NBGWAS/null_prop_results/


In [199]:
NullProp1_table = pd.read_csv(outdir+'full_prop_results_1.csv', sep='\t', index_col=0)
NullProp2_table = pd.read_csv(outdir+'full_prop_results_2.csv', sep='\t', index_col=0)

In [221]:
# Load each gene p-value table and create binary propagation vector
wd = '/cellar/users/jkhuang/Data/nosnap/Projects/NBGWAS/snp2gene/'
methods = [fn.split('.')[0] for fn in os.listdir(wd) if (fn.endswith('.csv')) & ('VEGAS2' in fn)]
methods.sort()

In [223]:
for name in methods:
    # Load gene p-values from file
    pval_table = pd.read_csv(wd+name+'.csv', sep='\t').set_index('Gene')
    
    ##### NullProp1 #####
    # Load full propagation results on PCNet
    NullProp1_results = NullProp1_table.ix[name]
    NullProp1_results.name = 'Prop Value'
    # Load null propagation results on shuffled PCNets
    NullProp1_null_results = pd.read_csv(null_outdir+name+'_null_prop_results_1.csv', sep='\t', index_col=0)
    # Calculate empirical p-values for all genes
    NullProp1_combined_table = pd.concat([NullProp1_results, NullProp1_null_results], axis=1)
    NullProp1_empirical_p = NullProp1_combined_table.rank(axis=1, ascending=False)['Prop Value']/float(1002)
    # Calculate normal p-values for all genes
    NullProp1_null_mean = NullProp1_null_results.mean(axis=1)
    NullProp1_null_std = NullProp1_null_results.std(axis=1)
    NullProp1_normal_table = pd.concat([NullProp1_results, NullProp1_null_mean, NullProp1_null_std], axis=1)
    NullProp1_normal_table.columns = ['Prop Value', 'Null Mean', 'Null Std']
    NullProp1_normal_table['Z'] = (NullProp1_normal_table['Prop Value'].subtract(NullProp1_normal_table['Null Mean'])).divide(NullProp1_normal_table['Null Std'])
    NullProp1_normal_table['P'] = stats.norm.sf(NullProp1_normal_table['Z'])
    # Calculate boosted p-values
    NullProp1_results_table = pd.concat([network_pvals, NullProp1_results, NullProp1_empirical_p, NullProp1_normal_table['P']], axis=1)
    NullProp1_results_table.columns = ['GWAS P', 'Prop Value', 'Empirical P', 'Normal P']
    NullProp1_results_table['Boosted P (Empirical)'] = NullProp1_results_table['GWAS P'].multiply(NullProp1_results_table['Empirical P'])
    NullProp1_results_table['Boosted P (Normal)'] = NullProp1_results_table['GWAS P'].multiply(NullProp1_results_table['Normal P'])
    NullProp1_results_table.to_csv(outdir+name+'_NullProp1_results.csv', sep='\t')
    
    ##### NullProp2 #####
    # Load full propagation results on PCNet
    NullProp2_results = NullProp2_table.ix[name]
    NullProp2_results.name = 'Prop Value'
    # Load null propagation results on shuffled PCNets
    NullProp2_null_results = pd.read_csv(null_outdir+name+'_null_prop_results_2.csv', sep='\t', index_col=0)
    # Calculate empirical p-values for all genes
    NullProp2_combined_table = pd.concat([NullProp2_results, NullProp2_null_results], axis=1)
    NullProp2_empirical_p = NullProp2_combined_table.rank(axis=1, ascending=False)['Prop Value']/float(1002)
    # Calculate normal p-values for all genes
    NullProp2_null_mean = NullProp2_null_results.mean(axis=1)
    NullProp2_null_std = NullProp2_null_results.std(axis=1)
    NullProp2_normal_table = pd.concat([NullProp2_results, NullProp2_null_mean, NullProp2_null_std], axis=1)
    NullProp2_normal_table.columns = ['Prop Value', 'Null Mean', 'Null Std']
    NullProp2_normal_table['Z'] = (NullProp2_normal_table['Prop Value'].subtract(NullProp2_normal_table['Null Mean'])).divide(NullProp2_normal_table['Null Std'])
    NullProp2_normal_table['P'] = stats.norm.sf(NullProp2_normal_table['Z'])
    # Calculate boosted p-values
    NullProp2_results_table = pd.concat([network_pvals, NullProp2_results, NullProp2_empirical_p, NullProp2_normal_table['P']], axis=1)
    NullProp2_results_table.columns = ['GWAS P', 'Prop Value', 'Empirical P', 'Normal P']
    NullProp2_results_table['Boosted P (Empirical)'] = NullProp2_results_table['GWAS P'].multiply(NullProp2_results_table['Empirical P'])
    NullProp2_results_table['Boosted P (Normal)'] = NullProp2_results_table['GWAS P'].multiply(NullProp2_results_table['Normal P'])
    NullProp2_results_table.to_csv(outdir+name+'_NullProp2_results.csv', sep='\t')    

# Sweep "Top Hits" parameter and propagate over PCNet
Last Updated: April 06, 2018  
Here, we will construct a binary matrix with the various top gene thresholds for each disease and multiply that result with the PCNet kernel.

In [469]:
import os
import pandas as pd
import numpy as np
import networkx as nx

In [225]:
# First, we will load the network kernel of the network we will be using
kernel = pd.read_hdf('/cellar/users/jkhuang/Data/Projects/Network_GWAS/Data/Network_Data/PCNet_kernel.hdf')

In [226]:
# Get all genes in the network
network_genes = list(kernel.index)

In [451]:
# Establish top gene thresholds (by % and p-value)
#p_thresholds = [1e-8, 5e-8, 1e-7, 5e-7, 1e-6, 5e-6, 1e-5, 5e-5, 1e-4]
p_thresholds = []
for i in range(-10, -2):
    p_thresholds = p_thresholds+[round(pct, 10) for pct in list(np.linspace(10**i, 9*10**i, 9))]
p_thresholds.append(0.01)

In [453]:
top_pct = [round(pct, 4) for pct in list(np.linspace(0.0001, 0.0009, 9))] + [round(pct, 3) for pct in list(np.linspace(0.001, 0.009, 9))] + [round(pct, 2) for pct in list(np.linspace(0.01, 0.15, 15))]   

In [454]:
# Get each gene p-value table
wd = '/cellar/users/jkhuang/Data/nosnap/Projects/NBGWAS/snp2gene/'
pval_table_files = [fn for fn in os.listdir(wd) if fn.endswith('_min_p_10kb.csv')]
pval_table_files.sort()

In [455]:
# Create binary propagation vector
threshold_genes = {}
prop_vector_list = []
for fn in pval_table_files:
    # Set disease table
    disease = fn.split('_')[0]
    # Load gene p-value assignments
    pval_table = pd.read_csv(wd+fn, sep='\t')
    pval_table['SNP Distance'] = abs(pval_table['TopSNP Pos'].subtract(pval_table['Gene Start']))
    pval_table.sort_values(by=['TopSNP P-Value', 'SNP Distance'])
    # Get seed genes for each disease
    for p_thresh in p_thresholds:
        vector_name = ' '.join([disease, 'p', '<', str(p_thresh)])
        # Get top genes for each disease by p threshold
        threshold_genes[vector_name] = pval_table[pval_table['TopSNP P-Value'] < p_thresh]
        prop_vector = (pval_table.set_index('Gene').ix[network_genes]['TopSNP P-Value'] < p_thresh).astype(float)
        prop_vector.name = vector_name
        prop_vector_list.append(prop_vector)
    for pct in top_pct:
        vector_name = ' '.join([disease, 'Top', str(pct*100)+'%'])
        # Get top genes for each disease by % threshold
        threshold_genes[vector_name] = pval_table.ix[:int(np.ceil(pct*pval_table.shape[0]))]
        prop_vector = (pval_table.set_index('Gene', drop=False).ix[network_genes]['Gene'].isin(threshold_genes[vector_name]['Gene'])).astype(float)
        prop_vector.name = vector_name
        prop_vector_list.append(prop_vector)
prop_vector_matrix = pd.concat(prop_vector_list, axis=1).ix[network_genes].T            

In [471]:
%%time
# Load PCNet
network_file = '/cellar/users/jkhuang/Data/Projects/Network_GWAS/Data/Network_Data/PCNet.txt'
network = nx.read_edgelist(network_file, delimiter='\t', data=False)
print 'Number of network nodes:', len(network.nodes())
print 'Number of network edges:', len(network.edges())

Number of network nodes: 19781
Number of network edges: 2724724
CPU times: user 30.9 s, sys: 932 ms, total: 31.8 s
Wall time: 31.9 s


In [482]:
# Get degree of each node in the network
network_degree = pd.Series(dict(network.degree()))
total_network_degree = float(network_degree.sum())

In [485]:
# Construct seed statistics table
# Get table columns
diseases = [fn.split('_')[0] for fn in pval_table_files]

# Get table indices
value_label = ['Seed / Thresh Genes', 'Thresh P', 'Thresh %', 'Net Seed %', 'Seed Degree %']
index_list = []
for p_thresh in p_thresholds:
    vector_name = ' '.join(['p', '<', str(p_thresh)])
    for label in value_label:
        index_list.append((vector_name, label))
for pct in top_pct:
    vector_name = ' '.join(['Top', str(pct*100)+'%'])    
    for label in value_label:
        index_list.append((vector_name, label))
        
# Construct seed statistics table
seed_table = pd.DataFrame(index=pd.MultiIndex.from_tuples(index_list), columns=diseases)

In [486]:
# Get seed statistics for each threshold
prop_vector_matrix_sums = prop_vector_matrix.sum(axis=1)
for i in prop_vector_matrix_sums.index:
    disease = i[:i.find(' ')]
    if disease == 'ALZ':
        fn = 'ALZ_IGAP_min_p_10kb.csv'
    else:
        fn = disease+'_WTCCC_min_p_10kb.csv'
    pval_table = pd.read_csv(wd+fn, sep='\t')
    index1 = i[i.find(' ')+1:]
    if '<' in index1:
        seed_table.set_value((index1, 'Thresh P'), disease, float(index1.split(' ')[-1]))
        thresh_pct = round(float(threshold_genes[i].shape[0]) / float(pval_table.shape[0])*100, 2)
        seed_table.set_value((index1, 'Thresh %'), disease, thresh_pct)
    else:
        thresh_p = threshold_genes[i]['TopSNP P-Value'].max()
        seed_table.set_value((index1, 'Thresh P'), disease, thresh_p)
        seed_table.set_value((index1, 'Thresh %'), disease, round(float(index1.split(' ')[-1][:-1]),2))
    
    num_seed_genes = int(prop_vector_matrix_sums.ix[i])
    num_thresh_genes = int(threshold_genes[i].shape[0])
    seed_table.set_value((index1, 'Seed / Thresh Genes'), disease, str(num_seed_genes)+' / '+str(num_thresh_genes))

    pct_seeds = round((num_seed_genes / float(len(network_genes)))*100, 2)
    seed_table.set_value((index1, 'Net Seed %'), disease, pct_seeds)
    
    degree_pct = round((network_degree.ix[prop_vector_matrix.ix[i][prop_vector_matrix.ix[i]==1].index].sum() / total_network_degree)*100, 2)
    seed_table.set_value((index1, 'Seed Degree %'), disease, degree_pct)
seed_table.to_csv('/cellar/users/jkhuang/Data/Projects/Network_GWAS/Results/min_p_10kb_seeds_table.csv', sep='\t')

In [458]:
print (prop_vector_matrix.columns==kernel.columns).all()
print (prop_vector_matrix.columns==kernel.index).all()

True
True


In [459]:
%%time
# Propagate binary vectors over the network
prop_val_matrix = np.dot(prop_vector_matrix, kernel)
prop_val_table = pd.DataFrame(prop_val_matrix, index = prop_vector_matrix.index, columns = prop_vector_matrix.columns)

CPU times: user 1min 6s, sys: 44 ms, total: 1min 6s
Wall time: 8.52 s


In [460]:
# For each propagation result:
outdir = '/cellar/users/jkhuang/Data/Projects/Network_GWAS/Results/min_p_10kb_prop_results/'
for i in prop_val_table.index:
    disease = i[:i.find(' ')]
    if disease == 'ALZ':
        fn = 'ALZ_IGAP_min_p_10kb.csv'
    else:
        fn = disease+'_WTCCC_min_p_10kb.csv'
    # Get all gene p-values and combine with propagation values
    pval_table = pd.read_csv(wd+fn, sep='\t').set_index('Gene')
    network_gene_gwas_pvals = pval_table.ix[network_genes]['TopSNP P-Value'].fillna(1.0)
    prop_gwas_table = pd.concat([network_gene_gwas_pvals, prop_val_table.ix[i]], axis=1)
    prop_gwas_table.columns = ['GWAS P-Value', 'Prop Value']
    # Get all genes that were not seed genes
    network_non_seed = list(set(network_genes) - set(threshold_genes[i]['Gene']))
    prop_gwas_table_filt = prop_gwas_table.ix[network_non_seed]
    # Update gene propagation table with non-seed gene ranks/empirical p-values, and adjusted ranks
    prop_gwas_table_filt['Prop Rank'] = prop_gwas_table_filt['Prop Value'].rank(ascending=False)
    prop_gwas_table_filt['Prop Empirical P-Value'] = prop_gwas_table_filt['Prop Rank'] / float(prop_gwas_table_filt.shape[0])
    prop_gwas_table_filt['Adjusted P-Value'] = prop_gwas_table_filt['GWAS P-Value'].multiply(prop_gwas_table_filt['Prop Empirical P-Value'])
    prop_gwas_table_filt['Adjusted Rank'] = prop_gwas_table_filt['Adjusted P-Value'].rank()
    # Construct top genes table
    top_genes = threshold_genes[i][['Gene','TopSNP P-Value']].set_index('Gene')
    top_network_genes = pd.concat([top_genes, prop_gwas_table.ix[top_genes.index]['Prop Value']], axis=1)
    top_network_genes.columns = ['GWAS P-Value', 'Prop Value']
    top_network_genes['Prop Rank'] = 0
    top_network_genes['Prop Empirical P-Value'] = None
    top_network_genes['Adjusted P-Value'] = None
    top_network_genes['Adjusted Rank'] = 0  
    # Combine top GWAS genes and new propagation ranking table
    combined_prop_gwas_table = pd.concat([top_network_genes, prop_gwas_table_filt])
    combined_prop_gwas_table_sorted = combined_prop_gwas_table[prop_gwas_table_filt.columns].sort_values(by=['Prop Rank', 'Adjusted Rank', 'GWAS P-Value'])
    combined_prop_gwas_table_sorted.to_csv(outdir+i+'_prop_results.csv', sep='\t')

# From previous results Propagate Top 1%, but propagate a transformed score (-log10) instead of a binary value
We will sweep the "capped" -log10 value from 100 to 6.

In [225]:
# First, we will load the network kernel of the network we will be using
kernel = pd.read_hdf('/cellar/users/jkhuang/Data/Projects/Network_GWAS/Data/Network_Data/PCNet_kernel.hdf')

In [226]:
# Get all genes in the network
network_genes = list(kernel.index)

In [489]:
# Establish maximum value of transformed p-values to propagate
max_log10p_score = [100, 75, 50, 25, 20]+list(range(15,5,-1))

In [454]:
# Get each gene p-value table
wd = '/cellar/users/jkhuang/Data/nosnap/Projects/NBGWAS/snp2gene/'
pval_table_files = [fn for fn in os.listdir(wd) if fn.endswith('_min_p_10kb.csv')]
pval_table_files.sort()

In [571]:
# Create weighted propagation vectors (using top 1.0% propagation)
pct = 0.01
threshold_genes = {}
prop_vector_list = []
for fn in pval_table_files:
    # Set disease table
    disease = fn.split('_')[0]
    # Load gene p-value assignments
    pval_table = pd.read_csv(wd+fn, sep='\t')
    # Sort gene p-value assignments in ascending order, and break ties with SNP distance to gene
    pval_table['SNP Distance'] = abs(pval_table['TopSNP Pos'].subtract(pval_table['Gene Start']))
    pval_table.sort_values(by=['TopSNP P-Value', 'SNP Distance'])
    # Get seed genes for each disease
    for thresh in max_log10p_score:
        vector_name = ' '.join([disease, 'Weighted Top 1.0%', 'Prop, Max Weight =', str(thresh)])
        # Get top genes for each disease by % threshold
        threshold_genes[vector_name] = pval_table.ix[:int(np.ceil(pct*pval_table.shape[0]))]
        # Get all genes in the network with p-values
        net_pval_table = pval_table.set_index('Gene', drop=False).ix[network_genes]
        net_pval_seeds = pd.concat([net_pval_table['TopSNP P-Value'].fillna(1), 
                                    (net_pval_table['Gene'].isin(threshold_genes[vector_name]['Gene'])).astype(float)], 
                                   axis=1)
        # Transform all p-values for genes in the network
        net_pval_seeds['TopSNP Score'] = -np.log10(net_pval_seeds['TopSNP P-Value'])
        # Clip transformed scores
        net_pval_seeds['TopSNP Score Clip'] = net_pval_seeds['TopSNP Score'].clip(lower=0, upper=thresh)
        # Keep only transformed scores of seed genes
        seed_values = abs(net_pval_seeds['TopSNP Score Clip'].multiply(net_pval_seeds['Gene']))
        seed_values.name = vector_name
        prop_vector_list.append(seed_values)
prop_vector_matrix = pd.concat(prop_vector_list, axis=1).ix[network_genes].T            



In [582]:
print (prop_vector_matrix.columns==kernel.columns).all()
print (prop_vector_matrix.columns==kernel.index).all()

True
True


In [574]:
%%time
# Propagate binary vectors over the network
prop_val_matrix = np.dot(prop_vector_matrix, kernel)
prop_val_table = pd.DataFrame(prop_val_matrix, index = prop_vector_matrix.index, columns = prop_vector_matrix.columns)

CPU times: user 12.4 s, sys: 0 ns, total: 12.4 s
Wall time: 1.6 s


In [581]:
# For each propagation result:
outdir = '/cellar/users/jkhuang/Data/Projects/Network_GWAS/Results/min_p_10kb_prop_results/'
for i in prop_val_table.index:
    disease = i[:i.find(' ')]
    if disease == 'ALZ':
        fn = 'ALZ_IGAP_min_p_10kb.csv'
    else:
        fn = disease+'_WTCCC_min_p_10kb.csv'
    # Get all gene p-values and combine with propagation values
    pval_table = pd.read_csv(wd+fn, sep='\t').set_index('Gene')
    network_gene_gwas_pvals = pval_table.ix[network_genes]['TopSNP P-Value'].fillna(1.0)
    prop_gwas_table = pd.concat([network_gene_gwas_pvals, prop_val_table.ix[i]], axis=1)
    prop_gwas_table.columns = ['GWAS P-Value', 'Prop Value']
    # Get all genes that were not seed genes
    network_non_seed = list(set(network_genes) - set(threshold_genes[i]['Gene']))
    prop_gwas_table_filt = prop_gwas_table.ix[network_non_seed]
    # Update gene propagation table with non-seed gene ranks/empirical p-values, and adjusted ranks
    prop_gwas_table_filt['Prop Rank'] = prop_gwas_table_filt['Prop Value'].rank(ascending=False)
    prop_gwas_table_filt['Prop Empirical P-Value'] = prop_gwas_table_filt['Prop Rank'] / float(prop_gwas_table_filt.shape[0])
    prop_gwas_table_filt['Adjusted P-Value'] = prop_gwas_table_filt['GWAS P-Value'].multiply(prop_gwas_table_filt['Prop Empirical P-Value'])
    prop_gwas_table_filt['Adjusted Rank'] = prop_gwas_table_filt['Adjusted P-Value'].rank()
    # Construct top genes table
    top_genes = threshold_genes[i][['Gene','TopSNP P-Value']].set_index('Gene')
    top_network_genes = pd.concat([top_genes, prop_gwas_table.ix[top_genes.index]['Prop Value']], axis=1)
    top_network_genes.columns = ['GWAS P-Value', 'Prop Value']
    top_network_genes['Prop Rank'] = 0
    top_network_genes['Prop Empirical P-Value'] = None
    top_network_genes['Adjusted P-Value'] = None
    top_network_genes['Adjusted Rank'] = 0  
    # Combine top GWAS genes and new propagation ranking table
    combined_prop_gwas_table = pd.concat([top_network_genes, prop_gwas_table_filt])
    combined_prop_gwas_table_sorted = combined_prop_gwas_table[prop_gwas_table_filt.columns].sort_values(by=['Prop Rank', 'Adjusted Rank', 'GWAS P-Value'])
    combined_prop_gwas_table_sorted.to_csv(outdir+i+'_prop_results.csv', sep='\t')