# Assigning GWAS p-values to genes

In this notebook, we present 2 methods by which you can assign gene-level p-values to genes using the following information:
 - SNP-level p-values: These are obtained from your GWAS study, typically using software such as PLINK or SNPTEST
 - Gene positions: These can be downloaded from the UCSC genome browser or NCBI

In [132]:
from nbgwas import NBGWAS_snp2gene as snp2gene
import pandas as pd
import time

### File Format: SNP-level p-values
This file is a 4-column delimited file with the following columns in this order:

1. rsID
2. Chromosome
3. SNP Position (using the genome build corresponding to your Gene Positions file)
4. SNP P-value

The user may pass more columns than specified (as long as the file has at least 4 columns). The default behavior will be to select the first four columns in order as these columns. 

The user may specify the columns to use in the file by passing in a 4-valued list indicating the column numbers that should be used (in the same order as the one described above) into the 'cols' parameter. These 4 values will be indexed from 0 (e.g. '0,1,2,3' will select the first 4 columns in order), and must be passed as a comma-separated string (no spaces).

In [133]:
# Fix so that this reads directly from respository Data folder
#snp_summary_file = '/Users/Dan/data/GWAS_for_class/pgc.bip.2012-04/pgc.bip.full.2012-04.txt'
snp_summary_file = '/Users/Dan/data/GWAS_for_class/pgc.scz.full.2012-04.txt'
snp_summary = snp2gene.load_SNP_pvals(snp_summary_file, delimiter='\t', header=False, cols='0,1,2,7')

In [134]:
snp_summary.head()

Unnamed: 0,Marker,Chr,Pos,P-Value
0,rs3131972,1,742584,0.761033
1,rs3131969,1,744045,0.784919
2,rs3131967,1,744197,0.79352
3,rs1048488,1,750775,0.761041
4,rs12562034,1,758311,0.987899


In [135]:
multiple_marks = snp_summary.index.value_counts()[snp_summary.index.value_counts() > 1].index

### File Format: Gene Positions
This file is a 4-column delimited file with the following columns in this order:

1. Gene Name/Symbol
2. Chromosome
3. Transcription Start Position
4. Transcription End Position

This file may need to be constructed by the user to conform to this file format from the download site location. The user may pass more columns than specified (as long as the file has at least 4 columns) and indicate the columns to be used as described above.

In [136]:
# Fix so that this reads directly from respository Data folder
gene_pos_file = '/Users/Dan/data/GWAS_for_class/hg18/glist-hg18_proteinCoding.txt'

hg18_gene_pos = snp2gene.load_gene_pos(gene_pos_file, delimiter='\t', header=False)

In [137]:
hg18_gene_pos.head()

Unnamed: 0_level_0,Chr,Start,End
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A1BG,19,63551643,63565932
A1CF,10,52271589,52315441
A2M,12,9111570,9159825
A2ML1,12,8911704,8930864
A3GALT2,1,33544953,33559286


# Assigning GWAS p-values to genes - Minimum P Method
1. For each gene in the genome (or as defined by the Gene Positions file), we will collect all SNPs within a specified genomic distance from the gene body (transcription start site to transcription end site). The SNP must fall within the specified genomic distance (up or downstream of the gene body). This distance is given as kilobases, (e.g. if 'window' is set to 5, this will collect all SNPs within 5kb of the gene body.
2. Each gene is then assigned the minimum of all the p-values across all SNPs falling within the specified window.

In [138]:
def min_p(SNP_summary, gene_positions, window):
        starttime = time.time()
        dist = window*1000
        genelist = list(gene_positions.index)
        min_p_list = []
        SNP_summary['Chr']=SNP_summary['Chr'].astype(str)
        for gene in genelist:
                gene_info = gene_positions.ix[gene]
                chrom = str(gene_info['Chr'])
                start = gene_info['Start']
                stop = gene_info['End']
                # Get all SNPs on same chromosome
                SNP_summary_filt1 = SNP_summary[SNP_summary['Chr']==chrom]
                # Get all SNPs after window start position
                SNP_summary_filt2 = SNP_summary_filt1[SNP_summary_filt1['Pos'] >= (start-dist)]
                # Get all SNPs before window end position
                SNP_summary_filt3 = SNP_summary_filt2[SNP_summary_filt2['Pos'] <= (stop+dist)]
                # Get min_p statistics for this gene
                if len(SNP_summary_filt3) >= 1:
                        min_p_data = SNP_summary_filt3.ix[SNP_summary_filt3['P-Value'].argmin()]
                        min_p_list.append([gene, chrom, start, stop, SNP_summary_filt3.shape[0], min_p_data['Marker'], int(min_p_data['Pos']), min_p_data['P-Value']])
                else:
                        min_p_list.append([gene, chrom, start, stop, 0, None, None, None])
        min_p_table = pd.DataFrame(min_p_list, columns = ['Gene', 'Chr', 'Gene Start', 'Gene End', 'nSNPs', 'TopSNP', 'TopSNP Pos', 'TopSNP P-Value'])
        min_p_table['SNP Distance'] = abs(min_p_table['TopSNP Pos'].subtract(min_p_table['Gene Start']))
        min_p_table = min_p_table.dropna().sort_values(by=['TopSNP P-Value', 'Chr', 'Gene Start'])
        print "P-Values assigned to genes:", time.time()-starttime, 'seconds'
        return min_p_table

In [139]:
min_p_table = min_p(snp_summary, hg18_gene_pos, 10)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated


P-Values assigned to genes: 1367.42107391 seconds




In [141]:
min_p_table.head(50)

Unnamed: 0,Gene,Chr,Gene Start,Gene End,nSNPs,TopSNP,TopSNP Pos,TopSNP P-Value,SNP Distance
5905,HIST1H4K,6,27906930,27907284,8,rs34706883,27913234.0,5.07118e-10,6304.0
5867,HIST1H2AK,6,27913636,27914096,16,rs34706883,27913234.0,5.07118e-10,402.0
5883,HIST1H2BN,6,27914418,27914867,17,rs34706883,27913234.0,5.07118e-10,1184.0
5868,HIST1H2AL,6,27941085,27941555,10,rs13199772,27942064.0,7.05379e-10,979.0
5855,HIST1H1B,6,27942548,27943338,10,rs13199772,27942064.0,7.05379e-10,484.0
5893,HIST1H3I,6,27947601,27948078,10,rs13199772,27942064.0,7.05379e-10,5537.0
5906,HIST1H4L,6,27948904,27949268,10,rs13199772,27942064.0,7.05379e-10,6840.0
9968,PGBD1,6,28357342,28378305,30,rs6901575,28358963.0,1.23604e-09,1621.0
5858,HIST1H1E,6,26264537,26265322,11,rs3857546,26265741.0,1.4581e-09,1204.0
5873,HIST1H2BD,6,26266327,26279555,15,rs3857546,26265741.0,1.4581e-09,586.0


In [142]:
min_p_table.to_csv('/Users/Dan/data/GWAS_for_class/scz_gene_10k.txt',sep='\t')

In [143]:
sum(min_p_table['TopSNP P-Value']<5e-6)

66