# Assigning GWAS p-values to genes

In this notebook, we present 2 methods by which you can assign gene-level p-values to genes using the following information:
 - SNP-level p-values: These are obtained from your GWAS study, typically using software such as PLINK or SNPTEST
 - Gene positions: These can be downloaded from the UCSC genome browser or NCBI

In [2]:
#from nbgwas import NBGWAS_snp2gene as snp2gene

import pandas as pd

### File Format: SNP-level p-values
This file is a 4-column delimited file with the following columns in this order:

1. rsID
2. Chromosome
3. SNP Position (using the genome build corresponding to your Gene Positions file)
4. SNP P-value

The user may pass more columns than specified (as long as the file has at least 4 columns). The default behavior will be to select the first four columns in order as these columns. 

The user may specify the columns to use in the file by passing in a 4-valued list indicating the column numbers that should be used (in the same order as the one described above) into the 'cols' parameter. These 4 values will be indexed from 0 (e.g. '0,1,2,3' will select the first 4 columns in order), and must be passed as a comma-separated string (no spaces).

In [28]:
snp_summary_file = '/cellar/users/jkhuang/Data/Projects/Network_GWAS/Data/IGAP/IGAP_stage_1.txt'
snp_summary = load_SNP_pvals(snp_summary_file, delimiter='\t', header=True, cols='2,0,1,7')

### File Format: Gene Positions
This file is a 4-column delimited file with the following columns in this order:

1. Gene Name/Symbol
2. Chromosome
3. Transcription Start Position
4. Transcription End Position

This file may need to be constructed by the user to conform to this file format from the download site location. The user may pass more columns than specified (as long as the file has at least 4 columns) and indicate the columns to be used as described above.

In [20]:
gene_pos_file = '/cellar/users/jkhuang/Data/Projects/Network_GWAS/Data/glist-hg19.txt'
hg19_gene_pos = load_gene_pos(gene_pos_file, delimiter=' ', header=False, cols='3,0,1,2')

# Assigning GWAS p-values to genes - Closest Gene Method
1. We will assign the gene with the closest transcription start site (regardless of strand) to each SNP. The gene transcription start site must fall within the specified genomic distance (up or downstream) from the SNP. This distance is given as kilobases, (e.g. if 'window' is set to 5, this will find the nearest gene to each SNP within a 5kb up and downstream window from the SNP).
2. Each gene assigned at least 1 SNP will then be assigned the minimum p-value of all SNPs that have been assigned to it with this method.

# Assigning GWAS p-values to genes - Minimum P Method
1. For each gene in the genome (or as defined by the Gene Positions file), we will collect all SNPs within a specified genomic distance from the gene body (transcription start site to transcription end site). The SNP must fall within the specified genomic distance (up or downstream of the gene body). This distance is given as kilobases, (e.g. if 'window' is set to 5, this will collect all SNPs within 5kb of the gene body.
2. Each gene is then assigned the minimum of all the p-values across all SNPs falling within the specified window.