<a href="https://colab.research.google.com/github/USCbiostats/PM570-Colab/blob/main/Lecture-3.SimpleGWAS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GWAS using JAX and pandas-plink

Let's make use of the GitHub repo that contains our simulation and utility functions. Similarly, we'll need to pull some real genotype data from 1000G

In [2]:
!rm -r /content/PM570-Colab/
!git clone https://github.com/USCbiostats/PM570-Colab.git
!pip install pandas_plink
!wget https://github.com/mancusolab/sushie/raw/main/data/plink/EUR.bed
!wget https://github.com/mancusolab/sushie/raw/main/data/plink/EUR.bim
!wget https://github.com/mancusolab/sushie/raw/main/data/plink/EUR.fam

Cloning into 'PM570-Colab'...
remote: Enumerating objects: 121, done.[K
remote: Counting objects: 100% (121/121), done.[K
remote: Compressing objects: 100% (90/90), done.[K
remote: Total 121 (delta 57), reused 82 (delta 26), pack-reused 0[K
Receiving objects: 100% (121/121), 20.59 KiB | 10.30 MiB/s, done.
Resolving deltas: 100% (57/57), done.


In [4]:
import sys
sys.path.append('/content/PM570-Colab/')

import jax
import jax.numpy as jnp
import jax.random as rdm

# lets make sure we're using 64bit precision to not lose accuracy
# in our GWAS results
# again, this only works on startup!
from jax.config import config
config.update("jax_enable_x64", True)

from sim import geno, trait
from util import gwas

N = 5000
P = 1000
PROP = 0.1
H2G = 0.1

key = rdm.PRNGKey(0)
key, geno_key, trait_key = rdm.split(key, 3)

# simulate genotype w/o LD
X = geno.naive_sim_genotype(N, P, geno_key)

# simulate phenotype using genotype data
y = trait.naive_trait_sim(X, PROP, H2G, trait_key)

# perform GWAS scan using OLS
gwas_df = gwas.trait_scan_ols(X, y)

# any hits?
gwas_df[gwas_df["log.pval"] < jnp.log(5e-8)]

Unnamed: 0,beta,se,zscore,log.pval
58,-0.1284935610802055,0.0228666130296959,-5.619265123056758,-17.769546955739166
241,-0.1190209450811848,0.02064956150107,-5.7638485483102055,-18.616486402928206
491,-0.1546328965608824,0.0226773609497793,-6.818822388695407,-25.414105063149474


In [14]:
N = 5000
PROP = 0.1
H2G = 0.01

# let's perform a simulation using real genotype data to reflect real LD patterns
key, geno_key, trait_key = rdm.split(key, 3)

# point to the EUR PLINK data and specify sample size
X = geno.sim_geno_from_plink("EUR", N, geno_key)

# simulate phenotype using simulated geno data
y = trait.naive_trait_sim(X, PROP, H2G, trait_key)

y = y - jnp.mean(y)
y = y / jnp.std(y)

# perform GWAS scan using OLS
gwas_df = gwas.trait_scan_ols(X, y)

# any hits?
gwas_df[gwas_df["log.pval"] < jnp.log(5e-8)]

Unnamed: 0,beta,se,zscore,log.pval
68,-0.0769071536306739,0.0141030711758369,-5.4532202717972975,-16.82198567715937
71,0.0778504320550798,0.0141020356335307,5.520510235414025,-17.202755619673642
72,0.0788957577729472,0.0141008732258459,5.595097304210682,-17.629961991214245
