### *cis*- and *trans*-QTL mapping with tensorQTL

This notebook provides examples for running *cis*- and *trans*-QTL mapping with tensorQTL, using open-access data from the [GEUVADIS](https://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/) project.

#### Requirements
An environment configured with a GPU and ~50GB of memory.

#### Test dataset

*Note: these files are provided for testing/benchmarking purposes only. They do not constitute an official release from the GEUVADIS project, and no quality-control was applied.*

Genotypes in PLINK and VCF format, and normalized expression data are available [here](https://personal.broadinstitute.org/francois/geuvadis/).

Alternatively, to download the files required for these examples, uncomment and run the cell below.

In [1]:
# !wget https://personal.broadinstitute.org/francois/geuvadis/GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.bed
# !wget https://personal.broadinstitute.org/francois/geuvadis/GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.bim
# !wget https://personal.broadinstitute.org/francois/geuvadis/GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.fam   
# !wget https://personal.broadinstitute.org/francois/geuvadis/GEUVADIS.445_samples.covariates.txt
# !wget https://personal.broadinstitute.org/francois/geuvadis/GEUVADIS.445_samples.expression.bed.gz

In [2]:
import pandas as pd
import numpy as np
import tensorqtl
# import tensorqtl.genotypeio as genotypeio
import genotypeio
import matplotlib.pyplot as plt
import tensorflow as tf
tf.keras.backend.clear_session()


# define paths to data
plink_prefix_path = 'GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered'
expression_bed = 'GEUVADIS.445_samples.expression.bed.gz'
covariates_file = 'GEUVADIS.445_samples.covariates.txt'
prefix = 'GEUVADIS.445_samples'

# load phenotypes and covariates
phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T

# PLINK reader for genotypes
pr = genotypeio.PlinkReader(plink_prefix_path, select_samples=phenotype_df.columns)


Mapping files: 100%|██████████| 3/3 [00:26<00:00, 11.97s/it]


### *cis*-QTL: nominal p-values for all variant-phenotype pairs

In [3]:
# map all cis-associations (results for each chromosome are written to file)

# all genes
# tensorqtl.map_cis_nominal(pr, phenotype_df, phenotype_pos_df, covariates_df, prefix)

# genes on chr18
tensorqtl.map_cis_nominal(pr, phenotype_df.loc[phenotype_pos_df['chr']=='chr18'],
                          phenotype_pos_df, covariates_df, prefix)

cis-QTL mapping: nominal associations for all variant-phenotype pairs
  * 445 samples
  * 301 phenotypes
  * 26 covariates
  * 13380684 variants
  Mapping chromosome chr18
  * checking phenotypes: 301/301
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    

For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tenso

In [4]:
# load results
pairs_df = pd.read_parquet('{}.cis_qtl_pairs.chr18.parquet'.format(prefix))
pairs_df.head()

Unnamed: 0,phenotype_id,variant_id,tss_distance,maf,ma_samples,ma_count,pval_nominal,slope,slope_se
0,ENSG00000263006.6,chr18_10644_C_G_b38,-98421,0.016854,15,15,0.577392,-0.118906,0.213232
1,ENSG00000263006.6,chr18_10847_C_A_b38,-98218,0.019101,17,17,0.142049,-0.299512,0.203613
2,ENSG00000263006.6,chr18_11275_G_A_b38,-97790,0.024719,22,22,0.734261,0.0571,0.168095
3,ENSG00000263006.6,chr18_11358_G_A_b38,-97707,0.024719,22,22,0.734261,0.0571,0.168095
4,ENSG00000263006.6,chr18_11445_G_A_b38,-97620,0.023596,21,21,0.592376,0.092131,0.171947


### *cis*-QTL: empirical p-values for phenotypes

In [5]:
# all genes
# cis_df = tensorqtl.map_cis(pr, phenotype_df, phenotype_pos_df, covariates_df)

# genes on chr18
cis_df = tensorqtl.map_cis(pr, phenotype_df.loc[phenotype_pos_df['chr']=='chr18'], phenotype_pos_df, covariates_df)

cis-QTL mapping: empirical p-values for phenotypes
  * 445 samples
  * 301 phenotypes
  * 26 covariates
  * 13380684 variants
  Mapping chromosome chr18
  * checking phenotypes: 301/301
  * loading genotypes
  * computing permutations for phenotype 301/301
  Time elapsed: 0.61 min
done.


In [6]:
cis_df.head()

Unnamed: 0_level_0,num_var,beta_shape1,beta_shape2,true_df,pval_true_df,variant_id,tss_distance,ma_samples,ma_count,maf,ref_factor,pval_nominal,slope,slope_se,pval_perm,pval_beta
phenotype_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
ENSG00000263006.6,6126,1.054404,1229.290527,377.742554,3.473348e-40,chr18_112535_G_A_b38,3470,212,251,0.282022,1,3.472972e-44,0.726806,0.04615,0.0001,4.3788140000000005e-39
ENSG00000101557.14,6361,1.02672,1268.857544,379.920746,3.178142e-11,chr18_210698_T_C_b38,52315,192,222,0.249438,1,3.529043e-12,-0.19159,0.026736,0.0001,2.529346e-08
ENSG00000079134.11,6929,1.033736,1174.474731,371.952698,3.815188e-08,chr18_243547_T_A_b38,-24503,293,383,0.430337,1,5.776256e-09,-0.122455,0.02059,0.0001,3.149913e-05
ENSG00000263884.1,6929,1.056725,1135.027954,367.828247,0.0007802722,chr18_584440_G_C_b38,316292,81,88,0.098876,1,0.0003468848,-0.331377,0.091862,0.568743,0.5620703
ENSG00000158270.11,8143,1.030213,1472.01123,380.06308,1.698875e-09,chr18_519222_C_T_b38,18500,108,115,0.129213,1,2.780035e-10,-0.386568,0.059763,0.0001,1.671679e-06


### *trans*-QTL mapping

In [7]:
# load all genotypes into memory
genotype_df = genotypeio.load_genotypes(plink_prefix_path)

Mapping files: 100%|██████████| 3/3 [00:26<00:00, 11.96s/it]


Loading genotypes ... done.


In [8]:
# run mapping
# to limit output size, only associations with p-value <= 1e-5 are returned
trans_df = tensorqtl.map_trans(genotype_df, phenotype_df, covariates_df, batch_size=50000,
                               return_sparse=True, pval_threshold=1e-5, maf_threshold=0.05)

trans-QTL mapping
  * 445 samples
  * 19836 phenotypes
  * 26 covariates
  * 13380684 variants
  Mapping batches
  * processing batch 268/268
    time elapsed: 1.24 min
  * filtering output by MAF >= 0.05
done.


In [9]:
# remove cis-associations
trans_df = tensorqtl.filter_cis(trans_df, phenotype_pos_df.T.to_dict(), window=1000000)

In [10]:
trans_df.head()

Unnamed: 0,variant_id,phenotype_id,pval,maf
0,chr1_10177_A_AC_b38,ENSG00000169203.16,6.134267e-06,0.42809
4,chr1_30923_G_T_b38,ENSG00000278668.1,8.346152e-06,0.123595
5,chr1_30923_G_T_b38,ENSG00000156531.16,1.818107e-06,0.123595
6,chr1_47159_T_C_b38,ENSG00000185324.21,7.856671e-25,0.05618
7,chr1_49554_A_G_b38,ENSG00000271155.1,3.945792e-06,0.051685
