### *cis*- and *trans*-QTL mapping with tensorQTL

This notebook provides examples for running *cis*- and *trans*-QTL mapping with tensorQTL, using open-access data from the [GEUVADIS](https://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/) project.

#### Requirements
An environment configured with a GPU and ~50GB of memory.

#### Test dataset

*Note: these files are provided for testing/benchmarking purposes only. They do not constitute an official release from the GEUVADIS project, and no quality-control was applied.*

Genotypes in PLINK and VCF format, and normalized expression data are available [here](https://personal.broadinstitute.org/francois/geuvadis/).

Alternatively, to download the files required for these examples, uncomment and run the cell below.

In [1]:
# !wget https://personal.broadinstitute.org/francois/geuvadis/GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.bed
# !wget https://personal.broadinstitute.org/francois/geuvadis/GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.bim
# !wget https://personal.broadinstitute.org/francois/geuvadis/GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.fam   
# !wget https://personal.broadinstitute.org/francois/geuvadis/GEUVADIS.445_samples.covariates.txt
# !wget https://personal.broadinstitute.org/francois/geuvadis/GEUVADIS.445_samples.expression.bed.gz

In [2]:
import pandas as pd
import numpy as np
import tensorqtl
from tensorqtl import genotypeio, cis, trans
import matplotlib.pyplot as plt

# define paths to data
plink_prefix_path = 'GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered'
expression_bed = 'GEUVADIS.445_samples.expression.bed.gz'
covariates_file = 'GEUVADIS.445_samples.covariates.txt'
prefix = 'GEUVADIS.445_samples'

# load phenotypes and covariates
phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T

# PLINK reader for genotypes
pr = genotypeio.PlinkReader(plink_prefix_path)
genotype_df = pr.load_genotypes()
variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]

Mapping files: 100%|██████████| 3/3 [00:23<00:00,  7.79s/it]


### *cis*-QTL: nominal p-values for all variant-phenotype pairs

In [3]:
# map all cis-associations (results for each chromosome are written to file)

# all genes
# cis.map_nominal(genotype_df, variant_df, phenotype_df, phenotype_pos_df, covariates_df, prefix)

# genes on chr18
cis.map_nominal(genotype_df, variant_df, phenotype_df.loc[phenotype_pos_df['chr']=='chr18'],
                phenotype_pos_df.loc[phenotype_pos_df['chr']=='chr18'], covariates_df, prefix)

cis-QTL mapping: nominal associations for all variant-phenotype pairs
  * 445 samples
  * 301 phenotypes
  * 26 covariates
  * 13369268 variants
  * checking phenotypes: 301/301
  * Computing associations
    Mapping chromosome chr18
    processing phenotype 301/301
    time elapsed: 0.03 min
    * writing output
done.


In [4]:
# load results
pairs_df = pd.read_parquet('{}.cis_qtl_pairs.chr18.parquet'.format(prefix))
pairs_df.head()

Unnamed: 0,phenotype_id,variant_id,tss_distance,maf,ma_samples,ma_count,pval_nominal,slope,slope_se
0,ENSG00000263006.6,chr18_10644_C_G_b38,-98421,0.016854,15,15,0.580872,-0.117761,0.213125
1,ENSG00000263006.6,chr18_10847_C_A_b38,-98218,0.019101,17,17,0.142884,-0.298726,0.203505
2,ENSG00000263006.6,chr18_11275_G_A_b38,-97790,0.024719,22,22,0.745231,0.054619,0.167981
3,ENSG00000263006.6,chr18_11358_G_A_b38,-97707,0.024719,22,22,0.745231,0.054619,0.167981
4,ENSG00000263006.6,chr18_11445_G_A_b38,-97620,0.023596,21,21,0.603276,0.089378,0.171851


### *cis*-QTL: empirical p-values for phenotypes

In [5]:
# all genes
# cis_df = cis.map_cis(genotype_df, variant_df, phenotype_df, phenotype_pos_df, covariates_df)

# genes on chr18
cis_df = cis.map_cis(genotype_df, variant_df, phenotype_df.loc[phenotype_pos_df['chr']=='chr18'],
                     phenotype_pos_df.loc[phenotype_pos_df['chr']=='chr18'], covariates_df)

cis-QTL mapping: empirical p-values for phenotypes
  * 445 samples
  * 301 phenotypes
  * 26 covariates
  * 13369268 variants
  * checking phenotypes: 301/301
  * computing permutations
    processing phenotype 301/301
  Time elapsed: 0.53 min
done.


In [6]:
cis_df.head()

Unnamed: 0_level_0,num_var,beta_shape1,beta_shape2,true_df,pval_true_df,variant_id,tss_distance,ma_samples,ma_count,maf,ref_factor,pval_nominal,slope,slope_se,pval_perm,pval_beta
phenotype_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
ENSG00000263006.6,6120,1.053295,1072.929199,368.270782,3.6744640000000004e-39,chr18_112535_G_A_b38,3470,212,251,0.282022,1,4.050327e-44,0.726425,0.046171,0.0001,4.997085999999999e-38
ENSG00000101557.14,6355,1.031151,1111.177124,372.139893,5.012966e-11,chr18_210698_T_C_b38,52315,192,222,0.249438,1,3.505407e-12,-0.191712,0.026749,0.0001,3.266306e-08
ENSG00000079134.11,6921,1.031874,1311.657837,379.678375,2.626606e-08,chr18_243547_T_A_b38,-24503,293,383,0.430337,1,5.47369e-09,-0.12272,0.020602,0.0001,2.448904e-05
ENSG00000263884.1,6921,1.045952,1219.967041,374.754089,0.0007087774,chr18_584440_G_C_b38,316292,81,88,0.098876,1,0.0003540397,-0.330811,0.091845,0.561944,0.5579992
ENSG00000158270.11,8134,1.03446,1467.754639,379.428467,1.538423e-09,chr18_519222_C_T_b38,18500,108,115,0.129213,1,2.409727e-10,-0.388277,0.059808,0.0001,1.421253e-06


### *trans*-QTL mapping

In [7]:
# run mapping
# to limit output size, only associations with p-value <= 1e-5 are returned
trans_df = trans.map_trans(genotype_df, phenotype_df, covariates_df, batch_size=20000,
                           return_sparse=True, pval_threshold=1e-5, maf_threshold=0.05)

trans-QTL mapping
  * 445 samples
  * 19836 phenotypes
  * 26 covariates
  * 13369268 variants
    processing batch 669/669
    elapsed time: 1.09 min
  * 7620376 variants passed MAF >= 0.05 filtering
done.


In [8]:
# remove cis-associations
trans_df = trans.filter_cis(trans_df, phenotype_pos_df.T.to_dict(), variant_df, window=5000000)

In [9]:
trans_df.head()

Unnamed: 0,variant_id,phenotype_id,pval,maf
0,chr1_10177_A_AC_b38,ENSG00000169203.16,6.147418e-06,0.42809
2,chr1_30923_G_T_b38,ENSG00000278668.1,8.097084e-06,0.123595
3,chr1_30923_G_T_b38,ENSG00000156531.16,1.868912e-06,0.123595
4,chr1_47159_T_C_b38,ENSG00000185324.21,8.467489000000001e-25,0.05618
5,chr1_49554_A_G_b38,ENSG00000271155.1,3.804483e-06,0.051685
