# Lab 6: TensorQTL

In eQTL analysis, we test association for each gene in transcriptome against a set of variants.
Typically, for cis-eQTL analysis, one gene could have thousands of variants to test against. 
Matrix eQTL @matrixqtl was developed to create a computationally less burdensome analysis for eQTL identification. 
Compared to other existing tools for QTL analysis, Matrix eQTL is **orders of magnitude faster** in analysis due to specific preprocessing and its use of large matrix operations for the computationally difficult aspects of the system. 

To compare between genes which habor different LD structures and different number of testing variants, the statistical test underlying cis-eQTL requires permutation.
So it is computationally intensive even with matrixQTL. 
To overcome the computational burden of permutation, FastQTL @fastqtl proposed an approximate simulation scheme which is efficient for large-scale transcriptome-wide analysis, *i.e.* Geuvadis & GTEx.

In the past few years, GPU has been widely adapted for many computations. 
TensorQTL @tensorqtl was recently developed which is capable of running on both CPU and GPU. 
And when GPU is enabled, it runs 100 times faster than CPU. 
Today, we will learn to use `tensorqtl` with CPU. (And to run on GPU needs only few more setup steps with the same command!)
By the end of the lab you should be able to:

- **Understand the types of files required for tensorQTL**  
- **Interpret the results of tensorQTL**  


# tensorQTL

## Input files for eQTL analysis

* **Phenotype**: a matrix representing gene expression levels in each individual (gene x individual)
* **Covariate**: a matrix representing value of covariate values in each individual (covariate x individual)
* **Genotype**: a matrix representing genotype dosage (effect allele) for each variant and individual (in plink format)

Note that we need to know the genomic position of the gene (say the position of transcription start site, TSS) since we test cis-eQTL for nearby variants only. The example data for this lab is at `lab6`:

## Input files description

* **Phenotype**: `GEUVADIS.chr22.expression.bed.gz`
* **Covariate**: `GEUVADIS.445_samples.covariates.txt`
* **Genotype**: `GEUVADIS.hg38.chr22.*`

*Problem 1*

How many covariates are in the example data?

## Compute nominal p-value in cis-eQTL analysis

Nominal p-value is the observed p-value under linear model $\tilde{Y} \sim X$, where $\tilde{Y}$ is residual expression level after regressing out covariates and $X$ is the genotype dosage of a variant of interest.

Let's compute nomimal p-value for all cis-eQTL candidates. 
Here we define cis-window as 10kb surrounding TSS (both sides).

In [1]:
mkdir output_lab6

tensorqtl \
  --covariates data/lab6/GEUVADIS.445_samples.covariates.txt \
  --window 10000 \
  --mode cis_nominal \
  data/lab6/GEUVADIS.hg38.chr22 \
  data/lab6/GEUVADIS.chr22.expression.bed.gz \
  output_lab6/cis_nominal

  i = int(re.findall('For batch (\d+)', str(e))[0])
[Jun 25 06:57:17] Running TensorQTL: cis-QTL mapping
  * reading phenotypes (data/lab6/GEUVADIS.chr22.expression.bed.gz)
  * cis-window detected as position ± 10,000
  * reading covariates (data/lab6/GEUVADIS.445_samples.covariates.txt)
  * loading genotypes
  self.bim, self.fam, self.bed = read_plink(plink_prefix_path, verbose=verbose)
  self.bim, self.fam, self.bed = read_plink(plink_prefix_path, verbose=verbose)
Mapping files: 100%|██████████████████████████████| 3/3 [00:00<00:00,  9.36it/s]
cis-QTL mapping: nominal associations for all variant-phenotype pairs
  * 445 samples
  * 555 phenotypes
  * 26 covariates
  * 182851 variants
  * cis-window: ±10,000
  * checking phenotypes: 555/555
  * Computing associations
    Mapping chromosome chr22
    processing phenotype 555/555
    time elapsed: 0.20 min
    * writing output
done.
[Jun 25 06:57:31] Finished mapping
Traceback (most recent call last):
  File "/home/jovyan/.pixi/envs/ten

: 1

*Problem 2*

From the logging message of tensorqtl run, how many genes are being analyzed? 

The output contains all variant/gene pairs being test regardless of significance. 
So, it will be huge amount of data in practice. 

The output file is in `parquet` format, which is a binary format but it gives better I/O performance as comparing to human-readable text file. 
We've provided a tiny python script to convert `parquet` file to text table in `txt.gz`.

In [4]:
import pandas as pd
df = pd.read_parquet("output_lab6/cis_nominal.cis_qtl_pairs.chr22.parquet")
print(df)
df.to_csv("output_lab6/cis_nominal.cis_qtl_pairs.chr22.csv")


             phenotype_id              variant_id  ...     slope  slope_se
0      ENSG00000206195.10  chr22_15775793_G_A_b38  ... -0.073546  0.091817
1      ENSG00000206195.10  chr22_15776728_G_C_b38  ...  0.038667  0.238783
2      ENSG00000206195.10  chr22_15776825_C_T_b38  ... -0.093063  0.125423
3      ENSG00000206195.10  chr22_15776849_G_T_b38  ...  0.013793  0.057066
4      ENSG00000206195.10  chr22_15777039_T_C_b38  ...  0.104574  0.146789
...                   ...                     ...  ...       ...       ...
57451  ENSG00000079974.17  chr22_50791885_T_G_b38  ...  0.025340  0.103589
57452  ENSG00000079974.17  chr22_50791894_C_G_b38  ...  0.025340  0.103589
57453  ENSG00000079974.17  chr22_50792314_A_C_b38  ...  0.133574  0.246519
57454  ENSG00000079974.17  chr22_50792792_A_G_b38  ...  0.427086  0.099061
57455  ENSG00000079974.17  chr22_50793326_C_T_b38  ... -0.022783  0.176548

[57456 rows x 9 columns]


*Problem 3*

How many variant/gene pairs are being tested and reported?

*Problem 4*

Which genes has the strongest association?


## Perform cis-eQTL analysis with adaptive permutation

If we'd like to identify eGene (gene that is significantly regulated by genetic variation), like we've mentioned above, we need to perform permutation to obtain gene-level p-value. 
Here is how it can be done using `tensorqtl`.

In [5]:
tensorqtl \
  --covariates data/lab6/GEUVADIS.445_samples.covariates.txt \
  --window 10000 \
  --mode cis \
  data/lab6/GEUVADIS.hg38.chr22 \
  data/lab6/GEUVADIS.chr22.expression.bed.gz \

  output_lab6/cis


[Jun 25 07:17:31] Running TensorQTL: cis-QTL mapping
  * reading phenotypes (data/lab6/GEUVADIS.chr22.expression.bed.gz)
  * cis-window detected as position ± 10,000
  * reading covariates (data/lab6/GEUVADIS.445_samples.covariates.txt)
  * loading genotypes
  self.bim, self.fam, self.bed = read_plink(plink_prefix_path, verbose=verbose)
  self.bim, self.fam, self.bed = read_plink(plink_prefix_path, verbose=verbose)
Mapping files: 100%|██████████████████████████████| 3/3 [00:00<00:00, 10.66it/s]
cis-QTL mapping: empirical p-values for phenotypes
  * 445 samples
  * 555 phenotypes
  * 26 covariates
  * 182851 variants
  * cis-window: ±10,000
  * checking phenotypes: 555/555
  * computing permutations
    processing phenotype 555/555
  Time elapsed: 2.11 min
done.
  * writing output
Computing q-values
  * Number of phenotypes tested: 555
  * Correlation between Beta-approximated and empirical p-values: 1.0000
  * Proportion of significant phenotypes (1-pi0): 0.87
  * QTL phenotypes @ FDR 

: 1

The output is the gene-level statistics obtained from adaptive permutation where each row is for one gene (in txt.gz format).
To obtain eGene as FDR 10%, we can collect all genes with `qval` smaller than 0.1. 
To obtain cis-eQTL for these eGenes, we can collect all variant/gene pairs with `pval_nominal` (reported in `cis_nominal` run) smaller than `pval_nominal_threshold`.

*Problem 5*

Which gene has the most significant q-value?

Note, it is possible your own cis output file might lack the q-value column (column 18, called "qval") and pval_nominal_threshold information, if this is the case, simply use the pre-computed `pre_run.cis_qtl.txt` in the data folder.

*Problem 6*

Select a gene with q-value < 0.05, visualize its cis-eQTL results by plotting $-\log(p)$
on y-axis and distance to TSS on x-axis. And put a horizontal line indicating the corresponding `pval_nominal_threshold` of the gene.


**References**:

Ongen, Halit, Alfonso Buil, Andrew Anand Brown, Emmanouil T Dermitzakis, and Olivier Delaneau. 2016. “Fast and Efficient Qtl Mapper for Thousands of Molecular Phenotypes.” Bioinformatics 32 (10). Oxford University Press: 1479–85.

Shabalin, Andrey A. 2012. “Matrix eQTL: Ultra Fast eQTL Analysis via Large Matrix Operations.” Bioinformatics 28 (10). Oxford University Press: 1353–8.

Taylor-Weiner, Amaro, François Aguet, Nicholas J Haradhvala, Sager Gosai, Shankara Anand, Jaegil Kim, Kristin Ardlie, Eliezer M Van Allen, and Gad Getz. 2019. “Scaling Computational Genomics to Millions of Individuals with Gpus.” Genome Biology 20 (1). BioMed Central: 1–5.