# Quantile QTL and Quantile TWAS (Transcriptome-Wide Association Study) Univariate Analysis
This notebook provides an introduction to applying [quantile regression of QTL](https://pmc.ncbi.nlm.nih.gov/articles/PMC11291931/)  and [quantile TWAS](https://arxiv.org/pdf/2207.12081) to explore quantile-specific associations.


## Introduction & Background
Traditional QTL and TWAS analyses predominantly focus on identifying genetic variants that influence the mean levels of gene expression or trait phenotypes. While this has led to the discovery of numerous associations, these methods overlook potential genetic effects that vary across different parts of the expression distribution. For example, some genetic variants might disproportionately affect individuals with very high or very low expression levels, which cannot be captured by mean-based approaches. This limitation motivates the need for methods that explore the entire distribution of gene expression, capturing these heterogeneous effects.

Quantile regression, a powerful statistical method that models different quantiles of the response variable, provides a solution. By extending standard TWAS and QTL approaches with quantile regression, we aim to uncover genetic associations that manifest at different expression levels, providing a more nuanced understanding of the relationship between genetic variation and complex traits. This approach is particularly useful in cases where genetic effects are not uniform across the expression distribution, potentially revealing novel associations missed by traditional mean-based models.

## Methods
### Quantile QTL: Identify genetic variants that influence gene expression at different quantile levels.
- Quantile QTL analysis assesses the effects of genetic variants across different quantiles of gene expression. Instead of only focusing on the mean in LR, we screen significant SNPs at various quantile levels, allowing for a deeper understanding of how genetic effects differ across the gene expression distribution.     
- Given a response variable *Y*, the *tau-th* quantile regression model with respect to the *j-t*h variant *X_j* and covariates *C* is represented as:

$$
Q_Y(\tau | X_j, C) = X_j \beta_j(\tau) + C \alpha(\tau)
$$

**Where**:
- *Y* represents the phenotype
- *X_j* represents the j-th genetic variant
- *C* represents the covariates
- *β_j(τ)* represents the coefficient for the j-th variant at the τ-th quantile
- *τ (tau)* represents the quantile level, ranging from 0 to 1. 


### Quantile TWAS: Integrate quantile-specific weights to better predict trait associations in specific expression regimes. 
The TWAS association analysis utilizes the integrated quantile TWAS weights to identify SNPs with quantile-specific effects on gene expression, potentially revealing disease associations that vary across different expression levels.

![image.png](image/quantile_twas.png)  
Wang, T., Ionita-Laza, I., & Wei, Y. (2024). A unified quantile framework for nonlinear heterogeneous transcriptome-wide associations. *Annals of Applied Statistics*.


## Data
- `Genotype Data`: SNP information from ROSMAP corhort study
- `Covariates data`: Includes factors like APOE, sex, PC, hidden factors etc.
- `LD Reference`: Population-specific LD matrices from 20240409_ADSP_LD_matrix
- `GWAS Summary Statistics`: Alzheimer's disease GWAS (Bellenguez 2022)
- `Phenotype Data`: ROSMAP eQTL/pQTL data    
The analysis includes 11 datasets spanning different tissue types and QTL contexts. The table below provides an overview of the datasets, their context, and type, along with sample sizes:

| Tissue Type                                  | Context                  | QTL Type      | Sample Size (Approx.) |
|----------------------------------------------|--------------------------|---------------|-----------------------|
| Astrocytes (Ast)                             | Ast_mega_eQTL            | eQTL          | ~780                  |
| Microglia (Mic)                              | Mic_mega_eQTL            | eQTL          | ~780                  |
| Oligodendrocytes (Oli)                       | Oli_mega_eQTL            | eQTL          | ~780                  |
| Excitatory Neurons (Exc)                     | Exc_mega_eQTL            | eQTL          | ~780                  |
| Inhibitory Neurons (Inh)                     | Inh_mega_eQTL            | eQTL          | ~780                  |
| Oligodendrocyte Precursor Cells (OPC)        | OPC_mega_eQTL            | eQTL          | ~780                  |
| DLPFC (DeJager)                              | DLPFC_DeJager_eQTL       | eQTL          | ~1140                 |
| PCC (DeJager)                                | PCC_DeJager_eQTL         | eQTL          | ~570                  |
| AC (DeJager)                                 | AC_DeJager_eQTL          | eQTL          | ~740                  |
| Monocytes                                    | monocyte_ROSMAP_eQTL     | eQTL          | ~610                  |
| DLPFC (Bennett)                              | DLPFC_Bennett_pQTL       | pQTL          | ~400                  |


### Additional Information:

- **Number of Genes**: 157 genes across these datasets were analyzed in the first iteration.


## Pipeline Overview
### 1. Quantile QTL
- *tau_list* = [0.05, 0.10, 0.15, ..., 0.90, 0.95] for QTL analysis

**Steps**: 
- `QR Screening`: Apply the QRank test using a list of quantile tau values. Screen SNPs by fitting QR models at each tau and retaining significant SNPs with a `combined q-value (using Cauchy combination method) < 0.05`.
- `LD Clumping and Pruning`: Perform LD clumping and pruning on the significant SNPs from the QR screening step. This process reduces redundancy among SNPs by selecting independent signals.
- `Estimate Beta Coefficients`: For the SNPs that passed LD clumping, calculate the beta coefficients using `QR marginal model` across different tau values and compute beta heterogeneity to assess how the effects of each SNP vary across different quantiles of gene expression. 

### 2. Quantile TWAS  
- *tau_list* = [0.01, 0.02, 0.03, ..., 0.98, 0.99] for TWAS analysis
#### 2.1 Quantile TWAS Weights Calculation    
For significant QTLs with combined qvalue < 0.05, we do:  
- `LD and MAF Filtering`: Filter the significant SNPs based on LD reference panels and minor allele frequency (MAF) thresholds to ensure robust and meaningful results, excluding rare variants and redundant SNPs.
- `Correlation Filtering`: Further refine the SNP set by removing highly correlated SNPs (e.g., correlation threshold of 0.8).
- `Fit QR for All Quantiles`: Fit the full QR model for the filtered SNPs and covariates across 99 tau values ranging from 0.01 to 0.99.
- **Output**: TWAS beta weights for each SNP at each tau with Pseudo R² values for each tau, providing insights into the model's fit.

#### 2.2 Quantile TWAS Association Analysis
- `Clustering Quantile TWAS Weights`: Cluster the TWAS weights across quantiles for each SNP. This can be done using either:
    - `Fixed Regions`: Predefined regions (e.g., A1, A2, A3) based on gene structure or expression levels.
    - `Dynamic Regions`: Data-driven clustering using methods like hierarchical clustering with modularity optimization to define groups (C1, C2, ... Cn).
- `Integration of Weights`: Integrate the clustered TWAS weights across quantiles using trapezoidal integration to obtain aggregate weights for each region.
- `GWAS Integration and TWAS P-value Calculation`: Apply the aggregated weights to GWAS summary statistics to calculate TWAS z-scores and p-values for each SNP cluster.
- **Output**: TWAS p-values that reflect SNP effects on gene expression across different quantiles, capturing quantile-specific disease associations.

## Input

1. A list of regions to be analyzed (optional); the last column of this file should be region name.
2. Either a list of per chromosome genotype files, or one file for genotype data of the entire genome. Genotype data has to be in PLINK `bed` format. 
3. Vector of lists of phenotype files per region to be analyzed, in UCSC `bed.gz` with index in `bed.gz.tbi` formats.
4. Vector of covariate files corresponding to the lists above.
5. Customized association windows file for variants (cis or trans). If it is not provided, a fixed sized window will be used around the region (a cis-window)
6. Optionally a vector of names of the phenotypic conditions in the form of `cond1 cond2 cond3` separated with whitespace.

Input 2 and 3 should be outputs from `genotype_per_region` and `annotate_coord` modules in previous preprocessing steps. 4 should be output of `covariate_preprocessing` pipeline that contains genotype PC, phenotypic hidden confounders and fixed covariates.

### Example genotype data

```
#chr        path
chr21 /mnt/mfs/statgen/xqtl_workflow_testing/protocol_example.genotype.chr21.bed
chr22 /mnt/mfs/statgen/xqtl_workflow_testing/protocol_example.genotype.chr22.bed
```

Alternatively, simply use `protocol_example.genotype.chr21_22.bed` if all chromosomes are in the same file.

### Example phenotype list

```
#chr    start   end ID  path
chr12   752578  752579  ENSG00000060237  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   990508  990509  ENSG00000082805  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
```

## Output
### Quantile QTL Output:

For each gene, several of summary statistics files are generated, including significant quantile qtl nominal test statistics for each test.

#### Data Structure
The columns of quantile qtl nominal association result are as follows:

- chr : Variant chromosome.
- pos : Variant chromosomal position (basepairs).
- ref : Variant reference allele (A, C, T, or G).
- alt : Variant alternate allele.
- phenotype_id: Molecular trait identifier.(gene)
- variant_id: ID of the variant (rsid or chr:position:ref:alt)
- p_qr(composite-p value using cauchy combination method): the integrated QR p-value across multiple quantile levels. 
- p_qr_0.05 to p_qr_0.95: quantile-specific QR p-values for the quantile levels 0.05, 0.1, ..., 0.95.   
- q_qr: the integrated QR q-value across multiple quantile levels. 
- q_qr_0.05 to q_qr_0.95: quantile-specific QR q-values for the quantile levels 0.05, 0.1, ..., 0.95. 
- coef_qr_0.05 to coef_qr_0.95: significant results of quantile-specific QR coefficients for the quantile levels 0.05, 0.1, ..., 0.95. 
- beta_heterogeneity: heterogeneity of beta coefficients across multiple quantiles for each variant_id, computed as log(sd(beta) / abs(mean(beta))).

#### Coefficient Heterogeneity Plot
![image.png](image/quantile_qtl_beta_heter_flat_picalm.png)  
**Fig 1.** Coefficient estimates for SNP chr11:86157598:T:C across quantiles from a quantile eQTL analysis of the ***PICALM*** gene in Mic_mega_eQTL dataset.

![image.png](image/quantile_qtl_beta_heter_upward.png)  
**Fig 2.** Coefficient estimates for SNP chr2:130047952:A:C across quantiles from a quantile eQTL analysis of the ***BIN1*** gene in PCC_DeJager_eQTL dataset.

- The x-axis represents quantile levels (ranging from 0.05 to 0.95), and the y-axis shows the corresponding coefficient values for the SNP at each quantile, indicating its effect on gene expression across the distribution. 
- In **Fig 1**, the coefficients remain relatively stable across quantiles, indicating a low degree of heterogeneity in the SNP's effects on gene expression. 
- However, in **Fig 2**, we observe an upward trend in the coefficients. This highlights a possible heterogeneous effect where the SNP exerts stronger influence in specific parts of the gene expression distribution. This contrast between the two figures illustrates the varying nature of SNP effects, potentially driven by complex genetic and environmental interactions across expression levels.



Additionally, the matrix of quantile TWAS weights and pseudo R-squares is calculated and saved using Koenker and Machado's Pseudo R² method, as described in their [Koenker and Machado, 1999: Inference in Quantile Regression](https://www.maths.usyd.edu.au/u/jchan/GLM/Koenker&Machado1999InferenceQuantileReg.pdf)

### Quantile TWAS Output
Example: Results for `PICALM` Gene in `Mic_mega_quantile_eQTL`

| chr  | molecular_id    | TSS       | start     | end       | context                   |
|------|-----------------|-----------|-----------|-----------|---------------------------|
| 11   | ENSG00000073921  | 86069881  | 84957175  | 87360000  | Mic_mega_quantile_eQTL     |
| 11   | ENSG00000073921  | 86069881  | 84957175  | 87360000  | Mic_mega_quantile_eQTL     |
| 11   | ENSG00000073921  | 86069881  | 84957175  | 87360000  | Mic_mega_quantile_eQTL     |
| 11   | ENSG00000073921  | 86069881  | 84957175  | 87360000  | Mic_mega_quantile_eQTL     |
| 11   | ENSG00000073921  | 86069881  | 84957175  | 87360000  | Mic_mega_quantile_eQTL     |

#### Additional Information:

| gwas_study       | method | quantile_start | quantile_end | pseudo_R2_avg | twas_z     | twas_pval     | type  | block                        |
|------------------|--------|----------------|--------------|---------------|------------|---------------|-------|------------------------------|
| Bellenguez_2022  | A1     | 0.01           | 0.33         | 0.6754932     | -10.094891 | 5.819559e-24  | eQTL  | chr11_84267999_86714492      |
| Bellenguez_2022  | A2     | 0.34           | 0.66         | 0.6751989     | -10.440459 | 1.620251e-25  | eQTL  | chr11_84267999_86714492      |
| Bellenguez_2022  | A3     | 0.67           | 0.99         | 0.6749045     | -8.872584  | 7.146782e-19  | eQTL  | chr11_84267999_86714492      |
| Bellenguez_2022  | C1     | 0.01           | 0.43         | 0.6754486     | -10.513485 | 7.487443e-26  | eQTL  | chr11_84267999_86714492      |
| Bellenguez_2022  | C2     | 0.44           | 0.99         | 0.6750071     | -9.609285  | 7.305415e-22  | eQTL  | chr11_84267999_86714492      |

---

### Description of Results:
- **Gene**: PICALM (ENSG00000073921)  
- **GWAS Study**: Bellenguez_2022  
- **Quantile-based TWAS Analysis**: The gene was analyzed using multiple quantile windows, with weights calculated for each region (A1, A2, A3 for fixed regions and C1, C2 for dynamic regions).
- **Quantile Ranges**: The gene expression was split into different quantiles from 0.01 to 0.99, and TWAS weights were calculated for each of these segments. For example:
    - **Fixed Region** A1(tau:0.01-0.33), A2(tau:0.34-0.66), A3(tau:0.67-0.99)  
    - **Dynamic Region** C1(tau:0.01-0.43), C2(0.44-0.99)  
- **Pseudo R² Values**: The pseudo R² reflects the proportion of variance explained by the SNPs in each quantile window.
- **TWAS Z-scores and P-values**: These columns provide the Z-scores and p-values associated with the quantile-specific TWAS analysis for this gene. For example, in the A1 region, the Z-score is -10.09, with a p-value of 5.82e-24, which is significant.
- **Blocks**:The analysis was conducted on the genomic block spanning **chr11_84267999_86714492**


## Advantages of Quantile QTL and Quantile TWAS.
### Advantages of Quantile QTL:
- **Detects Non-linear and Conditional Effects**:  
Quantile QTL captures the effects of SNPs across different levels of gene expression, allowing the identification of variants that influence gene expression in specific regions of the distribution rather than just the mean. This enables the detection of non-linear and context-specific genetic effects.

- **Heterogeneity Assessment**:  
Quantile QTL allows for a detailed heterogeneity analysis by examining how SNP effects change across the distribution. This helps identify SNPs that may have varying impacts depending on the expression level, providing a deeper understanding of genetic influence on gene regulation.

- **Explores Gene-Environment Interactions**:  
By considering SNP effects at different quantiles, the method may also uncover gene-environment interactions where the effect of a variant depends on the gene expression context (e.g., under certain environmental conditions or stresses).

### Advantages of Quantile TWAS:
- **Captures Distribution-wide Genetic Effects**:  
Quantile TWAS investigates how SNPs affect gene expression across the entire expression distribution, not just the mean. This allows for comprehensive analysis of genetic effects on gene expression, identifying associations that may vary across different expression levels.

- **Integrates Flexibly Across Quantiles**:  
Quantile TWAS can flexibly integrate SNP effects across predefined (fixed) or data-driven (dynamic) quantile regions, allowing for a more nuanced understanding of gene regulation and its relationship to complex traits and diseases.

- **Biological Insights into Disease Mechanisms**:   
By focusing on quantile-specific effects, Quantile TWAS provides valuable biological insights into how genetic variation affects disease processes. This can reveal mechanisms that are masked in traditional TWAS, particularly in diseases where the regulation of gene expression is critical at specific levels.