# Replication study
This notebook is trying to validate the eQTLs.

In [None]:
sos run Replication-study.ipynb replication \
        --cwd ./ \
        --analysis_1_path analysis_1_path.txt \
        --analysis_2_path analysis_2_path.txt \
        --ID_keys  ID_keys.txt \
        --container containers/bioinfo.sif \
        --specify the cluster running parameters if needed

### Methods overview
* Goal: Because of the different normalization and covariates extraction procedure, QTL calling result from the same data set might still have inconsistency. We aim to compare the difference bertween two calling results from two pipelines

* Settings and notations:

  * Outcome: $Y_{ir}, i=1,2,\cdots,n; r = 1,2,\cdots,R$
  * SNP: $X_{ij}, i=1,2,...,n, j = 1,2,\cdots, J$
  * P value for the outcome-r and SNP j pair in test k $p^{(k)}_{jr}$
  * test result for the outcome-r and SNP j pair in test k $I^{(k)}_{jr}$
  * We should compare the test result coming from the same multiple testing procedure to make sure the results comparable, For example, BH adjustment: $I_{jr}^{(k)} = I\{P_{ir}^{(k)}\leq \text{cutoff}_{jr}\}$, where cutoff is the BH adjusted cutoff for each pair of features

* Contingency table

  |                                            | Test 2 is significant for pair $(j,r)$ | Test 2 is non-significant for pair $(j,r)$ |      |
  | ------------------------------------------ | -------------------------------------- | ------------------------------------------ | ---- |
  | Test 1 is significant for pair $(j,r)$     | TP                                     | FN                                         | m0   |
  | Test 1 is non-significant for pair $(j,r)$ | FP                                     | TN                                         | m1   |
  |                                            | S                                      | N                                          | m    |


### Data input
newly generated sumstats: /mnt/vast/hpc/csg/ROSMAP_methy_QTL_beta/association_scan/methyl_QTL_2/TensorQTL/emprical.cis_sumstats.txt

original sumstats: "/mnt/vast/hpc/csg/jt3386/original_data/mQTLs.txt"

In [None]:
[global]
parameter: cwd = path
# current working directory
parameter: analysis_1_path = path
# Containing qvalue result from the first analysis, this is used to specify the significant pairs in analysis 1, the file should contain molecular_feature name, snp_id and the qvalue of the pair from analysis 1
parameter: analysis_2_path = path
# Containing pvalue result from the second analysis, this is used to estimate pi1 statistics in analysis 2, the file should contain molecular_feature name, snp_id and the pvalue of the pair from analysis 2
parameter: ID_keys = path
# Used to specify the ID difference and match the pairs from the 2 analysis
parameter: name = path

# The following are the cluster submission parameters, might be useful;
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "2G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""

In [1]:
[replication]
output: f'{cwd}/{name}.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
    # set up
    library(tidyverse)
    library(qvalue)
    analysis1_data = read.delim("$[analysis_1_path]")
    analysis2_data = read.delim("$[analysis_2_path]")
    # ? where did you match the snp id? how did you match them? can this be more generalize?
    # fixme: this step can be select out to match the ids, we might need to make this more generalize
    sig1 = analysis1_data %>% filter(qvalue <= 0.05)
    p2_sig1 = analysis2_data %>% filter(pair %in% sig1$pair)
    # output
    pi0 = pi0est(p2_sig1$pValue)
    pi1 = 1-pi0$pi0
    output_tib = tibble(name = "$[name]",
                        pi1 = pi1)
    output_tib %>% readr::write_delim("$[cwd]/$[name].txt","\t")

# Please finish the following steps:

1. fixme refine and discussion with Gao
2. Create 2 result files with the qvalue and pvalue randomly sampled from a uniform distribution as the input file specified as mwe, run the sos pipeline out;
