# A robust and versatile computational protocol for fine-mapping and integrative analyses of multiple molecular quantitative trait loci

## Introduction

(Figure 1 Overview flow chart)

### Development of the protocol


### Comparison with other methods


### Applications of the method


## Materials

### Required software dependency
The analysis described in this page can be ran on Linux Ubunto. For windows user, the analysis can be ran with `windows subsystem of Linux(wsl)`


To run our pipeline, following softwares are needed to be installed in the operating system.
1. Python 3
2. Singularity
3. Script of Script workflow

The bioinformatics software that are actually needed to run each of the analysis are packaged in singularity containers and can be downloaded from [synapse](https://www.synapse.org/#!Synapse:syn37177618)

-  The `container` parameter for each of the commands in this paper indicates the path to the singularity sif file for the respective step

### Equipment and hardware
It is highly recommended to run our pipeline on a computational cluster with Linux based OS. For personal computer, at least 60GB of memory is needed.

### Required data

#### Reference data setup 

This section described reference data downloading, indexing and preprocessing (if necessary), in preparation for use throughout the pipeline.

We have included the PDF document compiled by Data Standardization Working Group in the [on Synapse](https://www.synapse.org/#!Synapse:syn36416587) as well as on [ADSP Dashboard](https://www.niagads.org/adsp/content/adspgcadgenomeresources-v2pdf). It contains the reference data to use for the project.

The reference data after we process it (details see Methods section and the rest of the analysis) can be found [in this folder on Synapse](https://www.synapse.org/#!Synapse:syn36416587). 

##### Reference data downloads 
Following commands download the raw reference data before we processed it. It take around one hour to download all the data.

In [None]:
sos run pipeline/reference_data.ipynb download_hg_reference --cwd reference_data
sos run pipeline/reference_data.ipynb download_gene_annotation --cwd reference_data
sos run pipeline/reference_data.ipynb download_ercc_reference --cwd reference_data
sos run pipeline/reference_data.ipynb download_dbsnp --cwd reference_data

##### Convert transcript feature file gff3 to gtf

**Input:** an uncompressed gff3 file.

**Output:** a gtf file without HLA, ALT, Decoy but with ERCC.


In [None]:
sos run pipeline/reference_data.ipynb hg_gtf \
    --cwd reference_data \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --container containers/rna_quantification.sif --stranded

##### Collapse transcript features into genes

**Input:** a gtf file.

**Output:** a gtf file with collapesed gene model.

In [None]:
sos run pipeline/reference_data.ipynb gene_annotation \
    --cwd reference_data \
    --ercc-gtf reference_data/ERCC92.gtf \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --container containers/rna_quantification.sif --stranded

##### Process reference fasta file.

**Input:** a reference fasta file.

**Output:** a reference fasta file without HLA, ALT, Decoy but with ERCC.

In [None]:
sos run pipeline/reference_data.ipynb hg_reference \
    --cwd reference_data \
    --ercc-reference reference_data/ERCC92.fa \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --container container/rna_quantification.sif

##### Generate STAR index based on reference fasta

**Input:** a processed reference fasta file.

**Output:** A folder of STAR index.

This step shall take at least 40G of memory and around 1.5 hour in total

In [None]:
sos run pipeline/reference_data.ipynb STAR_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --container containers/rna_quantification.sif \
    --mem 40G

#### Input data setup 

The input data used in this protocol can be downloaded from [Synapse](https://www.synapse.org/#!Synapse:syn36416559/files/). The data was originally from the ROSMAP dataset [cite here].We de-identified and keep only a subset of the samples as well as the genomics region to sereve as demostration.

## Procedure 1: Molecular phenotype calling

(Figure 2 molecular phenotype flow chart)

### Bulk RNA-Seq
Our protocol can generate two types of molecular phenotypes based on bulk RNA-Seq data. The gene expression and splicing event. After the shared QC and alignment step, the STAR step generate the `Aligned.sortedByCoord.out.md.bam` files that serve as the input of both.

#### QC on fasta files

Before doing the STAR alignment, the fastqc step can generate a brief description on the quality of the fasta files.

**Input:** 

- `samples`: a collection of fastq file and a `fastqlist` file describing the sample name, file name, and optionally strandness as well as read length of each samples.

- `data-dir`: The folder containing the file described in `samples`

```
ID fq1 fq2 strand read_length
sample_1 samp1_r1.fq.gz samp1_r2.fq.gz rf 100
sample_2 samp2_r1.fq.gz samp2_r2.fq.gz fr 150
sample_3 samp3_r1.fq.gz samp3_r2.fq.gz strand_missing 75
```


**Output:** a collection of new fastq file without adaptor and a new sample list file corresponding to the newly generated fastq files.

**Option**
- `cwd` indicates the working diretory of the the command
- `STAR-index` `reference-fasta` `ref-flat` `gtf` are reference files prepared in the reference data setup stage

In [None]:
sos run pipeline/RNA_calling.ipynb fastqc \
    --cwd output/rnaseq/fastqc \
    --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir protocol_data/input_data/RNASeq/fastq \
    --container containers/rna_quantification.sif \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

#### Trim adaptor (Optional)


The optionally fastp_trim_adaptor step , leveraging a c++ command line tools `fastp`[cite], can automatically detect and remove the adpators automatically and completely. Alone with the fastq files without adaptors, a new `fastqlist` text file will also be produced to served as the input for STAR_output step, in place the old one. As the fastqs used in this protocol are converted from bam file and already without adaptors. This step can be safely skip.

In [None]:
sos run pipeline/RNA_calling.ipynb fastp_trim_adaptor \
    --cwd output/rnaseq --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir protocol_data/input_data/RNASeq/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat

#### Read alignment via STAR

The STAR[cite] alignment step produces the input needed for both the eQTL and the splicing QTL, along with multipe QC done through Picard [cite]. If the strand info is missing from the fastqlist for certain samples, the STAR_output step will also automatically detect the strandness of these samples based on `ReadsPerGene.out.tab`. The criteria of determining the strand is based on `how_are_we_stranded_here` [cite]


**Input** 
- `samples`:
    - Without adaptor trimming: The fastqlist file described in the **input** section of the `fastp_trim_adaptor` step
    - With adaptor trimming: The fastqlist file output from `fastp_trim_adaptor`

**Output**
- A `Aligned.sortedByCoord.out.md.bam` for each samples, this file serves as the input to the calling of both the gene expression and alternative splicing event.
- A `bam_list` file that outline the corresponding bam files for each samples, along with information about strandness.


**Option**
- The `--uncompressed` option indicates the fastq input is not in gz format, when the fastq input are compressed, removed this option.

**Time and Memory**

Following step shall take at least 40G of memory around 2 hour in total. 

**Critical** 
- It should be noted that, the gtf file used here is `reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf`, the one **before** `Collapse transcript features into genes` ,  i.e. the one without `gene` in its file name.

In [None]:
sos run pipeline/RNA_calling.ipynb STAR_output \
    --cwd output/rnaseq --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir protocol_data/input_data/RNASeq/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat --uncompressed 

#### Gene expression Calling
**Input** 
- `bam_list`: The output from `STAR_output`

**Output**
- A `rnaseqc.gene_tpm.gct.gz` and a `rnaseqc.gene_readsCount.gct.gz` that both serve as the input to the RNA-seq QC step.

- A `multiqc_report.html` file that describe the overall qualitly metrics for both alignment and expression calling. 

**Critical** 
- It should be noted that, the gtf file used here is `reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf`, the one **after** `Collapse transcript features into genes`,  i.e. the one with `gene` in its file name.

In [None]:
sos run pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd output/rnaseq \
    --samples input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist    --data-dir input_data/RNASeq/fastq \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --bam_list output/rnaseq/xqtl_protocol_data_bam_list

#### Gene expression QC

**Input** 
- `tpm-gct`,`counts-gct`: The output from `rnaseqc_call`

**Output**
- A `rnaseqc.gene_tpm.gct.gz` and a `rnaseqc.gene_readsCount.gct.gz` that both serve as the input to the RNA-seq QC step.

- A `multiqc_report.html` file that describe the overall qualitly metrics for both alignment and expression calling. 


In [None]:
sos run pipeline/bulk_expression_QC.ipynb qc \
    --cwd output/rnaseq \ \
    --tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_tpm.gct.gz \
    --counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_readsCount.gct.gz \
    --container containers/rna_quantification.sif 

#### Gene expression Normalization

**Input** 
- `tpm-gct`,`counts-gct`: The output from `qc` of `bulk_expression_QC`

**Output**
- A `expression.bed.gz` file that are the input files for downstream analysis on eQTL.

**Option**
- `sample_participant_lookup` a index files that map the sample name in RNA Seq data to that of the genotype data.

- `annotation-gtf` the same gtf file used in rnaseqc, i.e. the one after **Collapse transcript features into genes**


In [None]:
sos run pipeline/bulk_expression_normalization.ipynb normalize \
    --cwd output/rnaseq \
    --tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf  \
    --container containers/rna_quantification.sif \
    --count-threshold 1 --sample_participant_lookup reference_data/sample_participant_lookup.rnaseq

#### Splicing Event Calling
In this step the intron usage ratio was quantified via Leafcutter [cite]. 


**Input:** 
- `samples`: the `bam_list` which is the output from `STAR_output` 

**Output:** 
- a `intron_usage_perind.counts.gz` file with intron usage ratios that are input to normalization
- a `intron_usage_perind_numers.counts.gz` file with actual intron usage count that are input to `annotate_leafcutter_isoforms`

In [None]:
sos run pipeline/splicing_calling.ipynb leafcutter \
    --cwd output/leaf_cutter/ \
    --samples output/rnaseq/xqtl_protocol_data_bam_list \
    --container containers/leafcutter.sif 

#### Splicing Event QC and Normalization

**Input:** 
- `ratios`: the `intron_usage_perind.counts.gz` output from `leafcutter` 

**Output:** 
- a QCed and normalized phenotype table `gz_raw_data.qqnorm.txt`


In [None]:
sos run pipeline/splicing_normalization.ipynb leafcutter_norm \
    --cwd output/leaf_cutter/ \
    --ratios output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz \
    --container containers/leafcutter.sif 

#### Splicing Event Annotation

**Input:** 
- `intron_count`: the `intron_usage_perind_numers.counts.gz` output from `leafcutter` 
- `phenoFile`: the `gz_raw_data.qqnorm.txt` output from `leafcutter_norm`

**Output:** 
- a `gz_raw_data.qqnorm.formated.bed.gz` file that are the input files for downstream analysis on sQTL.
- a `gz_raw_data.qqnorm.phenotype_group.txt` file that needs to be fed into the TensorQTL association scan along with the `bed.gz` files of sQTL

In [None]:
sos run pipeline/gene_annotation.ipynb annotate_leafcutter_isoforms \
    --cwd output/leaf_cutter/ \
    --intron_count output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind_numers.counts.gz \
    --phenoFile output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz_raw_data.qqnorm.txt \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
    --container containers/bioinfo.sif \
    --sample_participant_lookup reference_data/sample_participant_lookup.rnaseq

### Methylation

#### Methylation Calling, QC, and Normalizaion

**Input:** 
- `sample-sheet`: a csv files that are output from illumina, by default in the parenting folders that holds all the idat file for each samples.

**Output:** 
- a `sesame.M.bed.gz`  file that are free of NA and are the input files for downstream analysis on mQTL.

**Option:**

- `sample_sheet_header_rows` govern the number of header rows in the `sample-sheet`. By default this number is 7 but it can be changed due to different formats

**Time and Memory**
- `time elapsed`: 1550.12s
- `max vms_memory`: 14.62GB


In [None]:
sos run pipeline/methylation_calling.ipynb sesame \
    --sample-sheet input_data/Methylation/xqtl_protocol_data_arrayMethylation_covariates.tsv  \
    --container containers/methylation.sif --sample_sheet_header_rows 0 --cwd output/methylation/

#### Methylation soft Imputation and NA removal

Many of the methylation matrix are filled with NA. All the probes with more than 10% missing rate are removed by default. Soft-Imputation are applied to the result. to filled in the remaining NA cells.


**Input:** 
- `phenoFile`: the `sesame.M.bed.gz` output from `sesame`

**Output:** 
- a `sesame.M.bed.gz`  file that are free of NA and are the input files for downstream analysis on mQTL.

**Time and Memory**
- `time elapsed`: 143.58s
- `max vms_memory`: 8.1GB

In [None]:
sos run pipeline/phenotype_formatting.ipynb bed_filter_na \
        --phenoFile output/methylation/xqtl_protocol_data_arrayMethylation_covariates.sesame.M.bed.gz \
        --cwd ./output/methylation/

## Procedure 2: Data Processing and QC

### Phenotype processing
The molecular phenotype files will each be partioned into 1 bed.gz per chromosome. Doing so allows `cis-eQTL association testing` be done in parallel. 


**Input:** 
- `phenoFile`: 
    - eQTL: the `expression.bed.gz` output from `normalization` step of gene expression calling.
    - sQTL: the `gz_raw_data.qqnorm.formated.bed.gz` from `annotate_leafcutter_isoforms`.
    - mQTL: the `sesame.M.bed.gz` output from `bed_filter_na` of methylation calling.

**Output:** 
- a collection of `bed.gz` files that are listed in a `bed.per_chrom.recipe` file for each molecular phenotypes.

**Time and Memory**
- `time elapsed`: 8.60s
- `max vms_memory`: 15.36GB


In [None]:
sos run pipeline/phenotype_formatting.ipynb partition_by_chrom \
    --cwd output/data_preprocessing/protocol-eqtl/phenotype_data  \
    --phenoFile output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tmm.expression.bed.gz \
    --region-list  <(zcat output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tmm.expression.bed.gz  | cut -f 1,2,3,4)  \
    --container containers/rna_quantification.sif \
    --mem 16G       

sos run pipeline/phenotype_formatting.ipynb partition_by_chrom \
    --cwd output/data_preprocessing/protocol-sqtl/phenotype_data  \
    --phenoFile output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz_raw_data.qqnorm.formated.bed.gz \
    --region-list  <(zcat output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz_raw_data.qqnorm.formated.bed.gz  | cut -f 1,2,3,4)  \
    --container containers/rna_quantification.sif \
    --mem 16G       

sos run pipeline/phenotype_formatting.ipynb partition_by_chrom \
    --cwd output/data_preprocessing/protocol-mqtl/phenotype_data  \
    --phenoFile output/methylation/xqtl_protocol_data_arrayMethylation_covariates.sesame.beta.filter_na.soft.bed.gz \
    --region-list  <(zcat output/methylation/xqtl_protocol_data_arrayMethylation_covariates.sesame.beta.filter_na.soft.bed.gz  | cut -f 1,2,3,4)  \
    --container containers/rna_quantification.sif \
    --mem 16G       

### Genotype QC and PCA 

### Factor analysis for phenotype

## Procedure 3: Univariate analysis

### Cis Association Scan

### Trans Association Scan

### Univariate Finemapping

## Procedure 4: Multi-variate analysis

### Multivariate Adaptive Shinkage (MASH)

### Multivariate-Finemapping

## Anticipated results

### Association Scan

### MASH

### Fine mapping