# Alternative splicing from RNA-seq data

This document shows the use of moudules for Data Preperation, Quantification, Quality Control + Normalization for Splicing events analysis, and converting the results to molecular phenotype data in `bed` format. In particular:

1. `molecular_phenotypes/calling/RNA_calling.ipynb`
2. `molecular_phenotypes/calling/splicing_calling.ipynb`
3. `molecular_phenotypes/QC/splicing_normalization.ipynb`
4. `data_preprocessing/phenotype/gene_annotation.ipynb`

Two tools, leafCutter and Psichomics are used in this splicing analyzing workflow and please check the corresponding modules for code documentation. Various reference data need to be prepared before using this workflow, please check [this module](https://cumc.github.io/xqtl-pipeline/code/data_preprocessing/reference_data.html) to download and prepare the reference data. 


## Minimal Working Example

As a minimal working example, some toy `fastq` data can be found on [Google Drive](https://drive.google.com/drive/u/0/folders/11kQv7PXozsKkgeqADH-28bC_kZ-w_oHo). These `fastq` files can be used to test step "fastqc", "fast_trim_adaptor", and "STAR_output" steps below.

Since the STAR aligned output from data above contains too little infomation to generate valid output in leafcutter and psichomics, some leafcutter and psichomics inputs are prepared and can be download [here](https://drive.google.com/drive/folders/1lpcx3eKG2UpauntLUuJ6bMBjHyIhWW_R). The minimal working example files are publicly available data from the [1000 Genomes Project](https://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/), an international research project with an extensive catalog of human genome variation. For the minimal working example, 3 of 465 unrelated human lymphoblastoid cell lines from the 1000 Genomes Project was selected to produce leafcutter and psichomics example inputs via STAR alignment. For details of the preperation method of the minimal working example please check [this document](https://docs.google.com/document/d/1Gmk8C-zhfQRLceYE9ViGl_JcoSbkJn3jPe-E9y3L-UM/edit).


## To generate `fastqc` report

In [None]:
sos run pipeline/RNA_calling.ipynb fastqc \
    --cwd output/ \
    --samples data/sample_fastq.list \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \
    --container containers/rna_quantification.sif

## To cut adaptor
After cutting the adaptor, a new sample list will be generated with the trimmed fastq file

In [None]:
sos run pipeline/RNA_calling.ipynb fastp_trim_adaptor \
    --cwd output2 \
    --samples data/sample_fastq.list \
    --data-dir data \
    --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gene.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat 

## To align the reads with STAR

In [None]:
sos run pipeline/RNA_calling.ipynb STAR_output \
    --cwd output2 \
    --samples data/sample_fastq.list \
    --data-dir data \
    --STAR-index reference_data3/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat

The LeafCutter and Psichomics part below should are in parallel. They should be run independently and the corresponding inputs/outputs are not depend on eachother.

# LeafCutter part workflow

## Intron usage ratio quantification via `leafCutter`
*  `input`: a meta data file contains locations of all Aligned.sortedByCoord.out.bam files to be analyzed.
*  `output`: a file with intron usage ratios, end with "_intron_usage_perind.counts.gz"

In [None]:
sos run pipeline/splicing_calling.ipynb leafcutter \
    --cwd leafcutter_output/ \
    --samples  sample_fastq_bam_list\
    --container containers/leafcutter.sif 

## QC and Normalization of leafCutter outputs
*  `input`: the "_intron_usage_perind.counts.gz" file from previous step
*  `output`: QC'd and normalized phenotype table end with "qqnorm.txt"

In [None]:
sos run pipeline/splicing_normalization.ipynb leafcutter_norm \
    --cwd leafcutter_output/ \
    --ratios leafcutter_output/sample_list_intron_usage_perind.counts.gz \
    --container containers/leafcutter.sif 

## Post-process of leafcutter outputs for them to be TensorQTL ready
*  `input`: output of the previous two steps and the gtf file.
*  `output`: a file in bed format end with "formated.bed.gz" 

In [None]:
sos run pipeline/code/data_preprocessing/phenotype/gene_annotation.ipynb map_leafcutter_cluster_to_gene \
    --cwd leafcutter_output \
    --intron_count  leafcutter_output/sample_fastq_bam_list_intron_usage_perind.counts.gz \
    --phenoFile leafcutter_output/sample_fastq_bam_list_intron_usage_perind.counts.gz_raw_data.qqnorm.txt \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \
    --container containers/bioinfo.sif

In [None]:
sos run pipeline/code/data_preprocessing/phenotype/gene_annotation.ipynb annotate_leafcutter_isoforms\
    --cwd leafcutter_output \
    --intron_count leafcutter_output/sample_fastq_bam_list_intron_usage_perind.counts.gz \
    --phenoFile leafcutter_output/sample_fastq_bam_list_intron_usage_perind.counts.gz_raw_data.qqnorm.txt \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \
    --container containers/bioinfo.sif

# Psichomics part workflow

## Percent Spliced In (PSI) quantification for alternative splicing events via `Psichomics`
*  `input`: a meta data file contains locations of all SJ.out.tab files to be analyzed.
*  `output`: psi_raw_data.tsv, contains percent spliced in values for each alternative splicing event

In [None]:
sos run splicing_calling.ipynb psichomics \
    --cwd psichomics_output/ \
    --samples sample_fastq_bam_list\
    --splicing_annotation hg38_suppa.rds \
    --container containers/psichomics.sif

## QC and Normalization of psichomics outputs
*  `input`: the "psi_raw_data.tsv" file from previous step
*  `output`: QC'd and normalized phenotype table end with "qqnorm.txt"

In [None]:
sos run pipeline/splicing_normalization.ipynb psichomics_norm\
    --cwd psichomics_output \
    --ratios psichomics_output/psi_raw_data.tsv \
    --container containers/psichomics.sif

## Post-process of psichomics outputs for them to be TensorQTL ready
*  `input`: the "qqnorm.txt" output from the previous step and the gtf file.
*  `output`: a file in bed format end with "formated.bed.gz" 

In [None]:
sos run pipeline/code/data_preprocessing/phenotype/gene_annotation.ipynb annotate_psichomics_isoforms \
    --cwd psichomics_output \
    --phenoFile psichomics_output/psichomics_raw_data_bedded.qqnorm.txt \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \
    --container containers/bioinfo.sif