# MWE for eqtl
This serve as a demostration and records of how to generate and what to use the xqtl calling and discovery pipeline. 

## Materials

### Generating the reference data


In [None]:
sos run pipeline/reference_data.ipynb download_hg_reference --cwd reference_data    &
sos run pipeline/reference_data.ipynb download_gene_annotation --cwd reference_data &
sos run pipeline/reference_data.ipynb download_ercc_reference --cwd reference_data &

**Warning, Following step is resource intensive and should be ran using `-J 1 -c csg.yml -q csg` to be submitted to a cluster with a minimum of 16GB of mem (default)**

To format reference data:

In [None]:
sos run reference_data.ipynb hg_reference \
    --cwd reference_data \
    --ercc-reference reference_data/ERCC92.fa \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --container container/rna_quantification.sif

In [None]:

sos run pipeline/reference_data.ipynb hg_gtf \
    --cwd reference_data \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \
    --container containers/rna_quantification.sif -J 1 -c csg.yml -q csg &

To format gene feature data:

In [None]:
sos run reference_data.ipynb gene_annotation \
    --cwd reference_data \
    --ercc-gtf reference_data/ERCC92.gtf \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --container container/rna_quantification.sif --stranded

In [None]:
sos run reference_data.ipynb gene_annotation \
    --cwd reference_data \
    --ercc-gtf reference_data/ERCC92.gtf \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta.fasta \
    --container container/rna_quantification.sif --stranded

**Notice that for un-stranded RNA-seq protocol please use switch `--no-stranded` to the command above instead of `--stranded`. More details can be found later in the document.**

To generate STAR index using the GTF annotation file before gene model collapse:

In [None]:
sos run pipeline/reference_data.ipynb STAR_index \
    --cwd reference_data3 \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf \
    --container containers/rna_quantification.sif \
    --mem 40G &

**Notice that command above requires at least 40G of memory, and takes quite a while to complete**.

To generate RSEM index:

In [None]:
sos run reference_data.ipynb RSEM_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container container/rna_quantification.sif \
    --mem 40G

To generate SUPPA annotation for psichomics

In [None]:
sos run pipeline/reference_data.ipynb SUPPA_annotation \
    --hg_gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.gtf

### Downloading the MWE

The samples that we use are the first 50 samples of [This project](https://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/). It should be noted that the ftp server of this project cannot be used due to mismatch between bam and bai file. Each the bam/bai file pairs was downloaded using wget. 

In [None]:
cd /mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing/data/bam

In [None]:
for i in `cat 50_samples_links`; do 
wget $i $i.bai;
done

In [None]:
The Genotype data are downloaded using:

In [None]:
wget https://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/GEUVADIS.chr21.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz \
     https://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/GEUVADIS.chr22.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz

In [2]:
cd ../../

It should be noted that, the generation of STAR Index will take significant amount of time.

In [None]:
To downloads the data:


### Preprocessing the MWE

Since we are using the fastq files as starting point of our RNASeq calling pipeline, the phenotype of MWE required some preprocessing. The genotype data can be used as is. 

#### Generating the input phenotype data
Command 1 take only the chromosome 21 and 22 from each of the bam file in the desinated diretory, then command 2 changes them into fastq file. Doing so keeps our MWE into a managable size

In [None]:
sos run pipeline/phenotype_formatting.ipynb bam_subsetting  --phenoFile `ls data/bam/*.bam` --cwd data/bam  --container containers/rna_quantification.sif  

In [None]:
sos run pipeline/phenotype_formatting.ipynb bam_to_fastq  --phenoFile `ls data/test/*subsetted.bam` --cwd data/fastq  --container containers/rna_quantification.sif  

The output are shown as followed.

In [None]:
ls -lah data/bam/*.bam

In [None]:
ls -lah data/fastq/*.fastq

#### Generating the input fastq list
The input of our RNA calling section requirs a list of following format, it was generated manually. We allows 2 optional columns: strand and read_length so that user can specify different stand and read length for each of the samples. However, it is not necessary to include them. Our pipeline can detect the strand based on the output of STAR Alignment 

In [1]:
head data/MWE.fastq_list 

ID      fq1     fq2     strand
HG00096.1.M_111124_6	HG00096.1.M_111124_6.subsetted.1.fastq 	HG00096.1.M_111124_6.subsetted.2.fastq 	strand_missing 
HG00101.1.M_111124_4	HG00101.1.M_111124_4.subsetted.1.fastq 	HG00101.1.M_111124_4.subsetted.2.fastq 	strand_missing 
HG00104.1.M_111124_5	HG00104.1.M_111124_5.subsetted.1.fastq 	HG00104.1.M_111124_5.subsetted.2.fastq 	strand_missing 


## Protocol 1: Molecular Phenotype Calling

### RNA Seq Alignment

In [None]:
sos run pipeline/RNA_calling.ipynb fastqc \
    --cwd output/rnaseq/fastqc \
    --samples data/MWE.fastq_list  \
    --data-dir data/fastq \
    --container containers/rna_quantification.sif

To align the reads with STAR and generate the bam_list recipe for downstream molecular phenotype count matrixes. The `-J 20 -c csg.yml -q csg` part is crucial for it ask for the required memory to conduct the STAR alignment.

In [None]:
sos run pipeline/RNA_calling.ipynb STAR_output \
    --cwd output/rnaseq \
    --samples data/MWE.fastq_list \
    --data-dir data/fastq \
    --STAR-index reference_data3/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat  -J 20 -c csg.yml -q csg

### Gene expression count matrix

In [None]:
sos run pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd output/rnaseq \
    --samples data/MWE.fastq_list \
    --data-dir data/fastq \
    --STAR-index reference_data3/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat  \
    --bam_list output/rnaseq/MWE_bam_list

### Splicing count matrix

#### Leafcutter

In [None]:
sos run pipeline/splicing_calling.ipynb leafcutter \
    --cwd output/leaf_cutter/ \
    --samples output/rnaseq/MWE_bam_list \
    --container containers/leafcutter.sif 

#### Psichomics

In [None]:
sos run pipeline/splicing_calling.ipynb psichomics \
    --cwd output/psichomics/ \
    --samples output/rnaseq/MWE_bam_list \
    --splicing_annotation hg38_suppa.rds \
    --container container/psichomics.sif

### Preparing of Xqtl Discovery pipeline

## Protocol 2: Xqtl Discovery pipeline

### Genotype QC

### Data Preprocessing

### Association Testing

### Sumstat Merging