# xqtl protocol data. for eqtl
This serve as a demostration and records of how to generate and what to use the xqtl calling and discovery pipeline. 

## Materials

### Generating the reference data
**Trouble: When running download_hg_reference, it is likely to recieve a connection time out error. Only way to solve it seems to be retrying/**

Aproximate time: 60 min

In [None]:
sos run pipeline/reference_data.ipynb download_hg_reference --cwd reference_data    &
sos run pipeline/reference_data.ipynb download_gene_annotation --cwd reference_data &
sos run pipeline/reference_data.ipynb download_ercc_reference --cwd reference_data &
sos run pipeline/reference_data.ipynb download_dbsnp --cwd reference_data &

**Warning: Following step is memory intensive and should be ran using `-J 1 -c csg.yml -q csg` to be submitted to a cluster with a minimum of 16GB of mem (default)**

To format reference data:

In [None]:
Aproximate time: 1  min
Mem: 8 G

In [None]:
sos run reference_data.ipynb hg_reference \
    --cwd reference_data \
    --ercc-reference reference_data/ERCC92.fa \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --container container/rna_quantification.sif -J 1 -c csg.yml -q csg  &

Aproximate time: 8 min
Mem: 16G

In [None]:
sos run pipeline/reference_data.ipynb hg_gtf \
    --cwd reference_data \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --container containers/rna_quantification.sif --stranded -J 1 -c csg.yml -q csg  &

To format gene feature data:

In [None]:
sos run pipeline/reference_data.ipynb gene_annotation \
    --cwd reference_data \
    --ercc-gtf reference_data/ERCC92.gtf \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --container containers/rna_quantification.sif --stranded

**Notice that for un-stranded RNA-seq protocol please use switch `--no-stranded` to the command above instead of `--stranded`. More details can be found later in the document.**

Generating STAR index without the GTF annotation file allow customize read lenght lateron in STAR alignment. it will take at least 40G of memory for STAR to build the index. 
Aproximate time: 30  min
Mem: 40 G

In [None]:
sos run pipeline/reference_data.ipynb STAR_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --container containers/rna_quantification.sif \
    --mem 40G -J 1 -c csg.yml -q csg  &

**Notice that command above requires at least 40G of memory, and takes quite a while to complete**.

To generate RSEM index:

Aproximate time: 1  min

In [None]:
sos run pipeline/reference_data.ipynb RSEM_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/rna_quantification.sif  &

To generate ref.flat annotation for picard QC

In [None]:
sos run pipeline/reference_data.ipynb RefFlat_generation \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf 

### Downloading the data

The samples that we use are 49 samples of [ROSMAP dataset](https://www.synapse.org/#!Synapse:syn4164376). The data used in this protocol paper after we processed and de-identified can be found at [here]()

In [None]:
cd /mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing/ROSMAP_data/bam

In [None]:
for i in `cat 50_samples_synapse_id`; do 
synapse get $i;
done

In [None]:
The Genotype data are downloaded using:

In [None]:
wget https://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/GEUVADIS.chr21.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz \
     https://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/GEUVADIS.chr22.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz

In [2]:
cd ../../


### Preprocessing the xqtl protocol data.

Since we are using the fastq files as starting point of our RNASeq calling pipeline, the phenotype of xqtl protocol data required some preprocessing . 



#### Generating the input phenotype data
Command 1 take only the chromosome 21 and 22 from each of the bam file in the desinated diretory, then command 2 changes them into fastq file. Doing so keeps our xqtl protocol data into a managable size

In [None]:
sos run pipeline/phenotype_formatting.ipynb bam_subsetting  \
    --phenoFile `ls ROSMAP_data/RNASeq/*.bam` \
    --cwd ROSMAP_data/RNASeq/subsetted  \
    --container containers/rna_quantification.sif -J 50 -q csg -c csg.yml

In [None]:
sos run pipeline/phenotype_formatting.ipynb bam_to_fastq  \
    --phenoFile `ls ROSMAP_data/RNASeq/subsetted/*.bam` \
    --cwd ROSMAP_data/RNASeq/fastq  \
    --container containers/rna_quantification.sif -J 50 -q csg -c csg.yml

#### Creation of sample name mapper and masks
To match and de-identified the samples in both Genotype/phenotype, a index file was created with the following codes

In [None]:
echo -e "fq1\tfq2" > xqtl_protocol_data_sample_list
paste <(ls *.1.fastq) <(ls *.2.fastq) >> xqtl_protocol_data_sample_list

***Following codes are ran in python.***

In [None]:
import pandas as pd
a = pd.read_csv("xqtl_protocol_data_sample_list","\t")
sample_id = [x.split(".")[0] for x in a.fq1 ]
b = pd.read_csv("filtered_sample_index","\t")
c = pd.read_csv("ROSMAP_assay_rnaSeq_metadata.csv",",")
a["rnaseq_id"] = sample_id
a.merge(b, on = "rnaseq_id")
abc = ab.merge(c, left_on = "rnaseq_id", right_on = "specimenID")
abc.to_csv("../../comprehensive_xqtl_protocol_sample_index.tsv","\t",index = False)

`ROSMAP_assay_rnaSeq_metadata.csv` can be downloaded from [ROSMAP metadata](https://www.synapse.org/#!Synapse:syn21088596) wherease `filtered_sample_index` is an internal file we used to determined which samples to used. For the purpose of deidentifying this file will not be released to the public.

#### De-identifing the input phenotype data
In compliance to HIPAA and the regulation on ROSMAP, we need to de-identify the data before releasing them to publics

In [None]:
readarray -t array1 <  <(tail -49 ../../comprehensive_xqtl_protocol_sample_index.tsv | cut -f5)
readarray -t array2 <  <(tail -49 ../../comprehensive_xqtl_protocol_sample_index.tsv | cut -f3)

In [None]:
for i in ${!array1[*]} ; do mv ${array1[$i]}.subsetted.1.fastq Sample_${array2[$i]}.subsetted.1.fastq   ;done
for i in ${!array1[*]} ; do mv ${array1[$i]}.subsetted.2.fastq Sample_${array2[$i]}.subsetted.2.fastq   ;done
for i in ${!array1[*]} ; do mv ${array1[$i]}.subsetted.1.stderr Sample_${array2[$i]}.subsetted.1.stderr   ;done
for i in ${!array1[*]} ; do mv ${array1[$i]}.subsetted.1.stdout Sample_${array2[$i]}.subsetted.1.stdout   ;done

#### Generating the input fastq list
The input of our RNA calling section requirs a list of following format, it was generated manually. We allows 2 optional columns: strand and read_length so that user can specify different stand and read length for each of the samples. However, it is not necessary to include them. Our pipeline can detect the strand based on the output of STAR Alignment.

***Following codes are ran in python.***

In [None]:
import pandas as pd
abc = pd.read_csv("comprehensive_xqtl_protocol_sample_index.tsv","\t",index = False)
abc = abc[["sample_id","fq1","fq2","strand","readLength"]]
abc["fq1"] =  [".".join([x] + y.split(".")[1:])  for x,y in  zip( abc.sample_id, abc.fq1) ]
abc["fq2"] =  [".".join([x] + y.split(".")[1:])  for x,y in  zip( abc.sample_id, abc.fq2) ]
abc.colums = ["ID","fq1","fq2","strand","read_length"]
abc.to_csv("xqtl_protocol_data.fastqlist","\t",index = False)

#### Subsetting and Indexing the genotypes
Since we only use 49 samples, we extract 49 samples from the genotype data to save memory and time

In [None]:
cd ../
echo -e "old_name\tnew_name" > xqtl_protocol_data_sample_list
paste <(cut -f6 ../comprehensive_xqtl_protocol_data_sample_index.tsv ) <(cut -f1 ../comprehensive_xqtl_protocol_data_sample_index.tsv  ) >> xqtl_protocol_data_sample_mask

In [None]:
bcftools view DEJ_11898_B01_GRM_WGS_2017-05-15_21.recalibrated_variants.vcf.gz -S <(cat ../comprehensive_xqtl_protocol_data_sample_index.tsv | cut -f6 | tail -49 ) | \
bcftools reheader --samples xqtl_protocol_data_sample_mask  -Oz -o DEJ_11898_B01_GRM_WGS_2017-05-15_21.recalibrated_variants.xqtl_protocol_data.vcf

bcftools view DEJ_11898_B01_GRM_WGS_2017-05-15_22.recalibrated_variants.vcf.gz -S <(cat ../comprehensive_xqtl_protocol_data_sample_index.tsv | cut -f6 | tail -49 ) | \
bcftools reheader --samples xqtl_protocol_data_sample_mask  -Oz -o  DEJ_11898_B01_GRM_WGS_2017-05-15_22.recalibrated_variants.xqtl_protocol_data.vcf

bgzip DEJ_11898_B01_GRM_WGS_2017-05-15_21.recalibrated_variants.xqtl_protocol_data.vcf
bgzip DEJ_11898_B01_GRM_WGS_2017-05-15_22.recalibrated_variants.xqtl_protocol_data.vcf
tabix DEJ_11898_B01_GRM_WGS_2017-05-15_21.recalibrated_variants.xqtl_protocol_data.vcf.gz
tabix DEJ_11898_B01_GRM_WGS_2017-05-15_22.recalibrated_variants.xqtl_protocol_data.vcf.gz

## Protocol 1: Molecular Phenotype Calling

### RNA Seq Alignment

In [None]:
sos run pipeline/RNA_calling.ipynb fastqc \
    --cwd output/rnaseq/fastqc \
    --samples ROSMAP_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir ROSMAP_data/RNASeq/fastq \
    --container containers/rna_quantification.sif \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf -J 50 -q csg -c csg.yml

To align the reads with STAR and generate the bam_list recipe for downstream molecular phenotype count matrixes. The `-J 20 -c csg.yml -q csg` part is crucial for it ask for the required memory to conduct the STAR alignment.

In [None]:
nohup sos run pipeline/RNA_calling.ipynb STAR_output \
    --cwd output/rnaseq --samples ROSMAP_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir ROSMAP_data/RNASeq/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat -J 50 -c csg.yml -q csg2 --uncompressed &

### Gene expression bed file

In [None]:
sos run pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd output/rnaseq \
    --samples ROSMAP_data/RNASeq/fastq/xqtl_protocol_data.fastqlist    --data-dir ROSMAP_data/RNASeq/fastq \
    --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf.ref.flat  \
    --bam_list output/rnaseq/xqtl_protocol_data_bam_list  -J 50 -c csg.yml -q csg2 

In [None]:
sos run pipeline/bulk_expression_QC.ipynb qc \
    --cwd output/rnaseq \ \
    --tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_tpm.gct.gz \
    --counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_readsCount.gct.gz \
    --container containers/rna_quantification.sif 

In [None]:
nohup sos run pipeline/bulk_expression_normalization.ipynb normalize \
    --cwd output/rnaseq \
    --tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf  \
    --container containers/rna_quantification.sif \
    --count-threshold 1 --sample_participant_lookup reference_data/sample_participant_lookup.rnaseq &

### Partition by chromosome

### Splicing count matrix

#### Leafcutter

In [None]:
sos run pipeline/splicing_calling.ipynb leafcutter \
    --cwd output/leaf_cutter/ \
    --samples output/rnaseq/xqtl_protocol_data_bam_list \
    --container containers/leafcutter.sif 

#### Psichomics


In [None]:
sos run pipeline/splicing_calling.ipynb psichomics \
    --cwd output/psichomics/ \
    --samples output/rnaseq/xqtl_protocol_data_bam_list \
    --splicing_annotation hg38_suppa.rds \
    --container containers/psichomics.sif

### Preparing of Xqtl Discovery pipeline
The command in the analysis in the xqtl pipeline can be generated with the command generater we provided. The command generator requirs what we called a recipe file. The code to generate it and the recipe we will be using are as followed:

## Protocol 2: Xqtl Discovery pipeline

### Genotype QC
It will take ~25G of memory and ~5 min to complete the VCF_QC step. 

In [None]:
sos run pipeline/VCF_QC.ipynb qc --genoFile ROSMAP_data/Genotype/DEJ_11898_B01_GRM_WGS_2017-05-15_21.recalibrated_variants.xqtl_protocol_data.add_chr.vcf.gz ROSMAP_data/Genotype//DEJ_11898_B01_GRM_WGS_2017-05-15_22.recalibrated_variants.xqtl_protocol_data.add_chr.vcf.gz \
            --dbsnp-variants reference_data/00-All.add_chr.variants.gz \
            --reference-genome reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
            --cwd output/genotype --container containers/bioinfo.sif -J 2 -q csg2 -c csg.yml --mem 25G &

Since the genotype data are per chromosome, we will need to merged the output plink file. When the input vcf to VCF_QC is the whole genome one, this step can be skipped

In [None]:
sos run pipeline/genotype_formatting.ipynb merge_plink \
            --genoFile `ls output/genotype/*.leftnorm.filtered.bed` \
            --cwd output/genotype --container containers/bioinfo.sif &

In [None]:
sos run pipeline/GWAS_QC.ipynb king \
   --cwd output/genotype \
   --genoFile output/genotype/xqtl_protocol_data.bed \
   --container containers/bioinfo.sif \
   --walltime 48h   --no-maximize_unrelated

### Data Preprocessing

### Association Testing

### Sumstat Merging