# Reference Processing
This module will perform variouse proprocessing for reference data
In Particular:
1. Convert gff3 to gtf

Input: an uncompressed gff3 file.(i.e. can be view via cat)

Output: a gtf file.

2. Produce gene collapesed version of gtf

Input: a gtf file.

Output: a gtf file with collapesed gene model.


3. Generate STAR index based on gtf and reference fasta

Input: a gtf file and an acompanying fasta file.

Output: A folder of STAR index.


4. Generate RSEM index based on gtf and reference fasta

Input: a gtf file and an acompanying fasta file.

Output: A folder of RSEM index.

In [None]:
[global]
# The output directory for generated files. MUST BE FULL PATH
parameter: wd = path("./")
cwd = wd
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""


In [None]:
[gff3_to_gtf]
parameter: gff3_file = path
input: gff3_file
output: f'{wd}/{_input:n}.gtf'
bash: container=container, expand= "${ }", stderr = f'{_input[0]}.stderr', stdout = f'{_input[0]}.stdout'
        gffread ${_input} -T -o ${_output}

### GTF reformatting

This step modify the gtf file for following reason:
1. RSEM require GTF input to have the same chromosome name format as the fasta file.

**For STAR, this problem can be solved by the now commented --sjdbGTFchrPrefix "chr"  option**
   
2. collapse_annotation.py from GTEX require the gtf have transcript_type insteadd transcript_biotype in its annotation.
**This problem can be solved by modifying the collapse_annotation.py while building the docker**

Once the problem with RSEM is solved, or when RSEM is no longer needed, the aforementioned remedy can be implemented and this step can be remvoed

In [None]:
[chrom_reformating]
# Reference genome
parameter: gtf = path
parameter: fasta = path
parameter: empty_rows = 5
input: fasta, gtf
output:  f'{wd}/{_input[1]:bn}.reformated.gtf'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '30G', tags = f'{step_name}_{_output[0]:bn}',container=container
R: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    library("readr")
    library("stringr")
    library("dplyr")
    options(scipen = 999)
    fasta = system("head -1 ${_input[0]}",intern = TRUE)
    gtf = read_delim("${_input[1]}", col_names  = F,"\t", skip = ${empty_rows})
    if(!str_detect(fasta,">chr")){
    gtf_mod = gtf%>%mutate(X1 = str_remove_all(X1,"chr"))
    } else if(!any(str_detect(gtf$X1[1],"chr"))) {
        gtf_mod = gtf%>%mutate(X1 = paste0("chr",X1))    
    }
    if(str_detect(gtf_mod$X9,"transcript_biotype")){gtf_mod = gtf_mod%>%mutate(X9 = str_replace_all(X9,"transcript_biotype","transcript_type"))}
    gtf_mod%>%write.table("${_output}",sep = "\t",quote = FALSE,col.names = F,row.names = F)

In [None]:
[collapse]
parameter: gtf = path
input: gtf
output: f'{wd}/{_input:bn}.collapsed.gtf'
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    collapse_annotation.py ${_input} ${_output}

## Generating indexing file for `STAR` 
This step generate the indexing file for STAR alignment. This file just need to generate once and can be re-used. 

At least 40GB of memory is needed
### Step Inputs:
* `STAR_index_dir`: a path to the output.
* `gtf` and `fasta`: path to reference sequence. Both of them needs to be unzipped
* `sjdbOverhang`: specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.

### Step Output:
* Indexing file stored in `STAR_index_dir`, which will be used by `STAR`

In [None]:
[STAR_indexing]

# The directory for STAR index
# Reference genome
parameter: gtf = path
parameter: fasta = path

# Length:
parameter: sjdbOverhang = 150
input: fasta, gtf
output: f'{wd}/STAR_Index/genomeParameters.txt'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '40G', tags = f'{step_name}_{_output[0]:bn}'
bash: container=container, expand= "${ }", stderr = f'{_input[0]}.stderr', stdout = f'{_input[0]}.stdout'
    STAR --runMode genomeGenerate \
         --genomeDir ${_output:d} \
         --genomeFastaFiles ${_input[0]} \
         --sjdbGTFfile ${_input[1]} \
         --sjdbOverhang ${sjdbOverhang} \
         --runThreadN ${numThreads} #--sjdbGTFchrPrefix "chr" 

## Generating indexing file for `RSEM`
This step generate the indexing file for `RSEM`. This file just need to generate once.

### Step Inputs:

* `RSEM_index_dir`: a path to the output.
* `gtf` and `fasta`: path to reference sequence.
* `sjdbOverhang`: specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.

### Step Outputs:
* Indexing file stored in `RSEM_index_dir`, which will be used by `RSEM`

### Example Command

In [None]:
[RSEM_indexing]
# Output directory:

# Reference genome
parameter: gtf = path
parameter: fasta = path
parameter: name = str
input: fasta, gtf
output: f'{wd}/RSEM_Index/rsem_reference.idx.fa'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '40G', tags = f'{step_name}_{_output[0]:bn}'
bash: container=container, expand= "${ }", stderr = f'{_input[0]}.stderr', stdout = f'{_input[0]}.stdout'
    rsem-prepare-reference \
            ${_input[0]} \
            ${_output:nn} \
            --gtf ${_input[1]} \
            --num-threads ${numThreads}