# mRNA mapping using SALMON

Followed documentation: https://salmon.readthedocs.io/en/latest/building.html

## 1. Installation and preparation

download references 
https://www.ensembl.org/info/data/ftp/index.html

In [None]:
bash generateDecoyTranscriptome.sh -g reference/mm10.fa
 -t reference/Mus_musculus.GRCm38.cdna.all.fa -a reference/Mus_musculus.GRCm38.97.gtf -o index/

In [None]:
salmon index -t index/gentrome.fa -i transcripts_index --decoys index/decoys.txt -k 31

In [None]:
salmon quant -i transcripts_index/ --libType A -r ~/data/mouse_brain_GSE100265/fastq/mRNA_SRP096017/SRR5144100_1.fastq --validateMappings -o output/SRR5144100

In [None]:
bash runSalmonForList.sh fastq_list_mRNA.txt ~/methods/salmon/output ~/methods/salmon/transcripts_index/

## 3. Sanity Check

Created sanity_check/reference/cdr1as.fa and sanity_check/reference/cdr1as_location.gtf containing only the location and sequence of cdr1as.
In the knocked-out mice, the gene locus corresponding to the Cdr1as gene has been removed. As a sanity check, I want to check whether the knocked-out effect is present in the mRNA data. As the gene corresponding to Cdr1as is not annotated in the gtf file, the effect cannot be seen in the already mapped data. 
"For example, the splice sites of the well-known circRNAs circSRY and circCDR1as are not annotated in linear RNA transcripts, and therefore these circRNAs will not be reported by reference-guided methods." (10.1038/srep38820)

### Determining the coordinates of Cdr1as

According to the supplementary material (DOI: 10.1126/science.aam8526) the Cdr1as transcript, was defined on the positive strand at chrX:58436423-58439349. These are the coordinates from the mm9 reference. As I am unsing the mm10 reference, the corresponding coordinates have to be computed. I will do this using the Batch Coordinate Conversion (liftOver) tool from UCSC.
The coordinates for Cdr1as in the mm10 mouse reference are chrX:61183248-61186174.

### Reference preparation

I created a gtf file manually containing the coordinates of Cdr1as. Using samtools I extracted the fasta sequence corresponding to that region.

In [None]:
samtools faidx ~/methods/salmon/sanity_check/reference/mm10.fa chrX:61183248-61186174 > ~/methods/salmon/sanity_check/reference/cdr1as.fa

### Mapping using Salmon

In [None]:
bash ~/methods/salmon/generateDecoyTranscriptome.sh -g ~/methods/salmon/sanity_check/reference/mm10.fa -t ~/methods/salmon/sanity_check/reference/cdr1as.fa -a ~/methods/salmon/sanity_check/reference/cdr1as_location.gtf -o ~/methods/salmon/sanity_check/index

In [None]:
salmon index -t ~/methods/salmon/sanity_check/index/gentrome.fa -i ~/methods/salmon/sanity_check/transcripts_index --decoys ~/methods/salmon/sanity_check/index/decoys.txt -k 31

In [None]:
salmon quant -i ~/methods/salmon/sanity_check/transcripts_index/ --libType A -r ~/data/mouse_brain_GSE100265/fastq/mRNA_SRP096017/SRR5144100_1.fastq --validateMappings -o ~/methods/salmon/sanity_check/output/SRR5144100

In [None]:
bash ~/circRNA-detection/scripts/runSalmonForList.sh ~/circRNA-detection/scripts/fastq_list_mRNA.txt ~/methods/salmon/sanity_check/output/ ~/methods/salmon/sanity_check/transcripts_index/

### Todo

Redo mapping using a non-automativ libType. Check results using --libType SR (stranded reversed).
From paper:
"Reads mapping to genomic features were counted using htseq-count tool 10 (51) in the stranded mode (--stranded=reverse), with Ensembl release 67 used as a genomic feature reference (52). The reference GTF file, downloaded from Ensembl, was
extended with Cdr1as transcript, which was defined on the positive strand at chrX:58436423-58439349."