# PZQ-ER-ES RNA-seq


## Aim

We will analyze gene expression of juvenile and adult worms from SmLE-PZQ-ER and SmLE-PZQ-ES populations to test if gene expression differences could be associated to the phenotype. The hypothesis is that we should observed differences in expression between ER and ES adults but not necessarily between ER and ES juveniles as those are naturally recovering from the PZQ treatment independently of the adult status.


## Environment and data

In [None]:
conda env create -f .env/env.yml

In [2]:
# Activate the environment
source $(sed "s,/bin/conda,," <<<$CONDA_EXE)/etc/profile.d/conda.sh
conda activate PZQ-R

(PZQ-R) 

: 1

In [None]:
gdir="data/genome"
[[ ! -d "$gdir" ]] && mkdir "$gdir"

wget -P "$gdir" ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS14/species/schistosoma_mansoni/PRJEA36577/schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa.gz

pigz -d "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa.gz"

# Preparing indices
#bwa index "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa"
#samtools faidx "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa"
#gatk CreateSequenceDictionary -R "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa"

In [None]:
wget -P "$gdir" ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS14/species/schistosoma_mansoni/PRJEA36577/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3.gz

pigz -d "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3.gz"

STAR will be used to align data and RSEM will be used to generate transcript per million (TPM) counts. These two tools require a step to prepare the reference genome.

### STAR reference genome

Creating a STAR reference genome requires the use of an annotation file. The Sanger Institute provided us with a GFF file which a format that can be normally used with STAR. However my first attempt to generate a STAR reference genome using the `--sjdbGTFtagExonParentTranscript Parent` option as mentioned in the manual did not allow me to get gene counts after running STAR on sample (the gene count file contains only the first 4 lines). This problem is very similar to [this](https://groups.google.com/forum/#!msg/rna-star/oRvzihFXE8k/Xa-7YgUUBgAJ). Therefore I converted the GFF file into a GTF file which is the default format used by STAR and this solved the problem.

Because the data used were generated on different platforms that produced different read sizes, I made two reference genomes using different values for the `--sjdbOverhang` option as recommended in the STAR documentation:
* A value of 75 for libraries that have 76 bp paired-end reads (the Protasio *et al* 2012).
* A value of 99 for libraries that have 100 bp paired-end reads (all the others).

In [5]:
# Convert GFF into GTF file
gffread "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3" -T -o "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gtf"

(PZQ-R) (PZQ-R) 

: 1

In [None]:
# Overhang
i=139

# Make STAR ref folder
mkdir "$gdir/S.mansoni_STAR_${i}"

STAR --runMode genomeGenerate \
     --runThreadN $(nproc)    \
     --genomeDir "$gdir/S.mansoni_STAR_${i}" \
     --genomeFastaFiles "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa" \
     --sjdbGTFfile "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gtf" \
     --sjdbOverhang $i
#     --outFileNamePrefix "$gdir/"

# Move log
mv Log.out "$gdir/S.mansoni_STAR_${i}/"

(PZQ-R) (PZQ-R) (PZQ-R) (PZQ-R) (PZQ-R) (PZQ-R) May 01 15:11:21 ..... started STAR run
May 01 15:11:21 ... starting to generate Genome files
May 01 15:12:20 ... starting to sort Suffix Array. This may take a long time...
May 01 15:12:25 ... sorting Suffix Array chunks and saving them to disk...


### RSEM reference

RSEM requires to generate a reference using the GFF and the reference genome file as mentioned in the documentation.

In [6]:
# Make RSEM ref folder
mkdir "$gdir/S.mansoni_RSEM"

rsem-prepare-reference --gtf "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gtf" \
        -p $(nproc) \
        "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa" \
        "$gdir/S.mansoni_RSEM/S.mansoni" > "$gdir/S.mansoni_RSEM/log"

(PZQ-R) mkdir: cannot create directory `data/genome/S.mansoni_RSEM': File exists
(PZQ-R) (PZQ-R) (PZQ-R) 

: 1

In [3]:
gdir="data/genome"

(PZQ-R) 

: 1

## Data quality

In [13]:
resdir="results"
[[ -d "$resdir" ]] || mkdir "$resdir"

multiqc -ip -o "$resdir/1-report/" data/libraries

[1;30m[INFO   ][0m         multiqc : This is MultiQC v1.8
[1;30m[INFO   ][0m         multiqc : Template    : default
[1;30m[INFO   ][0m         multiqc : Searching   : /data/infectious/schistosome/06 - PZQ resistance/2020-04-19 PZQ-ES-ER juveniles RNA-seq/1-Analysis/data/libraries
[?25lSearching 312 files..  [####################################]  100%          [?25h
[1;30m[INFO   ][0m            star : Found 24 reports and 24 gene count files
[1;30m[INFO   ][0m          fastqc : Found 48 reports
[1;30m[INFO   ][0m         multiqc : Compressing plot data
[1;30m[INFO   ][0m         multiqc : Report      : results/1-report/multiqc_report.html
[1;30m[INFO   ][0m         multiqc : Data        : results/1-report/multiqc_data
[1;30m[INFO   ][0m         multiqc : MultiQC complete
(PZQ-R) 0
(PZQ-R) (PZQ-R) (PZQ-R) 

: 1

## Alignment and quantification

Use a snakemake pipeline to align and perform transcript quantification using STAR. It requires a cluster running Sun/Oracle Grid Engine. If data and genome folder have been modified, these modifications must be reported in the snakemake file.

In [None]:
# Directory to store status files
[[ ! -d status ]] && mkdir status

# Snakemake pipeline
snakemake --snakefile snakefile --cluster "qsub -V -cwd -o status -j y -r y -pe smp 12 -S /bin/bash" --jobs 24 -w 300

In [10]:
ldir="data/libraries/"
qdir="$resdir/3-quantification"

[[ -d "$qdir" ]] || mkdir "$qdir"

rsem-generate-data-matrix "$ldir/"*/*isoforms.results > "$qdir/PZQ-ER-ES.isoform.counts.matrix"
rsem-generate-data-matrix "$ldir/"*/*genes.results > "$qdir/PZQ-ER-ES.gene.counts.matrix"

(PZQ-R) (PZQ-R) (PZQ-R) (PZQ-R) (PZQ-R) (PZQ-R) (PZQ-R) (PZQ-R) (PZQ-R) (PZQ-R) 

: 1

## Analysis

We use R scripts for analyzing the data.

In [None]:
# GLM-PCA to test if other than biological effects exists
Rscript scripts/RNA-seq_PCA.R

In [None]:
# Formal analysis of the complete RNA-seq data
Rscript scripts/RNA-seq_analysis.R

In [None]:
# Specific analysis of Smp_246790
Rscript scripts/RNA-seq_TRP_analysis.R