# PZQ-ER-ES RNA-seq


## Aim

We will analyze gene expression of juvenile and adult worms from SmLE-PZQ-ER and SmLE-PZQ-ES populations to test if gene expression differences could be associated to the phenotype. The hypothesis is that we should observed differences in expression between ER and ES adults but not necessarily between ER and ES juveniles as those are naturally recovering from the PZQ treatment independently of the adult status.


## Environment and data

In [None]:
conda env create -f .env/env.yml

The cell below must be run each time a new Jupyter session is run.

In [None]:
# Activate the environment
source $(sed "s,/bin/conda,," <<<$CONDA_EXE)/etc/profile.d/conda.sh
conda activate PZQ-R

# Remove potential variable interferences
export PERL5LIB=""
export PYTHONNOUSERSITE=1

The cell below must be run only once at the time of the environment creation.

In [None]:
# Installing needed R packages
Rscript ".env/R package dependencies.R"

### Sequencing data

This step downloads the fastq files of the different samples from the SRA repository.

In [None]:
# Data directory
ldir="data/libraries"
[[ ! -d "$ldir" ]] && mkdir -p "$ldir"

In [None]:
# Bioproject
bioproject=     # ERP114942 !! TO BE UPDATED

# Download related information to data project
wget -q -O runinfo "http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&rettype=runinfo&db=sra&term=${bioproject}"

# Field of interest (library name and weblink)
fdn=$(head -n 1 runinfo | tr "," "\n" | grep -w -n "SampleName" | cut -d ":" -f 1)
fdr=$(head -n 1 runinfo | tr "," "\n" | grep -w -n "Run" | cut -d ":" -f 1)
flk=$(head -n 1 runinfo | tr "," "\n" | grep -w -n "download_path" | cut -d ":" -f 1)

# Download fastq files
while read line
do
    # Filename, run and download link
    fln=$(cut -d "," -f $fdn <<<$line)
    run=$(cut -d "," -f $fdr <<<$line)
    lnk=$(cut -d "," -f $flk <<<$line)
    
    # Download
    echo "$fln"
    [[ ! -d "$ldir/$fln/" ]] && mkdir -p "$ldir/$fln/"
    retry=0
    
    while [[ $retry -lt 2 ]]
    do
        # Download sra file
        wget -q -c -O "$ldir/$fln/$run" "$lnk"
        # Check integrity
        vdb-validate -q "$ldir/$fln/$run" &> /dev/null
        [[ $? -ne 0 ]] && ((retry++)) || break
    done
    
    # If max download attempt reached, issue message and move to the next
    [[ $retry -eq 2 ]] && echo "$run: dowloading problem" >> "$ldir/download_issue" && contine
    
    # Convert sra into fastq
    fastq-dump -O "$ldir/$fln/" --split-files "$ldir/$fln/$run"
    rm "$ldir/$fln/$run"
    
    # Rename file with more meaningful name
    mv "$ldir/$fln/${run}_1.fastq" "$ldir/$fln/${fln}_R1.fastq"
    mv "$ldir/$fln/${run}_2.fastq" "$ldir/$fln/${fln}_R2.fastq"
    
done < <(tail -n +2 runinfo | sed "/^$/d")

# Compress files
pigz "$ldir/"*/*

rm runinfo*

### Genome data and annotation

The genome data is downloaded from the [WormBase ParaSite](https://parasite.wormbase.org). We use the data from the version 14 (WBPS14). The data is then indexed for the different tools used.

In [None]:
gdir="data/genome"
[[ ! -d "$gdir" ]] && mkdir "$gdir"

# Genome
wget -P "$gdir" ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS14/species/schistosoma_mansoni/PRJEA36577/schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa.gz
pigz -d "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa.gz"

# Annotation
wget -P "$gdir" ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS14/species/schistosoma_mansoni/PRJEA36577/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3.gz
pigz -d "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3.gz"

STAR will be used to align data and RSEM will be used to generate transcript per million (TPM) counts. These two tools require a step to prepare the reference genome.

### STAR reference genome

Creating a STAR reference genome requires the use of an annotation file. The Sanger Institute provided us with a GFF file which a format that can be normally used with STAR. However my first attempt to generate a STAR reference genome using the `--sjdbGTFtagExonParentTranscript Parent` option as mentioned in the manual did not allow me to get gene counts after running STAR on sample (the gene count file contains only the first 4 lines). This problem is very similar to [this](https://groups.google.com/forum/#!msg/rna-star/oRvzihFXE8k/Xa-7YgUUBgAJ). Therefore I converted the GFF file into a GTF file which is the default format used by STAR and this solved the problem.

Because the data used were generated on different platforms that produced different read sizes, I made two reference genomes using different values for the `--sjdbOverhang` option as recommended in the STAR documentation:
* A value of 75 for libraries that have 76 bp paired-end reads (the Protasio *et al* 2012).
* A value of 99 for libraries that have 100 bp paired-end reads (all the others).

In [None]:
# Convert GFF into GTF file
gffread "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3" -T -o "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gtf"

In [None]:
# Overhang
i=149

# Make STAR ref folder
mkdir "$gdir/S.mansoni_STAR_${i}"

STAR --runMode genomeGenerate \
     --runThreadN $(nproc)    \
     --genomeDir "$gdir/S.mansoni_STAR_${i}" \
     --genomeFastaFiles "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa" \
     --sjdbGTFfile "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gtf" \
     --sjdbOverhang $i
#     --outFileNamePrefix "$gdir/"

# Move log
mv Log.out "$gdir/S.mansoni_STAR_${i}/"

### RSEM reference

RSEM requires to generate a reference using the GFF and the reference genome file as mentioned in the documentation.

In [None]:
# Make RSEM ref folder
mkdir "$gdir/S.mansoni_RSEM"

rsem-prepare-reference --gtf "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gtf" \
        -p $(nproc) \
        "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa" \
        "$gdir/S.mansoni_RSEM/S.mansoni" > "$gdir/S.mansoni_RSEM/log"

## Data quality

In [None]:
resdir="results"
[[ -d "$resdir" ]] || mkdir "$resdir"

multiqc -ip -o "$resdir/1-report/" data/libraries

## Alignment and quantification

Use a snakemake pipeline to align and perform transcript quantification using STAR. It requires a cluster running Sun/Oracle Grid Engine. If data and genome folder have been modified, these modifications must be reported in the snakemake file.

In [None]:
# Directory to store status files
statdir=status
[[ ! -d "$statdir" ]] && mkdir "$statdir"

# Snakemake pipeline
snakemake --snakefile snakefile --cluster "qsub -V -cwd -o "$statdir" -j y -r y -pe smp 10 -S /bin/bash" --jobs 24 -w 120

In [None]:
ldir="data/libraries/"
qdir="$resdir/3-quantification"

[[ -d "$qdir" ]] || mkdir "$qdir"

rsem-generate-data-matrix "$ldir/"*/*isoforms.results > "$qdir/PZQ-ER-ES.isoform.counts.matrix"
rsem-generate-data-matrix "$ldir/"*/*genes.results > "$qdir/PZQ-ER-ES.gene.counts.matrix"

## Gene and isoform expression analysis


### Factors structuring the data

We first ensure that the transcriptomic variations is first explained by the sample type rather than any other factors. For this we performed a GLM-PCA on the expression data of each sample.

In [None]:
# GLM-PCA to test if other than biological effects exists
Rscript scripts/RNA-seq_PCA.R

This shows that the samples clustered first by stage, then by sex. No other factors (like sequencing lane) explain the structuration of the data.


### Global analysis

We look at differences in gene and isoform expression between ER and ES populations and between stages. We also highlight the genes under QTL 2 and QTL 3. The QTL were obtained from the genome wide association study (GWAS) conducted on SmLE-PZQ-R (see the XX).

In [None]:
# Directory
genedir="results/2-QTL"
[[ ! -d "$genedir" ]] && mkdir -p "$genedir"

# BED files of the QTL bondaries
bed2=$(echo -e "SM_V7_2\t291191\t1457462")
bed3=$(echo -e "SM_V7_3\t22805\t4013538")

#List of genes under QTL of chr. 2
bedtools intersect -a "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3" -b <(echo "$bed2") -wa |\
    awk '$3 == "gene"' |\
    cut -f 9 |\
    cut -d ";" -f 1 |\
    cut -d ":" -f 2 |\
    sort -u > "$genedir/QTL_genes_chr2"

#List of genes under QTL of chr. 3
bedtools intersect -a "$gdir/schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3" -b <(echo "$bed3") -wa |\
    awk '$3 == "gene"' |\
    cut -f 9 |\
    cut -d ";" -f 1 |\
    cut -d ":" -f 2 |\
    sort -u > "$genedir/QTL_genes_chr3"

In [None]:
# Formal analysis of the complete RNA-seq data
Rscript scripts/RNA-seq_analysis.R

In [None]:
# Specific analysis of Smp_246790
Rscript scripts/RNA-seq_TRP_analysis.R

In [None]:
Rscript scripts/Fig.6.R