## Great, now that we discussed a little let's continue

Given that the current approach utilized by the authors lacks reproducibility, we will explore an alternative method by leveraging nf-core pipelines for data analysis.

Please explain, how we will achieve reproducibility for the course  with this approach.


Because of the strict guidelines and standardized operations applied in pipelines.

You have successfully downloaded 2 of the fastq files we will use in our study.

What is the next step if we want to first have a count table and check the quality of our fastq files? What is the pipeline called to do so?

The pipeline is called rnaseq (https://nf-co.re/rnaseq/). To run this we need to prepare the sample sheet and make it match the example the authors show. Next the pipeline is ran with the following command:
nextflow run nf-core/rnaseq \
    --input <SAMPLESHEET> \
    --outdir <OUTDIR> \
    --gtf <GTF> \
    --fasta <GENOME FASTA> \
    -profile <docker/singularity/.../institute>

Analyze the 2 files using an nf-core pipeline.

What does this pipeline do?

Which are the main tools that will be used in the pipeline?

The pipeline has these steps:

    1 Merge re-sequenced FastQ files (cat)
    2 Auto-infer strandedness by subsampling and pseudoalignment (fq, Salmon)
    3 Read QC (FastQC)
    4 UMI extraction (UMI-tools)
    5 Adapter and quality trimming (Trim Galore!)
    6 Removal of genome contaminants (BBSplit)
    7 Removal of ribosomal RNA (SortMeRNA)
    8 Choice of multiple alignment and quantification routes (For STAR the sentieon implementation can be chosen):
        STAR -> Salmon
        STAR -> RSEM
        HiSAT2 -> NO QUANTIFICATION
    9 Sort and index alignments (SAMtools)
    10 UMI-based deduplication (UMI-tools)
    11 Duplicate read marking (picard MarkDuplicates)
    12 Transcript assembly and quantification (StringTie)
    13 Create bigWig coverage files (BEDTools, bedGraphToBigWig)
    14 Extensive quality control:
        RSeQC
        Qualimap
        dupRadar
        Preseq
        DESeq2
        Kraken2 -> Bracken on unaligned sequences; optional
    15 Pseudoalignment and quantification (Salmon or ‘Kallisto’; optional)
    16 Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks (MultiQC, R)

The tools are present in the previous list in brackets.

As all other nf-core pipelines, the chosen pipeline takes in a samplesheet as input.

Use Python and pandas to create the samplesheet for your 2 samples. Feel free to make use of the table you created earlier today.

Choose your sample names wisely, they must be the connection of the results to the metadata. If you can't find the sample in the metadata later, the analysis was useless.

In [6]:
# post here the command you used to run nf-core/rnaseq
import pandas as pd
import numpy as np

sampleSheet = pd.read_csv("fetchngs-out/samplesheet/samplesheet.csv")
sampleSheet = sampleSheet[['sample', 'fastq_1', 'fastq_2', 'strandedness']]
sampleSheet.to_csv("samplesheet.csv", index=False)

Explain all the parameters you set and why you set them in this way.



    profile docker
    samplesheet
    gtf file from ncbi (reference) or --genome  GRCm38

## Browsing the results

How did the pipeline perform?

Explain the quality control steps. Are you happy with the quality and why. If not, why not.
Please give additional information on : 
- ribosomal rRNA
- Duplication
- GC content

What are the possible steps that could lead to poorer results?

Quality control starts by merging any technical replicates and checking the library’s strandedness. Raw reads are assessed with FastQC to check quality, adapter contamination, and base composition. FastQC also analyses GC content to reveal any bias or strong deviation between samples to exclude those that could be affected by technical artifacts. Low quality bases and adapters are trimmed and ribosomal RNA sequences are filtered out. The cleaned reads are then aligned, sorted and indexed, with duplicates marked. Tools RSeQC, Qualimap, dupRadar, and Preseq are used to check alignment quality, coverage, strand specificity and duplication levels. Finally, MultiQC combines all results into one report so sample quality and consistency can be reviewed easily.

All steps could lead to poorer results as they are based on thresholds which if set incorrectly may filter out too much or too little of the data.

Would you exclude any samples? If yes, which and why?

Sample 2 and 4 should be excluded because of their projections in pca: they are very far away to all the other samples which are all clustered together. If we removed these samples, pca could more accurately capture what we expect to be real variance between the other samples rather than some noise we are not interested in.

What would you now do to continue the experiment? What are the scientists trying to figure out? Which packages on R or python would you use?

The next step is to perform differential expression analysis. The objective is to discover differentially expressed genes in different groups leading to further understanding on chronic pain and treatment response. We could use packages such as ggVennDiagram and pheatmap in R to visualize the results. Matplotlib or Plotly could be used in Python to show plots such as volcano plots or MA plots to show genes that are differentially expressed and have a valid p value. 