# Differential expression analysis


You have run the nf-core/rnaseq pipeline and checked the first quality control metrics of your fastq files. This was, however, only the primary analysis and we want to take it further.

Due to the computational demand of the pipeline, you only ran the pipeline on two of the 16 samples in the study yesterday. We provide you an essential output of nf-core/rnaseq pipeline in the `data` folder: It contains the combined epression matrix as produced by Salmon, which provides transcript levels for each gene (rows) and each sample (columns).

We would now like to understand exactly the difference between the expression in our groups of mice. 
Which pipeline would you use for this?

Name of pipeline: nf-core/differentialabundance

Have a close look at the pipeline's "Usage" page on the [nf-core docs](nf-co.re). You will need to create a samplesheet (based on the column names in the provided matrix).

Please paste here the command you used. You may need to inspect the provided expression matrix more closely and create additional files, like a samplesheet (based on the column names) or a contrast file (there happens to also be one in `data/` that you can use).

In [None]:
nextflow run nf-core/differentialabundance \
     -r 1.5.0 \
     --input samplesheet.csv \
     --contrasts contrasts.csv \
     --matrix salmon.merged.gene_counts.tsv \
     --outdir output  \
     -profile docker  

Explain all the parameters you set and why you set them in this way. If you used or created additional files as input, explain what they are used for.

| Parameter                | Value                          | Explanation                                                                                           |
|---------------------------|--------------------------------|-------------------------------------------------------------------------------------------------------|
| `-r`                     | `1.5.0`                        | Specifies the pipeline release version to ensure reproducibility. Using version `1.5.0` guarantees that the same workflow, dependencies, and behavior are used consistently. |
| `--input`                | `samplesheet.csv`              | Input file listing all samples, their metadata, and sequencing data paths. Required by the pipeline to know which datasets to process. |
| `--contrasts`            | `contrasts.csv`                | File defining experimental groups and conditions to be compared (e.g., case vs control). Used for differential expression analysis. |
| `--matrix`               | `salmon.merged.gene_counts.tsv`| Pre-computed count matrix file containing merged gene expression counts across samples, generated by Salmon quantification. |
| `--outdir`               | `output`                       | Directory where all pipeline results, reports, and processed files will be stored. Helps organize outputs cleanly. |
| `-profile`               | `docker`                       | Execution profile specifying containerization. Using `docker` ensures all software dependencies are encapsulated, improving portability and reproducibility. |

What were the outputs of the pipeline?

The “tables” folder contains statistical data based on the provided transcriptomics dataset.
These data are further processed, and more user-friendly plots are available in the “plots” directory.
The “report” directory contains a “study.html” file that presents the core statistical values as well as the generated plots.
Overall, the pipeline produced a differential expression analysis, providing information about which genes are significantly up- or downregulated between the studied conditions.

Would you exclude any samples? If yes, which and why?

SNI_Sal_2 and SNI_Sal_4 do show a high influence on variance as shown in the PCA. Also in the dendogram they are shown quite grouped together but are way different to the other samples.

Sham_oxy_1 exhibits an absolute MAD score greater than 4. Although it was not flagged as an outlier by the built-in detection method, a value above 4 can still be considered indicative of a potential outlier.

How many genes were differentially expressed in each contrast? Does this confirm what the paper mentions?

SNI_oxy versus SNI_Sal in Condition: 1 up, 17 down
Sham_oxy versus Sham_Sal in Condition: 7 up, 0 down

TBD

The paper mentions differentially expressed genes in three brain regions : the NAc, mPFC and VTA. Briefly explain what these 3 regions are.

- NAc (Nucleus Accumbens): Main region related to the brain’s reward system.
- mPFC (medial Prefrontal Cortex): Relevant for for decision-making, emotion regulation, and executive control.
- VTA (Ventral Tegmental Area): Region relevant for dopamine release, central to reward, learning, and motivation.

Is there anyway from the paper and the material and methods for us to know which genes are included in these regions?

TBD

Once you have your list of differentially expressed genes, do you think just communicating those to the biologists would be sufficient? What does the publication state?

TBD

Please reproduce the Venn Diagram from Figure 3, not taking into account the brain regions but just the contrasts mentionned.

TBD