## Great, now that we discussed a little let's continue

**Given that the current approach utilized by the authors lacks reproducibility, we will explore an alternative method by leveraging nf-core pipelines for data analysis.**

**Please explain, how we will achieve reproducibility for the course  with this approach.**


Reproducibility can be improved by using a nf-core pipeline.
Here the processes is not only transparent, since the tools used can be easily seen, it is also easy to check the versions for each run. Additionally, docker containers can be used to simulate identical environments. Lastly, the strict documentation guidelines for nf-core pipelines improve replicability

**You have successfully downloaded 2 of the fastq files we will use in our study.**

**What is the next step if we want to first have a count table and check the quality of our fastq files? What is the pipeline called to do so?**

First we want to do some quality control on the FastQ files and perform a read aligmment.

The rnaseq  pipeline can help with that

**Analyze the 2 files using an nf-core pipeline.**

**What does this pipeline do?**

The rnaseq pipeline is used to analyze RNA sequencing data with a reference genome and annotation. It performs quality control, trimming, and alignment, and produces a gene expression matrix and a Quality Control report.




**Which are the main tools that will be used in the pipeline?**



- **Merge re-sequenced FastQ files** (`cat`)
- **Auto-infer strandedness by subsampling and pseudoalignment** (`fq`, `Salmon`)
- **Read QC** (`FastQC`)
- **UMI extraction** (`UMI-tools`)
- **Adapter and quality trimming** (`Trim Galore!`)
- **Removal of genome contaminants** (`BBSplit`)
- **Removal of ribosomal RNA** (`SortMeRNA`)
- **Alignment and quantification routes:**
  - `STAR` -> `Salmon`
  - `STAR` -> `RSEM`
  - `HiSAT2` -> No quantification
- **Sort and index alignments** (`SAMtools`)
- **UMI-based deduplication** (`UMI-tools`)
- **Duplicate read marking** (`picard MarkDuplicates`)
- **Transcript assembly and quantification** (`StringTie`)
- **Create bigWig coverage files** (`BEDTools`, `bedGraphToBigWig`)
- **Extensive quality control:**
  - `RSeQC`
  - `Qualimap`
  - `dupRadar`
  - `Preseq`
  - `DESeq2`
- **Pseudoalignment and quantification** (`Salmon` or `Kallisto`; optional)
- **Present QC for various checks** (`MultiQC`, `R`):
  - Raw read
  - Alignment
  - Gene biotype
  - Sample similarity
  - Strand-specificity checks

**As all other nf-core pipelines, the chosen pipeline takes in a samplesheet as input.**

**Use Python and pandas to create the samplesheet for your 2 samples. Feel free to make use of the table you created earlier today.**

**Choose your sample names wisely, they must be the connection of the results to the metadata. If you can't find the sample in the metadata later, the analysis was useless.**

In [17]:
# Creating the sample sheet

'''
sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto
'''

import os

output_name = "samplesheet.csv"

# metadata
input_file = "data/input_day2_part2.txt"

# Directory containing the FastQ files
fastq_directory = "data/fastq" # fastq files omitted for size

# Read the input file
with open(input_file, "r") as file:
    samples = file.read().splitlines()

output_lines = ["sample,fastq_1,fastq_2,strandedness"]

# Process each sample in the input
for sample in samples:
    # Extract the run accession number
    run_accession = sample.split("_")[0]

    # Find corresponding FastQ files
    fastq_files = [
        f for f in os.listdir(fastq_directory)
        if run_accession in f
    ]

    # Create the output line in the desired format
    output_line = f"{sample},{fastq_files[0]},{fastq_files[1]},auto"
    output_lines.append(output_line)


output_file = "samplesheet.csv"
with open(output_file, "w") as file:
    file.write("\n".join(output_lines))


In [None]:
# post here the command you used to run nf-core/rnaseq

!nextflow run nf-core/rnaseq \
    --input samplesheet.csv \
    --outdir rnaseq_output \
    --genome GRCm38 \
    -profile docker \
    --aligner hisat2 \
    -- max_cpus 8 \
    -- max_memory 15    


**Explain all the parameters you set and why you set them in this way.**



--input samplesheet.csv     # The input samplesheet. Required

--outdir rnaseq_output      # The output directory. Required

--genome GRCm38             # Reference genome. The reference genome mm10 could not be downloaded, so --genome has been used instead

-profile docker             # Docker is recommended

--aligner hisat2            # Since I only have 15.4 GB of memory, --aligner hisat2 is necessary

--max_cpus 8                # Limited due to bad PC specifications

--max_memory 15GB           # Limited due to bad PC specifications

## Browsing the results

**How did the pipeline perform?**

Since the pipeline took too long on my device, other results have been used.

In those results, the pipeline was executed succesfully on 15th August and took 1h 24 min.

**Explain the quality control steps. Are you happy with the quality and why. If not, why not.**
**Please give additional information on :**

**- ribosomal rRNA**

**- Duplication**

**- GC content**

**What are the possible steps that could lead to poorer results?**

**Would you exclude any samples? If yes, which and why?**

Since there were 32 files analyzed, 2 random examples have been analyzed. First I had a look at the FastQC files. 

_Sham_oxy_1_1_fastqc.html_ 

and

_SNI_Sal_1_1_fastqc.html_


_Sham_oxy_1_1_fastqc.html_  has general good results. Towards the end of the read the quality declined a lot, like expected. Read trimming would be recommended here. The adapter is also still present in the FastQC report. 

The GC content shows a strong deviation from the expected result and there are a lot of overrepresented sequences. This could be a sign of rRNA contamination.


_SNI_Sal_1_1_fastqc.html_ has similar results, with better qualoty towards the end of the reads and less issues for the GC content. It also has some overrepresented sequences, but this is likely mostly due to the adapter.

Since these issues could be solved by trimming, the pipeline should be able to improve the quality and the reads can be used.


_Sham_oxy_1_fastqc.html_  

rRNA : 1.40% (acceptable levels of contamination)
Duplication: 38.9%
GC content: 63%

_SNI_Sal_1_fastqc.html_

rRNA : 0.02% (acceptable levels of rRNA contamination)
Duplication: 73.7%
GC content: 48%


When looking at the MultiQC report, we can see some samples with high rRNA contamination. Specifically _SNI_Sal_2_ and _SNI_Sal_4_.  Additionally, those samples failed the strand check.
For those samples with higher duplication values, those can be explained by RNA-seq often being dominated by the transcripts of a few genes. So a higher than usual duplicaiton level is expected. 

I would exclude _SNI_Sal_2_ and _SNI_Sal_4_ because of those reasons.

Additonally, _Sham_oxy_1_ shows some bad mapping. MultiQC shows that less than 30% could be mapped due to the reads being too short. For this reason, I would also expclude _Sham_oxy_1



The issues in quality are likely due to poor library preparation. Including those samples with bad quality could lead to a worse overall result.


Due to the small sample size, all samples could be used for analysis. If the downstream analysis shows any issues, they could be removed afterwards.

**What would you now do to continue the experiment? What are the scientists trying to figure out? Which packages on R or python would you use?**

To continue the experiment, we now want to see the difference in the four groups (Sham_oxy, SNI_oxy, Sham_Sal and
Sham_oxy) in their expression.


In the paper it was stated that differential expression analysis was performed.
The goal was to compare the oxycodone withrdawl versus saline treatment effects in SNI and sham mice. 


The ggVennDiagram R package can be used for Venn diagram generation. The R package pheatmap can be used to generate heatmaps for the different samples.
Pathway analysis can be conducted using IPA.
