# Day 2

**Today, we will start using nf-core pipelines to find differentially abundant genes in our dataset.** 
**We are using data from the following paper: https://www.nature.com/articles/s41593-023-01350-3#Sec10**

**1. Please take some time to read through the paper and understand their approach, hypotheses and goals.**

**What was the objective of the study?**


To better understand the transcriptomic effects of chronic opioid exposure and physical dependence under chronic pain states in the brain reward circuitry, like the inhibition of HDAC1/HDAC2.

They administered high doses of oxycodone, an opioid for to mice with prolonged spared nerve injury following spontaneous withdrawal.


**What do the conditions mean?**

oxy: oxycodon injection


sal: saline injection

**What do the genotypes mean?**

SNI: spared nerve injury


Sham: No intervention (fake surgery)

**Imagine you are the bioinformatician in the group who conducted this study. They hand you the raw files and ask you to analyze them.**

**What would you do?**

I would perform multiple steps to first check the quality of the raw data. Then perform an alignment against reference data.
Afterwards, the difference of gene expression in the four groups would have to be analyzed and compared.

**Which groups would you compare to each other?**

I would compare sal vs. oxy groups to analyze withdrawl and transcriptomic changes, and SNI vs. Sham groups to analyze chronic pain vs. no chronic pain.

**Please also mention which outcome you would expect to see from each comparison.**

The expected outcomes for comparing sal vs. oxy groups would be useful for showing the impact of opioids and withdrawl compared to a control. Here no signs of opioid usage or withdrawl are expected in sal mice, while they are expetcted in oxy mice.

Comparing SNI vs. Sham mice is useful for showing the difference between a chronic pain condition vs no chronic pain condition. Here it is expected that the Sham mice act as a control group with no exptected data supporting chronic pain, while the SNI group would show them. 

In [112]:
# sorting the table
import pandas as pd


# reference
'''
sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto
'''


# Read Excel file
df = pd.read_excel("data/conditions_runs_oxy_project.xlsx", sheet_name="Sheet1")


Sal_Sham = df[(df['condition: Sal'] == "x") & (df['Genotype: Sham'] == "x")]

Sal_SNI = df[(df['condition: Sal'] == "x") & (df['Genotype: SNI'] == "x")]
Oxy_Sham = df[(df['Condition: Oxy'] == "x") & (df['Genotype: Sham'] == "x")]
Oxy_SNI = df[(df['Condition: Oxy'] == "x") & (df['Genotype: SNI'] == "x")]

samples_Sal_Sham = Sal_Sham.iloc[:, 1]
samples_Sal_SNI = Sal_SNI.iloc[:, 1]
samples_Oxy_Sham = Oxy_Sham.iloc[:, 1]
samples_Oxy_SNI = Oxy_SNI.iloc[:, 1]


# Sorted table

sorted_Sal_Sham = pd.DataFrame({'sample': samples_Sal_Sham, 'fastq_1': None, 'fastq_2': None})
print("sorted_Sal_Sham")
print(sorted_Sal_Sham)
sorted_Sal_Sham.name = "Sal_Sham"

sorted_Sal_SNI = pd.DataFrame({'sample': samples_Sal_SNI, 'fastq_1': None, 'fastq_2': None})
print("sorted_Sal_SNI")
print(sorted_Sal_SNI)
sorted_Sal_SNI.name = "Sal_SNI"

sorted_Oxy_Sham = pd.DataFrame({'sample': samples_Oxy_Sham, 'fastq_1': None, 'fastq_2': None})
print("sorted_Oxy_Sham")
print(sorted_Oxy_Sham)
sorted_Oxy_Sham.name = "Oxy_Sham"

sorted_Oxy_SNI = pd.DataFrame({'sample': samples_Oxy_SNI, 'fastq_1': None, 'fastq_2': None})
print("sorted_Oxy_SNI")
print(sorted_Oxy_SNI)
sorted_Oxy_SNI.name = "Oxy_SNI"

sorted_Sal_Sham
         sample fastq_1 fastq_2
2   SRR23195507    None    None
7   SRR23195512    None    None
10  SRR23195515    None    None
15  SRR23195520    None    None
sorted_Sal_SNI
         sample fastq_1 fastq_2
0   SRR23195505    None    None
5   SRR23195510    None    None
8   SRR23195513    None    None
13  SRR23195518    None    None
sorted_Oxy_Sham
         sample fastq_1 fastq_2
1   SRR23195506    None    None
6   SRR23195511    None    None
9   SRR23195514    None    None
14  SRR23195519    None    None
sorted_Oxy_SNI
         sample fastq_1 fastq_2
3   SRR23195508    None    None
4   SRR23195509    None    None
11  SRR23195516    None    None
12  SRR23195517    None    None


In [69]:
# 1. How many samples do you have per condition?

print("1. How many samples do you have per condition?\n")

samples_per_Sal = sorted_Sal_Sham.shape[0] + sorted_Sal_SNI.shape[0]

print("Samples per Sal ", samples_per_Sal)

samples_per_Oxy = sorted_Oxy_Sham.shape[0] + sorted_Oxy_SNI.shape[0]

print("Samples per Oxy ", samples_per_Oxy)

1. How many samples do you have per condition?

Samples per Sal  8
Samples per Oxy  8


In [70]:
# 2. How many samples do you have per genotype?

print("2. How many samples do you have per genotype?\n")

samples_per_Sham = sorted_Oxy_Sham.shape[0] + sorted_Sal_Sham.shape[0]

print("Samples per Sham ", samples_per_Sham)

samples_per_SNI = sorted_Oxy_SNI.shape[0] + sorted_Sal_SNI.shape[0]

print("Samples per SNI ", samples_per_SNI)

2. How many samples do you have per genotype?

Samples per Sham  8
Samples per SNI  8


In [71]:
# 3. How often do you have each condition per genotype?

print("3. How often do you have each condition per genotype?\n")

print("Sample for Sal_Sham ", sorted_Sal_Sham.shape[0])
print("Sample for Sal_SNI ", sorted_Sal_SNI.shape[0])
print("Sample for Oxy_Sham ", sorted_Oxy_Sham.shape[0])
print("Sample for Oxy_SNI ", sorted_Oxy_SNI.shape[0])

3. How often do you have each condition per genotype?

Sample for Sal_Sham  4
Sample for Sal_SNI  4
Sample for Oxy_Sham  4
Sample for Oxy_SNI  4


**They were so kind to also provide you with the information of the number of bases per run, so that you can know how much space the data will take on your Cluster.<br>**
**Add a new column to your fancy table with this information (base_counts.csv) and sort your dataframe according to this information and the condition.**



In [91]:
# Read csv file
csv = pd.read_csv("data/base_counts.csv")

samples = [sorted_Oxy_SNI, sorted_Oxy_Sham, sorted_Sal_SNI, sorted_Sal_Sham]

# add Bases
for i, df in enumerate(samples):
    df['Bases'] = df['sample'].map(csv.set_index('Run')['Bases'])


**Then select the 2 smallest runs from your dataset**

In [109]:
lowest = [[float('inf'), 0, 0], [float('inf'), 0, 0]]

for i, df in enumerate(samples):

    for index, row in df.iterrows():
        if row.Bases < lowest[0][0]:
            lowest[0] = [row.Bases, row['sample'], df.name]
            
        elif row.Bases < lowest[1][0]:
            lowest[1] = [row.Bases, row['sample'], df.name]


print("The smallest samples are")
for sample in lowest:
    print(sample[1], " with ", sample[0], " bases")



The smallest samples are
SRR23195516  with  6203117700  bases
SRR23195511  with  6456390900  bases


**and download them from SRA (maybe an nf-core pipeline can help here?...)**

In [111]:
# Creating the csv file

file_name = "results/ids.csv"

with open(file_name, "w") as file:
    for sample in lowest:
         file.write(f"{sample[1]}\n")

In [None]:
# Downloading the files

!nextflow run nf-core/fetchngs --input ./ids.csv --outdir ./results -profile docker


**While your files are downloading, get back to the paper and explain how you would try to reproduce the analysis.<br>**
**When you are done with this shout, so we can discuss the different ideas.**


To reproduce the results, I would analyze the data by checking its quality, performing trimming, alignment etc. of all datasets and compare the results of the different groups. 

**Use pipeline to download files, if it requires more specs, then still write down the commands.**


In [121]:
# Create table for Day 2 Part 2 analysis
file_name = "results/input_day2_part2.txt"

with open(file_name, "w") as file:
    for sample in lowest:
         file.write(f"{sample[1]}_{sample[2]}\n")