# What to expect

In notebook 3A we ran a differential gene expression analysis on the example dataset <i>Schistosoma mansoni</i> and visualised the results in a volcano plot. 

In this notebook we will apply the same methods to our dataset of choice.

# Set up
First, we need to import the required libraries and install PyDESeq2 again

In [None]:
# import required libraries
import pandas as pd

#Install PyDESeq2 and import required classes
! pip install --quiet pydeseq2
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

We now need to create the dds object again, as we will use it for the next steps of the analysis.

<div class="alert alert-block alert-warning">
    
Create the counts and the metadata tables and restrict them so that they only contain the conditions we want to compare.

In [None]:
# load the counts and metadata again - remember that we have to transpose the ReadsPerGene table to use it in PyDESeq2
counts = 
metadata = 

# restrict to the 2 stages we want to compare.
counts_s = 
metadata_s = 

Now that we have the counts and metadata, we can create the DESeq2 dataset again

In [None]:
# create DESeq2 dataset object using the design factor appropriate to your dataset
dds = DeseqDataSet(
    counts=counts_s,
    metadata=metadata_s,
    design=,
    refit_cooks=True
)

# Differential Expression analysis

We will now apply the `deseq2` method to our dds object. Remember that this method normalises the data, estimates the dispersion and calculates the log fold change (LFC) based on the design factor.

In [None]:
# Run DESeq2
dds.deseq2()

To perform statistical analysis, use the class `DeseqStats` on our dds object, and store the output in a new object called "stat_res". 

In [None]:
stat_res=DeseqStats(dds, contrast=[<design>, <stumpy OR wildtype 22-24 hrs post invasion>, <slender OR wildtype 16hrs post invasion>])

Now, we have to generate a summary of the statistical analysis contained in the "stat_res" object. To do that, we use the `summary` method.

In [None]:
stat_res.summary()

Store the results in a dataframe called "res", so you can work with the results.

To do this, apply the PyDESeq2 attribute `results_df` to your stat_res object

In [None]:
res =
res

To make sure you understand the differential expression analysis, do the exercise below - the answers will be helpful to complete the bioinformatics analysis summary

<div class="alert alert-block alert-warning">
    
- Pick the first gene in the res dataframe that has a Log2FoldChange higher than 1 or smaller than -1
- Find the reads for that gene in each of your samples. You can find them by exploring the "ReadPerGene.csv" file that you saved in workshop 2B
- Explain how the expression of that gene changes between samples and how that relates to the Log2FoldChange value shown in your differential expression results

In [None]:
#Add your code here

You can use this text cell to make notes, or add the answers straight into your analysis summary

\

\

\

# Cleaning and exploring the results

As we saw in notebook 3A, there might be p-values of 0.0 in your analysis result. You have to replace those 0.0 values with a very small number, so they do not cause errors later on.

In [None]:
import numpy as np

# replace p-values of 0 with a very small number


In [None]:
# make a new folder to save the differential expression analysis results
! mkdir -p "analysis/<dataset>/de"
# save the results with a sensible name


Now remove genes with very low expression using the threshold baseMean of 10

In [None]:
# remove results with baseMean<10


<div class="alert alert-block alert-warning">

Find out the following - the answers will be helpful to complete the bioinformatics analysis summary:

- How many genes are significantly differentially expressed?
- For how many of these genes is the fold change (FC) greater than 2 or less than 0.5?

In [None]:
#Add your code here

# Visualisation

Visualise your results in a volcano plot

In [None]:
import matplotlib.pylab as plt

<div class="alert alert-block alert-warning">

Using matplotlib (plt), make a scatter plot that: 
* in the X axis plots the log2FoldChange values from the "res" dataframe 
* in the Y axis plots the -log10 of the padj values from the "res" dataframe

In [None]:
# First, create a new column in the dataframe res that contains the -log10(padj)

In [None]:
# Now, make the scatter plot 

<div class="alert alert-block alert-warning">

Make the volcano plot fancier by 

1- colouring dots depending on:
* whether the corresponding genes is up- or downregulated -> We will consider that a gene is up- or downregulated if it's expression at least doubles or halves between the two conditions
* whether the change in expression of the corresponding gene is significant -> We will use padj<0.05

2- adding:

* axes labels
* lines at the threshold values
* legend

3- save it as a png image so you can use it later

In [None]:
# define which parameters determine if a gene is significantly up or down

# plot  all the genes and label as non-significant

# colour downregulated genes in blue

# colour upregulated genes in red

# add axes labels

# add threshold lines
plt.axvline(-2,color="grey",linestyle="--")
plt.axvline(2,color="grey",linestyle="--")
plt.axhline(2,color="grey",linestyle="--")

# add a legend

# save as png


Remember that you can, if you wish, create and interactive volcano plot to explore your results further. You might find it useful for the development of your grant proposal. 

That is the end of the analysis in Python. Well done! You made it!

We will now move on to exploring which GO terms and metabolic pathways are represented in our lists of up- and down-regulated genes. To make sure everyone is on the same page, we have uploaded the lists of up- and down-regulated genes to Learn. Please use those from now on. 