# What to expect

We will start our analysis from the mapped reads. The mapping was done using the package [STAR](https://pubmed.ncbi.nlm.nih.gov/23104886/). In this notebook we will take a look at the outputs generated by STAR for the <i>Schistosoma mansoni</i> dataset. We will guide you through the process of combining the results for each sample in the dataset into a single dataframe and viewing them using Principal Components Analysis. In the second part of this session, you will repeat this process for your choice of dataset in notebook 2B.

# The files

For each dataset, you will find some information in `data/<dataset>`:

1. `README` file with information on how the data was generated
2. `list_ids` file with the names of the samples - called accession numbers
3. `metadata` file with information on the experimental conditions for each sample

For each dataset, you will also find some files stored in `analysis/<dataset>`:

1. `star` folder - contains some of the outputs generated by STAR (others have been omitted to save space)
2. `multiqc` folder - contains a summary of the quality of mappings by STAR

We will use these files to complete our analysis

# Checking the quality of mapping
We have mapped our reads to the genome, so we can compare expression levels between our samples. But before moving on to the next step, we should check the quality of our mapping. 

As explained before, the number of reads that map to a gene gives us a measure of how much that gene is being expressed. Therefore, for us to be confident in our quantification, the majority of reads should map to one and only one gene in the reference genome. There are valid reasons why a read can map to more than one gene, for example in the case of families of highly related genes. Reasons to be concerned include insufficiently stringent mapping criteria, short reads that can map to several genes non-specifically, or other technical reasons. As a standard guide, a good quality sample will usually have at least 75% of the reads uniquely mapped.

One of the tools frequently used to check mapping quality is [MultiQC](https://multiqc.info/). MultiQC analyses the mapping for each of our samples and produces an HTML report summarizing it. We have run MultiQC on our files and provided it in the analysis subdirectories. Try opening `analysis/Schistosoma_mansoni/multiqc/multiqc_report.html` by double-clicking (you may need to click 'Trust HTML' in a banner).

<div class="alert alert-block alert-warning">

Discuss in your group:

- What % of reads mapped to only one gene of the reference genome?
- Are there any samples that look less good? In what way? How might this impact your results?

You can add some notes from your discussion on this text cell

\

\

\

<div class="alert alert-block alert-success">
<b>Learning outcomes</b>

You should now...

- Know how to interpret a MultiQC report
- Understand why it is important to check the quality of mapping

# Combining data across samples

Now that we are happy with our mapping, we can move on with the analysis. 

### Looking at mapping results
In the `star` folder there is a file for each of our samples. Each file contains the mapping result for that particular sample. Let's have a look at the mapping results for one of the samples.

In [None]:
! head -n 30 analysis/Schistosoma_mansoni/star/ERR022875ReadsPerGeneUnst.out.tab

It would be very inconvenient if we had to open each file separately in order to look at the mapping results. Therefore, we will combine all the results in one dataframe. It will be our master dataframe. 

### Getting the list of samples in python

To create the master dataframe, we first need to provide python with a list of samples (accessions). These are stored in the file `data/Schistosoma_mansoni/list_ids.txt`. 

<div class="alert alert-block alert-warning">
Create a list that contains all the accession numbers (sample names) in our experiment.  

<details>
<summary><i>Hints and tips</i></summary>
    
    We practised this in the previous notebook. If you need to, go back and use that code.

</details>

In [None]:
# Create a new list called accessions. The list will be empty for now. 

# open the file in read mode
# use a for loop over the lines in the file
# remove any whitespace/newline characters from the line
# add the new accession to the list

# have a look at the list to check it all worked well

### Create the master dataframe
Now that we have our accessions list, we are going to use it to create our master dataframe.  

We will work with pandas dataframes, as we did in notebook 1.

The dataframe should have the gene name as index, and a column of reads per gene for each sample. Each column should have the accession number of that sample as column name.

We need some code that:
- for each accession number in the accessions list, reads the corresponding `ReadsPerGeneUnst.out.tab` file
- Puts the contents of that file into a dataframe that has the gene names as index and a column with the accession number as column name, and the counts for each gene in each row. 
- Adds that column we just generated to a master dataframe using `join` . You used `join` in notebook 1.

Because this is again a repetitive process (the code has to do the same thing for each of our samples) we will need to write another loop.

<div class="alert alert-block alert-warning">
    
Try to fill the gaps `...` in the code below to create the master dataframe.

In [None]:
# Import pandas library
import pandas as pd

# Create the master dataframe
master_df = pd.DataFrame()


# Create a loop that for each accession in the list "accessions", will:
# (1) print the accession number (this allows us to check that the code is working well)
# (2) read the ReadsPerGeneUnst.out.tab file for that same accession number into a temporary dataframe "counts_df" that has
# the gene names column as index and has the column names ["gene", "reads"]
# (3) copy the gene counts data ("reads") to a new column named as the accession number
# (4) Extract that new column with the current accession's read counts into a smaller dataframe "accession_df"
# (5) use a dataframe join to add this column to the master dataframe. Note that not all samples might express the same genes, 
# so the indexes might not be the same. Think about the type of join you need here.

for ...:
    print(accession)
    file = f"analysis/Schistosoma_mansoni/star/{accession}ReadsPerGeneUnst.out.tab"
    counts_df = ...
    ...
    ...
    ...
    master_df = master_df.join(...)


Let's have a look at our master dataframe we just created

In [None]:
master_df

### Filter and save the master dataframe

Hopefully you can see that, although the structure of the dataframe looks good, we need to get rid of the first 5 rows, as they contain summary information instead of gene counts. In DExB2, you already learned ways to either remove rows (week 2, class 3) or to slice the dataframe (week 4, class 8).

In addition, it is likely that not all samples express the same genes. If reads were mapped to a gene for sample 1, but not for sample 2, in our dataframe we will have a number of reads for sample 1 and an empty value "NaN" for sample 2. We want to replace those empty values with 0. You already practised how to change NaN values in a dataframe in notebook 1. 

We also want to save the master dataframe as a csv file. 


<div class="alert alert-block alert-warning">

Try to fill the gaps `...` in the code below to clean the dataframe and save it as csv

In [None]:
# Keep only the relevant rows
master_df_clean = ...
# Replace any empty values with 0
...
# save the dataframe as a csv file called ReadsPerGene
...
# have a look at the cleaned master_df
master_df_clean

# Normalisation
The number of reads mapping to each gene is proportional to the number of expressed transcripts of that gene in that sample. However, there are other factors which affect the number of reads that map to each gene. When comparing gene expression, we need to take those factors into account to minimise their impact, so that our comparisons are reliable. Normalisation is the process of scaling the raw counts to account for these other factors so that the expression levels are more comparable.

The main factors that we should consider are:

-	Sequencing depth: this is the total number of reads obtained (the “Gene counts” in our MultiQC report). As you saw in the MultiQC report, this was different between samples. 

-	Gene length: longer genes will usually have more reads than shorter genes.

-	Overall RNA composition: the presence of a small number of very highly expressed genes, or differences in the number of genes expressed between samples can skew our analysis.


There are different methods for normalisation, depending on the comparison we want to perform. For example, an analysis might look within a single sample to see which genes are more highly expressed in that sample; or it might compare the expression of a gene between 2 different samples.  

The following table is taken from a [Harvard Chan Bioinformatics Core tutorial](https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html) on the topic:

|                                 Normalisation method                                |                                                          Description                                                         |                  Accounted factors                 |                                                                                             |
|:-----------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------:|
| CPM (counts per million)                                                            | counts scaled by total number of reads                                                                                       | sequencing depth                                   |  |
| TPM (transcripts per kilobase million)                                              | counts per length of transcript (kb) per million reads mapped                                                                | sequencing depth and gene length                   |             |
| DESeq2’s median of ratios                                                      | counts divided by sample-specific size factors | sequencing depth and RNA composition               |                        |
| EdgeR’s trimmed mean of M values (TMM)                                        | uses a weighted trimmed mean of the log expression ratios between samples                                                    | sequencing depth, RNA composition, and gene length |                                                |

<div class="alert alert-block alert-warning">
Discuss in your group which normalisation method might be more appropriate for our analysis.

You can add some notes from your discussion in this text cell

\

\

\

<div class="alert alert-block alert-success">
<b>Learning outcomes</b>
    
You should now know...

- What normalisation is 
- Which factors may need to be controlled for different types of gene count comparisons and why

In this case, we will use the DESeq2 inbuilt method for normalisation. To implement it, we will use the python package [PyDESeq2](https://pydeseq2.readthedocs.io/en/latest/api/index.html). 

This method requires two inputs:

- a table with all our counts, just like the dataframe we have just created, but it needs to be transposed, so the sample names are in the first column. You already learned how to transpose a dataframe in DExB2, week 2, class 4.
- a metadata table that specifies what each sample is

<div class="alert alert-block alert-warning">
    
Create the counts and the metadata tables

In [None]:
#create the counts matrix by transposing our master_df
counts = 

#create the metadata table. The index should be the accession number.
metadata = 

#let's have a look at the metadata to make sure it looks right. 
print(metadata)

We are now ready to go ahead with the PyDESeq2 analysis

In [None]:
#To start the analysis, let's intall PyDESeq2
! pip install --quiet pydeseq2

# and import the "DeseqDataSet" class from the dds module of PyDESeq2
from pydeseq2.dds import DeseqDataSet

Now, we will use the DeseqDataSet class to create a "dds" object, which is an annotated data matrix, called [AnnData](https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.html#anndata.AnnData). To create the dds object, we provide some arguments: 

- `counts` is our transposed master_df matrix
- `metadata` is our metadata table that we just created
- `refit_cooks` indicates whether Cook's distance should be recalculated during the analysis. You do not have to worry about the details, we will just indicate that we do want to recalculate Cook's distance.
- `design` is where we indicate what we want to use to compare samples. In our experiment we want to compare gene expression between the different developmental stages, so we specify "stage" as our design factor

In [None]:
dds = DeseqDataSet(
    counts=counts,
    metadata=metadata,
    refit_cooks=True,
    design="stage"
)

dds

Once we have created the dds object, we will apply the `deseq2` method to it. This method normalises the data, estimates the dispersion and calculates the log fold change (LFC) based on the design factor.

In [None]:
dds.deseq2()
dds

Note that new elements have been added to the AnnData object, included LFC and normed counts, for example. Let's have a look at the normed counts

In [None]:
# View the normed counts
dds.layers['normed_counts']

<div class="alert alert-block alert-success">
<b>Learning outcome</b>
    
- You should now know how to use PyDESeq2 to normalise your data

# PCA Plot

In our experiment, we have 12 different samples (three replicates of 3 hr schistosomulum, four replicates of 24 hr schistosomulum, four replicates of cercarium and one replicate of platyhelminth adult). For each of these samples, we have gene counts for thousands of genes. It would be useful at this point to have an overview of the data. We expect the replicates within each stage to be very similar between them, but to be different from the other stages.  Principal Component Analysis (PCA) helps us do this.  You studied PCA in detail during DExB2. Look at the week 7, class 14 if you need a refresher. This [video](https://www.youtube.com/watch?v=5vgP05YpKdE) provides a simple overview of the concept. 

In our experiment, PCA will look at all the normalised gene counts and construct groups of genes (the "components") that describe as much of the variation between samples as possible. Plotting the first 2 components identified in the analysis can therefore be a useful way to visualise the effect of experimental covariates as well as batch effects.

We are going to use PCA to have a look at our data. We will perform the PCA and create the plot with the python library [scanpy](https://scanpy.readthedocs.io/en/stable/).

In [None]:
! pip install --quiet scanpy

In [None]:
import scanpy as sc

# Use scanpy to plot the first 2 components
sc.tl.pca(dds)
sc.pl.pca(dds, size=200, color="stage", annotate_var_explained=True)

<div class="alert alert-block alert-warning">

Discuss in your group:
- Is there a separation of the different developmental stages in the PCA plot?
- How well do the replicates for each stage cluster together?
- How much variance is explained by the first 2 principal components?
- What is PC1 separating?
- What is PC2 separating?

You can add some notes from your discussion in this text cell

\

\

\

We can now look at which genes are contributing to each component, and compare how much each gene contributes. This is called "loadings".

In [None]:
sc.pl.pca_loadings(dds, components = '1,2')

<div class="alert alert-block alert-warning">

Discuss in your group:

- Which 3 genes contribute most to PC1?
- Which 2 genes contribute most to PC2?

You can add some notes from your discussion in this text cell

\

\

\

<div class="alert alert-block alert-success">
<b>Learning outcomes</b>
    
You should now know...
    
- Why we perform PCA to visualise data
- How to interpret a PCA plot
- How to use scanpy to perform PCA and extract the loadings for each component