# What to expect

In notebook 2A we looked at the output of STAR, and combined the results for each sample in the <i>Schistosoma mansoni</i> example dataset into a single dataframe. We considered ways to normalise the gene count data and viewed the results using  Principal Component Analysis. In this second part of this session, you will repeat most of this process for your choice of dataset.

# Checking the quality of mapping
As before, take a look at the `analysis/<dataset>/multiqc/multiqc_report.html` by double-clicking.

<div class="alert alert-block alert-warning">

Answer the following questions. Your answers will be helpful to complete the bioinformatics analysis summary.

- What % of reads mapped to only one gene of the reference genome?
- Are there any samples that look less good? If so, in what way? How might this impact your results?

You can use this text cell to make notes, or add the answers straight into your analysis summary

\

\

\

# Combining data across samples

As with the example dataset in 2A, we have a set of files in the `star` folder. Each file contains the mapping result for that particular sample. Have a look at the mapping results for one of the samples

In [None]:
#Add code here to look at the mapping results for one sample in your dataset

To analyse the data further, we need to combine the results form each sample in one dataframe, which will be our master dataframe.  

The dataframe should have the gene name as index, and a column of reads per gene for each sample. Each column should have the accession number of that sample as column name. 

<div class="alert alert-block alert-warning">

Take the code you used in notebook 2A to generate the master dataframe for the example dataset and modify it as required to create the master dataframe for your dataset.

In [None]:
# Using a loop, create the list with the accession numbers 


# have a look at the list to check it all worked well

In [None]:
# Import pandas
import pandas as pd

# Create the master dataframe
master_df = pd.DataFrame()


# Use a loop to add the index and the data to the master_df

# Have a look at the master_df


In [None]:
# Clean the dataframe to make sure it is ready for the next step of the analysis

# Have a look at the cleaned master_df

<div class="alert alert-block alert-warning">

Take a few moments to look at the dataframe and compare it with the one the rest of your group obtained. Did you get the same output? Does it look like the one shown on the projector? If not, try to figure out why, and feel free to ask your peers and the demonstrators for help.

In [None]:
# Save the dataframe as a csv file so we can look at it later
master_df_clean.to_csv(f"analysis/<dataset>/star/ReadsPerGene.csv")

# Normalisation
As with the example dataset, we will now use the python package [PyDESeq2](https://pydeseq2.readthedocs.io/en/latest/api/index.html) to normalise our gene counts and quantify the log fold change.

As before, we will need a counts table and a metadata table. For each dataset we have provided metadata in `data/<dataset>/metadata.csv`, so we just need to read that into a dataframe

In [None]:
#create the counts matrix by transposing our master_df
counts =

#create the metadata table. The index should be the accession number.
metadata =

# Let's have a look at the metadata to make sure it looks right.
metadata

Hopefully you have noticed that the experiment in your dataset is a bit more complex than that in the example dataset. In your dataset, there are other variables (different genotypes for Plasmodium, different species for Trypanosoma). However, we do not need the data for all the experimental conditons for our analysis, because:
- For <i>Plasmodium</i>, we only want to compare the wildtype at the different timepoints.
- For <i>Trypanosoma</i>, we only want to compare <i>Trypanosoma brucei brucei</i> with different morphologies.

Therefore, we have to create a filtered version of the metadata (metadata_s) and counts (counts_s) to use with PyDESeq2, so that they only contain the conditions we are interested in. We practised how to do this in notebook 1. 

<div class="alert alert-block alert-warning">
    
Create the counts and the metadata tables so that they only contain the conditions we want to compare.

In [15]:
#add code below to create the filtered dataframes
counts_s = 
metadata_s = 


Now, we are ready to generate the DeseqDataSet object using the relevant `design` and apply the deseq2 method.

<div class="alert alert-block alert-warning">
    
Fill in the code below with the relevant design factor for your analysis.

In [None]:
! pip install --quiet pydeseq2
from pydeseq2.dds import DeseqDataSet

dds = DeseqDataSet(
    counts=counts_s,
    metadata=metadata_s,
    refit_cooks=True,
    design=
)

dds.deseq2()

# PCA Plot

Now take a look at how the overall data looks on a Principal Component Analysis plot of PC1 and PC2. 

In [None]:
! pip install --quiet scanpy
import scanpy as sc

sc.tl.pca(dds)
sc.pl.pca(dds, color=<design factor>, size=200, annotate_var_explained=True)

<div class="alert alert-block alert-warning">

Answer the following questions. Your answers will be helpful to complete the bioinformatics analysis summary.

- Is there a separation between the groups?
- Does reproducibility look good?
- What is PC1 separating?
- What is PC2 separating?

You can use this text cell to make notes, or add the answers straight into your analysis summary

\

\

\

Look at the loadings associated with the two components.

In [None]:
sc.pl.pca_loadings(dds, components = '1,2')

<div class="alert alert-block alert-warning">

Answer the following question. Your answer will be helpful to complete the bioinformatics analysis summary.

What is the top ranked gene for each component? What do these genes encode?

You can use this text cell to make notes, or add the answers straight into your analysis summary

\

\

\