## DExB2 Example Class Test - Rice RNASeq

In [59]:
# Analysis modules - make sure you run this first to import all the modules you'll need
import warnings
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
from statsmodels.formula.api import ols

warnings.filterwarnings("ignore")

This project has been looking for genetic material to enhance rice yields under drought. RNASeq data has been generated from several tissues for cultivated rice and drought-tolerant wild varieties.  

The raw expression data is spread across 6 files:  
        
        Cult_1.txt
        Cult_2.txt
        Cult_3.txt
        Wild_1.txt
        Wild_2.txt
        Wild_3.txt
        
The key columns are the gene name and the Transcripts per Million reads (TPM)
The other columns and the raw number of reads mapping to each gene (NumReads), the Length of the gene, and the Effective Length (takes into account all factors being modeled that will effect the probability of sampling fragments from this transcript, including the fragment length distribution and sequence-specific and gc-fragment bias)

The annotation is here:

        Rice_annot.txt
        
        

#### Read the RNASeq data in and join together to make a tidy table
You want the TPM (transcripts per million) column from each dataframe.

To set the index as the gene name use:

    index_col=0
    

<div class = "alert alert-danger">
Q1 How many data rows are there in the third replicate of the wild rice?  (1 mark)
 
    
    a 8723    
    b 11296  
    c 13409   
    d 22467   
    e 45072    
    
Enter your answer on LEARN

<div class = "alert alert-danger">
Q2 What is the maximum TPM for any gene in rep 1 of the Cultivated Rice?  (1 mark)

    a 3427.257  
    b 7730.83 
    c 21684.946  
    d 25196.333  
    e 41984.324  
    
Enter your answer on LEARN

<div class = "alert alert-danger">
Q3 What is name of the gene with the highest TPM in  Cultivated Rice rep 1? (1 mark)

    a Scaffolds_1827_0.16_mRNA_1 
    b Scaffolds_1075_2.1_mRNA_1
    c Scaffolds_1400_0.36_mRNA_1  
    d Scaffolds_1745_0.15_mRNA_1  
    e Scaffolds_1853_0.18_mRNA_1  
    
Enter your answer on LEARN

<div class = "alert alert-danger">
Q4 Plot TPM by NumReads for Cultivated Rice rep 1 as lmplot and upload the plot to LEARN (2 marks)

<div class = "alert alert-danger">
Q5 Fit a model for the line in this plot using ols().  Which of these statements is true?  (1 mark)

    a NumReads drives TPM
    b The intercept is significantly different from zero
    c The slope has a coefficient of 0.27
    d The P-value for NumReads supports a non-zero intercept
    e The t statistic for the slope shows a positive realtionship
    
Enter your answer on LEARN  

Concatenate the TPM values across dataframes. Remember that you want only the TPM (transcripts per million) column from each dataframe, along with the gene name index . (If you try to include more than these columns from each dataframe, you will run into frustrating problems with column headers.)

Use  

    pd.concat  with keys=['cult_1', 'cult_2', 'cult_3', 'wild_1', 'wild_2', 'wild_3']
    
This gives nested column headers.  Reset to the top level using:  

    df.columns = df.columns.get_level_values(0)
    
Sort by the values for replicate 1 of the cultivate rice.  Use:

    df.sort_values()

<div class = "alert alert-danger">
Q6 What is the TPM in wild rice replicate 1 for the gene most highly expressed in cultivated rice replicate 1?  (1 mark)

    a 785.21 
    b 9071.88  
    c 28106.24  
    d 5026.12  
    e 58203.12 
    
Enter your answer on LEARN


Read in the annotations.  Use df.loc to find the row where 'Gene_model' matches the name of the gene most highly expressed gene in cultivate rice replicate 1.


<div class = "alert alert-danger">
Q7 What is the annotation for the most highly expressed gene in cultivated rice replicate 1? (1 mark)

    a External stimuli response.salinity 
    b Solute transport.primary  
    c Cell wall organisation 
    d Protein homeostasis 
    Xe Tonoplast intrinsic protein  
    
Enter your answer on LEARN


Plot a heatmap of the TPM data using sns.heatmap()  


To remove the influence of highly expressed outliers use 

    robust=True
    

<div class = "alert alert-danger">
Q8 Upload the heatmap to LEARN (2 marks)

Re-arrange the dataframe to long form using

    pd.melt
    
The value name should be 'TPM'.

The value vars will be ['cult_1', 'cult_2', 'cult_3', 'wild_1', 'wild_2', 'wild_3']
    
You will need to reset the index first using:
    
    df.reset_index(inplace = True)

<div class = "alert alert-danger">
Q9 Which of the samples has the maximum TPM overall? (1 mark)

    Xa Cultivated 1   
    b Cultivated 2 
    c Cultivated 3   
    d Wild 1    
    e Wild 2  
    f Wild 3. 
    
Enter your answer on LEARN

Split the variable column with sample type and replicate into two columns.  Use:  
    
    df[['A', 'B']] = df['AB'].str.split(' ', n=1, expand=True)
    
Drop the original 'variable' column

Use multiple regression analysis with two categorical explanatory variables and a numerical response variable to look at the effect of Sample and Replicate on TPM.  

Use the formula 'TPM ~ Sample + Rep'


<div class = "alert alert-danger">
Q10 What is the P value for the effect of Sample? (1 mark)

    Xa 0.788 
    b 0.023 
    c 0.001 
    d 0.432 
    e 0.037
    
Enter your answer on LEARN

These genes are involved in salt tolerance.  Are salt tolerance genes significantly differentially expressed between wild and cultivated rice?

    ['Scaffolds_120_4.20_mRNA_1',
     'Scaffolds_1226_3.13_mRNA_1',
     'Scaffolds_1308_3.10_mRNA_1',
     'Scaffolds_1516_1.16_mRNA_1',
     'Scaffolds_2846_0.55_mRNA_1',
     'Scaffolds_486_2.19_mRNA_1',
     'Scaffolds_585_0.35_mRNA_1']
 
Subset the dataframe using
 
      df[df['Column'].isin([list])]
      
 
 Then make a boxplot to compare expression of these genes between all the samples.


<div class = "alert alert-danger">
Q11 upload the boxplot to LEARN (2 marks)