# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Basic-Statistics" data-toc-modified-id="Basic-Statistics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Basic Statistics</a></div>

# Basic Statistics

In this Jupyter notebook, I will assemble the dataframe that I will use throughout the rest of the project. Then, I will calculate some relevant statistics (number of genes and isoforms identified in total, number of differentially expressed genes per genotype, average effect sizes and the percent internalization of each genotype. Finally, I will briefly show the benefits of pooling samples instead of using a general linear model (GLM) with interactions to analyze this dataset. To begin, I will import all the relevant libraries.

In [1]:
import pandas as pd
import numpy as np
import scipy as scipy
import statsmodels.api as sm
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rc
import os

# import own libraries
import pretty_table as pretty

# plotting settings
import os
rc('text', usetex=True)
rc('text.latex', preamble=r'\usepackage{cmbright}')
rc('font', **{'family': 'sans-serif', 'sans-serif': ['Helvetica']})

%matplotlib inline

# This enables SVG graphics inline. 
%config InlineBackend.figure_formats = {'png', 'retina'}

# JB's favorite Seaborn settings for notebooks
rc = {'lines.linewidth': 2, 
      'axes.labelsize': 18, 
      'axes.titlesize': 18, 
      'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style("dark")

# more parameters
mpl.rcParams['xtick.labelsize'] = 16 
mpl.rcParams['ytick.labelsize'] = 16 
mpl.rcParams['legend.fontsize'] = 14

Having imported all the libraries I need, I will define the parameters that I will use for the rest of this Notebook.

In [2]:
strains =['PS4187', 'PS4087', 'PS4176']  # strains to be analyzed
q = 0.01  # q-value cutoff

strain_dict = {}
for i, g in enumerate(strains):
    strain_dict[g] = i

# experimental design matrix fed into sleuth
genmap = pd.read_csv('../sleuth/rna_seq_info.txt', sep='\t', comment='#')

Next, I will open the files containing the beta values output by sleuth for each genotype. If the genotype is not included in the list of genotypes to be analyzed, I will not open the file. Finally, I will make a single dataframe containing all of the data, and work with this unique dataframe.

In [3]:
# open all beta.csv files, and concatenate them into a single dataframe
frames = []
for root, dirs, files in os.walk("../sleuth/sleuth_strains"):
    for file in files:
        # don't open likelihood ratio test if there is one
        if file == 'lrt.csv':
            continue
        
        # extract the strain identifier from the filename
        strain =  file[:-4].replace('_', '-')
        
        # if the strain isn't in our strainlist, drop it
        if strain[2:] not in strains:
            continue
        
        # open the dataframe, and add strain, genotype and order columns
        df = pd.read_csv(root + '/' + file, sep=',')
        df.sort_values('target_id', inplace=True)
        df['strain'] = strain.replace('b-', '')
        df['genotype'] = genmap[genmap.strain == file[:-4]].genotype.unique()[0].replace('b_', '').replace('_', '-')
        df['order'] = strain_dict[df.strain.unique()[0]]
        
        # add the dataframe to the list
        frames += [df]

# concatenate, dropNAs
tidy = pd.concat(frames)
tidy.dropna(subset=['ens_gene', 'b', 'qval'], inplace=True)
tidy['absb'] = tidy.b.abs()
tidy.sort_values(['order', 'target_id'], ascending=True, inplace=True)
tidy.head()

# save this new dataframe
tidy.to_csv('../input/quantifications.csv', index=False)

Now that we have the dataframe, we can ask how many transcripts we identified in all genotypes, and how many genes these transcripts correspond to.

In [4]:
total_genes_id = tidy.target_id.unique().shape[0]
print("Total isoforms identified in all genotypes: {0}".format(total_genes_id))
print("Total genes identified in all genotypes: {0}".format(len(tidy.ens_gene.unique())))

Total isoforms identified in all genotypes: 21954
Total genes identified in all genotypes: 15121


I can also identify the number of differentially expressed genes for each genotype.

In [5]:
pretty.table_print(['Genotype', 'Strain', 'DEG'], space=30)
for x in tidy.strain.unique():
    # select the DE isoforms in the current genotype:
    sel = (tidy.qval < q) & (tidy.strain == x)
    # extract the number of unique genes:
    s = tidy[sel].ens_gene.unique().shape[0]
    genotype = tidy[tidy.strain == x].genotype.unique()[0]
    pretty.table_print([genotype, x, s], space=30)

Genotype                      Strain                        DEG                           
dpy22.bx93                    PS4187                        205                           
dpy22-sy622                   PS4087                        1562                          
dpy22-transhet-sy622-bx93     PS4176                        848                           


Finally, I will calculate a statistic I refer to as internalization. Simply put, internalization measures how much of a small object is contained within a large object. In other words, I want to know what fraction of the transcripts that are differentially expressed in *mdt-12(bx93)* are also differentially expressed in *med-12(sy622)*. The closer the internalization fraction is to 100%, the more *mdt-12(bx93)* is engulfed by *mdt-12(sy622)*.

In [6]:
sig = (tidy.qval < q)
pretty.table_print(['pair', 'shared GENES', 'internalization'], space=50)
for i, g1 in enumerate(tidy.genotype.unique()):
    genes1 = tidy[sig & (tidy.genotype == g1)]

    for j, g2 in enumerate(tidy.genotype.unique()[i+1:]):
        genes2 = tidy[sig & (tidy.genotype == g2)]
        
        # find the overlap between the two:
        size1 = tidy[(tidy.genotype == g1) & (tidy.qval < q)].shape[0]
        size2 = tidy[(tidy.genotype == g2) & (tidy.qval < q)].shape[0]
        
        n_iso = genes2[genes2.target_id.isin(genes1.target_id)].shape[0]
        if n_iso/np.min([size1, size2]) > .0:
            message = "{0}---{1}\t{2}\t{3:.2g}%".format(g1, g2, n_iso, 100*n_iso/np.min([size1, size2]))
            pretty.table_print(message.split('\t'), space=50)

pair                                              shared GENES                                      internalization                                   
dpy22.bx93---dpy22-sy622                          148                                               71%                                               
dpy22.bx93---dpy22-transhet-sy622-bx93            128                                               62%                                               
dpy22-sy622---dpy22-transhet-sy622-bx93           425                                               48%                                               
