# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Basic-Statistics" data-toc-modified-id="Basic-Statistics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Basic Statistics</a></div>

# Basic Statistics

In this Jupyter notebook, I will assemble the dataframe that I will use throughout the rest of the project. Then, I will calculate some relevant statistics (number of genes and isoforms identified in total, number of differentially expressed genes per genotype, average effect sizes and the percent internalization of each genotype. Finally, I will briefly show the benefits of pooling samples instead of using a general linear model (GLM) with interactions to analyze this dataset. To begin, I will import all the relevant libraries.

In [1]:
import pandas as pd
import numpy as np
import os

# import own libraries
import pretty_table as pretty

Having imported all the libraries I need, I will define the parameters that I will use for the rest of this Notebook.

In [2]:
strains =['PS4187', 'PS4087', 'PS4176', 'EW15', 'MT4866', 'MT2124'] 
q = 0.1  # q-value cutoff

strain_dict = {}
for i, g in enumerate(strains):
    strain_dict[g] = i

# experimental design matrix fed into sleuth
genmap = pd.read_csv('../sleuth/rna_seq_info.txt', sep='\t', comment='#')
# strain to genotype information:
names = pd.read_csv('../input/strain_to_genotype.csv', comment='#')

Next, I will open the files containing the beta values output by sleuth for each genotype. If the genotype is not included in the list of genotypes to be analyzed, I will not open the file. Finally, I will make a single dataframe containing all of the data, and work with this unique dataframe.

In [3]:
# open all beta.csv files, and concatenate them into a single dataframe
frames = []
for root, dirs, files in os.walk("../sleuth/sleuth_strains"):
    for file in files:
        # don't open likelihood ratio test if there is one
        if file == 'lrt.csv':
            continue

        # extract the strain identifier from the filename
        strain =  file[:-4].replace('_', '-')

        # if the strain isn't in our strainlist, drop it
        if strain[2:] not in strains:
            continue

        # open the dataframe, and add strain, genotype and order columns
        df = pd.read_csv(root + '/' + file, sep=',')
        df.sort_values('target_id', inplace=True)
        df['strain'] = strain.replace('b-', '')
        df['order'] = strain_dict[df.strain.unique()[0]]

        # add the dataframe to the list
        frames += [df]

# concatenate, dropNAs
tidy = pd.concat(frames)
tidy.dropna(subset=['ens_gene', 'b', 'qval'], inplace=True)

# sort by a pre-determined value, and then by target_id. Always sort like this
tidy.sort_values(['order', 'target_id'], ascending=True, inplace=True)

# drop the first column because it doesn't contain anything.
tidy.drop([tidy.columns[0]], axis=1, inplace=True)


tidy.head()

Unnamed: 0,target_id,pval,qval,b,se_b,mean_obs,var_obs,tech_var,sigma_sq,smooth_sigma_sq,final_sigma_sq,ens_gene,ext_gene,description,transcript_biotype,strain,order
10030,2L52.1a,0.631823,1.0,0.180717,0.377153,3.894122,1.016851,0.138264,0.065421,0.075103,0.075103,WBGene00007063,2L52.1,,protein_coding,PS4187,0
10031,2RSSE.1a,0.623173,1.0,0.174885,0.355922,4.155312,0.594458,0.130296,-0.040266,0.059725,0.059725,WBGene00007064,2RSSE.1,,protein_coding,PS4187,0
10032,2RSSE.1b,0.813709,1.0,-0.125369,0.532026,3.77613,4.014773,0.153215,0.271363,0.083518,0.271363,WBGene00007064,2RSSE.1,,protein_coding,PS4187,0
10033,3R5.1a,0.698628,1.0,-0.359459,0.928418,3.571237,1.255171,0.772584,0.520357,0.100956,0.520357,WBGene00007065,pot-3,Protection Of Telomeres 1 (Pot1) homolog [Sou...,protein_coding,PS4187,0
10034,3R5.1b,0.899553,1.0,0.061415,0.486547,4.709488,0.096104,0.317846,-0.251474,0.037246,0.037246,WBGene00007065,pot-3,Protection Of Telomeres 1 (Pot1) homolog [Sou...,protein_coding,PS4187,0


In [4]:
def strain_to_map(names, col='Alleles'):
    """
    A function to generate a dictionary of strain values to another column
    in the `names` dataframe.
    
    Params:
    -------
    names: pandas DataFrame. Must contain a `Strain` column
    col: string. Column name to associate values with.
    
    Output:
    strain_to: dictionary of Strain values to `col` values.
    """
    strain_to  = {names.Strain.values[i]: names[col].values[i]
                  for i in np.arange(len(names))}
    return strain_to

strain_to_allele = strain_to_map(names)
strain_to_genotype = strain_to_map(names, 'Genotype')
strain_to_fancy = strain_to_map(names, 'FancyName')

In [5]:
tidy['fancy'] = tidy.strain.map(strain_to_fancy)
tidy['allele'] = tidy.strain.map(strain_to_allele)
tidy['genotype'] = tidy.strain.map(strain_to_allele)

In [6]:
# save this new dataframe
tidy.to_csv('../input/quantifications.csv', index=False)

Now that we have the dataframe, we can ask how many transcripts we identified in all genotypes, and how many genes these transcripts correspond to.

In [7]:
total_genes_id = tidy.target_id.unique().shape[0]

m = "Total isoforms identified in all genotypes: {0}"
print(m.format(total_genes_id))

m = "Total genes identified in all genotypes: {0}"
print(m.format(len(tidy.ens_gene.unique())))

Total isoforms identified in all genotypes: 19416
Total genes identified in all genotypes: 13704


I can also identify the number of differentially expressed genes for each genotype.

In [8]:
sig = (tidy.qval < q)
pretty.table_print(['Genotype', 'Strain', 'DEG'])
for genotype, group in tidy[sig].groupby('genotype'):
    # select the DE isoforms in the current genotype:
    strain = group.strain.unique()[0]
    deg = len(group.ens_gene.unique())
    pretty.table_print([genotype, strain, deg])

Genotype            Strain              DEG                 
bx93                PS4187              266                 
bx93/sy622          PS4176              2128                
ga80                EW15                4613                
n1046               MT2124              2526                
n2021               MT4866              509                 
sy622               PS4087              2036                


Finally, I will calculate a statistic I refer to as internalization. Simply put, internalization measures how much of a small object is contained within a large object. In other words, I want to know what fraction of the transcripts that are differentially expressed in *mdt-12(bx93)* are also differentially expressed in *med-12(sy622)*. The closer the internalization fraction is to 100%, the more *mdt-12(bx93)* is engulfed by *mdt-12(sy622)*.

In [9]:
pretty.table_print(['pair', 'shared GENES', 'internalization',
                    'internal'], space=17)

passed = []
for genotype1, group1 in tidy[sig].groupby('genotype'):
    for genotype2, group2 in tidy[sig].groupby('genotype'):
        if genotype1 == genotype2:
            continue

        passed += [genotype2 + genotype1]
        if genotype1 + genotype2 in passed:
            continue

        # find the size of each group:
        size1 = group1.shape[0]
        size2 = group2.shape[0]
        # find the overlap between both genotypes
        overlap = group1[group1.target_id.isin(group2.target_id)]
        overlap = len(overlap.target_id.unique())
        
        # find whichever group is smaller:
        denominator = np.min([size1, size2])
        
        # record which group is smaller:
        if size1 < size2:
            internal = genotype1
        else:
            internal = genotype2
        
        # generate a message to pretty print
        m = "{0}-{1}\t".format(genotype1, genotype2)
        m += "{0}\t".format(overlap)
        m += "{0:.2g}\t".format(overlap/denominator)
        m += "{0}".format(internal)
        
        # pretty print:
        pretty.table_print(m.split('\t'), space=17)

pair             shared GENES     internalization  internal         
bx93-bx93/sy622  192              0.72             bx93             
bx93-ga80        184              0.69             bx93             
bx93-n1046       130              0.49             bx93             
bx93-n2021       56               0.21             bx93             
bx93-sy622       189              0.71             bx93             
bx93/sy622-ga80  1399             0.62             bx93/sy622       
bx93/sy622-n1046 981              0.44             bx93/sy622       
bx93/sy622-n2021 250              0.47             n2021            
bx93/sy622-sy622 836              0.39             sy622            
ga80-n1046       1947             0.76             n1046            
ga80-n2021       324              0.61             n2021            
ga80-sy622       1314             0.62             sy622            
n1046-n2021      144              0.27             n2021            
n1046-sy622      597              