* Introduction

Ken tried to compute a total FPKM broken down by spike vs non-spike.

See mccue-single-cell-spike-v-nonspike.pdf

He'd like me to try and duplicate it to see if its reasonable:

    |-------------------------+----------+----------+------+------------+--------|
    |                         |          |   Spikes |      | Non_spikes |        |
    |-------------------------+----------+----------+------+------------+--------|
    | cell_type               | run_type | tot_fpkm |    n |   tot_fpkm |      n |
    |-------------------------+----------+----------+------+------------+--------|
    | Hs_asp_purkinje_UMB5294 | pooled   |   709835 | 1656 |   15714689 | 189110 |
    | Hs_asp_purkinje_UMB5294 | single   |   486962 | 1748 |   15815523 | 181740 |
    | Hs_purkinje             | pooled   |   221500 | 1840 |   17227824 | 271866 |
    | Hs_purkinje             | single   |    33904 | 1840 |   17615127 | 247399 |
    | Mm_layer_V_pyramidal    | pooled   |    98677 | 1380 |   12765754 | 215976 |
    | Mm_layer_V_pyramidal    | single   |   136069 | 1380 |   12420768 | 205757 |
    | Mm_purkinje             | pooled   |    71468 | 1472 |   14203585 | 272805 |
    | Mm_purkinje             | single   |    74021 | 1472 |   13836904 | 244684 |


In [1]:
import pandas
from rsemcache import RSEMCache

In [2]:
rsems = RSEMCache('rsem-genes.h5')

In [3]:
tot_fpkms = []
for experiment in sorted(rsems.experiments):
    libraries = rsems.experiments[experiment]
    df = rsems.get_gene_expression(libraries)
    spikes = [ x.startswith('gSpikein') for x in df.index]
    notspikes =  [ not x.startswith('gSpikein') for x in df.index]
    spike_fpkms = df[spikes].sum().sum()
    spike_count = df[spikes].count().sum()
    nonspike_fpkms = df[notspikes].sum().sum()
    nonspike_count = df[notspikes].count().sum()
    ratio = spike_fpkms / nonspike_fpkms
    tot_fpkms.append((experiment, spike_fpkms, spike_count, nonspike_fpkms, nonspike_count, ratio ))

tot_fpkms = pandas.DataFrame(
    tot_fpkms, 
    columns=['name', 'spike_fpkms', 'spike_count', 'nonspike_fpkms', 'nonspike_count', 'ratio'])
tot_fpkms = tot_fpkms.set_index('name')
tot_fpkms

Unnamed: 0_level_0,spike_fpkms,spike_count,nonspike_fpkms,nonspike_count,ratio
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Hs_asp_purkinje_UMB5294_poolsplit,709835.74,1728,15465362.09,1051992,0.045898
Hs_asp_purkinje_UMB5294_single,486962.17,1824,15552696.86,1110436,0.03131
Hs_purkinje_poolsplit,221500.44,1920,16954734.14,1168880,0.013064
Hs_purkinje_single,339045.25,1920,17342437.78,1168880,0.01955
Mm_layer_V_pyramidal_poolsplit,98677.12,1440,12535781.58,1043910,0.007872
Mm_layer_V_pyramidal_single,159694.86,1728,14708965.11,1252692,0.010857
Mm_purkinje_poolsplit,71468.2,1536,13958857.99,1113504,0.00512
Mm_purkinje_single,74021.29,1536,13592688.46,1113504,0.005446


In [4]:
tot_tpm = []
for experiment in sorted(rsems.experiments):
    libraries = rsems.experiments[experiment]
    df = rsems.get_gene_expression(libraries, quantification='TPM')
    spikes = [ x.startswith('gSpikein') for x in df.index]
    notspikes =  [ not x.startswith('gSpikein') for x in df.index]
    spike_tpm = df[spikes].sum().sum()
    spike_count = df[spikes].count().sum()
    nonspike_tpm = df[notspikes].sum().sum()
    nonspike_count = df[notspikes].count().sum()
    ratio = spike_tpm / nonspike_tpm
    tot_tpm.append((experiment, spike_tpm, spike_count, nonspike_tpm, nonspike_count, ratio ))

tot_tpm = pandas.DataFrame(
    tot_tpm, 
    columns=['name', 'spike_tpm', 'spike_count', 'nonspike_tpm', 'nonspike_count', 'ratio'])
tot_tpm = tot_tpm.set_index('name')
tot_tpm

Unnamed: 0_level_0,spike_tpm,spike_count,nonspike_tpm,nonspike_count,ratio
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Hs_asp_purkinje_UMB5294_poolsplit,777299.36,1728,17222703.18,1051992,0.045132
Hs_asp_purkinje_UMB5294_single,591185.36,1824,18408815.33,1110436,0.032114
Hs_purkinje_poolsplit,255767.61,1920,19744230.31,1168880,0.012954
Hs_purkinje_single,397361.72,1920,19602639.31,1168880,0.020271
Mm_layer_V_pyramidal_poolsplit,117042.64,1440,14882956.82,1043910,0.007864
Mm_layer_V_pyramidal_single,189410.55,1728,17810588.88,1252692,0.010635
Mm_purkinje_poolsplit,81485.19,1536,15918513.39,1113504,0.005119
Mm_purkinje_single,86628.55,1536,15913372.3,1113504,0.005444


'TPM' stands for Transcripts Per Million. It is a relative measure of transcript abundance. The sum of all transcripts' TPM is 1 million. 'FPKM' stands for Fragments Per Kilobase of transcript per Million mapped reads. It is another relative measure of transcript abundance. If we define $l_{bar}$ be the mean transcript length in a sample, which can be calculated as

$l_{bar} = \sum_i TPM_i / 10^6 * {effective_length}_i$ (i goes through every transcript),

the following equation is hold:

$FPKM_i = 10^3 / l_{bar} * TPM_i$.