#Index

* <a href="#Show-Describe">Show Describe</a>.
* <a href="#Show-FPKMs">Show FPKMs</a>.
* <a href="#List-libraries-without-spikes">List libraries without spikes</a>.
* <a href="#List-libraries-with-Spikes">List libraries with Spikes</a> that said they didn't have spikes.

#Introduction

In this effort to try and deal with our single cell QC I noticed that some libraries declare that the used spikes but the expression is mostly zeros, with a few spikes with  quite low expression. Implying to that they don't actually have spikes added.

The DCC asked for a list of which ones were missing, and instead of cluttering one of my other notebooks I thought I'd put them the answer to that question in its own notebook.

In [1]:
import pandas
from IPython.display import display

In [2]:
spike_store = pandas.HDFStore('all-rna-spikes.h5', 'r')

#Show Describe

<a href="#Index">Index</a>

Lets look at what the unfiltered describe() output looks like so we can see what a reasonable threshold for detection might be. For ENCSR156CIL we can see a mix of libraries with expression. e.g. ENCLB063ZZZ and ones that really look like there's no expression ENCLB265JPE. In the next cell lets look at the raw FPKMs for those two libraries.

In [3]:
spike_store['/references/ENCSR156CIL'].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ENCLB063ZZZ,96,1709.371563,7406.443508,0,0.6325,7.870,120.8450,59553.99
ENCLB064ZZZ,96,1598.284167,7029.320688,0,0.4200,7.135,110.1900,57540.82
ENCLB265JPE,96,0.412813,1.625931,0,0.0000,0.000,0.0000,11.34
ENCLB672PSC,96,0.297604,1.218616,0,0.0000,0.000,0.0000,7.72
ENCLB061ZZZ,96,2050.480208,9393.375634,0,0.4750,9.210,138.4850,79611.67
ENCLB062ZZZ,96,2215.719271,10480.867024,0,0.3675,10.305,142.7650,90866.44
ENCLB459IUG,96,398.621250,1517.159741,0,0.0000,1.385,30.2275,10717.37
ENCLB779RPP,96,479.373854,1846.159082,0,0.0200,1.020,39.1575,12137.86
ENCLB484KMD,96,141.023542,524.732269,0,0.0000,0.445,8.4625,3191.32
ENCLB596KKZ,96,100.464792,378.783051,0,0.0000,0.600,5.3450,2345.50


#Show FPKMs

<a href="#Index">Index</a>

Here's just two libraries, one who I think has expression and one I think doesn't have expression. As you can see there's a number of spikes that have very high expression levels.

In [4]:
spike_store['/references/ENCSR156CIL'][['ENCLB063ZZZ', 'ENCLB265JPE']]

Unnamed: 0_level_0,ENCLB063ZZZ,ENCLB265JPE
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1
gSpikein_ERCC-00002,25980.82,5.71
gSpikein_ERCC-00003,1741.77,1.00
gSpikein_ERCC-00004,9874.01,2.56
gSpikein_ERCC-00007,0.00,0.00
gSpikein_ERCC-00009,1528.75,0.00
gSpikein_ERCC-00012,0.27,0.00
gSpikein_ERCC-00013,1.39,0.00
gSpikein_ERCC-00014,0.87,0.00
gSpikein_ERCC-00016,0.11,0.00
gSpikein_ERCC-00017,0.31,0.00


#List libraries without spikes

<a href="#Index">Index</a>

For all of the RNA-Seq libraries (from when I made my snapshot about dec 21, 2015).

The filter I'm using should only report libraries which have no expression >= 20 FPKM.

There are some libraries which had no spike ins used listed, those are in my dataset with the heading '/None', and are treated seperately.

In [5]:
for spike_in_id in spike_store.keys():
    if spike_in_id != '/None':
        expression = spike_store[spike_in_id]
        low_expression = expression[expression < 20].dropna(axis=1, how='any')
        if len(low_expression.columns) > 0:
            print('Spike in ID', spike_in_id)
            display(low_expression.describe().T)


Spike in ID /references/ENCSR133ALU


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ENCLB847UDV,96,0.152708,0.651456,0,0,0,0.0,5.2
ENCLB704CYQ,96,0.17,0.720069,0,0,0,0.0,6.11
ENCLB217DSV,96,0.181042,0.693643,0,0,0,0.005,4.99
ENCLB159SLV,96,0.083958,0.320458,0,0,0,0.0,2.17
ENCLB080NNG,96,0.096042,0.460683,0,0,0,0.0,4.16
ENCLB180OTB,96,0.123958,0.501426,0,0,0,0.0,4.0
ENCLB318WHF,96,0.310417,1.208413,0,0,0,0.0225,9.42
ENCLB590UZK,96,0.341458,1.352162,0,0,0,0.03,10.81
ENCLB074REG,96,0.235208,0.91155,0,0,0,0.0,6.78
ENCLB415KPR,96,0.154688,0.656923,0,0,0,0.0,5.57


Spike in ID /references/ENCSR156CIL


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ENCLB265JPE,96,0.412813,1.625931,0,0,0,0,11.34
ENCLB672PSC,96,0.297604,1.218616,0,0,0,0,7.72
ENCLB038MNH,96,0.2125,0.921084,0,0,0,0,6.52
ENCLB806ISB,96,0.161354,0.719777,0,0,0,0,5.57
ENCLB003LKL,96,0.212917,0.994799,0,0,0,0,7.42
ENCLB436FJW,96,0.509687,1.949613,0,0,0,0,11.89
ENCLB787ATZ,96,0.481875,1.957999,0,0,0,0,13.08
ENCLB500PHK,96,0.329583,1.352818,0,0,0,0,9.34
ENCLB073TQT,96,0.437708,1.832367,0,0,0,0,12.48
ENCLB556JLH,96,0.43,1.795278,0,0,0,0,12.67


Spike in ID /references/ENCSR449DXG


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ENCLB057ZZZ,96,0.000417,0.004082,0,0,0,0,0.04
ENCLB058ZZZ,96,0.0,0.0,0,0,0,0,0.0


#List libraries with Spikes

<a href="#Index">Index</a>

Just to check, make sure that libraries that list no spike-ins have no spike in expression.

This should only list libraries who have expression > 5 FPKM. Which should be none of them.

In [6]:
expression = spike_store['/None']
expressed = expression[expression > 10].dropna(axis=1, how='any')
if len(expressed.columns) > 0:
    display(expressed.describe().T)

In [7]:
spike_store.close()