## Reading Single Cell profiles into the memory
- All the information about single cells are stored in a sqlite file for each plate
- sqlite files are huge (up to 50 GB) and loading them to memory may cause memory errors


#### Here are alternative ways of handling this issue:

- Reading All the Single Cells of a plate

- Reading random images or defind subset of the plate images 

- Reading a subset of wells from the plate 

- Reading a subset of features from the plate 

- Reading a subset of features and a subset of wells of a plate 
   
- Reading a subset of objects from a subset of wells plate    
  
  
** Timing Example **
* SQ00015195  :  11.55 GB

- Reading All the Single Cells of a plate

- Reading random images or defind subset of the plate images 

- Reading a subset of wells from the plate 

- Reading a subset of features from the plate 
   - One feature: 7 mins

- Reading a subset of features and a subset of wells of a plate 
   - One feature and one well: 0.6 mins
   
- Reading a subset of objects from a subset of wells plate    
  

In [3]:
%load_ext autoreload
%autoreload 2
%matplotlib notebook
import numpy as np
import pandas as pd 
import time
import sys, os
# from utils import read_data, visualize_data
from utils.read_data import *
from utils.visualize_data import *
import pandas as pd
import seaborn as sns
from sqlalchemy import create_engine
from functools import reduce
import time
from scipy.stats import pearsonr

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
# Example dataset:
#     drug rep
meta_lincs=pd.read_csv("/home/ubuntu/bucket/projects/2018_04_20_Rosetta/workspace/results/synth_meta/meta_lincs_repLevel.csv")
rootDirDrug='/home/ubuntu/bucket/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace'
batchName='2016_04_01_a549_48hr_batch1'
p,wells="SQ00015195",["A13"]
fileName=rootDirDrug+"/backend/"+batchName+"/"+p+"/"+p+".sqlite"


In [5]:
fileName

'/home/ubuntu/bucket/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/2016_04_01_a549_48hr_batch1/SQ00015195/SQ00015195.sqlite'

###### Check file size

In [3]:
sqlFileSizGB=os.stat(fileName).st_size/10e8
print(p,' : ',sqlFileSizGB)

SQ00015195  :  11.553037312


## Reading All the Single Cells of a plate

In [None]:
# python sql reader
compartments=["cells", "cytoplasm", "nuclei"]
# compartments=["Neurites","CellBodies","CellBodiesPlusNeurites","Nuclei","Cytoplasm"]

df_p_s=readSingleCellData_sqlalch(fileName,compartments);

# R sql reader
df_p_s=readSingleCellData_r(fileName);

## Reading random images or defind subset of the plate images 

In [None]:
df_p_s=readSingleCellData_sqlalch_random_image_subset(fileName,50);

## Reading a subset of wells from the plate

In [None]:
df_p_s=readSingleCellData_sqlalch_well_subset(fileName,wells);

## Reading a subset of objects from a subset of wells plate

In [None]:
df_p_s=readSingleCellData_sqlalch_wellAndObject_subset(fileName,wells,50);

## Reading a subset of features from the plate 

In [15]:
selected_features='Cells_Intensity_IntegratedIntensity_DNA'
df_p_s=readSingleCellData_sqlalch_features_subset(fileName,selected_features);


time elapsed: 7.294410037994385


## Reading a subset of features and a subset of wells of a plate 

In [6]:
selected_features='Cells_Intensity_IntegratedIntensity_DNA'
wells=["A13"]

p,wells="SQ00015199", ['P20']
fileName=rootDirDrug+"/backend/"+batchName+"/"+p+"/"+p+".sqlite"
df_p_s=readSingleCellData_sqlalch_FeatureAndWell_subset(fileName,selected_features,wells);

time elapsed: 5.4183234333992  mins


In [33]:
# df_p_s.columns.duplicated()

In [8]:
blackListFeatures

['Nuclei_Correlation_Manders_AGP_DNA',
 'Nuclei_Correlation_Manders_AGP_ER',
 'Nuclei_Correlation_Manders_AGP_Mito',
 'Nuclei_Correlation_Manders_AGP_RNA',
 'Nuclei_Correlation_Manders_DNA_AGP',
 'Nuclei_Correlation_Manders_DNA_ER',
 'Nuclei_Correlation_Manders_DNA_Mito',
 'Nuclei_Correlation_Manders_DNA_RNA',
 'Nuclei_Correlation_Manders_ER_AGP',
 'Nuclei_Correlation_Manders_ER_DNA',
 'Nuclei_Correlation_Manders_ER_Mito',
 'Nuclei_Correlation_Manders_ER_RNA',
 'Nuclei_Correlation_Manders_Mito_AGP',
 'Nuclei_Correlation_Manders_Mito_DNA',
 'Nuclei_Correlation_Manders_Mito_ER',
 'Nuclei_Correlation_Manders_Mito_RNA',
 'Nuclei_Correlation_Manders_RNA_AGP',
 'Nuclei_Correlation_Manders_RNA_DNA',
 'Nuclei_Correlation_Manders_RNA_ER',
 'Nuclei_Correlation_Manders_RNA_Mito',
 'Nuclei_Correlation_RWC_AGP_DNA',
 'Nuclei_Correlation_RWC_AGP_ER',
 'Nuclei_Correlation_RWC_AGP_Mito',
 'Nuclei_Correlation_RWC_AGP_RNA',
 'Nuclei_Correlation_RWC_DNA_AGP',
 'Nuclei_Correlation_RWC_DNA_ER',
 'Nuclei_Co