###  Cell Painting morphological (CP) and L1000 gene expression (GE) profiles for the following datasets:
 
- **CDRP**-BBBC047-Bray-CP-GE (Cell line: U2OS) : 
    * $\bf{CP}$ There are 30,430 unique compounds for CP dataset, median number of replicates --> 4
    * $\bf{GE}$ There are 21,782 unique compounds for GE dataset, median number of replicates --> 3
    * 20,131 compounds are present in both datasets.

- **CDRP-bio**-BBBC036-Bray-CP-GE (Cell line: U2OS) : 
    * $\bf{CP}$ There are 2,242 unique compounds for CP dataset, median number of replicates --> 8
    * $\bf{GE}$ There are 1,917 unique compounds for GE dataset, median number of replicates --> 2
    * 1916 compounds are present in both datasets.

    
- **LUAD**-BBBC041-Caicedo-CP-GE (Cell line: A549) : 
    * $\bf{CP}$ There are 593 unique alleles for CP dataset, median number of replicates --> 8
    * $\bf{GE}$ There are 529 unique alleles for GE dataset, median number of replicates --> 8
    * 525 alleles are present in both datasets.
    
    
- **TA-ORF**-BBBC037-Rohban-CP-GE (Cell line: U2OS) :
    * $\bf{CP}$ There are 323 unique alleles for CP dataset, median number of replicates --> 5
    * $\bf{GE}$ There are 327 unique alleles for GE dataset, median number of replicates --> 2
    * 150 alleles are present in both datasets.
    
- **LINCS**-Pilot1-CP-GE (Cell line: U2OS) :
    * $\bf{CP}$ There are 1570 unique compounds across 7 doses for CP dataset, median number of replicates --> 5
    * $\bf{GE}$ There are 1402 unique compounds for GE dataset, median number of replicates --> 3
    * $N_{p/d}$: 6984 compounds are present in both datasets.
--------------------------------------------
 #### Link to the processed profiles:
 
 https://cellpainting-datasets.s3.us-east-1.amazonaws.com/Rosetta-GE-CP

In [None]:
%matplotlib notebook
%load_ext autoreload
%autoreload 2
import numpy as np
import scipy.spatial
import pandas as pd
import sklearn.decomposition
import matplotlib.pyplot as plt
import seaborn as sns
import os
# from cmapPy.pandasGEXpress.parse import parse
import sys
sys.path.insert(0, '../utils/') 

from replicateCorrs import replicateCorrs
from saveAsNewSheetToExistingFile import saveAsNewSheetToExistingFile,saveDF_to_CSV_GZ_no_timestamp
from importlib import reload

from readProfiles import *
from pred_models import *

from normalize_funcs import standardize_per_catX
# sns.set_style("whitegrid")
# np.__version__
pd.__version__

###  Input / ouput files:

- **CDRP**-BBBC047-Bray-CP-GE (Cell line: U2OS) : 
    * $\bf{CP}$ 406 plates
        * Input:  s3://imaging-platform/projects/2018_04_20_Rosetta/workspace/raw-profiles/raw-profiles/CDRP
        * Output: /home/ubuntu/datasetsbucket/Rosetta-GE-CP/preprocessed_data/CDRP-BBBC047-Bray/CellPainting/
        
    * $\bf{GE}$ 
        * Input: .mat files that are generated using https://github.com/broadinstitute/2014_wawer_pnas
        * Output: /home/ubuntu/datasetsbucket/Rosetta-GE-CP/preprocessed_data/CDRP-BBBC047-Bray/L1000/

- **CDRP-bio**-BBBC036-Bray-CP-GE (Cell line: U2OS): 
    * $\bf{CP}$ 55 plates
        * Input:  s3://imaging-platform/projects/2018_04_20_Rosetta/workspace/raw-profiles/CDRP-bioactive
        * Output: /home/ubuntu/datasetsbucket/Rosetta-GE-CP/preprocessed_data/CDRPBIO-BBBC036-Bray/CellPainting/
        
    * $\bf{GE}$ the same source used for CDRP just the bioactive subset was chosen
        
- **LUAD**-BBBC041-Caicedo-CP-GE (Cell line: A549) : 
    * $\bf{CP}$ 
        * Input:   /ipgpu/storage/luad/profiles_cp/LUAD-BBBC043-Caicedo/
        * Output:  /home/ubuntu/datasetsbucket/Rosetta-GE-CP/preprocessed_data/LUAD-BBBC041-Caicedo/CellPainting/
        
    * $\bf{GE}$    
        * Input:   s3://imaging-platform/projects/2018_04_20_Rosetta/workspace/raw-profiles/l1000_LUAD/
        * Output:  /home/ubuntu/datasetsbucket/Rosetta-GE-CP/preprocessed_data/LUAD-BBBC041-Caicedo/L1000/
    
- **TA-ORF**-BBBC037-Rohban-CP-GE (Cell line: U2OS) :
    * $\bf{CP}$ 
        * Input:   s3://imaging-platform/projects/2018_04_20_Rosetta/workspace/raw-profiles/TA-ORF-BBBC037-Rohban/
        * Output:  /home/ubuntu/datasetsbucket/Rosetta-GE-CP/preprocessed_data/TA-ORF-BBBC037-Rohban/CellPainting/
        
    * $\bf{GE}$ 
        * Input:  s3://imaging-platform/projects/2018_04_20_Rosetta/workspace/raw-profiles/l1000_TA_ORF/
                - https://data.broadinstitute.org/icmap/custom/TA/brew/pc/TA.OE005_U2OS_72H/
        * Output:  /home/ubuntu/datasetsbucket/Rosetta-GE-CP/preprocessed_data/TA-ORF-BBBC037-Rohban/L1000/
        
        
        
- **LINCS**-Pilot1-CP-GE (Cell line: U2OS) :
    * $\bf{CP}$ 
        * Input:   s3://cellpainting-datasets/lincs-cell-painting/profiles/2016_04_01_a549_48hr_batch1
        * Output:  /home/ubuntu/datasetsbucket/Rosetta-GE-CP/preprocessed_data/LINCS-Pilot1/CellPainting/
        
    * $\bf{GE}$ 
        * Input:  s3://imaging-platform/projects/2018_04_20_Rosetta/workspace/raw-profiles/l1000_LINCS
        * Original location at internal servers:        
            - GCTX matrices: /cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/collated/2016_04_01_a549_48hr_batch1_L1000/
            - metadata: /cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/metadata/2016_04_01_a549_48hr_batch1_L1000/
        * Output:  /home/ubuntu/datasetsbucket/Rosetta-GE-CP/preprocessed_data/LINCS-Pilot1/L1000/
        

In [None]:
fileName='RepCorrDF'
### dirs on gpu cluster
# rawProf_dir='/storage/data/marziehhaghighi/Rosetta/raw-profiles/'
# procProf_dir='/home/marziehhaghighi/workspace_rosetta/workspace/'

### dirs on ec2
rawProf_dir='/home/ubuntu/bucket/projects/2018_04_20_Rosetta/workspace/raw-profiles/'
# procProf_dir='./'
procProf_dir='/home/ubuntu/bucket/projects/2018_04_20_Rosetta/workspace/'
# procProf_dir='/home/ubuntu/datasetsbucket/Rosetta-GE-CP/'
# s3://imaging-platform/projects/2018_04_20_Rosetta/workspace/preprocessed_data
# aws s3 sync preprocessed_data s3://cellpainting-datasets/Rosetta-GE-CP/preprocessed_data --profile jumpcpuser

filename='../../results/RepCor/'+fileName+'.xlsx'


In [None]:
# ls ../../
# https://cellpainting-datasets.s3.us-east-1.amazonaws.com/

# CDRP-BBBC047-Bray

### GE - L1000 - CDRP

In [None]:
os.listdir(rawProf_dir+'/l1000_CDRP/')

In [None]:
cdrp_dataDir=rawProf_dir+'/l1000_CDRP/'
cpd_info = pd.read_csv(cdrp_dataDir+"/compounds.txt", sep="\t", dtype=str)
cpd_info.columns

In [None]:
from scipy.io import loadmat
x = loadmat(cdrp_dataDir+'cdrp.all.prof.mat')

k1=x['metaWell']['pert_id'][0][0]
k2=x['metaGen']['AFFX_PROBE_ID'][0][0]
k3=x['metaWell']['pert_dose'][0][0]
k4=x['metaWell']['det_plate'][0][0]
# pert_dose
# x['metaWell']['pert_id'][0][0][0][0][0]
pertID = []
probID=[]
for r in range(len(k1)):
    v = k1[r][0][0]
    pertID.append(v)
#     probID.append(k2[r][0][0])

for r in range(len(k2)):
    probID.append(k2[r][0][0])
    
pert_dose=[]
det_plate=[]
for r in range(len(k3)):
    pert_dose.append(k3[r][0])
    det_plate.append(k4[r][0][0]) 
    
dataArray=x['pclfc'];
cdrp_l1k_rep = pd.DataFrame(data=dataArray,columns=probID)
cdrp_l1k_rep['pert_id']=pertID
cdrp_l1k_rep['pert_dose']=pert_dose
cdrp_l1k_rep['det_plate']=det_plate
cdrp_l1k_rep['BROAD_CPD_ID']=cdrp_l1k_rep['pert_id'].str[:13]
cdrp_l1k_rep2=pd.merge(cdrp_l1k_rep, cpd_info, how='left',on=['BROAD_CPD_ID'])
l1k_features_cdrp=cdrp_l1k_rep2.columns[cdrp_l1k_rep2.columns.str.contains("_at")]
cdrp_l1k_rep2['pert_id_dose']=cdrp_l1k_rep2['BROAD_CPD_ID']+'_'+cdrp_l1k_rep2['pert_dose'].round(2).astype(str)
cdrp_l1k_rep2['pert_sample_dose']=cdrp_l1k_rep2['pert_id']+'_'+cdrp_l1k_rep2['pert_dose'].round(2).astype(str)

# cdrp_l1k_df.head()
print(cpd_info.shape,cdrp_l1k_rep.shape,cdrp_l1k_rep2.shape)

cdrp_l1k_rep2['pert_id_dose']=cdrp_l1k_rep2['pert_id_dose'].replace('DMSO_-666.0', 'DMSO')
cdrp_l1k_rep2['pert_sample_dose']=cdrp_l1k_rep2['pert_sample_dose'].replace('DMSO_-666.0', 'DMSO')

saveDF_to_CSV_GZ_no_timestamp(cdrp_l1k_rep2,procProf_dir+'preprocessed_data/CDRP-BBBC047-Bray/L1000/replicate_level_l1k.csv.gz');
# cdrp_l1k_rep2.head()

In [None]:
# cpd_info

### CP - CDRP

In [None]:
profileType=['_augmented','_normalized']

bioactiveFlag="";# either "-bioactive" or ""

plates=os.listdir(rawProf_dir+'/CDRP'+bioactiveFlag+'/')
for pt in profileType[1:2]:
    repLevelCDRP0=[]
    for p in plates:
#         repLevelCDRP0.append(pd.read_csv(rawProf_dir+'/CDRP/'+p+'/'+p+pt+'.csv'))
        repLevelCDRP0.append(pd.read_csv(rawProf_dir+'/CDRP'+bioactiveFlag+'/'+p+'/'+p+pt+'.csv')) #if bioactive
    repLevelCDRP = pd.concat(repLevelCDRP0)
    metaCDRP1=pd.read_csv(rawProf_dir+'/CP_CDRP/metadata/metadata_CDRP.csv')
    # metaCDRP1=metaCDRP1.rename(columns={"PlateName":"Metadata_Plate_Map_Name",'Well':'Metadata_Well'})
    # metaCDRP1['Metadata_Well']=metaCDRP1['Metadata_Well'].str.lower()
    repLevelCDRP2=pd.merge(repLevelCDRP, metaCDRP1, how='left',on=['Metadata_broad_sample'])
#     repLevelCDRP2['Metadata_Sample_Dose']=repLevelCDRP2['Metadata_broad_sample']+'_'+repLevelCDRP2['Metadata_mmoles_per_liter'].round(0).astype(int).astype(str)
#     repLevelCDRP2['Metadata_Sample_Dose']=repLevelCDRP2['Metadata_pert_id']+'_'+(repLevelCDRP2['Metadata_mmoles_per_liter']*2).round(0).astype(int).astype(str)
    repLevelCDRP2["Metadata_mmoles_per_liter2"]=(repLevelCDRP2["Metadata_mmoles_per_liter"]*2).round(2)
    repLevelCDRP2['Metadata_Sample_Dose']=repLevelCDRP2['Metadata_broad_sample']+'_'+repLevelCDRP2['Metadata_mmoles_per_liter2'].astype(str)

    repLevelCDRP2['Metadata_Sample_Dose']=repLevelCDRP2['Metadata_Sample_Dose'].replace('DMSO_0.0', 'DMSO')
    repLevelCDRP2['Metadata_pert_id']=repLevelCDRP2['Metadata_pert_id'].replace(np.nan, 'DMSO')
    
#     repLevelCDRP2.to_csv(procProf_dir+'preprocessed_data/CDRPBIO-BBBC036-Bray/CellPainting/replicate_level_cp'+pt+'.csv.gz',index=False,compression='gzip')

# ,
    if bioactiveFlag:
        dataFolderName='CDRPBIO-BBBC036-Bray'
        saveDF_to_CSV_GZ_no_timestamp(repLevelCDRP2,procProf_dir+'preprocessed_data/'+dataFolderName+\
                                      '/CellPainting/replicate_level_cp'+pt+'.csv.gz')
    else:
#         sgfsgf
        dataFolderName='CDRP-BBBC047-Bray'
        saveDF_to_CSV_GZ_no_timestamp(repLevelCDRP2,procProf_dir+'preprocessed_data/'+dataFolderName+\
                                      '/CellPainting/replicate_level_cp'+pt+'.csv.gz')

    print(metaCDRP1.shape,repLevelCDRP.shape,repLevelCDRP2.shape)

In [None]:
dataFolderName='CDRP-BBBC047-Bray'
cp_feats=repLevelCDRP.columns[repLevelCDRP.columns.str.contains("Cells_|Cytoplasm_|Nuclei_")].tolist()
features_to_remove =find_correlation(repLevelCDRP2[cp_feats], threshold=0.9, remove_negative=False)
repLevelCDRP2_var_sel=repLevelCDRP2.drop(columns=features_to_remove)
saveDF_to_CSV_GZ_no_timestamp(repLevelCDRP2_var_sel,procProf_dir+'preprocessed_data/'+dataFolderName+\
                                  '/CellPainting/replicate_level_cp'+'_normalized_variable_selected'+'.csv.gz')

In [None]:
# features_to_remove
# features_to_remove
# features_to_remove

In [None]:
repLevelCDRP2['Nuclei_Texture_Variance_RNA_3_0']

In [None]:
# repLevelCDRP2.shape
# cp_scaled.columns[cp_scaled.columns.str.contains("Cells_|Cytoplasm_|Nuclei_")].tolist()

# CDRP-bio-BBBC036-Bray

### GE - L1000 - CDRPBIO

In [None]:
bioactiveFlag="-bioactive";# either "-bioactive" or ""
plates=os.listdir(rawProf_dir+'/CDRP'+bioactiveFlag+'/')

In [None]:
# plates

In [None]:
cdrp_l1k_rep2_bioactive=cdrp_l1k_rep2[cdrp_l1k_rep2["pert_sample_dose"].isin(repLevelCDRP2.Metadata_Sample_Dose.unique().tolist())]


In [None]:
cdrp_l1k_rep.det_plate

### CP - CDRPBIO

In [None]:
profileType=['_augmented','_normalized','_normalized_variable_selected']

bioactiveFlag="-bioactive";# either "-bioactive" or ""

plates=os.listdir(rawProf_dir+'/CDRP'+bioactiveFlag+'/')
for pt in profileType:
    repLevelCDRP0=[]
    for p in plates:
#         repLevelCDRP0.append(pd.read_csv(rawProf_dir+'/CDRP/'+p+'/'+p+pt+'.csv'))
        repLevelCDRP0.append(pd.read_csv(rawProf_dir+'/CDRP'+bioactiveFlag+'/'+p+'/'+p+pt+'.csv')) #if bioactive
    repLevelCDRP = pd.concat(repLevelCDRP0,ignore_index=True)
    metaCDRP1=pd.read_csv(rawProf_dir+'/CP_CDRP/metadata/metadata_CDRP.csv')
    # metaCDRP1=metaCDRP1.rename(columns={"PlateName":"Metadata_Plate_Map_Name",'Well':'Metadata_Well'})
    # metaCDRP1['Metadata_Well']=metaCDRP1['Metadata_Well'].str.lower()
    repLevelCDRP2=pd.merge(repLevelCDRP, metaCDRP1, how='left',on=['Metadata_broad_sample'])
#     sfgssd
#     repLevelCDRP2['Metadata_Sample_Dose']=repLevelCDRP2['Metadata_broad_sample']+'_'+repLevelCDRP2['Metadata_mmoles_per_liter'].round(0).astype(int).astype(str)
#     repLevelCDRP2['Metadata_Sample_Dose']=repLevelCDRP2['Metadata_pert_id']+'_'+(repLevelCDRP2['Metadata_mmoles_per_liter']*2).round(0).astype(int).astype(str)
    repLevelCDRP2["Metadata_mmoles_per_liter2"]=(repLevelCDRP2["Metadata_mmoles_per_liter"]*2).round(2)
    repLevelCDRP2['Metadata_Sample_Dose']=repLevelCDRP2['Metadata_broad_sample']+'_'+repLevelCDRP2['Metadata_mmoles_per_liter2'].astype(str)

    repLevelCDRP2['Metadata_Sample_Dose']=repLevelCDRP2['Metadata_Sample_Dose'].replace('DMSO_0.0', 'DMSO')
    repLevelCDRP2['Metadata_pert_id']=repLevelCDRP2['Metadata_pert_id'].replace(np.nan, 'DMSO')
    
#     repLevelCDRP2.to_csv(procProf_dir+'preprocessed_data/CDRPBIO-BBBC036-Bray/CellPainting/replicate_level_cp'+pt+'.csv.gz',index=False,compression='gzip')

# ,
    if bioactiveFlag:
        dataFolderName='CDRPBIO-BBBC036-Bray'
        saveDF_to_CSV_GZ_no_timestamp(repLevelCDRP2,procProf_dir+'preprocessed_data/'+dataFolderName+\
                                      '/CellPainting/replicate_level_cp'+pt+'.csv.gz')
    else:
        dataFolderName='CDRP-BBBC047-Bray'
        saveDF_to_CSV_GZ_no_timestamp(repLevelCDRP2,procProf_dir+'preprocessed_data/'+dataFolderName+\
                                      '/CellPainting/replicate_level_cp'+pt+'.csv.gz')

    print(metaCDRP1.shape,repLevelCDRP.shape,repLevelCDRP2.shape)

In [None]:
metaCDRP1.shape,repLevelCDRP.shape,repLevelCDRP2.shape

In [None]:
# repLevelCDRP
metaCDRP1.groupby(["Metadata_broad_sample"]).size().sort_values()

In [None]:
# metaCDRP1[metaCDRP1["Metadata_broad_sample"].isin(["BRD-A97701745-001-04-6","BRD-K73323637-003-03-4"])]

In [None]:
# repLevelCDRP2
# repLevelCDRP = pd.concat(repLevelCDRP0,ignore_index=True)
# repLevelCDRP
# repLevelCDRP2=pd.merge(repLevelCDRP, metaCDRP1, how='left',on=['Metadata_broad_sample'])
# repLevelCDRP2

# LUAD-BBBC041-Caicedo

### GE - L1000 - LUAD

In [None]:
os.listdir(rawProf_dir+'/l1000_LUAD/input/')

In [None]:
os.listdir(rawProf_dir+'/l1000_LUAD/output/')

In [None]:
luad_dataDir=rawProf_dir+'/l1000_LUAD/'
luad_info1 = pd.read_csv(luad_dataDir+"/input/TA.OE014_A549_96H.map", sep="\t", dtype=str)
luad_info2 = pd.read_csv(luad_dataDir+"/input/TA.OE015_A549_96H.map", sep="\t", dtype=str)
luad_info=pd.concat([luad_info1, luad_info2], ignore_index=True)
luad_info.head()

In [None]:
luad_l1k_df = parse(luad_dataDir+"/output/high_rep_A549_8reps_141230_ZSPCINF_n4232x978.gctx").data_df.T.reset_index()
luad_l1k_df=luad_l1k_df.rename(columns={"cid":"id"})
# cdrp_l1k_df['XX']=cdrp_l1k_df['cid'].str[0]
# cdrp_l1k_df['BROAD_CPD_ID']=cdrp_l1k_df['cid'].str[2:15]
luad_l1k_df2=pd.merge(luad_l1k_df, luad_info, how='inner',on=['id'])
luad_l1k_df2=luad_l1k_df2.rename(columns={"x_mutation_status":"allele"})

l1k_features=luad_l1k_df2.columns[luad_l1k_df2.columns.str.contains("_at")]
luad_l1k_df2['allele']=luad_l1k_df2['allele'].replace('UnTrt', 'DMSO')
print(luad_info.shape,luad_l1k_df.shape,luad_l1k_df2.shape)
saveDF_to_CSV_GZ_no_timestamp(luad_l1k_df2,procProf_dir+'/preprocessed_data/LUAD-BBBC041-Caicedo/L1000/replicate_level_l1k.csv.gz')

In [None]:
luad_l1k_df_scaled = standardize_per_catX(luad_l1k_df2,'det_plate',l1k_features.tolist());
x_l1k_luad=replicateCorrs(luad_l1k_df_scaled.reset_index(drop=True),'allele',l1k_features,1)
# x_l1k_luad=replicateCorrs(luad_l1k_df2[luad_l1k_df2['allele']!='DMSO'].reset_index(drop=True),'allele',l1k_features,1)
# saveAsNewSheetToExistingFile(filename,x_l1k_luad[2],'l1k-luad')

### CP - LUAD

In [None]:
profileType=['_augmented','_normalized','_normalized_variable_selected']
plates=os.listdir('/storage/luad/profiles_cp/LUAD-BBBC043-Caicedo/')
for pt in profileType[1:2]:
    repLevelLuad0=[]
    for p in plates:
        repLevelLuad0.append(pd.read_csv('/storage/luad/profiles_cp/LUAD-BBBC043-Caicedo/'+p+'/'+p+pt+'.csv'))
    repLevelLuad = pd.concat(repLevelLuad0)
    metaLuad1=pd.read_csv(rawProf_dir+'/CP_LUAD/metadata/combined_platemaps_AHB_20150506_ssedits.csv')
    metaLuad1=metaLuad1.rename(columns={"PlateName":"Metadata_Plate_Map_Name",'Well':'Metadata_Well'})
    metaLuad1['Metadata_Well']=metaLuad1['Metadata_Well'].str.lower()
    # metaLuad2=pd.read_csv('~/workspace_rosetta/workspace/raw_profiles/CP_LUAD/metadata/barcode_platemap.csv')
    # Y[Y['Metadata_Well']=='g05']['Nuclei_Texture_Variance_Mito_5_0']
    repLevelLuad2=pd.merge(repLevelLuad, metaLuad1, how='inner',on=['Metadata_Plate_Map_Name','Metadata_Well'])
    repLevelLuad2['x_mutation_status']=repLevelLuad2['x_mutation_status'].replace(np.nan, 'DMSO')
    cp_features=repLevelLuad2.columns[repLevelLuad2.columns.str.contains("Cells_|Cytoplasm_|Nuclei_")]
#     repLevelLuad2.to_csv(procProf_dir+'preprocessed_data/LUAD-BBBC041-Caicedo/CellPainting/replicate_level_cp'+pt+'.csv.gz',index=False,compression='gzip')
    saveDF_to_CSV_GZ_no_timestamp(repLevelLuad2,procProf_dir+'preprocessed_data/LUAD-BBBC041-Caicedo/CellPainting/replicate_level_cp'+pt+'.csv.gz')
    print(metaLuad1.shape,repLevelLuad.shape,repLevelLuad2.shape)    

In [None]:
pt=['_normalized']
# Read save data
repLevelLuad2=pd.read_csv('./preprocessed_data/LUAD-BBBC041-Caicedo/CellPainting/replicate_level_cp'+pt[0]+'.csv.gz')

# repLevelTA.head()
cp_features=repLevelLuad2.columns[repLevelLuad2.columns.str.contains("Cells_|Cytoplasm_|Nuclei_")]
cols2remove0=[i for i in cp_features if ((repLevelLuad2[i].isnull()).sum(axis=0)/repLevelLuad2.shape[0])>0.05]
print(cols2remove0)
repLevelLuad2=repLevelLuad2.drop(cols2remove0, axis=1);
cp_features=repLevelLuad2.columns[repLevelLuad2.columns.str.contains("Cells_|Cytoplasm_|Nuclei_")]
repLevelLuad2 = repLevelLuad2.interpolate()
repLevelLuad2 = standardize_per_catX(repLevelLuad2,'Metadata_Plate',cp_features.tolist());
df1=repLevelLuad2[~repLevelLuad2['x_mutation_status'].isnull()].reset_index(drop=True)
x_cp_luad=replicateCorrs(df1,'x_mutation_status',cp_features,1)
saveAsNewSheetToExistingFile(filename,x_cp_luad[2],'cp-luad')

# TA-ORF-BBBC037-Rohban

### GE - L1000 

In [None]:
taorf_datadir=rawProf_dir+'/l1000_TA_ORF/'
gene_info = pd.read_csv(taorf_datadir+"TA.OE005_U2OS_72H.map.txt", sep="\t", dtype=str)
gene_info

In [None]:
taorf_datadir=rawProf_dir+'/l1000_TA_ORF/'
gene_info = pd.read_csv(taorf_datadir+"TA.OE005_U2OS_72H.map.txt", sep="\t", dtype=str)
# gene_info.columns
# TA.OE005_U2OS_72H_INF_n729x22268.gctx
# TA.OE005_U2OS_72H_QNORM_n729x978.gctx
# TA.OE005_U2OS_72H_ZSPCINF_n729x22268.gctx
# TA.OE005_U2OS_72H_ZSPCQNORM_n729x978.gctx
taorf_l1k0 = parse(taorf_datadir+"TA.OE005_U2OS_72H_ZSPCQNORM_n729x978.gctx")
# taorf_l1k0 = parse(taorf_datadir+"TA.OE005_U2OS_72H_QNORM_n729x978.gctx")
taorf_l1k_df0=taorf_l1k0.data_df
taorf_l1k_df=taorf_l1k_df0.T.reset_index()
l1k_features=taorf_l1k_df.columns[taorf_l1k_df.columns.str.contains("_at")]
taorf_l1k_df=taorf_l1k_df.rename(columns={"cid":"id"})
taorf_l1k_df2=pd.merge(taorf_l1k_df, gene_info, how='inner',on=['id'])
# print(taorf_l1k_df.shape,gene_info.shape,taorf_l1k_df2.shape)
taorf_l1k_df2.head()
# x_genesymbol_mutation
taorf_l1k_df2['pert_id']=taorf_l1k_df2['pert_id'].replace('CMAP-000', 'DMSO')
# compression_opts = dict(method='zip',archive_name='out.csv')  
# taorf_l1k_df2.to_csv(procProf_dir+'preprocessed_data/TA-ORF-BBBC037-Rohban/L1000/replicate_level_l1k.csv.gz',index=False,compression=compression_opts)
saveDF_to_CSV_GZ_no_timestamp(taorf_l1k_df2,procProf_dir+'preprocessed_data/TA-ORF-BBBC037-Rohban/L1000/replicate_level_l1k.csv.gz')
print(gene_info.shape,taorf_l1k_df.shape,taorf_l1k_df2.shape)
# gene_info.head()

In [None]:
taorf_l1k_df2.groupby(['x_genesymbol_mutation']).size().describe()

In [None]:
taorf_l1k_df2.groupby(['pert_id']).size().describe()

#### Check Replicate Correlation

In [None]:
# df1=taorf_l1k_df2[taorf_l1k_df2['pert_id']!='CMAP-000']

df1_scaled = standardize_per_catX(taorf_l1k_df2,'det_plate',l1k_features.tolist());
df1_scaled2=df1_scaled[df1_scaled['pert_id']!='DMSO']
x=replicateCorrs(df1_scaled2,'pert_id',l1k_features,1)

### CP - TAORF

In [None]:
profileType=['_augmented','_normalized','_normalized_variable_selected']
plates=os.listdir(rawProf_dir+'TA-ORF-BBBC037-Rohban/')
for pt in profileType[0:1]:
    repLevelTA0=[]
    for p in plates:
        repLevelTA0.append(pd.read_csv(rawProf_dir+'TA-ORF-BBBC037-Rohban/'+p+'/'+p+pt+'.csv'))
    repLevelTA = pd.concat(repLevelTA0)
    metaTA1=pd.read_csv(rawProf_dir+'/CP_TA_ORF/metadata/metadata_TA.csv')
    metaTA2=pd.read_csv(rawProf_dir+'/CP_TA_ORF/metadata/metadata_TA_2.csv')
#     metaTA2=metaTA2.rename(columns={"Metadata_broad_sample":"Metadata_broad_sample_2",'Metadata_Treatment':'Gene Allele Name'})
    metaTA=pd.merge(metaTA2, metaTA1, how='left',on=['Metadata_broad_sample'])
#     metaTA2=metaTA2.rename(columns={"Metadata_Treatment":"Metadata_pert_name"})
#     repLevelTA2=pd.merge(repLevelTA, metaTA2, how='left',on=['Metadata_pert_name'])
    repLevelTA2=pd.merge(repLevelTA, metaTA, how='left',on=['Metadata_broad_sample'])

#     repLevelTA2=repLevelTA2.rename(columns={"Gene Allele Name":"Allele"})
    repLevelTA2['Metadata_broad_sample']=repLevelTA2['Metadata_broad_sample'].replace(np.nan, 'DMSO')
    saveDF_to_CSV_GZ_no_timestamp(repLevelTA2,procProf_dir+'/preprocessed_data/TA-ORF-BBBC037-Rohban/CellPainting/replicate_level_cp'+pt+'.csv.gz')
    print(metaTA.shape,repLevelTA.shape,repLevelTA2.shape)


In [None]:
# repLevelTA.head()
cp_features=repLevelTA2.columns[repLevelTA2.columns.str.contains("Cells_|Cytoplasm_|Nuclei_")]
cols2remove0=[i for i in cp_features if ((repLevelTA2[i].isnull()).sum(axis=0)/repLevelTA2.shape[0])>0.05]
print(cols2remove0)
repLevelTA2=repLevelTA2.drop(cols2remove0, axis=1);
# cp_features=list(set(cp_features)-set(cols2remove0))
# repLevelTA2=repLevelTA2.replace('nan', np.nan)
repLevelTA2 = repLevelTA2.interpolate()
cp_features=repLevelTA2.columns[repLevelTA2.columns.str.contains("Cells_|Cytoplasm_|Nuclei_")]
repLevelTA2 = standardize_per_catX(repLevelTA2,'Metadata_Plate',cp_features.tolist());
df1=repLevelTA2[~repLevelTA2['Metadata_broad_sample'].isnull()].reset_index(drop=True)
x_taorf_cp=replicateCorrs(df1,'Metadata_broad_sample',cp_features,1)
# saveAsNewSheetToExistingFile(filename,x_taorf_cp[2],'cp-taorf')

In [None]:
# plates

# LINCS-Pilot1

### GE - L1000 - LINCS

In [None]:
os.listdir(rawProf_dir+'/l1000_LINCS/2016_04_01_a549_48hr_batch1_L1000/')

In [None]:
os.listdir(rawProf_dir+'/l1000_LINCS/metadata/')

In [None]:
data_meta_match_ls=[['level_3','level_3_q2norm_n27837x978.gctx','col_meta_level_3_REP.A_A549_only_n27837.txt'],
                   ['level_4W','level_4W_zspc_n27837x978.gctx','col_meta_level_3_REP.A_A549_only_n27837.txt'],
                   ['level_4','level_4_zspc_n27837x978.gctx','col_meta_level_3_REP.A_A549_only_n27837.txt'],
                   ['level_5_modz','level_5_modz_n9482x978.gctx','col_meta_level_5_REP.A_A549_only_n9482.txt'],
                   ['level_5_rank','level_5_rank_n9482x978.gctx','col_meta_level_5_REP.A_A549_only_n9482.txt']]

In [None]:
lincs_dataDir=rawProf_dir+'/l1000_LINCS/'
lincs_pert_info = pd.read_csv(lincs_dataDir+"/metadata/REP.A_A549_pert_info.txt", sep="\t", dtype=str)
lincs_meta_level3 = pd.read_csv(lincs_dataDir+"/metadata/col_meta_level_3_REP.A_A549_only_n27837.txt", sep="\t", dtype=str)
# lincs_info1 = pd.read_csv(lincs_dataDir+"/metadata/REP.A_A549_pert_info.txt", sep="\t", dtype=str)
print(lincs_meta_level3.shape)
lincs_meta_level3.head()
# lincs_info2 = pd.read_csv(lincs_dataDir+"/input/TA.OE015_A549_96H.map", sep="\t", dtype=str)
# lincs_info=pd.concat([lincs_info1, lincs_info2], ignore_index=True)
# lincs_info.head()

In [None]:
# lincs_meta_level3.groupby('distil_id').size()
lincs_meta_level3['distil_id'].unique().shape

In [None]:
# lincs_meta_level3.columns.tolist()
# lincs_meta_level3.pert_id

In [None]:
ls /home/ubuntu/workspace_rosetta/workspace/software/2018_04_20_Rosetta/preprocessed_data/LINCS-Pilot1/CellPainting

In [None]:
# procProf_dir+'preprocessed_data/LINCS-Pilot1/'
procProf_dir

In [None]:
for el in data_meta_match_ls:
    lincs_l1k_df=parse(lincs_dataDir+"/2016_04_01_a549_48hr_batch1_L1000/"+el[1]).data_df.T.reset_index()
    lincs_meta0 = pd.read_csv(lincs_dataDir+"/metadata/"+el[2], sep="\t", dtype=str)
    lincs_meta=pd.merge(lincs_meta0, lincs_pert_info, how='left',on=['pert_id'])
    lincs_meta=lincs_meta.rename(columns={"distil_id":"cid"})
    lincs_l1k_df2=pd.merge(lincs_l1k_df, lincs_meta, how='inner',on=['cid'])
    lincs_l1k_df2['pert_id_dose']=lincs_l1k_df2['pert_id']+'_'+lincs_l1k_df2['nearest_dose'].astype(str)
    lincs_l1k_df2['pert_id_dose']=lincs_l1k_df2['pert_id_dose'].replace('DMSO_-666', 'DMSO')
#     lincs_l1k_df2.to_csv(procProf_dir+'preprocessed_data/LINCS-Pilot1/L1000/'+el[0]+'.csv.gz',index=False,compression='gzip')
    saveDF_to_CSV_GZ_no_timestamp(lincs_l1k_df2,procProf_dir+'preprocessed_data/LINCS-Pilot1/L1000/'+el[0]+'.csv.gz')

In [None]:
# lincs_l1k_df2

In [None]:
lincs_l1k_rep['pert_id_dose'].unique()

In [None]:
lincs_l1k_rep = pd.read_csv(procProf_dir+'preprocessed_data/LINCS-Pilot1/L1000/'+data_meta_match_ls[1][0]+'.csv.gz')
# l1k_features=lincs_l1k_rep.columns[lincs_l1k_rep.columns.str.contains("_at")]
# x=replicateCorrs(lincs_l1k_rep[lincs_l1k_rep['pert_iname_x']!='DMSO'].reset_index(drop=True),'pert_id',l1k_features,1)
# # saveAsNewSheetToExistingFile(filename,x[2],'l1k-lincs')
# # lincs_l1k_rep.head()

In [None]:
lincs_l1k_rep.pert_id.unique().shape

In [None]:
lincs_l1k_rep = pd.read_csv(procProf_dir+'preprocessed_data/LINCS-Pilot1/L1000/'+data_meta_match_ls[2][0]+'.csv.gz')
lincs_l1k_rep.columns[lincs_l1k_rep.columns.str.contains('dose')]

In [None]:
lincs_l1k_rep[['pert_dose', 'pert_dose_unit', 'pert_idose', 'nearest_dose']]

In [None]:
lincs_l1k_rep['nearest_dose'].unique()

In [None]:
lincs_l1k_rep.groupby(['nearest_dose']).size()

In [None]:
# lincs_l1k_rep.rna_plate.unique()
primary_dose_mapping = [0.04, 0.12, 0.37, 1.11, 3.33, 10, 20]

In [None]:
lincs_l1k_rep = pd.read_csv(procProf_dir+'preprocessed_data/LINCS-Pilot1/L1000/'+data_meta_match_ls[2][0]+'.csv.gz')
l1k_features=lincs_l1k_rep.columns[lincs_l1k_rep.columns.str.contains("_at")]
lincs_l1k_rep = standardize_per_catX(lincs_l1k_rep,'det_plate',l1k_features.tolist());
x=replicateCorrs(lincs_l1k_rep[lincs_l1k_rep['pert_iname_x']!='DMSO'].reset_index(drop=True),'pert_id',l1k_features,1)

In [None]:
lincs_l1k_rep = pd.read_csv(procProf_dir+'preprocessed_data/LINCS-Pilot1/L1000/'+data_meta_match_ls[2][0]+'.csv.gz')
l1k_features=lincs_l1k_rep.columns[lincs_l1k_rep.columns.str.contains("_at")]
lincs_l1k_rep = standardize_per_catX(lincs_l1k_rep,'det_plate',l1k_features.tolist());
x_l1k_lincs=replicateCorrs(lincs_l1k_rep[lincs_l1k_rep['pert_iname_x']!='DMSO'].reset_index(drop=True),'pert_id_dose',l1k_features,1)
saveAsNewSheetToExistingFile(filename,x_l1k_lincs[2],'l1k-lincs')

In [None]:
lincs_l1k_rep = pd.read_csv(procProf_dir+'preprocessed_data/LINCS-Pilot1/L1000/'+data_meta_match_ls[2][0]+'.csv.gz')
l1k_features=lincs_l1k_rep.columns[lincs_l1k_rep.columns.str.contains("_at")]
lincs_l1k_rep = standardize_per_catX(lincs_l1k_rep,'det_plate',l1k_features.tolist());
x_l1k_lincs=replicateCorrs(lincs_l1k_rep[lincs_l1k_rep['pert_iname_x']!='DMSO'].reset_index(drop=True),'pert_id_dose',l1k_features,1)
saveAsNewSheetToExistingFile(filename,x_l1k_lincs[2],'l1k-lincs')

In [None]:
saveAsNewSheetToExistingFile(filename,x[2],'l1k-lincs')

In [None]:
lincs_l1k_rep = pd.read_csv(procProf_dir+'preprocessed_data/LINCS-Pilot1/L1000/'+data_meta_match_ls[2][0]+'.csv.gz')

raw data


In [None]:
# set(repLevelLuad2)-set(Y1.columns)

In [None]:
# Y1[['Allele', 'Category', 'Clone ID', 'Gene Symbol']].head()

In [None]:
# repLevelLuad2[repLevelLuad2['PublicID']=='BRDN0000553807'][['Col','InsertLength','NCBIGeneID','Name','OtherDescriptions','PublicID','Row','Symbol','Transcript','Vector','pert_type','x_mutation_status']].head()

#### Check Replicate Correlation

### CP - LINCS

In [None]:
# Ran the following on:
# https://ec2-54-242-99-61.compute-1.amazonaws.com:5006/notebooks/workspace_nucleolar/2020_07_20_Nucleolar_Calico/1-NucleolarSizeMetrics.ipynb
# Metadata
def recode_dose(x, doses, return_level=False):
    closest_index = np.argmin([np.abs(dose - x) for dose in doses])
    if np.isnan(x):
        return 0
    if return_level:
        return closest_index + 1
    else:
        return doses[closest_index]
    
primary_dose_mapping = [0.04, 0.12, 0.37, 1.11, 3.33, 10, 20]


metadata=pd.read_csv("/home/ubuntu/bucket/projects/2018_04_20_Rosetta/workspace/raw-profiles/CP_LINCS/metadata/matadata_lincs_2.csv")
metadata['Metadata_mmoles_per_liter']=metadata.mmoles_per_liter.values.round(2)
metadata=metadata.rename(columns={"Assay_Plate_Barcode": "Metadata_Plate",'broad_sample':'Metadata_broad_sample','well_position':'Metadata_Well'})

lincs_submod_root_dir="/home/ubuntu/datasetsbucket/lincs-cell-painting/"

profileType=['_augmented','_normalized','_normalized_dmso',\
             '_normalized_feature_select','_normalized_feature_select_dmso']
# profileType=['_normalized']
# plates=metadata.Assay_Plate_Barcode.unique().tolist()
plates=metadata.Metadata_Plate.unique().tolist()
for pt in profileType[4:5]:
    repLevelLINCS0=[]
    
    for p in plates:
        profile_add=lincs_submod_root_dir+"/profiles/2016_04_01_a549_48hr_batch1/"+p+"/"+p+pt+".csv.gz"
        if os.path.exists(profile_add):
            repLevelLINCS0.append(pd.read_csv(profile_add))
            
    repLevelLINCS = pd.concat(repLevelLINCS0)
    meta_lincs1=metadata.rename(columns={"broad_sample": "Metadata_broad_sample"})
    # metaCDRP1=metaCDRP1.rename(columns={"PlateName":"Metadata_Plate_Map_Name",'Well':'Metadata_Well'})
    # metaCDRP1['Metadata_Well']=metaCDRP1['Metadata_Well'].str.lower()
    
    repLevelLINCS2=pd.merge(repLevelLINCS,meta_lincs1,how='left', on=["Metadata_broad_sample","Metadata_Well","Metadata_Plate",'Metadata_mmoles_per_liter'])
    

    repLevelLINCS2 = repLevelLINCS2.assign(Metadata_dose_recode=(repLevelLINCS2.Metadata_mmoles_per_liter.apply(
            lambda x: recode_dose(x, primary_dose_mapping, return_level=False))))
    repLevelLINCS2['Metadata_pert_id_dose']=repLevelLINCS2['Metadata_pert_id']+'_'+repLevelLINCS2['Metadata_dose_recode'].astype(str)
#     repLevelLINCS2['Metadata_Sample_Dose']=repLevelLINCS2['Metadata_broad_sample']+'_'+repLevelLINCS2['Metadata_dose_recode'].astype(str)
    repLevelLINCS2['Metadata_pert_id_dose']=repLevelLINCS2['Metadata_pert_id_dose'].replace(np.nan, 'DMSO')
#     saveDF_to_CSV_GZ_no_timestamp(repLevelLINCS2,procProf_dir+'/preprocessed_data/LINCS-Pilot1/CellPainting/replicate_level_cp'+pt+'.csv.gz')
    print(meta_lincs1.shape,repLevelLINCS.shape,repLevelLINCS2.shape)

In [None]:
# (8120, 15) (52223, 1810) (688699, 1825)
# repLevelLINCS

In [None]:
# pd.merge(repLevelLINCS,meta_lincs1,how='left', on=["Metadata_broad_sample"]).shape
repLevelLINCS.shape,meta_lincs1.shape

In [None]:
(8120, 15) (52223, 1238) (52223, 1253)

In [None]:
# ls /home/ubuntu/bucket/projects/2018_04_20_Rosetta/workspace/raw-profiles/CP_LINCS/metadata

In [None]:
csv_l1k_lincs=pd.read_csv('./preprocessed_data/LINCS-Pilot1/L1000/replicate_level_l1k'+'.csv.gz')
csv_pddf=pd.read_csv('./preprocessed_data/LINCS-Pilot1/CellPainting/replicate_level_cp'+pt[0]+'.csv.gz')

In [None]:
csv_l1k_lincs.head()

In [None]:
csv_l1k_lincs.pert_id_dose.unique()

In [None]:
csv_pddf.Metadata_pert_id_dose.unique()

#### Read saved data

In [None]:
repLevelLINCS2.groupby(['Metadata_pert_id']).size()

In [None]:
repLevelLINCS2.groupby(['Metadata_pert_id_dose']).size().describe()

In [None]:
repLevelLINCS2.Metadata_Plate.unique().shape

In [None]:
repLevelLINCS2['Metadata_pert_id_dose'].unique().shape

In [None]:
# csv_pddf['Metadata_mmoles_per_liter'].round(0).unique()
# np.sort(csv_pddf['Metadata_mmoles_per_liter'].unique())

In [None]:
csv_pddf.groupby(['Metadata_dose_recode']).size()#.median()

In [None]:
# repLevelLincs2=csv_pddf.copy()
import gc
cp_features=repLevelLincs2.columns[repLevelLincs2.columns.str.contains("Cells_|Cytoplasm_|Nuclei_")]
cols2remove0=[i for i in cp_features if ((repLevelLincs2[i].isnull()).sum(axis=0)/repLevelLincs2.shape[0])>0.05]
print(cols2remove0)
repLevelLincs3=repLevelLincs2.drop(cols2remove0, axis=1);
print('here0')
# cp_features=list(set(cp_features)-set(cols2remove0))
# repLevelTA2=repLevelTA2.replace('nan', np.nan)
del repLevelLincs2
gc.collect()
print('here0')
cp_features=repLevelLincs3.columns[repLevelLincs3.columns.str.contains("Cells_|Cytoplasm_|Nuclei_")]
repLevelLincs3[cp_features] = repLevelLincs3[cp_features].interpolate()
print('here1')
repLevelLincs3 = standardize_per_catX(repLevelLincs3,'Metadata_Plate',cp_features.tolist());
print('here1')

# df0=repLevelCDRP3[repLevelCDRP3['Metadata_broad_sample']!='DMSO'].reset_index(drop=True)
# repSizeDF=repLevelLincs3.groupby(['Metadata_broad_sample']).size().reset_index()
repSizeDF=repLevelLincs3.groupby(['Metadata_pert_id_dose']).size().reset_index()

highRepComp=repSizeDF[repSizeDF[0]>1].Metadata_pert_id_dose.tolist()
highRepComp.remove('DMSO')
# df0=repLevelLincs3[(repLevelLincs3['Metadata_broad_sample'].isin(highRepComp)) &\
#                    (repLevelLincs3['Metadata_dose_recode']==1.11)]
df0=repLevelLincs3[(repLevelLincs3['Metadata_pert_id_dose'].isin(highRepComp))]
x_lincs_cp=replicateCorrs(df0,'Metadata_pert_id_dose',cp_features,1)
# saveAsNewSheetToExistingFile(filename,x_lincs_cp[2],'cp-lincs')

In [None]:
repSizeDF

In [None]:
# repLevelLincs2=csv_pddf.copy()

# cp_features=repLevelLincs2.columns[repLevelLincs2.columns.str.contains("Cells_|Cytoplasm_|Nuclei_")]
# cols2remove0=[i for i in cp_features if ((repLevelLincs2[i].isnull()).sum(axis=0)/repLevelLincs2.shape[0])>0.05]
# print(cols2remove0)
# repLevelLincs3=repLevelLincs2.drop(cols2remove0, axis=1);
# # cp_features=list(set(cp_features)-set(cols2remove0))
# # repLevelTA2=repLevelTA2.replace('nan', np.nan)
# repLevelLincs3 = repLevelLincs3.interpolate()

# repLevelLincs3 = standardize_per_catX(repLevelLincs3,'Metadata_Plate',cp_features.tolist());

# cp_features=repLevelLincs3.columns[repLevelLincs3.columns.str.contains("Cells_|Cytoplasm_|Nuclei_")]
# # df0=repLevelCDRP3[repLevelCDRP3['Metadata_broad_sample']!='DMSO'].reset_index(drop=True)
# # repSizeDF=repLevelLincs3.groupby(['Metadata_broad_sample']).size().reset_index()
repSizeDF=repLevelLincs3.groupby(['Metadata_pert_id']).size().reset_index()

highRepComp=repSizeDF[repSizeDF[0]>1].Metadata_pert_id.tolist()
# highRepComp.remove('DMSO')
# df0=repLevelLincs3[(repLevelLincs3['Metadata_broad_sample'].isin(highRepComp)) &\
#                    (repLevelLincs3['Metadata_dose_recode']==1.11)]
df0=repLevelLincs3[(repLevelLincs3['Metadata_pert_id'].isin(highRepComp))]
x_lincs_cp=replicateCorrs(df0,'Metadata_pert_id',cp_features,1)
# saveAsNewSheetToExistingFile(filename,x_lincs_cp[2],'cp-lincs')

In [None]:
# x=replicateCorrs(df0,'Metadata_broad_sample',cp_features,1)
# highRepComp[-1]


In [None]:
saveAsNewSheetToExistingFile(filename,x[2],'cp-lincs')

In [None]:
# repLevelLincs3.Metadata_Plate
repLevelLincs3.head()

In [None]:
# csv_pddf[(csv_pddf['Metadata_dose_recode']==0.04) & (csv_pddf['Metadata_pert_id']=="BRD-A00147595")][['Metadata_Plate','Metadata_Well']].drop_duplicates()

In [None]:
# csv_pddf[(csv_pddf['Metadata_dose_recode']==0.04) & (csv_pddf['Metadata_pert_id']=="BRD-A00147595") &
#         (csv_pddf['Metadata_Plate']=='SQ00015196') & (csv_pddf['Metadata_Well']=="B12")][csv_pddf.columns[1820:]].drop_duplicates()

In [None]:
# def standardize_per_catX(df,column_name):
column_name='Metadata_Plate'
repLevelLincs_scaled_perPlate=repLevelLincs3.copy()
repLevelLincs_scaled_perPlate[cp_features.tolist()]=repLevelLincs3[cp_features.tolist()+[column_name]].groupby(column_name).transform(lambda x: (x - x.mean()) / x.std()).values

In [None]:
# def standardize_per_catX(df,column_name):
# # column_name='Metadata_Plate'
#     cp_features=df.columns[df.columns.str.contains("Cells_|Cytoplasm_|Nuclei_")]
#     df_scaled_perPlate=df.copy()
#     df_scaled_perPlate[cp_features.tolist()]=\
#     df[cp_features.tolist()+[column_name]].groupby(column_name)\
#     .transform(lambda x: (x - x.mean()) / x.std()).values
#     return df_scaled_perPlate

In [None]:
df0=repLevelLincs_scaled_perPlate[(repLevelLincs_scaled_perPlate['Metadata_Sample_Dose'].isin(highRepComp))]
x=replicateCorrs(df0,'Metadata_broad_sample',cp_features,1)

# Metadata columns for each dataset

In [None]:
#'dataset_name',['folder_name',[cp_pert_col_name,l1k_pert_col_name],[cp_control_val,l1k_control_val]]
dataset='CDRP';
profileType='normalized'

ds_info_dict={'CDRP':['CDRPBIO-BBBC036-Bray',['Metadata_Sample_Dose','pert_sample_dose'],[['DMSO'],['DMSO']]],
              'CDRP-bio':['CDRPBIO-BBBC036-Bray',['Metadata_Sample_Dose','pert_sample_dose'],[['DMSO'],['DMSO']]],
              'TAORF':['TA-ORF-BBBC037-Rohban',['Metadata_broad_sample','pert_id',],[['DMSO_0.04'],['DMSO_-666']]],
              'LUAD':['LUAD-BBBC041-Caicedo',['x_mutation_status','allele'],[['DMSO_0.04'],['DMSO_-666']]],
              'LINCS':['LINCS-Pilot1',['Metadata_pert_id_dose','pert_id_dose'],[['DMSO'],['DMSO']]]}

dataDir=procProf_dir+'/preprocessed_data/'+ds_info_dict[dataset][0]+'/'

cp_data_repLevel=pd.read_csv(dataDir+'/CellPainting/replicate_level_cp_'+profileType+'.csv.gz')    
l1k_data_repLevel=pd.read_csv(dataDir+'/L1000/replicate_level_l1k.csv.gz')  


# meta_cols_cp=cp_data_repLevel.columns[cp_data_repLevel.columns.str.contains('Metadata')].tolist()
meta_cols_cp=cp_data_repLevel.columns[~cp_data_repLevel.columns.str.contains("Cells_|Cytoplasm_|Nuclei_")].tolist()
meta_cols_l1k=l1k_data_repLevel.columns[~l1k_data_repLevel.columns.str.contains('_at')].tolist()


###########################################
meta_df_cp=pd.DataFrame(index=meta_cols_cp,columns=['unique_values'])
for m in meta_cols_cp:
    unq_vals=cp_data_repLevel[m].unique()
    if len(unq_vals)<10:
        meta_df_cp.loc[m]['unique_values']=', '.join(list(unq_vals.astype(str)))
    else:
        meta_df_cp.loc[m]['unique_values']=', '.join(list(unq_vals[:10].astype(str)))+', ...'
    
    
meta_df_l1k=pd.DataFrame(index=meta_cols_l1k,columns=['unique_values'])
# for m in meta_cols_l1k:
#     meta_df_l1k.loc[m]['unique_values']=l1k_data_repLevel[m].unique()
    
    
for m in meta_cols_l1k:
    unq_vals=l1k_data_repLevel[m].unique()
    if len(unq_vals)<10:
        meta_df_l1k.loc[m]['unique_values']=', '.join(list(unq_vals.astype(str)))
    else:
        meta_df_l1k.loc[m]['unique_values']=', '.join(list(unq_vals[:10].astype(str)))+', ...'

In [None]:
print(meta_df_cp.to_markdown())

In [None]:
# meta_df_l1k.to_markdown(index=False)
# list(unq_vals.astype(str))
print(meta_df_l1k.to_markdown())

In [None]:
cp_data_repLevel['Metadata_NCBIGeneID'].unique().shape

In [None]:
print(cp_data_repLevel[cp_data_repLevel.columns[cp_data_repLevel.columns.str.contains('Metadata')]].head().to_markdown())

In [None]:
print(l1k_data_repLevel[l1k_data_repLevel.columns[~l1k_data_repLevel.columns.str.contains('_at')]].head().to_markdown())

# Rename negative controls and add pert_type column

In [None]:
import pandas as pd
import numpy as np
# datasets=['TAORF','LUAD','LINCS', 'CDRP-bio','CDRP']
datasets=['CDRP']
# dataset='CDRP';
profileTypes=['augmented' , 'normalized', 'normalized_variable_selected']

ds_info_dict={'CDRP':['CDRP-BBBC047-Bray',['Metadata_Sample_Dose','pert_sample_dose']],
              'CDRP-bio':['CDRPBIO-BBBC036-Bray',['Metadata_Sample_Dose','pert_sample_dose']],
              'TAORF':['TA-ORF-BBBC037-Rohban',['Metadata_broad_sample','pert_id',]],
              'LUAD':['LUAD-BBBC041-Caicedo',['x_mutation_status','allele']],
              'LINCS':['LINCS-Pilot1',['Metadata_pert_id_dose','pert_id_dose']]}
    
# ds_info_dict={'CDRP':['CDRPBIO-BBBC036-Bray',['Metadata_Sample_Dose','pert_sample_dose'],[['DMSO'],['DMSO']]],
#               'CDRP-bio':['CDRPBIO-BBBC036-Bray',['Metadata_Sample_Dose','pert_sample_dose'],[['DMSO'],['DMSO']]],
#               'TAORF':['TA-ORF-BBBC037-Rohban',['Metadata_broad_sample','pert_id',],[['DMSO_0.04'],['DMSO_-666']]],
#               'LUAD':['LUAD-BBBC041-Caicedo',['x_mutation_status','allele'],[['DMSO_0.04'],['DMSO_-666']]],
#               'LINCS':['LINCS-Pilot1',['Metadata_pert_id_dose','pert_id_dose'],[['DMSO'],['DMSO']]]}

for dataset in datasets:
    dataDir=procProf_dir+'/preprocessed_data/'+ds_info_dict[dataset][0]+'/'
    dataDir_save='../preprocessed_data/'+ds_info_dict[dataset][0]+'/'
    os.system('mkdir -p '+dataDir_save+'L1000')
    os.system('mkdir -p '+dataDir_save+'CellPainting')
    pert_ids_col_cp=ds_info_dict[dataset][1][0]
    pert_ids_col_ge=ds_info_dict[dataset][1][1]
    l1k_data_repLevel=pd.read_csv(dataDir+'/L1000/replicate_level_l1k.csv.gz')  

    l1k_data_repLevel[pert_ids_col_ge]=l1k_data_repLevel[pert_ids_col_ge].replace('DMSO','negcon')
    l1k_data_repLevel['pert_type']='trt'
    l1k_data_repLevel.loc[l1k_data_repLevel[pert_ids_col_ge]=='negcon','pert_type']='control'
    l1k_data_repLevel['control_type']=np.nan
    l1k_data_repLevel.loc[l1k_data_repLevel[pert_ids_col_ge]=='negcon','control_type']='negcon'
    saveDF_to_CSV_GZ_no_timestamp(l1k_data_repLevel,dataDir_save+'/L1000/replicate_level_l1k.csv.gz')
#     column pert_type indicates trt or control , and column control_type indicates negcon (otherwise empty).
    
    for profileType in profileTypes:
        cp_data_repLevel=pd.read_csv(dataDir+'/CellPainting/replicate_level_cp_'+profileType+'.csv.gz')    
        cp_data_repLevel[pert_ids_col_cp]=cp_data_repLevel[pert_ids_col_cp].replace('DMSO','negcon')
        cp_data_repLevel['pert_type']='trt'
        cp_data_repLevel.loc[cp_data_repLevel[pert_ids_col_cp]=='negcon','pert_type']='control'
        cp_data_repLevel['control_type']=np.nan
        cp_data_repLevel.loc[cp_data_repLevel[pert_ids_col_cp]=='negcon','control_type']='negcon' 
        saveDF_to_CSV_GZ_no_timestamp(cp_data_repLevel,dataDir_save+'/CellPainting/replicate_level_cp_'+profileType+'.csv.gz')
#     gfdgsf

In [None]:
# datasets=['TAORF','LUAD','LINCS', 'CDRP-bio','CDRP']
datasets=['CDRP']
for dataset in datasets:
    dataDir=procProf_dir+'/preprocessed_data/'+ds_info_dict[dataset][0]+'/'
    dataDir_save='../preprocessed_data/'+ds_info_dict[dataset][0]+'/'
#     os.system('mkdir -p '+dataDir_save+'L1000')
#     os.system('mkdir -p '+dataDir_save+'CellPainting')
    pert_ids_col_cp=ds_info_dict[dataset][1][0]
    pert_ids_col_ge=ds_info_dict[dataset][1][1]
    
    fNames=os.listdir(dataDir+'/L1000/')
    fNames.remove('replicate_level_l1k.csv.gz')
    
#     correct_cols_df=pd.read_csv(dataDir_save+'/L1000/'+'replicate_level_l1k.csv.gz') 
    
    for fname in fNames:
        l1k_data_repLevel=pd.read_csv(dataDir+'/L1000/'+fname)  
#         l1k_data_repLevel[pert_ids_col_ge]=correct_cols_df[pert_ids_col_ge]
        l1k_data_repLevel[pert_ids_col_ge]=l1k_data_repLevel[pert_ids_col_ge].replace('DMSO','negcon')
        l1k_data_repLevel['pert_type']='trt'
        l1k_data_repLevel.loc[l1k_data_repLevel[pert_ids_col_ge]=='negcon','pert_type']='control'
        l1k_data_repLevel['control_type']=np.nan
        l1k_data_repLevel.loc[l1k_data_repLevel[pert_ids_col_ge]=='negcon','control_type']='negcon'
        saveDF_to_CSV_GZ_no_timestamp(l1k_data_repLevel,dataDir_save+'/L1000/'+fname)
        

In [None]:
import pandas as pd
import numpy as np
# datasets=['TAORF','LUAD','LINCS', 'CDRP-bio','CDRP']
datasets=['TAORF']
# dataset='CDRP';
profileTypes=['augmented' , 'normalized', 'normalized_variable_selected']

ds_info_dict={'CDRP':['CDRP-BBBC047-Bray',['Metadata_Sample_Dose','pert_sample_dose']],
              'CDRP-bio':['CDRPBIO-BBBC036-Bray',['Metadata_Sample_Dose','pert_sample_dose']],
              'TAORF':['TA-ORF-BBBC037-Rohban',['Metadata_broad_sample','pert_id',]],
              'LUAD':['LUAD-BBBC041-Caicedo',['x_mutation_status','allele']],
              'LINCS':['LINCS-Pilot1',['Metadata_pert_id_dose','pert_id_dose']]}
    
# ds_info_dict={'CDRP':['CDRPBIO-BBBC036-Bray',['Metadata_Sample_Dose','pert_sample_dose'],[['DMSO'],['DMSO']]],
#               'CDRP-bio':['CDRPBIO-BBBC036-Bray',['Metadata_Sample_Dose','pert_sample_dose'],[['DMSO'],['DMSO']]],
#               'TAORF':['TA-ORF-BBBC037-Rohban',['Metadata_broad_sample','pert_id',],[['DMSO_0.04'],['DMSO_-666']]],
#               'LUAD':['LUAD-BBBC041-Caicedo',['x_mutation_status','allele'],[['DMSO_0.04'],['DMSO_-666']]],
#               'LINCS':['LINCS-Pilot1',['Metadata_pert_id_dose','pert_id_dose'],[['DMSO'],['DMSO']]]}

for dataset in datasets:
    dataDir=procProf_dir+'/preprocessed_data/'+ds_info_dict[dataset][0]+'/'
    dataDir_save='../preprocessed_data/'+ds_info_dict[dataset][0]+'/'
    
    for profileType in profileTypes:
        cp_data_repLevel=pd.read_csv(dataDir+'/CellPainting/replicate_level_cp_'+profileType+'.csv.gz') 
        cp_data_repLevel=cp_data_repLevel.drop(columns=['Metadata_moa'])
        saveDF_to_CSV_GZ_no_timestamp(cp_data_repLevel,dataDir_save+'/CellPainting/replicate_level_cp_'+profileType+'.csv.gz')

# Landmark gene names across different datasets

In [None]:
from cmapPy.pandasGEXpress.parse import parse

### CDRP
from scipy.io import loadmat
x = loadmat(rawProf_dir+'/l1000_CDRP/cdrp.all.prof.mat')
m1=x['metaGen']['AFFX_PROBE_ID'][0][0]
m2=x['metaGen']['GENE_SYMBOL'][0][0]

cdrp_l1k_prob_ids=[]
GENE_SYMBOLs=[]
for r in range(len(m1)):
    cdrp_l1k_prob_ids.append(m1[r][0][0])
    GENE_SYMBOLs.append(m2[r][0][0]) 
    
    
### LUAD
luad_l1k_df = parse(rawProf_dir+"/l1000_LUAD/output/high_rep_A549_8reps_141230_ZSPCINF_n4232x978.gctx").data_df.T.reset_index()
luad_l1k_prob_ids=luad_l1k_df.columns[luad_l1k_df.columns.str.contains('_at')].tolist()

### TAORF
taorf_l1k_df0=parse(rawProf_dir+"/l1000_TA_ORF/TA.OE005_U2OS_72H_ZSPCQNORM_n729x978.gctx")
taorf_l1k_df=taorf_l1k_df0.data_df.T
taorf_l1k_prob_ids=taorf_l1k_df.columns[taorf_l1k_df.columns.str.contains('_at')].tolist()
    

### LINCS
lincs_l1k_df=parse(rawProf_dir+"/l1000_LINCS/2016_04_01_a549_48hr_batch1_L1000/level_3_q2norm_n27837x978.gctx").data_df.T.reset_index()
lincs_l1k_prob_ids=lincs_l1k_df.columns[lincs_l1k_df.columns.str.contains('_at')].tolist()

In [None]:
union_of_prob_ids_across_DSs=list(set(lincs_l1k_prob_ids+taorf_l1k_prob_ids+luad_l1k_prob_ids+cdrp_l1k_prob_ids))
len(union_of_prob_ids_across_DSs)

In [None]:
meta=pd.read_csv("/home/ubuntu/bucket/projects/2018_04_20_Rosetta/workspace/metadata/affy_probe_gene_mapping.txt",delimiter="\t",header=None, names=["probe_id", "gene"])
meta_gene_probID=meta.set_index('probe_id')
d = dict(zip(meta_gene_probID.index, meta_gene_probID['gene']))

In [None]:
# d

In [None]:
np.unique(list(d.keys())).shape,np.unique(list(d.values())).shape

In [None]:
luad_gene_symbols=[d[x] for x in luad_l1k_prob_ids]
lincs_gene_symbols=[d[x] for x in lincs_l1k_prob_ids]
cdrp_gene_symbols=[d[x] for x in cdrp_l1k_prob_ids]
taorf_gene_symbols=[d[x] for x in taorf_l1k_prob_ids]

In [None]:
print('LUAD and LINCS overlap: ',len(set(luad_l1k_prob_ids) & set(lincs_l1k_prob_ids)))
print('LUAD and TAORF overlap: ',len(set(luad_l1k_prob_ids) & set(taorf_l1k_prob_ids)))
print('CDRP and LINCS overlap: ',len(set(cdrp_l1k_prob_ids) & set(lincs_l1k_prob_ids)))
print('CDRP and TAORF overlap: ',len(set(cdrp_l1k_prob_ids) & set(taorf_l1k_prob_ids)))

In [None]:
print('LUAD and LINCS overlap: ',len(set(luad_gene_symbols) & set(lincs_gene_symbols)))
print('LUAD and TAORF overlap: ',len(set(luad_gene_symbols) & set(taorf_gene_symbols)))
print('CDRP and LINCS overlap: ',len(set(cdrp_gene_symbols) & set(lincs_gene_symbols)))
print('CDRP and TAORF overlap: ',len(set(cdrp_gene_symbols) & set(taorf_gene_symbols)))
print('CDRP and TAORF overlap: ',len(set(GENE_SYMBOLs) & set(taorf_gene_symbols)))

In [None]:
just_in_cdrp=list(set(cdrp_gene_symbols)-set(luad_gene_symbols))
# print(pd.DataFrame(just_in_cdrp,columns=['Just-In-CDRP']).to_markdown())

In [None]:
just_in_therest=list(set(luad_gene_symbols)-set(cdrp_gene_symbols))
# print(pd.DataFrame(just_in_therest,columns=['Just-In-LUAD-LINCS-TAORF']).to_markdown())


In [None]:
# just_in_therest

In [None]:
set(cdrp_gene_symbols) & set(taorf_gene_symbols)

In [None]:
len(set(cdrp_gene_symbols+taorf_gene_symbols))

In [None]:
# cdrp_gene_symbols

In [None]:
union_of_lm_genes_across_DSs=list(set(cdrp_gene_symbols+taorf_gene_symbols))

In [None]:
print('\n'.join(union_of_lm_genes_across_DSs))

In [None]:
# meta[meta['gene']=='CALM1']

In [None]:
meta[meta['probe_id']=='1122_f_at']

In [None]:
df_m=meta[meta['probe_id'].isin(union_of_prob_ids_across_DSs)].reset_index(drop=True)

In [None]:
df_q=pd.read_excel('idmap.xlsx')  
df_q

In [None]:
lst=['ABHD6///LOC643635','HSPA1A///HSPA1B','CALM1///CALM2///CALM3','KHDC1///SPA17','LOC100133724///VDAC1',
'DALRD3///LOC100133719','LOC284889///MIF']
df_m[df_m['gene'].isin(lst)]

In [None]:
# print('\n'.join(df_q.merge(df_m, left_on='query', right_on='gene',how='left')['probe_id'].astype(str).tolist()))

In [None]:
pd.set_option('display.max_rows', None)

df_q.merge(df_m, left_on='query', right_on='gene',how='left')[['query','probe_id']]

In [None]:
df_q.merge(df_m, left_on='query', right_on='gene',how='left')