This notebook addresses multivariate analysis for data corresponding to multiple spindle sizes. All images were passed through the pipeline reported in this paper. In this notebook, we have focused on demonstrating, how one can quickly pick most interesting variables from a multivariate data set. To run the notebook with your own input, just substitute the respective paths.    
     
Interesting questions:   
* how do different spindle types cluster together (hierarchical, as suggested by Steve Altschuler and Lani Wu)
* how well can you classify these multiple classes 


####Table of contents:   
1. Data folder.  
    * combine data sets
2. Feature description and pre-processing    
3. Feature selection and engineering    
4. Predictive models    
5. Unsupervised models    


Non-standard columns: **'Meta'** columns, **'locator'** columns, **'NA'** columns, **Euclidian distance**, **Zernike polynomials**. 

#I. Data folder

The final data set requires **combining** with the content of STG8.

####Data set 1 (complete).

In [14]:
!ls ../../A_PAPER2_PIPELINE

Colors.txt                     [34mSTG8[m[m
[34mComplete Dataset[m[m               [34mSTG8_CSF_Mixing[m[m
[34mPartial Datasets[m[m               [34mTPX2[m[m
[34mPresentation:Figures:etc[m[m       TPX2_feature_selection.numbers


In [15]:
!ls ../../A_PAPER2_PIPELINE/Complete\ Dataset/Dataset

DefaultDB.db                 MyExpt_FilteredSpindles.csv
DefaultDB_MyExpt.properties  MyExpt_Image.csv
MyExpt_Chromatin.csv         MyExpt_PreSpindles.csv
MyExpt_Experiment.csv        MyExpt_Spindles.csv
MyExpt_FilteredChromatin.csv


In [16]:
!head -n3 ../../A_PAPER2_PIPELINE/Complete\ Dataset/Dataset/MyExpt_FilteredSpindles.csv | awk 'BEGIN{FS=","}{print NF}'

203
203
203


####Data set 2 (complete).

In [17]:
!ls ../../A_PAPER2_PIPELINE/STG8/CoO05_Stream_STG8/

DefaultDB.db                 MyExpt_FilteredChromatin.csv
DefaultDB_MyExpt.properties  MyExpt_FilteredSpindles.csv
DefaultOUT.h5                MyExpt_Image.csv
MyExpt_Chromatin.csv         MyExpt_PreSpindles.csv
MyExpt_Experiment.csv        MyExpt_Spindles.csv


In [18]:
!head -n3 ../../A_PAPER2_PIPELINE/STG8/CoO05_Stream_STG8/MyExpt_FilteredSpindles.csv | awk 'BEGIN{FS=","}{print NF}'

203
203
203


#Building final dataset

We wil heavily leverage the analysis and code written for the TPX2 add-in and mixing experiments. 

In [19]:
import pandas as pd
import numpy as np
import sklearn as skl
import matplotlib.pyplot as plt
pd.set_option('display.mpl_style','default')
import re
from matplotlib.backends.backend_pdf import PdfPages

%load_ext rpy2.ipython
%matplotlib inline

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [20]:
#!ls ../../A_PAPER2_PIPELINE/Complete\ Dataset/Trop/Colombo/CoO05_Stream

In [21]:
df_complete=pd.read_csv('../../A_PAPER2_PIPELINE/Complete Dataset/Dataset/MyExpt_FilteredSpindles.csv')
df_STG8=pd.read_csv('../../A_PAPER2_PIPELINE/STG8/CoO05_Stream_STG8/MyExpt_FilteredSpindles.csv')
df_trop=pd.read_csv('../../A_PAPER2_PIPELINE/Complete Dataset/Trop/Colombo/CoO05_Stream/MyExpt_FilteredSpindles.csv')

1. **STG8** has both mixed and 'pure' spindles. Removing mixed for now.

In [22]:
np.sum(df_STG8.Metadata_Type == 'STG')

2087

In [10]:
np.sum(df_complete.Metadata_Type == 'STG')

1888

In [11]:
df_complete[df_complete.Metadata_Type == 'STG'].groupby('Metadata_Treatment').count()

Unnamed: 0_level_0,ImageNumber,ObjectNumber,Metadata_Experiment,Metadata_Experimenter,Metadata_FileLocation,Metadata_Frame,Metadata_Series,Metadata_Set,Metadata_Type,Metadata_cvsp,...,Mean_FilteredChromatin_Location_CenterMassIntensity_Y_Rhodamine,Mean_FilteredChromatin_Location_Center_X,Mean_FilteredChromatin_Location_Center_Y,Mean_FilteredChromatin_Location_MaxIntensity_X_DNA,Mean_FilteredChromatin_Location_MaxIntensity_X_Rhodamine,Mean_FilteredChromatin_Location_MaxIntensity_Y_DNA,Mean_FilteredChromatin_Location_MaxIntensity_Y_Rhodamine,Mean_FilteredChromatin_Number_Object_Number,Number_Object_Number,Parent_Spindles
Metadata_Treatment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
105,24,24,24,24,0,24,24,24,24,24,...,24,24,24,24,24,24,24,24,24,24
130,1,1,1,1,0,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
150,42,42,42,42,0,42,42,42,42,42,...,42,42,42,42,42,42,42,42,42,42
160,29,29,29,29,0,29,29,29,29,29,...,29,29,29,29,29,29,29,29,29,29
200,27,27,27,27,0,27,27,27,27,27,...,27,27,27,27,27,27,27,27,27,27
800,1316,1316,1316,1316,0,1316,1316,1316,1316,1316,...,1316,1316,1316,1316,1316,1316,1316,1316,1316,1316
850,90,90,90,90,0,90,90,90,90,90,...,90,90,90,90,90,90,90,90,90,90
875,142,142,142,142,0,142,142,142,142,142,...,142,142,142,142,142,142,142,142,142,142
890,148,148,148,148,0,148,148,148,148,148,...,148,148,148,148,148,148,148,148,148,148
895,32,32,32,32,0,32,32,32,32,32,...,32,32,32,32,32,32,32,32,32,32


In [26]:
df_complete.groupby(['Metadata_Treatment']).count()

Unnamed: 0_level_0,ImageNumber,ObjectNumber,Metadata_Experiment,Metadata_Experimenter,Metadata_FileLocation,Metadata_Frame,Metadata_Series,Metadata_Set,Metadata_Type,Metadata_cvsp,...,Mean_FilteredChromatin_Location_CenterMassIntensity_Y_Rhodamine,Mean_FilteredChromatin_Location_Center_X,Mean_FilteredChromatin_Location_Center_Y,Mean_FilteredChromatin_Location_MaxIntensity_X_DNA,Mean_FilteredChromatin_Location_MaxIntensity_X_Rhodamine,Mean_FilteredChromatin_Location_MaxIntensity_Y_DNA,Mean_FilteredChromatin_Location_MaxIntensity_Y_Rhodamine,Mean_FilteredChromatin_Number_Object_Number,Number_Object_Number,Parent_Spindles
Metadata_Treatment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,14496,14496,14496,14496,0,14496,14496,14496,14496,14496,...,14496,14496,14496,14496,14496,14496,14496,14496,14496,14496
1,258,258,258,258,0,258,258,258,258,258,...,258,258,258,258,258,258,258,258,258,258
2,203,203,203,203,0,203,203,203,203,203,...,203,203,203,203,203,203,203,203,203,203
3,307,307,307,307,0,307,307,307,307,307,...,307,307,307,307,307,307,307,307,307,307
4,482,482,482,482,0,482,482,482,482,482,...,482,482,482,482,482,482,482,482,482,482
5,807,807,807,807,0,807,807,807,807,807,...,807,807,807,807,807,807,807,807,807,807
6,167,167,167,167,0,167,167,167,167,167,...,167,167,167,167,167,167,167,167,167,167
8,209,209,209,209,0,209,209,209,209,209,...,209,209,209,209,209,209,209,209,209,209
80,289,289,289,289,0,289,289,289,289,289,...,289,289,289,289,289,289,289,289,289,289
84,509,509,509,509,0,509,509,509,509,509,...,509,509,509,509,509,509,509,509,509,509


In [29]:
df_complete[df_complete.Metadata_Treatment==349].groupby(['Metadata_Experiment']).count()

Unnamed: 0_level_0,ImageNumber,ObjectNumber,Metadata_Experimenter,Metadata_FileLocation,Metadata_Frame,Metadata_Series,Metadata_Set,Metadata_Treatment,Metadata_Type,Metadata_cvsp,...,Mean_FilteredChromatin_Location_CenterMassIntensity_Y_Rhodamine,Mean_FilteredChromatin_Location_Center_X,Mean_FilteredChromatin_Location_Center_Y,Mean_FilteredChromatin_Location_MaxIntensity_X_DNA,Mean_FilteredChromatin_Location_MaxIntensity_X_Rhodamine,Mean_FilteredChromatin_Location_MaxIntensity_Y_DNA,Mean_FilteredChromatin_Location_MaxIntensity_Y_Rhodamine,Mean_FilteredChromatin_Number_Object_Number,Number_Object_Number,Parent_Spindles
Metadata_Experiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Exp79,14,14,14,0,14,14,14,14,14,14,...,14,14,14,14,14,14,14,14,14,14
Exp80,7,7,7,0,7,7,7,7,7,7,...,7,7,7,7,7,7,7,7,7,7


In [75]:
#x = list(df_complete[df_complete.Metadata_Type == 'STG'].groupby('Metadata_Experiment').count().iloc[:,1].index)
#y = list(df_complete[df_complete.Metadata_Type == 'STG'].groupby('Metadata_Experiment').count().iloc[:,1])

In [76]:
##https://plot.ly/matplotlib/bar-charts/
##http://matplotlib.org/1.4.2/examples/api/barchart_demo.html
#fig, ax = plt.subplots()

#ax.bar(range(len(x)), y)
##plt.bar(range(len(x)), y)
#ax.set_xticks(np.arange(len(x))+0.35)
#ax.set_xticklabels(x)

In [78]:
df_complete[df_complete.Metadata_Type == 'STG'].groupby('Metadata_Experiment').count().iloc[:,1]

Metadata_Experiment
Exp00    123
Exp01     41
Exp02      3
Exp05     71
Exp06    337
Exp08    384
Exp09      6
Exp11     89
Exp13     54
Exp14    632
Exp15    148
Name: ObjectNumber, dtype: int64

In [40]:
df_STG8[df_STG8.Metadata_Type == 'STG'].groupby('Metadata_Experiment').count().iloc[:,1]

Metadata_Experiment
Exp01     26
Exp02      3
Exp05     65
Exp06    268
Exp08    300
Exp09      6
Exp11     75
Exp13     50
Exp14    421
Exp15    104
Exp16    769
Name: ObjectNumber, dtype: int64

Substitute the STG in the complete dataset (advised by AWG).

In [80]:
df_complete[df_complete.Metadata_Type == 'STG'].shape

(1888, 203)

In [115]:
df_complete[df_complete.Metadata_Type == 'STG'].Metadata_Treatment.unique()

array([800, 850, 875, 890, 895, 899])

In [None]:
df_complete[df_complete.Metadata_Type == 'STG'].Metadata_Expe.unique()

In [82]:
df_complete.drop(df_complete[df_complete.Metadata_Type == 'STG'].index, axis = 0, inplace =True)

In [86]:
df_complete=pd.concat([df_complete, df_STG8])

In [87]:
df_complete[df_complete.Metadata_Type == 'STG'].shape

(2087, 203)

In [88]:
df_complete.shape, df_STG8.shape, df_trop.shape

((25269, 203), (2836, 203), (3455, 203))

Add trop data.

In [10]:
df_complete.Metadata_Type.unique(), df_complete.Metadata_Series.unique(), df_complete.Metadata_Treatment.unique()

(array(['STG', 'BDS', 'CSF', 'CYC'], dtype=object),
 array([0]),
 array([105, 130, 150, 160, 200,   0, 800, 420,   4,   5,   8, 850, 875,
        890,   1,   2,   3, 895, 899,  87,   6,  86,  80,  84, 131, 171,
        302, 201, 203, 205, 501, 503, 505, 349, 512]))

In [27]:
np.unique(df_trop.Metadata_Treatment), df_trop.Metadata_Type.unique()

(array([ 0,  1,  3,  4,  5,  6,  7,  8,  9, 10, 11]),
 array(['CYC'], dtype=object))

In [24]:
np.unique(df_trop.Metadata_Treatment)

array([ 0,  1,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [94]:
df_trop.shape

(3455, 203)

In [95]:
df_trop['Metadata_Type'] = np.repeat('TRO',df_trop.shape[0])

In [96]:
df_trop.Metadata_Type.unique()

array(['TRO'], dtype=object)

In [97]:
df_final = pd.concat([df_complete, df_trop])

In [98]:
df_final.shape

(28724, 203)

In [99]:
df_final.to_csv('../Data/types_assembled_df_CompleteStgTrop.csv')

**THE FINAL DATASET IS: ../Data/types_assembled_df_CompleteStgTrop.csv **

The meta column file was created in TPX2.ipynb. 

#Dropping columns

In [112]:
to_drop = list(pd.read_table('../Data/columns_to_drop.txt',  header = None).iloc[:, 1])

In [29]:
#meta=pd.read_table('../Data/columns_meta.txt',header=0,names=['meta_columns'])

The **target** column should be **Metadata_Treatment** or **Metadata_Type**.

In [113]:
np.unique(df_STG8[['Metadata_Treatment']])

array([  0, 420, 800, 850, 875, 890, 895, 899])

In [15]:
np.unique(df_STG8[['Metadata_Type']])

array(['CSF', 'STG'], dtype=object)

In [16]:
np.unique(df_STG8[['Metadata_Experimenter']])

array(['MEC'], dtype=object)

In [17]:
np.unique(df_complete[['Metadata_Treatment']])

array([  0,   1,   2,   3,   4,   5,   6,   8,  80,  84,  86,  87, 105,
       130, 131, 150, 160, 171, 200, 201, 203, 205, 302, 349, 420, 501,
       503, 505, 512, 800, 850, 875, 890, 895, 899])

In [18]:
np.unique(df_complete[['Metadata_Type']])

array(['BDS', 'CSF', 'CYC', 'STG'], dtype=object)

In [None]:
np.unique(df_complete[df_complete['Metadata_Type']=='STG']['Metadata_Experimenter'])

In [None]:
df_complete[df_complete['Metadata_Type']=='STG'][df_complete['Metadata_Experimenter']=='MEC'].shape

In [None]:
df_STG8.groupby(['Metadata_Type','Metadata_Treatment','Metadata_Experimenter','Metadata_Experiment']).count()

In [None]:
df_complete[df_complete['Metadata_Type']=='STG'].groupby(['Metadata_Type','Metadata_Treatment','Metadata_Experimenter','Metadata_Experiment']).count()

In [None]:
np.sum(df_complete[df_complete['Metadata_Type']=='STG'].groupby(['Metadata_Type','Metadata_Experimenter','Metadata_Experiment']).count()['ImageNumber'])

In [None]:
np.sum(df_STG8[df_STG8['Metadata_Type']=='STG'].groupby(['Metadata_Type','Metadata_Experimenter','Metadata_Experiment']).count()['ImageNumber'])