# CytoTable mise en place

This notebook includes a quick demonstration of CytoTable to help you understand the basics of using this project.

The name of the notebook comes from the french _mise en place_:
> "Mise en place (French pronunciation: [mi zɑ̃ ˈplas]) is a French culinary phrase which means "putting in place"
> or "gather". It refers to the setup required before cooking, and is often used in professional kitchens to
> refer to organizing and arranging the ingredients ..."
> - [Wikipedia](https://en.wikipedia.org/wiki/Mise_en_place)

In [1]:
import pathlib
from collections import Counter

import pyarrow.parquet as pq

import cytotable

# setup variables for use throughout the notebook
source_path = "../../../tests/data/cellprofiler/examplehuman"
dest_path = "./example.parquet"

In [2]:
# remove the dest_path if it's present
if pathlib.Path(dest_path).is_file():
    pathlib.Path(dest_path).unlink()

In [3]:
# show the files we will use as source data with CytoTable
list(pathlib.Path(source_path).glob("*.csv"))

[PosixPath('../../../tests/data/cellprofiler/examplehuman/Experiment.csv'),
 PosixPath('../../../tests/data/cellprofiler/examplehuman/PH3.csv'),
 PosixPath('../../../tests/data/cellprofiler/examplehuman/Cytoplasm.csv'),
 PosixPath('../../../tests/data/cellprofiler/examplehuman/Image.csv'),
 PosixPath('../../../tests/data/cellprofiler/examplehuman/Nuclei.csv'),
 PosixPath('../../../tests/data/cellprofiler/examplehuman/Cells.csv')]

In [4]:
%%time

# run cytotable convert
result = cytotable.convert(
    source_path=source_path,
    dest_path=dest_path,
    # specify a destination data format type
    dest_datatype="parquet",
    # specify a preset which enables quick use of common input file formats
    preset="cellprofiler_csv",
)
result.name

CPU times: user 327 ms, sys: 201 ms, total: 528 ms
Wall time: 22.4 s


'example.parquet'

In [5]:
# show the table head using pandas
pq.read_table(source=result).to_pandas().head()

Unnamed: 0,Metadata_ImageNumber,Metadata_Cells_Parent_Nuclei,Metadata_Cytoplasm_Parent_Cells,Metadata_Cytoplasm_Parent_Nuclei,Metadata_ObjectNumber,Image_FileName_DNA,Image_FileName_OrigOverlay,Image_FileName_PH3,Image_FileName_cellbody,Cytoplasm_AreaShape_Area,...,Nuclei_Location_Center_X,Nuclei_Location_Center_Y,Nuclei_Location_Center_Z,Nuclei_Location_MaxIntensity_X_DNA,Nuclei_Location_MaxIntensity_X_PH3,Nuclei_Location_MaxIntensity_Y_DNA,Nuclei_Location_MaxIntensity_Y_PH3,Nuclei_Location_MaxIntensity_Z_DNA,Nuclei_Location_MaxIntensity_Z_PH3,Nuclei_Number_Object_Number
0,1,1,1,1,1,AS_09125_050116030001_D03f00d0.tif,AS_09125_050116030001_D03f00d0_Overlay.png,AS_09125_050116030001_D03f00d1.tif,AS_09125_050116030001_D03f00d2.tif,288,...,477.099237,7.580153,0,477.0,478.0,8.0,13.0,0.0,0.0,1
1,1,2,2,2,2,AS_09125_050116030001_D03f00d0.tif,AS_09125_050116030001_D03f00d0_Overlay.png,AS_09125_050116030001_D03f00d1.tif,AS_09125_050116030001_D03f00d2.tif,256,...,495.75,11.098684,0,495.0,502.0,9.0,14.0,0.0,0.0,2
2,1,3,3,3,3,AS_09125_050116030001_D03f00d0.tif,AS_09125_050116030001_D03f00d0_Overlay.png,AS_09125_050116030001_D03f00d1.tif,AS_09125_050116030001_D03f00d2.tif,52,...,438.959184,11.37415,0,440.0,439.0,11.0,16.0,0.0,0.0,3
3,1,4,4,4,4,AS_09125_050116030001_D03f00d0.tif,AS_09125_050116030001_D03f00d0_Overlay.png,AS_09125_050116030001_D03f00d1.tif,AS_09125_050116030001_D03f00d2.tif,466,...,80.459184,11.163265,0,80.0,81.0,13.0,10.0,0.0,0.0,4
4,1,5,5,5,5,AS_09125_050116030001_D03f00d0.tif,AS_09125_050116030001_D03f00d0_Overlay.png,AS_09125_050116030001_D03f00d1.tif,AS_09125_050116030001_D03f00d2.tif,296,...,58.423077,15.509615,0,62.0,52.0,14.0,15.0,0.0,0.0,5


In [6]:
# show metadata for the result file
pq.read_metadata(result)

<pyarrow._parquet.FileMetaData object at 0x17e23fab0>
  created_by: parquet-cpp-arrow version 20.0.0
  num_columns: 312
  num_rows: 289
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 87762

In [7]:
# show schema metadata which includes CytoTable information
# note: this information will travel with the file.
pq.read_schema(result).metadata

{b'data-producer': b'https://github.com/cytomining/CytoTable',
 b'data-producer-version': b'0.0.15.post15.dev0+c2a924c'}

In [8]:
# show schema column name summaries
print("Column name prefix counts:")
dict(Counter(w.split("_", 1)[0] for w in pq.read_schema(result).names))

Column name prefix counts:


{'Metadata': 5, 'Image': 4, 'Cytoplasm': 99, 'Cells': 101, 'Nuclei': 103}

In [9]:
# show full schema details
pq.read_schema(result)

Metadata_ImageNumber: int64
Metadata_Cells_Parent_Nuclei: int64
Metadata_Cytoplasm_Parent_Cells: int64
Metadata_Cytoplasm_Parent_Nuclei: int64
Metadata_ObjectNumber: int64
Image_FileName_DNA: string
Image_FileName_OrigOverlay: string
Image_FileName_PH3: string
Image_FileName_cellbody: string
Cytoplasm_AreaShape_Area: int64
Cytoplasm_AreaShape_BoundingBoxArea: int64
Cytoplasm_AreaShape_BoundingBoxMaximum_X: int64
Cytoplasm_AreaShape_BoundingBoxMaximum_Y: int64
Cytoplasm_AreaShape_BoundingBoxMinimum_X: int64
Cytoplasm_AreaShape_BoundingBoxMinimum_Y: int64
Cytoplasm_AreaShape_Center_X: double
Cytoplasm_AreaShape_Center_Y: double
Cytoplasm_AreaShape_Compactness: double
Cytoplasm_AreaShape_Eccentricity: double
Cytoplasm_AreaShape_EquivalentDiameter: double
Cytoplasm_AreaShape_EulerNumber: int64
Cytoplasm_AreaShape_Extent: double
Cytoplasm_AreaShape_FormFactor: double
Cytoplasm_AreaShape_MajorAxisLength: double
Cytoplasm_AreaShape_MaxFeretDiameter: double
Cytoplasm_AreaShape_MaximumRadius: d