# Pycytominer General Usage Walkthrough!

Welcome to this walkthrough where we will guide you through the process of extracting single cell morphology features using the [`pycytominer`](https://github.com/cytomining/pycytominer) API.

For this walkthrough, we will be working with the NF1-Schwann cell morphology dataset. 
If you would like the more information about this dataset, you can refer to this [repository](https://github.com/WayScience/Benchmarking_NF1_data)

From the mentioned repo, we specifically used this [dataset](https://github.com/WayScience/Benchmarking_NF1_data/tree/main/4_processing_features/data/Plate2/CellProfiler) and the associated [metadata](https://github.com/WayScience/Benchmarking_NF1_data/tree/main/3_extracting_features/metadata) to generate the walkthrough. 


Let's get started with the walkthrough below!

In [1]:
import pathlib

import pandas as pd

# pycytominer imports
from pycytominer.cyto_utils.cells import SingleCells
from pycytominer import annotate, normalize, feature_select

# ignore warnings
import warnings

warnings.filterwarnings("ignore")

## About the inputs


In this section, we will set up the expected input and output paths that will be generated throughout this walkthrough. Let's take a look at the explanation of these inputs and outputs.

For this workflow, we have two main inputs:

- **plate_data**: This contains the quantified single-cell morphology features that we'll be working with.
- **plate_map**: This contains additional information related to the cells, providing valuable context of our single-cell morphology dataset.

Now, let's explore the outputs generated in this workflow. In this walkthrough, we will be generating four profiles:

- **sc_profile_path***: This refers to the single-cell morphology profile that will be generated.
- **anno_profile_path**: This corresponds to the annotated single-cell morphology profile.
- **norm_profile_path**: This represents the normalized single-cell morphology profile.
- **feat_profile_path**: Lastly, this refers to the selected features from the single-cell morphology profile.

These profiles will serve as important outputs that will help us analyze and interpret the single-cell morphology data effectively. Now that we have a clear understanding of the inputs and outputs, let's proceed further in our walkthrough.

In [2]:
# Setting file paths
data_dir = pathlib.Path("./data/").resolve(strict=True)
metadata_dir = (data_dir / "metadata").resolve(strict=True)
out_dir = pathlib.Path("results")
out_dir.mkdir(exist_ok=True)

# input file paths
plate_data = pathlib.Path("./data/nf1_data.sqlite").resolve(strict=True)
plate_map = (metadata_dir / "platemap_NF1_CP.csv").resolve(strict=True)

# setting output paths
sc_profile_path = out_dir / "nf1_single_cell_profile.csv.gz"
anno_profile_path = out_dir / "nf1_annotated_profile.csv.gz"
norm_profile_path = out_dir / "nf1_noramlzied_profile.csv.gz"
feat_profile_path = out_dir / "nf1_features_profile.csv.gz"


## Generating Merged Single-cell Morphology Dataset

In this section of the walkthrough, our goal is to load the NF1 dataset and create a merged single-cell morphology dataset.

Currently, the NF1 dataset is stored in an `sqlite` format, where each table represents a different compartment, such as Image, Cell, Nucleus, and Cytoplasm.
To achieve this, we will utilize the SingleCells class, which offers a range of functionalities specifically designed for single-cell analysis. You can find detailed documentation on these functionalities [here](https://pycytominer.readthedocs.io/en/latest/pycytominer.cyto_utils.html#pycytominer.cyto_utils.cells.SingleCells).

However, for our purpose in this walkthrough, we will focus on using the SingleCells class to merge all the tables within the NF1 sqlite file into a merged single-cell morphology dataset.

### Updating defaults
Before we proceed further, it is important to update the default parameters in the SingleCells class to accommodate the table name changes in our NF1 dataset.

Since the table names in our NF1 dataset differ from the default table names recognized by the SingleCells class, we need to make adjustments to ensure proper recognition of these table name changes.

In [3]:
# update compartment names and strata
strata = ["Image_Metadata_Well", "Image_Metadata_Plate"]
compartments = ["Per_Cells", "Per_Cytoplasm", "Per_Nuclei"]

# Updating linking columns for merging all compartments
linking_cols = {
    "Per_Cytoplasm": {
        "Per_Cells": "Cytoplasm_Parent_Cells",
        "Per_Nuclei": "Cytoplasm_Parent_Nuclei",
    },
    "Per_Cells": {"Per_Cytoplasm": "Cells_Number_Object_Number"},
    "Per_Nuclei": {"Per_Cytoplasm": "Nuclei_Number_Object_Number"},
}

Now that we have stored the updated the parameters, we can use them as inputs for SingleCells class to proceed with the merging of all the NF1 sqlite tables into a single consolidated dataset.

In [4]:
# setting up sqlite address
sqlite_address = f"sqlite:///{str(plate_data)}"

# loading single cell morphology data into pycyotminer's SingleCells Object
single_cell_profile = SingleCells(
    sql_file=sqlite_address,
    compartments=compartments,
    compartment_linking_cols=linking_cols,
    image_table_name="Per_Image",
    strata=strata,
    merge_cols=["ImageNumber"],
    image_cols="ImageNumber",
    load_image_data=True,
)

# mering all sqlite table into a single tabular dataset (csv)
sc_profile = single_cell_profile.merge_single_cells(
    sc_output_files=sc_profile_path, compression_options="gzip"
)

# saving single-cell morphology dataset
sc_profile.to_csv(sc_profile_path, compression="gzip")

# displaying dataset
sc_profile.head()

Unnamed: 0,Metadata_ImageNumber,Image_Metadata_Plate,Image_Metadata_Well,Cytoplasm_Number_Object_Number,Cytoplasm_AreaShape_Area,Cytoplasm_AreaShape_BoundingBoxArea,Cytoplasm_AreaShape_BoundingBoxMaximum_X,Cytoplasm_AreaShape_BoundingBoxMaximum_Y,Cytoplasm_AreaShape_BoundingBoxMinimum_X,Cytoplasm_AreaShape_BoundingBoxMinimum_Y,...,Nuclei_Texture_Variance_DAPI_3_02_256,Nuclei_Texture_Variance_DAPI_3_03_256,Nuclei_Texture_Variance_GFP_3_00_256,Nuclei_Texture_Variance_GFP_3_01_256,Nuclei_Texture_Variance_GFP_3_02_256,Nuclei_Texture_Variance_GFP_3_03_256,Nuclei_Texture_Variance_RFP_3_00_256,Nuclei_Texture_Variance_RFP_3_01_256,Nuclei_Texture_Variance_RFP_3_02_256,Nuclei_Texture_Variance_RFP_3_03_256
0,1,1,C6,1,32065.0,95944.0,1018.0,382.0,750.0,24.0,...,1393.410857,1331.583275,653.826838,618.063979,606.832257,590.114791,147.195839,144.355017,148.179465,148.875403
1,1,1,C6,2,14466.0,58032.0,688.0,337.0,440.0,103.0,...,1369.498111,1276.305513,332.941295,317.56745,321.873215,292.116754,60.632767,61.876198,65.202076,60.022847
2,1,1,C6,3,41368.0,139598.0,1201.0,503.0,888.0,57.0,...,1338.091947,1299.373271,432.829034,398.306003,401.091835,358.84984,74.837374,71.033793,80.523205,80.845266
3,1,1,C6,4,14564.0,52624.0,377.0,470.0,169.0,217.0,...,899.439956,874.837386,211.898029,189.348918,186.3333,188.292692,113.059608,113.194846,110.997393,109.83439
4,1,1,C6,5,27417.0,84084.0,760.0,550.0,487.0,242.0,...,1231.630414,1218.998954,306.13973,295.581509,310.469726,287.78839,496.084704,502.046808,490.259298,491.171009


Now that we have created our merged single-cell profile, let's move on to the next step: loading our `platemaps`. 

Platemaps provide us with additional information that is crucial for our analysis. They contain details such as well positions, genotypes, gene names, perturbation types, and more. In other words, platemaps serve as a valuable source of supplementary information for our single-cell morphology profile.

In [5]:
# loading plate map and display it
platemap_df = pd.read_csv(plate_map)
platemap_df.head(8)

Unnamed: 0,WellRow,WellCol,well_position,gene_name,genotype
0,C,6,C6,NF1,WT
1,C,7,C7,NF1,Null
2,D,6,D6,NF1,WT
3,D,7,D7,NF1,Null
4,E,6,E6,NF1,WT
5,E,7,E7,NF1,Null
6,F,6,F6,NF1,WT
7,F,7,F7,NF1,Null


## Annotation

In this step of the walkthrough, we will combine the metadata with the merged single-cell morphology dataset. To accomplish this, we will utilize the `annotation` function provided by `pycytominer`.

The `annotation` function takes two inputs: the merged single-cell morphology dataset and its associated plate map. By combining these two datasets, we will generate an annotated_profile that contains enriched information.

More information about the `annotation` function can be found [here](https://pycytominer.readthedocs.io/en/latest/pycytominer.html#module-pycytominer.annotate)


In [6]:
# annotating merged single-cell profile with metadata
annotated_df = annotate(
    profiles=sc_profile,
    platemap=platemap_df,
    join_on=["Metadata_well_position", "Image_Metadata_Well"],
)

# save annotated profile
annotated_df.to_csv(anno_profile_path, compression="gzip")

# displaying annotated profile
annotated_df.head()

Unnamed: 0,Metadata_WellRow,Metadata_WellCol,Metadata_gene_name,Metadata_genotype,Metadata_ImageNumber,Metadata_Plate,Metadata_Well,Metadata_Cytoplasm_Parent_Cells,Metadata_Cytoplasm_Parent_Nuclei,Metadata_Cells_Number_Object_Number,...,Nuclei_Texture_Variance_DAPI_3_02_256,Nuclei_Texture_Variance_DAPI_3_03_256,Nuclei_Texture_Variance_GFP_3_00_256,Nuclei_Texture_Variance_GFP_3_01_256,Nuclei_Texture_Variance_GFP_3_02_256,Nuclei_Texture_Variance_GFP_3_03_256,Nuclei_Texture_Variance_RFP_3_00_256,Nuclei_Texture_Variance_RFP_3_01_256,Nuclei_Texture_Variance_RFP_3_02_256,Nuclei_Texture_Variance_RFP_3_03_256
0,C,6,NF1,WT,1,1,C6,1,3,1,...,1393.410857,1331.583275,653.826838,618.063979,606.832257,590.114791,147.195839,144.355017,148.179465,148.875403
1,C,6,NF1,WT,1,1,C6,2,4,2,...,1369.498111,1276.305513,332.941295,317.56745,321.873215,292.116754,60.632767,61.876198,65.202076,60.022847
2,C,6,NF1,WT,1,1,C6,3,5,3,...,1338.091947,1299.373271,432.829034,398.306003,401.091835,358.84984,74.837374,71.033793,80.523205,80.845266
3,C,6,NF1,WT,1,1,C6,4,7,4,...,899.439956,874.837386,211.898029,189.348918,186.3333,188.292692,113.059608,113.194846,110.997393,109.83439
4,C,6,NF1,WT,1,1,C6,5,8,5,...,1231.630414,1218.998954,306.13973,295.581509,310.469726,287.78839,496.084704,502.046808,490.259298,491.171009


## Noramlization Step

The next step is to normalize our dataset using the `normalize` function provided by `pycytominer`.
More information regards `pycytominer`'s `normalize` function can be found [here](https://pycytominer.readthedocs.io/en/latest/pycytominer.html#module-pycytominer.normalize)

Normalization is a critical preprocessing step that improves the quality of our dataset. It addresses two key challenges: mitigating the impact of outliers and handling variations in value scales. By normalizing the data, we ensure that our downstream analysis is not heavily influenced by these factors.

Additionally, normalization plays a crucial role in determining feature importance (which is crucial for our last step). By bringing all features to a comparable scale, it enables the identification of important features without biases caused by outliers or widely-scaled values.

To normalize our annotated single-cell morphology profile, we will utilize the normalize function from pycytominer. This function is specifically designed to handle the normalization process for cytometry data. 

In [7]:
# normalize dataset
normalized_df = normalize(annotated_df)

# save normalized dataset 
normalized_df.to_csv(norm_profile_path, compression="gzip")

# display normalized dataset
normalized_df.head()

Unnamed: 0,Metadata_WellRow,Metadata_WellCol,Metadata_gene_name,Metadata_genotype,Metadata_ImageNumber,Metadata_Plate,Metadata_Well,Metadata_Cytoplasm_Parent_Cells,Metadata_Cytoplasm_Parent_Nuclei,Metadata_Cells_Number_Object_Number,...,Nuclei_Texture_Variance_DAPI_3_02_256,Nuclei_Texture_Variance_DAPI_3_03_256,Nuclei_Texture_Variance_GFP_3_00_256,Nuclei_Texture_Variance_GFP_3_01_256,Nuclei_Texture_Variance_GFP_3_02_256,Nuclei_Texture_Variance_GFP_3_03_256,Nuclei_Texture_Variance_RFP_3_00_256,Nuclei_Texture_Variance_RFP_3_01_256,Nuclei_Texture_Variance_RFP_3_02_256,Nuclei_Texture_Variance_RFP_3_03_256
0,C,6,NF1,WT,1,1,C6,1,3,1,...,-0.277354,-0.278153,0.437881,0.406082,0.358158,0.358081,0.760504,0.736324,0.765755,0.781857
1,C,6,NF1,WT,1,1,C6,2,4,2,...,-0.308203,-0.353348,-0.069536,-0.073583,-0.088878,-0.116121,0.001109,0.015249,0.039691,-0.000841
2,C,6,NF1,WT,1,1,C6,3,5,3,...,-0.34872,-0.321969,0.088417,0.055295,0.035398,-0.009929,0.125722,0.09531,0.173753,0.182583
3,C,6,NF1,WT,1,1,C6,4,7,4,...,-0.91462,-0.899471,-0.260942,-0.278251,-0.301509,-0.281335,0.461035,0.463905,0.440407,0.437947
4,C,6,NF1,WT,1,1,C6,5,8,5,...,-0.486065,-0.431303,-0.111917,-0.108678,-0.106767,-0.123008,3.821213,3.86346,3.759005,3.797123


## Feature Selection


In the final section of our walkthrough, we will utilize the normalized dataset to extract important morphological features and generate a selected features profile.

To accomplish this, we will make use of the `feature_select` function provided by `pycytominer`. 
Using `pycytominer`'s `feature_select` function to our dataset, we can identify the most informative morphological features that contribute significantly to the variations observed in our data. These selected features will be utilized to create our feature profile.

For more detailed information about the feature_select function, its parameters, and its capabilities, please refer to the documentation available [here](https://pycytominer.readthedocs.io/en/latest/pycytominer.html#module-pycytominer.feature_select).

In [8]:
# creating selected features profile 
features_df = feature_select(profiles=normalized_df)

# saving selected features profile 
features_df.to_csv(feat_profile_path)

# display selected features 
features_df.head()


Unnamed: 0,Metadata_WellRow,Metadata_WellCol,Metadata_gene_name,Metadata_genotype,Metadata_ImageNumber,Metadata_Plate,Metadata_Well,Metadata_Cytoplasm_Parent_Cells,Metadata_Cytoplasm_Parent_Nuclei,Metadata_Cells_Number_Object_Number,...,Nuclei_Texture_Variance_DAPI_3_02_256,Nuclei_Texture_Variance_DAPI_3_03_256,Nuclei_Texture_Variance_GFP_3_00_256,Nuclei_Texture_Variance_GFP_3_01_256,Nuclei_Texture_Variance_GFP_3_02_256,Nuclei_Texture_Variance_GFP_3_03_256,Nuclei_Texture_Variance_RFP_3_00_256,Nuclei_Texture_Variance_RFP_3_01_256,Nuclei_Texture_Variance_RFP_3_02_256,Nuclei_Texture_Variance_RFP_3_03_256
0,C,6,NF1,WT,1,1,C6,1,3,1,...,-0.277354,-0.278153,0.437881,0.406082,0.358158,0.358081,0.760504,0.736324,0.765755,0.781857
1,C,6,NF1,WT,1,1,C6,2,4,2,...,-0.308203,-0.353348,-0.069536,-0.073583,-0.088878,-0.116121,0.001109,0.015249,0.039691,-0.000841
2,C,6,NF1,WT,1,1,C6,3,5,3,...,-0.34872,-0.321969,0.088417,0.055295,0.035398,-0.009929,0.125722,0.09531,0.173753,0.182583
3,C,6,NF1,WT,1,1,C6,4,7,4,...,-0.91462,-0.899471,-0.260942,-0.278251,-0.301509,-0.281335,0.461035,0.463905,0.440407,0.437947
4,C,6,NF1,WT,1,1,C6,5,8,5,...,-0.486065,-0.431303,-0.111917,-0.108678,-0.106767,-0.123008,3.821213,3.86346,3.759005,3.797123


Congratulations! You have successfully completed our walkthrough. We hope that this tutorial has provided you with a basic understanding of how to analyze cell morphology features using pycytominer.

By following the steps outlined in this walkthrough, you have gained valuable insights into processing high-dimensional single-cell morphology data with ease using pycytominer. However, please keep in mind that pycytominer offers a wide range of functionalities beyond what we covered here. We encourage you to explore the documentation to discover more advanced features and techniques.

If you have any questions or need further assistance, don't hesitate to visit the pycytominer repository and post your question in the issues section. The community is there to support you and provide guidance.

Now that you have the knowledge and tools to analyze cell morphology features, have fun exploring and mining your data!