## Perform single-cell quality control to remove erroneous, technical outlier segmentations

Using coSMicQC, we decided on three morphology features (related to shape and intensity) to maximize removal of poor quality nuclei segmentations that could impact downstream analysis.

The original indices are saved as a parquet file to be used for filtering downstream.

## Set papermill parameters

In [1]:
# Papermill notebook parameters
# (including notebook cell tag).
# We set a default here which may be overridden
# during Papermill execution.
# See here for more information:
# https://papermill.readthedocs.io/en/latest/usage-parameterize.html
plate_id = "BR00117010"

In [2]:
# Parameters
plate_id = "BR00117006"


## Import libraries

In [3]:
from pathlib import Path

import cosmicqc
import pandas as pd
from cytodataframe import CytoDataFrame
from pyarrow import parquet

## Load in the plate data to process with only relevant metadata columns

In [4]:
# set path to save qc results
qc_results_dir = Path("./qc_results")
qc_results_dir.mkdir(parents=True, exist_ok=True)

# form a path to the parquet file with single-cell profiles
merged_single_cells = f"/media/jenna/8TB-C/work/JUMP-single-cell/0.download_data/data/plates/{plate_id}/{plate_id}.parquet"

# read only the metadata from parquet file
parquet.ParquetFile(merged_single_cells).metadata

<pyarrow._parquet.FileMetaData object at 0x7049834350e0>
  created_by: parquet-cpp-arrow version 17.0.0
  num_columns: 7390
  num_rows: 491838
  num_row_groups: 53
  format_version: 2.6
  serialized_size: 58923224

In [5]:
# set a list of metadata columns for use throughout
metadata_cols = [
    "Image_TableNumber",
    "Metadata_ImageNumber",
    "Image_Metadata_Site",
    "Metadata_ObjectNumber",
    "Metadata_Plate",
    "Metadata_Well",
    "Image_Metadata_Col",
    "Image_Metadata_Row",
    "Image_FileName_CellOutlines",
    "Image_FileName_NucleiOutlines",
    "Image_FileName_OrigDNA",
    "Image_FileName_OrigRNA",
    "Nuclei_AreaShape_BoundingBoxMaximum_X",
    "Nuclei_AreaShape_BoundingBoxMaximum_Y",
    "Nuclei_AreaShape_BoundingBoxMinimum_X",
    "Nuclei_AreaShape_BoundingBoxMinimum_Y",
]

# read only metadata columns with feature columns used for outlier detection
df_merged_single_cells = pd.read_parquet(
    path=merged_single_cells,
    columns=[
        *metadata_cols,
        "Nuclei_AreaShape_FormFactor",
        "Nuclei_Intensity_MassDisplacement_DNA",
        "Nuclei_AreaShape_Compactness",
    ],
)
print(df_merged_single_cells.shape)
df_merged_single_cells.head()

(491838, 19)


Unnamed: 0,Image_TableNumber,Metadata_ImageNumber,Image_Metadata_Site,Metadata_ObjectNumber,Metadata_Plate,Metadata_Well,Image_Metadata_Col,Image_Metadata_Row,Image_FileName_CellOutlines,Image_FileName_NucleiOutlines,Image_FileName_OrigDNA,Image_FileName_OrigRNA,Nuclei_AreaShape_BoundingBoxMaximum_X,Nuclei_AreaShape_BoundingBoxMaximum_Y,Nuclei_AreaShape_BoundingBoxMinimum_X,Nuclei_AreaShape_BoundingBoxMinimum_Y,Nuclei_AreaShape_FormFactor,Nuclei_Intensity_MassDisplacement_DNA,Nuclei_AreaShape_Compactness
0,100953712601233197982859519342258171194,1,1,2,BR00117006,A01,1,1,A01_s1--cell_outlines.png,A01_s1--nuclei_outlines.png,r01c01f01p01-ch5sk1fk1fl1.tiff,r01c01f01p01-ch3sk1fk1fl1.tiff,906,39,879,10,0.919634,0.157198,1.087389
1,10948179779460722453965434149548305036,2,2,1,BR00117006,A01,1,1,A01_s2--cell_outlines.png,A01_s2--nuclei_outlines.png,r01c01f02p01-ch5sk1fk1fl1.tiff,r01c01f02p01-ch3sk1fk1fl1.tiff,43,32,13,3,0.939079,0.032054,1.064873
2,10948179779460722453965434149548305036,2,2,2,BR00117006,A01,1,1,A01_s2--cell_outlines.png,A01_s2--nuclei_outlines.png,r01c01f02p01-ch5sk1fk1fl1.tiff,r01c01f02p01-ch3sk1fk1fl1.tiff,173,34,143,9,0.941767,0.063976,1.061834
3,10948179779460722453965434149548305036,2,2,3,BR00117006,A01,1,1,A01_s2--cell_outlines.png,A01_s2--nuclei_outlines.png,r01c01f02p01-ch5sk1fk1fl1.tiff,r01c01f02p01-ch3sk1fk1fl1.tiff,821,34,793,5,0.884215,0.416156,1.130947
4,22362509688399705307174787699161992705,4,4,2,BR00117006,A01,1,1,A01_s4--cell_outlines.png,A01_s4--nuclei_outlines.png,r01c01f04p01-ch5sk1fk1fl1.tiff,r01c01f04p01-ch3sk1fk1fl1.tiff,847,93,818,67,0.961052,0.057564,1.040526


## Create mapping for the outlines to the images for CytoDataFrame

In [6]:
# create an outline and orig mapping dictionary to map original images to outlines
# note: we turn off formatting here to avoid the key-value pairing definintion
# from being reformatted by black, which is normally preferred.
# fmt: off
outline_to_orig_mapping = {
    rf"{record['Metadata_Well']}_s{record['Image_Metadata_Site']}--nuclei_outlines\.png":
    rf"r{record['Image_Metadata_Row']:02d}c{record['Image_Metadata_Col']:02d}f{record['Image_Metadata_Site']:02d}p(\d{{2}})-.*\.tiff"
    for record in df_merged_single_cells[
        [
            "Metadata_Well",
            "Image_Metadata_Row",
            "Image_Metadata_Col",
            "Image_Metadata_Site",
        ]
    ].to_dict(orient="records")
}
# fmt: on

next(iter(outline_to_orig_mapping.items()))

('A01_s1--nuclei_outlines\\.png', 'r01c01f01p(\\d{2})-.*\\.tiff')

## Detect technically mis-segmented nuclei

We define technically mis-segmented nuclei as segmentations that include more than one nuclei, under or over-segmentation, or segmentation of background/smudging.

We are using the measurements in two different conditions:


1. Irregular shaped nuclei where the intensity based center is much different than the shape's center

This condition looks to detect any technical outliers with irregular shape and very high difference in centeroids, which can be detecting multiple different types of technical outliers.

- `FormFactor`, which detects how irregular the shape is. 
- `MassDisplacement`, which detects how different the segmentation versus intensity based centeroids are (which can reflect multiple nuclei within one segmentation).
We understand that interesting phenotypes will occur, so we are going to evaluate if this will identify the mis-segmentationsb

### Detect outliers and show single-cell image crops with CytoDataFrame (cdf)

In [7]:
# find irregular shaped nuclei using thresholds that maximizes
# removing most technical outliers and minimizes good cells
feature_thresholds = {
    "Nuclei_AreaShape_FormFactor": -2.35,
    "Nuclei_Intensity_MassDisplacement_DNA": 1.5,
}

irregular_nuclei_outliers = cosmicqc.find_outliers(
    df=df_merged_single_cells,
    metadata_columns=metadata_cols,
    feature_thresholds=feature_thresholds,
)

irregular_nuclei_outliers_cdf = CytoDataFrame(
    data=pd.DataFrame(irregular_nuclei_outliers),
    data_context_dir=f"/media/jenna/8TB-C/work/JUMP-download-images/JUMP-single-cell/0.download_data/data/images/{plate_id}/orig",
    data_outline_context_dir=f"/media/jenna/8TB-C/work/JUMP-download-images/JUMP-single-cell/0.download_data/data/images/{plate_id}/outlines",
    segmentation_file_regex=outline_to_orig_mapping,
)[
    [
        "Nuclei_AreaShape_FormFactor",
        "Nuclei_Intensity_MassDisplacement_DNA",
        "Image_FileName_OrigDNA",
    ]
]


print(irregular_nuclei_outliers_cdf.shape)
irregular_nuclei_outliers_cdf.sort_values(
    by="Nuclei_AreaShape_FormFactor", ascending=False
).head(2)

Number of outliers: 3801 (0.77%)
Outliers Range:
Nuclei_AreaShape_FormFactor Min: 0.23113739750863993
Nuclei_AreaShape_FormFactor Max: 0.760638693620542
Nuclei_Intensity_MassDisplacement_DNA Min: 0.6397183314369549
Nuclei_Intensity_MassDisplacement_DNA Max: 7.26280904251638
(3801, 3)


Unnamed: 0,Nuclei_AreaShape_FormFactor,Nuclei_Intensity_MassDisplacement_DNA,Image_FileName_OrigDNA
19664,0.760639,1.086892,
4012,0.760626,0.749094,


### Randomly sample outlier rows to visually inspect if the threshold looks to be optimized

In [8]:
irregular_nuclei_outliers_cdf.sample(n=5, random_state=0)

Unnamed: 0,Nuclei_AreaShape_FormFactor,Nuclei_Intensity_MassDisplacement_DNA,Image_FileName_OrigDNA
93222,0.758155,0.85384,
179009,0.725754,1.016294,
119166,0.72787,0.695138,
232985,0.641028,1.412864,
293186,0.736369,0.844214,


### Generate a new dataframe to save the original indices for filtering prior to further preprocessing

In [9]:
# Create a new dataframe for failed QC indices
failed_qc_indices = irregular_nuclei_outliers[metadata_cols].copy()
failed_qc_indices["original_index"] = failed_qc_indices.index  # Store original index
failed_qc_indices = failed_qc_indices.reset_index(drop=True)  # Reset index
failed_qc_indices["cqc.formfactor_displacement_outlier.is_outlier"] = True

print(failed_qc_indices.shape)
failed_qc_indices.head()

(3801, 18)


Unnamed: 0,Image_TableNumber,Metadata_ImageNumber,Image_Metadata_Site,Metadata_ObjectNumber,Metadata_Plate,Metadata_Well,Image_Metadata_Col,Image_Metadata_Row,Image_FileName_CellOutlines,Image_FileName_NucleiOutlines,Image_FileName_OrigDNA,Image_FileName_OrigRNA,Nuclei_AreaShape_BoundingBoxMaximum_X,Nuclei_AreaShape_BoundingBoxMaximum_Y,Nuclei_AreaShape_BoundingBoxMinimum_X,Nuclei_AreaShape_BoundingBoxMinimum_Y,original_index,cqc.formfactor_displacement_outlier.is_outlier
0,69459963208210734185096132903982422119,74,2,1,BR00117006,A09,9,1,A09_s2--cell_outlines.png,A09_s2--nuclei_outlines.png,r01c09f02p01-ch5sk1fk1fl1.tiff,r01c09f02p01-ch3sk1fk1fl1.tiff,596,31,569,5,102,True
1,272004832586908140273863469260939669360,239,5,3,BR00117006,B03,3,2,B03_s5--cell_outlines.png,B03_s5--nuclei_outlines.png,r02c03f05p01-ch5sk1fk1fl1.tiff,r02c03f05p01-ch3sk1fk1fl1.tiff,473,56,437,29,389,True
2,106551575141517557626179295758086866364,244,1,3,BR00117006,B04,4,2,B04_s1--cell_outlines.png,B04_s1--nuclei_outlines.png,r02c04f01p01-ch5sk1fk1fl1.tiff,r02c04f01p01-ch3sk1fk1fl1.tiff,1064,40,1035,15,395,True
3,245747665552168333667480201534104719103,614,2,1,BR00117006,C21,21,3,C21_s2--cell_outlines.png,C21_s2--nuclei_outlines.png,r03c21f02p01-ch5sk1fk1fl1.tiff,r03c21f02p01-ch3sk1fk1fl1.tiff,192,29,165,2,1184,True
4,164520559472283748380831895597373580469,88,7,1,BR00117006,A10,10,1,A10_s7--cell_outlines.png,A10_s7--nuclei_outlines.png,r01c10f07p01-ch5sk1fk1fl1.tiff,r01c10f07p01-ch3sk1fk1fl1.tiff,886,46,841,3,1450,True


2. Nuclei segmentations with holes and also irregular shaped

We need to include this extra condition as it was discovered that there were more poorly segmented cells not caught, especially those over-segmented nuclei that contained holes, which is not common for a nuclei segmentation.

- `Compactness`, which detects irregular nuclei and nuclei containing holes.

### Detect outliers and show single-cell image crops with CytoDataFrame (cdf)

In [10]:
# find mis-segmented nuclei due using thresholds that maximizes
# removing most technical outliers and minimizes good cells
feature_thresholds = {
    "Nuclei_AreaShape_Compactness": 4.05,
}

misshapen_nuclei_outliers = cosmicqc.find_outliers(
    df=df_merged_single_cells,
    metadata_columns=metadata_cols,
    feature_thresholds=feature_thresholds,
)

misshapen_nuclei_outliers_cdf = CytoDataFrame(
    data=pd.DataFrame(misshapen_nuclei_outliers),
    data_context_dir=f"/media/jenna/8TB-C/work/JUMP-download-images/JUMP-single-cell/0.download_data/data/images/{plate_id}/orig",
    data_outline_context_dir=f"/media/jenna/8TB-C/work/JUMP-download-images/JUMP-single-cell/0.download_data/data/images/{plate_id}/outlines",
    segmentation_file_regex=outline_to_orig_mapping,
)[
    [
        "Nuclei_AreaShape_Compactness",
        "Image_FileName_OrigDNA",
    ]
]


print(misshapen_nuclei_outliers_cdf.shape)
misshapen_nuclei_outliers_cdf.sort_values(
    by="Nuclei_AreaShape_Compactness", ascending=True
).head(2)

Number of outliers: 2608 (0.53%)
Outliers Range:
Nuclei_AreaShape_Compactness Min: 1.4446025015717159
Nuclei_AreaShape_Compactness Max: 4.326430992036327
(2608, 2)


Unnamed: 0,Nuclei_AreaShape_Compactness,Image_FileName_OrigDNA
241712,1.444603,
315355,1.444606,


### Randomly sample outlier rows to visually inspect if the threshold looks to be optimized

In [11]:
misshapen_nuclei_outliers_cdf.sample(n=5, random_state=0)

Unnamed: 0,Nuclei_AreaShape_Compactness,Image_FileName_OrigDNA
409567,1.4776,
471000,1.455228,
46627,1.8264,
296020,1.4786,
186330,1.460929,


### Save the original indices for failed single cells to parquet files

In [12]:
# Create a new dataframe for poor nuclei outliers
poor_qc_indices = misshapen_nuclei_outliers[metadata_cols].copy()
poor_qc_indices["original_index"] = poor_qc_indices.index  # Store original index
poor_qc_indices = poor_qc_indices.reset_index(drop=True)  # Reset index
poor_qc_indices["cqc.compactness_outlier.is_outlier"] = True

# Merge both outlier dataframes on metadata columns and original_index
failed_qc_indices = failed_qc_indices.merge(
    poor_qc_indices, on=[*metadata_cols, "original_index"], how="outer"
)

# Fill missing values with False (ensuring only detected outliers are True)
failed_qc_indices[
    [
        "cqc.formfactor_displacement_outlier.is_outlier",
        "cqc.compactness_outlier.is_outlier",
    ]
] = failed_qc_indices[
    [
        "cqc.formfactor_displacement_outlier.is_outlier",
        "cqc.compactness_outlier.is_outlier",
    ]
].fillna(
    False
)

# Calculate percentage removed
num_outliers = failed_qc_indices["original_index"].nunique()
total_cells = len(df_merged_single_cells)
percentage_removed = (num_outliers / total_cells) * 100

# Sort by original index to maintain consistent order
failed_qc_indices = failed_qc_indices.sort_values(by="original_index").reset_index(
    drop=True
)

# Save the indices for failed single-cells as a parquet file
failed_qc_indices.to_parquet(
    Path(f"{qc_results_dir}/{plate_id}_failed_qc_indices.parquet")
)

print(f"Failed QC indices shape: {failed_qc_indices.shape}")
print(f"Percentage of single cells removed: {percentage_removed:.2f}%")
failed_qc_indices.head()

Failed QC indices shape: (5423, 19)
Percentage of single cells removed: 1.10%


  ] = failed_qc_indices[


Unnamed: 0,Image_TableNumber,Metadata_ImageNumber,Image_Metadata_Site,Metadata_ObjectNumber,Metadata_Plate,Metadata_Well,Image_Metadata_Col,Image_Metadata_Row,Image_FileName_CellOutlines,Image_FileName_NucleiOutlines,Image_FileName_OrigDNA,Image_FileName_OrigRNA,Nuclei_AreaShape_BoundingBoxMaximum_X,Nuclei_AreaShape_BoundingBoxMaximum_Y,Nuclei_AreaShape_BoundingBoxMinimum_X,Nuclei_AreaShape_BoundingBoxMinimum_Y,original_index,cqc.formfactor_displacement_outlier.is_outlier,cqc.compactness_outlier.is_outlier
0,69459963208210734185096132903982422119,74,2,1,BR00117006,A09,9,1,A09_s2--cell_outlines.png,A09_s2--nuclei_outlines.png,r01c09f02p01-ch5sk1fk1fl1.tiff,r01c09f02p01-ch3sk1fk1fl1.tiff,596,31,569,5,102,True,False
1,247634561082235873146216276711644102284,118,1,1,BR00117006,A14,14,1,A14_s1--cell_outlines.png,A14_s1--nuclei_outlines.png,r01c14f01p01-ch5sk1fk1fl1.tiff,r01c14f01p01-ch3sk1fk1fl1.tiff,1065,27,1023,2,169,False,True
2,15540175052208848045391526125020691728,124,7,2,BR00117006,A14,14,1,A14_s7--cell_outlines.png,A14_s7--nuclei_outlines.png,r01c14f07p01-ch5sk1fk1fl1.tiff,r01c14f07p01-ch3sk1fk1fl1.tiff,139,51,113,5,183,False,True
3,272004832586908140273863469260939669360,239,5,3,BR00117006,B03,3,2,B03_s5--cell_outlines.png,B03_s5--nuclei_outlines.png,r02c03f05p01-ch5sk1fk1fl1.tiff,r02c03f05p01-ch3sk1fk1fl1.tiff,473,56,437,29,389,True,False
4,106551575141517557626179295758086866364,244,1,1,BR00117006,B04,4,2,B04_s1--cell_outlines.png,B04_s1--nuclei_outlines.png,r02c04f01p01-ch5sk1fk1fl1.tiff,r02c04f01p01-ch3sk1fk1fl1.tiff,994,46,959,13,393,False,True
