# Identifying On and Off Morphological Signatures

In this section, we will identify morphological features that are distinctly different between the reference group and the target group. This analysis aims to generate morphological signatures associated with the different cellular states.

For this analysis:
- The **reference state** is the negative control: Failing CF Cells treated with DMSO.
- The **target state** is the positive control: Healthy CF Cells treated with DMSO.

### On-Morphological Features
**On-morphological features** refer to morphological characteristics that show significant differences between the reference and target states. These features represent the on-target morphology associated with the target state. 

On-morphological features are crucial when developing metrics or models to differentiate the target state from the reference. These features signify cellular morphological changes that are specific to the target state and should be prioritized during metric development.

### Off-Morphological Features
**Off-morphological features** refer to morphological characteristics that do not show significant differences between the reference and target states. These features are not associated with the target state and may reflect general cellular morphology unaffected by the treatment or target condition.

These features can be leveraged to:
- Monitor off-target effects.
- Identify morphological changes unrelated to the target state.

### Goal of This Analysis
The goal is to clearly separate on-target morphological features (those associated with healthy CF cells) from off-target features (those not significantly affected by the target condition). This distinction helps in designing metrics for detecting on-target effects while monitoring and minimizing off-target impacts.

In [1]:
from typing import Optional, Tuple, List

import numpy as np
import pandas as pd
from scipy.stats import ks_2samp
from statsmodels.stats.multitest import multipletests

## Helper functions

In [2]:
def weighted_ks_test(
    reference: pd.DataFrame,
    target: pd.DataFrame,
    p_thresh: Optional[float] = 0.05,
    correction_method: str = "fdr_bh",
) -> Tuple[List[str], List[str]]:
    """ Performs a weighted Kolmogorov-Smirnov (KS) test between the target and reference
    datasets for each morphological feature. Adjusts for imbalanced sample sizes by applying
    weights to the cumulative distribution functions (CDFs). Includes multiple testing correction.

    Parameters
    ----------
    reference : pd.DataFrame
        A DataFrame containing the morphological features of the reference dataset (e.g.,
        reference cells). Each column represents a different feature, and each row represents
        a single observation (e.g., a cell).

    target : pd.DataFrame
        A DataFrame containing the morphological features of the target dataset (e.g.,
        desired cell state). Each column represents a different feature, and each row 
        represents a single observation (e.g., a cell).

    p_thresh : Optional[float], default=0.05
        The significance threshold for the corrected p-value.

    correction_method : str, default="fdr_bh"
        The method for multiple testing correction. Options include:
        - "fdr_bh" (False Discovery Rate, Benjamini-Hochberg)
        - "bonferroni" (Bonferroni correction)
        Refer to `statsmodels.stats.multitest.multipletests` for other options.

    Returns
    -------
    Tuple[List[str], List[str]]
        - A list of features that are not significantly different between the target
          and reference datasets (off-morphology signatures).
        - A list of features that are significantly different (on-morphology signatures).

    Notes
    -----
    - This implementation uses weights proportional to the inverse of the dataset sizes
      to adjust for imbalances between the reference and target datasets.
    - Multiple testing correction is applied to the computed p-values.
    """

    # Store the KS statistics and raw p-values for each feature
    ks_stats = []
    raw_p_values = []
    feature_names = target.columns.tolist()

    # Iterate through each morphological feature in the dataset
    for morphology_feature in feature_names:
        # Step 1: Calculate weights for both target and reference datasets
        # Weights ensure each dataset contributes equally, regardless of sample size
        target_weights = np.ones(len(target)) / len(target)
        reference_weights = np.ones(len(reference)) / len(reference)

        # Step 2: Sort the values of the feature and their corresponding weights
        sorted_target_indices = np.argsort(target[morphology_feature].to_numpy())
        sorted_reference_indices = np.argsort(reference[morphology_feature].to_numpy())

        sorted_target_data = target[morphology_feature].to_numpy()[
            sorted_target_indices
        ]
        sorted_reference_data = reference[morphology_feature].to_numpy()[
            sorted_reference_indices
        ]

        sorted_target_weights = target_weights[sorted_target_indices]
        sorted_reference_weights = reference_weights[sorted_reference_indices]

        # Step 3: Compute the weighted cumulative distribution functions (CDFs)
        # Use cumulative sums of sorted weights to calculate the CDFs
        weighted_target_cdf = np.cumsum(sorted_target_weights) / np.sum(
            sorted_target_weights
        )
        weighted_reference_cdf = np.cumsum(sorted_reference_weights) / np.sum(
            sorted_reference_weights
        )

        # Step 4: Find all unique feature values across both datasets
        # Unique values are necessary for interpolating CDFs
        all_values = np.unique(
            np.concatenate([sorted_target_data, sorted_reference_data])
        )

        # Step 5: Interpolate the CDFs at the unique values
        # Ensures the CDFs can be compared directly at the same points
        target_cdf_at_values = np.interp(
            all_values, sorted_target_data, weighted_target_cdf, left=0, right=1
        )
        reference_cdf_at_values = np.interp(
            all_values, sorted_reference_data, weighted_reference_cdf, left=0, right=1
        )

        # Step 6: Compute the KS statistic
        # The KS statistic is the maximum absolute difference between the two CDFs
        ks_stat = np.max(np.abs(target_cdf_at_values - reference_cdf_at_values))
        ks_stats.append(ks_stat)

        # Step 7: Compute the raw p-value using an unweighted KS test
        # The p-value is used to assess statistical significance
        _, p_val = ks_2samp(target[morphology_feature], reference[morphology_feature])
        raw_p_values.append(p_val)

    # Step 8: Apply multiple testing correction to raw p-values
    # This controls for false positives when testing multiple features
    corrected_results = multipletests(
        raw_p_values, alpha=p_thresh, method=correction_method
    )
    
    # we are only extracting the flags if the feature is significant or not
    # corrected p-values are not used in this implementation
    significant_flags = corrected_results[0]  # Boolean flags for significance

    # Step 9: Categorize features based on corrected p-values
    # Separate features into on-morphology (significant) and off-morphology (non-significant)
    on_morphology_signatures = [
        feature_names[i]
        for i, significant in enumerate(significant_flags)
        if significant
    ]
    off_morphology_signatures = [
        feature_names[i]
        for i, significant in enumerate(significant_flags)
        if not significant
    ]

    return off_morphology_signatures, on_morphology_signatures


def create_target_and_ref_datasets(
    reference_rows=10, target_rows=10, cols=10, significant_count=4, seed=0
):
    """
    Create reference and target datasets with the same feature space but different
    numbers of rows and specified significant features.

    Parameters
    ----------
    reference_rows : int, optional
        Number of rows in the reference dataset (default is 10).
    target_rows : int, optional
        Number of rows in the target dataset (default is 10).
    cols : int, optional
        Number of columns (morphology features) shared by both datasets (default is 10).
    significant_count : int, optional
        Number of features that are significantly different between the datasets (default is 4).

    Returns
    -------
    reference : pd.DataFrame
        The reference dataset with random morphology feature values.
    target : pd.DataFrame
        The target dataset with specified number of significantly different features.
    """
    # Set seed for reproducibility
    np.random.seed(seed)

    # Generate column names for the datasets
    column_names = [f"Morphology_Feature_{i + 1}" for i in range(cols)]

    # Generate random data for the reference group
    reference = pd.DataFrame(
        np.random.normal(0, 1, size=(reference_rows, cols)),
        columns=column_names,
    )

    # Generate random data for the target group
    target = pd.DataFrame(
        np.random.normal(0, 1, size=(target_rows, cols)),
        columns=column_names,
    )

    # Select significant features
    significant_features = np.random.choice(
        column_names, size=significant_count, replace=False
    )

    # Introduce significant differences in the target group
    for feature in significant_features:
        target[feature] += np.random.normal(5, 0.5, size=target_rows)

    return reference, target

## Testing our weighted KS

In this section we have a function that generates random data where you provide the number of rows and column and the amount of significant features you want. This allows us to test weather our weighted-KS test can capture s features 

In [19]:
# Example usage
reference, target, significant_features = create_target_and_ref_datasets(
    reference_rows=20, target_rows=40, cols=30, significant_count=10, seed=0
)

In [20]:
reference.head()

Unnamed: 0,Morphology_Feature_1,Morphology_Feature_2,Morphology_Feature_3,Morphology_Feature_4,Morphology_Feature_5,Morphology_Feature_6,Morphology_Feature_7,Morphology_Feature_8,Morphology_Feature_9,Morphology_Feature_10,...,Morphology_Feature_21,Morphology_Feature_22,Morphology_Feature_23,Morphology_Feature_24,Morphology_Feature_25,Morphology_Feature_26,Morphology_Feature_27,Morphology_Feature_28,Morphology_Feature_29,Morphology_Feature_30
0,1.764052,0.400157,0.978738,2.240893,1.867558,-0.977278,0.950088,-0.151357,-0.103219,0.410599,...,-2.55299,0.653619,0.864436,-0.742165,2.269755,-1.454366,0.045759,-0.187184,1.532779,1.469359
1,0.154947,0.378163,-0.887786,-1.980796,-0.347912,0.156349,1.230291,1.20238,-0.387327,-0.302303,...,-0.895467,0.386902,-0.510805,-1.180632,-0.028182,0.428332,0.066517,0.302472,-0.634322,-0.362741
2,-0.67246,-0.359553,-0.813146,-1.726283,0.177426,-0.401781,-1.630198,0.462782,-0.907298,0.051945,...,-1.16515,0.900826,0.465662,-1.536244,1.488252,1.895889,1.17878,-0.179925,-1.070753,1.054452
3,-0.403177,1.222445,0.208275,0.976639,0.356366,0.706573,0.0105,1.78587,0.126912,0.401989,...,1.867559,0.906045,-0.861226,1.910065,-0.268003,0.802456,0.947252,-0.15501,0.614079,0.922207
4,0.376426,-1.099401,0.298238,1.326386,-0.694568,-0.149635,-0.435154,1.849264,0.672295,0.407462,...,-1.491258,0.439392,0.166673,0.635031,2.383145,0.944479,-0.912822,1.117016,-1.315907,-0.461585


In [21]:
target.head()

Unnamed: 0,Morphology_Feature_1,Morphology_Feature_2,Morphology_Feature_3,Morphology_Feature_4,Morphology_Feature_5,Morphology_Feature_6,Morphology_Feature_7,Morphology_Feature_8,Morphology_Feature_9,Morphology_Feature_10,...,Morphology_Feature_21,Morphology_Feature_22,Morphology_Feature_23,Morphology_Feature_24,Morphology_Feature_25,Morphology_Feature_26,Morphology_Feature_27,Morphology_Feature_28,Morphology_Feature_29,Morphology_Feature_30
0,3.810596,5.920754,-0.944368,0.238103,-1.405963,-0.590058,-0.110489,-1.6607,0.115148,5.150091,...,0.676461,-0.382009,5.106472,4.276905,4.571031,-1.226196,0.183339,1.670943,5.550692,-0.001385
1,3.431756,4.917136,0.466166,-0.370242,-0.453804,0.403265,-0.918005,0.252497,0.820322,5.917431,...,0.286904,-2.320594,5.53342,5.826136,4.697473,0.449712,-0.067276,-1.318396,5.049491,-0.945616
2,4.573113,4.723658,0.452489,0.097896,-0.448165,-0.649338,-0.023423,1.079195,-2.004216,5.10878,...,-0.591403,1.124419,4.65432,6.327076,4.721959,-2.834555,2.116791,-1.610878,5.098335,2.380745
3,5.231858,6.567697,-1.502397,-1.777667,-0.532703,1.09075,-0.346249,-0.794636,0.197967,6.182621,...,-0.704921,0.679975,3.520733,4.92179,6.26957,-0.101281,-0.803141,-0.464338,6.046289,-0.552541
4,4.477513,3.771621,0.183925,-0.38549,-1.601836,-0.887181,-0.932789,1.243319,0.812674,5.236792,...,-1.177629,-1.140196,6.555146,5.251092,3.650465,0.555787,0.010349,0.720034,2.611602,0.303604


In [22]:
# apply weighted KS to the dummy data
off_morph_signatures, on_morph_signatures = weighted_ks_test(reference, target, p_thresh=0.05)
print("off_morph_signatures:", off_morph_signatures)
print("on_morph_signatures:", on_morph_signatures)

off_morph_signatures: ['Morphology_Feature_3', 'Morphology_Feature_4', 'Morphology_Feature_5', 'Morphology_Feature_6', 'Morphology_Feature_7', 'Morphology_Feature_8', 'Morphology_Feature_9', 'Morphology_Feature_11', 'Morphology_Feature_14', 'Morphology_Feature_15', 'Morphology_Feature_16', 'Morphology_Feature_17', 'Morphology_Feature_19', 'Morphology_Feature_20', 'Morphology_Feature_21', 'Morphology_Feature_22', 'Morphology_Feature_26', 'Morphology_Feature_27', 'Morphology_Feature_28', 'Morphology_Feature_30']
on_morph_signatures: ['Morphology_Feature_1', 'Morphology_Feature_2', 'Morphology_Feature_10', 'Morphology_Feature_12', 'Morphology_Feature_13', 'Morphology_Feature_18', 'Morphology_Feature_23', 'Morphology_Feature_24', 'Morphology_Feature_25', 'Morphology_Feature_29']


## Applying weighted KS to the CFReT dataset 

Setting up paths