# 1. Executing Mean Average Precision (mAP)

Mean Average Precision (mAP) is a flexible statistical framework used to measure the **phenotypic activity** of compounds by comparing them to control groups. In this notebook, we utilize high-content screening data, that used the CellPainting assay, to identify potential drug candidates that demonstrate evidence of reversing the effects of cardiac fibrosis. The dataset comprises **image-based profiles at the replicate level (well-level)**.

#### **Controls Used in the Screen**
To interpret mAP scores, we leverage the following control groups:
- **Negative control**: Failing CF cells treated with DMSO.
- **Positive control**: Healthy CF cells treated with DMSO.

#### **Interpreting mAP Scores**
- **High mAP Scores**:  
  Indicate that wells treated with a specific compound are highly phenotypically distinct compared to the control. This suggests the compound induces a strong and specific phenotypic change.
  
- **Low mAP Scores**:  
  Indicate that wells treated with a specific compound are phenotypically similar to the control. This suggests the compound has little to no phenotypic effect or a nonspecific one.

#### **Biological Interpretation**
mAP scores help determine which compounds exhibit phenotypic changes that resemble those of healthy cells, making them potential candidates for reversing the effects of cardiac fibrosis. By comparing the phenotypic activity of compounds to both positive and negative controls, we can prioritize compounds for further validation.

**what is outputed**
- AP scores generated using both the postive and negvative controls
- mAP scores generated using both the postive and negative controls

In [1]:
import sys
import warnings
import pathlib

import pandas as pd
from pycytominer.cyto_utils import load_profiles
from tqdm import TqdmWarning

sys.path.append("../../")
from src import io_utils, data_utils, analysis_utils

# removing warnigns 
warnings.filterwarnings("ignore", category=TqdmWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

  from .autonotebook import tqdm as notebook_tqdm


This code sets up the necessary file paths and directories required for the notebook, ensuring that input files exist. 
It also creates a results folder if it doesn't already exist to store outputs generated during the analysis.

In [2]:
# Setting the base data directory and ensure it exists (raises an error if it doesn't)
data_dir = pathlib.Path("../data/").resolve(strict=True)

# Setting the metadata directory for updated plate maps and ensure it exists
metadata_dir = pathlib.Path("../data/metadata/updated_platemaps").resolve(strict=True)

# Path to the updated barcode plate map file, ensure it exists
platemap_path = (metadata_dir / "updated_barcode_platemap.csv").resolve(strict=True)

# Path to the configuration file (does not enforce existence check here)
config_path = pathlib.Path("../config.yaml").resolve(strict=True)

# Setting the results directory, resolve the full path, and create it if it doesn't already exist
results_dir = pathlib.Path("./results/map_scores").resolve()
results_dir.mkdir(exist_ok=True, parents=True)

Loading in the files

In [3]:
# loading config and general configs
configs = io_utils.load_config(config_path)
general_configs = configs["general_configs"]

# loading bar code
barcode = pd.read_csv(platemap_path)

Since these files have undergone feature selection, it is essential to identify the overlapping feature names to ensure accurate and consistent analysis.

In [4]:
shared_cols = None
for aggregated_profile in list(data_dir.glob("*.parquet")):
    # read aggreagated profiled and column names
    agg_df = pd.read_parquet(aggregated_profile)
    columns = list(agg_df.columns)

    # Update the shared_columns set
    if shared_cols is None:
        # Initialize shared columns with the first profile's columns, preserving order
        shared_cols = columns
    else:
        # Retain only the columns present in both the current profile and shared columns
        shared_cols = [col for col in shared_cols if col in columns]


In this section, the code processes and organizes data by grouping related files and enriching them with additional metadata. Each group is assigned a unique identifier, and the corresponding data files are systematically loaded and prepared. New metadata columns are generated by combining existing information to ensure consistency and clarity. Additional metadata is integrated into the data to provide valuable experimental context, while unique identifiers are added to distinguish the aggregated profiles from different batches.

In [5]:
# Suffix for aggregated profiles
aggregated_file_suffix = "aggregated_post_fs.parquet"

# Dictionary to store loaded plate data grouped by batch
loaded_plate_batches = {}
loade_shuffled_plate_batches = {}

# Iterate over unique platemap files and their associated plates
for batch_index, (platemap_filename, associated_plates_df) in enumerate(
    barcode.groupby("platemap_file")
):
    # Generate a unique batch ID
    batch_id = f"batch_{batch_index + 1}"

    # Load the platemap CSV file
    platemap_path = (metadata_dir / f"{platemap_filename}.csv").resolve(strict=True)
    platemap_data = pd.read_csv(platemap_path)

    # Extract all plate names associated with the current platemap
    plate_barcodes = associated_plates_df["plate_barcode"].tolist()

    # List to store all loaded and processed aggregated plates for the current batch
    loaded_aggregated_plates = []

    for plate_barcode in plate_barcodes:
        # Resolve the file path for the aggregated plate data
        plate_file_path = (
            data_dir / f"{plate_barcode}_{aggregated_file_suffix}"
        ).resolve(strict=True)

        # Load the aggregated profile data for the current plate
        aggregated_data = load_profiles(plate_file_path)

        # Update loaded data frame with only shared features
        aggregated_data = aggregated_data[shared_cols]

        # Add a new column indicating the source plate for each row
        aggregated_data.insert(0,"Metadata_plate_barcode" , plate_barcode)

        # Append the processed aggregated data for this plate to the batch list
        loaded_aggregated_plates.append(aggregated_data)

    # Combine all processed plates for the current batch into a single DataFrame
    combined_aggregated_data = pd.concat(loaded_aggregated_plates)
    meta_concat, feats_concat = data_utils.split_meta_and_features(combined_aggregated_data)

    # Store the combined DataFrame in the loaded_plate_batches dictionary
    loaded_plate_batches[batch_id] = combined_aggregated_data


In this section, we analyze a high-content screening dataset generated from cell painting experiments, where failing cardiac fibroblasts are treated with multiple compounds. Our goal is to calculate the mean average precision (mAP) by comparing the experimental treatments to two controls: a negative control consisting of DMSO-treated failing cardiac fibroblasts and a positive control consisting of DMSO-treated healthy cardiac fibroblasts.

We start by preparing the dataset, copying the profiles, and assigning a reference index to ensure proper grouping of non-DMSO treatment replicates. Metadata and feature columns are separated to facilitate the calculation of average precision (AP) scores. To calculate these scores, we define positive pairs as treatments with the same metadata values (e.g., same treatment type) across all plates. Negative pairs, on the other hand, are determined by comparing all DMSO-treated wells across all plates with all other treatments.

Once the AP scores are computed, we aggregate them across all plates for each treatment to derive the mean average precision (mAP) score. This process captures the consistency of treatment performance relative to the controls and allows for a comprehensive evaluation of the dataset. Finally, we save both the AP and mAP scores for each control condition, providing a well-structured dataset for further interpretation and downstream analysis.

In [6]:
# here we execute map pipeline with with the original 
analysis_utils.calculate_trt_map_batch_profiles(
    batched_profiles=loaded_plate_batches,
    configs=configs,
    outdir_path=results_dir,
    shuffled=False
)

                                             