# Configure Aggregate Module Params

This notebook should be used as a test for ensuring correct aggregate parameters before aggregate processing.
Cells marked with `SET PARAMETERS` contain crucial variables that need to be set according to your specific experimental setup and data organization.
Please review and modify these variables as needed before proceeding with the analysis.

## SET PARAMETERS

### Fixed parameters for merge processing

- `CONFIG_FILE_PATH`: Path to a Brieflow config file used during processing. Absolute or relative to where workflows are run from.

In [1]:
CONFIG_FILE_PATH = "config/config.yml"

In [2]:
from pathlib import Path

import yaml
import pandas as pd

from lib.aggregate.load import load_hdf_subset, clean_cell_data
from lib.aggregate.feature_processing import (
    feature_transform,
    suggest_parameters,
    grouped_standardization,
)

## SET PARAMETERS

### Loading subset of data

- `POPULATION_FEATURE`: The column name that identifies your perturbation groups (e.g., 'gene_symbol_0' for CRISPR screens, 'treatment' for drug screens)

In [3]:
POPULATION_FEATURE = "gene_symbol_0"

In [4]:
# load config file and determine root path
with open(CONFIG_FILE_PATH, "r") as config_file:
    config = yaml.safe_load(config_file)
ROOT_FP = Path(config["all"]["root_fp"])

# Load subset of data
# Takes ~1 minute
merge_final_fp = ROOT_FP / "merge_process" / "hdfs" / "merge_final.hdf5"
raw_df = load_hdf_subset(
    merge_final_fp, n_rows=50000, population_feature=POPULATION_FEATURE
)

# Remove unassigned cells
clean_df = clean_cell_data(raw_df, POPULATION_FEATURE, filter_single_gene=False)
# pd.Series(raw_df.columns).to_csv('aggregate_4/column_names.csv', index=False)
print(f"Loaded {len(raw_df)} cells with {len(raw_df.columns)} features")

Reading first 50,000 rows from analysis_root/merge_process/hdfs/merge_final.hdf5
Unique populations: 4623
well
A2    25154
A1    24846
Name: count, dtype: int64
Found 23022 cells with assigned perturbations
Loaded 50000 cells with 1683 features


## SET PARAMETERS

### Apply feature transformations

- `TRANFORMATIONS_FP`: CSV file containing feature transformation specifications. Each row defines a feature pattern and its transformation (e.g., 'log(feature)', 'log(feature-1)'), and should have a feature and transformation column

In [5]:
TRANFORMATIONS_FP = "config/transformations.tsv"

In [6]:
# load lower case version of channels
channels = [channel.lower() for channel in config["phenotype_process"]["channel_names"]]

# load transformations
transformations = pd.read_csv(TRANFORMATIONS_FP, sep="\t")

# perform feature transformation
transformed_df = feature_transform(clean_df, transformations, channels)
transformed_df

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,well,tile,cell_0,i_0,j_0,site,cell_1,i_1,j_1,distance,...,cell_number_neighbors_1,cell_percent_touching_1,cell_first_neighbor_distance,cell_second_neighbor_distance,cell_angle_between_neighbors,cytoplasm_number_neighbors_1,cytoplasm_percent_touching_1,cytoplasm_first_neighbor_distance,cytoplasm_second_neighbor_distance,cytoplasm_angle_between_neighbors
9,A2,905,567,1482.224982,1481.369056,217,1968,671.092784,101.958763,0.149559,...,0,0.000000,59.020638,59.446369,157.601917,0.0,0.000000,60.919552,62.456065,150.540954
11,A1,1502,499,1480.591457,1477.269347,354,269,100.472868,99.922481,0.145653,...,0,0.000000,50.862138,74.697479,151.299725,0.0,0.000000,52.537544,71.361624,165.587553
15,A2,1368,522,1480.857229,1476.934664,325,284,100.825243,100.436893,0.180161,...,0,0.000000,90.256086,96.366612,43.202061,0.0,0.000000,94.688817,101.806778,37.325935
19,A2,832,579,1483.300457,1481.761594,218,293,100.020619,101.144330,0.395906,...,1,0.285714,56.619206,67.570229,102.565383,1.0,0.177106,62.803962,66.980804,99.027314
23,A1,418,499,1476.278287,1478.299694,109,286,99.273810,98.892857,0.075175,...,0,0.000000,48.263332,66.138147,165.590938,0.0,0.000000,49.133680,71.571187,174.283650
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49990,A1,838,545,1598.246973,1327.407385,215,344,129.131313,61.484848,0.278980,...,0,0.000000,65.108539,91.830725,138.082263,0.0,0.000000,67.253756,92.008217,136.012198
49992,A1,1438,436,1412.378292,1660.818882,351,171,83.470588,146.098039,0.210144,...,0,0.000000,75.807628,97.823042,59.732586,0.0,0.000000,74.377920,94.409159,101.775957
49996,A2,361,344,1302.862956,1556.767205,82,1489,626.941748,120.485437,0.187461,...,0,0.000000,85.521857,103.909836,63.600237,0.0,0.000000,88.696553,101.486088,64.907789
49998,A2,1495,611,1619.156885,1613.819977,357,392,135.649123,705.315789,0.192781,...,1,0.160839,49.715872,52.119155,174.042063,1.0,0.090551,53.628804,57.179754,162.854502


## SET PARAMETERS

### Standardize features

- `CONTROL_PREFIX`: Prefix identifying control populations.
- `GROUP_COLUMNS`: Columns defining experimental groups (e.g., `['well']` for per-well standardization).
- `INDEX_COLUMNS`: Columns uniquely identifying cells (e.g., `['tile', 'cell_0']`).
- `CAT_COLUMNS`: Categorical columns to preserve.
- `FEATURE_START`: First column containing measured features.

We provide a useful function for suggesting these parameters, `suggest_parameters`.

In [7]:
suggest_parameters(clean_df, POPULATION_FEATURE)

Suggested Parameters:
--------------------------------------------------

Potential control prefixes found:
  - 'TNNT2'
  - 'nontargeting'
  - 'CNTROB'
  - 'INTS10'
  - 'CCNT1'
  - 'INTS5'
  - 'NTF4'
  - 'INTS13'
  - 'TENT4B'
  - 'B3GNT3'
  - 'NT5C3B'
  - 'RINT1'
  - 'PCNT'
  - 'INTS6'
  - 'FNTB'
  - 'INTS9'
  - 'SPINT1'
  - 'NT5C'
  - 'INTS8'
  - 'ARNTL2'
  - 'INTS12'
  - 'ZWINT'
  - 'NTF3'
  - 'TRNT1'
  - 'INTS2'
  - 'INTS3'
  - 'MNT'
  - 'BPNT1'
  - 'FNTA'
  - 'B3GNT4'
  - 'ANTXRL'
  - 'INTS7'
  - 'INTS11'
  - 'TNNT1'
  - 'KNTC1'
  - 'DNTTIP2'
  - 'B3GNT2'
  - 'CNTNAP1'

First few feature columns detected:
  - 'nucleus_dapi_mean'
  - 'nucleus_coxiv_mean'
  - 'nucleus_cenpa_mean'
  - 'nucleus_wga_mean'
  - 'nucleus_dapi_std'

Metadata columns detected:
  - Categorical: well, sgRNA_0, gene_symbol_0, mapped_single_gene


In [8]:
CONTROL_PREFIX = "nontargeting"
GROUP_COLUMNS = ["well"]
INDEX_COLUMNS = ["tile", "cell_0"]
CAT_COLUMNS = ["gene_symbol_0", "sgRNA_0"]
FEATURE_START = "nucleus_dapi_mean"

In [9]:
# Identify features to standardize (all columns after mapped_single_gene)
feature_start_idx = transformed_df.columns.get_loc(FEATURE_START)
target_features = transformed_df.columns[feature_start_idx:].tolist()
# Standardize the data
standardized_df = grouped_standardization(
    transformed_df,
    population_feature=POPULATION_FEATURE,
    control_prefix=CONTROL_PREFIX,
    group_columns=GROUP_COLUMNS,
    index_columns=INDEX_COLUMNS,
    cat_columns=CAT_COLUMNS,
    target_features=target_features,
    drop_features=False,
)
standardized_df

Unnamed: 0,well,tile,cell_0,i_0,j_0,site,cell_1,i_1,j_1,distance,...,cell_number_neighbors_1,cell_percent_touching_1,cell_first_neighbor_distance,cell_second_neighbor_distance,cell_angle_between_neighbors,cytoplasm_number_neighbors_1,cytoplasm_percent_touching_1,cytoplasm_first_neighbor_distance,cytoplasm_second_neighbor_distance,cytoplasm_angle_between_neighbors
0,A2,905,567,1482.224982,1481.369056,217,1968,671.092784,101.958763,0.149559,...,,,0.123011,-0.747438,0.793857,,,0.198123,-0.535340,0.657262
1,A1,1502,499,1480.591457,1477.269347,354,269,100.472868,99.922481,0.145653,...,,,-0.671283,0.691724,0.647273,,,-0.459141,0.379547,0.910414
2,A2,1368,522,1480.857229,1476.934664,325,284,100.825243,100.436893,0.180161,...,,,2.649679,2.599410,-1.671113,,,2.926781,3.243207,-1.868996
3,A2,832,579,1483.300457,1481.761594,218,293,100.020619,101.144330,0.395906,...,inf,inf,-0.071243,-0.011004,-0.392013,inf,inf,0.350389,-0.100864,-0.492203
4,A1,418,499,1476.278287,1478.299694,109,286,99.273810,98.892857,0.075175,...,,,-0.924515,-0.163673,1.001362,,,-0.801768,0.399206,1.101677
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23017,A1,838,545,1598.246973,1327.407385,215,344,129.131313,61.484848,0.278980,...,,,0.716910,2.403974,0.319788,,,1.022171,2.316367,0.259930
23018,A1,1438,436,1412.378292,1660.818882,351,171,83.470588,146.098039,0.210144,...,,,1.759448,3.002830,-1.621456,,,1.739279,2.541595,-0.493066
23019,A2,361,344,1302.862956,1556.767205,82,1489,626.941748,120.485437,0.187461,...,,,2.266722,3.283209,-1.231594,,,2.442589,3.212413,-1.253540
23020,A2,1495,611,1619.156885,1613.819977,357,392,135.649123,705.315789,0.192781,...,inf,inf,-0.629661,-1.411656,1.148093,inf,inf,-0.390991,-1.041983,0.932024


## SET PARAMETERS

### Add file names

- `CHANNEL_DICT`: Maps fluorescence channels to their corresponding image files
- `BASE_PH_FP`: Points to the "input_ph_tif" folder in your home directory

In [None]:
CHANNEL_DICT = None
BASE_PH_FP = os.path.join(home, "input_ph_tif")