# Configure Aggregate Module Params

This notebook should be used as a test for ensuring correct aggregate parameters before aggregate processing.
Cells marked with `SET PARAMETERS` contain crucial variables that need to be set according to your specific experimental setup and data organization.
Please review and modify these variables as needed before proceeding with the analysis.

## SET PARAMETERS

### Fixed parameters for merge processing

- `CONFIG_FILE_PATH`: Path to a Brieflow config file used during processing. Absolute or relative to where workflows are run from.

In [1]:
CONFIG_FILE_PATH = "config/config.yml"

In [2]:
from pathlib import Path

import yaml

from lib.aggregate.load import load_hdf_subset, clean_cell_data
from lib.aggregate.feature_processing import feature_transform

## SET PARAMETERS

### Loading subset of data

- `POPULATION_FEATURE`: The column name that identifies your perturbation groups (e.g., 'gene_symbol_0' for CRISPR screens, 'treatment' for drug screens)

In [3]:
POPULATION_FEATURE = 'gene_symbol_0'

In [4]:
# load config file and determine root path
with open(CONFIG_FILE_PATH, "r") as config_file:
    config = yaml.safe_load(config_file)
ROOT_FP = Path(config["all"]["root_fp"])

# Load subset of data
# Takes ~1 minute
merge_final_fp = ROOT_FP / "merge_process" / "hdfs" / "merge_final.hdf5"
raw_df = load_hdf_subset(merge_final_fp, n_rows=50000, population_feature=POPULATION_FEATURE)

# Remove unassigned cells
clean_df = clean_cell_data(raw_df, POPULATION_FEATURE, filter_single_gene=False)
# pd.Series(raw_df.columns).to_csv('aggregate_4/column_names.csv', index=False)
print(f"Loaded {len(raw_df)} cells with {len(raw_df.columns)} features")

Reading first 50,000 rows from analysis_root/merge_process/hdfs/merge_final.hdf5
Unique populations: 4623
well
A2    25154
A1    24846
Name: count, dtype: int64
Found 23022 cells with assigned perturbations
Loaded 50000 cells with 1683 features


## SET PARAMETERS

### Apply feature transformations

- `TRANFORMATIONS_FP`: CSV file containing feature transformation specifications. Each row defines a feature pattern and its transformation (e.g., 'log(feature)', 'log(feature-1)'), and should have a feature and transformation column

In [None]:
TRANFORMATIONS_FP = "config/transformations.tsv"

In [6]:
# load lower case version of channels
CHANNELS = [channel.lower() for channel in config["phenotype_process"]["channel_names"]]


CHANNELS

['DAPI', 'COXIV', 'CENPA', 'WGA']

In [7]:
transformed_df = agg.feature_transform(
    clean_df, 
    transformations, 
    channels
)
transformed_df

well
tile
cell_0
i_0
j_0
site
cell_1
i_1
j_1
distance
fov_distance_0
fov_distance_1
sgRNA_0
gene_symbol_0
mapped_single_gene
channels_min
nucleus_dapi_int
nucleus_coxiv_int
nucleus_cenpa_int
nucleus_wga_int
nucleus_dapi_mean
nucleus_coxiv_mean
nucleus_cenpa_mean
nucleus_wga_mean
nucleus_dapi_std
nucleus_coxiv_std
nucleus_cenpa_std
nucleus_wga_std
nucleus_dapi_max
nucleus_coxiv_max
nucleus_cenpa_max
nucleus_wga_max
nucleus_dapi_min
nucleus_coxiv_min
nucleus_cenpa_min
nucleus_wga_min
nucleus_dapi_int_edge
nucleus_coxiv_int_edge
nucleus_cenpa_int_edge
nucleus_wga_int_edge
nucleus_dapi_mean_edge
nucleus_coxiv_mean_edge
nucleus_cenpa_mean_edge
nucleus_wga_mean_edge
nucleus_dapi_std_edge
nucleus_coxiv_std_edge
nucleus_cenpa_std_edge
nucleus_wga_std_edge
nucleus_dapi_max_edge
nucleus_coxiv_max_edge
nucleus_cenpa_max_edge
nucleus_wga_max_edge
nucleus_dapi_min_edge
nucleus_coxiv_min_edge
nucleus_cenpa_min_edge
nucleus_wga_min_edge
nucleus_dapi_mass_displacement
nucleus_coxiv_mass_displacement
n