## Process JUMP phenotypic profiles

We applied the AreaShape only class-balanced multiclass elastic net logistic regression model to all single-cell profiles in the JUMP dataset.

We then performed a series of KS tests to identify how different treatment distributions of all phenotype probabilities differed from controls.

See https://github.com/WayScience/JUMP-single-cell/tree/main/3.analyze_data#analyze-predicted-probabilities for complete details.

Here, we perform the following:

1. Load in this data from the JUMP-single-cell repo
2. Summarize replicate KS test metrics (mean value) and align across cell types and time variables
3. Explore the top results per phenotype/treatment_type/model_type (Supplementary Table S1)
4. Convert it to wide format

This wide format represents a "phenotypic profile" which we can use similarly as an image-based morphology profile.

In [1]:
import pathlib
from typing import List
import pandas as pd

import umap

In [2]:
# Set file paths
# JUMP phenotype probabilities from AreaShape model
commit = "4225e427fd9da59159de69f53be65c31b4d4644a"

url = "https://github.com/WayScience/JUMP-single-cell/raw"
file = "3.analyze_data/class_balanced_well_log_reg_comparison_results/class_balanced_well_log_reg_areashape_model_comparisons.parquet"

jump_sc_pred_file = f"{url}/{commit}/{file}"

# Set constants
n_top_results_to_explore = 10

In [3]:
# Set output files
output_dir = "jump_phenotype_profiles"

cell_type_time_comparison_file = pathlib.Path(output_dir, "jump_compare_cell_types_and_time_across_phenotypes.tsv.gz")
top_results_summary_file = pathlib.Path(output_dir, "jump_most_significant_phenotype_enrichment.tsv")
final_jump_phenotype_file = pathlib.Path(output_dir, "jump_phenotype_profiles.tsv.gz")
shuffled_jump_phenotype_file = pathlib.Path(output_dir, "jump_phenotype_profiles_shuffled.tsv.gz")

## Load and process data

In [4]:
# Load KS test results and drop uninformative columns
jump_pred_df = (
    pd.read_parquet(jump_sc_pred_file)
    .drop(columns=["statistical_test", "comparison_metric"])
)

print(jump_pred_df.shape)
jump_pred_df.head()

(485370, 11)


Unnamed: 0,comparison_metric_value,p_value,Metadata_Plate,treatment,Metadata_model_type,treatment_type,Metadata_Well,Cell_type,Time,cell_count,phenotype
0,0.091654,0.01313,BR00117002,ABL1,final,crispr,C01,A549,144,592,ADCCM
1,0.118823,0.000441,BR00117002,ABL1,final,crispr,C01,A549,144,592,Anaphase
2,0.121319,0.000273,BR00117002,ABL1,final,crispr,C01,A549,144,592,Apoptosis
3,0.054403,0.332314,BR00117002,ABL1,final,crispr,C01,A549,144,592,Binuclear
4,0.030717,0.931704,BR00117002,ABL1,final,crispr,C01,A549,144,592,Elongated


In [5]:
# Process data to match treatments and scores across cell types
jump_pred_compare_df = (
    jump_pred_df
    # Summarize replicate scores
    .groupby([
        "Cell_type",
        "Time",
        "treatment",
        "treatment_type",
        "Metadata_model_type",
        "phenotype"
    ])
    .agg({
        "comparison_metric_value": "mean",
        "p_value": "mean"
    })
    .reset_index()
    # Compare per treatment scores across cell types
    .pivot(
        index=[
            "treatment",
            "treatment_type",
            "Time",
            "phenotype",
            "Metadata_model_type"
        ],
        columns="Cell_type",
        values=[
            "comparison_metric_value",
            "p_value"
        ]
    )
    .reset_index()
)

# Clen up column names
jump_pred_compare_df.columns = jump_pred_compare_df.columns.map(lambda x: '_'.join(filter(None, x)))

# Output file
jump_pred_compare_df.to_csv(cell_type_time_comparison_file, sep="\t", index=False)

print(jump_pred_compare_df.shape)
jump_pred_compare_df.head()

(37320, 9)


Unnamed: 0,treatment,treatment_type,Time,phenotype,Metadata_model_type,comparison_metric_value_A549,comparison_metric_value_U2OS,p_value_A549,p_value_U2OS
0,1-EBIO,compound,24,ADCCM,final,0.042401,0.049459,0.142509,0.53025
1,1-EBIO,compound,24,ADCCM,shuffled,0.027813,0.040563,0.59524,0.539444
2,1-EBIO,compound,24,Anaphase,final,0.036235,0.06436,0.343061,0.285088
3,1-EBIO,compound,24,Anaphase,shuffled,0.033513,0.048337,0.461138,0.346912
4,1-EBIO,compound,24,Apoptosis,final,0.033457,0.072364,0.468468,0.10051


In [6]:
# Focus on the top results for downstream interpretation
jump_focused_top_results_df = (
    jump_pred_df
    .groupby(["Metadata_model_type", "treatment_type", "Cell_type", "Time", "phenotype"])
    .apply(lambda x: x.nsmallest(n_top_results_to_explore, "p_value"))
    .reset_index(drop=True)
)

jump_focused_top_results_df.to_csv(top_results_summary_file, sep="\t", index=False)

print(jump_focused_top_results_df.shape)
jump_focused_top_results_df.head()

(3600, 11)


Unnamed: 0,comparison_metric_value,p_value,Metadata_Plate,treatment,Metadata_model_type,treatment_type,Metadata_Well,Cell_type,Time,cell_count,phenotype
0,0.50947,2.854138e-58,BR00116992,CYT-997,final,compound,C09,A549,24,489,ADCCM
1,0.508134,3.137002e-54,BR00116993,CYT-997,final,compound,E06,A549,24,456,ADCCM
2,0.398279,1.072331e-53,BR00116993,fludarabine-phosphate,final,compound,N12,A549,24,753,ADCCM
3,0.468131,4.861256e-51,BR00116993,CYT-997,final,compound,C09,A549,24,509,ADCCM
4,0.495328,7.660155999999999e-51,BR00116992,CYT-997,final,compound,E06,A549,24,450,ADCCM


## Summarize data

In [7]:
# How many unique plates?
jump_pred_df.Metadata_Plate.nunique()

51

In [8]:
# How many different individual treatments?
jump_pred_df.query("Metadata_model_type == 'final'").treatment_type.value_counts()

compound    113505
crispr       86130
orf          43050
Name: treatment_type, dtype: int64

In [9]:
# How many unique treatments per treatment type?
jump_pred_df.groupby("treatment_type").treatment.nunique()

treatment_type
compound    302
crispr      160
orf         160
Name: treatment, dtype: int64

In [10]:
# How many treatments with phenotype predictions?
jump_pred_df.query("Metadata_model_type == 'final'").phenotype.value_counts()

ADCCM                 16179
Anaphase              16179
Apoptosis             16179
Binuclear             16179
Elongated             16179
Grape                 16179
Hole                  16179
Interphase            16179
Large                 16179
Metaphase             16179
MetaphaseAlignment    16179
OutOfFocus            16179
Polylobed             16179
Prometaphase          16179
SmallIrregular        16179
Name: phenotype, dtype: int64