In [1]:
import os
import pandas as pd

notebook_dir = os.path.abspath(os.path.dirname("__file__"))
project_root = os.path.abspath(os.path.join(notebook_dir, ".."))

In [2]:
from mozzarellm import analyze_gene_clusters, reshape_to_clusters
from mozzarellm.prompts import ROBUST_SCREEN_CONTEXT, ROBUST_CLUSTER_PROMPT
from mozzarellm.configs import DEFAULT_OPENAI_REASONING_CONFIG

# You can have a .env file that stores your keys or set your api key here:
# os.environ["OPENAI_API_KEY"] = "your_openai_key_here"

In [3]:
sample_data = pd.read_csv(os.path.join(notebook_dir, "sample_data.csv"))

In [4]:
cluster_df, gene_features = reshape_to_clusters(
    input_df=sample_data, uniprot_col="uniprot_function", verbose=True
)

Using provided DataFrame with 140 rows
Found 140 genes across 6 clusters
Extracting gene features from uniprot_function column


In [5]:
display(cluster_df)
display(gene_features)

Unnamed: 0,cluster_id,genes
0,21,AATF;ABT1;BYSL;BMS1;C1orf131;EIF3M;EIF4A1;ESF1...
1,37,SRSF3;PDPK1;RICTOR;RPTOR;SEH1L;SGF29;PRKAR1A;P...
2,121,CCDC174;FAM32A;GABPA;SP2;N6AMT1;SETD2;SON;POU5...
3,149,KRAS;BRAF;NDUFV2;NDUFA6;NDUFC1;RAD23B;SNAPC1;N...
4,167,POMP;PSMA2;PSMB7;PSMB3;PSMA7;PSMB2;PSMA1;PSMA4...
5,197,SPAST;NCOR2;NCAPD3;HNRNPD;MCM3;METTL14;METTL3;...


Unnamed: 0,gene_symbol,uniprot_function
0,AATF,"Part of the small subunit (SSU) processome, fi..."
1,ABT1,Could be a novel TATA-binding protein (TBP) wh...
2,BYSL,Required for processing of 20S pre-rRNA precur...
3,BMS1,GTPase required for the synthesis of 40S ribos...
4,C1orf131,"Part of the small subunit (SSU) processome, fi..."
...,...,...
135,METTL3,The METTL3-METTL14 heterodimer forms a N6-meth...
136,FITM1,Plays an important role in the formation of li...
137,PTCHD4,Could act as a repressor of canonical hedgehog...
138,VRK1,Serine/threonine kinase involved in the regula...


In [6]:
print(ROBUST_SCREEN_CONTEXT)


Genes grouped within a cluster tend to exhibit similar morphological phenotypes in this context, suggesting that they may participate in the same biological process or pathway. However, not all clusters will correspond to a defined or coherent biological pathway.

When evaluating pathway confidence, apply these stringent criteria:

HIGH CONFIDENCE:
- Multiple well-established genes (≥3) with strong literature support in the same specific pathway
- Clear functional relationship between genes that explains the observed phenotypic clustering
- Genes represent different aspects or components of the same biological process
- The pathway assignment explains >60% of genes in the cluster

MEDIUM CONFIDENCE:
- Some established genes (1-2) from a specific pathway, with additional supporting genes
- Functional relationship is plausible but has some gaps or uncertainties
- Some genes in the cluster have unclear relationship to the proposed pathway
- The pathway assignment explains 40-60% of genes

In [7]:
print(ROBUST_CLUSTER_PROMPT)


Analyze gene cluster {cluster_id} to identify the dominant biological pathway and classify genes:

Genes: {gene_list}

For each cluster:
1. Identify the dominant biological pathway, focusing on specific molecular mechanisms rather than general terms
2. For clusters with coherent biological signatures, classify each gene into one of three mutually exclusive categories:
   - ESTABLISHED: Well-known members of the identified pathway with clear functional roles in this pathway
   - UNCHARACTERIZED: Genes with minimal to no functional annotation in ANY published literature
   - NOVEL_ROLE: Genes with published functional annotation in OTHER pathways that may have additional roles in the dominant pathway

3. For both UNCHARACTERIZED and NOVEL_ROLE genes:
   - Assign a priority score (1-10) for follow-up investigation
   - Provide a rationale explaining why this gene merits investigation

4. Provide a concise summary of the key findings for each cluster

When classifying and prioritizing gen

In [8]:
DEFAULT_OPENAI_REASONING_CONFIG

{'MODEL': 'o4-mini',
 'CONTEXT': 'You are an AI assistant specializing in genomics and systems biology with expertise in pathway analysis. Your task is to analyze gene clusters to identify biological pathways and potential novel pathway members based on published literature and gaps in knowledge of gene function.',
 'TEMP': 1.0,
 'MAX_TOKENS': 8000,
 'RATE_PER_TOKEN': 1e-05,
 'DOLLAR_LIMIT': 10.0,
 'LOG_NAME': 'cluster_analysis',
 'API_TYPE': 'openai'}

In [9]:
# Run analysis with OpenAI GPT-4o
openai_results = analyze_gene_clusters(
    # Input data options
    input_df=cluster_df,
    # Model and configuration
    model_name="o4-mini",
    config_dict=DEFAULT_OPENAI_REASONING_CONFIG,
    # Analysis context and prompts
    screen_context=ROBUST_SCREEN_CONTEXT,
    cluster_analysis_prompt=ROBUST_CLUSTER_PROMPT,
    # Gene annotations
    gene_annotations_df=gene_features,
    # Processing options
    batch_size=1,
    # Output options
    save_outputs=False,
    outputs_to_generate=["json", "clusters", "flagged_genes"],
)

Loaded data with 6 rows and columns: ['cluster_id', 'genes']
Created annotations dictionary with 140 entries from DataFrame


Processing clusters:   0%|          | 0/6 [00:00<?, ?it/s]

Using provided template string
Appending output format instructions to template
Added 39 gene feature descriptions to prompt


Processing clusters:  17%|█▋        | 1/6 [00:35<02:59, 35.86s/it]

Using provided template string
Appending output format instructions to template
Added 33 gene feature descriptions to prompt


Processing clusters:  33%|███▎      | 2/6 [01:09<02:17, 34.42s/it]

Using provided template string
Appending output format instructions to template
Added 22 gene feature descriptions to prompt


Processing clusters:  50%|█████     | 3/6 [02:10<02:19, 46.62s/it]

Using provided template string
Appending output format instructions to template
Added 19 gene feature descriptions to prompt


Processing clusters:  67%|██████▋   | 4/6 [02:32<01:13, 36.93s/it]

Using provided template string
Appending output format instructions to template
Added 16 gene feature descriptions to prompt


Processing clusters:  83%|████████▎ | 5/6 [02:47<00:29, 29.05s/it]

Using provided template string
Appending output format instructions to template
Added 11 gene feature descriptions to prompt


Processing clusters: 100%|██████████| 6/6 [03:08<00:00, 31.50s/it]
INFO:cluster_analysis_20250506_163322.log:Completed analysis for 6 clusters without saving to disk


In [10]:
openai_results["cluster_df"]

Unnamed: 0,cluster_id,cluster_biological_process,pathway_confidence_level,cluster_importance_score,follow_up_suggestion,established_genes,established_gene_count,uncharacterized_genes,uncharacterized_gene_count,novel_role_genes,...,total_gene_count,highest_unchar_importance,average_unchar_importance,high_unchar_genes,high_unchar_gene_count,highest_novel_role_importance,average_novel_role_importance,high_novel_role_genes,high_novel_role_gene_count,all_cluster_genes
0,21,Small (40S) ribosomal subunit biogenesis via S...,High,2.88,Cluster 21 is highly enriched for known SSU pr...,AATF;BYSL;BMS1;C1orf131;DIMT1;NOP14;PDCD11;RRP...,21,NOC4L;NOL10,2,ESF1;ABT1;DYRK1A;INO80;RAE1;EIF3M;EIF3E;EIF3F;...,...,39,8,8.0,NOC4L:8;NOL10:8,2,6,2.81,,0,AATF;BYSL;BMS1;C1orf131;DIMT1;NOP14;PDCD11;RRP...
1,37,Regulation of mTORC1 and mTORC2 signaling,Medium,1.76,Cluster 37 is enriched for core components and...,MTOR;RPTOR;RICTOR;MLST8;RHEB;PDPK1;WDR24;SEH1L...,9,ERH,1,PRKAR1A;MAP7D1;TRAPPC8,...,13,8,8.0,ERH:8,1,7,6.0,,0,MTOR;RPTOR;RICTOR;MLST8;RHEB;PDPK1;WDR24;SEH1L...
2,121,RNA polymerase II–mediated transcription regul...,High,3.51,Cluster 121 is highly enriched for core compon...,GABPA;GABPB1;SP2;MAX;MYC;FOXN1;SETD2;N6AMT1;ZM...,12,CCDC174;ZBTB11;POU5F1B;FAM32A,4,DDA1;KEAP1;RAB11FIP4;MEMO1;SLC38A2;SPAG5,...,22,9,8.0,CCDC174:9;ZBTB11:8;POU5F1B:8,3,6,3.67,,0,GABPA;GABPB1;SP2;MAX;MYC;FOXN1;SETD2;N6AMT1;ZM...
3,149,Mitochondrial oxidative phosphorylation and re...,High,1.8,Cluster 149 is dominated by mitochondrial oxid...,NDUFV2;NDUFA6;NDUFC1;DMAC1;CYC1;UQCRFS1;ATP5ME...,12,,0,KRAS;BRAF;RAD23B;UBAC2;SNAPC1;NINL;LSM11,...,19,0,0.0,,0,6,3.57,,0,NDUFV2;NDUFA6;NDUFC1;DMAC1;CYC1;UQCRFS1;ATP5ME...
4,167,20S proteasome core complex assembly and ubiqu...,High,1.8,Cluster 167 is almost entirely composed of can...,POMP;PSMA1;PSMA2;PSMA3;PSMA4;PSMA5;PSMA6;PSMA7...,15,,0,UBE2R2,...,16,0,0.0,,0,6,6.0,,0,POMP;PSMA1;PSMA2;PSMA3;PSMA4;PSMA5;PSMA6;PSMA7...
5,197,No coherent biological pathway,Low,0.0,Cluster 197 contains genes spanning microtubul...,,0,,0,,...,0,0,0.0,,0,0,0.0,,0,


In [11]:
openai_results["gene_df"]

Unnamed: 0,gene_name,gene_description,gene_importance_score,cluster_id,cluster_biological_process,pathway_confidence_level,cluster_importance_score,follow_up_suggestion,established_genes,established_gene_count,uncharacterized_genes,uncharacterized_gene_count,novel_role_genes,novel_role_gene_count,gene_category
25,TRAPPC8,Functions in ATG9 trafficking for autophagy in...,7,37,Regulation of mTORC1 and mTORC2 signaling,Medium,1.76,Cluster 37 is enriched for core components and...,MTOR;RPTOR;RICTOR;MLST8;RHEB;PDPK1;WDR24;SEH1L...,9,ERH,1,PRKAR1A;MAP7D1;TRAPPC8,3,novel_role
26,DDA1,Scaffolding subunit of CUL4 E3 ligases; may re...,6,121,RNA polymerase II–mediated transcription regul...,High,3.51,Cluster 121 is highly enriched for core compon...,GABPA;GABPB1;SP2;MAX;MYC;FOXN1;SETD2;N6AMT1;ZM...,12,CCDC174;ZBTB11;POU5F1B;FAM32A,4,DDA1;KEAP1;RAB11FIP4;MEMO1;SLC38A2;SPAG5,6,novel_role
27,KEAP1,Well-known NRF2 regulator; could also control ...,6,121,RNA polymerase II–mediated transcription regul...,High,3.51,Cluster 121 is highly enriched for core compon...,GABPA;GABPB1;SP2;MAX;MYC;FOXN1;SETD2;N6AMT1;ZM...,12,CCDC174;ZBTB11;POU5F1B;FAM32A,4,DDA1;KEAP1;RAB11FIP4;MEMO1;SLC38A2;SPAG5,6,novel_role
7,ESF1,Known as a basal transcription regulator; may ...,6,21,Small (40S) ribosomal subunit biogenesis via S...,High,2.88,Cluster 21 is highly enriched for known SSU pr...,AATF;BYSL;BMS1;C1orf131;DIMT1;NOP14;PDCD11;RRP...,21,NOC4L;NOL10,2,ESF1;ABT1;DYRK1A;INO80;RAE1;EIF3M;EIF3E;EIF3F;...,16,novel_role
8,ABT1,A TBP-like transcription activator with no rep...,6,21,Small (40S) ribosomal subunit biogenesis via S...,High,2.88,Cluster 21 is highly enriched for known SSU pr...,AATF;BYSL;BMS1;C1orf131;DIMT1;NOP14;PDCD11;RRP...,21,NOC4L;NOL10,2,ESF1;ABT1;DYRK1A;INO80;RAE1;EIF3M;EIF3E;EIF3F;...,16,novel_role
9,DYRK1A,A dual‐specificity kinase best studied in DNA ...,6,21,Small (40S) ribosomal subunit biogenesis via S...,High,2.88,Cluster 21 is highly enriched for known SSU pr...,AATF;BYSL;BMS1;C1orf131;DIMT1;NOP14;PDCD11;RRP...,21,NOC4L;NOL10,2,ESF1;ABT1;DYRK1A;INO80;RAE1;EIF3M;EIF3E;EIF3F;...,16,novel_role
10,INO80,Chromatin remodeler not linked to SSU processo...,6,21,Small (40S) ribosomal subunit biogenesis via S...,High,2.88,Cluster 21 is highly enriched for known SSU pr...,AATF;BYSL;BMS1;C1orf131;DIMT1;NOP14;PDCD11;RRP...,21,NOC4L;NOL10,2,ESF1;ABT1;DYRK1A;INO80;RAE1;EIF3M;EIF3E;EIF3F;...,16,novel_role
32,KRAS,Oncogenic KRAS is known to drive metabolic rep...,6,149,Mitochondrial oxidative phosphorylation and re...,High,1.8,Cluster 149 is dominated by mitochondrial oxid...,NDUFV2;NDUFA6;NDUFC1;DMAC1;CYC1;UQCRFS1;ATP5ME...,12,,0,KRAS;BRAF;RAD23B;UBAC2;SNAPC1;NINL;LSM11,7,novel_role
39,UBE2R2,UBE2R2 is a well-studied E2 ubiquitin-conjugat...,6,167,20S proteasome core complex assembly and ubiqu...,High,1.8,Cluster 167 is almost entirely composed of can...,POMP;PSMA1;PSMA2;PSMA3;PSMA4;PSMA5;PSMA6;PSMA7...,15,,0,UBE2R2,1,novel_role
23,PRKAR1A,Regulatory subunit of PKA; cAMP–PKA cross‐talk...,6,37,Regulation of mTORC1 and mTORC2 signaling,Medium,1.76,Cluster 37 is enriched for core components and...,MTOR;RPTOR;RICTOR;MLST8;RHEB;PDPK1;WDR24;SEH1L...,9,ERH,1,PRKAR1A;MAP7D1;TRAPPC8,3,novel_role
