Skip to content

Conversation

@grapentt
Copy link

@grapentt grapentt commented Nov 18, 2025

Checklist

  • My pull request has a clear and explanatory title.
  • My pull request passes the Linting test.
  • I added appropriate unit tests and I made sure the code passes all unit tests.
  • My PR follows PEP8 guidelines.
  • My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
  • I linked to issues and PRs that are relevant to this PR.
  • "Official" HIGH-PPI splits

Description

This PR introduces the HIGH-PPI SHS27k + CORUM dataset to TopoBench - a natively higher-order simplicial complex dataset combining protein-protein interaction networks with experimentally validated protein complexes.

Note: This PR focuses on dataset integration and infrastructure. The training pipeline for higher-order prediction tasks will be added in my B.2 submission.

Dataset Structure:

  • 0-cells: 1,553 human proteins
  • 1-cells: 6,660 protein-protein interactions with typed edges (7 interaction types: reaction, binding, ptmod, activation, inhibition, catalysis, expression) plus confidence scores
  • 2+ cells: ~470 experimentally validated protein complexes from CORUM

Simplicial Complex Construction:

  1. Add proteins as 0-cells
  2. Add HIGH-PPI edges as 1-cells with 8-dim feature vectors (7 interaction types + 1 confidence score)
  3. Process CORUM complexes (top-down, largest first):
    • Add each complex to the simplicial complex (automatically creates all sub-faces via TopoNetX)
    • Mark the complex as positive (+1)
    • Mark all proper sub-faces as negative (-1) unless already labeled by a smaller CORUM complex
    • Boost confidence to 1.0 for edges within CORUM complexes (both HIGH-PPI and CORUM-only edges)
  4. Generate random negative samples proportionally for higher-order cells (rank ≥2) to balance the dataset

Supported Prediction Tasks (configured, training pipeline in B.2):

  1. Edge score regression: Predict confidence of protein-protein interactions (0-1 continuous)
  2. Edge interaction type classification: Multi-label prediction of 7 interaction types per edge
  3. Higher-order complex prediction: Binary classification of whether a protein set forms a real complex (2+ order cells)

Additional context

Data Sources:

  • HIGH-PPI SHS27k: Human protein interaction network with typed edges (Paper)
  • CORUM: Comprehensive Resource of Mammalian protein complexes database (experimentally validated)

Configuration Options:

  • min_complex_size / max_complex_size: Control which CORUM complexes to include (default: 2-6)
  • target_ranks: List of ranks to predict on (supports single or multi-rank prediction)
  • neg_ratio: Negative sample ratio for complex classification
  • edge_task: Choose between "score" (regression) or "interaction_type" (classification)

@grapentt grapentt marked this pull request as draft November 18, 2025 21:49
@levtelyatnikov
Copy link
Collaborator

Dear Participants,

This is a final reminder regarding the upcoming challenge deadline.

📅 Deadline: Tomorrow, 25th November 2025

✅ Critical Requirement: Please ensure your branch is passing all CI/CD tests.

If you have any pending changes, please push them and verify your build status as soon as possible.

Good luck!

@levtelyatnikov levtelyatnikov added the category-a2 Submission to TDL Challenge 2025: Mission A, Category 2. label Nov 24, 2025
@grapentt grapentt changed the title Category: A2; Team name: TG; Dataset: Simplicial PPI (HIGH-PPI + CORUM) Category: A2; Team name: TG; Dataset: Simplicial PPI (HIGH-PPI SHS27k + CORUM) Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category-a2 Submission to TDL Challenge 2025: Mission A, Category 2.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants