Trainer: Refactoring the _pair_clustering_keep_id #52

lotif · 2025-10-02T18:06:36Z

PR Type

Fix

Short Description

Clickup Ticket(s): https://app.clickup.com/t/868fuke6e

In order to remove some Ruff ignores, I had to refactor the _pair_clustering_keep_id function into many smaller functions. I also renamed it to _pair_clustering because I don't know what the kleep_id part means.

Tests Added

Tests stay the same, the functionality should not change.

…efault class method and update function signatures to use Transformations directly."

…for better type safety and clarity in clustering function signatures.

…afety and streamline model retrieval in fine-tuning and training functions.

…arameters dataclasses for improved structure and type safety in model configuration across fine-tuning and training functions."

…YCond enum in dataset, model, and training modules for improved type safety and clarity in handling y column conditions."

…eplace string literals for loss type specification in fine-tuning and training modules, enhancing type safety and clarity.

…ng literals for scheduler specification in fine-tuning and training modules, enhancing type safety and clarity.

…sSecondMomentResampler to accept num_timesteps directly, replacing the diffusion object dependency, and enhance the ScheduleSampler enum with a method for creating samplers."

…ical_forward_backward_log and _compute_top_k functions to utilize the new ReductionMethod enum for improved type safety and consistency.

…e to the gaussian diffusion file

…o marcelo/classes-and-enums-2

emersodb

First off, good job disentangling this code. Certainly some "interesting" choices in how and when variables came into existence 😂. Only a small collection of comments.

emersodb · 2025-10-02T20:08:49Z

src/midst_toolkit/models/clavaddpm/clustering.py

+        parent_domain_dict,
+        all_child_columns,
+        all_parent_columns,
+        parent_primary_key,


It looks like this guy is only ever used to create the parent_primary_key_index in _prepare_cluster_data, which we're already doing outside of this function. Maybe we just pass the index like you do with _denormalize_parent_data or _get_parent_data_clusters?

Good eye, thanks.

Actually, I'm gonna take that back. It's an inexpensive operation and it saves us one parameter in this function that already has too many. I'll add a comment explaining that.

src/midst_toolkit/models/clavaddpm/clustering.py

emersodb · 2025-10-02T20:17:32Z

src/midst_toolkit/models/clavaddpm/clustering.py

+    foreign_key_index: int,
+) -> np.ndarray:
+    """
+    Denormalize the parent data in relation to the child group data,


I'm still not 100% clear what we mean by denormalize here? Maybe an example or more details would help?

I knew this was going to be confusing... "Normalize" and "denormalize" in database terms means merging the data from one table into another by duplicating values when they have a primary-foreign key relationship. The problem is in our context, the term "normalization" is already used for something else.

The original term used by this code was "repeated" data, which is not entirely wrong. Do you think that's a better name? Or should I expand on the "denormalization" term in the docstring here?

Maybe I should just call it merge_parent_and_child_data?

I think that makes sense, but also I think it would be good to add a comment along the lines of what you stated above to make it more obvious what we're doing with the "repeat" If you have a link for an example of this kind of thing in databases that would also be sufficient.

I have changed to merge, but also added a reference to DB denormalization.

src/midst_toolkit/models/clavaddpm/clustering.py

emersodb

Just a small final request about the normalize/denormalize thing.

fatemetkl

LGTM!
Added two minor comments.

fatemetkl · 2025-10-06T17:48:05Z

src/midst_toolkit/models/clavaddpm/enumerations.py

    LONG = "long"
+
+
+class ForeignKeyScalingType(Enum):


Based on my understanding, this indicates how we want to scale the parent data: key_normalized = _min_max_normalize_sklearn(reshaped_parent_data) for example, and it seems that key_normalized is the normalized parent data in this example. Initially the name ForeignKeyScalingType was confusing to me as I thought we want to scale the foreign key itself.

Yeah, that makes sense. To be honest, I am still a bit confused by this, but it seems like it's really not the foreign key, so I'll rename it.

fatemetkl · 2025-10-06T18:33:55Z

src/midst_toolkit/models/clavaddpm/clustering.py

+    key_scaling_type: ForeignKeyScalingType = ForeignKeyScalingType.MINMAX,
+) -> np.ndarray:
+    """
+    Prepare the data for the clustering algorithm, which comprises of denormalizing the parent data,


Maybe also adding what we mean by denormalization here would be helpful.

I've changed the terminology here to say "merge" instead of "denormalize" as it's a similar concept and the word is less "loaded" for us.

…ing' into marcelo/refactoring-pair-clustering

c83ecea Trainer: Refactoring process_pipeline_data function (#54) 2be0f81 Bump astral-sh/setup-uv from 6.7.0 to 6.8.0 (#56) 004fc67 Trainer: general variable renamings on data_loaders.py and adding a couple more enums (#53) 4c2d75f Trainer: Refactoring the _pair_clustering_keep_id (#52) d8fc981 Ensemble attack: Meta classifier pipeline (#37) git-subtree-dir: deps/midst-toolkit git-subtree-split: c83ecea

lotif and others added 28 commits September 29, 2025 11:18

Merge "Refactor Transformations handling: replace get_T_dict with a d…

c059d93

…efault class method and update function signatures to use Transformations directly."

Refactor clustering method handling: Introduce ClusteringMethod enum …

9f0290a

…for better type safety and clarity in clustering function signatures.

Refactor model handling: Introduce ModelType enum for improved type s…

80073f2

…afety and streamline model retrieval in fine-tuning and training functions.

Merge "Refactor model parameters: Introduce ModelParameters and RTDLP…

1d43783

…arameters dataclasses for improved structure and type safety in model configuration across fine-tuning and training functions."

Merge "Refactor y condition handling: Replace string literals with Is…

f7d69ec

…YCond enum in dataset, model, and training modules for improved type safety and clarity in handling y column conditions."

Refactor Gaussian loss handling: Introduce GaussianLossType enum to r…

01ed9c1

…eplace string literals for loss type specification in fine-tuning and training modules, enhancing type safety and clarity.

Refactor scheduler handling: Introduce Scheduler enum to replace stri…

0dbc6b4

…ng literals for scheduler specification in fine-tuning and training modules, enhancing type safety and clarity.

Merge "Refactor sampler initialization: Update UniformSampler and Los…

128da65

…sSecondMomentResampler to accept num_timesteps directly, replacing the diffusion object dependency, and enhance the ScheduleSampler enum with a method for creating samplers."

Enhance metric and loss handling: Refactor loss computation in _numer…

f158b44

…ical_forward_backward_log and _compute_top_k functions to utilize the new ReductionMethod enum for improved type safety and consistency.

Transforming a lot of literals into enums

6503788

WIP renaming RTDL, cat and num and data splits

d599334

Using more data splits and adding types for gaussian parametrization

1950f5c

Adding enum for YType

3e8237c

Merge branch 'main' into marcelo/classes-and-enums-2

c957dec

Renaming Scheduler to SchedulerType and moving it and GaussianLossTyp…

4d0707b

…e to the gaussian diffusion file

Merge remote-tracking branch 'origin/marcelo/classes-and-enums-2' int…

b7db96e

…o marcelo/classes-and-enums-2

WIP CR by David

30d0a0d

Merge branch 'main' into marcelo/classes-and-enums-2

779b108

Cont'd CR comments by David

0e3a42a

Merge remote-tracking branch 'origin/marcelo/classes-and-enums-2' int…

8600ccb

…o marcelo/classes-and-enums-2

Adding TODO

bc67266

WIp starting the breakdown

774e99b

Renames

1168e93

Last breakdown

bf05c3c

Removing ignore

d90ed2c

Finished refactoring

96414be

Merge branch 'main' into marcelo/remove-ignores

aafd66c

Renaming function

293f4d9

lotif requested review from amrit110 and fatemetkl October 2, 2025 18:06

lotif requested review from ElahehBassak, bzamanlooy, emersodb, masi-sh and sarakodeiri October 2, 2025 18:06

emersodb reviewed Oct 2, 2025

View reviewed changes

CR by David

0a9994a

emersodb self-requested a review October 4, 2025 15:39

emersodb approved these changes Oct 4, 2025

View reviewed changes

fatemetkl approved these changes Oct 6, 2025

View reviewed changes

lotif and others added 3 commits October 6, 2025 15:39

Merge branch 'main' into marcelo/refactoring-pair-clustering

10e9989

CR by David and Fatemeh

94f014b

Merge remote-tracking branch 'origin/marcelo/refactoring-pair-cluster…

f479990

…ing' into marcelo/refactoring-pair-clustering

lotif merged commit 4c2d75f into main Oct 6, 2025
5 checks passed

lotif deleted the marcelo/refactoring-pair-clustering branch October 6, 2025 20:09

Trainer: Refactoring the _pair_clustering_keep_id #52

Trainer: Refactoring the _pair_clustering_keep_id #52

Uh oh!

Conversation

lotif commented Oct 2, 2025

PR Type

Short Description

Tests Added

Uh oh!

emersodb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

emersodb left a comment

Choose a reason for hiding this comment

Uh oh!

fatemetkl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants