-
Notifications
You must be signed in to change notification settings - Fork 1
Trainer: Refactoring the _pair_clustering_keep_id #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…efault class method and update function signatures to use Transformations directly."
…for better type safety and clarity in clustering function signatures.
…afety and streamline model retrieval in fine-tuning and training functions.
…arameters dataclasses for improved structure and type safety in model configuration across fine-tuning and training functions."
…YCond enum in dataset, model, and training modules for improved type safety and clarity in handling y column conditions."
…eplace string literals for loss type specification in fine-tuning and training modules, enhancing type safety and clarity.
…ng literals for scheduler specification in fine-tuning and training modules, enhancing type safety and clarity.
…sSecondMomentResampler to accept num_timesteps directly, replacing the diffusion object dependency, and enhance the ScheduleSampler enum with a method for creating samplers."
…ical_forward_backward_log and _compute_top_k functions to utilize the new ReductionMethod enum for improved type safety and consistency.
…e to the gaussian diffusion file
…o marcelo/classes-and-enums-2
…o marcelo/classes-and-enums-2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First off, good job disentangling this code. Certainly some "interesting" choices in how and when variables came into existence 😂. Only a small collection of comments.
parent_domain_dict, | ||
all_child_columns, | ||
all_parent_columns, | ||
parent_primary_key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this guy is only ever used to create the parent_primary_key_index
in _prepare_cluster_data
, which we're already doing outside of this function. Maybe we just pass the index like you do with _denormalize_parent_data
or _get_parent_data_clusters
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good eye, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm gonna take that back. It's an inexpensive operation and it saves us one parameter in this function that already has too many. I'll add a comment explaining that.
foreign_key_index: int, | ||
) -> np.ndarray: | ||
""" | ||
Denormalize the parent data in relation to the child group data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not 100% clear what we mean by denormalize here? Maybe an example or more details would help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew this was going to be confusing... "Normalize" and "denormalize" in database terms means merging the data from one table into another by duplicating values when they have a primary-foreign key relationship. The problem is in our context, the term "normalization" is already used for something else.
The original term used by this code was "repeated" data, which is not entirely wrong. Do you think that's a better name? Or should I expand on the "denormalization" term in the docstring here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should just call it merge_parent_and_child_data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense, but also I think it would be good to add a comment along the lines of what you stated above to make it more obvious what we're doing with the "repeat" If you have a link for an example of this kind of thing in databases that would also be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed to merge, but also added a reference to DB denormalization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small final request about the normalize/denormalize thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Added two minor comments.
LONG = "long" | ||
|
||
|
||
class ForeignKeyScalingType(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding, this indicates how we want to scale the parent data: key_normalized = _min_max_normalize_sklearn(reshaped_parent_data)
for example, and it seems that key_normalized
is the normalized parent data in this example. Initially the name ForeignKeyScalingType
was confusing to me as I thought we want to scale the foreign key itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. To be honest, I am still a bit confused by this, but it seems like it's really not the foreign key, so I'll rename it.
key_scaling_type: ForeignKeyScalingType = ForeignKeyScalingType.MINMAX, | ||
) -> np.ndarray: | ||
""" | ||
Prepare the data for the clustering algorithm, which comprises of denormalizing the parent data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also adding what we mean by denormalization here would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the terminology here to say "merge" instead of "denormalize" as it's a similar concept and the word is less "loaded" for us.
…ing' into marcelo/refactoring-pair-clustering
c83ecea Trainer: Refactoring process_pipeline_data function (#54) 2be0f81 Bump astral-sh/setup-uv from 6.7.0 to 6.8.0 (#56) 004fc67 Trainer: general variable renamings on data_loaders.py and adding a couple more enums (#53) 4c2d75f Trainer: Refactoring the _pair_clustering_keep_id (#52) d8fc981 Ensemble attack: Meta classifier pipeline (#37) git-subtree-dir: deps/midst-toolkit git-subtree-split: c83ecea
PR Type
Fix
Short Description
Clickup Ticket(s): https://app.clickup.com/t/868fuke6e
In order to remove some Ruff ignores, I had to refactor the
_pair_clustering_keep_id
function into many smaller functions. I also renamed it to_pair_clustering
because I don't know what thekleep_id
part means.Tests Added
Tests stay the same, the functionality should not change.