GH-36: feature: Serialize pipeline feature sets as DataFrames#41
GH-36: feature: Serialize pipeline feature sets as DataFrames#41
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR refactors pipeline serialization to use getML DataFrames for transformed features, while renaming keys and updating related tests accordingly.
- Renames parameters and keys from “predicts” to “predictions” and “transforms” to “feature_sets”.
- Updates serialization functions, metadata, and tests to support the new DataFrame-based feature sets.
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/serialize/test_pipeline_information.py | Updates JSON keys to “predictions” and “feature_sets”. |
| tests/unit/serialize/test_pipeline.py | Renames test functions and assertions to validate feature sets and predictions. |
| tests/unit/conftest.py | Updates fixtures and mock pipeline to support new naming and data types. |
| tests/integration/test_serialize_robot.py | Updates integration test directory and file naming for feature sets and predictions. |
| src/getml_io/serialize/pipeline.py | Replaces deprecated serialization helpers with new functions for predictions and feature sets. |
| src/getml_io/metadata/* | Renames type aliases and metadata keys to correspond to the new nomenclature. |
| src/getml_io/metadata/feature_sets.py | Introduces a new type alias for structured feature sets. |
There was a problem hiding this comment.
Summary of Changes
Hello @Urfoex, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly refactors the pipeline serialization mechanism to enhance the robustness and informativeness of serialized feature sets. Previously stored as simple NumPy arrays, transformed features will now be serialized as full getML DataFrames, preserving crucial metadata like column names and types. This change also introduces clearer terminology by renaming 'transforms' to 'feature_sets' and 'predicts' to 'predictions' throughout the codebase and on-disk storage.
Highlights
- Enhanced Feature Set Serialization: Transformed features, now termed 'feature sets,' are serialized as
getML DataFramesinstead ofNumPy arrays. This change ensures that column names, types, and other metadata are preserved, leading to a more robust and informative serialization format. - Terminology Renaming: The terms
transformsandpredictshave been consistently renamed tofeature_setsandpredictions, respectively. This renaming applies across the codebase, metadata structures (e.g.,PipelineInformation), and the on-disk storage paths for improved clarity and consistency. - Refactored Serialization Logic: Generic serialization helpers for pipeline actions have been removed. They are replaced with explicit, dedicated functions (
serialize_predictionsandserialize_feature_sets) that handle the specific serialization requirements for each type of output, simplifying the code and making it more readable. - Metadata and Storage Updates: The
PipelineInformationmetadata structures and the on-disk folder layout have been updated to reflect the newpredictionsandfeature_setsterminology. Feature set filenames now include afeatures.prefix (e.g.,features.subset_name.parquet) to distinguish them clearly.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
The pull request refactors pipeline serialization to handle transformed features as DataFrames, enhancing robustness and information preservation. Key changes include renaming variables for clarity, updating pipeline.transform() to produce DataFrames, and adjusting metadata and tests accordingly. The changes look good overall, with a couple of suggestions to enhance robustness by adding checks for None values before serialization.
e9cbd64 to
793e020
Compare
Refactor pipeline serialization to handle transformed features as full getML DataFrames instead of simple NumPy arrays. This provides a more robust and informative serialization format, preserving column names, types, and other metadata for the feature sets. Key changes: - Rename `transforms` to `feature_sets` and `predicts` to `predictions` throughout the codebase for clarity. - `pipeline.transform()` now produces a `getml.DataFrame`. - `serialize_feature_sets` saves the resulting DataFrame using the existing `serialize_dataframe_or_view` logic. - Update `PipelineInformation` metadata, on-disk folder structure, and all related tests to reflect the new model. - Remove generic serialization helpers in favor of more explicit and readable implementations for predictions and feature sets.
972dc4e to
349b7a5
Compare
Refactor pipeline serialization to handle transformed features as full getML DataFrames instead of simple NumPy arrays.
This provides a more robust and informative serialization format, preserving column names, types, and other metadata for the feature sets.
Key changes:
transformstofeature_setsandpredictstopredictionsthroughout the codebase for clarity.pipeline.transform()now produces agetml.DataFrame.serialize_feature_setssaves the resulting DataFrame using the existingserialize_dataframe_or_viewlogic.PipelineInformationmetadata, on-disk folder structure, and all related tests to reflect the new model.