GH-36: feature: Serialize pipeline feature sets as DataFrames by Urfoex · Pull Request #41 · getml/getml-io

Urfoex · 2025-06-27T12:28:41Z

Refactor pipeline serialization to handle transformed features as full getML DataFrames instead of simple NumPy arrays.

This provides a more robust and informative serialization format, preserving column names, types, and other metadata for the feature sets.

Key changes:

Rename transforms to feature_sets and predicts to predictions throughout the codebase for clarity.
pipeline.transform() now produces a getml.DataFrame.
serialize_feature_sets saves the resulting DataFrame using the existing serialize_dataframe_or_view logic.
Update PipelineInformation metadata, on-disk folder structure, and all related tests to reflect the new model.
Remove generic serialization helpers in favor of more explicit and readable implementations for predictions and feature sets.

Copilot

Pull Request Overview

This PR refactors pipeline serialization to use getML DataFrames for transformed features, while renaming keys and updating related tests accordingly.

Renames parameters and keys from “predicts” to “predictions” and “transforms” to “feature_sets”.
Updates serialization functions, metadata, and tests to support the new DataFrame-based feature sets.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/unit/serialize/test_pipeline_information.py	Updates JSON keys to “predictions” and “feature_sets”.
tests/unit/serialize/test_pipeline.py	Renames test functions and assertions to validate feature sets and predictions.
tests/unit/conftest.py	Updates fixtures and mock pipeline to support new naming and data types.
tests/integration/test_serialize_robot.py	Updates integration test directory and file naming for feature sets and predictions.
src/getml_io/serialize/pipeline.py	Replaces deprecated serialization helpers with new functions for predictions and feature sets.
src/getml_io/metadata/*	Renames type aliases and metadata keys to correspond to the new nomenclature.
src/getml_io/metadata/feature_sets.py	Introduces a new type alias for structured feature sets.

src/getml_io/serialize/pipeline.py

gemini-code-assist

Summary of Changes

Hello @Urfoex, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the pipeline serialization mechanism to enhance the robustness and informativeness of serialized feature sets. Previously stored as simple NumPy arrays, transformed features will now be serialized as full getML DataFrames, preserving crucial metadata like column names and types. This change also introduces clearer terminology by renaming 'transforms' to 'feature_sets' and 'predicts' to 'predictions' throughout the codebase and on-disk storage.

Highlights

Enhanced Feature Set Serialization: Transformed features, now termed 'feature sets,' are serialized as getML DataFrames instead of NumPy arrays. This change ensures that column names, types, and other metadata are preserved, leading to a more robust and informative serialization format.
Terminology Renaming: The terms transforms and predicts have been consistently renamed to feature_sets and predictions, respectively. This renaming applies across the codebase, metadata structures (e.g., PipelineInformation), and the on-disk storage paths for improved clarity and consistency.
Refactored Serialization Logic: Generic serialization helpers for pipeline actions have been removed. They are replaced with explicit, dedicated functions (serialize_predictions and serialize_feature_sets) that handle the specific serialization requirements for each type of output, simplifying the code and making it more readable.
Metadata and Storage Updates: The PipelineInformation metadata structures and the on-disk folder layout have been updated to reflect the new predictions and feature_sets terminology. Feature set filenames now include a features. prefix (e.g., features.subset_name.parquet) to distinguish them clearly.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

src/getml_io/metadata/pipeline_information.py

gemini-code-assist

Code Review

The pull request refactors pipeline serialization to handle transformed features as DataFrames, enhancing robustness and information preservation. Key changes include renaming variables for clarity, updating pipeline.transform() to produce DataFrames, and adjusting metadata and tests accordingly. The changes look good overall, with a couple of suggestions to enhance robustness by adding checks for None values before serialization.

src/getml_io/serialize/pipeline.py

src/getml_io/metadata/pipeline_information.py

src/getml_io/serialize/pipeline.py

src/getml_io/metadata/pipeline_information.py

src/getml_io/serialize/pipeline.py

Refactor pipeline serialization to handle transformed features as full getML DataFrames instead of simple NumPy arrays. This provides a more robust and informative serialization format, preserving column names, types, and other metadata for the feature sets. Key changes: - Rename `transforms` to `feature_sets` and `predicts` to `predictions` throughout the codebase for clarity. - `pipeline.transform()` now produces a `getml.DataFrame`. - `serialize_feature_sets` saves the resulting DataFrame using the existing `serialize_dataframe_or_view` logic. - Update `PipelineInformation` metadata, on-disk folder structure, and all related tests to reflect the new model. - Remove generic serialization helpers in favor of more explicit and readable implementations for predictions and feature sets.

Urfoex requested review from Copilot and srnnkls June 27, 2025 12:28

Urfoex self-assigned this Jun 27, 2025

Urfoex added the enhancement New feature or request label Jun 27, 2025

Copilot AI reviewed Jun 27, 2025

View reviewed changes

src/getml_io/serialize/pipeline.py Show resolved Hide resolved

gemini-code-assist bot reviewed Jun 27, 2025

View reviewed changes

Urfoex commented Jun 27, 2025

View reviewed changes

src/getml_io/metadata/pipeline_information.py Show resolved Hide resolved

gemini-code-assist bot reviewed Jun 27, 2025

View reviewed changes

src/getml_io/serialize/pipeline.py Show resolved Hide resolved

src/getml_io/serialize/pipeline.py Show resolved Hide resolved

This was referenced Jun 27, 2025

GH-35: feat(serialize): Implement pipeline serialization and refactor… #37

Closed

GH-35: feat(serialize): Implement pipeline serialization #39

Merged

Urfoex requested review from awaismirza92 and cyclux July 14, 2025 10:46

cyclux reviewed Jul 21, 2025

View reviewed changes

src/getml_io/metadata/pipeline_information.py Show resolved Hide resolved

src/getml_io/serialize/pipeline.py Show resolved Hide resolved

src/getml_io/metadata/pipeline_information.py Show resolved Hide resolved

src/getml_io/serialize/pipeline.py Show resolved Hide resolved

Urfoex mentioned this pull request Jul 24, 2025

Add names to parameters on function-calls for easier understanding #48

Open

cyclux approved these changes Jul 24, 2025

View reviewed changes

Urfoex force-pushed the feature/GH-36-serialize-pipeline-predict branch from e9cbd64 to 793e020 Compare July 25, 2025 07:20

Base automatically changed from feature/GH-36-serialize-pipeline-predict to develop July 25, 2025 12:23

Urfoex force-pushed the refactor/GH-36-serialize-pipeline-predict branch from 972dc4e to 349b7a5 Compare July 25, 2025 12:24

Urfoex merged commit 16c2353 into develop Jul 25, 2025
3 checks passed

Urfoex deleted the refactor/GH-36-serialize-pipeline-predict branch July 25, 2025 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-36: feature: Serialize pipeline feature sets as DataFrames#41

GH-36: feature: Serialize pipeline feature sets as DataFrames#41
Urfoex merged 1 commit intodevelopfrom
refactor/GH-36-serialize-pipeline-predict

Urfoex commented Jun 27, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Urfoex commented Jun 27, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants