Skip to content

GH-36: feature: Serialize pipeline feature sets as DataFrames#41

Merged
Urfoex merged 1 commit intodevelopfrom
refactor/GH-36-serialize-pipeline-predict
Jul 25, 2025
Merged

GH-36: feature: Serialize pipeline feature sets as DataFrames#41
Urfoex merged 1 commit intodevelopfrom
refactor/GH-36-serialize-pipeline-predict

Conversation

@Urfoex
Copy link
Collaborator

@Urfoex Urfoex commented Jun 27, 2025

Refactor pipeline serialization to handle transformed features as full getML DataFrames instead of simple NumPy arrays.

This provides a more robust and informative serialization format, preserving column names, types, and other metadata for the feature sets.

Key changes:

  • Rename transforms to feature_sets and predicts to predictions throughout the codebase for clarity.
  • pipeline.transform() now produces a getml.DataFrame.
  • serialize_feature_sets saves the resulting DataFrame using the existing serialize_dataframe_or_view logic.
  • Update PipelineInformation metadata, on-disk folder structure, and all related tests to reflect the new model.
  • Remove generic serialization helpers in favor of more explicit and readable implementations for predictions and feature sets.

@Urfoex Urfoex requested review from Copilot and srnnkls June 27, 2025 12:28
@Urfoex Urfoex self-assigned this Jun 27, 2025
@Urfoex Urfoex added the enhancement New feature or request label Jun 27, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors pipeline serialization to use getML DataFrames for transformed features, while renaming keys and updating related tests accordingly.

  • Renames parameters and keys from “predicts” to “predictions” and “transforms” to “feature_sets”.
  • Updates serialization functions, metadata, and tests to support the new DataFrame-based feature sets.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/serialize/test_pipeline_information.py Updates JSON keys to “predictions” and “feature_sets”.
tests/unit/serialize/test_pipeline.py Renames test functions and assertions to validate feature sets and predictions.
tests/unit/conftest.py Updates fixtures and mock pipeline to support new naming and data types.
tests/integration/test_serialize_robot.py Updates integration test directory and file naming for feature sets and predictions.
src/getml_io/serialize/pipeline.py Replaces deprecated serialization helpers with new functions for predictions and feature sets.
src/getml_io/metadata/* Renames type aliases and metadata keys to correspond to the new nomenclature.
src/getml_io/metadata/feature_sets.py Introduces a new type alias for structured feature sets.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Urfoex, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the pipeline serialization mechanism to enhance the robustness and informativeness of serialized feature sets. Previously stored as simple NumPy arrays, transformed features will now be serialized as full getML DataFrames, preserving crucial metadata like column names and types. This change also introduces clearer terminology by renaming 'transforms' to 'feature_sets' and 'predicts' to 'predictions' throughout the codebase and on-disk storage.

Highlights

  • Enhanced Feature Set Serialization: Transformed features, now termed 'feature sets,' are serialized as getML DataFrames instead of NumPy arrays. This change ensures that column names, types, and other metadata are preserved, leading to a more robust and informative serialization format.
  • Terminology Renaming: The terms transforms and predicts have been consistently renamed to feature_sets and predictions, respectively. This renaming applies across the codebase, metadata structures (e.g., PipelineInformation), and the on-disk storage paths for improved clarity and consistency.
  • Refactored Serialization Logic: Generic serialization helpers for pipeline actions have been removed. They are replaced with explicit, dedicated functions (serialize_predictions and serialize_feature_sets) that handle the specific serialization requirements for each type of output, simplifying the code and making it more readable.
  • Metadata and Storage Updates: The PipelineInformation metadata structures and the on-disk folder layout have been updated to reflect the new predictions and feature_sets terminology. Feature set filenames now include a features. prefix (e.g., features.subset_name.parquet) to distinguish them clearly.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request refactors pipeline serialization to handle transformed features as DataFrames, enhancing robustness and information preservation. Key changes include renaming variables for clarity, updating pipeline.transform() to produce DataFrames, and adjusting metadata and tests accordingly. The changes look good overall, with a couple of suggestions to enhance robustness by adding checks for None values before serialization.

@Urfoex Urfoex force-pushed the feature/GH-36-serialize-pipeline-predict branch from e9cbd64 to 793e020 Compare July 25, 2025 07:20
Base automatically changed from feature/GH-36-serialize-pipeline-predict to develop July 25, 2025 12:23
Refactor pipeline serialization to handle transformed features as full
getML DataFrames instead of simple NumPy arrays.

This provides a more robust and informative serialization format,
preserving column names, types, and other metadata for the feature sets.

Key changes:
- Rename `transforms` to `feature_sets` and `predicts` to `predictions`
  throughout the codebase for clarity.
- `pipeline.transform()` now produces a `getml.DataFrame`.
- `serialize_feature_sets` saves the resulting DataFrame using the
  existing `serialize_dataframe_or_view` logic.
- Update `PipelineInformation` metadata, on-disk folder structure,
  and all related tests to reflect the new model.
- Remove generic serialization helpers in favor of more explicit and
  readable implementations for predictions and feature sets.
@Urfoex Urfoex force-pushed the refactor/GH-36-serialize-pipeline-predict branch from 972dc4e to 349b7a5 Compare July 25, 2025 12:24
@Urfoex Urfoex merged commit 16c2353 into develop Jul 25, 2025
3 checks passed
@Urfoex Urfoex deleted the refactor/GH-36-serialize-pipeline-predict branch July 25, 2025 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants